Object Stores
Parameters can be defined in a configuration file or as environment variables. If you’re writing it in the config.toml file, you would add an entry like this.
[[storages]]
type = "aws"
access_key_id = "AKIA..."
secret_access_key = "SECRET"
region = "us-east-1"
bucket = "my-bucket"
If you’re using multiple buckets, you would write multiple entries.
If you’re specifying values through environment variables, set the values to the following keys. The definitions specified in environment variables and those defined in config.toml will be merged.
AWS_ACCESS_KEY_IDAWS_SECRET_ACCESS_KEYAWS_DEFAULT_REGIONAWS_BUCKET
Parameters defined in environment variables are effective even in Docker.
docker run -d --rm \
-p 4000:4000 \
-v ./data:/var/datafusion-server/data \
-e AWS_ACCESS_KEY_ID="AKIA..." \
-e AWS_SECRET_ACCESS_KEY="SECRET" \
-e AWS_DEFAULT_REGION="us-west-2" \
-e AWS_BUCKET="my-bucket" \
--name datafusion-server \
datafusion-server:x.y.z
GCS, like S3, also specifies parameters in the environment configuration file or environment variables.
[[storages]]
type = "gcp"
service_account_key = "SERVICE_ACCOUNT_KEY"
bucket = "my-bucket"
The service_account_key should be set to the JSON serialized credentials.
Likewise for environment variables,
GOOGLE_SERVICE_ACCOUNT_KEYGOOGLE_BUCKET
Specify parameters from the environment configuration file or environment variables.
[[storages]]
type = "azure"
account_name = "AZURE_STORAGE_ACCOUNT_NAME"
access_key = "AZURE_STORAGE_ACCESS_KEY"
container = "my-container"
Likewise for environment variables,
AZURE_STORAGE_ACCOUNT_NAMEAZURE_STORAGE_ACCESS_KEYAZURE_CONTAINER
Specify the scheme and authority part in the URL (for example, https://server.com). DataFusion Server treats the
location scheme as either http or https, and if the authority matches the server defined here, it handles it as an
extension of WebDAV for HTTP.
[[storages]]
type = "webdav"
url = "https://server.com"
user = "USER"
password = "PASSWORD"
Likewise for environment variables,
HTTP_URLHTTP_USERHTTP_PASSWORD
Please refer to the details of the data source definition here. The only thing that changes when dealing with a data source from Object Store is the location key.
[
{
"format": "csv",
"name": "example",
"location": "s3://my-bucket/example.csv",
"options": {
"hasHeader": true
}
}
]
In this example, a data source is defined to read / write “example.csv” from / to an S3 bucket. Similarly, if reading /writing Parquet from / to Google Cloud Storage, it would look like this:
[
{
"format": "parquet",
"name": "example",
"location": "gs://my-bucket/path/to/example.parquet"
}
]
For Microsoft Azure Blob Storage, the same applies,
[
{
"format": "ndJson",
"name": "example",
"location": "az://my-container/path/to/example.json"
}
]
The scheme can be specified using commonly used schemes such as adl, abfs, and abfss, in addition to az.
WebDAV might need a bit of explanation. Just by looking at the location, it’s not clear whether it’s for regular http(s) access or for accessing WebDAV, which is an extension of HTTP.
[
{
"format": "avro",
"name": "example",
"location": "https://server.com/path/to/example.avro"
}
]
If DataFusion Server has the following entry defined in the configuration file or environment variables, it treats
access to server.com via http(s) as WebDAV. This includes adding methods like PROPFIND and adding basic
authentication.
[[storages]]
type = "webdav"
url = "https://server.com"
user = "USER"
password = "PASSWORD"
The url defined in the configuration includes only the scheme and authority. Any path or query parameters are
completely ignored.