Definition Basics
The example shown below defines a data source for a CSV format file located at settings.server.data_dir
directory with consisting of two columns and including a header row.
[
{
"format": "csv",
"name": "example",
"location": "file:///example.csv",
"options": {
"hasHeader": true
},
"schema": [
{
"name": "id",
"dataType": "Integer",
"nullable": false
},
{
"name": "name",
"dataType": "String",
"nullable": true
}
]
}
]
This example specifies schema information, but the schema can be inferred using the inferSchemaRows
option.
[
{
"format": "csv",
"name": "example",
"location": "file:///example.csv",
"options": {
"hasHeader": true,
"inferSchemaRows": 100
}
}
]
This instructs to prefetch data up to the first 100 rows to determine the schema. Increasing the number of rows to infer can improve the accuracy of the schema information, but also raises the cost of inference.
If schema
is not defined, the schema is determined unconditionally using the default value of inferSchemaRows
, which is 100
. Therefore, the previous example can be simplified as follows.
[
{
"format": "csv",
"name": "example",
"location": "file:///example.csv",
"options": {
"hasHeader": true
}
}
]
For schema
, define the necessary number of columns according to their order (for CSV, it needs to match the data order in the file; for JSON, the keys must match the name
attribute). If the schema definition has fewer columns than the actual data, those extra columns will be ignored.
Details of each key in the schema
array:
Key | Value | Required |
---|---|---|
name | Column Name | True |
dataType | Specifies the type of data that a column can hold, such as String, Integer, Float, Boolean, etc. Each data type determines the nature of the data, how much space it occupies, and how the system processes it. | True |
nullable | Indicates whether the column allows null values or not. If true, the column can contain null values; if false, it cannot. If omitted, it as false . | False |
Although dataType
can be set to data types supported by Apache Arrow, not all are applicable. Below is a list of data types supported by DataFusion Server.
Data Type | Description |
---|---|
Boolean | |
Int8 | |
Int16 | |
Int32 | |
Int64 | |
UInt8 | |
UInt16 | |
UInt32 | |
UInt64 | |
Integer | Alias for Int64 |
Float16 | |
Float32 | |
Float64 | |
Float | Alias for Float64 |
Decimal128 | 128-bit width decimal value with maximum precision and scale is 38 digits |
Decimal256 | 256-bit width decimal value with maximum precision and scale is 72 digits |
Decimal | Alias for Decimal256 |
Date32 | Elapsed time since 1 January 1970 in days |
Date64 | Elapsed time since 00:00:00.000 on 1 January 1970 in milliseconds |
Date | Alias for Date64 |
Time32 | Elapsed time since midnight in seconds or milliseconds |
Time64 | Elapsed time since midnight in microseconds or nanoseconds |
Time | Alias for Time32 |
Timestamp | Counting the specific precisions from 00:00:00 on 1 January 1970 as UTC |
Duration | Measure of elapsed time in specific precisions |
Interval | “Calendar” based interval in YearMonth , DayTime , MonthDayNano |
String | variable length string in Unicode with UTF-8 encoding |
List | Array of some data type with variable length, Can be stored up to 2,147,483,647 elements |
LargeList | Array of some data type with variable length, Can be stored up to 9,223,372,036,854,775,807 elements |
Struct | Nested data types that contains a number of sub-fields |
Map | Map is a nestable key-value store |
Union | Can represent slots of differing types |
Data types removed in DataFusion Server v0.11.0
Data Type | Description |
---|---|
TimestampSecond | Counting the seconds from 00:00:00 on 1 January 1970 as UTC |
TimestampMicro | Counting the microseconds from 00:00:00.000000 on 1 January 1970 as UTC |
TimestampNano | Counting the nanoseconds from 00:00:00.000000000 on 1 January 1970 as UTC |
DurationSecond | Measure of elapsed time in seconds |
DurationMicro | Measure of elapsed time in microseconds |
DurationNano | Measure of elapsed time in nanoseconds |
A simple data type, such as Int64
, is defined as follows.
{
"schema": [
{
"name": "quantity",
"dataType": "Int64",
"nullable": true
}
]
}
Complex data types have different definitions individually. Let’s look at each one by one.
First, here are examples of Decimal128
and Decimal256
.
{
"schema": [
{
"name": "global_population",
"dataType": {
"Decimal256": {
"precision": 8,
"scale": 2
}
},
"nullable": false
}
]
}
- precision is the total number of digits
- scale is the number of digits past the decimal
In this example, it is possible to represent 123456.78
.
https://docs.rs/arrow-schema/51.0.0/arrow_schema/enum.DataType.html#variant.Decimal256
An example of a Timestamp
would look as follows.
{
"schema": [
{
"name": "accessed_at",
"dataType": {
"Timestamp": {
"unit": "Millisecond",
"timezone": "UTC"
}
},
"nullable": false
}
]
}
The possible precisions for unit
are Second
, Millisecond
, Microsecond
, and Nanosecond
. The timezone
is optional, and the default value when omitted is UTC
.
https://docs.rs/arrow-schema/51.0.0/arrow_schema/enum.DataType.html#variant.Timestamp
The definition of Duration
is similar to that of Timestamp
, but conceptually, it does not have a timezone.
{
"schema": [
{
"name": "processed_time",
"dataType": {
"Timestamp": {
"unit": "Microsecond"
}
},
"nullable": true
}
]
}
https://docs.rs/arrow-schema/51.0.0/arrow_schema/enum.DataType.html#variant.Duration
Interval
represents a period according to the calendar. The units that can be specified for unit
are YearMonth
, DayTime
, and MonthDayNano
.
{
"schema": [
{
"name": "construction_period",
"dataType": {
"Interval": {
"unit": "YearMonth"
}
},
"nullable": true
}
]
}
https://docs.rs/arrow-schema/51.0.0/arrow_schema/enum.DataType.html#variant.Interval
List
and LargeList
differ only in the maximum number of array elements they can hold.
{
"schema": [
{
"name": "purity_numbers",
"dataType": {
"List": "UInt64"
},
"nullable": true
}
]
}
Elements can be nested, and there is no logical upper limit to the depth of nesting.
{
"schema": [
{
"name": "trend_by_years",
"dataType": {
"LargeList": {
"List": "String"
}
},
"nullable": true
}
]
}
Nestable elements are not limited to List
; Map
, Struct
, and any other combinations can also be defined.
https://docs.rs/arrow-schema/51.0.0/arrow_schema/enum.DataType.html#variant.List
A Struct
is a composite of any combination of fields.
{
"schema": [
{
"name": "person",
"dataType": {
"Struct": [
{
"name": "name",
"dataType": "String",
"nullable": false
},
{
"name": "age",
"dataType": "UInt8",
"nullable": true
}
]
},
"nullable": true
}
]
}
https://docs.rs/arrow-schema/51.0.0/arrow_schema/enum.DataType.html#variant.Struct
A Map
is a key-value store. Any data type can be specified for key
and value
, and nesting is also possible here.
{
"schema": [
{
"name": "revenue_by_area",
"dataType": {
"Map": {
"key": "String",
"value": "Int64",
"ordered": false
}
},
"nullable": true
}
]
}
If ordered
is set to true
, the entries are stored sorted by key. ordered
is optional, and the default value when omitted is false
.
https://docs.rs/arrow-schema/51.0.0/arrow_schema/enum.DataType.html#variant.Map
Union
can represent slots of differing types.
{
"schema": [
{
"name": "age_or_name",
"dataType": {
"Union": {
"types": [
{"id": 1, "type": "Int32"},
{"id": 2, "type": "String"}
],
"mode": "Sparse"
}
},
"nullable": true
}
]
}
The id
array is used to indicate which type of union each Field
belongs to. The memory layout for Union
differs depending on whether the type is Dense
or Sparse
. In a dense union, all elements are contiguously placed in memory, whereas in a sparse union, they are not. The id
array serves as a mapping to show which field each piece of data belongs to.
https://docs.rs/arrow-schema/51.0.0/arrow_schema/enum.DataType.html#variant.Union
The parameters that can be specified in options
vary depending on the format
and the scheme of location
. Refer to the respective data source page for the options available in each format. Here, we will discuss options common to all formats.
Specifies whether to overwrite the table specified by name
if it already exists in the session context. The default is false
. Refer to information about the Session Context.
Although the schema has already been discussed, there is one additional point to note. If the format is parquet
, the schema is predefined, so any specifications via inferSchemaRows
or schema
are completely ignored.
While it’s not necessary to present another example, it should be noted that the data source definition is an array. Therefore, multiple data sources can be defined at once.
Here, shows a complete query sample that does not use a session with multiple data source definitions.
curl -X POST http://127.0.0.1:4000/dataframe/query \
-H 'Content-Type: application/json' \
-H 'Accept: text/csv' \
-d $'{
"dataSources": [
{
"name": "table1",
"format": "parquet",
"location": "file:///table1.parquet"
},
{
"name": "table2",
"format": "json",
"location": "file:///table2.json",
"options": {
"inferSchemaRows": 300
}
}
]
"query": {
"sql": "SELECT id, name, category FROM table1, table2 WHERE table1.category = table2.category ORDER BY id"
}
}'