11 minute read · October 11, 2022
Introducing MAP Data Type in Dremio
· Senior Product Manager, Dremio
Introduction
Data types in Dremio fall into two categories: primitive types such as INT and VARCHAR that hold single values, and semi-structured types like LIST and STRUCT that hold complex values.
Dremio Cloud as of late October and Dremio Software version 23.0 now have a semi-structured MAP data type that allows you to query map data from Apache Parquet files, Apache Iceberg, and Delta Lake. The MAP data type is a collection of key-value pairs and is useful for holding sparse data.
The remainder of this article describes how to query MAP data in Dremio.
Using the MAP Data Type
Getting Started
If you have queried tables with MAP data using an earlier release of Dremio Cloud or Dremio Software, your cached table metadata has STRUCT as the data type instead of MAP. You must run ALTER TABLE table_name FORGET METADATA
on those tables to refresh their metadata caches before you query them so that Dremio knows they have MAP rather than STRUCT. Otherwise, Dremio will continue classifying your data as STRUCT and give an error prompting you to reformat your dataset.
This feature is ON
by default. If prefer the old behavior of representing MAP data as STRUCT, set dremio.data_types.map.enabled
to OFF
under ⚙ (Settings) > Support > Support Keys.
For most data sources, you can define internal schemas that override the data types or names of columns that Dremio detected using ALTER TABLE table_name MODIFY COLUMN
or ALTER TABLE table_name RENAME COLUMN
, respectively. You will need to re-run ALTER TABLE table_name MODIFY COLUMN
or ALTER TABLE table_name RENAME COLUMN
to restore any internal schemas you defined earlier. This is because ALTER TABLE table_name FORGET METADATA
causes Dremio to forget internal schemas in addition to metadata caches.
Tip: The above steps are only necessary for old tables that you have queried at least once; Dremio will automatically detect MAP data in new tables.
Using the MAP Data Type
As mentioned above, the MAP data type is a collection of key-value pairs. MAP keys are case-insensitive strings. All values in a given map have the same type. For example, map<string, int>
represents a mapping where the keys are strings and the values are integers.
The section below is an overview of how you can query the MAP data type. Suppose you have a table called "company" with a column named "employees" that is map<string, string>
. You can retrieve the entire MAP column, which returns a collection of key-value pairs:
SELECT employees FROM company
You retrieve the value of MAP elements using column['key']
syntax:
SELECT employees['first_name'] FROM company
You can pass values of MAP elements as arguments to both scalar and aggregate functions:
SELECT UPPER(employees['first_name']) FROM company; SELECT MAX(employees['first_name']) FROM company;
You can use values of MAP elements in filters:
SELECT * FROM company WHERE employees['first_name'] = 'Fred'
If your JDBC client program expects JSON, you can convert from MAP to JSON using CAST
and CONVERT_TOJSON
like this:
SELECT CAST(CONVERT_TOJSON(employee) AS VARCHAR) FROM company
You can ORDER BY
and GROUP BY
elements of a MAP if they evaluate to primitive types such as INT or VARCHAR:
SELECT employees['first_name'], COUNT(*) FROM company GROUP BY employees['first_name'] ORDER BY employees['first_name']
In addition, the following functions are available for MAP expressions:
Signature | Description |
SIZE(M) | Returns number of elements in MAP expression M. |
MAP_KEYS(M) | Returns all keys from MAP expression M. |
MAP_VALUES(M) | Returns all values from MAP expression M. |
TYPEOF(M) | Returns "MAP" if M is a MAP expression. |
Supported Sources
Dremio can read MAP data from Parquet files from any of the sources and table formats that it can read Parquet, including filesystem sources, metastores, Iceberg, and Delta Lake.
Differences from STRUCT
Both STRUCT and MAP are semi-structured data types that hold collections of key-value pairs, but they differ in the following ways:
- MAP values can only be primitive types; STRUCT values can be any type. [This restriction was removed in Dremio v24.]
- All values in a given MAP must be the same type; STRUCT values can be of different types.
Example
An e-commerce store sells a variety of products. Each category has particular properties that need to be tracked. For example, clothing has size and color, but books have publication dates and cover types (hardcover vs. paperback). Instead of having a "products" table with dozens of columns — most of which are NULL
since they are not applicable — the table has a single MAP column named "properties," which is map<string, string>
.
The Python 3 script below uses the pyarrow library to create a Parquet file with sample data:
#!python import pyarrow as pa import pyarrow.parquet as pq description = ["Dictionary", "Dress", "Wagon"] properties = [ [("Cover", "Hardcover"), ("Publication Year", "2002"), ("Color", "Blue")], [("Size", "Medium"), ("Color", "Red")], [("Material", "Metal"), ("Color", "Red")], ] assert len(description) == len(properties) maparr = pa.array(properties, type=pa.map_(pa.string(), pa.string())) table = pa.table([description, maparr], names=["description", "properties"]) pq.write_table(table, "products.parquet")
Using the parquet-cli package, you can view the raw sample file with parquet cat products.parquet
:
{"description": "Dictionary", "properties": {"Publication Year": "2002", "Cover": "Hardcover", "Color": "Blue"}} {"description": "Dress", "properties": {"Size": "Medium", "Color": "Red"}} {"description": "Wagon", "properties": {"Color": "Red", "Material": "Metal"}}
To query the sample data, do the following in the Dremio SQL Runner:
- Select the "Datasets" icon in the navigation bar.
- Select ⊕ then "Upload File" in the upper right.
- Browse to
products.parquet
and select "Next." - Select "Save" to accept the auto-detected format of Parquet.
Now you can query the "products" table in the usual way. For example:
SELECT TYPEOF(properties) FROM products; -- MAP -- MAP -- MAP SELECT SIZE(properties) FROM products; -- 3 -- 2 -- 2 SELECT MAP_KEYS(properties) FROM products; -- ["Cover","Publication Year","Color"] -- ["Size","Color"] -- ["Material","Color"] SELECT MAP_VALUES(properties) FROM products; -- ["Hardcover","2002","Blue"] -- ["Medium","Red"] -- ["Metal","Red"] SELECT properties['Cover'] FROM products; -- Hardcover -- null -- null SELECT description FROM products WHERE properties['Color'] = 'Red'; -- Dress -- Wagon SELECT properties FROM products; -- { "Cover": "Hardcover", "Publication Year": "2002", "Color": "Blue" } -- { "Size": "Medium", "Color": "Red" } -- { "Material": "Metal", "Color": "Red" } SELECT properties['Color'], COUNT(*) FROM products GROUP BY properties['Color'] ORDER BY COUNT(*) DESC LIMIT 2 -- Red, 2 -- Blue, 1
Get Started with Dremio Cloud – It’s Free!
Dremio Cloud: The easy and open, fully managed data lakehouse platform.
Everything you need to build, automate, and query your data lakehouse in production.