Reading Parquet files with array or list columns
Add Reply
David Kincaid
2017-06-18 02:04:51 UTC
Raw Message
I'm having a problem querying Parquet files that were created from Spark
and have columns that are array or list types. When I do a SELECT on these
columns they show up like this:

{"list": [{"element": "My first value"}, {"element": "My second value"}]}

which Drill does not recognize as a REPEATED column and is not really
workable to hack around like I did in DRILL-5183 (
https://issues.apache.org/jira/browse/DRILL-5183). I can get to one value
using something like t.columnName.`list`.`element` but that's not really
feasible to use in a query.

The little I could find on this by Googling around led me to this document
on the Parquet format Github page -
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md. This
seems to say that Spark is writing these files correctly, but Drill is not
interpreting them properly.

Is there a workaround that anyone can help me to turn these columns into
values that Drill understands as repeated values? This is a fairly urgent
issue for us.