Discussion:
Reading Parquet files with array or list columns
(too old to reply)
David Kincaid
2017-06-18 02:04:51 UTC
Permalink
Raw Message
I'm having a problem querying Parquet files that were created from Spark
and have columns that are array or list types. When I do a SELECT on these
columns they show up like this:

{"list": [{"element": "My first value"}, {"element": "My second value"}]}

which Drill does not recognize as a REPEATED column and is not really
workable to hack around like I did in DRILL-5183 (
https://issues.apache.org/jira/browse/DRILL-5183). I can get to one value
using something like t.columnName.`list`.`element` but that's not really
feasible to use in a query.

The little I could find on this by Googling around led me to this document
on the Parquet format Github page -
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md. This
seems to say that Spark is writing these files correctly, but Drill is not
interpreting them properly.

Is there a workaround that anyone can help me to turn these columns into
values that Drill understands as repeated values? This is a fairly urgent
issue for us.

Thanks,

Dave
François Méthot
2017-06-30 17:06:44 UTC
Permalink
Raw Message
Hi,

Have you tried:
select column['list'][0]['element'] from ...
should return "My First Value".

or try:
select flatten(column['list'])['element] from ...

Hope it helps, in our data we have a column that looks like this:
[{"NAME:":"Aname", "DATA":"thedata"},{"NAME:":"Aname2",
"DATA":"thedata2"},.....]

We ended doing custom function to do look up instead of doing costly
flatten technique.

Francois
Post by David Kincaid
I'm having a problem querying Parquet files that were created from Spark
and have columns that are array or list types. When I do a SELECT on these
{"list": [{"element": "My first value"}, {"element": "My second value"}]}
which Drill does not recognize as a REPEATED column and is not really
workable to hack around like I did in DRILL-5183 (
https://issues.apache.org/jira/browse/DRILL-5183). I can get to one value
using something like t.columnName.`list`.`element` but that's not really
feasible to use in a query.
The little I could find on this by Googling around led me to this document
on the Parquet format Github page -
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md. This
seems to say that Spark is writing these files correctly, but Drill is not
interpreting them properly.
Is there a workaround that anyone can help me to turn these columns into
values that Drill understands as repeated values? This is a fairly urgent
issue for us.
Thanks,
Dave
David Kincaid
2017-06-30 17:41:07 UTC
Permalink
Raw Message
As far as I was able to discern it is not possible to actually use this
column as an array in Drill at all. It just does not correctly read the
Parquet. I have had a very similar defect I created in Jira back in January
that has had no attention at all. So we are moving on to other tools. I
understand Drill is free and no one developing it owes me anything. It's
just not going to work for us without proper support for nested objects in
Parquet format.

Thanks for the reply though. It's much appreciated to have some
acknowledgment that I raised a valid issue.

- Dave
Post by François Méthot
Hi,
select column['list'][0]['element'] from ...
should return "My First Value".
select flatten(column['list'])['element] from ...
[{"NAME:":"Aname", "DATA":"thedata"},{"NAME:":"Aname2",
"DATA":"thedata2"},.....]
We ended doing custom function to do look up instead of doing costly
flatten technique.
Francois
Post by David Kincaid
I'm having a problem querying Parquet files that were created from Spark
and have columns that are array or list types. When I do a SELECT on
these
Post by David Kincaid
{"list": [{"element": "My first value"}, {"element": "My second value"}]}
which Drill does not recognize as a REPEATED column and is not really
workable to hack around like I did in DRILL-5183 (
https://issues.apache.org/jira/browse/DRILL-5183). I can get to one
value
Post by David Kincaid
using something like t.columnName.`list`.`element` but that's not really
feasible to use in a query.
The little I could find on this by Googling around led me to this
document
Post by David Kincaid
on the Parquet format Github page -
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md.
This
Post by David Kincaid
seems to say that Spark is writing these files correctly, but Drill is
not
Post by David Kincaid
interpreting them properly.
Is there a workaround that anyone can help me to turn these columns into
values that Drill understands as repeated values? This is a fairly urgent
issue for us.
Thanks,
Dave
rahul challapalli
2017-06-30 18:38:52 UTC
Permalink
Raw Message
Like I suggested in the comment for DRILL-5183, can you try using a view as
a workaround until the issue gets resolved?
Post by David Kincaid
As far as I was able to discern it is not possible to actually use this
column as an array in Drill at all. It just does not correctly read the
Parquet. I have had a very similar defect I created in Jira back in January
that has had no attention at all. So we are moving on to other tools. I
understand Drill is free and no one developing it owes me anything. It's
just not going to work for us without proper support for nested objects in
Parquet format.
Thanks for the reply though. It's much appreciated to have some
acknowledgment that I raised a valid issue.
- Dave
Post by François Méthot
Hi,
select column['list'][0]['element'] from ...
should return "My First Value".
select flatten(column['list'])['element] from ...
[{"NAME:":"Aname", "DATA":"thedata"},{"NAME:":"Aname2",
"DATA":"thedata2"},.....]
We ended doing custom function to do look up instead of doing costly
flatten technique.
Francois
Post by David Kincaid
I'm having a problem querying Parquet files that were created from
Spark
Post by François Méthot
Post by David Kincaid
and have columns that are array or list types. When I do a SELECT on
these
Post by David Kincaid
{"list": [{"element": "My first value"}, {"element": "My second
value"}]}
Post by François Méthot
Post by David Kincaid
which Drill does not recognize as a REPEATED column and is not really
workable to hack around like I did in DRILL-5183 (
https://issues.apache.org/jira/browse/DRILL-5183). I can get to one
value
Post by David Kincaid
using something like t.columnName.`list`.`element` but that's not
really
Post by François Méthot
Post by David Kincaid
feasible to use in a query.
The little I could find on this by Googling around led me to this
document
Post by David Kincaid
on the Parquet format Github page -
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md.
This
Post by David Kincaid
seems to say that Spark is writing these files correctly, but Drill is
not
Post by David Kincaid
interpreting them properly.
Is there a workaround that anyone can help me to turn these columns
into
Post by François Méthot
Post by David Kincaid
values that Drill understands as repeated values? This is a fairly
urgent
Post by François Méthot
Post by David Kincaid
issue for us.
Thanks,
Dave
David Kincaid
2017-06-30 18:46:38 UTC
Permalink
Raw Message
The view only works for the first example in the Jira I created. That was
the workaround we have been using since January.

Recently we've had a use case where we are running a Spark script to
pre-join some data before we try to use it in Drill. That was the subject
of the initial e-mail in this thread and the topic of the comment I made in
the JIra on 6/17. As far as I've been able to tell there isn't a similar
work around for this case that will make the column appear as an array.

Note, I tried to use Drill to do that pre-join of the Parquet data using
CTAS, but it ran for about 4 hours then crashed. The Spark script to do it
runs in 14 minutes successfully.

- Dave

On Fri, Jun 30, 2017 at 1:38 PM, rahul challapalli <
Post by rahul challapalli
Like I suggested in the comment for DRILL-5183, can you try using a view as
a workaround until the issue gets resolved?
Post by David Kincaid
As far as I was able to discern it is not possible to actually use this
column as an array in Drill at all. It just does not correctly read the
Parquet. I have had a very similar defect I created in Jira back in
January
Post by David Kincaid
that has had no attention at all. So we are moving on to other tools. I
understand Drill is free and no one developing it owes me anything. It's
just not going to work for us without proper support for nested objects
in
Post by David Kincaid
Parquet format.
Thanks for the reply though. It's much appreciated to have some
acknowledgment that I raised a valid issue.
- Dave
Post by François Méthot
Hi,
select column['list'][0]['element'] from ...
should return "My First Value".
select flatten(column['list'])['element] from ...
[{"NAME:":"Aname", "DATA":"thedata"},{"NAME:":"Aname2",
"DATA":"thedata2"},.....]
We ended doing custom function to do look up instead of doing costly
flatten technique.
Francois
On Sat, Jun 17, 2017 at 10:04 PM, David Kincaid <
Post by David Kincaid
I'm having a problem querying Parquet files that were created from
Spark
Post by François Méthot
Post by David Kincaid
and have columns that are array or list types. When I do a SELECT on
these
Post by David Kincaid
{"list": [{"element": "My first value"}, {"element": "My second
value"}]}
Post by François Méthot
Post by David Kincaid
which Drill does not recognize as a REPEATED column and is not really
workable to hack around like I did in DRILL-5183 (
https://issues.apache.org/jira/browse/DRILL-5183). I can get to one
value
Post by David Kincaid
using something like t.columnName.`list`.`element` but that's not
really
Post by François Méthot
Post by David Kincaid
feasible to use in a query.
The little I could find on this by Googling around led me to this
document
Post by David Kincaid
on the Parquet format Github page -
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md
.
Post by David Kincaid
Post by François Méthot
This
Post by David Kincaid
seems to say that Spark is writing these files correctly, but Drill
is
Post by David Kincaid
Post by François Méthot
not
Post by David Kincaid
interpreting them properly.
Is there a workaround that anyone can help me to turn these columns
into
Post by François Méthot
Post by David Kincaid
values that Drill understands as repeated values? This is a fairly
urgent
Post by François Méthot
Post by David Kincaid
issue for us.
Thanks,
Dave
rahul challapalli
2017-06-30 18:50:53 UTC
Permalink
Raw Message
Hmm....I too see no simple workaround for the second case. Can you also
file a jira for the CTAS case? Drill could have been running short on heap
memory.

- Rahul
Post by David Kincaid
The view only works for the first example in the Jira I created. That was
the workaround we have been using since January.
Recently we've had a use case where we are running a Spark script to
pre-join some data before we try to use it in Drill. That was the subject
of the initial e-mail in this thread and the topic of the comment I made in
the JIra on 6/17. As far as I've been able to tell there isn't a similar
work around for this case that will make the column appear as an array.
Note, I tried to use Drill to do that pre-join of the Parquet data using
CTAS, but it ran for about 4 hours then crashed. The Spark script to do it
runs in 14 minutes successfully.
- Dave
On Fri, Jun 30, 2017 at 1:38 PM, rahul challapalli <
Post by rahul challapalli
Like I suggested in the comment for DRILL-5183, can you try using a view
as
Post by rahul challapalli
a workaround until the issue gets resolved?
Post by David Kincaid
As far as I was able to discern it is not possible to actually use this
column as an array in Drill at all. It just does not correctly read the
Parquet. I have had a very similar defect I created in Jira back in
January
Post by David Kincaid
that has had no attention at all. So we are moving on to other tools. I
understand Drill is free and no one developing it owes me anything.
It's
Post by rahul challapalli
Post by David Kincaid
just not going to work for us without proper support for nested objects
in
Post by David Kincaid
Parquet format.
Thanks for the reply though. It's much appreciated to have some
acknowledgment that I raised a valid issue.
- Dave
Post by François Méthot
Hi,
select column['list'][0]['element'] from ...
should return "My First Value".
select flatten(column['list'])['element] from ...
[{"NAME:":"Aname", "DATA":"thedata"},{"NAME:":"Aname2",
"DATA":"thedata2"},.....]
We ended doing custom function to do look up instead of doing costly
flatten technique.
Francois
On Sat, Jun 17, 2017 at 10:04 PM, David Kincaid <
Post by David Kincaid
I'm having a problem querying Parquet files that were created from
Spark
Post by François Méthot
Post by David Kincaid
and have columns that are array or list types. When I do a SELECT
on
Post by rahul challapalli
Post by David Kincaid
Post by François Méthot
these
Post by David Kincaid
{"list": [{"element": "My first value"}, {"element": "My second
value"}]}
Post by François Méthot
Post by David Kincaid
which Drill does not recognize as a REPEATED column and is not
really
Post by rahul challapalli
Post by David Kincaid
Post by François Méthot
Post by David Kincaid
workable to hack around like I did in DRILL-5183 (
https://issues.apache.org/jira/browse/DRILL-5183). I can get to
one
Post by rahul challapalli
Post by David Kincaid
Post by François Méthot
value
Post by David Kincaid
using something like t.columnName.`list`.`element` but that's not
really
Post by François Méthot
Post by David Kincaid
feasible to use in a query.
The little I could find on this by Googling around led me to this
document
Post by David Kincaid
on the Parquet format Github page -
https://github.com/apache/parquet-format/blob/master/
LogicalTypes.md
Post by rahul challapalli
.
Post by David Kincaid
Post by François Méthot
This
Post by David Kincaid
seems to say that Spark is writing these files correctly, but Drill
is
Post by David Kincaid
Post by François Méthot
not
Post by David Kincaid
interpreting them properly.
Is there a workaround that anyone can help me to turn these columns
into
Post by François Méthot
Post by David Kincaid
values that Drill understands as repeated values? This is a fairly
urgent
Post by François Méthot
Post by David Kincaid
issue for us.
Thanks,
Dave
Loading...