Discussion:
Parquet filter pushdown and string fields that use dictionary encoding
(too old to reply)
Stefán Baxter
2017-05-29 20:41:30 UTC
Permalink
Raw Message
Hi,

I would like to verify that my understanding of parquet filter pushdown in
Drill (https://drill.apache.org/docs/parquet-filter-pushdown/) is correct.

Is it correctly understood that Drill does not support predicate push-down
for string fields when dictionary based string encoding is enabled? (It
looks like Presto can do this.)

We save a lot of space using dictionary encoding (not enabled in Drill 1.10
by default) and if my understanding of how-it-works is correct then the
segment dictionary could be used to determine if a value is in a segments
or if it can be pruned/skipped when filtering based on columns that are
compressed/encoded using a dictionary.

I may be misunderstanding how this works and perhaps the dictionary is
create for the file as a whole and not individual sections but I know that
min/max values would not be good to determine the need for a segment scan.

I was hoping we could use partitioning on field(s) with lower cardinality
to create partitions for typical partition pruning and then sort the
contents of individual fields by session/customer IDs (which include
alphanumeric characters here) so that segments would only contain a
relatively low number of those unique values to facilitate "segment
pruning" when looking for data belonging to individual sessions/customers.

Best regards,
-Stefán Baxter
Kunal Khatua
2017-05-31 17:55:37 UTC
Permalink
Raw Message
Even though filter pushdown is supported in Drill, it is limited to pushing down of numeric values including dates. We do not support pushdown of varchar because of this bug in the parquet library:

https://issues.apache.org/jira/browse/PARQUET-686

<http://www.mapr.com/>

The issue of correctness for comparison is what makes the dependency on min-max statistics by the Parquet library be unreliable.


________________________________
From: Stefán Baxter <***@activitystream.com>
Sent: Monday, May 29, 2017 1:41:30 PM
To: user
Subject: Parquet filter pushdown and string fields that use dictionary encoding

Hi,

I would like to verify that my understanding of parquet filter pushdown in
Drill (https://drill.apache.org/docs/parquet-filter-pushdown/) is correct.

Is it correctly understood that Drill does not support predicate push-down
for string fields when dictionary based string encoding is enabled? (It
looks like Presto can do this.)

We save a lot of space using dictionary encoding (not enabled in Drill 1.10
by default) and if my understanding of how-it-works is correct then the
segment dictionary could be used to determine if a value is in a segments
or if it can be pruned/skipped when filtering based on columns that are
compressed/encoded using a dictionary.

I may be misunderstanding how this works and perhaps the dictionary is
create for the file as a whole and not individual sections but I know that
min/max values would not be good to determine the need for a segment scan.

I was hoping we could use partitioning on field(s) with lower cardinality
to create partitions for typical partition pruning and then sort the
contents of individual fields by session/customer IDs (which include
alphanumeric characters here) so that segments would only contain a
relatively low number of those unique values to facilitate "segment
pruning" when looking for data belonging to individual sessions/customers.

Best regards,
-Stefán Baxter
Stefán Baxter
2017-06-01 00:08:23 UTC
Permalink
Raw Message
Thank you Kunal.

Kan you please explain to me why min/max values would be relevant for
dictionary encoded fields? (I think I may be completely misunderstanding
how they work)

Regards,
-Stefán
Post by Kunal Khatua
Even though filter pushdown is supported in Drill, it is limited to
pushing down of numeric values including dates. We do not support pushdown
https://issues.apache.org/jira/browse/PARQUET-686
<http://www.mapr.com/>
The issue of correctness for comparison is what makes the dependency on
min-max statistics by the Parquet library be unreliable.
________________________________
Sent: Monday, May 29, 2017 1:41:30 PM
To: user
Subject: Parquet filter pushdown and string fields that use dictionary encoding
Hi,
I would like to verify that my understanding of parquet filter pushdown in
Drill (https://drill.apache.org/docs/parquet-filter-pushdown/) is correct.
Is it correctly understood that Drill does not support predicate push-down
for string fields when dictionary based string encoding is enabled? (It
looks like Presto can do this.)
We save a lot of space using dictionary encoding (not enabled in Drill 1.10
by default) and if my understanding of how-it-works is correct then the
segment dictionary could be used to determine if a value is in a segments
or if it can be pruned/skipped when filtering based on columns that are
compressed/encoded using a dictionary.
I may be misunderstanding how this works and perhaps the dictionary is
create for the file as a whole and not individual sections but I know that
min/max values would not be good to determine the need for a segment scan.
I was hoping we could use partitioning on field(s) with lower cardinality
to create partitions for typical partition pruning and then sort the
contents of individual fields by session/customer IDs (which include
alphanumeric characters here) so that segments would only contain a
relatively low number of those unique values to facilitate "segment
pruning" when looking for data belonging to individual sessions/customers.
Best regards,
-Stefán Baxter
Kunal Khatua
2017-06-01 00:47:53 UTC
Permalink
Raw Message
I might not be completely accurate, but the min-max technique allows you to figure if a String-based filter potentially exists in a rowgroup (Currently, Drill doesn't check at the page level). The comparison might be incorrect in cases where the bytes of a text are not interpreted as unsigned bytes. The Parquet Filter Pushdown filter is applied by Drill during planning time.


However, for dictionary-encoded fields, the Reader/Scanner would need to decode the Dictionary page to identify whether a filter condition's value is present in the subsequent data pages. This would (most likely) be done during execution time, and I don't believe Drill does that as yet.



<http://www.mapr.com/>

________________________________
From: Stefán Baxter <***@activitystream.com>
Sent: Wednesday, May 31, 2017 5:08:23 PM
To: user
Subject: Re: Parquet filter pushdown and string fields that use dictionary encoding

Thank you Kunal.

Kan you please explain to me why min/max values would be relevant for
dictionary encoded fields? (I think I may be completely misunderstanding
how they work)

Regards,
-Stefán
Post by Kunal Khatua
Even though filter pushdown is supported in Drill, it is limited to
pushing down of numeric values including dates. We do not support pushdown
https://issues.apache.org/jira/browse/PARQUET-686
<http://www.mapr.com/>
The issue of correctness for comparison is what makes the dependency on
min-max statistics by the Parquet library be unreliable.
________________________________
Sent: Monday, May 29, 2017 1:41:30 PM
To: user
Subject: Parquet filter pushdown and string fields that use dictionary
encoding
Hi,
I would like to verify that my understanding of parquet filter pushdown in
Drill (https://drill.apache.org/docs/parquet-filter-pushdown/) is correct.
Is it correctly understood that Drill does not support predicate push-down
for string fields when dictionary based string encoding is enabled? (It
looks like Presto can do this.)
We save a lot of space using dictionary encoding (not enabled in Drill 1.10
by default) and if my understanding of how-it-works is correct then the
segment dictionary could be used to determine if a value is in a segments
or if it can be pruned/skipped when filtering based on columns that are
compressed/encoded using a dictionary.
I may be misunderstanding how this works and perhaps the dictionary is
create for the file as a whole and not individual sections but I know that
min/max values would not be good to determine the need for a segment scan.
I was hoping we could use partitioning on field(s) with lower cardinality
to create partitions for typical partition pruning and then sort the
contents of individual fields by session/customer IDs (which include
alphanumeric characters here) so that segments would only contain a
relatively low number of those unique values to facilitate "segment
pruning" when looking for data belonging to individual sessions/customers.
Best regards,
-Stefán Baxter
Jinfeng Ni
2017-06-01 04:46:23 UTC
Permalink
Raw Message
Kunal is correct that Drill currently supports filter pruning at parquet
row group level, using min/max statistics. Such support is limited to
numeric/timestamp type, due to the potential corrupted varchar min/max
issue as Kunal mentioned.

For now Drill does not support dictionary-based pruning. It would be great
if someone in the community could contribute to make it happen. That
probably would require lots of work in Parquet reader during execution
time.
Post by Kunal Khatua
I might not be completely accurate, but the min-max technique allows you
to figure if a String-based filter potentially exists in a rowgroup
(Currently, Drill doesn't check at the page level). The comparison might be
incorrect in cases where the bytes of a text are not interpreted as
unsigned bytes. The Parquet Filter Pushdown filter is applied by Drill
during planning time.
However, for dictionary-encoded fields, the Reader/Scanner would need to
decode the Dictionary page to identify whether a filter condition's value
is present in the subsequent data pages. This would (most likely) be done
during execution time, and I don't believe Drill does that as yet.
<http://www.mapr.com/>
________________________________
Sent: Wednesday, May 31, 2017 5:08:23 PM
To: user
Subject: Re: Parquet filter pushdown and string fields that use dictionary encoding
Thank you Kunal.
Kan you please explain to me why min/max values would be relevant for
dictionary encoded fields? (I think I may be completely misunderstanding
how they work)
Regards,
-Stefán
Post by Kunal Khatua
Even though filter pushdown is supported in Drill, it is limited to
pushing down of numeric values including dates. We do not support
pushdown
Post by Kunal Khatua
https://issues.apache.org/jira/browse/PARQUET-686
<http://www.mapr.com/>
The issue of correctness for comparison is what makes the dependency on
min-max statistics by the Parquet library be unreliable.
________________________________
Sent: Monday, May 29, 2017 1:41:30 PM
To: user
Subject: Parquet filter pushdown and string fields that use dictionary encoding
Hi,
I would like to verify that my understanding of parquet filter pushdown
in
Post by Kunal Khatua
Drill (https://drill.apache.org/docs/parquet-filter-pushdown/) is
correct.
Post by Kunal Khatua
Is it correctly understood that Drill does not support predicate
push-down
Post by Kunal Khatua
for string fields when dictionary based string encoding is enabled? (It
looks like Presto can do this.)
We save a lot of space using dictionary encoding (not enabled in Drill
1.10
Post by Kunal Khatua
by default) and if my understanding of how-it-works is correct then the
segment dictionary could be used to determine if a value is in a segments
or if it can be pruned/skipped when filtering based on columns that are
compressed/encoded using a dictionary.
I may be misunderstanding how this works and perhaps the dictionary is
create for the file as a whole and not individual sections but I know
that
Post by Kunal Khatua
min/max values would not be good to determine the need for a segment
scan.
Post by Kunal Khatua
I was hoping we could use partitioning on field(s) with lower cardinality
to create partitions for typical partition pruning and then sort the
contents of individual fields by session/customer IDs (which include
alphanumeric characters here) so that segments would only contain a
relatively low number of those unique values to facilitate "segment
pruning" when looking for data belonging to individual
sessions/customers.
Post by Kunal Khatua
Best regards,
-Stefán Baxter
Stefán Baxter
2017-06-01 06:22:45 UTC
Permalink
Raw Message
Hi Jinfeng,

Netflix already has this working in Presto with current Parquet version so
the fundamentals are all there.

I wish we had resources to do this our selves as this is massively
important to us and I would think that the performance gain is so
substantial that this would be of high value to more.

I would, if that helps, be more than happy to commission this work or offer
a bounty if that is considered appropriate and then assign resources to
such problems as soon as we can.

Regards,
-Stefán
Post by Jinfeng Ni
Kunal is correct that Drill currently supports filter pruning at parquet
row group level, using min/max statistics. Such support is limited to
numeric/timestamp type, due to the potential corrupted varchar min/max
issue as Kunal mentioned.
For now Drill does not support dictionary-based pruning. It would be great
if someone in the community could contribute to make it happen. That
probably would require lots of work in Parquet reader during execution
time.
Post by Kunal Khatua
I might not be completely accurate, but the min-max technique allows you
to figure if a String-based filter potentially exists in a rowgroup
(Currently, Drill doesn't check at the page level). The comparison might
be
Post by Kunal Khatua
incorrect in cases where the bytes of a text are not interpreted as
unsigned bytes. The Parquet Filter Pushdown filter is applied by Drill
during planning time.
However, for dictionary-encoded fields, the Reader/Scanner would need to
decode the Dictionary page to identify whether a filter condition's value
is present in the subsequent data pages. This would (most likely) be done
during execution time, and I don't believe Drill does that as yet.
<http://www.mapr.com/>
________________________________
Sent: Wednesday, May 31, 2017 5:08:23 PM
To: user
Subject: Re: Parquet filter pushdown and string fields that use
dictionary
Post by Kunal Khatua
encoding
Thank you Kunal.
Kan you please explain to me why min/max values would be relevant for
dictionary encoded fields? (I think I may be completely misunderstanding
how they work)
Regards,
-Stefán
Post by Kunal Khatua
Even though filter pushdown is supported in Drill, it is limited to
pushing down of numeric values including dates. We do not support
pushdown
Post by Kunal Khatua
https://issues.apache.org/jira/browse/PARQUET-686
<http://www.mapr.com/>
The issue of correctness for comparison is what makes the dependency on
min-max statistics by the Parquet library be unreliable.
________________________________
Sent: Monday, May 29, 2017 1:41:30 PM
To: user
Subject: Parquet filter pushdown and string fields that use dictionary encoding
Hi,
I would like to verify that my understanding of parquet filter pushdown
in
Post by Kunal Khatua
Drill (https://drill.apache.org/docs/parquet-filter-pushdown/) is
correct.
Post by Kunal Khatua
Is it correctly understood that Drill does not support predicate
push-down
Post by Kunal Khatua
for string fields when dictionary based string encoding is enabled?
(It
Post by Kunal Khatua
Post by Kunal Khatua
looks like Presto can do this.)
We save a lot of space using dictionary encoding (not enabled in Drill
1.10
Post by Kunal Khatua
by default) and if my understanding of how-it-works is correct then the
segment dictionary could be used to determine if a value is in a
segments
Post by Kunal Khatua
Post by Kunal Khatua
or if it can be pruned/skipped when filtering based on columns that are
compressed/encoded using a dictionary.
I may be misunderstanding how this works and perhaps the dictionary is
create for the file as a whole and not individual sections but I know
that
Post by Kunal Khatua
min/max values would not be good to determine the need for a segment
scan.
Post by Kunal Khatua
I was hoping we could use partitioning on field(s) with lower
cardinality
Post by Kunal Khatua
Post by Kunal Khatua
to create partitions for typical partition pruning and then sort the
contents of individual fields by session/customer IDs (which include
alphanumeric characters here) so that segments would only contain a
relatively low number of those unique values to facilitate "segment
pruning" when looking for data belonging to individual
sessions/customers.
Post by Kunal Khatua
Best regards,
-Stefán Baxter
Loading...