Discussion:
Clarification on Drill Options
(too old to reply)
John Omernik
2016-05-29 11:55:11 UTC
Permalink
Raw Message
Hey all, when looking at the drill options, and specifically as I was
trying to understand the parquet options, I realized that the naming of the
options was forming "question" as I looked at them. What do I mean?
Consider:

+--------------------------------------------+

| name |

+--------------------------------------------+

| store.parquet.block-size |

| store.parquet.compression |

| store.parquet.dictionary.page-size |

| store.parquet.enable_dictionary_encoding |

| store.parquet.page-size |

| store.parquet.use_new_reader |

| store.parquet.vector_fill_check_threshold |

| store.parquet.vector_fill_threshold |

+--------------------------------------------+



So I will remove "store.parquet" as I refer to them here:


use_new_reader - This seems fairly obvious an "on read" options and
(maybe?) does affect the Parquet writer, yet "enable_dictionary_encoding"
is likely ONLY an on write option.... correct? I mean, if the Parquet file
was written somewhere else, and written with Dictionary encoding, Drill
will still read it ok, regardless of this setting. Compression as well, if
the Parquet file was created with gzip, and this setting is snappy, it will
still read it, same goes for block size. Thus, those seem to be "writer"
settings, rather than reader settings.


So what about the vector settings? Write or Read (or both?) For json there
is this setting: | store.json.writer.uglify which seems to be writer
focused and obviously writer, but for other settings, knowing what the
setting applies to, on write, on read, neither, or both, could be very
useful for troubleshooting and knowing which settings to play with.


Now, changing these settings as they are is not recommended, even in my
test clusters, I have scripts that alter them for specific ETLs, and I
would hate to have things break, but how hard would it be to add a string
column to sys.options something like "applies_to" with write, read, both,
neither, n/a as options? I think this could be valuable for users and
administrators of Drill.


One other note, in addition to the applies_to, would it be horrifically
difficult to add a "description" field for options? Self documenting
settings sure would be handy.... :)


John
John Omernik
2016-05-31 17:51:51 UTC
Permalink
Raw Message
I added a JIRA related to this:

https://issues.apache.org/jira/browse/DRILL-4699
Post by John Omernik
Hey all, when looking at the drill options, and specifically as I was
trying to understand the parquet options, I realized that the naming of the
options was forming "question" as I looked at them. What do I mean?
+--------------------------------------------+
| name |
+--------------------------------------------+
| store.parquet.block-size |
| store.parquet.compression |
| store.parquet.dictionary.page-size |
| store.parquet.enable_dictionary_encoding |
| store.parquet.page-size |
| store.parquet.use_new_reader |
| store.parquet.vector_fill_check_threshold |
| store.parquet.vector_fill_threshold |
+--------------------------------------------+
use_new_reader - This seems fairly obvious an "on read" options and
(maybe?) does affect the Parquet writer, yet "enable_dictionary_encoding"
is likely ONLY an on write option.... correct? I mean, if the Parquet file
was written somewhere else, and written with Dictionary encoding, Drill
will still read it ok, regardless of this setting. Compression as well, if
the Parquet file was created with gzip, and this setting is snappy, it will
still read it, same goes for block size. Thus, those seem to be "writer"
settings, rather than reader settings.
So what about the vector settings? Write or Read (or both?) For json there
is this setting: | store.json.writer.uglify which seems to be writer
focused and obviously writer, but for other settings, knowing what the
setting applies to, on write, on read, neither, or both, could be very
useful for troubleshooting and knowing which settings to play with.
Now, changing these settings as they are is not recommended, even in my
test clusters, I have scripts that alter them for specific ETLs, and I
would hate to have things break, but how hard would it be to add a string
column to sys.options something like "applies_to" with write, read, both,
neither, n/a as options? I think this could be valuable for users and
administrators of Drill.
One other note, in addition to the applies_to, would it be horrifically
difficult to add a "description" field for options? Self documenting
settings sure would be handy.... :)
John
John Omernik
2017-05-02 17:54:28 UTC
Permalink
Raw Message
Looks like some work has been done here, any chance we can move this along?

https://issues.apache.org/jira/browse/DRILL-4699


Thanks!
Post by John Omernik
https://issues.apache.org/jira/browse/DRILL-4699
Post by John Omernik
Hey all, when looking at the drill options, and specifically as I was
trying to understand the parquet options, I realized that the naming of the
options was forming "question" as I looked at them. What do I mean?
+--------------------------------------------+
| name |
+--------------------------------------------+
| store.parquet.block-size |
| store.parquet.compression |
| store.parquet.dictionary.page-size |
| store.parquet.enable_dictionary_encoding |
| store.parquet.page-size |
| store.parquet.use_new_reader |
| store.parquet.vector_fill_check_threshold |
| store.parquet.vector_fill_threshold |
+--------------------------------------------+
use_new_reader - This seems fairly obvious an "on read" options and
(maybe?) does affect the Parquet writer, yet "enable_dictionary_encoding"
is likely ONLY an on write option.... correct? I mean, if the Parquet file
was written somewhere else, and written with Dictionary encoding, Drill
will still read it ok, regardless of this setting. Compression as well, if
the Parquet file was created with gzip, and this setting is snappy, it will
still read it, same goes for block size. Thus, those seem to be "writer"
settings, rather than reader settings.
So what about the vector settings? Write or Read (or both?) For json
there is this setting: | store.json.writer.uglify which seems to be
writer focused and obviously writer, but for other settings, knowing what
the setting applies to, on write, on read, neither, or both, could be very
useful for troubleshooting and knowing which settings to play with.
Now, changing these settings as they are is not recommended, even in my
test clusters, I have scripts that alter them for specific ETLs, and I
would hate to have things break, but how hard would it be to add a string
column to sys.options something like "applies_to" with write, read, both,
neither, n/a as options? I think this could be valuable for users and
administrators of Drill.
One other note, in addition to the applies_to, would it be horrifically
difficult to add a "description" field for options? Self documenting
settings sure would be handy.... :)
John
Loading...