Discussion:
drill parquet - create table as ... partition by ... non present column
b***@laposte.net.INVALID
2018-12-05 15:39:23 UTC
Permalink
In would like to create a parquet with a partition on computed data (without to have to put the result of the computation in the parquet) :
The goal is to optimize the parquet for typical expecting queries.

Imaginary example :
CREATE TABLE `mytable`
PARTITION BY (substr(name,1,1)) AS
SELECT name, birthdate, birthcity
ORDER BY bithdate;

So, if I do that I obtain a VALIDATION ERROR: Partition column ... is not in the SELECT list of CTAS

And the comment of the code of the function
"public static RelNode qualifyPartitionCol(RelNode input, List<String> partitionColumns)"
confirms that it's not possible actually :
" A partition column is resolved, either (1) the same column appear in the select list of CTAS or (2) CTAS has a * in select list"

But what is the reason of this limitation ?
Is there exists any tricks to do it right now, or can we expect an evolution to allow this possibilities.

I just imagine to do (with the data of the example)
CREATE TABLE `mytable`
PARTITION BY (sname) AS
SELECT substr(name,1,1) sname, name, birthdate, birthcity
ORDER BY bithdate;
Then, next, request each partition file to remove the useless data
, like
CREATE TABLE `mytable_2/partition_x`
SELECT name, birthdate, birthcity
ORDER BY bithdate;
but it's not really satisfying...

I would appreciate yours comments,
Regards,

benj
Anton Gozhiy
2018-12-06 15:42:10 UTC
Permalink
Hi Benj,

Creating partitions as in your first example won't work.
From the docs: "During partitioning, Drill creates separate files, but not
separate directories, for different partitions." (
https://drill.apache.org/docs/how-to-partition-data/).
Also, Drill doesn't write additional metadata regarding partitioning, when
it reads parquet files it determines partitions using min/max values.
That means that if you want for example to partition using the first
letter, you'll need to create a corresponding column. Or you can create
partitions manually as directories.
Post by b***@laposte.net.INVALID
In would like to create a parquet with a partition on computed data
The goal is to optimize the parquet for typical expecting queries.
CREATE TABLE `mytable`
PARTITION BY (substr(name,1,1)) AS
SELECT name, birthdate, birthcity
ORDER BY bithdate;
So, if I do that I obtain a VALIDATION ERROR: Partition column ... is not
in the SELECT list of CTAS
And the comment of the code of the function
"public static RelNode qualifyPartitionCol(RelNode input, List<String> partitionColumns)"
" A partition column is resolved, either (1) the same column appear in the
select list of CTAS or (2) CTAS has a * in select list"
But what is the reason of this limitation ?
Is there exists any tricks to do it right now, or can we expect an
evolution to allow this possibilities.
I just imagine to do (with the data of the example)
CREATE TABLE `mytable`
PARTITION BY (sname) AS
SELECT substr(name,1,1) sname, name, birthdate, birthcity
ORDER BY bithdate;
Then, next, request each partition file to remove the useless data
, like
CREATE TABLE `mytable_2/partition_x`
SELECT name, birthdate, birthcity
ORDER BY bithdate;
but it's not really satisfying...
I would appreciate yours comments,
Regards,
benj
--
Sincerely, Anton Gozhiy
***@gmail.com
benj.dev
2018-12-07 19:20:22 UTC
Permalink
Hi,

Thanks for details.
It's the point, I don't want to write additional metadata,
but just organize the parquet file to have more useful stats.

In a simple GROUP BY it's possible to not SELECT some of "grouped" column.
(Example SELECT a, b FROM ... GROUP BY a, b, c;)
In the same way, I think it will be useful to have the possibilities
to PARTITION BY and say (after the PARTITION BY) which columns put in
the parquet file.

I can imagine, that it will be difficult to add a "SELECT" close in
PARTITION BY clauses like :
[... PARTITION BY (a (SELECT ONLY b, c)) AS SELECT a, b, c FROM ...]

Intermediately, allow the possibilities to "PARTITION BY" a computation
on an existing column seem more accessible and should be useful.
example :
[... PARTITION BY (a < 10) AS SELECT a, b, c FROM ...]

Because it's useless to store the "a < 10" but it may be useful to filer
half of the parquet when request with condition on column a.

But maybe there is a fundamental reason to not allow these facilities.

Regards,
Benj
Post by Anton Gozhiy
Hi Benj,
Creating partitions as in your first example won't work.
Post by Anton Gozhiy
From the docs: "During partitioning, Drill creates separate files, but not
separate directories, for different partitions." (
https://drill.apache.org/docs/how-to-partition-data/).
Also, Drill doesn't write additional metadata regarding partitioning, when
it reads parquet files it determines partitions using min/max values.
That means that if you want for example to partition using the first
letter, you'll need to create a corresponding column. Or you can create
partitions manually as directories.
Post by Anton Gozhiy
In would like to create a parquet with a partition on computed data
The goal is to optimize the parquet for typical expecting queries.
CREATE TABLE `mytable`
PARTITION BY (substr(name,1,1)) AS
SELECT name, birthdate, birthcity
ORDER BY bithdate;
So, if I do that I obtain a VALIDATION ERROR: Partition column ... is not
in the SELECT list of CTAS
And the comment of the code of the function
"public static RelNode qualifyPartitionCol(RelNode input, List<String> partitionColumns)"
" A partition column is resolved, either (1) the same column appear in the
select list of CTAS or (2) CTAS has a * in select list"
But what is the reason of this limitation ?
Is there exists any tricks to do it right now, or can we expect an
evolution to allow this possibilities.
I just imagine to do (with the data of the example)
CREATE TABLE `mytable`
PARTITION BY (sname) AS
SELECT substr(name,1,1) sname, name, birthdate, birthcity
ORDER BY bithdate;
Then, next, request each partition file to remove the useless data
, like
CREATE TABLE `mytable_2/partition_x`
SELECT name, birthdate, birthcity
ORDER BY bithdate;
but it's not really satisfying...
I would appreciate yours comments,
Regards,
benj
Anton Gozhiy
2018-12-10 14:40:21 UTC
Permalink
Benj,

I meant that without the metadata Drill won't recognize files as
partitions. Although that should not be a problem for the optimization
mechanisms.

As for your original question, I think that it's rather not implemented
than an intended limitation. Feel free to submit a feature request to
Apache JIRA.
Post by benj.dev
Hi,
Thanks for details.
It's the point, I don't want to write additional metadata,
but just organize the parquet file to have more useful stats.
In a simple GROUP BY it's possible to not SELECT some of "grouped" column.
(Example SELECT a, b FROM ... GROUP BY a, b, c;)
In the same way, I think it will be useful to have the possibilities
to PARTITION BY and say (after the PARTITION BY) which columns put in
the parquet file.
I can imagine, that it will be difficult to add a "SELECT" close in
[... PARTITION BY (a (SELECT ONLY b, c)) AS SELECT a, b, c FROM ...]
Intermediately, allow the possibilities to "PARTITION BY" a computation
on an existing column seem more accessible and should be useful.
[... PARTITION BY (a < 10) AS SELECT a, b, c FROM ...]
Because it's useless to store the "a < 10" but it may be useful to filer
half of the parquet when request with condition on column a.
But maybe there is a fundamental reason to not allow these facilities.
Regards,
Benj
Post by Anton Gozhiy
Hi Benj,
Creating partitions as in your first example won't work.
Post by Anton Gozhiy
From the docs: "During partitioning, Drill creates separate files, but
not
Post by Anton Gozhiy
separate directories, for different partitions." (
https://drill.apache.org/docs/how-to-partition-data/).
Also, Drill doesn't write additional metadata regarding partitioning,
when
Post by Anton Gozhiy
it reads parquet files it determines partitions using min/max values.
That means that if you want for example to partition using the first
letter, you'll need to create a corresponding column. Or you can create
partitions manually as directories.
Post by Anton Gozhiy
In would like to create a parquet with a partition on computed data
The goal is to optimize the parquet for typical expecting queries.
CREATE TABLE `mytable`
PARTITION BY (substr(name,1,1)) AS
SELECT name, birthdate, birthcity
ORDER BY bithdate;
So, if I do that I obtain a VALIDATION ERROR: Partition column ... is
not
Post by Anton Gozhiy
Post by Anton Gozhiy
in the SELECT list of CTAS
And the comment of the code of the function
"public static RelNode qualifyPartitionCol(RelNode input, List<String>
partitionColumns)"
" A partition column is resolved, either (1) the same column appear in
the
Post by Anton Gozhiy
Post by Anton Gozhiy
select list of CTAS or (2) CTAS has a * in select list"
But what is the reason of this limitation ?
Is there exists any tricks to do it right now, or can we expect an
evolution to allow this possibilities.
I just imagine to do (with the data of the example)
CREATE TABLE `mytable`
PARTITION BY (sname) AS
SELECT substr(name,1,1) sname, name, birthdate, birthcity
ORDER BY bithdate;
Then, next, request each partition file to remove the useless data
, like
CREATE TABLE `mytable_2/partition_x`
SELECT name, birthdate, birthcity
ORDER BY bithdate;
but it's not really satisfying...
I would appreciate yours comments,
Regards,
benj
--
Sincerely, Anton Gozhiy
***@gmail.com
Loading...