drill parquet - create table as ... partition by ... non present column

Discussion:

b***@laposte.net.INVALID

2018-12-05 15:39:23 UTC

In would like to create a parquet with a partition on computed data (without to have to put the result of the computation in the parquet) :
The goal is to optimize the parquet for typical expecting queries.

Imaginary example :
CREATE TABLE `mytable`
PARTITION BY (substr(name,1,1)) AS
SELECT name, birthdate, birthcity
ORDER BY bithdate;

So, if I do that I obtain a VALIDATION ERROR: Partition column ... is not in the SELECT list of CTAS

And the comment of the code of the function
"public static RelNode qualifyPartitionCol(RelNode input, List<String> partitionColumns)"
confirms that it's not possible actually :
" A partition column is resolved, either (1) the same column appear in the select list of CTAS or (2) CTAS has a * in select list"

But what is the reason of this limitation ?
Is there exists any tricks to do it right now, or can we expect an evolution to allow this possibilities.

I just imagine to do (with the data of the example)
CREATE TABLE `mytable`
PARTITION BY (sname) AS
SELECT substr(name,1,1) sname, name, birthdate, birthcity
ORDER BY bithdate;
Then, next, request each partition file to remove the useless data
, like
CREATE TABLE `mytable_2/partition_x`
SELECT name, birthdate, birthcity
ORDER BY bithdate;
but it's not really satisfying...

I would appreciate yours comments,
Regards,

benj

Anton Gozhiy

2018-12-06 15:42:10 UTC

Permalink

Hi Benj,

Creating partitions as in your first example won't work.
From the docs: "During partitioning, Drill creates separate files, but not
separate directories, for different partitions." (
https://drill.apache.org/docs/how-to-partition-data/).
Also, Drill doesn't write additional metadata regarding partitioning, when
it reads parquet files it determines partitions using min/max values.
That means that if you want for example to partition using the first
letter, you'll need to create a corresponding column. Or you can create
partitions manually as directories.

Post by b***@laposte.net.INVALID
In would like to create a parquet with a partition on computed data
The goal is to optimize the parquet for typical expecting queries.
CREATE TABLE `mytable`
PARTITION BY (substr(name,1,1)) AS
SELECT name, birthdate, birthcity
ORDER BY bithdate;
So, if I do that I obtain a VALIDATION ERROR: Partition column ... is not
in the SELECT list of CTAS
And the comment of the code of the function
"public static RelNode qualifyPartitionCol(RelNode input, List<String> partitionColumns)"
" A partition column is resolved, either (1) the same column appear in the
select list of CTAS or (2) CTAS has a * in select list"
But what is the reason of this limitation ?
Is there exists any tricks to do it right now, or can we expect an
evolution to allow this possibilities.
I just imagine to do (with the data of the example)
CREATE TABLE `mytable`
PARTITION BY (sname) AS
SELECT substr(name,1,1) sname, name, birthdate, birthcity
ORDER BY bithdate;
Then, next, request each partition file to remove the useless data
, like
CREATE TABLE `mytable_2/partition_x`
SELECT name, birthdate, birthcity
ORDER BY bithdate;
but it's not really satisfying...
I would appreciate yours comments,
Regards,
benj

--
Sincerely, Anton Gozhiy
***@gmail.com

benj.dev

2018-12-07 19:20:22 UTC

Permalink

Hi,

Thanks for details.
It's the point, I don't want to write additional metadata,
but just organize the parquet file to have more useful stats.

In a simple GROUP BY it's possible to not SELECT some of "grouped" column.
(Example SELECT a, b FROM ... GROUP BY a, b, c;)
In the same way, I think it will be useful to have the possibilities
to PARTITION BY and say (after the PARTITION BY) which columns put in
the parquet file.

I can imagine, that it will be difficult to add a "SELECT" close in
PARTITION BY clauses like :
[... PARTITION BY (a (SELECT ONLY b, c)) AS SELECT a, b, c FROM ...]

Intermediately, allow the possibilities to "PARTITION BY" a computation
on an existing column seem more accessible and should be useful.
example :
[... PARTITION BY (a < 10) AS SELECT a, b, c FROM ...]

Because it's useless to store the "a < 10" but it may be useful to filer
half of the parquet when request with condition on column a.

But maybe there is a fundamental reason to not allow these facilities.

Regards,
Benj

Post by Anton Gozhiy
Hi Benj,
Creating partitions as in your first example won't work.

Post by Anton Gozhiy
From the docs: "During partitioning, Drill creates separate files, but not

separate directories, for different partitions." (
https://drill.apache.org/docs/how-to-partition-data/).
Also, Drill doesn't write additional metadata regarding partitioning, when
it reads parquet files it determines partitions using min/max values.
That means that if you want for example to partition using the first
letter, you'll need to create a corresponding column. Or you can create
partitions manually as directories.

Post by Anton Gozhiy
In would like to create a parquet with a partition on computed data
The goal is to optimize the parquet for typical expecting queries.
CREATE TABLE `mytable`
PARTITION BY (substr(name,1,1)) AS
SELECT name, birthdate, birthcity
ORDER BY bithdate;
So, if I do that I obtain a VALIDATION ERROR: Partition column ... is not
in the SELECT list of CTAS
And the comment of the code of the function
"public static RelNode qualifyPartitionCol(RelNode input, List<String> partitionColumns)"
" A partition column is resolved, either (1) the same column appear in the
select list of CTAS or (2) CTAS has a * in select list"
But what is the reason of this limitation ?
Is there exists any tricks to do it right now, or can we expect an
evolution to allow this possibilities.
I just imagine to do (with the data of the example)
CREATE TABLE `mytable`
PARTITION BY (sname) AS
SELECT substr(name,1,1) sname, name, birthdate, birthcity
ORDER BY bithdate;
Then, next, request each partition file to remove the useless data
, like
CREATE TABLE `mytable_2/partition_x`
SELECT name, birthdate, birthcity
ORDER BY bithdate;
but it's not really satisfying...
I would appreciate yours comments,
Regards,
benj

Anton Gozhiy

2018-12-10 14:40:21 UTC

Permalink

Benj,

I meant that without the metadata Drill won't recognize files as
partitions. Although that should not be a problem for the optimization
mechanisms.

As for your original question, I think that it's rather not implemented
than an intended limitation. Feel free to submit a feature request to
Apache JIRA.

Post by benj.dev
Hi,
Thanks for details.
It's the point, I don't want to write additional metadata,
but just organize the parquet file to have more useful stats.
In a simple GROUP BY it's possible to not SELECT some of "grouped" column.
(Example SELECT a, b FROM ... GROUP BY a, b, c;)
In the same way, I think it will be useful to have the possibilities
to PARTITION BY and say (after the PARTITION BY) which columns put in
the parquet file.
I can imagine, that it will be difficult to add a "SELECT" close in
[... PARTITION BY (a (SELECT ONLY b, c)) AS SELECT a, b, c FROM ...]
Intermediately, allow the possibilities to "PARTITION BY" a computation
on an existing column seem more accessible and should be useful.
[... PARTITION BY (a < 10) AS SELECT a, b, c FROM ...]
Because it's useless to store the "a < 10" but it may be useful to filer
half of the parquet when request with condition on column a.
But maybe there is a fundamental reason to not allow these facilities.
Regards,
Benj

Post by Anton Gozhiy
Hi Benj,
Creating partitions as in your first example won't work.

Post by Anton Gozhiy
From the docs: "During partitioning, Drill creates separate files, but

not

Post by Anton Gozhiy
separate directories, for different partitions." (
https://drill.apache.org/docs/how-to-partition-data/).
Also, Drill doesn't write additional metadata regarding partitioning,

when

Post by Anton Gozhiy
it reads parquet files it determines partitions using min/max values.
That means that if you want for example to partition using the first
letter, you'll need to create a corresponding column. Or you can create
partitions manually as directories.

not

Post by Anton Gozhiy

Post by Anton Gozhiy
in the SELECT list of CTAS
And the comment of the code of the function
"public static RelNode qualifyPartitionCol(RelNode input, List<String>
partitionColumns)"
" A partition column is resolved, either (1) the same column appear in

the

Post by Anton Gozhiy

Post by Anton Gozhiy
select list of CTAS or (2) CTAS has a * in select list"
But what is the reason of this limitation ?
Is there exists any tricks to do it right now, or can we expect an
evolution to allow this possibilities.
I just imagine to do (with the data of the example)
CREATE TABLE `mytable`
PARTITION BY (sname) AS
SELECT substr(name,1,1) sname, name, birthdate, birthcity
ORDER BY bithdate;
Then, next, request each partition file to remove the useless data
, like
CREATE TABLE `mytable_2/partition_x`
SELECT name, birthdate, birthcity
ORDER BY bithdate;
but it's not really satisfying...
I would appreciate yours comments,
Regards,
benj

--
Sincerely, Anton Gozhiy
***@gmail.com