b***@laposte.net.INVALID
2018-12-05 15:39:23 UTC
In would like to create a parquet with a partition on computed data (without to have to put the result of the computation in the parquet) :
The goal is to optimize the parquet for typical expecting queries.
Imaginary example :
CREATE TABLE `mytable`
PARTITION BY (substr(name,1,1)) AS
SELECT name, birthdate, birthcity
ORDER BY bithdate;
So, if I do that I obtain a VALIDATION ERROR: Partition column ... is not in the SELECT list of CTAS
And the comment of the code of the function
"public static RelNode qualifyPartitionCol(RelNode input, List<String> partitionColumns)"
confirms that it's not possible actually :
" A partition column is resolved, either (1) the same column appear in the select list of CTAS or (2) CTAS has a * in select list"
But what is the reason of this limitation ?
Is there exists any tricks to do it right now, or can we expect an evolution to allow this possibilities.
I just imagine to do (with the data of the example)
CREATE TABLE `mytable`
PARTITION BY (sname) AS
SELECT substr(name,1,1) sname, name, birthdate, birthcity
ORDER BY bithdate;
Then, next, request each partition file to remove the useless data
, like
CREATE TABLE `mytable_2/partition_x`
SELECT name, birthdate, birthcity
ORDER BY bithdate;
but it's not really satisfying...
I would appreciate yours comments,
Regards,
benj
The goal is to optimize the parquet for typical expecting queries.
Imaginary example :
CREATE TABLE `mytable`
PARTITION BY (substr(name,1,1)) AS
SELECT name, birthdate, birthcity
ORDER BY bithdate;
So, if I do that I obtain a VALIDATION ERROR: Partition column ... is not in the SELECT list of CTAS
And the comment of the code of the function
"public static RelNode qualifyPartitionCol(RelNode input, List<String> partitionColumns)"
confirms that it's not possible actually :
" A partition column is resolved, either (1) the same column appear in the select list of CTAS or (2) CTAS has a * in select list"
But what is the reason of this limitation ?
Is there exists any tricks to do it right now, or can we expect an evolution to allow this possibilities.
I just imagine to do (with the data of the example)
CREATE TABLE `mytable`
PARTITION BY (sname) AS
SELECT substr(name,1,1) sname, name, birthdate, birthcity
ORDER BY bithdate;
Then, next, request each partition file to remove the useless data
, like
CREATE TABLE `mytable_2/partition_x`
SELECT name, birthdate, birthcity
ORDER BY bithdate;
but it's not really satisfying...
I would appreciate yours comments,
Regards,
benj