Discussion:
Partitioning for parquet
(too old to reply)
Raz Baluchi
2017-05-31 20:28:31 UTC
Permalink
Raw Message
Hi all,

Trying to understand parquet partitioning works.

What is the recommended partitioning scheme for event data that will be
queried primarily by date. I assume that partitioning by year and month
would be optimal?

Lets say I have data that looks like:

application,status,date,message
kafka,down,2017-03023 04:53,zookeeper is not available


Would I have to create new columns for year and month?

e.g.
application,status,date,message,year,month
kafka,down,2017-03023 04:53,zookeeper is not available,2017,03

and then perform a CTAS using the year and month columns as the 'partition
by'?

Thanks
rahul challapalli
2017-05-31 20:49:41 UTC
Permalink
Raw Message
How to partition data is dependent on how you want to access your data. If
you can foresee that most of the queries use year and month, then go-ahead
and partition the data on those 2 columns. You can do that like below

create table partitioned_data partition by (yr, mnth) as select
extract(year from `date`) yr, extract(month from `date`) mnth, `date`,
........ from mydata;

For partitioning to have any benefit, your queries should have filters on
month and year columns.

- Rahul
Post by Raz Baluchi
Hi all,
Trying to understand parquet partitioning works.
What is the recommended partitioning scheme for event data that will be
queried primarily by date. I assume that partitioning by year and month
would be optimal?
application,status,date,message
kafka,down,2017-03023 04:53,zookeeper is not available
Would I have to create new columns for year and month?
e.g.
application,status,date,message,year,month
kafka,down,2017-03023 04:53,zookeeper is not available,2017,03
and then perform a CTAS using the year and month columns as the 'partition
by'?
Thanks
Lee, David
2017-05-31 21:12:17 UTC
Permalink
Raw Message
In addition to partitioning I would also make sub directories by year and then month if that is what you are partitioning against.. Apache Spark doesn't use parquet metadata and depends on subdirectory names for its partitioning scheme if you want to use your parquet files for multiple platforms.

http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery

Table partitioning is a common optimization approach used in systems like Hive. In a partitioned table, data are usually stored in different directories, with partitioning column values encoded in the path of each partition directory.

-----Original Message-----
From: rahul challapalli [mailto:***@gmail.com]
Sent: Wednesday, May 31, 2017 1:50 PM
To: user <***@drill.apache.org>
Subject: Re: Partitioning for parquet

How to partition data is dependent on how you want to access your data. If you can foresee that most of the queries use year and month, then go-ahead and partition the data on those 2 columns. You can do that like below

create table partitioned_data partition by (yr, mnth) as select extract(year from `date`) yr, extract(month from `date`) mnth, `date`, ........ from mydata;

For partitioning to have any benefit, your queries should have filters on month and year columns.

- Rahul
Post by Raz Baluchi
Hi all,
Trying to understand parquet partitioning works.
What is the recommended partitioning scheme for event data that will
be queried primarily by date. I assume that partitioning by year and
month would be optimal?
application,status,date,message
kafka,down,2017-03023 04:53,zookeeper is not available
Would I have to create new columns for year and month?
e.g.
application,status,date,message,year,month
kafka,down,2017-03023 04:53,zookeeper is not available,2017,03
and then perform a CTAS using the year and month columns as the
'partition by'?
Thanks
This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/en-us/compliance/email-disclaimers for further information. Please refer to http://www.blackrock.com/corporate/en-us/compliance/privacy-policy for more information about BlackRock’s Privacy Policy.

For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/en-us/about-us/contacts-locations.

© 2017 BlackRock, Inc. All rights reserved.
Raz Baluchi
2017-05-31 22:22:23 UTC
Permalink
Raw Message
So, if I understand you correctly, I would have to include the 'yr' and
'mnth' columns in addition to the 'date' column in the query?

e.g.

select * from events where yr in (2016, 2017) and mnth in (11,12,1) and
date between '2016-11-11' and '2017-01-23';

Is that correct?

On Wed, May 31, 2017 at 4:49 PM, rahul challapalli <
Post by rahul challapalli
How to partition data is dependent on how you want to access your data. If
you can foresee that most of the queries use year and month, then go-ahead
and partition the data on those 2 columns. You can do that like below
create table partitioned_data partition by (yr, mnth) as select
extract(year from `date`) yr, extract(month from `date`) mnth, `date`,
........ from mydata;
For partitioning to have any benefit, your queries should have filters on
month and year columns.
- Rahul
Post by Raz Baluchi
Hi all,
Trying to understand parquet partitioning works.
What is the recommended partitioning scheme for event data that will be
queried primarily by date. I assume that partitioning by year and month
would be optimal?
application,status,date,message
kafka,down,2017-03023 04:53,zookeeper is not available
Would I have to create new columns for year and month?
e.g.
application,status,date,message,year,month
kafka,down,2017-03023 04:53,zookeeper is not available,2017,03
and then perform a CTAS using the year and month columns as the
'partition
Post by Raz Baluchi
by'?
Thanks
rahul challapalli
2017-05-31 22:42:06 UTC
Permalink
Raw Message
If most of your queries use date column in the filter condition, I would
partition the data on the date column. Then you can simply say

select * from events where `date` between '2016-11-11' and '2017-01-23';

- Rahul
Post by Raz Baluchi
So, if I understand you correctly, I would have to include the 'yr' and
'mnth' columns in addition to the 'date' column in the query?
e.g.
select * from events where yr in (2016, 2017) and mnth in (11,12,1) and
date between '2016-11-11' and '2017-01-23';
Is that correct?
On Wed, May 31, 2017 at 4:49 PM, rahul challapalli <
Post by rahul challapalli
How to partition data is dependent on how you want to access your data.
If
Post by rahul challapalli
you can foresee that most of the queries use year and month, then
go-ahead
Post by rahul challapalli
and partition the data on those 2 columns. You can do that like below
create table partitioned_data partition by (yr, mnth) as select
extract(year from `date`) yr, extract(month from `date`) mnth, `date`,
........ from mydata;
For partitioning to have any benefit, your queries should have filters on
month and year columns.
- Rahul
Post by Raz Baluchi
Hi all,
Trying to understand parquet partitioning works.
What is the recommended partitioning scheme for event data that will be
queried primarily by date. I assume that partitioning by year and month
would be optimal?
application,status,date,message
kafka,down,2017-03023 04:53,zookeeper is not available
Would I have to create new columns for year and month?
e.g.
application,status,date,message,year,month
kafka,down,2017-03023 04:53,zookeeper is not available,2017,03
and then perform a CTAS using the year and month columns as the
'partition
Post by Raz Baluchi
by'?
Thanks
Raz Baluchi
2017-06-01 01:15:50 UTC
Permalink
Raw Message
As an experiment, I created an event file will 100 million entries spanning
25 years. I then created tables both ways, one partitioned by year and
month and the other by date. The first table created 410 parquet files and
the second 11837.

Querying the first table is consistently faster by a factor of 2x to 10x,

Is this because drill is not very efficient at querying a large number of
small(ish) parquet files?

On Wed, May 31, 2017 at 6:42 PM, rahul challapalli <
Post by rahul challapalli
If most of your queries use date column in the filter condition, I would
partition the data on the date column. Then you can simply say
select * from events where `date` between '2016-11-11' and '2017-01-23';
- Rahul
Post by Raz Baluchi
So, if I understand you correctly, I would have to include the 'yr' and
'mnth' columns in addition to the 'date' column in the query?
e.g.
select * from events where yr in (2016, 2017) and mnth in (11,12,1) and
date between '2016-11-11' and '2017-01-23';
Is that correct?
On Wed, May 31, 2017 at 4:49 PM, rahul challapalli <
Post by rahul challapalli
How to partition data is dependent on how you want to access your data.
If
Post by rahul challapalli
you can foresee that most of the queries use year and month, then
go-ahead
Post by rahul challapalli
and partition the data on those 2 columns. You can do that like below
create table partitioned_data partition by (yr, mnth) as select
extract(year from `date`) yr, extract(month from `date`) mnth, `date`,
........ from mydata;
For partitioning to have any benefit, your queries should have filters
on
Post by Raz Baluchi
Post by rahul challapalli
month and year columns.
- Rahul
Post by Raz Baluchi
Hi all,
Trying to understand parquet partitioning works.
What is the recommended partitioning scheme for event data that will
be
Post by Raz Baluchi
Post by rahul challapalli
Post by Raz Baluchi
queried primarily by date. I assume that partitioning by year and
month
Post by Raz Baluchi
Post by rahul challapalli
Post by Raz Baluchi
would be optimal?
application,status,date,message
kafka,down,2017-03023 04:53,zookeeper is not available
Would I have to create new columns for year and month?
e.g.
application,status,date,message,year,month
kafka,down,2017-03023 04:53,zookeeper is not available,2017,03
and then perform a CTAS using the year and month columns as the
'partition
Post by Raz Baluchi
by'?
Thanks
Padma Penumarthy
2017-06-01 04:33:29 UTC
Permalink
Raw Message
Are you running same query on both tables ? What is the filter condition ?
Since they are partitioned differently, same filter may prune the files differently.
If possible, can you share query profiles ?
You can check query profiles to see how many rows are being read from disk
in both cases.

Thanks,
Padma
Post by Raz Baluchi
As an experiment, I created an event file will 100 million entries spanning
25 years. I then created tables both ways, one partitioned by year and
month and the other by date. The first table created 410 parquet files and
the second 11837.
Querying the first table is consistently faster by a factor of 2x to 10x,
Is this because drill is not very efficient at querying a large number of
small(ish) parquet files?
On Wed, May 31, 2017 at 6:42 PM, rahul challapalli <
Post by rahul challapalli
If most of your queries use date column in the filter condition, I would
partition the data on the date column. Then you can simply say
select * from events where `date` between '2016-11-11' and '2017-01-23';
- Rahul
Post by Raz Baluchi
So, if I understand you correctly, I would have to include the 'yr' and
'mnth' columns in addition to the 'date' column in the query?
e.g.
select * from events where yr in (2016, 2017) and mnth in (11,12,1) and
date between '2016-11-11' and '2017-01-23';
Is that correct?
On Wed, May 31, 2017 at 4:49 PM, rahul challapalli <
Post by rahul challapalli
How to partition data is dependent on how you want to access your data.
If
Post by rahul challapalli
you can foresee that most of the queries use year and month, then
go-ahead
Post by rahul challapalli
and partition the data on those 2 columns. You can do that like below
create table partitioned_data partition by (yr, mnth) as select
extract(year from `date`) yr, extract(month from `date`) mnth, `date`,
........ from mydata;
For partitioning to have any benefit, your queries should have filters
on
Post by Raz Baluchi
Post by rahul challapalli
month and year columns.
- Rahul
Post by Raz Baluchi
Hi all,
Trying to understand parquet partitioning works.
What is the recommended partitioning scheme for event data that will
be
Post by Raz Baluchi
Post by rahul challapalli
Post by Raz Baluchi
queried primarily by date. I assume that partitioning by year and
month
Post by Raz Baluchi
Post by rahul challapalli
Post by Raz Baluchi
would be optimal?
application,status,date,message
kafka,down,2017-03023 04:53,zookeeper is not available
Would I have to create new columns for year and month?
e.g.
application,status,date,message,year,month
kafka,down,2017-03023 04:53,zookeeper is not available,2017,03
and then perform a CTAS using the year and month columns as the
'partition
Post by Raz Baluchi
by'?
Thanks
Jinfeng Ni
2017-06-01 05:05:03 UTC
Permalink
Raw Message
You may want to check if query on the second table is slower because of
planning time or execution time. That could be determined by looking at the
query profile in web-UI.

Two factors might impact the planning time for second table having 11837:
1. Reading parquet metadata from those parquet files. Parquet metadata
cache file might help for the cases of large number of small files.
2. Filter expression evaluation cost : query second would evaluate
expression 11837 times, vs just 410 times for first table.

In general, if you have 100M rows in 11837 files, ==> that's about 8500
rows per file. Performance-wise, this does not seem to be a good choice for
parquet format.
Post by Padma Penumarthy
Are you running same query on both tables ? What is the filter condition ?
Since they are partitioned differently, same filter may prune the files differently.
If possible, can you share query profiles ?
You can check query profiles to see how many rows are being read from disk
in both cases.
Thanks,
Padma
Post by Raz Baluchi
As an experiment, I created an event file will 100 million entries
spanning
Post by Raz Baluchi
25 years. I then created tables both ways, one partitioned by year and
month and the other by date. The first table created 410 parquet files
and
Post by Raz Baluchi
the second 11837.
Querying the first table is consistently faster by a factor of 2x to 10x,
Is this because drill is not very efficient at querying a large number of
small(ish) parquet files?
On Wed, May 31, 2017 at 6:42 PM, rahul challapalli <
Post by rahul challapalli
If most of your queries use date column in the filter condition, I would
partition the data on the date column. Then you can simply say
select * from events where `date` between '2016-11-11' and '2017-01-23';
- Rahul
Post by Raz Baluchi
So, if I understand you correctly, I would have to include the 'yr' and
'mnth' columns in addition to the 'date' column in the query?
e.g.
select * from events where yr in (2016, 2017) and mnth in (11,12,1)
and
Post by Raz Baluchi
Post by rahul challapalli
Post by Raz Baluchi
date between '2016-11-11' and '2017-01-23';
Is that correct?
On Wed, May 31, 2017 at 4:49 PM, rahul challapalli <
Post by rahul challapalli
How to partition data is dependent on how you want to access your
data.
Post by Raz Baluchi
Post by rahul challapalli
Post by Raz Baluchi
If
Post by rahul challapalli
you can foresee that most of the queries use year and month, then
go-ahead
Post by rahul challapalli
and partition the data on those 2 columns. You can do that like below
create table partitioned_data partition by (yr, mnth) as select
extract(year from `date`) yr, extract(month from `date`) mnth, `date`,
........ from mydata;
For partitioning to have any benefit, your queries should have filters
on
Post by Raz Baluchi
Post by rahul challapalli
month and year columns.
- Rahul
Post by Raz Baluchi
Hi all,
Trying to understand parquet partitioning works.
What is the recommended partitioning scheme for event data that will
be
Post by Raz Baluchi
Post by rahul challapalli
Post by Raz Baluchi
queried primarily by date. I assume that partitioning by year and
month
Post by Raz Baluchi
Post by rahul challapalli
Post by Raz Baluchi
would be optimal?
application,status,date,message
kafka,down,2017-03023 04:53,zookeeper is not available
Would I have to create new columns for year and month?
e.g.
application,status,date,message,year,month
kafka,down,2017-03023 04:53,zookeeper is not available,2017,03
and then perform a CTAS using the year and month columns as the
'partition
Post by Raz Baluchi
by'?
Thanks
Raz Baluchi
2017-06-01 13:55:46 UTC
Permalink
Raw Message
I guess there is such a thing as over partitioning...

The query on the table partitioned by date spends most of the elapsed time
on the 'planning' phase, with the execution being roughly equal to the one
on the table partitioned by year and month.

Based on these results, I've added a third table which is partitioned
simply by year. I've also added an ORDER BY to the CTAS in an attempt to
sort the table by date.

This third table seems to have the fastest query times so far with the
least amount of 'planning'. My take away from this exercise is to limit
the partitioning to the minimum required to obtain parquet files in the
range of 100 MB or so. Is that a valid lesson learned?
Post by Jinfeng Ni
You may want to check if query on the second table is slower because of
planning time or execution time. That could be determined by looking at the
query profile in web-UI.
1. Reading parquet metadata from those parquet files. Parquet metadata
cache file might help for the cases of large number of small files.
2. Filter expression evaluation cost : query second would evaluate
expression 11837 times, vs just 410 times for first table.
In general, if you have 100M rows in 11837 files, ==> that's about 8500
rows per file. Performance-wise, this does not seem to be a good choice for
parquet format.
Post by Padma Penumarthy
Are you running same query on both tables ? What is the filter condition
?
Post by Padma Penumarthy
Since they are partitioned differently, same filter may prune the files differently.
If possible, can you share query profiles ?
You can check query profiles to see how many rows are being read from
disk
Post by Padma Penumarthy
in both cases.
Thanks,
Padma
Post by Raz Baluchi
As an experiment, I created an event file will 100 million entries
spanning
Post by Raz Baluchi
25 years. I then created tables both ways, one partitioned by year and
month and the other by date. The first table created 410 parquet files
and
Post by Raz Baluchi
the second 11837.
Querying the first table is consistently faster by a factor of 2x to
10x,
Post by Padma Penumarthy
Post by Raz Baluchi
Is this because drill is not very efficient at querying a large number
of
Post by Padma Penumarthy
Post by Raz Baluchi
small(ish) parquet files?
On Wed, May 31, 2017 at 6:42 PM, rahul challapalli <
Post by rahul challapalli
If most of your queries use date column in the filter condition, I
would
Post by Padma Penumarthy
Post by Raz Baluchi
Post by rahul challapalli
partition the data on the date column. Then you can simply say
select * from events where `date` between '2016-11-11' and
'2017-01-23';
Post by Padma Penumarthy
Post by Raz Baluchi
Post by rahul challapalli
- Rahul
Post by Raz Baluchi
So, if I understand you correctly, I would have to include the 'yr'
and
Post by Padma Penumarthy
Post by Raz Baluchi
Post by rahul challapalli
Post by Raz Baluchi
'mnth' columns in addition to the 'date' column in the query?
e.g.
select * from events where yr in (2016, 2017) and mnth in (11,12,1)
and
Post by Raz Baluchi
Post by rahul challapalli
Post by Raz Baluchi
date between '2016-11-11' and '2017-01-23';
Is that correct?
On Wed, May 31, 2017 at 4:49 PM, rahul challapalli <
Post by rahul challapalli
How to partition data is dependent on how you want to access your
data.
Post by Raz Baluchi
Post by rahul challapalli
Post by Raz Baluchi
If
Post by rahul challapalli
you can foresee that most of the queries use year and month, then
go-ahead
Post by rahul challapalli
and partition the data on those 2 columns. You can do that like
below
Post by Padma Penumarthy
Post by Raz Baluchi
Post by rahul challapalli
Post by Raz Baluchi
Post by rahul challapalli
create table partitioned_data partition by (yr, mnth) as select
extract(year from `date`) yr, extract(month from `date`) mnth,
`date`,
Post by Padma Penumarthy
Post by Raz Baluchi
Post by rahul challapalli
Post by Raz Baluchi
Post by rahul challapalli
........ from mydata;
For partitioning to have any benefit, your queries should have
filters
Post by Padma Penumarthy
Post by Raz Baluchi
Post by rahul challapalli
on
Post by Raz Baluchi
Post by rahul challapalli
month and year columns.
- Rahul
Post by Raz Baluchi
Hi all,
Trying to understand parquet partitioning works.
What is the recommended partitioning scheme for event data that
will
Post by Padma Penumarthy
Post by Raz Baluchi
Post by rahul challapalli
be
Post by Raz Baluchi
Post by rahul challapalli
Post by Raz Baluchi
queried primarily by date. I assume that partitioning by year and
month
Post by Raz Baluchi
Post by rahul challapalli
Post by Raz Baluchi
would be optimal?
application,status,date,message
kafka,down,2017-03023 04:53,zookeeper is not available
Would I have to create new columns for year and month?
e.g.
application,status,date,message,year,month
kafka,down,2017-03023 04:53,zookeeper is not available,2017,03
and then perform a CTAS using the year and month columns as the
'partition
Post by Raz Baluchi
by'?
Thanks
Jinfeng Ni
2017-06-01 17:31:01 UTC
Permalink
Raw Message
Looks like the default parquet block size is 512MB [1]. There is an ongoing
patch which may put one single parquet block into a single file system
block [2].

In general, if you are doing pruning over large number of small parquet
files, the filter evaluation during pruning may become a bottleneck. Unlike
filter evaluation in execution time which uses run-time generated code and
is more efficient, filter evaluation in planning time uses interpreter and
is less efficient. Partition pruning only makes sense when each partition
contains large number or rows.

1. https://drill.apache.org/docs/parquet-format/#configuring-
the-size-of-parquet-files
2. https://issues.apache.org/jira/browse/DRILL-5379
Post by Raz Baluchi
I guess there is such a thing as over partitioning...
The query on the table partitioned by date spends most of the elapsed time
on the 'planning' phase, with the execution being roughly equal to the one
on the table partitioned by year and month.
Based on these results, I've added a third table which is partitioned
simply by year. I've also added an ORDER BY to the CTAS in an attempt to
sort the table by date.
This third table seems to have the fastest query times so far with the
least amount of 'planning'. My take away from this exercise is to limit
the partitioning to the minimum required to obtain parquet files in the
range of 100 MB or so. Is that a valid lesson learned?
Post by Jinfeng Ni
You may want to check if query on the second table is slower because of
planning time or execution time. That could be determined by looking at
the
Post by Jinfeng Ni
query profile in web-UI.
1. Reading parquet metadata from those parquet files. Parquet metadata
cache file might help for the cases of large number of small files.
2. Filter expression evaluation cost : query second would evaluate
expression 11837 times, vs just 410 times for first table.
In general, if you have 100M rows in 11837 files, ==> that's about 8500
rows per file. Performance-wise, this does not seem to be a good choice
for
Post by Jinfeng Ni
parquet format.
Post by Padma Penumarthy
Are you running same query on both tables ? What is the filter
condition
Post by Jinfeng Ni
?
Post by Padma Penumarthy
Since they are partitioned differently, same filter may prune the files
differently.
If possible, can you share query profiles ?
You can check query profiles to see how many rows are being read from
disk
Post by Padma Penumarthy
in both cases.
Thanks,
Padma
Post by Raz Baluchi
As an experiment, I created an event file will 100 million entries
spanning
Post by Raz Baluchi
25 years. I then created tables both ways, one partitioned by year
and
Post by Jinfeng Ni
Post by Padma Penumarthy
Post by Raz Baluchi
month and the other by date. The first table created 410 parquet
files
Post by Jinfeng Ni
Post by Padma Penumarthy
and
Post by Raz Baluchi
the second 11837.
Querying the first table is consistently faster by a factor of 2x to
10x,
Post by Padma Penumarthy
Post by Raz Baluchi
Is this because drill is not very efficient at querying a large
number
Post by Jinfeng Ni
of
Post by Padma Penumarthy
Post by Raz Baluchi
small(ish) parquet files?
On Wed, May 31, 2017 at 6:42 PM, rahul challapalli <
Post by rahul challapalli
If most of your queries use date column in the filter condition, I
would
Post by Padma Penumarthy
Post by Raz Baluchi
Post by rahul challapalli
partition the data on the date column. Then you can simply say
select * from events where `date` between '2016-11-11' and
'2017-01-23';
Post by Padma Penumarthy
Post by Raz Baluchi
Post by rahul challapalli
- Rahul
Post by Raz Baluchi
So, if I understand you correctly, I would have to include the 'yr'
and
Post by Padma Penumarthy
Post by Raz Baluchi
Post by rahul challapalli
Post by Raz Baluchi
'mnth' columns in addition to the 'date' column in the query?
e.g.
select * from events where yr in (2016, 2017) and mnth in
(11,12,1)
Post by Jinfeng Ni
Post by Padma Penumarthy
and
Post by Raz Baluchi
Post by rahul challapalli
Post by Raz Baluchi
date between '2016-11-11' and '2017-01-23';
Is that correct?
On Wed, May 31, 2017 at 4:49 PM, rahul challapalli <
Post by rahul challapalli
How to partition data is dependent on how you want to access your
data.
Post by Raz Baluchi
Post by rahul challapalli
Post by Raz Baluchi
If
Post by rahul challapalli
you can foresee that most of the queries use year and month, then
go-ahead
Post by rahul challapalli
and partition the data on those 2 columns. You can do that like
below
Post by Padma Penumarthy
Post by Raz Baluchi
Post by rahul challapalli
Post by Raz Baluchi
Post by rahul challapalli
create table partitioned_data partition by (yr, mnth) as select
extract(year from `date`) yr, extract(month from `date`) mnth,
`date`,
Post by Padma Penumarthy
Post by Raz Baluchi
Post by rahul challapalli
Post by Raz Baluchi
Post by rahul challapalli
........ from mydata;
For partitioning to have any benefit, your queries should have
filters
Post by Padma Penumarthy
Post by Raz Baluchi
Post by rahul challapalli
on
Post by Raz Baluchi
Post by rahul challapalli
month and year columns.
- Rahul
On Wed, May 31, 2017 at 1:28 PM, Raz Baluchi <
Post by Raz Baluchi
Hi all,
Trying to understand parquet partitioning works.
What is the recommended partitioning scheme for event data that
will
Post by Padma Penumarthy
Post by Raz Baluchi
Post by rahul challapalli
be
Post by Raz Baluchi
Post by rahul challapalli
Post by Raz Baluchi
queried primarily by date. I assume that partitioning by year and
month
Post by Raz Baluchi
Post by rahul challapalli
Post by Raz Baluchi
would be optimal?
application,status,date,message
kafka,down,2017-03023 04:53,zookeeper is not available
Would I have to create new columns for year and month?
e.g.
application,status,date,message,year,month
kafka,down,2017-03023 04:53,zookeeper is not available,2017,03
and then perform a CTAS using the year and month columns as the
'partition
Post by Raz Baluchi
by'?
Thanks
Andries Engelbrecht
2017-06-01 18:07:13 UTC
Permalink
Raw Message
Sorting the data by the partition column in the CTAS is a good plan normally, not only does it sort the output by the most likely filter column but also limits the number of parquet files being written to a single stream per partition. Drill can write data per fragment by partition, unless you add a sort operator.

And as Jinfeng mentioned metadata caching is very helpful on large data sets.

There is some info available on partition strategies for Drill on parquet to optimize performance.

--Andries


On 6/1/17, 6:55 AM, "***@gmail.com on behalf of Raz Baluchi" <***@gmail.com on behalf of ***@gmail.com> wrote:

I guess there is such a thing as over partitioning...

The query on the table partitioned by date spends most of the elapsed time
on the 'planning' phase, with the execution being roughly equal to the one
on the table partitioned by year and month.

Based on these results, I've added a third table which is partitioned
simply by year. I've also added an ORDER BY to the CTAS in an attempt to
sort the table by date.

This third table seems to have the fastest query times so far with the
least amount of 'planning'. My take away from this exercise is to limit
the partitioning to the minimum required to obtain parquet files in the
range of 100 MB or so. Is that a valid lesson learned?
Post by Jinfeng Ni
You may want to check if query on the second table is slower because of
planning time or execution time. That could be determined by looking at the
query profile in web-UI.
1. Reading parquet metadata from those parquet files. Parquet metadata
cache file might help for the cases of large number of small files.
2. Filter expression evaluation cost : query second would evaluate
expression 11837 times, vs just 410 times for first table.
In general, if you have 100M rows in 11837 files, ==> that's about 8500
rows per file. Performance-wise, this does not seem to be a good choice for
parquet format.
Post by Padma Penumarthy
Are you running same query on both tables ? What is the filter condition
?
Post by Padma Penumarthy
Since they are partitioned differently, same filter may prune the files differently.
If possible, can you share query profiles ?
You can check query profiles to see how many rows are being read from
disk
Post by Padma Penumarthy
in both cases.
Thanks,
Padma
Post by Raz Baluchi
As an experiment, I created an event file will 100 million entries
spanning
Post by Raz Baluchi
25 years. I then created tables both ways, one partitioned by year and
month and the other by date. The first table created 410 parquet files
and
Post by Raz Baluchi
the second 11837.
Querying the first table is consistently faster by a factor of 2x to
10x,
Post by Padma Penumarthy
Post by Raz Baluchi
Is this because drill is not very efficient at querying a large number
of
Post by Padma Penumarthy
Post by Raz Baluchi
small(ish) parquet files?
On Wed, May 31, 2017 at 6:42 PM, rahul challapalli <
Post by rahul challapalli
If most of your queries use date column in the filter condition, I
would
Post by Padma Penumarthy
Post by Raz Baluchi
Post by rahul challapalli
partition the data on the date column. Then you can simply say
select * from events where `date` between '2016-11-11' and
'2017-01-23';
Post by Padma Penumarthy
Post by Raz Baluchi
Post by rahul challapalli
- Rahul
Post by Raz Baluchi
So, if I understand you correctly, I would have to include the 'yr'
and
Post by Padma Penumarthy
Post by Raz Baluchi
Post by rahul challapalli
Post by Raz Baluchi
'mnth' columns in addition to the 'date' column in the query?
e.g.
select * from events where yr in (2016, 2017) and mnth in (11,12,1)
and
Post by Raz Baluchi
Post by rahul challapalli
Post by Raz Baluchi
date between '2016-11-11' and '2017-01-23';
Is that correct?
On Wed, May 31, 2017 at 4:49 PM, rahul challapalli <
Post by rahul challapalli
How to partition data is dependent on how you want to access your
data.
Post by Raz Baluchi
Post by rahul challapalli
Post by Raz Baluchi
If
Post by rahul challapalli
you can foresee that most of the queries use year and month, then
go-ahead
Post by rahul challapalli
and partition the data on those 2 columns. You can do that like
below
Post by Padma Penumarthy
Post by Raz Baluchi
Post by rahul challapalli
Post by Raz Baluchi
Post by rahul challapalli
create table partitioned_data partition by (yr, mnth) as select
extract(year from `date`) yr, extract(month from `date`) mnth,
`date`,
Post by Padma Penumarthy
Post by Raz Baluchi
Post by rahul challapalli
Post by Raz Baluchi
Post by rahul challapalli
........ from mydata;
For partitioning to have any benefit, your queries should have
filters
Post by Padma Penumarthy
Post by Raz Baluchi
Post by rahul challapalli
on
Post by Raz Baluchi
Post by rahul challapalli
month and year columns.
- Rahul
Post by Raz Baluchi
Hi all,
Trying to understand parquet partitioning works.
What is the recommended partitioning scheme for event data that
will
Post by Padma Penumarthy
Post by Raz Baluchi
Post by rahul challapalli
be
Post by Raz Baluchi
Post by rahul challapalli
Post by Raz Baluchi
queried primarily by date. I assume that partitioning by year and
month
Post by Raz Baluchi
Post by rahul challapalli
Post by Raz Baluchi
would be optimal?
application,status,date,message
kafka,down,2017-03023 04:53,zookeeper is not available
Would I have to create new columns for year and month?
e.g.
application,status,date,message,year,month
kafka,down,2017-03023 04:53,zookeeper is not available,2017,03
and then perform a CTAS using the year and month columns as the
'partition
Post by Raz Baluchi
by'?
Thanks
Loading...