In addition to partitioning I would also make sub directories by year and then month if that is what you are partitioning against.. Apache Spark doesn't use parquet metadata and depends on subdirectory names for its partitioning scheme if you want to use your parquet files for multiple platforms.
Table partitioning is a common optimization approach used in systems like Hive. In a partitioned table, data are usually stored in different directories, with partitioning column values encoded in the path of each partition directory.
From: rahul challapalli [mailto:***@gmail.com]
Sent: Wednesday, May 31, 2017 1:50 PM
To: user <***@drill.apache.org>
Subject: Re: Partitioning for parquet
How to partition data is dependent on how you want to access your data. If you can foresee that most of the queries use year and month, then go-ahead and partition the data on those 2 columns. You can do that like below
create table partitioned_data partition by (yr, mnth) as select extract(year from `date`) yr, extract(month from `date`) mnth, `date`, ........ from mydata;
For partitioning to have any benefit, your queries should have filters on month and year columns.
Post by Raz Baluchi
Trying to understand parquet partitioning works.
What is the recommended partitioning scheme for event data that will
be queried primarily by date. I assume that partitioning by year and
month would be optimal?
kafka,down,2017-03023 04:53,zookeeper is not available
Would I have to create new columns for year and month?
kafka,down,2017-03023 04:53,zookeeper is not available,2017,03
and then perform a CTAS using the year and month columns as the
For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/en-us/about-us/contacts-locations.
© 2017 BlackRock, Inc. All rights reserved.