Discussion:
Increasing store.parquet.block-size
(too old to reply)
Shuporno Choudhury
2017-06-09 07:18:41 UTC
Permalink
Raw Message
The max value that can be assigned to *store.parquet.block-size *is
*2147483647*, as the value kind of this configuration parameter is LONG.
This basically translates to 2GB of block size.
How do I increase it to 3/4/5 GB ?
Trying to set this parameter to a higher value using the following command
actually succeeds :
ALTER SYSTEM SET `store.parquet.block-size` = 4294967296;
But when I try to run a query that uses this config, it throws the
following error:
Error: SYSTEM ERROR: NumberFormatException: For input string:
"4294967296"
So, is it possible to assign a higher value to this parameter?
--
Regards,
Shuporno Choudhury
Khurram Faraaz
2017-06-09 10:25:03 UTC
Permalink
Raw Message
1. DRILL-2478<https://issues.apache.org/jira/browse/DRILL-2478> is Open for this issue.
2. I have added more details into the comments.

Thanks,
Khurram

________________________________
From: Shuporno Choudhury <***@manthan.com>
Sent: Friday, June 9, 2017 12:48:41 PM
To: ***@drill.apache.org
Subject: Increasing store.parquet.block-size

The max value that can be assigned to *store.parquet.block-size *is
*2147483647*, as the value kind of this configuration parameter is LONG.
This basically translates to 2GB of block size.
How do I increase it to 3/4/5 GB ?
Trying to set this parameter to a higher value using the following command
actually succeeds :
ALTER SYSTEM SET `store.parquet.block-size` = 4294967296;
But when I try to run a query that uses this config, it throws the
following error:
Error: SYSTEM ERROR: NumberFormatException: For input string:
"4294967296"
So, is it possible to assign a higher value to this parameter?
--
Regards,
Shuporno Choudhury
Vitalii Diravka
2017-06-09 11:49:02 UTC
Permalink
Raw Message
Khurram,

DRILL-2478 is a good place holder for the LongValidator issue, it really
works wrong.

But other issue connected to impossibility to use long values for parquet
block-size.
This issue can be independent task or a sub-task of updating Drill project
to a latest parquet library.

Kind regards
Vitalii
Post by Khurram Faraaz
1. DRILL-2478<https://issues.apache.org/jira/browse/DRILL-2478> is Open for this issue.
2. I have added more details into the comments.
Thanks,
Khurram
________________________________
Sent: Friday, June 9, 2017 12:48:41 PM
Subject: Increasing store.parquet.block-size
The max value that can be assigned to *store.parquet.block-size *is
*2147483647*, as the value kind of this configuration parameter is LONG.
This basically translates to 2GB of block size.
How do I increase it to 3/4/5 GB ?
Trying to set this parameter to a higher value using the following command
ALTER SYSTEM SET `store.parquet.block-size` = 4294967296;
But when I try to run a query that uses this config, it throws the
"4294967296"
So, is it possible to assign a higher value to this parameter?
--
Regards,
Shuporno Choudhury
Kunal Khatua
2017-06-09 17:33:49 UTC
Permalink
Raw Message
Shuporno


There are some interesting problems when using Parquet files > 2GB on HDFS.


If I'm not mistaken, the HDFS APIs that allow you to read offsets (oddly enough) returns an int value. Large Parquet blocksize also means you'll end up having the file span across multiple HDFS blocks, and that would make reading of rowgroups inefficient.


Is there a reason you want to create such a large parquet file?


~ Kunal

________________________________
From: Vitalii Diravka <***@gmail.com>
Sent: Friday, June 9, 2017 4:49:02 AM
To: ***@drill.apache.org
Subject: Re: Increasing store.parquet.block-size

Khurram,

DRILL-2478 is a good place holder for the LongValidator issue, it really
works wrong.

But other issue connected to impossibility to use long values for parquet
block-size.
This issue can be independent task or a sub-task of updating Drill project
to a latest parquet library.

Kind regards
Vitalii
Post by Khurram Faraaz
1. DRILL-2478<https://issues.apache.org/jira/browse/DRILL-2478> is
Open for this issue.
2. I have added more details into the comments.
Thanks,
Khurram
________________________________
Sent: Friday, June 9, 2017 12:48:41 PM
Subject: Increasing store.parquet.block-size
The max value that can be assigned to *store.parquet.block-size *is
*2147483647*, as the value kind of this configuration parameter is LONG.
This basically translates to 2GB of block size.
How do I increase it to 3/4/5 GB ?
Trying to set this parameter to a higher value using the following command
ALTER SYSTEM SET `store.parquet.block-size` = 4294967296;
But when I try to run a query that uses this config, it throws the
"4294967296"
So, is it possible to assign a higher value to this parameter?
--
Regards,
Shuporno Choudhury
Shuporno Choudhury
2017-06-09 17:50:06 UTC
Permalink
Raw Message
Thanks Kunal for your insight.
I am actually converting some .csv files and storing them in parquet format
in s3, not in HDFS.
The size of the individual .csv source files can be quite huge (around
10GB).
So, is there a way to overcome this and create one parquet file or do I
have to go ahead with multiple parquet files?
Post by Shuporno Choudhury
Shuporno
There are some interesting problems when using Parquet files > 2GB on HDFS.
If I'm not mistaken, the HDFS APIs that allow you to read offsets (oddly
enough) returns an int value. Large Parquet blocksize also means you'll end
up having the file span across multiple HDFS blocks, and that would make
reading of rowgroups inefficient.
Is there a reason you want to create such a large parquet file?
~ Kunal
________________________________
Sent: Friday, June 9, 2017 4:49:02 AM
Subject: Re: Increasing store.parquet.block-size
Khurram,
DRILL-2478 is a good place holder for the LongValidator issue, it really
works wrong.
But other issue connected to impossibility to use long values for parquet
block-size.
This issue can be independent task or a sub-task of updating Drill project
to a latest parquet library.
Kind regards
Vitalii
Post by Khurram Faraaz
1. DRILL-2478<https://issues.apache.org/jira/browse/DRILL-2478> is
Open for this issue.
2. I have added more details into the comments.
Thanks,
Khurram
________________________________
Sent: Friday, June 9, 2017 12:48:41 PM
Subject: Increasing store.parquet.block-size
The max value that can be assigned to *store.parquet.block-size *is
*2147483647*, as the value kind of this configuration parameter is LONG.
This basically translates to 2GB of block size.
How do I increase it to 3/4/5 GB ?
Trying to set this parameter to a higher value using the following
command
Post by Khurram Faraaz
ALTER SYSTEM SET `store.parquet.block-size` = 4294967296;
But when I try to run a query that uses this config, it throws the
"4294967296"
So, is it possible to assign a higher value to this parameter?
--
Regards,
Shuporno Choudhury
Kunal Khatua
2017-06-09 17:57:48 UTC
Permalink
Raw Message
If you're storing this in S3... you might want to selectively read the files as well.


I'm only speculating, but if you want to download the data, downloading as a queue of files might be more reliable than one massive file. Similarly, within AWS, it *might* be faster to have an EC2 instance access a couple of large Parquet files versus one massive Parquet file.


Remember that when you create a large block size, Drill tries to write everything within a single row group for each. So there is no chance of parallelization of the read (i.e. reading parts in parallel). The defaults should work well for S3 as well, and with the compression (e.g. Snappy), you should get a reasonably smaller file size.


With the current default settings... have you seen what Parquet file sizes you get with Drill when converting your 10GB CSV source files?


________________________________
From: Shuporno Choudhury <***@manthan.com>
Sent: Friday, June 9, 2017 10:50:06 AM
To: ***@drill.apache.org
Subject: Re: Increasing store.parquet.block-size

Thanks Kunal for your insight.
I am actually converting some .csv files and storing them in parquet format
in s3, not in HDFS.
The size of the individual .csv source files can be quite huge (around
10GB).
So, is there a way to overcome this and create one parquet file or do I
have to go ahead with multiple parquet files?
Post by Shuporno Choudhury
Shuporno
There are some interesting problems when using Parquet files > 2GB on HDFS.
If I'm not mistaken, the HDFS APIs that allow you to read offsets (oddly
enough) returns an int value. Large Parquet blocksize also means you'll end
up having the file span across multiple HDFS blocks, and that would make
reading of rowgroups inefficient.
Is there a reason you want to create such a large parquet file?
~ Kunal
________________________________
Sent: Friday, June 9, 2017 4:49:02 AM
Subject: Re: Increasing store.parquet.block-size
Khurram,
DRILL-2478 is a good place holder for the LongValidator issue, it really
works wrong.
But other issue connected to impossibility to use long values for parquet
block-size.
This issue can be independent task or a sub-task of updating Drill project
to a latest parquet library.
Kind regards
Vitalii
Post by Khurram Faraaz
1. DRILL-2478<https://issues.apache.org/jira/browse/DRILL-2478> is
Open for this issue.
2. I have added more details into the comments.
Thanks,
Khurram
________________________________
Sent: Friday, June 9, 2017 12:48:41 PM
Subject: Increasing store.parquet.block-size
The max value that can be assigned to *store.parquet.block-size *is
*2147483647*, as the value kind of this configuration parameter is LONG.
This basically translates to 2GB of block size.
How do I increase it to 3/4/5 GB ?
Trying to set this parameter to a higher value using the following
command
Post by Khurram Faraaz
ALTER SYSTEM SET `store.parquet.block-size` = 4294967296;
But when I try to run a query that uses this config, it throws the
"4294967296"
So, is it possible to assign a higher value to this parameter?
--
Regards,
Shuporno Choudhury
Shuporno Choudhury
2017-06-09 18:23:37 UTC
Permalink
Raw Message
Thanks for the information Kunal.
After the conversion, the file size scales down to half if I use gzip
compression.
For a 10 GB gzipped csv source file, it becomes 5GB (2+2+1) parquet file
(using gzip compression).
So, if I have to make multiple parquet files, what block size would be
optimal, if I have to read the file later?
Post by Kunal Khatua
If you're storing this in S3... you might want to selectively read the files as well.
I'm only speculating, but if you want to download the data, downloading as
a queue of files might be more reliable than one massive file. Similarly,
within AWS, it *might* be faster to have an EC2 instance access a couple of
large Parquet files versus one massive Parquet file.
Remember that when you create a large block size, Drill tries to write
everything within a single row group for each. So there is no chance of
parallelization of the read (i.e. reading parts in parallel). The defaults
should work well for S3 as well, and with the compression (e.g. Snappy),
you should get a reasonably smaller file size.
With the current default settings... have you seen what Parquet file sizes
you get with Drill when converting your 10GB CSV source files?
________________________________
Sent: Friday, June 9, 2017 10:50:06 AM
Subject: Re: Increasing store.parquet.block-size
Thanks Kunal for your insight.
I am actually converting some .csv files and storing them in parquet format
in s3, not in HDFS.
The size of the individual .csv source files can be quite huge (around
10GB).
So, is there a way to overcome this and create one parquet file or do I
have to go ahead with multiple parquet files?
Post by Shuporno Choudhury
Shuporno
There are some interesting problems when using Parquet files > 2GB on
HDFS.
Post by Shuporno Choudhury
If I'm not mistaken, the HDFS APIs that allow you to read offsets (oddly
enough) returns an int value. Large Parquet blocksize also means you'll
end
Post by Shuporno Choudhury
up having the file span across multiple HDFS blocks, and that would make
reading of rowgroups inefficient.
Is there a reason you want to create such a large parquet file?
~ Kunal
________________________________
Sent: Friday, June 9, 2017 4:49:02 AM
Subject: Re: Increasing store.parquet.block-size
Khurram,
DRILL-2478 is a good place holder for the LongValidator issue, it really
works wrong.
But other issue connected to impossibility to use long values for parquet
block-size.
This issue can be independent task or a sub-task of updating Drill
project
Post by Shuporno Choudhury
to a latest parquet library.
Kind regards
Vitalii
Post by Khurram Faraaz
1. DRILL-2478<https://issues.apache.org/jira/browse/DRILL-2478> is
Open for this issue.
2. I have added more details into the comments.
Thanks,
Khurram
________________________________
Sent: Friday, June 9, 2017 12:48:41 PM
Subject: Increasing store.parquet.block-size
The max value that can be assigned to *store.parquet.block-size *is
*2147483647*, as the value kind of this configuration parameter is
LONG.
Post by Shuporno Choudhury
Post by Khurram Faraaz
This basically translates to 2GB of block size.
How do I increase it to 3/4/5 GB ?
Trying to set this parameter to a higher value using the following
command
Post by Khurram Faraaz
ALTER SYSTEM SET `store.parquet.block-size` = 4294967296;
But when I try to run a query that uses this config, it throws the
"4294967296"
So, is it possible to assign a higher value to this parameter?
--
Regards,
Shuporno Choudhury
Kunal Khatua
2017-06-09 18:36:01 UTC
Permalink
Raw Message
The ideal size depends on what engine is consuming the parquet files (Drill, i'm guessing).... and the storage layer. For HDFS, which is usually 128-256GB, we recommend to bump it to about 512GB (with the underlying HDFS blocksize to match that).


You'll probably need to experiment a little with different blocks sizes stored on S3 to see which works the best.

<http://www.mapr.com/>

________________________________
From: Shuporno Choudhury <***@manthan.com>
Sent: Friday, June 9, 2017 11:23:37 AM
To: ***@drill.apache.org
Subject: Re: Increasing store.parquet.block-size

Thanks for the information Kunal.
After the conversion, the file size scales down to half if I use gzip
compression.
For a 10 GB gzipped csv source file, it becomes 5GB (2+2+1) parquet file
(using gzip compression).
So, if I have to make multiple parquet files, what block size would be
optimal, if I have to read the file later?
Post by Kunal Khatua
If you're storing this in S3... you might want to selectively read the
files as well.
I'm only speculating, but if you want to download the data, downloading as
a queue of files might be more reliable than one massive file. Similarly,
within AWS, it *might* be faster to have an EC2 instance access a couple of
large Parquet files versus one massive Parquet file.
Remember that when you create a large block size, Drill tries to write
everything within a single row group for each. So there is no chance of
parallelization of the read (i.e. reading parts in parallel). The defaults
should work well for S3 as well, and with the compression (e.g. Snappy),
you should get a reasonably smaller file size.
With the current default settings... have you seen what Parquet file sizes
you get with Drill when converting your 10GB CSV source files?
________________________________
Sent: Friday, June 9, 2017 10:50:06 AM
Subject: Re: Increasing store.parquet.block-size
Thanks Kunal for your insight.
I am actually converting some .csv files and storing them in parquet format
in s3, not in HDFS.
The size of the individual .csv source files can be quite huge (around
10GB).
So, is there a way to overcome this and create one parquet file or do I
have to go ahead with multiple parquet files?
Post by Shuporno Choudhury
Shuporno
There are some interesting problems when using Parquet files > 2GB on
HDFS.
Post by Shuporno Choudhury
If I'm not mistaken, the HDFS APIs that allow you to read offsets (oddly
enough) returns an int value. Large Parquet blocksize also means you'll
end
Post by Shuporno Choudhury
up having the file span across multiple HDFS blocks, and that would make
reading of rowgroups inefficient.
Is there a reason you want to create such a large parquet file?
~ Kunal
________________________________
Sent: Friday, June 9, 2017 4:49:02 AM
Subject: Re: Increasing store.parquet.block-size
Khurram,
DRILL-2478 is a good place holder for the LongValidator issue, it really
works wrong.
But other issue connected to impossibility to use long values for parquet
block-size.
This issue can be independent task or a sub-task of updating Drill
project
Post by Shuporno Choudhury
to a latest parquet library.
Kind regards
Vitalii
Post by Khurram Faraaz
1. DRILL-2478<https://issues.apache.org/jira/browse/DRILL-2478> is
Open for this issue.
2. I have added more details into the comments.
Thanks,
Khurram
________________________________
Sent: Friday, June 9, 2017 12:48:41 PM
Subject: Increasing store.parquet.block-size
The max value that can be assigned to *store.parquet.block-size *is
*2147483647*, as the value kind of this configuration parameter is
LONG.
Post by Shuporno Choudhury
Post by Khurram Faraaz
This basically translates to 2GB of block size.
How do I increase it to 3/4/5 GB ?
Trying to set this parameter to a higher value using the following
command
Post by Khurram Faraaz
ALTER SYSTEM SET `store.parquet.block-size` = 4294967296;
But when I try to run a query that uses this config, it throws the
"4294967296"
So, is it possible to assign a higher value to this parameter?
--
Regards,
Shuporno Choudhury
Padma Penumarthy
2017-06-14 06:13:16 UTC
Permalink
Raw Message
I think you meant MB (not GB) below.
HDFS allows creation of very large files(theoretically, there is no limit).
I am wondering why >2GB file is a problem. May be it is blockSize >2GB, that is not recommended.

Anyways, we should not let the user be able to set any value and later throw an error.
I opened a PR to fix this.
https://github.com/apache/drill/pull/852

Thanks,
Padma


On Jun 9, 2017, at 11:36 AM, Kunal Khatua <***@mapr.com<mailto:***@mapr.com>> wrote:

The ideal size depends on what engine is consuming the parquet files (Drill, i'm guessing).... and the storage layer. For HDFS, which is usually 128-256GB, we recommend to bump it to about 512GB (with the underlying HDFS blocksize to match that).


You'll probably need to experiment a little with different blocks sizes stored on S3 to see which works the best.

<http://www.mapr.com/>

________________________________
From: Shuporno Choudhury <***@manthan.com<mailto:***@manthan.com>>
Sent: Friday, June 9, 2017 11:23:37 AM
To: ***@drill.apache.org<mailto:***@drill.apache.org>
Subject: Re: Increasing store.parquet.block-size

Thanks for the information Kunal.
After the conversion, the file size scales down to half if I use gzip
compression.
For a 10 GB gzipped csv source file, it becomes 5GB (2+2+1) parquet file
(using gzip compression).
So, if I have to make multiple parquet files, what block size would be
optimal, if I have to read the file later?

On 09-Jun-2017 11:28 PM, "Kunal Khatua" <***@mapr.com<mailto:***@mapr.com>> wrote:


If you're storing this in S3... you might want to selectively read the
files as well.


I'm only speculating, but if you want to download the data, downloading as
a queue of files might be more reliable than one massive file. Similarly,
within AWS, it *might* be faster to have an EC2 instance access a couple of
large Parquet files versus one massive Parquet file.


Remember that when you create a large block size, Drill tries to write
everything within a single row group for each. So there is no chance of
parallelization of the read (i.e. reading parts in parallel). The defaults
should work well for S3 as well, and with the compression (e.g. Snappy),
you should get a reasonably smaller file size.


With the current default settings... have you seen what Parquet file sizes
you get with Drill when converting your 10GB CSV source files?


________________________________
From: Shuporno Choudhury <***@manthan.com<mailto:***@manthan.com>>
Sent: Friday, June 9, 2017 10:50:06 AM
To: ***@drill.apache.org<mailto:***@drill.apache.org>
Subject: Re: Increasing store.parquet.block-size

Thanks Kunal for your insight.
I am actually converting some .csv files and storing them in parquet format
in s3, not in HDFS.
The size of the individual .csv source files can be quite huge (around
10GB).
So, is there a way to overcome this and create one parquet file or do I
have to go ahead with multiple parquet files?

On 09-Jun-2017 11:04 PM, "Kunal Khatua" <***@mapr.com<mailto:***@mapr.com>> wrote:

Shuporno


There are some interesting problems when using Parquet files > 2GB on
HDFS.


If I'm not mistaken, the HDFS APIs that allow you to read offsets (oddly
enough) returns an int value. Large Parquet blocksize also means you'll
end
up having the file span across multiple HDFS blocks, and that would make
reading of rowgroups inefficient.


Is there a reason you want to create such a large parquet file?


~ Kunal

________________________________
From: Vitalii Diravka <***@gmail.com<mailto:***@gmail.com>>
Sent: Friday, June 9, 2017 4:49:02 AM
To: ***@drill.apache.org<mailto:***@drill.apache.org>
Subject: Re: Increasing store.parquet.block-size

Khurram,

DRILL-2478 is a good place holder for the LongValidator issue, it really
works wrong.

But other issue connected to impossibility to use long values for parquet
block-size.
This issue can be independent task or a sub-task of updating Drill
project
to a latest parquet library.

Kind regards
Vitalii

On Fri, Jun 9, 2017 at 10:25 AM, Khurram Faraaz <***@mapr.com<mailto:***@mapr.com>>
wrote:

1. DRILL-2478<https://issues.apache.org/jira/browse/DRILL-2478> is
Open for this issue.
2. I have added more details into the comments.

Thanks,
Khurram

________________________________
From: Shuporno Choudhury <***@manthan.com<mailto:***@manthan.com>>
Sent: Friday, June 9, 2017 12:48:41 PM
To: ***@drill.apache.org<mailto:***@drill.apache.org>
Subject: Increasing store.parquet.block-size

The max value that can be assigned to *store.parquet.block-size *is
*2147483647*, as the value kind of this configuration parameter is
LONG.
This basically translates to 2GB of block size.
How do I increase it to 3/4/5 GB ?
Trying to set this parameter to a higher value using the following
command
actually succeeds :
ALTER SYSTEM SET `store.parquet.block-size` = 4294967296;
But when I try to run a query that uses this config, it throws the
following error:
Error: SYSTEM ERROR: NumberFormatException: For input string:
"4294967296"
So, is it possible to assign a higher value to this parameter?
--
Regards,
Shuporno Choudhury
Khurram Faraaz
2017-06-14 10:12:40 UTC
Permalink
Raw Message
Thanks Padma. There are some more related failures reported in DRILL-2478, do you think we should fix them too, if it is an easy fix.


Regards,

Khurram

________________________________
From: Padma Penumarthy <***@mapr.com>
Sent: Wednesday, June 14, 2017 11:43:16 AM
To: ***@drill.apache.org
Subject: Re: Increasing store.parquet.block-size

I think you meant MB (not GB) below.
HDFS allows creation of very large files(theoretically, there is no limit).
I am wondering why >2GB file is a problem. May be it is blockSize >2GB, that is not recommended.

Anyways, we should not let the user be able to set any value and later throw an error.
I opened a PR to fix this.
https://github.com/apache/drill/pull/852

Thanks,
Padma


On Jun 9, 2017, at 11:36 AM, Kunal Khatua <***@mapr.com<mailto:***@mapr.com>> wrote:

The ideal size depends on what engine is consuming the parquet files (Drill, i'm guessing).... and the storage layer. For HDFS, which is usually 128-256GB, we recommend to bump it to about 512GB (with the underlying HDFS blocksize to match that).


You'll probably need to experiment a little with different blocks sizes stored on S3 to see which works the best.

<http://www.mapr.com/>

________________________________
From: Shuporno Choudhury <***@manthan.com<mailto:***@manthan.com>>
Sent: Friday, June 9, 2017 11:23:37 AM
To: ***@drill.apache.org<mailto:***@drill.apache.org>
Subject: Re: Increasing store.parquet.block-size

Thanks for the information Kunal.
After the conversion, the file size scales down to half if I use gzip
compression.
For a 10 GB gzipped csv source file, it becomes 5GB (2+2+1) parquet file
(using gzip compression).
So, if I have to make multiple parquet files, what block size would be
optimal, if I have to read the file later?

On 09-Jun-2017 11:28 PM, "Kunal Khatua" <***@mapr.com<mailto:***@mapr.com>> wrote:


If you're storing this in S3... you might want to selectively read the
files as well.


I'm only speculating, but if you want to download the data, downloading as
a queue of files might be more reliable than one massive file. Similarly,
within AWS, it *might* be faster to have an EC2 instance access a couple of
large Parquet files versus one massive Parquet file.


Remember that when you create a large block size, Drill tries to write
everything within a single row group for each. So there is no chance of
parallelization of the read (i.e. reading parts in parallel). The defaults
should work well for S3 as well, and with the compression (e.g. Snappy),
you should get a reasonably smaller file size.


With the current default settings... have you seen what Parquet file sizes
you get with Drill when converting your 10GB CSV source files?


________________________________
From: Shuporno Choudhury <***@manthan.com<mailto:***@manthan.com>>
Sent: Friday, June 9, 2017 10:50:06 AM
To: ***@drill.apache.org<mailto:***@drill.apache.org>
Subject: Re: Increasing store.parquet.block-size

Thanks Kunal for your insight.
I am actually converting some .csv files and storing them in parquet format
in s3, not in HDFS.
The size of the individual .csv source files can be quite huge (around
10GB).
So, is there a way to overcome this and create one parquet file or do I
have to go ahead with multiple parquet files?

On 09-Jun-2017 11:04 PM, "Kunal Khatua" <***@mapr.com<mailto:***@mapr.com>> wrote:

Shuporno


There are some interesting problems when using Parquet files > 2GB on
HDFS.


If I'm not mistaken, the HDFS APIs that allow you to read offsets (oddly
enough) returns an int value. Large Parquet blocksize also means you'll
end
up having the file span across multiple HDFS blocks, and that would make
reading of rowgroups inefficient.


Is there a reason you want to create such a large parquet file?


~ Kunal

________________________________
From: Vitalii Diravka <***@gmail.com<mailto:***@gmail.com>>
Sent: Friday, June 9, 2017 4:49:02 AM
To: ***@drill.apache.org<mailto:***@drill.apache.org>
Subject: Re: Increasing store.parquet.block-size

Khurram,

DRILL-2478 is a good place holder for the LongValidator issue, it really
works wrong.

But other issue connected to impossibility to use long values for parquet
block-size.
This issue can be independent task or a sub-task of updating Drill
project
to a latest parquet library.

Kind regards
Vitalii

On Fri, Jun 9, 2017 at 10:25 AM, Khurram Faraaz <***@mapr.com<mailto:***@mapr.com>>
wrote:

1. DRILL-2478<https://issues.apache.org/jira/browse/DRILL-2478> is
Open for this issue.
2. I have added more details into the comments.

Thanks,
Khurram

________________________________
From: Shuporno Choudhury <***@manthan.com<mailto:***@manthan.com>>
Sent: Friday, June 9, 2017 12:48:41 PM
To: ***@drill.apache.org<mailto:***@drill.apache.org>
Subject: Increasing store.parquet.block-size

The max value that can be assigned to *store.parquet.block-size *is
*2147483647*, as the value kind of this configuration parameter is
LONG.
This basically translates to 2GB of block size.
How do I increase it to 3/4/5 GB ?
Trying to set this parameter to a higher value using the following
command
actually succeeds :
ALTER SYSTEM SET `store.parquet.block-size` = 4294967296;
But when I try to run a query that uses this config, it throws the
following error:
Error: SYSTEM ERROR: NumberFormatException: For input string:
"4294967296"
So, is it possible to assign a higher value to this parameter?
--
Regards,
Shuporno Choudhury
Padma Penumarthy
2017-06-15 03:28:44 UTC
Permalink
Raw Message
Sure. I will check and try to fix them as well.

Thanks,
Padma
Post by Khurram Faraaz
Thanks Padma. There are some more related failures reported in DRILL-2478, do you think we should fix them too, if it is an easy fix.
Regards,
Khurram
________________________________
Sent: Wednesday, June 14, 2017 11:43:16 AM
Subject: Re: Increasing store.parquet.block-size
I think you meant MB (not GB) below.
HDFS allows creation of very large files(theoretically, there is no limit).
I am wondering why >2GB file is a problem. May be it is blockSize >2GB, that is not recommended.
Anyways, we should not let the user be able to set any value and later throw an error.
I opened a PR to fix this.
https://github.com/apache/drill/pull/852
Thanks,
Padma
The ideal size depends on what engine is consuming the parquet files (Drill, i'm guessing).... and the storage layer. For HDFS, which is usually 128-256GB, we recommend to bump it to about 512GB (with the underlying HDFS blocksize to match that).
You'll probably need to experiment a little with different blocks sizes stored on S3 to see which works the best.
<http://www.mapr.com/>
________________________________
Sent: Friday, June 9, 2017 11:23:37 AM
Subject: Re: Increasing store.parquet.block-size
Thanks for the information Kunal.
After the conversion, the file size scales down to half if I use gzip
compression.
For a 10 GB gzipped csv source file, it becomes 5GB (2+2+1) parquet file
(using gzip compression).
So, if I have to make multiple parquet files, what block size would be
optimal, if I have to read the file later?
If you're storing this in S3... you might want to selectively read the files as well.
I'm only speculating, but if you want to download the data, downloading as
a queue of files might be more reliable than one massive file. Similarly,
within AWS, it *might* be faster to have an EC2 instance access a couple of
large Parquet files versus one massive Parquet file.
Remember that when you create a large block size, Drill tries to write
everything within a single row group for each. So there is no chance of
parallelization of the read (i.e. reading parts in parallel). The defaults
should work well for S3 as well, and with the compression (e.g. Snappy),
you should get a reasonably smaller file size.
With the current default settings... have you seen what Parquet file sizes
you get with Drill when converting your 10GB CSV source files?
________________________________
Sent: Friday, June 9, 2017 10:50:06 AM
Subject: Re: Increasing store.parquet.block-size
Thanks Kunal for your insight.
I am actually converting some .csv files and storing them in parquet format
in s3, not in HDFS.
The size of the individual .csv source files can be quite huge (around
10GB).
So, is there a way to overcome this and create one parquet file or do I
have to go ahead with multiple parquet files?
Shuporno
There are some interesting problems when using Parquet files > 2GB on HDFS.
If I'm not mistaken, the HDFS APIs that allow you to read offsets (oddly
enough) returns an int value. Large Parquet blocksize also means you'll end
up having the file span across multiple HDFS blocks, and that would make
reading of rowgroups inefficient.
Is there a reason you want to create such a large parquet file?
~ Kunal
________________________________
Sent: Friday, June 9, 2017 4:49:02 AM
Subject: Re: Increasing store.parquet.block-size
Khurram,
DRILL-2478 is a good place holder for the LongValidator issue, it really
works wrong.
But other issue connected to impossibility to use long values for parquet
block-size.
This issue can be independent task or a sub-task of updating Drill project
to a latest parquet library.
Kind regards
Vitalii
1. DRILL-2478<https://issues.apache.org/jira/browse/DRILL-2478> is Open for this issue.
2. I have added more details into the comments.
Thanks,
Khurram
________________________________
Sent: Friday, June 9, 2017 12:48:41 PM
Subject: Increasing store.parquet.block-size
The max value that can be assigned to *store.parquet.block-size *is
*2147483647*, as the value kind of this configuration parameter is LONG.
This basically translates to 2GB of block size.
How do I increase it to 3/4/5 GB ?
Trying to set this parameter to a higher value using the following command
ALTER SYSTEM SET `store.parquet.block-size` = 4294967296;
But when I try to run a query that uses this config, it throws the
"4294967296"
So, is it possible to assign a higher value to this parameter?
--
Regards,
Shuporno Choudhury
Khurram Faraaz
2017-06-15 18:35:18 UTC
Permalink
Raw Message
Thanks Padma.

________________________________
From: Padma Penumarthy <***@mapr.com>
Sent: Thursday, June 15, 2017 8:58:44 AM
To: ***@drill.apache.org
Subject: Re: Increasing store.parquet.block-size

Sure. I will check and try to fix them as well.

Thanks,
Padma
Post by Khurram Faraaz
Thanks Padma. There are some more related failures reported in DRILL-2478, do you think we should fix them too, if it is an easy fix.
Regards,
Khurram
________________________________
Sent: Wednesday, June 14, 2017 11:43:16 AM
Subject: Re: Increasing store.parquet.block-size
I think you meant MB (not GB) below.
HDFS allows creation of very large files(theoretically, there is no limit).
I am wondering why >2GB file is a problem. May be it is blockSize >2GB, that is not recommended.
Anyways, we should not let the user be able to set any value and later throw an error.
I opened a PR to fix this.
https://github.com/apache/drill/pull/852
Thanks,
Padma
The ideal size depends on what engine is consuming the parquet files (Drill, i'm guessing).... and the storage layer. For HDFS, which is usually 128-256GB, we recommend to bump it to about 512GB (with the underlying HDFS blocksize to match that).
You'll probably need to experiment a little with different blocks sizes stored on S3 to see which works the best.
<http://www.mapr.com/>
________________________________
Sent: Friday, June 9, 2017 11:23:37 AM
Subject: Re: Increasing store.parquet.block-size
Thanks for the information Kunal.
After the conversion, the file size scales down to half if I use gzip
compression.
For a 10 GB gzipped csv source file, it becomes 5GB (2+2+1) parquet file
(using gzip compression).
So, if I have to make multiple parquet files, what block size would be
optimal, if I have to read the file later?
If you're storing this in S3... you might want to selectively read the files as well.
I'm only speculating, but if you want to download the data, downloading as
a queue of files might be more reliable than one massive file. Similarly,
within AWS, it *might* be faster to have an EC2 instance access a couple of
large Parquet files versus one massive Parquet file.
Remember that when you create a large block size, Drill tries to write
everything within a single row group for each. So there is no chance of
parallelization of the read (i.e. reading parts in parallel). The defaults
should work well for S3 as well, and with the compression (e.g. Snappy),
you should get a reasonably smaller file size.
With the current default settings... have you seen what Parquet file sizes
you get with Drill when converting your 10GB CSV source files?
________________________________
Sent: Friday, June 9, 2017 10:50:06 AM
Subject: Re: Increasing store.parquet.block-size
Thanks Kunal for your insight.
I am actually converting some .csv files and storing them in parquet format
in s3, not in HDFS.
The size of the individual .csv source files can be quite huge (around
10GB).
So, is there a way to overcome this and create one parquet file or do I
have to go ahead with multiple parquet files?
Shuporno
There are some interesting problems when using Parquet files > 2GB on HDFS.
If I'm not mistaken, the HDFS APIs that allow you to read offsets (oddly
enough) returns an int value. Large Parquet blocksize also means you'll end
up having the file span across multiple HDFS blocks, and that would make
reading of rowgroups inefficient.
Is there a reason you want to create such a large parquet file?
~ Kunal
________________________________
Sent: Friday, June 9, 2017 4:49:02 AM
Subject: Re: Increasing store.parquet.block-size
Khurram,
DRILL-2478 is a good place holder for the LongValidator issue, it really
works wrong.
But other issue connected to impossibility to use long values for parquet
block-size.
This issue can be independent task or a sub-task of updating Drill project
to a latest parquet library.
Kind regards
Vitalii
1. DRILL-2478<https://issues.apache.org/jira/browse/DRILL-2478> is Open for this issue.
2. I have added more details into the comments.
Thanks,
Khurram
________________________________
Sent: Friday, June 9, 2017 12:48:41 PM
Subject: Increasing store.parquet.block-size
The max value that can be assigned to *store.parquet.block-size *is
*2147483647*, as the value kind of this configuration parameter is LONG.
This basically translates to 2GB of block size.
How do I increase it to 3/4/5 GB ?
Trying to set this parameter to a higher value using the following command
ALTER SYSTEM SET `store.parquet.block-size` = 4294967296;
But when I try to run a query that uses this config, it throws the
"4294967296"
So, is it possible to assign a higher value to this parameter?
--
Regards,
Shuporno Choudhury
Loading...