Discussion:
Drill performance tuning parquet
(too old to reply)
Dan Holmes
2017-07-31 12:23:58 UTC
Permalink
Raw Message
When is it right to add nodes vs adding CPUs? Since my installation is on AWS, adding CPUs is relatively easy. When does it make sense to add nodes instead of CPUs?

Dan Holmes | Revenue Analytics, Inc.
Direct: 770.859.1255
www.revenueanalytics.com

-----Original Message-----
From: Kunal Khatua [mailto:***@mapr.com]
Sent: Friday, July 28, 2017 12:51 PM
To: ***@drill.apache.org
Subject: RE: Drill performance tuning parquet

I also forgot to mention... within the drill-override.conf, is a parameter you'd need to set to constrain the async parquet reader's scan pool size. If you have just 4 cores, the pool's 4 threads will compete with your other fragments for CPU. Of course, all of this depends on what the metrics in query profile reveal.

-----Original Message-----
From: Jinfeng Ni [mailto:***@apache.org]
Sent: Friday, July 28, 2017 7:41 AM
To: user <***@drill.apache.org>
Subject: Re: Drill performance tuning parquet

The number you posted seems to show that the query elapse time is highly impacted by the number of scan minor fragments (scan parallelization degree).

In Drill, scan parallelization degree is capped at minimum of # of parquet row groups, or 70% of cpu cores. In your original configuration, since you only have 3 file each with one row group, Drill will have only up to 3 scan minor fragments ( you can confirm that by looking at the query profile).
With decreased blocksize, you have more parquet files, and hence higher scan parallelization degree, and better performance. In the case of 4 cores, the scan parallelization degree is capped at 4*70% = 2, which probably explains why reducing blocksize does not help.

The 900MB total parquet file size is relatively small. If you want to turn Drill for such small dataset, you probably need smaller parquet file size.
In the case of 4 cores, you may consider bump up the following parameter.

`planner.width.max_per_node`
Thank you for the tips. I have used 4 different block sizes. It appears
to scale linearly and anything less than the 512 blocksize was of
similar performance. I rounded the numbers to whole seconds. The
data is local to the EC2 instance; I did not put the data on EFS. I
used the same data files. After I created it the first time I put the
data on s3 and copied it to the others.
If there are other configurations that someone is interested in, I
would be willing to try them out. I have something to gain in that too.
Here's the data for the interested.
vCPU x Blocksize
64 128 256 512
m3.xlarge - 16 6 6 5 12
c3.2xlarge - 8 11 11 11 20
c4.4xlarge - 4 20 20 20 20
Dan Holmes | Revenue Analytics, Inc.
Direct: 770.859.1255
www.revenueanalytics.com
-----Original Message-----
Sent: Friday, July 28, 2017 2:38 AM
Subject: RE: Drill performance tuning parquet
Look at the query profile's (in the UI) "operator profiles - overview"
section. The % Query Time is a good indicator of which operator
consumes the most CPU. Changing the planner.width.max_per_node
actually will affect this (positively or negatively, depending on the load).
Within the same Operator Profile, also look at Average and Max Wait times.
See if the numbers are unusually high.
Scrolling further down, (since you are working with parquet) the
Parquet RowGroup Scan operator can be expanded to show the minor
fragment (worker/leaf fragments) metrics. Since you have 3 files, you
probably will see only 3 entries... since each fragment will scan 1
row group in the parquet file. (I'm making the assumption that u have
only 1 rowgroup per file). Just at the end of that table, you'll see
"OperatorMetrics". This gives you the time (in nanosec) and other
metrics it takes for these fragments to be handed data by the pool of Async Parquet Reader threads.
Most likely, changing the number of Parquet files being produced using
CTAS to a larger value (i.e. with smaller file sizes) might actually
help leverage the surplus CPU capacity you have. For that, you'll need
to tweak (and experiment) with parquet block sizes of something less
than 512MB. You could try 128MB (or even 64MB). All of this depends on
the nature of the dat.. so it's a bit of experimenting that's needed here on.
Beyond that, see the memory (in "Fragment Profiles - Overview") , and
whether you need to bump up that setting.
It is also possible that since you're running on AWS, the compute and
storage layers and not as tightly coupled as Athena is with their own
S3, which would make sense since they need an incentive for users to
try Athena on their AWS infrastructure. :)
Happy Drilling!
-----Original Message-----
Sent: Thursday, July 27, 2017 6:23 PM
Subject: RE: Drill performance tuning parquet
Let's pretend there is only this one query or a similar style
aggregation perhaps with a WHERE clause. I am trying to understand
how to get more out of the drill instance I have. That set of parquet
files is ~900MB. It was
6.25 GB as PSV. There are 13million records.
How can I tell if I am IO bound and I need more reader threads? If
there were more files would that be better?
I don't think I am CPU bound based on the stats that EC2 gave me. I
am using this example to both learn how to scale drill and also how to
understand it.
We are considering using it for our text file processing as an
exploratory tool, ETL (since it will convert to parquet) and because
of its ability to join disparate datasources as a db layer for tools like tableau.
Other tools we have thought of are Athena. It is crazy fast. That
same query against the text files runs is ~3 seconds. (My 4 vCPU
drill instance did that same query against s3 txt files in 180
seconds) But it does have drawbacks. It only works on AWS. My
on-premise solutions would have to be designed differently.
I don't need performance parity but to do this right I need to
understand drill better. That is the essence of this inquiry.
Dan Holmes | Revenue Analytics, Inc.
Direct: 770.859.1255
www.revenueanalytics.com
-----Original Message-----
Sent: Thursday, July 27, 2017 6:52 PM
Subject: Re: Drill performance tuning parquet
Hi Dan,
Here are some thoughts from my end.
So this is just one query and you have the numbers. But how about a
representative collection? Do you have the use cases? Now, I know from
experience that if you can predict the pattern of the queries to about
60%, that would be great. The rest could be ad hoc and you could plan for it.
1. SQL query, response time measured, response time expected, size of
the tables that are part of the query 2. Do you have any data skew?
3. What is the EC2 configuration you have: memory, CPU cores?
So the approach would be to tune it for the entire set (which means
you will end up trading off the various parameters) and then scale
out. (Scale out is not cheap).
Thanks,
Saurabh
You haven't specified what kind of query are you running.
The Async Parquet Reader tuning should be more than sufficient in
your usecase, since you seem to be only processing 3 files.
The feature introduces a small fixed pool of threads that are
responsible for the actual fetching of bytes from the disk, without
blocking the fragments that already have some data available to work on.
The "store.parquet.reader.pagereader.buffersize" might be of interest.
The default for this is 4MB and can be tuned to match the parquet
page size (usually 1MB). This can reduce memory pressure and improve
the pipeline behavior.
Apart from this, the primary factors affecting your query
performance is the number of cores (which is what you seem to be tuning) and memory.
By design, the parallelization level is a function of the num-of-cores.
From the look of things, it looks like that is helping. You can try
planner.width.max_per_node (default is 70% of num-of-cores)
For memory,
planner.memory.max_query_memory_per_node (default is 2GB)
https://drill.apache.org/docs/performance-tuning/
~ Kunal
-----Original Message-----
Sent: Thursday, July 27, 2017 1:06 PM
Subject: RE: Drill performance tuning parquet
I did not partition the data when I created the parquet files (CTAS
without a PARITION BY)
Here is the file list.
Thank you.
-rw-rw-r-- 1 dholmes dholmes 393443418 Jul 27 19:05 1_0_0.parquet
-rw-rw-r-- 1 dholmes dholmes 321665234 Jul 27 19:06 1_1_0.parquet
-rw-rw-r-- 1 dholmes dholmes 330758061 Jul 27 19:06 1_2_0.parquet
Dan Holmes | Revenue Analytics, Inc.
Direct: 770.859.1255
www.revenueanalytics.com
-----Original Message-----
Sent: Thursday, July 27, 2017 3:59 PM
Subject: Drill performance tuning parquet
I am performance testing a single drill instance with different vCPU
configurations in AWS. I have a parquet files on an EFS volume and
use the same data for each EC2 instance.
I have used 4vCPUs, 8 and 16. Drill performance is ~25 second, 15 and 12
respectively. I have not changed any of the options. This an out of
the
box 1.11 installation.
What Drill tuning options should I experiment with? I have read
https://drill.apache.org/docs/asynchronous-parquet-reader/ but it is
so technical that I can't consume it but it reads like the default
options are the best ones.
SELECT store_key, SUM(sales_dollars) sd FROM dfs.root.sales_p GROUP
BY store_key ORDER BY sd DESC LIMIT 10
Dan Holmes | Architect | Revenue Analytics,
Padma Penumarthy
2017-08-01 02:14:41 UTC
Permalink
Raw Message
There is one more thing you need to consider, memory. In general, adding more CPUs is better than adding nodes as long as you don't hit other bottlenecks. Once that happens, you might consider adding nodes.


Thanks,

Padma

________________________________
From: Dan Holmes <***@revenueanalytics.com>
Sent: Monday, July 31, 2017 5:23 AM
To: ***@drill.apache.org
Subject: RE: Drill performance tuning parquet

When is it right to add nodes vs adding CPUs? Since my installation is on AWS, adding CPUs is relatively easy. When does it make sense to add nodes instead of CPUs?

Dan Holmes | Revenue Analytics, Inc.
Direct: 770.859.1255
www.revenueanalytics.com<http://www.revenueanalytics.com>
[Loading Image...]<http://www.revenueanalytics.com/>

Revenue Analytics : Revenue Management, Price Optimization ...<http://www.revenueanalytics.com/>
www.revenueanalytics.com
Discover Revenue Management techniques, demand forecasting methods and price optimization techniques to exceed sales and profit goals. Increase revenue and profit ...




-----Original Message-----
From: Kunal Khatua [mailto:***@mapr.com]
Sent: Friday, July 28, 2017 12:51 PM
To: ***@drill.apache.org
Subject: RE: Drill performance tuning parquet

I also forgot to mention... within the drill-override.conf, is a parameter you'd need to set to constrain the async parquet reader's scan pool size. If you have just 4 cores, the pool's 4 threads will compete with your other fragments for CPU. Of course, all of this depends on what the metrics in query profile reveal.

-----Original Message-----
From: Jinfeng Ni [mailto:***@apache.org]
Sent: Friday, July 28, 2017 7:41 AM
To: user <***@drill.apache.org>
Subject: Re: Drill performance tuning parquet

The number you posted seems to show that the query elapse time is highly impacted by the number of scan minor fragments (scan parallelization degree).

In Drill, scan parallelization degree is capped at minimum of # of parquet row groups, or 70% of cpu cores. In your original configuration, since you only have 3 file each with one row group, Drill will have only up to 3 scan minor fragments ( you can confirm that by looking at the query profile).
With decreased blocksize, you have more parquet files, and hence higher scan parallelization degree, and better performance. In the case of 4 cores, the scan parallelization degree is capped at 4*70% = 2, which probably explains why reducing blocksize does not help.

The 900MB total parquet file size is relatively small. If you want to turn Drill for such small dataset, you probably need smaller parquet file size.
In the case of 4 cores, you may consider bump up the following parameter.

`planner.width.max_per_node`
Thank you for the tips. I have used 4 different block sizes. It appears
to scale linearly and anything less than the 512 blocksize was of
similar performance. I rounded the numbers to whole seconds. The
data is local to the EC2 instance; I did not put the data on EFS. I
used the same data files. After I created it the first time I put the
data on s3 and copied it to the others.
If there are other configurations that someone is interested in, I
would be willing to try them out. I have something to gain in that too.
Here's the data for the interested.
vCPU x Blocksize
64 128 256 512
m3.xlarge - 16 6 6 5 12
c3.2xlarge - 8 11 11 11 20
c4.4xlarge - 4 20 20 20 20
Dan Holmes | Revenue Analytics, Inc.
Direct: 770.859.1255
www.revenueanalytics.com<http://www.revenueanalytics.com>
-----Original Message-----
Sent: Friday, July 28, 2017 2:38 AM
Subject: RE: Drill performance tuning parquet
Look at the query profile's (in the UI) "operator profiles - overview"
section. The % Query Time is a good indicator of which operator
consumes the most CPU. Changing the planner.width.max_per_node
actually will affect this (positively or negatively, depending on the load).
Within the same Operator Profile, also look at Average and Max Wait times.
See if the numbers are unusually high.
Scrolling further down, (since you are working with parquet) the
Parquet RowGroup Scan operator can be expanded to show the minor
fragment (worker/leaf fragments) metrics. Since you have 3 files, you
probably will see only 3 entries... since each fragment will scan 1
row group in the parquet file. (I'm making the assumption that u have
only 1 rowgroup per file). Just at the end of that table, you'll see
"OperatorMetrics". This gives you the time (in nanosec) and other
metrics it takes for these fragments to be handed data by the pool of Async Parquet Reader threads.
Most likely, changing the number of Parquet files being produced using
CTAS to a larger value (i.e. with smaller file sizes) might actually
help leverage the surplus CPU capacity you have. For that, you'll need
to tweak (and experiment) with parquet block sizes of something less
than 512MB. You could try 128MB (or even 64MB). All of this depends on
the nature of the dat.. so it's a bit of experimenting that's needed here on.
Beyond that, see the memory (in "Fragment Profiles - Overview") , and
whether you need to bump up that setting.
It is also possible that since you're running on AWS, the compute and
storage layers and not as tightly coupled as Athena is with their own
S3, which would make sense since they need an incentive for users to
try Athena on their AWS infrastructure. :)
Happy Drilling!
-----Original Message-----
Sent: Thursday, July 27, 2017 6:23 PM
Subject: RE: Drill performance tuning parquet
Let's pretend there is only this one query or a similar style
aggregation perhaps with a WHERE clause. I am trying to understand
how to get more out of the drill instance I have. That set of parquet
files is ~900MB. It was
6.25 GB as PSV. There are 13million records.
How can I tell if I am IO bound and I need more reader threads? If
there were more files would that be better?
I don't think I am CPU bound based on the stats that EC2 gave me. I
am using this example to both learn how to scale drill and also how to
understand it.
We are considering using it for our text file processing as an
exploratory tool, ETL (since it will convert to parquet) and because
of its ability to join disparate datasources as a db layer for tools like tableau.
Other tools we have thought of are Athena. It is crazy fast. That
same query against the text files runs is ~3 seconds. (My 4 vCPU
drill instance did that same query against s3 txt files in 180
seconds) But it does have drawbacks. It only works on AWS. My
on-premise solutions would have to be designed differently.
I don't need performance parity but to do this right I need to
understand drill better. That is the essence of this inquiry.
Dan Holmes | Revenue Analytics, Inc.
Direct: 770.859.1255
www.revenueanalytics.com<http://www.revenueanalytics.com>
-----Original Message-----
Sent: Thursday, July 27, 2017 6:52 PM
Subject: Re: Drill performance tuning parquet
Hi Dan,
Here are some thoughts from my end.
So this is just one query and you have the numbers. But how about a
representative collection? Do you have the use cases? Now, I know from
experience that if you can predict the pattern of the queries to about
60%, that would be great. The rest could be ad hoc and you could plan for it.
1. SQL query, response time measured, response time expected, size of
the tables that are part of the query 2. Do you have any data skew?
3. What is the EC2 configuration you have: memory, CPU cores?
So the approach would be to tune it for the entire set (which means
you will end up trading off the various parameters) and then scale
out. (Scale out is not cheap).
Thanks,
Saurabh
You haven't specified what kind of query are you running.
The Async Parquet Reader tuning should be more than sufficient in
your usecase, since you seem to be only processing 3 files.
The feature introduces a small fixed pool of threads that are
responsible for the actual fetching of bytes from the disk, without
blocking the fragments that already have some data available to work on.
The "store.parquet.reader.pagereader.buffersize" might be of interest.
The default for this is 4MB and can be tuned to match the parquet
page size (usually 1MB). This can reduce memory pressure and improve
the pipeline behavior.
Apart from this, the primary factors affecting your query
performance is the number of cores (which is what you seem to be tuning) and memory.
By design, the parallelization level is a function of the num-of-cores.
From the look of things, it looks like that is helping. You can try
planner.width.max_per_node (default is 70% of num-of-cores)
For memory,
planner.memory.max_query_memory_per_node (default is 2GB)
https://drill.apache.org/docs/performance-tuning/
~ Kunal
-----Original Message-----
Sent: Thursday, July 27, 2017 1:06 PM
Subject: RE: Drill performance tuning parquet
I did not partition the data when I created the parquet files (CTAS
without a PARITION BY)
Here is the file list.
Thank you.
-rw-rw-r-- 1 dholmes dholmes 393443418 Jul 27 19:05 1_0_0.parquet
-rw-rw-r-- 1 dholmes dholmes 321665234 Jul 27 19:06 1_1_0.parquet
-rw-rw-r-- 1 dholmes dholmes 330758061 Jul 27 19:06 1_2_0.parquet
Dan Holmes | Revenue Analytics, Inc.
Direct: 770.859.1255
www.revenueanalytics.com<http://www.revenueanalytics.com>
-----Original Message-----
Sent: Thursday, July 27, 2017 3:59 PM
Subject: Drill performance tuning parquet
I am performance testing a single drill instance with different vCPU
configurations in AWS. I have a parquet files on an EFS volume and
use the same data for each EC2 instance.
I have used 4vCPUs, 8 and 16. Drill performance is ~25 second, 15 and 12
respectively. I have not changed any of the options. This an out of
the
box 1.11 installation.
What Drill tuning options should I experiment with? I have read
https://drill.apache.org/docs/asynchronous-parquet-reader/ but it is
so technical that I can't consume it but it reads like the default
options are the best ones.
SELECT store_key, SUM(sales_dollars) sd FROM dfs.root.sales_p GROUP
BY store_key ORDER BY sd DESC LIMIT 10
Dan Holmes | Architect | Revenue Analytics, Inc.
Saurabh Mahapatra
2017-08-01 18:37:49 UTC
Permalink
Raw Message
This is my understanding on how I approach these problems.

It is important to have a mental model of the execution semantics of Drill.
At a fundamental level, it is a DAG of "map" and "reduce" tasks. DAG
ensures that the results are correct if it were to be executed on a single
thread with the right sequencing. Since this is a DAG all proceeding in an
upward direction of a final result, you want to keep the nodes at the
bottom as busy as possible which brings up the idea of batches. So the
question is whether you can parallelize more with the same coordination
strategy that ensures correctness of the result.

There are two groups of nodes that are competing for CPU resources:
(1) The nodes up the DAG that need to crunch through the data batches that
have come from below.
(2) The data ingest nodes at the bottom that want to be as busy as possible
but must obey the DAG dependencies.

If you give too few CPU resources, then these two groups start competing
with each other. In the case where you have one core, all the threads would
be execute round robin.

Those parallel units of execution (or opportunities) are called "minor
fragments". There are ways to give limits to the query planner as to how
many fragments should be created i.e. at a node level
(planner.width_max_per_node), at a cluster level
(planner.width_max_per_query), and at a data ingest level
(planner.slice_target). The query planner operates within these
constraints. Most users will tune the planner.width_max_per_node because it
has the notion of symmetry in execution.

https://drill.apache.org/docs/modifying-query-planning-options/

So if you increase the number of cores for the same query, you could get
more parallelism which would show you improvements at scale (data size).
Also, you reduce the competition with nodes up the DAG that also need to
execute. The memory consumption would not matter in this case.

The first step is to keep tune the memory so that you avoid out of memory
situations. You run your suite of tests. Once satisfied, you tune the
number of fragments to see if you get performance gain. Once you reached an
optimization, add more cores and see if you get more. (you need not if the
network traffic is too much). Conservative design is to take the worst case
query and optimize for that. I recommend doing it by frequency of the
occurrence of the query. The query that consumes the most resources could
well be the most infrequent. It is OK to be suboptimal for those.

Now, let us bring in a couple of concurrent queries-where things become
more complicated. You will have to budget for multiple queries. So you will
have to add memory and cores. But there is a caveat: increasing concurrency
does not necessarily improves performance because you could have CPU and
memory intensive queries (that would just run longer) that just slow the
system down. IN this situation, you are better off giving as much
CPU/memory resources as possible so that you can move on. This is where
queues help:

https://drill.apache.org/docs/enabling-query-queuing/

The queues are based on estimated row size which is a reflection of memory
required.

All in all, at the end of the day, you will be looking at two parameters:
1. Throughput as a function of input query load (time-based)
2. Performance arranged on a percentile basis with query response times
arranged in ascending order. What is the performance of the 90th percentile
queries.

Thanks,
Saurabh
Post by Padma Penumarthy
There is one more thing you need to consider, memory. In general, adding
more CPUs is better than adding nodes as long as you don't hit other
bottlenecks. Once that happens, you might consider adding nodes.
Thanks,
Padma
________________________________
Sent: Monday, July 31, 2017 5:23 AM
Subject: RE: Drill performance tuning parquet
When is it right to add nodes vs adding CPUs? Since my installation is on
AWS, adding CPUs is relatively easy. When does it make sense to add nodes
instead of CPUs?
Dan Holmes | Revenue Analytics, Inc.
Direct: 770.859.1255
www.revenueanalytics.com<http://www.revenueanalytics.com>
[http://revenueanalytics.com/wp-content/uploads/2017/03/
revenue-analytics-logo.jpg]<http://www.revenueanalytics.com/>
Revenue Analytics : Revenue Management, Price Optimization ...<http://www.
revenueanalytics.com/>
www.revenueanalytics.com
Discover Revenue Management techniques, demand forecasting methods and
price optimization techniques to exceed sales and profit goals. Increase
revenue and profit ...
-----Original Message-----
Sent: Friday, July 28, 2017 12:51 PM
Subject: RE: Drill performance tuning parquet
I also forgot to mention... within the drill-override.conf, is a parameter
you'd need to set to constrain the async parquet reader's scan pool size.
If you have just 4 cores, the pool's 4 threads will compete with your other
fragments for CPU. Of course, all of this depends on what the metrics in
query profile reveal.
-----Original Message-----
Sent: Friday, July 28, 2017 7:41 AM
Subject: Re: Drill performance tuning parquet
The number you posted seems to show that the query elapse time is highly
impacted by the number of scan minor fragments (scan parallelization
degree).
In Drill, scan parallelization degree is capped at minimum of # of parquet
row groups, or 70% of cpu cores. In your original configuration, since you
only have 3 file each with one row group, Drill will have only up to 3 scan
minor fragments ( you can confirm that by looking at the query profile).
With decreased blocksize, you have more parquet files, and hence higher
scan parallelization degree, and better performance. In the case of 4
cores, the scan parallelization degree is capped at 4*70% = 2, which
probably explains why reducing blocksize does not help.
The 900MB total parquet file size is relatively small. If you want to
turn Drill for such small dataset, you probably need smaller parquet file
size.
In the case of 4 cores, you may consider bump up the following parameter.
`planner.width.max_per_node`
Thank you for the tips. I have used 4 different block sizes. It
appears
to scale linearly and anything less than the 512 blocksize was of
similar performance. I rounded the numbers to whole seconds. The
data is local to the EC2 instance; I did not put the data on EFS. I
used the same data files. After I created it the first time I put the
data on s3 and copied it to the others.
If there are other configurations that someone is interested in, I
would be willing to try them out. I have something to gain in that too.
Here's the data for the interested.
vCPU x Blocksize
64 128 256 512
m3.xlarge - 16 6 6 5 12
c3.2xlarge - 8 11 11 11 20
c4.4xlarge - 4 20 20 20 20
Dan Holmes | Revenue Analytics, Inc.
Direct: 770.859.1255
www.revenueanalytics.com<http://www.revenueanalytics.com>
-----Original Message-----
Sent: Friday, July 28, 2017 2:38 AM
Subject: RE: Drill performance tuning parquet
Look at the query profile's (in the UI) "operator profiles - overview"
section. The % Query Time is a good indicator of which operator
consumes the most CPU. Changing the planner.width.max_per_node
actually will affect this (positively or negatively, depending on the
load).
Within the same Operator Profile, also look at Average and Max Wait
times.
See if the numbers are unusually high.
Scrolling further down, (since you are working with parquet) the
Parquet RowGroup Scan operator can be expanded to show the minor
fragment (worker/leaf fragments) metrics. Since you have 3 files, you
probably will see only 3 entries... since each fragment will scan 1
row group in the parquet file. (I'm making the assumption that u have
only 1 rowgroup per file). Just at the end of that table, you'll see
"OperatorMetrics". This gives you the time (in nanosec) and other
metrics it takes for these fragments to be handed data by the pool of
Async Parquet Reader threads.
Most likely, changing the number of Parquet files being produced using
CTAS to a larger value (i.e. with smaller file sizes) might actually
help leverage the surplus CPU capacity you have. For that, you'll need
to tweak (and experiment) with parquet block sizes of something less
than 512MB. You could try 128MB (or even 64MB). All of this depends on
the nature of the dat.. so it's a bit of experimenting that's needed
here on.
Beyond that, see the memory (in "Fragment Profiles - Overview") , and
whether you need to bump up that setting.
It is also possible that since you're running on AWS, the compute and
storage layers and not as tightly coupled as Athena is with their own
S3, which would make sense since they need an incentive for users to
try Athena on their AWS infrastructure. :)
Happy Drilling!
-----Original Message-----
Sent: Thursday, July 27, 2017 6:23 PM
Subject: RE: Drill performance tuning parquet
Let's pretend there is only this one query or a similar style
aggregation perhaps with a WHERE clause. I am trying to understand
how to get more out of the drill instance I have. That set of parquet
files is ~900MB. It was
6.25 GB as PSV. There are 13million records.
How can I tell if I am IO bound and I need more reader threads? If
there were more files would that be better?
I don't think I am CPU bound based on the stats that EC2 gave me. I
am using this example to both learn how to scale drill and also how to
understand it.
We are considering using it for our text file processing as an
exploratory tool, ETL (since it will convert to parquet) and because
of its ability to join disparate datasources as a db layer for tools
like tableau.
Other tools we have thought of are Athena. It is crazy fast. That
same query against the text files runs is ~3 seconds. (My 4 vCPU
drill instance did that same query against s3 txt files in 180
seconds) But it does have drawbacks. It only works on AWS. My
on-premise solutions would have to be designed differently.
I don't need performance parity but to do this right I need to
understand drill better. That is the essence of this inquiry.
Dan Holmes | Revenue Analytics, Inc.
Direct: 770.859.1255
www.revenueanalytics.com<http://www.revenueanalytics.com>
-----Original Message-----
Sent: Thursday, July 27, 2017 6:52 PM
Subject: Re: Drill performance tuning parquet
Hi Dan,
Here are some thoughts from my end.
So this is just one query and you have the numbers. But how about a
representative collection? Do you have the use cases? Now, I know from
experience that if you can predict the pattern of the queries to about
60%, that would be great. The rest could be ad hoc and you could plan
for it.
1. SQL query, response time measured, response time expected, size of
the tables that are part of the query 2. Do you have any data skew?
3. What is the EC2 configuration you have: memory, CPU cores?
So the approach would be to tune it for the entire set (which means
you will end up trading off the various parameters) and then scale
out. (Scale out is not cheap).
Thanks,
Saurabh
You haven't specified what kind of query are you running.
The Async Parquet Reader tuning should be more than sufficient in
your usecase, since you seem to be only processing 3 files.
The feature introduces a small fixed pool of threads that are
responsible for the actual fetching of bytes from the disk, without
blocking the fragments that already have some data available to work
on.
The "store.parquet.reader.pagereader.buffersize" might be of interest.
The default for this is 4MB and can be tuned to match the parquet
page size (usually 1MB). This can reduce memory pressure and improve
the pipeline behavior.
Apart from this, the primary factors affecting your query
performance is the number of cores (which is what you seem to be
tuning) and memory.
By design, the parallelization level is a function of the num-of-cores.
From the look of things, it looks like that is helping. You can try
planner.width.max_per_node (default is 70% of num-of-cores)
For memory,
planner.memory.max_query_memory_per_node (default is 2GB)
https://drill.apache.org/docs/performance-tuning/
~ Kunal
-----Original Message-----
Sent: Thursday, July 27, 2017 1:06 PM
Subject: RE: Drill performance tuning parquet
I did not partition the data when I created the parquet files (CTAS
without a PARITION BY)
Here is the file list.
Thank you.
-rw-rw-r-- 1 dholmes dholmes 393443418 Jul 27 19:05 1_0_0.parquet
-rw-rw-r-- 1 dholmes dholmes 321665234 Jul 27 19:06 1_1_0.parquet
-rw-rw-r-- 1 dholmes dholmes 330758061 Jul 27 19:06 1_2_0.parquet
Dan Holmes | Revenue Analytics, Inc.
Direct: 770.859.1255
www.revenueanalytics.com<http://www.revenueanalytics.com>
-----Original Message-----
Sent: Thursday, July 27, 2017 3:59 PM
Subject: Drill performance tuning parquet
I am performance testing a single drill instance with different vCPU
configurations in AWS. I have a parquet files on an EFS volume and
use the same data for each EC2 instance.
I have used 4vCPUs, 8 and 16. Drill performance is ~25 second, 15 and
12
respectively. I have not changed any of the options. This an out of
the
box 1.11 installation.
What Drill tuning options should I experiment with? I have read
https://drill.apache.org/docs/asynchronous-parquet-reader/ but it is
so technical that I can't consume it but it reads like the default
options are the best ones.
SELECT store_key, SUM(sales_dollars) sd FROM dfs.root.sales_p GROUP
BY store_key ORDER BY sd DESC LIMIT 10
Dan Holmes | Architect | Revenue Analytics, Inc.
Loading...