Discussion:
Drill performance - Waiting time
Rosenthaler Matthias (PS-DI/ETF1.1)
2018-11-29 12:56:20 UTC
Permalink
Hi,

I am using apache drill to query huge parquet files (100 MB) on a single node.
A SELECT * query takes around 90 Seconds. 80 Seconds are "Waiting time".
Can you explain what this waiting time means and how I am able to optimize it?

Mit freundlichen Grüßen / Best regards

Matthias Rosenthaler

Powertrain Solutions, Engine Testing (PS-DI/ETF1.1)
Robert Bosch AG | Robert-Bosch-Straße 1 | 4020 Linz | AUSTRIA | www.bosch.at<http://www.bosch.at>
Tel. +43 732 7667-479 | ***@at.bosch.com<mailto:***@at.bosch.com>

Sitz: Robert Bosch Aktiengesellschaft, A-1030 Wien, Göllnergasse 15-17 , Registergericht: FN 55722 w HG-Wien
Aufsichtsratsvorsitzender: Dr. Uwe Thomas; Geschäftsführung: Dr. Klaus Peter Fouquet
DVR-Nr.: 0418871- ARA-Lizenz-Nr.: 1831 - UID-Nr.: ATU14719303 - Steuernummer 140/4988
Kunal Khatua
2018-11-29 21:45:07 UTC
Permalink
Hi Matthias

The waiting time for a PARQUET_ROW_GROUP_SCAN operator is the total time that all the fragments took to read the parquet data into memory as Drill's Value Vectors. So, 80 seconds would indicate that the bulk of the time is spent in just getting the data. 

If you scroll down to the operator specific table.. you;ll find an entry on the lines of 
09-xx-01 - PARQUET_ROW_GROUP_SCAN

Within that collapsed table, you should find at the end a sub section for Operator Metrics. 

These metrics should be able to tell you where time is being spent the most on a per-fragment level. 

If the metrics are missing, that means the traditional Parquet reader was used instead of Drill's fast native parquet reader (Drill does this if it encounters parquet files with Nested data) and the time is being spent by the Parquet library in deserializing the file. In this case, you're out of luck and your best bet is to split the parquet file into multiple files or atleast multiple rowgroups. That way, Drill can create more fragments (assuming you've not maxed out that limit) and read the data in parallel.

~ Kunal 
On 11/29/2018 9:13:45 AM, Rosenthaler Matthias (PS-DI/ETF1.1) <***@at.bosch.com> wrote:
Hi,

I am using apache drill to query huge parquet files (100 MB) on a single node.
A SELECT * query takes around 90 Seconds. 80 Seconds are "Waiting time".
Can you explain what this waiting time means and how I am able to optimize it?

Mit freundlichen GrÌßen / Best regards

Matthias Rosenthaler

Powertrain Solutions, Engine Testing (PS-DI/ETF1.1)
Robert Bosch AG | Robert-Bosch-Straße 1 | 4020 Linz | AUSTRIA | www.bosch.at
Tel. +43 732 7667-479 | ***@at.bosch.com

Sitz: Robert Bosch Aktiengesellschaft, A-1030 Wien, Göllnergasse 15-17 , Registergericht: FN 55722 w HG-Wien
Aufsichtsratsvorsitzender: Dr. Uwe Thomas; GeschÀftsfÌhrung: Dr. Klaus Peter Fouquet
DVR-Nr.: 0418871- ARA-Lizenz-Nr.: 1831 - UID-Nr.: ATU14719303 - Steuernummer 140/4988
Ted Dunning
2018-11-29 22:07:56 UTC
Permalink
Matthias,

Kunal gives very good information about how to start from the high level to
debug this, but you should also be suspicious of the lower levels. For
instance, are you sure that your file system is working correctly? Is the
file actually stored on MapR?

How long does it take to run something like wc on this file on the same
node?

This should take much less than a second if you have competent I/O system,
but it would not be unheard of to hear that this is much slower than
expected due any number of reasons.



On Thu, Nov 29, 2018 at 9:13 AM Rosenthaler Matthias (PS-DI/ETF1.1) <
Post by Rosenthaler Matthias (PS-DI/ETF1.1)
Hi,
I am using apache drill to query huge parquet files (100 MB) on a single node.
A SELECT * query takes around 90 Seconds. 80 Seconds are "Waiting time".
Can you explain what this waiting time means and how I am able to optimize it?
Mit freundlichen GrÌßen / Best regards
Matthias Rosenthaler
Powertrain Solutions, Engine Testing (PS-DI/ETF1.1)
Robert Bosch AG | Robert-Bosch-Straße 1 | 4020 Linz | AUSTRIA |
www.bosch.at<http://www.bosch.at>
Sitz: Robert Bosch Aktiengesellschaft, A-1030 Wien, Göllnergasse 15-17 ,
Registergericht: FN 55722 w HG-Wien
Aufsichtsratsvorsitzender: Dr. Uwe Thomas; GeschÀftsfÌhrung: Dr. Klaus
Peter Fouquet
DVR-Nr.: 0418871- ARA-Lizenz-Nr.: 1831 - UID-Nr.: ATU14719303 - Steuernummer 140/4988
Loading...