Discussion:
Parquet on S3 - timeouts
Add Reply
Raz Baluchi
2017-06-01 19:55:41 UTC
Reply
Permalink
Raw Message
Now that I have Drill working with parquet files on dfs, the next step was
to move the parquet files to S3.

I get pretty good performance - I can query for events by date range
within 10 seconds. ( out of a total of ~ 800M events across 25 years)
However, there seems to be some threshold beyond which queries start
timing out.

SYSTEM ERROR: ConnectionPoolTimeoutException: Timeout waiting for
connection from pool

My first question is, is there a default timeout value to queries against
S3? Anything that takes longer than ~ 150 seconds seems to hit the timeout
error.

The second question has to do with the possible conditions that trigger the
prolonged query time. It seems that if I increase the filters beyond a
certain number - it doesn't take much - the query times out.

For example the query:

select * from events where YEAR in (2012, 2013) works fine - however,
select * from events where YEAR in (2012, 2013, 2014) fails with a timeout.

To make it worse, I can't use the first query either until I restart
drill...
Abhishek Girish
2017-06-01 20:08:47 UTC
Reply
Permalink
Raw Message
Can you take a look at [1] and let us know if that helps resolve your issue?

[1]
https://drill.apache.org/docs/s3-storage-plugin/#quering-parquet-format-files-on-s3
Post by Raz Baluchi
Now that I have Drill working with parquet files on dfs, the next step was
to move the parquet files to S3.
I get pretty good performance - I can query for events by date range
within 10 seconds. ( out of a total of ~ 800M events across 25 years)
However, there seems to be some threshold beyond which queries start
timing out.
SYSTEM ERROR: ConnectionPoolTimeoutException: Timeout waiting for
connection from pool
My first question is, is there a default timeout value to queries against
S3? Anything that takes longer than ~ 150 seconds seems to hit the timeout
error.
The second question has to do with the possible conditions that trigger the
prolonged query time. It seems that if I increase the filters beyond a
certain number - it doesn't take much - the query times out.
select * from events where YEAR in (2012, 2013) works fine - however,
select * from events where YEAR in (2012, 2013, 2014) fails with a timeout.
To make it worse, I can't use the first query either until I restart
drill...
Raz Baluchi
2017-06-01 20:56:35 UTC
Reply
Permalink
Raw Message
I noticed that if I precede the query with a select count(*) with the same
filters, I no longer experience timeouts. By 'priming' the query in this
way, the second query is also faster. This seems to be an acceptable
workaround as it it seems to allow me to essentially include all partitions
in the filter and still get results pretty quickly. I am still curious why
this occurs?
Post by Abhishek Girish
Can you take a look at [1] and let us know if that helps resolve your issue?
[1]
https://drill.apache.org/docs/s3-storage-plugin/#quering-
parquet-format-files-on-s3
Post by Raz Baluchi
Now that I have Drill working with parquet files on dfs, the next step
was
Post by Raz Baluchi
to move the parquet files to S3.
I get pretty good performance - I can query for events by date range
within 10 seconds. ( out of a total of ~ 800M events across 25 years)
However, there seems to be some threshold beyond which queries start
timing out.
SYSTEM ERROR: ConnectionPoolTimeoutException: Timeout waiting for
connection from pool
My first question is, is there a default timeout value to queries against
S3? Anything that takes longer than ~ 150 seconds seems to hit the
timeout
Post by Raz Baluchi
error.
The second question has to do with the possible conditions that trigger
the
Post by Raz Baluchi
prolonged query time. It seems that if I increase the filters beyond a
certain number - it doesn't take much - the query times out.
select * from events where YEAR in (2012, 2013) works fine - however,
select * from events where YEAR in (2012, 2013, 2014) fails with a
timeout.
Post by Raz Baluchi
To make it worse, I can't use the first query either until I restart
drill...
Raz Baluchi
2017-06-01 21:14:50 UTC
Reply
Permalink
Raw Message
setting

<property>
<name>fs.s3a.connection.maximum</name>
<value>100</value>
</property>

does fix the problem. No more timeouts and very quick response. No need to
'prime' the query...
Post by Abhishek Girish
Can you take a look at [1] and let us know if that helps resolve your issue?
[1]
https://drill.apache.org/docs/s3-storage-plugin/#quering-
parquet-format-files-on-s3
Post by Raz Baluchi
Now that I have Drill working with parquet files on dfs, the next step
was
Post by Raz Baluchi
to move the parquet files to S3.
I get pretty good performance - I can query for events by date range
within 10 seconds. ( out of a total of ~ 800M events across 25 years)
However, there seems to be some threshold beyond which queries start
timing out.
SYSTEM ERROR: ConnectionPoolTimeoutException: Timeout waiting for
connection from pool
My first question is, is there a default timeout value to queries against
S3? Anything that takes longer than ~ 150 seconds seems to hit the
timeout
Post by Raz Baluchi
error.
The second question has to do with the possible conditions that trigger
the
Post by Raz Baluchi
prolonged query time. It seems that if I increase the filters beyond a
certain number - it doesn't take much - the query times out.
select * from events where YEAR in (2012, 2013) works fine - however,
select * from events where YEAR in (2012, 2013, 2014) fails with a
timeout.
Post by Raz Baluchi
To make it worse, I can't use the first query either until I restart
drill...
Abhishek Girish
2017-06-01 21:29:18 UTC
Reply
Permalink
Raw Message
Cool, thanks for confirming.



_____________________________
From: Raz Baluchi <***@gmail.com>
Sent: Thursday, June 1, 2017 2:14 PM
Subject: Re: Parquet on S3 - timeouts
To: <***@drill.apache.org>


setting

<property>
<name>fs.s3a.connection.maximum</name>
<value>100</value>
</property>

does fix the problem. No more timeouts and very quick response. No need to
'prime' the query...
Post by Abhishek Girish
Can you take a look at [1] and let us know if that helps resolve your issue?
[1]
https://drill.apache.org/docs/s3-storage-plugin/#quering-
parquet-format-files-on-s3
Post by Raz Baluchi
Now that I have Drill working with parquet files on dfs, the next step
was
Post by Raz Baluchi
to move the parquet files to S3.
I get pretty good performance - I can query for events by date range
within 10 seconds. ( out of a total of ~ 800M events across 25 years)
However, there seems to be some threshold beyond which queries start
timing out.
SYSTEM ERROR: ConnectionPoolTimeoutException: Timeout waiting for
connection from pool
My first question is, is there a default timeout value to queries against
S3? Anything that takes longer than ~ 150 seconds seems to hit the
timeout
Post by Raz Baluchi
error.
The second question has to do with the possible conditions that trigger
the
Post by Raz Baluchi
prolonged query time. It seems that if I increase the filters beyond a
certain number - it doesn't take much - the query times out.
select * from events where YEAR in (2012, 2013) works fine - however,
select * from events where YEAR in (2012, 2013, 2014) fails with a
timeout.
Post by Raz Baluchi
To make it worse, I can't use the first query either until I restart
drill...
Loading...