Discussion:
[DRILL HANGOUT] Topics for 5/16/2017
(too old to reply)
Jinfeng Ni
2017-05-15 20:59:00 UTC
Permalink
Raw Message
Hi All,

Out bi-weekly Drill hangout is tomorrow (5/16/2017, 10AM PDT). Please
respond with suggestion of topics for discussion. We will also collect
topics at the beginning of handout tomorrow.

Thanks,

Jinfeng
j***@accenture.com
2017-05-16 02:11:34 UTC
Permalink
Raw Message
Hi,

I am stuck in a problem where instance of apache drill stops working. My topic of discussion will be -

For a scenario, I have 25 parquet file with around 400K-500K records with around 10 columns. My select query is such that for one column in clause values are around 100K. When I run these queries parallelly, instance of apache drill hangs and then gets shut down. Therefore, how to design the select queries that apache supports these queries.
One of the solution that we are trying is -
a- Create temp table of 100K values and then use this as an inner query. But as I know we can't make temp table at run time from Java code. It needs some data source either parquet or some other source to create temp table.
b - Create a separate parquet file of all 100K values and use inner query instead of all the values directly in the main query.

Is there any better way to go around this problem or can we just solve this problem with simple configuration changes ?

Regards,
Jasbir Singh


-----Original Message-----
From: Jinfeng Ni [mailto:***@apache.org]
Sent: Tuesday, May 16, 2017 2:29 AM
To: dev <***@drill.apache.org>; user <***@drill.apache.org>
Subject: [DRILL HANGOUT] Topics for 5/16/2017

Hi All,

Out bi-weekly Drill hangout is tomorrow (5/16/2017, 10AM PDT). Please respond with suggestion of topics for discussion. We will also collect topics at the beginning of handout tomorrow.

Thanks,

Jinfeng

________________________________

This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy.
_________________________________________________________________
Jinfeng Ni
2017-05-16 04:53:43 UTC
Permalink
Raw Message
My feeling is that either temp table or putting 100k values into a
separate parquet files makes more sense than putting 100k values in a
IN list. Although for such long IN list Drill planner will convert
into a JOIN (which is same as temp table / parquet table solutions),
there is a big difference in terms of what the query plan looks like.
An IN list with 100k values has to be serialized / de-serialized
before the plan can be executed. I guess that would create a huge
serialized plan, and is not the best solution one may use.

Also, putting 100k values in IN list may not be very typical. RDBMS
probably impose certain limits on # of values in IN list. For
instance, Oracle set the limit to 1000 [1].

1. http://docs.oracle.com/database/122/SQLRF/Expression-Lists.htm#SQLRF52099
Post by j***@accenture.com
Hi,
I am stuck in a problem where instance of apache drill stops working. My topic of discussion will be -
For a scenario, I have 25 parquet file with around 400K-500K records with around 10 columns. My select query is such that for one column in clause values are around 100K. When I run these queries parallelly, instance of apache drill hangs and then gets shut down. Therefore, how to design the select queries that apache supports these queries.
One of the solution that we are trying is -
a- Create temp table of 100K values and then use this as an inner query. But as I know we can't make temp table at run time from Java code. It needs some data source either parquet or some other source to create temp table.
b - Create a separate parquet file of all 100K values and use inner query instead of all the values directly in the main query.
Is there any better way to go around this problem or can we just solve this problem with simple configuration changes ?
Regards,
Jasbir Singh
-----Original Message-----
Sent: Tuesday, May 16, 2017 2:29 AM
Subject: [DRILL HANGOUT] Topics for 5/16/2017
Hi All,
Out bi-weekly Drill hangout is tomorrow (5/16/2017, 10AM PDT). Please respond with suggestion of topics for discussion. We will also collect topics at the beginning of handout tomorrow.
Thanks,
Jinfeng
________________________________
This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy.
______________________________________________________________________________________
www.accenture.com
Jinfeng Ni
2017-05-16 17:01:31 UTC
Permalink
Raw Message
We will start hangout shortly.

https://plus.google.com/hangouts/_/event/ci4rdiju8bv04a64efj5fedd0lc
Post by Jinfeng Ni
My feeling is that either temp table or putting 100k values into a
separate parquet files makes more sense than putting 100k values in a
IN list. Although for such long IN list Drill planner will convert
into a JOIN (which is same as temp table / parquet table solutions),
there is a big difference in terms of what the query plan looks like.
An IN list with 100k values has to be serialized / de-serialized
before the plan can be executed. I guess that would create a huge
serialized plan, and is not the best solution one may use.
Also, putting 100k values in IN list may not be very typical. RDBMS
probably impose certain limits on # of values in IN list. For
instance, Oracle set the limit to 1000 [1].
1. http://docs.oracle.com/database/122/SQLRF/Expression-Lists.htm#SQLRF52099
Post by j***@accenture.com
Hi,
I am stuck in a problem where instance of apache drill stops working. My topic of discussion will be -
For a scenario, I have 25 parquet file with around 400K-500K records with around 10 columns. My select query is such that for one column in clause values are around 100K. When I run these queries parallelly, instance of apache drill hangs and then gets shut down. Therefore, how to design the select queries that apache supports these queries.
One of the solution that we are trying is -
a- Create temp table of 100K values and then use this as an inner query. But as I know we can't make temp table at run time from Java code. It needs some data source either parquet or some other source to create temp table.
b - Create a separate parquet file of all 100K values and use inner query instead of all the values directly in the main query.
Is there any better way to go around this problem or can we just solve this problem with simple configuration changes ?
Regards,
Jasbir Singh
-----Original Message-----
Sent: Tuesday, May 16, 2017 2:29 AM
Subject: [DRILL HANGOUT] Topics for 5/16/2017
Hi All,
Out bi-weekly Drill hangout is tomorrow (5/16/2017, 10AM PDT). Please respond with suggestion of topics for discussion. We will also collect topics at the beginning of handout tomorrow.
Thanks,
Jinfeng
________________________________
This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy.
______________________________________________________________________________________
www.accenture.com
Jinfeng Ni
2017-05-17 00:32:13 UTC
Permalink
Raw Message
Meeting minutes 5/16/2017

Attendees: Aman, Jinfeng, Karthikeyan, Khurram, Kunal, Padma, Parth, Paul,
, Pritesh, Vitalii, Volodymyr

Two topics discussed.
1. Scheme change exception.
Jinfeng is working on bugs related to 'soft' SchemeChangeException (such
as DRILL-5327), where data does not present schema change but Drill failed
query intermittently with either SchemaChangeException or incorrect result.
Initial analysis shows the problems comes from either scan operator or
schema-loss operator (one example is UnionAll).
Aman, Paul, Parth brought up the work of UnionVector. UnionVector targets
at 'hard' schema change, where data itself present schema change. Although
it may help solve issues in DRILL-5327, it might need quite extensive work
to make it work (As of today, enabling UnionVector did not fix the problems
reported). Also, UnionVector might pose challenge for JDBC/ODBC client,
which only takes regular SQL type.

2. Memory fragmentation
Paul is working on memory fragmentation and size-aware batch /value
vector. Drill keeps a list of chunk of 16MB. For allocation beyond 16MB, it
asks for system memory through netty. With each batch having 64k rows, if a
column width is beyond 256, value vector would requires > 16MB. Drill may
hit OOM, even if there are plenty of free chunk of 16MB.
Paul's proposal is to impose size constraint on value vector, by
providing a new set of setSafe() methods. Work is in progress and will
submit PR shortly.
Post by Jinfeng Ni
We will start hangout shortly.
https://plus.google.com/hangouts/_/event/ci4rdiju8bv04a64efj5fedd0lc
Post by Jinfeng Ni
My feeling is that either temp table or putting 100k values into a
separate parquet files makes more sense than putting 100k values in a
IN list. Although for such long IN list Drill planner will convert
into a JOIN (which is same as temp table / parquet table solutions),
there is a big difference in terms of what the query plan looks like.
An IN list with 100k values has to be serialized / de-serialized
before the plan can be executed. I guess that would create a huge
serialized plan, and is not the best solution one may use.
Also, putting 100k values in IN list may not be very typical. RDBMS
probably impose certain limits on # of values in IN list. For
instance, Oracle set the limit to 1000 [1].
1. http://docs.oracle.com/database/122/SQLRF/Expression-
Lists.htm#SQLRF52099
Post by Jinfeng Ni
Post by j***@accenture.com
Hi,
I am stuck in a problem where instance of apache drill stops working.
My topic of discussion will be -
Post by Jinfeng Ni
Post by j***@accenture.com
For a scenario, I have 25 parquet file with around 400K-500K records
with around 10 columns. My select query is such that for one column in
clause values are around 100K. When I run these queries parallelly,
instance of apache drill hangs and then gets shut down. Therefore, how to
design the select queries that apache supports these queries.
Post by Jinfeng Ni
Post by j***@accenture.com
One of the solution that we are trying is -
a- Create temp table of 100K values and then use this as an inner
query. But as I know we can't make temp table at run time from Java code.
It needs some data source either parquet or some other source to create
temp table.
Post by Jinfeng Ni
Post by j***@accenture.com
b - Create a separate parquet file of all 100K values and use inner
query instead of all the values directly in the main query.
Post by Jinfeng Ni
Post by j***@accenture.com
Is there any better way to go around this problem or can we just solve
this problem with simple configuration changes ?
Post by Jinfeng Ni
Post by j***@accenture.com
Regards,
Jasbir Singh
-----Original Message-----
Sent: Tuesday, May 16, 2017 2:29 AM
Subject: [DRILL HANGOUT] Topics for 5/16/2017
Hi All,
Out bi-weekly Drill hangout is tomorrow (5/16/2017, 10AM PDT). Please
respond with suggestion of topics for discussion. We will also collect
topics at the beginning of handout tomorrow.
Post by Jinfeng Ni
Post by j***@accenture.com
Thanks,
Jinfeng
________________________________
This message is for the designated recipient only and may contain
privileged, proprietary, or otherwise confidential information. If you have
received it in error, please notify the sender immediately and delete the
original. Any other use of the e-mail by you is prohibited. Where allowed
by local law, electronic communications with Accenture and its affiliates,
including e-mail and instant messaging (including content), may be scanned
by our systems for the purposes of information security and assessment of
internal compliance with Accenture policy.
Post by Jinfeng Ni
Post by j***@accenture.com
____________________________________________________________
__________________________
Post by Jinfeng Ni
Post by j***@accenture.com
www.accenture.com
Loading...