Discussion:
Running cartesian joins on Drill
(too old to reply)
Muhammad Gelbana
2017-05-06 14:34:53 UTC
Permalink
Raw Message
Is there a reason why Drill would intentionally reject cartesian join
queries even if *planner.enable_nljoin_for_scalar_only* is disabled ?

Any ideas how could a query be rewritten to overcome this restriction ?

*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana
Gautam Parai
2017-05-06 16:53:29 UTC
Permalink
Raw Message
Can you please specify the query you are trying to execute?


Gautam

________________________________
From: Muhammad Gelbana <***@gmail.com>
Sent: Saturday, May 6, 2017 7:34:53 AM
To: ***@drill.apache.org; ***@drill.apache.org
Subject: Running cartesian joins on Drill

Is there a reason why Drill would intentionally reject cartesian join
queries even if *planner.enable_nljoin_for_scalar_only* is disabled ?

Any ideas how could a query be rewritten to overcome this restriction ?

*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana
Muhammad Gelbana
2017-05-06 22:05:06 UTC
Permalink
Raw Message
​​
Here it is:

SELECT * FROM (SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc
LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM
`dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (
​​
`t0`.`UserID` IS NOT DISTINCT FROM
​​
`t1`.`UserID`) LIMIT 2147483647

I debugged Drill code and found it decomposes *IS NOT DISTINCT FROM* into
​
*`t0`.`UserID` = ​`t1`.`UserID` OR (`t0`.`UserID` IS NULL && `t1`.`UserID`
IS NULL**)* while checking if the query is a cartesian join, and when the
check returns true, it throws an excetion saying: *This query cannot be
planned possibly due to either a cartesian join or an inequality join*


*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana
Post by Gautam Parai
Can you please specify the query you are trying to execute?
Gautam
________________________________
Sent: Saturday, May 6, 2017 7:34:53 AM
Subject: Running cartesian joins on Drill
Is there a reason why Drill would intentionally reject cartesian join
queries even if *planner.enable_nljoin_for_scalar_only* is disabled ?
Any ideas how could a query be rewritten to overcome this restriction ?
*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana
Muhammad Gelbana
2017-05-08 16:09:34 UTC
Permalink
Raw Message
​I believe ​clhubert is referring to this discussion
<http://drill-user.incubator.apache.narkive.com/TIXWiTY4/cartesian-product-in-apache-drill#post1>
.

So why Drill doesn't transform this query into a nested join query ? Simply
because there is no Calcite rule to transform it into a nested loop join ?
Is it not technically possible to write such Rule or is it feasible so I
may take on this challenge ?

Also pardon me for repeating my question but I fail to find an answer in
your replies, why doesn't Drill just run a cartesian join ? Because it's
expensive regarding resources (i.e. CPU\Network\RAM) ?

Thanks a lot Shadi for the query, it works for me.

*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana
Hi Muhammad,
I did the following as a workaround to have Cartesian product. The basic
idea is to create a dummy column on the fly that has the value 1 in both
tables and then join on that column leading to having a match of every row
of the first table with every row of the second table, hence do a Cartesian
product. This might not be the most efficient way but it will do the job.
*Original Query:*
SELECT * FROM
( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT
2147483647) `t0`
INNER JOIN
( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT
2147483647) `t1`
ON (`t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`)
LIMIT 2147483647
*Workaround (add columns **d1a381f3g73 and **d1a381f3g74 to tables one
and two, respectively. Names don't really matter, just need to be unique):*
SELECT * FROM
( SELECT *1 as d1a381f3g73*, 'ABC' `UserID` FROM
`dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t0`
INNER JOIN
( SELECT *1 as d1a381f3g74*, 'ABC' `UserID` FROM
`dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1`
ON (`t0`.*d1a381f3g73 = *`t1`.*d1a381f3g74*)
WHERE `t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`
LIMIT 2147483647
Regards
*Shadi Khalifa, PhD*
Postdoctoral Fellow
Cognitive Analytics Development Hub
Centre for Advanced Computing
Queen’s University
(613) 533-6000 x78347
http://cac.queensu.ca
I'm just a neuron in the society collective brain
*Join us for HPCS in June 2017! Register at:* *http://2017.hpcs.ca/
<http://2017.hpcs.ca/>*
P Please consider your environmental responsibility before printing this
e-mail
*01001001 00100000 01101100 01101111 01110110 01100101 00100000 01000101
01100111 01111001 01110000 01110100 *
*The information transmitted is intended only for the person or entity to
which it is addressed and may contain confidential material. Any review or
dissemination of this information by persons other than the intended
recipient is prohibited. If you received this in error, please contact the
sender and delete the material from any computer. Thank you.*
​​
SELECT * FROM (SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc
LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM
`dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (
​​
`t0`.`UserID` IS NOT DISTINCT FROM
​​
`t1`.`UserID`) LIMIT 2147483647
I debugged Drill code and found it decomposes *IS NOT DISTINCT FROM* into
​
*`t0`.`UserID` = ​`t1`.`UserID` OR (`t0`.`UserID` IS NULL && `t1`.`UserID`
IS NULL**)* while checking if the query is a cartesian join, and when the
check returns true, it throws an excetion saying: *This query cannot be
planned possibly due to either a cartesian join or an inequality join*
*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana
Post by Gautam Parai
Can you please specify the query you are trying to execute?
Gautam
________________________________
Sent: Saturday, May 6, 2017 7:34:53 AM
Subject: Running cartesian joins on Drill
Is there a reason why Drill would intentionally reject cartesian join
queries even if *planner.enable_nljoin_for_scalar_only* is disabled ?
Any ideas how could a query be rewritten to overcome this restriction ?
*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana
Zelaine Fong
2017-05-08 16:30:50 UTC
Permalink
Raw Message
Cartesian joins in Drill are implemented as nested loop joins, and I think you should see that reflected in the resultant query plan when you run explain plan on the query.

Yes, Cartesian joins/nested loop joins are expensive because you’re effectively doing an MxN read of your tables. There are more efficient ways of processing a nested loop join, e.g., by creating an index on the larger table in the join and then using that index to do lookups into that table. That way, the nested loop join cost is the cost of creating the index + M, where M is the number of rows in the smaller table and assuming the lookup cost into the index does minimize the amount of data read of the second table. Drill currently doesn’t do this.

-- Zelaine

On 5/8/17, 9:09 AM, "Muhammad Gelbana" <***@gmail.com> wrote:

​I believe ​clhubert is referring to this discussion
<http://drill-user.incubator.apache.narkive.com/TIXWiTY4/cartesian-product-in-apache-drill#post1>
.

So why Drill doesn't transform this query into a nested join query ? Simply
because there is no Calcite rule to transform it into a nested loop join ?
Is it not technically possible to write such Rule or is it feasible so I
may take on this challenge ?

Also pardon me for repeating my question but I fail to find an answer in
your replies, why doesn't Drill just run a cartesian join ? Because it's
expensive regarding resources (i.e. CPU\Network\RAM) ?

Thanks a lot Shadi for the query, it works for me.

*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana
Hi Muhammad,
I did the following as a workaround to have Cartesian product. The basic
idea is to create a dummy column on the fly that has the value 1 in both
tables and then join on that column leading to having a match of every row
of the first table with every row of the second table, hence do a Cartesian
product. This might not be the most efficient way but it will do the job.
*Original Query:*
SELECT * FROM
( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT
2147483647) `t0`
INNER JOIN
( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT
2147483647) `t1`
ON (`t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`)
LIMIT 2147483647
*Workaround (add columns **d1a381f3g73 and **d1a381f3g74 to tables one
and two, respectively. Names don't really matter, just need to be unique):*
SELECT * FROM
( SELECT *1 as d1a381f3g73*, 'ABC' `UserID` FROM
`dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t0`
INNER JOIN
( SELECT *1 as d1a381f3g74*, 'ABC' `UserID` FROM
`dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1`
ON (`t0`.*d1a381f3g73 = *`t1`.*d1a381f3g74*)
WHERE `t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`
LIMIT 2147483647
Regards
*Shadi Khalifa, PhD*
Postdoctoral Fellow
Cognitive Analytics Development Hub
Centre for Advanced Computing
Queen’s University
(613) 533-6000 x78347
http://cac.queensu.ca
I'm just a neuron in the society collective brain
*Join us for HPCS in June 2017! Register at:* *http://2017.hpcs.ca/
<http://2017.hpcs.ca/>*
P Please consider your environmental responsibility before printing this
e-mail
*01001001 00100000 01101100 01101111 01110110 01100101 00100000 01000101
01100111 01111001 01110000 01110100 *
*The information transmitted is intended only for the person or entity to
which it is addressed and may contain confidential material. Any review or
dissemination of this information by persons other than the intended
recipient is prohibited. If you received this in error, please contact the
sender and delete the material from any computer. Thank you.*
​​
SELECT * FROM (SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc
LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM
`dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (
​​
`t0`.`UserID` IS NOT DISTINCT FROM
​​
`t1`.`UserID`) LIMIT 2147483647
I debugged Drill code and found it decomposes *IS NOT DISTINCT FROM* into

*`t0`.`UserID` = ​`t1`.`UserID` OR (`t0`.`UserID` IS NULL && `t1`.`UserID`
IS NULL**)* while checking if the query is a cartesian join, and when the
check returns true, it throws an excetion saying: *This query cannot be
planned possibly due to either a cartesian join or an inequality join*
*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana
Post by Gautam Parai
Can you please specify the query you are trying to execute?
Gautam
________________________________
Sent: Saturday, May 6, 2017 7:34:53 AM
Subject: Running cartesian joins on Drill
Is there a reason why Drill would intentionally reject cartesian join
queries even if *planner.enable_nljoin_for_scalar_only* is disabled ?
Any ideas how could a query be rewritten to overcome this restriction ?
*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana
Anup Tiwari
2017-05-11 11:20:40 UTC
Permalink
Raw Message
Hi,

I have one question here.. so if we have to use Cartesian join in Drill
then do we have to follow some workaround like Shadi mention : adding a
dummy column on the fly that has the value 1 in both tables and then join
on that column leading to having a match of every row of the first table
with every row of the second table, hence do a Cartesian product?
OR
If we just don't specify join condition like :
select a.*, b.* from tt1 as a, tt2 b; then will it internally treat this
query as Cartesian join.

Regards,
*Anup Tiwari*
Post by Zelaine Fong
Cartesian joins in Drill are implemented as nested loop joins, and I think
you should see that reflected in the resultant query plan when you run
explain plan on the query.
Yes, Cartesian joins/nested loop joins are expensive because you’re
effectively doing an MxN read of your tables. There are more efficient
ways of processing a nested loop join, e.g., by creating an index on the
larger table in the join and then using that index to do lookups into that
table. That way, the nested loop join cost is the cost of creating the
index + M, where M is the number of rows in the smaller table and assuming
the lookup cost into the index does minimize the amount of data read of the
second table. Drill currently doesn’t do this.
-- Zelaine
​I believe ​clhubert is referring to this discussion
<http://drill-user.incubator.apache.narkive.com/TIXWiTY4/
cartesian-product-in-apache-drill#post1>
.
So why Drill doesn't transform this query into a nested join query ? Simply
because there is no Calcite rule to transform it into a nested loop join ?
Is it not technically possible to write such Rule or is it feasible so I
may take on this challenge ?
Also pardon me for repeating my question but I fail to find an answer in
your replies, why doesn't Drill just run a cartesian join ? Because it's
expensive regarding resources (i.e. CPU\Network\RAM) ?
Thanks a lot Shadi for the query, it works for me.
*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana
Hi Muhammad,
I did the following as a workaround to have Cartesian product. The
basic
idea is to create a dummy column on the fly that has the value 1 in
both
tables and then join on that column leading to having a match of
every row
of the first table with every row of the second table, hence do a
Cartesian
product. This might not be the most efficient way but it will do the
job.
*Original Query:*
SELECT * FROM
( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT
2147483647) `t0`
INNER JOIN
( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT
2147483647) `t1`
ON (`t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`)
LIMIT 2147483647
*Workaround (add columns **d1a381f3g73 and **d1a381f3g74 to tables
one
and two, respectively. Names don't really matter, just need to be
unique):*
SELECT * FROM
( SELECT *1 as d1a381f3g73*, 'ABC' `UserID` FROM
`dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t0`
INNER JOIN
( SELECT *1 as d1a381f3g74*, 'ABC' `UserID` FROM
`dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1`
ON (`t0`.*d1a381f3g73 = *`t1`.*d1a381f3g74*)
WHERE `t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`
LIMIT 2147483647
Regards
*Shadi Khalifa, PhD*
Postdoctoral Fellow
Cognitive Analytics Development Hub
Centre for Advanced Computing
Queen’s University
(613) 533-6000 x78347
http://cac.queensu.ca
I'm just a neuron in the society collective brain
*Join us for HPCS in June 2017! Register at:* *http://2017.hpcs.ca/
<http://2017.hpcs.ca/>*
P Please consider your environmental responsibility before printing
this
e-mail
*01001001 00100000 01101100 01101111 01110110 01100101 00100000
01000101
01100111 01111001 01110000 01110100 *
*The information transmitted is intended only for the person or
entity to
which it is addressed and may contain confidential material. Any
review or
dissemination of this information by persons other than the intended
recipient is prohibited. If you received this in error, please
contact the
sender and delete the material from any computer. Thank you.*
On Saturday, May 6, 2017 6:05 PM, Muhammad Gelbana <
​​
SELECT * FROM (SELECT 'ABC' `UserID` FROM
`dfs`.`path_to_parquet_file` tc
LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM
`dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (
​​
`t0`.`UserID` IS NOT DISTINCT FROM
​​
`t1`.`UserID`) LIMIT 2147483647
I debugged Drill code and found it decomposes *IS NOT DISTINCT FROM*
into
​
*`t0`.`UserID` = ​`t1`.`UserID` OR (`t0`.`UserID` IS NULL &&
`t1`.`UserID`
IS NULL**)* while checking if the query is a cartesian join, and
when the
check returns true, it throws an excetion saying: *This query cannot
be
planned possibly due to either a cartesian join or an inequality
join*
*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana
Post by Gautam Parai
Can you please specify the query you are trying to execute?
Gautam
________________________________
Sent: Saturday, May 6, 2017 7:34:53 AM
Subject: Running cartesian joins on Drill
Is there a reason why Drill would intentionally reject cartesian
join
Post by Gautam Parai
queries even if *planner.enable_nljoin_for_scalar_only* is
disabled ?
Post by Gautam Parai
Any ideas how could a query be rewritten to overcome this
restriction ?
Post by Gautam Parai
*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana
Zelaine Fong
2017-05-11 15:05:45 UTC
Permalink
Raw Message
Provided `planner.enable_nljoin_for_scalar_only` is set to false, even without an explicit join condition, the query should use the Cartesian join/nested loop join.

-- Zelaine

On 5/11/17, 4:20 AM, "Anup Tiwari" <***@games24x7.com> wrote:

Hi,

I have one question here.. so if we have to use Cartesian join in Drill
then do we have to follow some workaround like Shadi mention : adding a
dummy column on the fly that has the value 1 in both tables and then join
on that column leading to having a match of every row of the first table
with every row of the second table, hence do a Cartesian product?
OR
If we just don't specify join condition like :
select a.*, b.* from tt1 as a, tt2 b; then will it internally treat this
query as Cartesian join.

Regards,
*Anup Tiwari*
Post by Zelaine Fong
Cartesian joins in Drill are implemented as nested loop joins, and I think
you should see that reflected in the resultant query plan when you run
explain plan on the query.
Yes, Cartesian joins/nested loop joins are expensive because you’re
effectively doing an MxN read of your tables. There are more efficient
ways of processing a nested loop join, e.g., by creating an index on the
larger table in the join and then using that index to do lookups into that
table. That way, the nested loop join cost is the cost of creating the
index + M, where M is the number of rows in the smaller table and assuming
the lookup cost into the index does minimize the amount of data read of the
second table. Drill currently doesn’t do this.
-- Zelaine
​I believe ​clhubert is referring to this discussion
<http://drill-user.incubator.apache.narkive.com/TIXWiTY4/
cartesian-product-in-apache-drill#post1>
.
So why Drill doesn't transform this query into a nested join query ? Simply
because there is no Calcite rule to transform it into a nested loop join ?
Is it not technically possible to write such Rule or is it feasible so I
may take on this challenge ?
Also pardon me for repeating my question but I fail to find an answer in
your replies, why doesn't Drill just run a cartesian join ? Because it's
expensive regarding resources (i.e. CPU\Network\RAM) ?
Thanks a lot Shadi for the query, it works for me.
*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana
Hi Muhammad,
I did the following as a workaround to have Cartesian product. The
basic
idea is to create a dummy column on the fly that has the value 1 in
both
tables and then join on that column leading to having a match of
every row
of the first table with every row of the second table, hence do a
Cartesian
product. This might not be the most efficient way but it will do the
job.
*Original Query:*
SELECT * FROM
( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT
2147483647) `t0`
INNER JOIN
( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT
2147483647) `t1`
ON (`t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`)
LIMIT 2147483647
*Workaround (add columns **d1a381f3g73 and **d1a381f3g74 to tables
one
and two, respectively. Names don't really matter, just need to be
unique):*
SELECT * FROM
( SELECT *1 as d1a381f3g73*, 'ABC' `UserID` FROM
`dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t0`
INNER JOIN
( SELECT *1 as d1a381f3g74*, 'ABC' `UserID` FROM
`dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1`
ON (`t0`.*d1a381f3g73 = *`t1`.*d1a381f3g74*)
WHERE `t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`
LIMIT 2147483647
Regards
*Shadi Khalifa, PhD*
Postdoctoral Fellow
Cognitive Analytics Development Hub
Centre for Advanced Computing
Queen’s University
(613) 533-6000 x78347
http://cac.queensu.ca
I'm just a neuron in the society collective brain
*Join us for HPCS in June 2017! Register at:* *http://2017.hpcs.ca/
<http://2017.hpcs.ca/>*
P Please consider your environmental responsibility before printing
this
e-mail
*01001001 00100000 01101100 01101111 01110110 01100101 00100000
01000101
01100111 01111001 01110000 01110100 *
*The information transmitted is intended only for the person or
entity to
which it is addressed and may contain confidential material. Any
review or
dissemination of this information by persons other than the intended
recipient is prohibited. If you received this in error, please
contact the
sender and delete the material from any computer. Thank you.*
On Saturday, May 6, 2017 6:05 PM, Muhammad Gelbana <
​​
SELECT * FROM (SELECT 'ABC' `UserID` FROM
`dfs`.`path_to_parquet_file` tc
LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM
`dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (
​​
`t0`.`UserID` IS NOT DISTINCT FROM
​​
`t1`.`UserID`) LIMIT 2147483647
I debugged Drill code and found it decomposes *IS NOT DISTINCT FROM*
into

*`t0`.`UserID` = ​`t1`.`UserID` OR (`t0`.`UserID` IS NULL &&
`t1`.`UserID`
IS NULL**)* while checking if the query is a cartesian join, and
when the
check returns true, it throws an excetion saying: *This query cannot
be
planned possibly due to either a cartesian join or an inequality
join*
*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana
Post by Gautam Parai
Can you please specify the query you are trying to execute?
Gautam
________________________________
Sent: Saturday, May 6, 2017 7:34:53 AM
Subject: Running cartesian joins on Drill
Is there a reason why Drill would intentionally reject cartesian
join
Post by Gautam Parai
queries even if *planner.enable_nljoin_for_scalar_only* is
disabled ?
Post by Gautam Parai
Any ideas how could a query be rewritten to overcome this
restriction ?
Post by Gautam Parai
*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana
Muhammad Gelbana
2017-05-11 22:17:46 UTC
Permalink
Raw Message
​But the query I provided failed to be planned because it's a cartesian
join, although I've set the option you mentioned to false. Is there a
reason why wouldn't Drill rules physically implement the logical join in my
query to a nested loop join ?

*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana
Post by Zelaine Fong
Provided `planner.enable_nljoin_for_scalar_only` is set to false, even
without an explicit join condition, the query should use the Cartesian
join/nested loop join.
-- Zelaine
Hi,
I have one question here.. so if we have to use Cartesian join in Drill
then do we have to follow some workaround like Shadi mention : adding a
dummy column on the fly that has the value 1 in both tables and then join
on that column leading to having a match of every row of the first table
with every row of the second table, hence do a Cartesian product?
OR
select a.*, b.* from tt1 as a, tt2 b; then will it internally treat this
query as Cartesian join.
Regards,
*Anup Tiwari*
Post by Zelaine Fong
Cartesian joins in Drill are implemented as nested loop joins, and I
think
Post by Zelaine Fong
you should see that reflected in the resultant query plan when you
run
Post by Zelaine Fong
explain plan on the query.
Yes, Cartesian joins/nested loop joins are expensive because you’re
effectively doing an MxN read of your tables. There are more
efficient
Post by Zelaine Fong
ways of processing a nested loop join, e.g., by creating an index on
the
Post by Zelaine Fong
larger table in the join and then using that index to do lookups
into that
Post by Zelaine Fong
table. That way, the nested loop join cost is the cost of creating
the
Post by Zelaine Fong
index + M, where M is the number of rows in the smaller table and
assuming
Post by Zelaine Fong
the lookup cost into the index does minimize the amount of data read
of the
Post by Zelaine Fong
second table. Drill currently doesn’t do this.
-- Zelaine
​I believe ​clhubert is referring to this discussion
<http://drill-user.incubator.apache.narkive.com/TIXWiTY4/
cartesian-product-in-apache-drill#post1>
.
So why Drill doesn't transform this query into a nested join
query ?
Post by Zelaine Fong
Simply
because there is no Calcite rule to transform it into a nested
loop
Post by Zelaine Fong
join ?
Is it not technically possible to write such Rule or is it
feasible so
Post by Zelaine Fong
I
may take on this challenge ?
Also pardon me for repeating my question but I fail to find an
answer
Post by Zelaine Fong
in
your replies, why doesn't Drill just run a cartesian join ?
Because
Post by Zelaine Fong
it's
expensive regarding resources (i.e. CPU\Network\RAM) ?
Thanks a lot Shadi for the query, it works for me.
*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana
On Mon, May 8, 2017 at 6:10 AM, Shadi Khalifa <
Hi Muhammad,
I did the following as a workaround to have Cartesian product.
The
Post by Zelaine Fong
basic
idea is to create a dummy column on the fly that has the value
1 in
Post by Zelaine Fong
both
tables and then join on that column leading to having a match
of
Post by Zelaine Fong
every row
of the first table with every row of the second table, hence
do a
Post by Zelaine Fong
Cartesian
product. This might not be the most efficient way but it will
do the
Post by Zelaine Fong
job.
*Original Query:*
SELECT * FROM
( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc
LIMIT
Post by Zelaine Fong
2147483647) `t0`
INNER JOIN
( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc
LIMIT
Post by Zelaine Fong
2147483647) `t1`
ON (`t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`)
LIMIT 2147483647
*Workaround (add columns **d1a381f3g73 and **d1a381f3g74 to
tables
Post by Zelaine Fong
one
and two, respectively. Names don't really matter, just need to
be
Post by Zelaine Fong
unique):*
SELECT * FROM
( SELECT *1 as d1a381f3g73*, 'ABC' `UserID` FROM
`dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t0`
INNER JOIN
( SELECT *1 as d1a381f3g74*, 'ABC' `UserID` FROM
`dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1`
ON (`t0`.*d1a381f3g73 = *`t1`.*d1a381f3g74*)
WHERE `t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`
LIMIT 2147483647
Regards
*Shadi Khalifa, PhD*
Postdoctoral Fellow
Cognitive Analytics Development Hub
Centre for Advanced Computing
Queen’s University
(613) 533-6000 x78347
http://cac.queensu.ca
I'm just a neuron in the society collective brain
*Join us for HPCS in June 2017! Register at:* *
http://2017.hpcs.ca/
Post by Zelaine Fong
<http://2017.hpcs.ca/>*
P Please consider your environmental responsibility before
printing
Post by Zelaine Fong
this
e-mail
*01001001 00100000 01101100 01101111 01110110 01100101 00100000
01000101
01100111 01111001 01110000 01110100 *
*The information transmitted is intended only for the person or
entity to
which it is addressed and may contain confidential material.
Any
Post by Zelaine Fong
review or
dissemination of this information by persons other than the
intended
Post by Zelaine Fong
recipient is prohibited. If you received this in error, please
contact the
sender and delete the material from any computer. Thank you.*
On Saturday, May 6, 2017 6:05 PM, Muhammad Gelbana <
​​
SELECT * FROM (SELECT 'ABC' `UserID` FROM
`dfs`.`path_to_parquet_file` tc
LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM
`dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (
​​
`t0`.`UserID` IS NOT DISTINCT FROM
​​
`t1`.`UserID`) LIMIT 2147483647
I debugged Drill code and found it decomposes *IS NOT DISTINCT
FROM*
Post by Zelaine Fong
into
​
*`t0`.`UserID` = ​`t1`.`UserID` OR (`t0`.`UserID` IS NULL &&
`t1`.`UserID`
IS NULL**)* while checking if the query is a cartesian join,
and
Post by Zelaine Fong
when the
check returns true, it throws an excetion saying: *This query
cannot
Post by Zelaine Fong
be
planned possibly due to either a cartesian join or an
inequality
Post by Zelaine Fong
join*
*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana
Post by Gautam Parai
Can you please specify the query you are trying to execute?
Gautam
________________________________
Sent: Saturday, May 6, 2017 7:34:53 AM
Subject: Running cartesian joins on Drill
Is there a reason why Drill would intentionally reject
cartesian
Post by Zelaine Fong
join
Post by Gautam Parai
queries even if *planner.enable_nljoin_for_scalar_only* is
disabled ?
Post by Gautam Parai
Any ideas how could a query be rewritten to overcome this
restriction ?
Post by Gautam Parai
*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana
Zelaine Fong
2017-05-11 22:23:52 UTC
Permalink
Raw Message
I’m not sure why it isn’t working for you. Using Drill 1.10, here’s my output:

0: jdbc:drill:zk=local> alter session set `planner.enable_nljoin_for_scalar_only` = false;
+-------+-------------------------------------------------+
| ok | summary |
+-------+-------------------------------------------------+
| true | planner.enable_nljoin_for_scalar_only updated. |
+-------+-------------------------------------------------+
1 row selected (0.137 seconds)
0: jdbc:drill:zk=local> explain plan for select * from dfs.`/Users/zfong/foo.csv` t1, dfs.`/Users/zfong/foo.csv` t2;
+------+------+
| text | json |
+------+------+
| 00-00 Screen
00-01 ProjectAllowDup(*=[$0], *0=[$1])
00-02 NestedLoopJoin(condition=[true], joinType=[inner])
00-04 Project(T2¦¦*=[$0])
00-06 Scan(groupscan=[EasyGroupScan [selectionRoot=file:/Users/zfong/foo.csv, numFiles=1, columns=[`*`], files=[file:/Users/zfong/foo.csv]]])
00-03 Project(T3¦¦*=[$0])
00-05 Scan(groupscan=[EasyGroupScan [selectionRoot=file:/Users/zfong/foo.csv, numFiles=1, columns=[`*`], files=[file:/Users/zfong/foo.csv]]])

-- Zelaine

On 5/11/17, 3:17 PM, "Muhammad Gelbana" <***@gmail.com> wrote:

​But the query I provided failed to be planned because it's a cartesian
join, although I've set the option you mentioned to false. Is there a
reason why wouldn't Drill rules physically implement the logical join in my
query to a nested loop join ?

*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana
Post by Zelaine Fong
Provided `planner.enable_nljoin_for_scalar_only` is set to false, even
without an explicit join condition, the query should use the Cartesian
join/nested loop join.
-- Zelaine
Hi,
I have one question here.. so if we have to use Cartesian join in Drill
then do we have to follow some workaround like Shadi mention : adding a
dummy column on the fly that has the value 1 in both tables and then join
on that column leading to having a match of every row of the first table
with every row of the second table, hence do a Cartesian product?
OR
select a.*, b.* from tt1 as a, tt2 b; then will it internally treat this
query as Cartesian join.
Regards,
*Anup Tiwari*
Post by Zelaine Fong
Cartesian joins in Drill are implemented as nested loop joins, and I
think
Post by Zelaine Fong
you should see that reflected in the resultant query plan when you
run
Post by Zelaine Fong
explain plan on the query.
Yes, Cartesian joins/nested loop joins are expensive because you’re
effectively doing an MxN read of your tables. There are more
efficient
Post by Zelaine Fong
ways of processing a nested loop join, e.g., by creating an index on
the
Post by Zelaine Fong
larger table in the join and then using that index to do lookups
into that
Post by Zelaine Fong
table. That way, the nested loop join cost is the cost of creating
the
Post by Zelaine Fong
index + M, where M is the number of rows in the smaller table and
assuming
Post by Zelaine Fong
the lookup cost into the index does minimize the amount of data read
of the
Post by Zelaine Fong
second table. Drill currently doesn’t do this.
-- Zelaine
​I believe ​clhubert is referring to this discussion
<http://drill-user.incubator.apache.narkive.com/TIXWiTY4/
cartesian-product-in-apache-drill#post1>
.
So why Drill doesn't transform this query into a nested join
query ?
Post by Zelaine Fong
Simply
because there is no Calcite rule to transform it into a nested
loop
Post by Zelaine Fong
join ?
Is it not technically possible to write such Rule or is it
feasible so
Post by Zelaine Fong
I
may take on this challenge ?
Also pardon me for repeating my question but I fail to find an
answer
Post by Zelaine Fong
in
your replies, why doesn't Drill just run a cartesian join ?
Because
Post by Zelaine Fong
it's
expensive regarding resources (i.e. CPU\Network\RAM) ?
Thanks a lot Shadi for the query, it works for me.
*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana
On Mon, May 8, 2017 at 6:10 AM, Shadi Khalifa <
Hi Muhammad,
I did the following as a workaround to have Cartesian product.
The
Post by Zelaine Fong
basic
idea is to create a dummy column on the fly that has the value
1 in
Post by Zelaine Fong
both
tables and then join on that column leading to having a match
of
Post by Zelaine Fong
every row
of the first table with every row of the second table, hence
do a
Post by Zelaine Fong
Cartesian
product. This might not be the most efficient way but it will
do the
Post by Zelaine Fong
job.
*Original Query:*
SELECT * FROM
( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc
LIMIT
Post by Zelaine Fong
2147483647) `t0`
INNER JOIN
( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc
LIMIT
Post by Zelaine Fong
2147483647) `t1`
ON (`t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`)
LIMIT 2147483647
*Workaround (add columns **d1a381f3g73 and **d1a381f3g74 to
tables
Post by Zelaine Fong
one
and two, respectively. Names don't really matter, just need to
be
Post by Zelaine Fong
unique):*
SELECT * FROM
( SELECT *1 as d1a381f3g73*, 'ABC' `UserID` FROM
`dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t0`
INNER JOIN
( SELECT *1 as d1a381f3g74*, 'ABC' `UserID` FROM
`dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1`
ON (`t0`.*d1a381f3g73 = *`t1`.*d1a381f3g74*)
WHERE `t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`
LIMIT 2147483647
Regards
*Shadi Khalifa, PhD*
Postdoctoral Fellow
Cognitive Analytics Development Hub
Centre for Advanced Computing
Queen’s University
(613) 533-6000 x78347
http://cac.queensu.ca
I'm just a neuron in the society collective brain
*Join us for HPCS in June 2017! Register at:* *
http://2017.hpcs.ca/
Post by Zelaine Fong
<http://2017.hpcs.ca/>*
P Please consider your environmental responsibility before
printing
Post by Zelaine Fong
this
e-mail
*01001001 00100000 01101100 01101111 01110110 01100101 00100000
01000101
01100111 01111001 01110000 01110100 *
*The information transmitted is intended only for the person or
entity to
which it is addressed and may contain confidential material.
Any
Post by Zelaine Fong
review or
dissemination of this information by persons other than the
intended
Post by Zelaine Fong
recipient is prohibited. If you received this in error, please
contact the
sender and delete the material from any computer. Thank you.*
On Saturday, May 6, 2017 6:05 PM, Muhammad Gelbana <
​​
SELECT * FROM (SELECT 'ABC' `UserID` FROM
`dfs`.`path_to_parquet_file` tc
LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM
`dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (
​​
`t0`.`UserID` IS NOT DISTINCT FROM
​​
`t1`.`UserID`) LIMIT 2147483647
I debugged Drill code and found it decomposes *IS NOT DISTINCT
FROM*
Post by Zelaine Fong
into

*`t0`.`UserID` = ​`t1`.`UserID` OR (`t0`.`UserID` IS NULL &&
`t1`.`UserID`
IS NULL**)* while checking if the query is a cartesian join,
and
Post by Zelaine Fong
when the
check returns true, it throws an excetion saying: *This query
cannot
Post by Zelaine Fong
be
planned possibly due to either a cartesian join or an
inequality
Post by Zelaine Fong
join*
*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana
Post by Gautam Parai
Can you please specify the query you are trying to execute?
Gautam
________________________________
Sent: Saturday, May 6, 2017 7:34:53 AM
Subject: Running cartesian joins on Drill
Is there a reason why Drill would intentionally reject
cartesian
Post by Zelaine Fong
join
Post by Gautam Parai
queries even if *planner.enable_nljoin_for_scalar_only* is
disabled ?
Post by Gautam Parai
Any ideas how could a query be rewritten to overcome this
restriction ?
Post by Gautam Parai
*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana
Aman Sinha
2017-05-11 23:29:20 UTC
Permalink
Raw Message
I think Muhammad may be trying to run his original query with IS NOT DISTINCT FROM. That discussion got side-tracked into Cartesian joins because his query was not getting planned and the error was about Cartesian join.

Muhammad, can you try with the equivalent version below ? You mentioned the rewrite but did you try the rewritten version ?



SELECT * FROM (SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc

LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM

`dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (

​​

`t0`.`UserID` = `t1`.`UserID` OR (`t0`.`UserID` IS NULL && `t1`.`UserID` IS NULL) )



On 5/11/17, 3:23 PM, "Zelaine Fong" <***@mapr.com> wrote:



I’m not sure why it isn’t working for you. Using Drill 1.10, here’s my output:



0: jdbc:drill:zk=local> alter session set `planner.enable_nljoin_for_scalar_only` = false;

+-------+-------------------------------------------------+

| ok | summary |

+-------+-------------------------------------------------+

| true | planner.enable_nljoin_for_scalar_only updated. |

+-------+-------------------------------------------------+

1 row selected (0.137 seconds)

0: jdbc:drill:zk=local> explain plan for select * from dfs.`/Users/zfong/foo.csv` t1, dfs.`/Users/zfong/foo.csv` t2;

+------+------+

| text | json |

+------+------+

| 00-00 Screen

00-01 ProjectAllowDup(*=[$0], *0=[$1])

00-02 NestedLoopJoin(condition=[true], joinType=[inner])

00-04 Project(T2ŠŠ*=[$0])

00-06 Scan(groupscan=[EasyGroupScan [selectionRoot=file:/Users/zfong/foo.csv, numFiles=1, columns=[`*`], files=[file:/Users/zfong/foo.csv]]])

00-03 Project(T3ŠŠ*=[$0])

00-05 Scan(groupscan=[EasyGroupScan [selectionRoot=file:/Users/zfong/foo.csv, numFiles=1, columns=[`*`], files=[file:/Users/zfong/foo.csv]]])



-- Zelaine



On 5/11/17, 3:17 PM, "Muhammad Gelbana" <***@gmail.com> wrote:



​But the query I provided failed to be planned because it's a cartesian

join, although I've set the option you mentioned to false. Is there a

reason why wouldn't Drill rules physically implement the logical join in my

query to a nested loop join ?



*---------------------*

*Muhammad Gelbana*

http://www.linkedin.com/in/mgelbana
Post by Zelaine Fong
Provided `planner.enable_nljoin_for_scalar_only` is set to false, even
without an explicit join condition, the query should use the Cartesian
join/nested loop join.
-- Zelaine
Hi,
I have one question here.. so if we have to use Cartesian join in Drill
then do we have to follow some workaround like Shadi mention : adding a
dummy column on the fly that has the value 1 in both tables and then
join
on that column leading to having a match of every row of the first
table
with every row of the second table, hence do a Cartesian product?
OR
select a.*, b.* from tt1 as a, tt2 b; then will it internally treat
this
query as Cartesian join.
Regards,
*Anup Tiwari*
Post by Zelaine Fong
Cartesian joins in Drill are implemented as nested loop joins, and I
think
Post by Zelaine Fong
you should see that reflected in the resultant query plan when you
run
Post by Zelaine Fong
explain plan on the query.
Yes, Cartesian joins/nested loop joins are expensive because you’re
effectively doing an MxN read of your tables. There are more
efficient
Post by Zelaine Fong
ways of processing a nested loop join, e.g., by creating an index on
the
Post by Zelaine Fong
larger table in the join and then using that index to do lookups
into that
Post by Zelaine Fong
table. That way, the nested loop join cost is the cost of creating
the
Post by Zelaine Fong
index + M, where M is the number of rows in the smaller table and
assuming
Post by Zelaine Fong
the lookup cost into the index does minimize the amount of data read
of the
Post by Zelaine Fong
second table. Drill currently doesn’t do this.
-- Zelaine
​I believe ​clhubert is referring to this discussion
<http://drill-user.incubator.apache.narkive.com/TIXWiTY4/
cartesian-product-in-apache-drill#post1>
.
So why Drill doesn't transform this query into a nested join
query ?
Post by Zelaine Fong
Simply
because there is no Calcite rule to transform it into a nested
loop
Post by Zelaine Fong
join ?
Is it not technically possible to write such Rule or is it
feasible so
Post by Zelaine Fong
I
may take on this challenge ?
Also pardon me for repeating my question but I fail to find an
answer
Post by Zelaine Fong
in
your replies, why doesn't Drill just run a cartesian join ?
Because
Post by Zelaine Fong
it's
expensive regarding resources (i.e. CPU\Network\RAM) ?
Thanks a lot Shadi for the query, it works for me.
*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana
On Mon, May 8, 2017 at 6:10 AM, Shadi Khalifa <
Hi Muhammad,
I did the following as a workaround to have Cartesian product.
The
Post by Zelaine Fong
basic
idea is to create a dummy column on the fly that has the value
1 in
Post by Zelaine Fong
both
tables and then join on that column leading to having a match
of
Post by Zelaine Fong
every row
of the first table with every row of the second table, hence
do a
Post by Zelaine Fong
Cartesian
product. This might not be the most efficient way but it will
do the
Post by Zelaine Fong
job.
*Original Query:*
SELECT * FROM
( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc
LIMIT
Post by Zelaine Fong
2147483647) `t0`
INNER JOIN
( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc
LIMIT
Post by Zelaine Fong
2147483647) `t1`
ON (`t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`)
LIMIT 2147483647
*Workaround (add columns **d1a381f3g73 and **d1a381f3g74 to
tables
Post by Zelaine Fong
one
and two, respectively. Names don't really matter, just need to
be
Post by Zelaine Fong
unique):*
SELECT * FROM
( SELECT *1 as d1a381f3g73*, 'ABC' `UserID` FROM
`dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t0`
INNER JOIN
( SELECT *1 as d1a381f3g74*, 'ABC' `UserID` FROM
`dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1`
ON (`t0`.*d1a381f3g73 = *`t1`.*d1a381f3g74*)
WHERE `t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`
LIMIT 2147483647
Regards
*Shadi Khalifa, PhD*
Postdoctoral Fellow
Cognitive Analytics Development Hub
Centre for Advanced Computing
Queen’s University
(613) 533-6000 x78347
http://cac.queensu.ca
I'm just a neuron in the society collective brain
*Join us for HPCS in June 2017! Register at:* *
http://2017.hpcs.ca/
Post by Zelaine Fong
<http://2017.hpcs.ca/>*
P Please consider your environmental responsibility before
printing
Post by Zelaine Fong
this
e-mail
*01001001 00100000 01101100 01101111 01110110 01100101 00100000
01000101
01100111 01111001 01110000 01110100 *
*The information transmitted is intended only for the person or
entity to
which it is addressed and may contain confidential material.
Any
Post by Zelaine Fong
review or
dissemination of this information by persons other than the
intended
Post by Zelaine Fong
recipient is prohibited. If you received this in error, please
contact the
sender and delete the material from any computer. Thank you.*
On Saturday, May 6, 2017 6:05 PM, Muhammad Gelbana <
​​
SELECT * FROM (SELECT 'ABC' `UserID` FROM
`dfs`.`path_to_parquet_file` tc
LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM
`dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (
​​
`t0`.`UserID` IS NOT DISTINCT FROM
​​
`t1`.`UserID`) LIMIT 2147483647
I debugged Drill code and found it decomposes *IS NOT DISTINCT
FROM*
Post by Zelaine Fong
into
​
*`t0`.`UserID` = ​`t1`.`UserID` OR (`t0`.`UserID` IS NULL &&
`t1`.`UserID`
IS NULL**)* while checking if the query is a cartesian join,
and
Post by Zelaine Fong
when the
check returns true, it throws an excetion saying: *This query
cannot
Post by Zelaine Fong
be
planned possibly due to either a cartesian join or an
inequality
Post by Zelaine Fong
join*
*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana
Post by Gautam Parai
Can you please specify the query you are trying to execute?
Gautam
________________________________
Sent: Saturday, May 6, 2017 7:34:53 AM
Subject: Running cartesian joins on Drill
Is there a reason why Drill would intentionally reject
cartesian
Post by Zelaine Fong
join
Post by Gautam Parai
queries even if *planner.enable_nljoin_for_scalar_only* is
disabled ?
Post by Gautam Parai
Any ideas how could a query be rewritten to overcome this
restriction ?
Post by Gautam Parai
*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana
Aman Sinha
2017-05-12 04:50:30 UTC
Permalink
Raw Message
Muhammad,
The join condition ‘a = b or (a is null && b is null)’ works. Internally, this is converted to ‘a is not distinct from b’ which is processed by Drill.
For some reason, if the second form is directly supplied in the user query, it is not working and ends up with the Cartesian join condition. Drill leverages Calcite for this (you can see CALCITE-1200 for some background).
Can you file a JIRA for this ?

-Aman

From: "Aman Sinha (***@mapr.com)" <***@mapr.com>
Date: Thursday, May 11, 2017 at 4:29 PM
To: dev <***@drill.apache.org>, user <***@drill.apache.org>
Cc: Shadi Khalifa <***@cs.queensu.ca>
Subject: Re: Running cartesian joins on Drill


I think Muhammad may be trying to run his original query with IS NOT DISTINCT FROM. That discussion got side-tracked into Cartesian joins because his query was not getting planned and the error was about Cartesian join.

Muhammad, can you try with the equivalent version below ? You mentioned the rewrite but did you try the rewritten version ?



SELECT * FROM (SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc

LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM

`dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (

​​

`t0`.`UserID` = `t1`.`UserID` OR (`t0`.`UserID` IS NULL && `t1`.`UserID` IS NULL) )



On 5/11/17, 3:23 PM, "Zelaine Fong" <***@mapr.com> wrote:



I’m not sure why it isn’t working for you. Using Drill 1.10, here’s my output:



0: jdbc:drill:zk=local> alter session set `planner.enable_nljoin_for_scalar_only` = false;

+-------+-------------------------------------------------+

| ok | summary |

+-------+-------------------------------------------------+

| true | planner.enable_nljoin_for_scalar_only updated. |

+-------+-------------------------------------------------+

1 row selected (0.137 seconds)

0: jdbc:drill:zk=local> explain plan for select * from dfs.`/Users/zfong/foo.csv` t1, dfs.`/Users/zfong/foo.csv` t2;

+------+------+

| text | json |

+------+------+

| 00-00 Screen

00-01 ProjectAllowDup(*=[$0], *0=[$1])

00-02 NestedLoopJoin(condition=[true], joinType=[inner])

00-04 Project(T2ŠŠ*=[$0])

00-06 Scan(groupscan=[EasyGroupScan [selectionRoot=file:/Users/zfong/foo.csv, numFiles=1, columns=[`*`], files=[file:/Users/zfong/foo.csv]]])

00-03 Project(T3ŠŠ*=[$0])

00-05 Scan(groupscan=[EasyGroupScan [selectionRoot=file:/Users/zfong/foo.csv, numFiles=1, columns=[`*`], files=[file:/Users/zfong/foo.csv]]])



-- Zelaine



On 5/11/17, 3:17 PM, "Muhammad Gelbana" <***@gmail.com> wrote:



​But the query I provided failed to be planned because it's a cartesian

join, although I've set the option you mentioned to false. Is there a

reason why wouldn't Drill rules physically implement the logical join in my

query to a nested loop join ?



*---------------------*

*Muhammad Gelbana*

http://www.linkedin.com/in/mgelbana
Post by Zelaine Fong
Provided `planner.enable_nljoin_for_scalar_only` is set to false, even
without an explicit join condition, the query should use the Cartesian
join/nested loop join.
-- Zelaine
Hi,
I have one question here.. so if we have to use Cartesian join in Drill
then do we have to follow some workaround like Shadi mention : adding a
dummy column on the fly that has the value 1 in both tables and then
join
on that column leading to having a match of every row of the first
table
with every row of the second table, hence do a Cartesian product?
OR
select a.*, b.* from tt1 as a, tt2 b; then will it internally treat
this
query as Cartesian join.
Regards,
*Anup Tiwari*
Post by Zelaine Fong
Cartesian joins in Drill are implemented as nested loop joins, and I
think
Post by Zelaine Fong
you should see that reflected in the resultant query plan when you
run
Post by Zelaine Fong
explain plan on the query.
Yes, Cartesian joins/nested loop joins are expensive because you’re
effectively doing an MxN read of your tables. There are more
efficient
Post by Zelaine Fong
ways of processing a nested loop join, e.g., by creating an index on
the
Post by Zelaine Fong
larger table in the join and then using that index to do lookups
into that
Post by Zelaine Fong
table. That way, the nested loop join cost is the cost of creating
the
Post by Zelaine Fong
index + M, where M is the number of rows in the smaller table and
assuming
Post by Zelaine Fong
the lookup cost into the index does minimize the amount of data read
of the
Post by Zelaine Fong
second table. Drill currently doesn’t do this.
-- Zelaine
​I believe ​clhubert is referring to this discussion
<http://drill-user.incubator.apache.narkive.com/TIXWiTY4/
cartesian-product-in-apache-drill#post1>
.
So why Drill doesn't transform this query into a nested join
query ?
Post by Zelaine Fong
Simply
because there is no Calcite rule to transform it into a nested
loop
Post by Zelaine Fong
join ?
Is it not technically possible to write such Rule or is it
feasible so
Post by Zelaine Fong
I
may take on this challenge ?
Also pardon me for repeating my question but I fail to find an
answer
Post by Zelaine Fong
in
your replies, why doesn't Drill just run a cartesian join ?
Because
Post by Zelaine Fong
it's
expensive regarding resources (i.e. CPU\Network\RAM) ?
Thanks a lot Shadi for the query, it works for me.
*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana
On Mon, May 8, 2017 at 6:10 AM, Shadi Khalifa <
Hi Muhammad,
I did the following as a workaround to have Cartesian product.
The
Post by Zelaine Fong
basic
idea is to create a dummy column on the fly that has the value
1 in
Post by Zelaine Fong
both
tables and then join on that column leading to having a match
of
Post by Zelaine Fong
every row
of the first table with every row of the second table, hence
do a
Post by Zelaine Fong
Cartesian
product. This might not be the most efficient way but it will
do the
Post by Zelaine Fong
job.
*Original Query:*
SELECT * FROM
( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc
LIMIT
Post by Zelaine Fong
2147483647) `t0`
INNER JOIN
( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc
LIMIT
Post by Zelaine Fong
2147483647) `t1`
ON (`t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`)
LIMIT 2147483647
*Workaround (add columns **d1a381f3g73 and **d1a381f3g74 to
tables
Post by Zelaine Fong
one
and two, respectively. Names don't really matter, just need to
be
Post by Zelaine Fong
unique):*
SELECT * FROM
( SELECT *1 as d1a381f3g73*, 'ABC' `UserID` FROM
`dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t0`
INNER JOIN
( SELECT *1 as d1a381f3g74*, 'ABC' `UserID` FROM
`dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1`
ON (`t0`.*d1a381f3g73 = *`t1`.*d1a381f3g74*)
WHERE `t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`
LIMIT 2147483647
Regards
*Shadi Khalifa, PhD*
Postdoctoral Fellow
Cognitive Analytics Development Hub
Centre for Advanced Computing
Queen’s University
(613) 533-6000 x78347
http://cac.queensu.ca
I'm just a neuron in the society collective brain
*Join us for HPCS in June 2017! Register at:* *
http://2017.hpcs.ca/
Post by Zelaine Fong
<http://2017.hpcs.ca/>*
P Please consider your environmental responsibility before
printing
Post by Zelaine Fong
this
e-mail
*01001001 00100000 01101100 01101111 01110110 01100101 00100000
01000101
01100111 01111001 01110000 01110100 *
*The information transmitted is intended only for the person or
entity to
which it is addressed and may contain confidential material.
Any
Post by Zelaine Fong
review or
dissemination of this information by persons other than the
intended
Post by Zelaine Fong
recipient is prohibited. If you received this in error, please
contact the
sender and delete the material from any computer. Thank you.*
On Saturday, May 6, 2017 6:05 PM, Muhammad Gelbana <
​​
SELECT * FROM (SELECT 'ABC' `UserID` FROM
`dfs`.`path_to_parquet_file` tc
LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM
`dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (
​​
`t0`.`UserID` IS NOT DISTINCT FROM
​​
`t1`.`UserID`) LIMIT 2147483647
I debugged Drill code and found it decomposes *IS NOT DISTINCT
FROM*
Post by Zelaine Fong
into
​
*`t0`.`UserID` = ​`t1`.`UserID` OR (`t0`.`UserID` IS NULL &&
`t1`.`UserID`
IS NULL**)* while checking if the query is a cartesian join,
and
Post by Zelaine Fong
when the
check returns true, it throws an excetion saying: *This query
cannot
Post by Zelaine Fong
be
planned possibly due to either a cartesian join or an
inequality
Post by Zelaine Fong
join*
*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana
Post by Gautam Parai
Can you please specify the query you are trying to execute?
Gautam
________________________________
Sent: Saturday, May 6, 2017 7:34:53 AM
Subject: Running cartesian joins on Drill
Is there a reason why Drill would intentionally reject
cartesian
Post by Zelaine Fong
join
Post by Gautam Parai
queries even if *planner.enable_nljoin_for_scalar_only* is
disabled ?
Post by Gautam Parai
Any ideas how could a query be rewritten to overcome this
restriction ?
Post by Gautam Parai
*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana
Muhammad Gelbana
2017-05-16 09:37:42 UTC
Permalink
Raw Message
You are correct Aman. Here is the JIRA issue
<https://issues.apache.org/jira/browse/DRILL-5515>

This thread has been very helpful. Thank you all.

*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana
Muhammad,
The join condition ‘a = b or (a is null && b is null)’ works.
Internally, this is converted to ‘a is not distinct from b’ which is
processed by Drill.
For some reason, if the second form is directly supplied in the user
query, it is not working and ends up with the Cartesian join condition.
Drill leverages Calcite for this (you can see CALCITE-1200 for some
background).
Can you file a JIRA for this ?
-Aman
Date: Thursday, May 11, 2017 at 4:29 PM
Subject: Re: Running cartesian joins on Drill
I think Muhammad may be trying to run his original query with IS NOT
DISTINCT FROM. That discussion got side-tracked into Cartesian joins
because his query was not getting planned and the error was about Cartesian
join.
Muhammad, can you try with the equivalent version below ? You mentioned
the rewrite but did you try the rewritten version ?
SELECT * FROM (SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc
LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM
`dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (
​​
`t0`.`UserID` = `t1`.`UserID` OR (`t0`.`UserID` IS NULL && `t1`.`UserID` IS NULL) )
I’m not sure why it isn’t working for you. Using Drill 1.10, here’s
0: jdbc:drill:zk=local> alter session set `planner.enable_nljoin_for_scalar_only`
= false;
+-------+-------------------------------------------------+
| ok | summary |
+-------+-------------------------------------------------+
| true | planner.enable_nljoin_for_scalar_only updated. |
+-------+-------------------------------------------------+
1 row selected (0.137 seconds)
0: jdbc:drill:zk=local> explain plan for select * from
dfs.`/Users/zfong/foo.csv` t1, dfs.`/Users/zfong/foo.csv` t2;
+------+------+
| text | json |
+------+------+
| 00-00 Screen
00-01 ProjectAllowDup(*=[$0], *0=[$1])
00-02 NestedLoopJoin(condition=[true], joinType=[inner])
00-04 Project(T2ŠŠ*=[$0])
00-06 Scan(groupscan=[EasyGroupScan
[selectionRoot=file:/Users/zfong/foo.csv, numFiles=1, columns=[`*`],
files=[file:/Users/zfong/foo.csv]]])
00-03 Project(T3ŠŠ*=[$0])
00-05 Scan(groupscan=[EasyGroupScan
[selectionRoot=file:/Users/zfong/foo.csv, numFiles=1, columns=[`*`],
files=[file:/Users/zfong/foo.csv]]])
-- Zelaine
​But the query I provided failed to be planned because it's a cartesian
join, although I've set the option you mentioned to false. Is there a
reason why wouldn't Drill rules physically implement the logical join in my
query to a nested loop join ?
*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana
Post by Zelaine Fong
Provided `planner.enable_nljoin_for_scalar_only` is set to
false, even
Post by Zelaine Fong
without an explicit join condition, the query should use the
Cartesian
Post by Zelaine Fong
join/nested loop join.
-- Zelaine
Hi,
I have one question here.. so if we have to use Cartesian
join in Drill
Post by Zelaine Fong
then do we have to follow some workaround like Shadi mention
: adding a
Post by Zelaine Fong
dummy column on the fly that has the value 1 in both tables
and then
Post by Zelaine Fong
join
on that column leading to having a match of every row of the
first
Post by Zelaine Fong
table
with every row of the second table, hence do a Cartesian
product?
Post by Zelaine Fong
OR
select a.*, b.* from tt1 as a, tt2 b; then will it
internally treat
Post by Zelaine Fong
this
query as Cartesian join.
Regards,
*Anup Tiwari*
On Mon, May 8, 2017 at 10:00 PM, Zelaine Fong <
Post by Zelaine Fong
Cartesian joins in Drill are implemented as nested loop
joins, and I
Post by Zelaine Fong
think
Post by Zelaine Fong
you should see that reflected in the resultant query plan
when you
Post by Zelaine Fong
run
Post by Zelaine Fong
explain plan on the query.
Yes, Cartesian joins/nested loop joins are expensive
because you’re
Post by Zelaine Fong
Post by Zelaine Fong
effectively doing an MxN read of your tables. There are
more
Post by Zelaine Fong
efficient
Post by Zelaine Fong
ways of processing a nested loop join, e.g., by creating
an index on
Post by Zelaine Fong
the
Post by Zelaine Fong
larger table in the join and then using that index to do
lookups
Post by Zelaine Fong
into that
Post by Zelaine Fong
table. That way, the nested loop join cost is the cost of
creating
Post by Zelaine Fong
the
Post by Zelaine Fong
index + M, where M is the number of rows in the smaller
table and
Post by Zelaine Fong
assuming
Post by Zelaine Fong
the lookup cost into the index does minimize the amount of
data read
Post by Zelaine Fong
of the
Post by Zelaine Fong
second table. Drill currently doesn’t do this.
-- Zelaine
On 5/8/17, 9:09 AM, "Muhammad Gelbana" <
​I believe ​clhubert is referring to this discussion
<http://drill-user.incubator.
apache.narkive.com/TIXWiTY4/
Post by Zelaine Fong
Post by Zelaine Fong
cartesian-product-in-apache-drill#post1>
.
So why Drill doesn't transform this query into a
nested join
Post by Zelaine Fong
query ?
Post by Zelaine Fong
Simply
because there is no Calcite rule to transform it into
a nested
Post by Zelaine Fong
loop
Post by Zelaine Fong
join ?
Is it not technically possible to write such Rule or
is it
Post by Zelaine Fong
feasible so
Post by Zelaine Fong
I
may take on this challenge ?
Also pardon me for repeating my question but I fail to
find an
Post by Zelaine Fong
answer
Post by Zelaine Fong
in
your replies, why doesn't Drill just run a cartesian
join ?
Post by Zelaine Fong
Because
Post by Zelaine Fong
it's
expensive regarding resources (i.e. CPU\Network\RAM) ?
Thanks a lot Shadi for the query, it works for me.
*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana
On Mon, May 8, 2017 at 6:10 AM, Shadi Khalifa <
Hi Muhammad,
I did the following as a workaround to have
Cartesian product.
Post by Zelaine Fong
The
Post by Zelaine Fong
basic
idea is to create a dummy column on the fly that has
the value
Post by Zelaine Fong
1 in
Post by Zelaine Fong
both
tables and then join on that column leading to
having a match
Post by Zelaine Fong
of
Post by Zelaine Fong
every row
of the first table with every row of the second
table, hence
Post by Zelaine Fong
do a
Post by Zelaine Fong
Cartesian
product. This might not be the most efficient way
but it will
Post by Zelaine Fong
do the
Post by Zelaine Fong
job.
*Original Query:*
SELECT * FROM
( SELECT 'ABC' `UserID` FROM
`dfs`.`path_to_parquet_file` tc
Post by Zelaine Fong
LIMIT
Post by Zelaine Fong
2147483647) `t0`
INNER JOIN
( SELECT 'ABC' `UserID` FROM
`dfs`.`path_to_parquet_file` tc
Post by Zelaine Fong
LIMIT
Post by Zelaine Fong
2147483647) `t1`
ON (`t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`)
LIMIT 2147483647
*Workaround (add columns **d1a381f3g73 and
**d1a381f3g74 to
Post by Zelaine Fong
tables
Post by Zelaine Fong
one
and two, respectively. Names don't really matter,
just need to
Post by Zelaine Fong
be
Post by Zelaine Fong
unique):*
SELECT * FROM
( SELECT *1 as d1a381f3g73*, 'ABC' `UserID` FROM
`dfs`.`path_to_parquet_file` tc LIMIT 2147483647)
`t0`
Post by Zelaine Fong
Post by Zelaine Fong
INNER JOIN
( SELECT *1 as d1a381f3g74*, 'ABC' `UserID` FROM
`dfs`.`path_to_parquet_file` tc LIMIT 2147483647)
`t1`
Post by Zelaine Fong
Post by Zelaine Fong
ON (`t0`.*d1a381f3g73 = *`t1`.*d1a381f3g74*)
WHERE `t0`.`UserID` IS NOT DISTINCT FROM
`t1`.`UserID`
Post by Zelaine Fong
Post by Zelaine Fong
LIMIT 2147483647
Regards
*Shadi Khalifa, PhD*
Postdoctoral Fellow
Cognitive Analytics Development Hub
Centre for Advanced Computing
Queen’s University
(613) 533-6000 x78347
http://cac.queensu.ca
I'm just a neuron in the society collective brain
*Join us for HPCS in June 2017! Register at:* *
http://2017.hpcs.ca/
Post by Zelaine Fong
<http://2017.hpcs.ca/>*
P Please consider your environmental responsibility
before
Post by Zelaine Fong
printing
Post by Zelaine Fong
this
e-mail
*01001001 00100000 01101100 01101111 01110110
01100101 00100000
Post by Zelaine Fong
Post by Zelaine Fong
01000101
01100111 01111001 01110000 01110100 *
*The information transmitted is intended only for
the person or
Post by Zelaine Fong
Post by Zelaine Fong
entity to
which it is addressed and may contain confidential
material.
Post by Zelaine Fong
Any
Post by Zelaine Fong
review or
dissemination of this information by persons other
than the
Post by Zelaine Fong
intended
Post by Zelaine Fong
recipient is prohibited. If you received this in
error, please
Post by Zelaine Fong
Post by Zelaine Fong
contact the
sender and delete the material from any computer.
Thank you.*
Post by Zelaine Fong
Post by Zelaine Fong
On Saturday, May 6, 2017 6:05 PM, Muhammad Gelbana <
​​
SELECT * FROM (SELECT 'ABC' `UserID` FROM
`dfs`.`path_to_parquet_file` tc
LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC'
`UserID` FROM
Post by Zelaine Fong
Post by Zelaine Fong
`dfs`.`path_to_parquet_file` tc LIMIT 2147483647)
`t1` ON (
Post by Zelaine Fong
Post by Zelaine Fong
​​
`t0`.`UserID` IS NOT DISTINCT FROM
​​
`t1`.`UserID`) LIMIT 2147483647
I debugged Drill code and found it decomposes *IS
NOT DISTINCT
Post by Zelaine Fong
FROM*
Post by Zelaine Fong
into
​
*`t0`.`UserID` = ​`t1`.`UserID` OR (`t0`.`UserID` IS
NULL &&
Post by Zelaine Fong
Post by Zelaine Fong
`t1`.`UserID`
IS NULL**)* while checking if the query is a
cartesian join,
Post by Zelaine Fong
and
Post by Zelaine Fong
when the
*This query
Post by Zelaine Fong
cannot
Post by Zelaine Fong
be
planned possibly due to either a cartesian join or an
inequality
Post by Zelaine Fong
join*
*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana
On Sat, May 6, 2017 at 6:53 PM, Gautam Parai <
Post by Gautam Parai
Can you please specify the query you are trying to
execute?
Post by Zelaine Fong
Post by Zelaine Fong
Post by Gautam Parai
Gautam
________________________________
Sent: Saturday, May 6, 2017 7:34:53 AM
Subject: Running cartesian joins on Drill
Is there a reason why Drill would intentionally
reject
Post by Zelaine Fong
cartesian
Post by Zelaine Fong
join
Post by Gautam Parai
queries even if *planner.enable_nljoin_for_scalar_only*
is
Post by Zelaine Fong
Post by Zelaine Fong
disabled ?
Post by Gautam Parai
Any ideas how could a query be rewritten to
overcome this
Post by Zelaine Fong
Post by Zelaine Fong
restriction ?
Post by Gautam Parai
*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana
Loading...