Discussion:
Drill data and database locality
(too old to reply)
Ivan Kovacevic
2017-06-10 17:02:18 UTC
Permalink
Raw Message
Greetings,

I have two servers. The first server (A) contains the zookeeper, a mongodb
database and a drillbit. The second server (B) contains a hadoop
distribution with several hive tables, a postgresql database and the other
drillbit. Both drillbits can see eachother on the drill homepages, since
they are both connected to the zookeeper on server A. When a query like the
one on the bottom is run (the view dfs.lineorder consists of the hive and
postgresql data on server B), drillbit B becomes the foreman and executes
everything. It does not delegate the parts dealing with mongodb tables to
the drillbit on server A, so 90% of the time required to run the query
(approx. 30 minutes) is spent on sending the mongodb collection from server
A to server B. Is there a way to force drillbit B to delegate the mongodb
part of the query to drillbit B, or could this be a misconfiguration
problem?
Also, can drillbits be configured to access different databases (e.g. when
each drillbit has access to a distinct subnet, so that not all drillbits
can access the same databases)?
Thank you.

Sample query:
SELECT SUM(revenue) AS revenue
FROM (
SELECT SUM(lo_extendedprice*lo_discount) AS revenue
FROM dfs.tmp.lineorder, dfs.tmp.`date`
WHERE lo_orderdate = d_datekey
AND d_year = 1993
AND lo_discount BETWEEN 1 AND 3
AND lo_quantity < 25
UNION ALL
SELECT SUM(lo_extendedprice * lo_discount) AS revenue
FROM mongo.test.ssb_europe ssb
WHERE ssb.orderdate.d_year = 1993
AND lo_discount BETWEEN 1 AND 3
AND lo_quantity < 25
);

Configuration (drill-override.conf):

On drillbit A (Windows Server 2008):

drill.exec: {
cluster-id: "drillbits1",
zk.connect: "serverA:2181",
impersonation: {
enabled: true,
max_chained_user_hops: 3
}
}

On drillbit B (Cloudera CDH 5.8.0):
drill.exec: {
cluster-id: "drillbits1",
zk.connect: "serverA:2181",
impersonation: {
enabled: true,
max_chained_user_hops: 3
}
}


Best regards,
Ivan
Aman Sinha
2017-06-11 06:32:02 UTC
Permalink
Raw Message
Ivan,
yes, the scans for the various data sources are expected to use locality
information to perform the table scan.

If you only run the query against mongodb (the right side of union-all) and
the foreman is on server B, does it do the table scan on server A which is
hosting the mongodb table ? It is supposed to use locality information, so
if it is not then most likely there is a bug. Does the problem only
occur when the query is accessing multiple data sources including mongo ?

You could enable DEBUG logging level in logback.xml and check the output
in drillbit.log file.. look for entries for MongoGroupScan that should
provide insight into whether the affinity based assignment is working
correctly. Each data source has its own implementation of determining
'affinity'.

You may have already seen this, but just in case..here's the doc page for
the mongo plugin: https://drill.apache.org/blog/2014/11/19/sql-on-mongodb/
I am not familiar with this plugin but hopefully the original authors
(listed on that page) can chime in.

-Aman
Post by Ivan Kovacevic
Greetings,
I have two servers. The first server (A) contains the zookeeper, a mongodb
database and a drillbit. The second server (B) contains a hadoop
distribution with several hive tables, a postgresql database and the other
drillbit. Both drillbits can see eachother on the drill homepages, since
they are both connected to the zookeeper on server A. When a query like the
one on the bottom is run (the view dfs.lineorder consists of the hive and
postgresql data on server B), drillbit B becomes the foreman and executes
everything. It does not delegate the parts dealing with mongodb tables to
the drillbit on server A, so 90% of the time required to run the query
(approx. 30 minutes) is spent on sending the mongodb collection from server
A to server B. Is there a way to force drillbit B to delegate the mongodb
part of the query to drillbit B, or could this be a misconfiguration
problem?
Also, can drillbits be configured to access different databases (e.g. when
each drillbit has access to a distinct subnet, so that not all drillbits
can access the same databases)?
Thank you.
SELECT SUM(revenue) AS revenue
FROM (
SELECT SUM(lo_extendedprice*lo_discount) AS revenue
FROM dfs.tmp.lineorder, dfs.tmp.`date`
WHERE lo_orderdate = d_datekey
AND d_year = 1993
AND lo_discount BETWEEN 1 AND 3
AND lo_quantity < 25
UNION ALL
SELECT SUM(lo_extendedprice * lo_discount) AS revenue
FROM mongo.test.ssb_europe ssb
WHERE ssb.orderdate.d_year = 1993
AND lo_discount BETWEEN 1 AND 3
AND lo_quantity < 25
);
drill.exec: {
cluster-id: "drillbits1",
zk.connect: "serverA:2181",
impersonation: {
enabled: true,
max_chained_user_hops: 3
}
}
drill.exec: {
cluster-id: "drillbits1",
zk.connect: "serverA:2181",
impersonation: {
enabled: true,
max_chained_user_hops: 3
}
}
Best regards,
Ivan
Loading...