2017-06-01 12:17:34 UTC
First of all, I was very happy to at last attend the hangouts meeting, I've
been trying to do so for quite sometime.
I know I confused most of you during the meeting but that's because my
requirements aren't crystal clear at the moment and I'm still learning what
Drill can do. Hopefully I learn enough so I would be confident about the
options I have when I need to make implementation decisions.
Now to the point, and let me restate my case..
We have a proprietary datasource that can perform limits, aggregations,
filters and joins very fast. This datasource can handle SQL queries but not
all possible SQL syntax. I've been successful, so far, to pushdown joins,
filters and limits, but I'm still struggling with aggregates. I've sent an
email about aggregates to Calcite's mailing list.
The amount of data this datasource may be required to process can be
billions of records and 100s of GBs of data. So we are looking forward to
distribute this data among multiple servers to overcome storage limitations
and maximize throughput.
This distribution can be just duplicating the data to maximize throughput,
so each server will have the same set of data, *or* records may be
distributed among different servers, without duplication among these
servers because a single server may not be able to hold all the data. So
some tables may be duplicated and some tables may be distributed among
servers. Let's assume that the distribution details of each table is
available for the plugin.
Now I understand that for Drill to implement a query, it supports a
set of physical
operators <https://drill.apache.org/docs/physical-operators/>. These
operators logic\code is generated at runtime and it's distributed among a
Drill cluster to be compiled and executed.
So to scan a table distributed\duplicated among 3 servers, I may want to
configure Drill to execute *SELECT * FROM TABLE* by running the same query
with an extra filter (to read\scan a specific portion of the table if the
table is duplicated, to maximize throughput) or by running the query
without modifications but having Drill run the query multiple times, once
against each server. I assume this can be done by having the table's
*GroupScan* return 3 different *SubScan*s when the
*GroupScan.getSpecificScan(int)* is called multiple times, with different
parameters of course.
These different parameters can be controlled by the output of
*GroupScan.getMaxParallelizationWidth()*, correct ?
Please correct me if I'm wrong about anything.
Now assuming what I said is correct, I have a couple of questions:
1. If I have multiple *SubScan*s to be executed, will each *SubScan* be
handled by a single *Scan* operator ? So whenever I have *n* *SubScan*s,
I'll have *n* Scan operators distributed among Drill's cluster ?
2. How can I control the amount of any type of physical operators per
Drill cluster or node ? For instance, what if I want to have less
*Filter* operators or more *Scan* operators, how can I do that ?
I'm aware that my distribution goal may not be as simple as I may have made
Pardon me for the huge email and thanks a lot for your time.