Hi Lokendra,
Your usecase is a typical old school sharded DB app. The design itself is fine. However, as Tim noted, Drill is not designed for this case. Still, perhaps Drill could be extended.
As Tim suggested, Drill assumes any Drillbit can operate in any role. So, in your setup, you would run Drillbits on all your shard storage-nodes. Drill would schedule reads (more on this shortly) on those nodes. Then, Drill would do shuffles to other nodes to perform query operations.
In this model, one of your nodes would act as Foreman for a user. ZooKeeper (ZK) tracks all nodes, each user randomly chooses a Drillbit to act as Foreman, which means Forman load is shared across all your Drillbits.
Suppose you wanted to change this. You'd have to extend the way that Drillbits register themselves in ZK. A Drillbit, when it starts, would be assigned one or more roles which it would advertise in ZK. The distribution mechanisms in the Planner would have to be aware of scan-only nodes, compute-only nodes, and Foreman-only nodes.
Unless you plan to put heavy load on your scan nodes, it is not clear what benefit you'd gain from forcing Drill into a particular distribution model.
Perhaps you can start by running Drill on just your storage nodes, then noting performance.
One final point. Drill today knows to use HDFS to work out data locality for scans. You'd need to modify this to plug in your own data distribution mechanism so that Drill knows which shards to scan on which nodes. I don't believe Drill has a plugin-API for this, but I could be wrong. If not, this would be a great opportunity to define such an API.
Such an API might be helpful for other storage plugins such as Kafka so that scans are done on nodes with data.
Thanks,
- Paul
On Tuesday, November 13, 2018, 5:32:32 PM PST, Lokendra Singh Panwar <***@gmail.com> wrote:
Hi Tim,
Thanks for the reply.
My usecase is following:
 - My main DB table is huge so it is sharded amongs multiple
 storage-nodes.
 - Each stroage-node is storing the assigned shard in a local relational
 db engine.
I was planning to use Drill as a distributed query engine that can
scatter-gather data from these storage-nodes.
So, my overall plan for such architecture, as per my limited understanding
of Drill so far, is:
 - Have a DrilBit instance run on each storage-node, and this fleet will
 act as a leaf-worker fleet.
 - (I will write a Storage Plugin to transform data from my local
   relational DB engine to Drill record fromat)
 - Maintain another fleet that will serve as Foreman and Intermeidate
 query workers, still part of the same Drill cluster.
 - The reason I intended to have the leaf-query fleet (storage-nodes)
 segregated from Foreman/Intermediate workers (working on major fragments
 is):
   -  storage-nodes (acting as leaf-workers) are premium commodity in
   my cluster, involved in data ingestion as well as query traffic
servers as
   leaf-worker.
   -  So, I do not intend to overload them further with intermediate
   query fragment processing and aggregation that Foreman and Intermeidate
   pool of workers are involved in.
Does the above make sense?
Thanks,
Lokendra
Post by Timothy FarkasHi Lokendra,
All Drillbits can function as a foreman if a query is sent to them, and all
drillbits are considered worker nodes. This ingrained deeply into the
design of Drill and it was done with the intention of making Drill
symmetric. Symmetric here means that each Drillbit is identical to all the
others. Making this change would be a significant design change.
Why are you interested in running Drill in this way? Do you have a specific
use case in mind?
Thanks,
Tim
On Tue, Nov 13, 2018 at 3:37 PM Lokendra Singh Panwar <
Post by Lokendra Singh PanwarHi,
Is it possible to configure Drill such that the Foreman and leaf-worker
fleets are separate fleets of nodes?
Or if this needs changing the source of Drill, any pointers are
appreciated