Segregating Foreman from leaf-worker fleet

Discussion:

Lokendra Singh Panwar

2018-11-13 23:37:21 UTC

Hi,

Is it possible to configure Drill such that the Foreman and leaf-worker
fleets are separate fleets of nodes?
Or if this needs changing the source of Drill, any pointers are appreciated
too.

Thanks,
Lokendra

Timothy Farkas

2018-11-14 00:17:25 UTC

Permalink

Hi Lokendra,

All Drillbits can function as a foreman if a query is sent to them, and all
drillbits are considered worker nodes. This ingrained deeply into the
design of Drill and it was done with the intention of making Drill
symmetric. Symmetric here means that each Drillbit is identical to all the
others. Making this change would be a significant design change.

Why are you interested in running Drill in this way? Do you have a specific
use case in mind?

Thanks,
Tim

Post by Lokendra Singh Panwar
Hi,
Is it possible to configure Drill such that the Foreman and leaf-worker
fleets are separate fleets of nodes?
Or if this needs changing the source of Drill, any pointers are appreciated
too.
Thanks,
Lokendra

Lokendra Singh Panwar

2018-11-14 01:31:57 UTC

Permalink

Hi Tim,

Thanks for the reply.

My usecase is following:

- My main DB table is huge so it is sharded amongs multiple
storage-nodes.
- Each stroage-node is storing the assigned shard in a local relational
db engine.

I was planning to use Drill as a distributed query engine that can
scatter-gather data from these storage-nodes.

So, my overall plan for such architecture, as per my limited understanding
of Drill so far, is:

- Have a DrilBit instance run on each storage-node, and this fleet will
act as a leaf-worker fleet.
- (I will write a Storage Plugin to transform data from my local
relational DB engine to Drill record fromat)
- Maintain another fleet that will serve as Foreman and Intermeidate
query workers, still part of the same Drill cluster.
- The reason I intended to have the leaf-query fleet (storage-nodes)
segregated from Foreman/Intermediate workers (working on major fragments
is):
- storage-nodes (acting as leaf-workers) are premium commodity in
my cluster, involved in data ingestion as well as query traffic
servers as
leaf-worker.
- So, I do not intend to overload them further with intermediate
query fragment processing and aggregation that Foreman and Intermeidate
pool of workers are involved in.

Does the above make sense?

Thanks,
Lokendra

Post by Timothy Farkas
Hi Lokendra,
All Drillbits can function as a foreman if a query is sent to them, and all
drillbits are considered worker nodes. This ingrained deeply into the
design of Drill and it was done with the intention of making Drill
symmetric. Symmetric here means that each Drillbit is identical to all the
others. Making this change would be a significant design change.
Why are you interested in running Drill in this way? Do you have a specific
use case in mind?
Thanks,
Tim
On Tue, Nov 13, 2018 at 3:37 PM Lokendra Singh Panwar <

appreciated

Post by Lokendra Singh Panwar
too.
Thanks,
Lokendra

Paul Rogers

2018-11-14 06:26:51 UTC

Permalink

Hi Lokendra,

Your usecase is a typical old school sharded DB app. The design itself is fine. However, as Tim noted, Drill is not designed for this case. Still, perhaps Drill could be extended.

As Tim suggested, Drill assumes any Drillbit can operate in any role. So, in your setup, you would run Drillbits on all your shard storage-nodes. Drill would schedule reads (more on this shortly) on those nodes. Then, Drill would do shuffles to other nodes to perform query operations.

In this model, one of your nodes would act as Foreman for a user. ZooKeeper (ZK) tracks all nodes, each user randomly chooses a Drillbit to act as Foreman, which means Forman load is shared across all your Drillbits.

Suppose you wanted to change this. You'd have to extend the way that Drillbits register themselves in ZK. A Drillbit, when it starts, would be assigned one or more roles which it would advertise in ZK. The distribution mechanisms in the Planner would have to be aware of scan-only nodes, compute-only nodes, and Foreman-only nodes.

Unless you plan to put heavy load on your scan nodes, it is not clear what benefit you'd gain from forcing Drill into a particular distribution model.

Perhaps you can start by running Drill on just your storage nodes, then noting performance.

One final point. Drill today knows to use HDFS to work out data locality for scans. You'd need to modify this to plug in your own data distribution mechanism so that Drill knows which shards to scan on which nodes. I don't believe Drill has a plugin-API for this, but I could be wrong. If not, this would be a great opportunity to define such an API.

Such an API might be helpful for other storage plugins such as Kafka so that scans are done on nodes with data.

Thanks,
- Paul

On Tuesday, November 13, 2018, 5:32:32 PM PST, Lokendra Singh Panwar <***@gmail.com> wrote:

Hi Tim,

Thanks for the reply.

My usecase is following:

Â -Â My main DB table is huge so it is sharded amongs multiple
Â storage-nodes.
Â -Â Each stroage-node is storing the assigned shard in a local relational
Â db engine.

I was planning to use Drill as a distributed query engine that can
scatter-gather data from these storage-nodes.

So, my overall plan for such architecture, as per my limited understanding
of Drill so far, is:

Â - Have a DrilBit instance run on each storage-node, and this fleet will
Â act as a leaf-worker fleet.
Â - (I will write a Storage Plugin to transform data from my local
Â Â Â relational DB engine to Drill record fromat)
Â - Maintain another fleet that will serve as Foreman and Intermeidate
Â query workers, still part of the same Drill cluster.
Â -Â The reason I intended to have the leaf-query fleet (storage-nodes)
Â segregated from Foreman/Intermediate workers (working on major fragments
Â is):
Â Â Â -Â Â storage-nodes (acting as leaf-workers) are premium commodity in
Â Â Â my cluster, involved in data ingestion as well as query traffic
servers as
Â Â Â leaf-worker.
Â Â Â -Â Â So, I do not intend to overload them further with intermediate
Â Â Â query fragment processing and aggregation that Foreman and Intermeidate
Â Â Â pool of workers are involved in.

Does the above make sense?

Thanks,
Lokendra

appreciated

Post by Lokendra Singh Panwar
too.
Thanks,
Lokendra