I wonder if we should pop the discussion up a level? What goals should Drill have as an Apache project?
Drill is a big data query engine, and shares that description with Impala, Presto, Hive and (to some degree) Spark. Drill's adoption is currently lower than Impala or Spark. What unique use cases can Drill address that are unserved (or under-served) by the Impala and Spark juggernauts?
To grow, Drill must define its sweet spot: what it does better than any other project. Let's identify why organizations might want to use Drill rather than (or in addition to) the better-known alternatives. Answer that, and we enter a virtuous cycle: organizations will adopt Drill because it does things that other tools don't do (or do poorly). Some of those adopters will want to contribute to Drill for new use cases, which will encourage more adoption.
Drill is like many other projects in their early years: one core vendor has graciously contributed the bulk of the code. (Impala, Hadoop, Spark, Kudu and Kafka are other examples.) Naturally, the work of the core team focuses on the specific needs of that vendor's customers. Ideally, Drill would, like those other tools, gain sufficient adoption that many other organizations contribute as well, broadening the set of supported use cases, and entering that virtuous growth cycle, to everyone's benefit.
The core question: what does the community see as gaps in their big data stacks that Drill can serve?
On Tuesday, August 14, 2018, 7:08:51 AM PDT, Arina Yelchiyeva <***@gmail.com> wrote:
1. Regarding Drill metastore, its under investigation, please follow up
with DRILL-6552.
2. UDFs: I would not say, it's that quit to write UDFs in Drill.
have good manuals. Regarding adding support for different languages like
heavily relies on Java source code when during UDFs initialization. Though
for UDFs.
3. Drill vs Arrow is the topic I heard since I have started working with
Drill. But so far nobody dared to tackle it. I would suspect Drill first
be a show-stopper if Arrow community does not accept them.
Iâd like to weigh in here as well. As a long time user of Drill, I really
would like to see more people using it and I think there are a few key
aspects that could really help on that front.
The first of which is the Arrow integration. Iâm not enough of a software
engineer to understand all the internal details here, but as I understand
it, the promise of Arrow is that many tools will share a common memory
model and that it will be possible to transfer data from one tool to the
other without having to serialize/deserialize the data. In the data
science community many of the major platforms, Python-pandas, R, and Spark
are moving or have adopted Arrow.
Drillâs strength is the ease that it can query many different data sources
and if Drill were to adopt Arrow, I suspect that many people would adopt it
as a part of a machine learning pipeline. Just recently, I attempted to do
some data manipulation using Spark, and couldnât help but notice how
difficult ti was in contrast with Drill. Iâm sure this is a very complex
task, but I do think that it could be worth it in the end.
Secondly, Iâd like to second Paulâs call to simplify the interfaces for
UDFs, Format and ideally storage plugins. A core strength of Drill is its
extensibility and making it easier would be a great thing. I was wondering
whether it would be possible or even a good idea, to enable users to write
UDFs in a scripting language such as python.
Thirdly,
your work to build a storage plugin for ElasticSearch is really great and I
think more capabilities like that are really needed. Iâd like to see a
generic HTTP storage plugin, a storage plugin for Google Sheets, If I can
figure out how storage plugins work, Iâll gladly work on some of these.
Just my .02.
â C
Post by Paul RogersHi Arina,
Another topic would be whether/how to round out Drill's data model.
Drill's scalar and nullable types are pretty solid. Great work was done
recently for Decimal (though the old types still remain.) Good support is
now available for nested types to do implicit joins to produce SQL-friendly
flat records.
Post by Paul RogersBut, opportunities for improvement still remain. Date/Time has timezone
issues. Union, List and Repeated List never quite worked. There are a few
types identified in the code, but not implemented (dates with TZ, tiny
ints, etc.) How should Drill bridge. the gap from arrays and maps (really,
structs) on the one hand, and plain-old-relational ODBC/JDBC/BI tools on
the other?
either keep a type and make it fully work if it has holes, or drop it.
Unions and Lists are the messiest. They are incomplete in part, because
they are trying to do the impossible: to predict the future well enough
that Drill can handle columns with varying or ambiguous data types (that
is, to handle schema changes.) Is there a better way to handle this issue
(such as with metadata hints)? That is, rather than fight with conflicting
types at run time, simply declare the common type in metadata so all
operators and record batches agree on the type.
Post by Paul RogersAnd, of course, there is the lingering issue of Drill vectors vs. Arrow.
Arrow did great work in metadata, but seems to have kept some of the
awkward aspects of Drill's original memory model (lack of control over
batch sizes, ability to fragment memory.) Might there be a resyncing of the
two projects: Drill picks up Arrow's metadata and APIs, Arrow picks up
Drill's memory improvements, such as the size-limiting "result set loader"
framework.
Post by Paul RogersBig-picture issues such as this tend to get lost in the 2270 open Jira
tickets. How might the project create some "theme" tickets (or Wiki pages
or whatever) to help pull the main issues out of the wealth of detail in
Jira?
Post by Paul RogersThanks,
- Paul
  On Monday, August 13, 2018, 11:07:39 AM PDT, Paul Rogers <
Hi Arina,
Thanks for launching this discussion. A few minor suggestions.
The developers have done a fantastic job stabilizing and improving
Drill's core functionality. Now the opportunity is to expand the use cases
for Drill so that it gets wider adoption within the community. Drill
competes for mindshare with Impala, Presto, Hive, Spark and others. A key
differentiator for Drill can be the ability to extend the core and
integrate Drill into user applications. Of these tools, only Spark has a
fully ostensible model. Can Drill provide some of the flexibility that has
powered Spark to success?
Post by Paul Rogers1. You mentioned the metastore is under active investigation. Anything
yet to share? Didn't see any activity on the JIRA ticket. Metadata is a key
gap in Drill. Simply adding a Hive-like metastore would repeat the very
errors that Drill was meant to address. Maybe we can toss around ideas for
a metadata API that provides greater flexibility.
Post by Paul Rogers2. Users can extend the core with custom UDFs, storage engines, formats
and so on. At present, the code to do this is rather hard to write, debug
and maintain. Is there value in streamlining those interfaces so that a
wider audience can extend Drill for their specific needs?
Post by Paul Rogers3. Similarly, we've seen interest in integrating Drill with other
systems, which suggests an opportunity for improved APIs. Ability to
associate options, defaults and restrictions with users. Ability to use the
REST API for larger data sets and with stateful session options. And so on.
Post by Paul RogersSuch extensions are best guided by user demands: what can Drill provide
for production applications to enable simpler/faster/more complete
integration?
Post by Paul RogersThanks,
- Paul
  On Monday, August 13, 2018, 5:42:08 AM PDT, Arina Ielchiieva <
Hi all,
as a new PMC Chair I would like to thank users for choosing and using
Apache Drill and contributors /Â committers for making improvements and
fixes. Recently Apache Drill 1.14 was released bundled up with many
improvements and new features. Please feel free to try it out and share
your experience. As always we would love to hear your success stories of
using Apache Drill.
Also I encourage users to share any problems found in Drill, as well as
any
Post by Paul Rogerssuggestions for future improvements. Feel free to start discussion on the
mailing list and then file a Jira with the summary. Contributions are
always welcome: minor, major, doc improvements or grammar fixes. Just
file
Post by Paul Rogersa Jira and open the PR. Do not hesitate to ping developers on the mailing
list if PR is not being timely reviewed.
Apache Drill project has healthy release schedule, each release includes
lots of features.
Mailing list (user / dev) are getting substantial support from the active
developers, including Stackoverflow and Twitter.
New committers are added on the steady basis.
Overall project is growing and moving forward. There have been
discussions
Post by Paul Rogersabout Drill 2.0 last year and currently Drill metastore feature is under
active investigation which might the breaking change for 2.0.
Please feel free to reply to this email with your comments / concerns /
ideas about current project state.
Kind regards,
Arina