Discussion:
Drill and Tableau
(too old to reply)
Divya Gehlot
2017-07-25 07:54:19 UTC
Permalink
Raw Message
Hi,
As a naive user would like to know the benefitsof Apache Drill with tableau
?

As per my understanding we to visualize we need to push the data to tableau
for granular visualization .

Would like to understand few features of Drill in terms of visualtion or
data retrieval :
1, Metadata Caching
2 .Partitioning
3.Generation of Parquet File
4.Custom column calculation.


Thanks,
Divya
Jinfeng Ni
2017-07-25 18:59:36 UTC
Permalink
Raw Message
Some of questions you have are related to which type of storage you are
querying with Drill. If you are querying hive table, Drill uses Hive
metastore API to get the metadata, and internally maintain a cache for
speedup [1]. If you are querying parquet files through file system plugin,
Drill has the concept of parquet metadata [2]. If you are querying other
datasets, AFAIK, Drill does not maintain a metadata cache.

Drill's CTAS allows user to create partitioned data [3]. But you can also
have your data created as partitioned table by other tools such as Hive,
and Drill will recognize the hive partition from Hive metastore.

Parquet files could be created by either Drill's CTAS, or other tools.

I'm not clear what you mean in the fourth question, though.


1. https://drill.apache.org/docs/hive-metadata-caching/
2. https://drill.apache.org/docs/optimizing-parquet-metadata-reading/
3. https://drill.apache.org/docs/how-to-partition-data/
Post by Divya Gehlot
Hi,
As a naive user would like to know the benefitsof Apache Drill with tableau
?
As per my understanding we to visualize we need to push the data to tableau
for granular visualization .
Would like to understand few features of Drill in terms of visualtion or
1, Metadata Caching
2 .Partitioning
3.Generation of Parquet File
4.Custom column calculation.
Thanks,
Divya
Saurabh Mahapatra
2017-07-25 19:21:02 UTC
Permalink
Raw Message
Hi Divya,

There is nothing as a naive question. Please feel free to post any
questions you have. There is someone in the community that will help you
out.

This is my opinion:
There are a variety of BI tools in the market that offer excellent
visualization and interaction with data capabilities. Tableau,
MicroStrategy, Qlik to name a few. These are tools built by companies but
you could be building your own web app that is highly customized for your
users. The need for such tools as arisen for the need of the end BI
(business intelligence) user who does not have the time and patience of
type SQL queries. If you are slicing and dicing data while following your
intuition, imagine having to rewrite the SQL queries each time and ensuring
that they work syntactically.

That is a lot to ask for the average user who wants to look at the data in
different ways and make a decision that hopefully results in some
action(and not just powerpoint slides). The latter is more important than
anything else inside a company.

Drill provides the SQL query layer for interaction with the data underneath
scattered across various data sources.

So before you jump in to standardize on any BI tool-ask yourself: who are
the business users, what are their needs in terms of decision making and
what kind of workflows do they envision. Then work backwards to find out
the tool that best meets your needs. Be open to the idea of building your
web app if that is something you envision will benefit your users in the
long term.

As for the other questions, these are the ones related to data retrieval
(efficiency which results in performance). I will tell you what I know and
other can chime in with better info:

1. Metadata cache: Only for Parquet files. The idea here is to store the
metadata associated with Parquet rowgroups per file in a separate file so
that you avoid having to open and close every Parquet file to get that
info. Metadata can help you understand basic statistics such as mins and
maxes so that you can skip rowgroups or files that do not match your filter
condition. This idea of storing metadata is not new across other query
engines.

Read more here:
https://drill.apache.org/docs/optimizing-parquet-metadata-reading/

2. Partitioning: Drill is "directory aware". This is an age old concept of
partitioning your data in a way so that Drill can skip directories that are
part of the filter condition. The layout and structuring of the data now
helps Drill. Partitioning schemes depend on query patterns. One rule of
thumb that I use is to look at the BI users and observe their workflows. If
they use a time range as the basis of every analysis, then I will partition
by time (say month). If I know that the likelihood of partitioning by
location is 60%, then I will create a nested directory structure with
time(month) at top of the hierarchy and location just below it.

Read more here:
https://drill.apache.org/docs/partition-pruning-introduction/

3. Generation of Parquet file

https://drill.apache.org/docs/parquet-format/

Please pay attention to how you configure the writer:
https://drill.apache.org/docs/parquet-format/#configuring-the-parquet-storage-format

4. Custom column calculation:
There is some out of the box stuff here:
https://drill.apache.org/docs/sql-window-functions-introduction/

But you should also be aware of nesting operations as well:
https://drill.apache.org/docs/nested-data-limitations/

and of course there are UDFs:
https://drill.apache.org/docs/adding-custom-functions-to-drill-introduction/

Please let us know if you have any additional questions.

Happy drilling:)

Saurabh
Post by Divya Gehlot
Hi,
As a naive user would like to know the benefitsof Apache Drill with tableau
?
As per my understanding we to visualize we need to push the data to tableau
for granular visualization .
Would like to understand few features of Drill in terms of visualtion or
1, Metadata Caching
2 .Partitioning
3.Generation of Parquet File
4.Custom column calculation.
Thanks,
Divya
Divya Gehlot
2017-07-31 02:15:47 UTC
Permalink
Raw Message
Hi Saurabh,

If I know that the likelihood of partitioning by
location is 60%, then I will create a nested directory structure with
time(month) at top of the hierarchy and location just below it.

You mean first have to create the partition directory by year,month and
then create a nested directory structure
Somthing like as shown below
/path/to/directory
/country1
/datafiles.parquet
/country2
datafiles.parquet

Thanks,
Divya
Post by Saurabh Mahapatra
Hi Divya,
There is nothing as a naive question. Please feel free to post any
questions you have. There is someone in the community that will help you
out.
There are a variety of BI tools in the market that offer excellent
visualization and interaction with data capabilities. Tableau,
MicroStrategy, Qlik to name a few. These are tools built by companies but
you could be building your own web app that is highly customized for your
users. The need for such tools as arisen for the need of the end BI
(business intelligence) user who does not have the time and patience of
type SQL queries. If you are slicing and dicing data while following your
intuition, imagine having to rewrite the SQL queries each time and ensuring
that they work syntactically.
That is a lot to ask for the average user who wants to look at the data in
different ways and make a decision that hopefully results in some
action(and not just powerpoint slides). The latter is more important than
anything else inside a company.
Drill provides the SQL query layer for interaction with the data underneath
scattered across various data sources.
So before you jump in to standardize on any BI tool-ask yourself: who are
the business users, what are their needs in terms of decision making and
what kind of workflows do they envision. Then work backwards to find out
the tool that best meets your needs. Be open to the idea of building your
web app if that is something you envision will benefit your users in the
long term.
As for the other questions, these are the ones related to data retrieval
(efficiency which results in performance). I will tell you what I know and
1. Metadata cache: Only for Parquet files. The idea here is to store the
metadata associated with Parquet rowgroups per file in a separate file so
that you avoid having to open and close every Parquet file to get that
info. Metadata can help you understand basic statistics such as mins and
maxes so that you can skip rowgroups or files that do not match your filter
condition. This idea of storing metadata is not new across other query
engines.
https://drill.apache.org/docs/optimizing-parquet-metadata-reading/
2. Partitioning: Drill is "directory aware". This is an age old concept of
partitioning your data in a way so that Drill can skip directories that are
part of the filter condition. The layout and structuring of the data now
helps Drill. Partitioning schemes depend on query patterns. One rule of
thumb that I use is to look at the BI users and observe their workflows. If
they use a time range as the basis of every analysis, then I will partition
by time (say month). If I know that the likelihood of partitioning by
location is 60%, then I will create a nested directory structure with
time(month) at top of the hierarchy and location just below it.
https://drill.apache.org/docs/partition-pruning-introduction/
3. Generation of Parquet file
https://drill.apache.org/docs/parquet-format/
https://drill.apache.org/docs/parquet-format/#configuring-
the-parquet-storage-format
https://drill.apache.org/docs/sql-window-functions-introduction/
https://drill.apache.org/docs/nested-data-limitations/
https://drill.apache.org/docs/adding-custom-functions-to-
drill-introduction/
Please let us know if you have any additional questions.
Happy drilling:)
Saurabh
Post by Divya Gehlot
Hi,
As a naive user would like to know the benefitsof Apache Drill with
tableau
Post by Divya Gehlot
?
As per my understanding we to visualize we need to push the data to
tableau
Post by Divya Gehlot
for granular visualization .
Would like to understand few features of Drill in terms of visualtion or
1, Metadata Caching
2 .Partitioning
3.Generation of Parquet File
4.Custom column calculation.
Thanks,
Divya
Saurabh Mahapatra
2017-07-31 23:24:05 UTC
Permalink
Raw Message
Hi Divya,

There are a couple of factors that play into how partition pruning helps.
(1) Access pattern based on the frequency of dimensions (and their number)
being accessed.
(2) Data distribution across the dimensions listed in (1). If there is
skew, partition pruning can be of no use.
(3) Cardinality of the dimensions listed in (1)
(4) Partition directory hierarchy i.e. ordering

In your case, it would be enough to assume the directory structure listed.
However, I recommend reordering the directory structure (put location on
top and time below it) and do the testing. Data layouts need to be
experimented with.

Instead of hand-waving, I decided to do a mental exercise to show you how I
think about these problems.

Let us do a thought exercise and make the following assumptions. Assume the
simplest query engine that returns all the rows based on a filter
condition. It does no optimization on filters but just scans the records to
see if a filter condition is met or not. Also assume that you have limited
cluster resources to scan these records (does not matter how many ideal
parallel machines you have, the resources are not infinite).

(A) Probability of dimensions in a query (assume that they are independent
but even with dependency you can find out using Bayesian inference rules)
1. Probability of time dimension in query: 90%
2. Probability of location dimension in query: 60%

(B) Number of records and baseline
Let us assume that the number of records in a single Parquet file is
1,000,000,000. (1 billion). Without partition pruning, you would be
scanning 1 billion records regardless of the filter condition. The data
could be there in any order and so the worst case measurement makes sense.

(C) Assume the cardinality of the directory partitions (or the cardinality
of the dimensions themselves)
Time dimension: 1000
Location dimension: 10

(D) Assume that there is no data skew and a uniform distribution for all
records. This means that 1 million records would be uniformly distributed
across 1,000,000,000/1000=1 million records per time partition. And within
each partition, you have 1 million record/10=100,000 records in each
location partition.

If the filter predicate contains a maximum of 10 conditions on subset of
the time dimensions, the worst case of cost of access: 10* 100,000=1
million records. i.e. you will access 10 folders and that should be
multiplied by number of records.

If there is a second filter predicate that contains a maximum of 5
conditions on the location, the worst case cost of access is reduced
further to 5*100,000 records.

Any query coming into the system has four possibilities with respect to
what is there in the filter predicate:
1. No time, no location dimension. The probability of this happening is
(1-0.9)*(1-0.6)=0.1*0.4=0.04.

Worst case cost of access is 0.04*1,000,000,000=40,000,000=40 million.

2. Only time, no location dimension: The probability of this happening is
0.9*0.4=0.36
Worst case cost of access is 0.36* 1 million=360,000 records=0.36 million

3. Only location, no time dimension: The probability of this happening is
0.6*0.1=0.06
Worst case cost of access is 0.06*5*100,000*1000 (for all time folders)=30
million

4. Both location and time dimension present: The probability of this
happening is 0.54. This is important because this tells you that you have
just over 50% chance of these dimensions present in the filter.

Worst case cost of access is: 0.54*10*5*100,000=2,700,000

Add them all up and you have the worst case cost of access as: 40
million+0.36 million+ 30 million+2.7 million=73.6 million. So you have
reduced the amount of scan by ~92% improvement (10x improvement) by
partition pruning based on an access pattern.

Now, let us consider the second case when you change the directory
structure to have location partition at the top and time dimension just
below it.

There are 10 location directories with each containing: 1 billion/10=100
million records.
Within each location directory, there are 100 time folders and so we have:
100,000,000/100=1 million records.

Any query coming into the system has four possibilities with respect to
what is there in the filter predicate:
1. No time, no location dimension. The probability of this happening is
(1-0.9)*(1-0.6)=0.1*0.4=0.04.

Worst case cost of access is 0.04*1,000,000,000=40,000,000=40 million.

2. Only time, no location dimension: The probability of this happening is
0.9*0.4=0.36
Worst case cost of access is 0.36*10*10*1 million=36 million

3. Only location, no time dimension: The probability of this happening is
0.6*0.1=0.06
Worst case cost of access is 0.06*5*100 million=30 million

4. Both location and time dimension present: The probability of this
happening is 0.54. This is important because this tells you that you have
just over 50% chance of these dimensions present in the filter.

Worst case cost of access is: 0.54* 5*10*100,000=2,700,000. No change here.

Add them all up and you have the worst case cost of access as: 40
million+36 million+ 30 million+2.7 million=108 million. So you have reduced
the amount of scan by ~89% improvement (still 10x improvement) by partition
pruning based on an access pattern. It has reduced a little because of the
change in the directory structure.

Thanks,
Saurabh
Post by Divya Gehlot
Hi Saurabh,
If I know that the likelihood of partitioning by
location is 60%, then I will create a nested directory structure with
time(month) at top of the hierarchy and location just below it.
You mean first have to create the partition directory by year,month and
then create a nested directory structure
Somthing like as shown below
/path/to/directory
/country1
/datafiles.parquet
/country2
datafiles.parquet
Thanks,
Divya
Post by Saurabh Mahapatra
Hi Divya,
There is nothing as a naive question. Please feel free to post any
questions you have. There is someone in the community that will help you
out.
There are a variety of BI tools in the market that offer excellent
visualization and interaction with data capabilities. Tableau,
MicroStrategy, Qlik to name a few. These are tools built by companies but
you could be building your own web app that is highly customized for your
users. The need for such tools as arisen for the need of the end BI
(business intelligence) user who does not have the time and patience of
type SQL queries. If you are slicing and dicing data while following your
intuition, imagine having to rewrite the SQL queries each time and
ensuring
Post by Saurabh Mahapatra
that they work syntactically.
That is a lot to ask for the average user who wants to look at the data
in
Post by Saurabh Mahapatra
different ways and make a decision that hopefully results in some
action(and not just powerpoint slides). The latter is more important than
anything else inside a company.
Drill provides the SQL query layer for interaction with the data
underneath
Post by Saurabh Mahapatra
scattered across various data sources.
So before you jump in to standardize on any BI tool-ask yourself: who are
the business users, what are their needs in terms of decision making and
what kind of workflows do they envision. Then work backwards to find out
the tool that best meets your needs. Be open to the idea of building your
web app if that is something you envision will benefit your users in the
long term.
As for the other questions, these are the ones related to data retrieval
(efficiency which results in performance). I will tell you what I know
and
Post by Saurabh Mahapatra
1. Metadata cache: Only for Parquet files. The idea here is to store the
metadata associated with Parquet rowgroups per file in a separate file so
that you avoid having to open and close every Parquet file to get that
info. Metadata can help you understand basic statistics such as mins and
maxes so that you can skip rowgroups or files that do not match your
filter
Post by Saurabh Mahapatra
condition. This idea of storing metadata is not new across other query
engines.
https://drill.apache.org/docs/optimizing-parquet-metadata-reading/
2. Partitioning: Drill is "directory aware". This is an age old concept
of
Post by Saurabh Mahapatra
partitioning your data in a way so that Drill can skip directories that
are
Post by Saurabh Mahapatra
part of the filter condition. The layout and structuring of the data now
helps Drill. Partitioning schemes depend on query patterns. One rule of
thumb that I use is to look at the BI users and observe their workflows.
If
Post by Saurabh Mahapatra
they use a time range as the basis of every analysis, then I will
partition
Post by Saurabh Mahapatra
by time (say month). If I know that the likelihood of partitioning by
location is 60%, then I will create a nested directory structure with
time(month) at top of the hierarchy and location just below it.
https://drill.apache.org/docs/partition-pruning-introduction/
3. Generation of Parquet file
https://drill.apache.org/docs/parquet-format/
https://drill.apache.org/docs/parquet-format/#configuring-
the-parquet-storage-format
https://drill.apache.org/docs/sql-window-functions-introduction/
https://drill.apache.org/docs/nested-data-limitations/
https://drill.apache.org/docs/adding-custom-functions-to-
drill-introduction/
Please let us know if you have any additional questions.
Happy drilling:)
Saurabh
Post by Divya Gehlot
Hi,
As a naive user would like to know the benefitsof Apache Drill with
tableau
Post by Divya Gehlot
?
As per my understanding we to visualize we need to push the data to
tableau
Post by Divya Gehlot
for granular visualization .
Would like to understand few features of Drill in terms of visualtion
or
Post by Saurabh Mahapatra
Post by Divya Gehlot
1, Metadata Caching
2 .Partitioning
3.Generation of Parquet File
4.Custom column calculation.
Thanks,
Divya
Loading...