Discussion:
querying from multiple directories in S3
(too old to reply)
Wesley Chow
2017-05-09 14:06:48 UTC
Permalink
Raw Message
What is the recommended way to issue a query against a large number of
tables in S3? At the moment I'm aliasing the table as a giant UNION ALL,
but is there a better way to do this?

Our data is stored as a time hierarchy, like YYYY/MM/DD/HH/MM in UTC, but
unfortunately I can't simply run the query recursively on an entire day of
data. I usually need a day of data in a non-UTC time zone. Is there some
elegant way to grab that data using the dir0, dir1 magic columns?

Thanks,
Wes
Abhishek Girish
2017-05-10 05:56:30 UTC
Permalink
Raw Message
Can you share more details of how the data is structured within the S3
bucket, using some examples? Also some representative queries of what you
are currently doing and what you hope was possible to do instead? I'm not
clear on what your question is.

The drill special attributes - filename, dir0, dir1, ... does work for data
within S3 storage plugin.
Post by Wesley Chow
What is the recommended way to issue a query against a large number of
tables in S3? At the moment I'm aliasing the table as a giant UNION ALL,
but is there a better way to do this?
Our data is stored as a time hierarchy, like YYYY/MM/DD/HH/MM in UTC, but
unfortunately I can't simply run the query recursively on an entire day of
data. I usually need a day of data in a non-UTC time zone. Is there some
elegant way to grab that data using the dir0, dir1 magic columns?
Thanks,
Wes
Wesley Chow
2017-05-10 17:27:46 UTC
Permalink
Raw Message
Yep sure.. so the directory structure looks something like:

root/2017/03/17/00/00/{300 - 400 CSV files}
root/2017/03/17/00/30/{300 - 400 CSV files}
root/2017/03/17/01/00/{300 - 400 CSV files}
...

And there's perhaps a year worth of files, only increasing over time. Our
queries tend to run over no more than day's worth of data, usually 2 - 6
hours. Right now these files import into Redshift where we run our queries,
but what I'm working on is assessing the feasibility of directly querying
against the CSV files. Our queries run every hour or a bit more frequent
than that, so if we're looking back over a span of multiple hours we need
to be able to query across day/month boundaries.

The files contain time series data, so each row corresponds to an event
with multiple dimensions and metrics. Our queries tend to be rollups where
we group by some subset of dimensions and aggregate up the metrics. So a
basic example might be, every hour I want the top ten counts grouped by
some dimension. The basic structure of the query would be:

SELECT d, count(1) AS c FROM ... GROUP BY d ORDER BY c DESC LIMIT 10;

The question I'm asking is, what goes into ... or do I use WHERE clauses
suggested by Charles Givre in a different answer?

For performance reasons, we probably want to be converting these into
parquet, so I'm also looking at whether or not the table metadata feature
would make this easier. In this case, I'd put the time of the event in the
rows, and query against the root directory with WHERE clauses that restrict
to the range of my query. It seems, though, that refreshing the metadata
table may be prohibitively expensive. In my test, a refresh of just a day's
worth of data takes about 5 minutes. I don't understand how the refresh
works quite yet (the documentation isn't too precise on this). I'm still
figuring this out.

Wes
Post by Abhishek Girish
Can you share more details of how the data is structured within the S3
bucket, using some examples? Also some representative queries of what you
are currently doing and what you hope was possible to do instead? I'm not
clear on what your question is.
The drill special attributes - filename, dir0, dir1, ... does work for data
within S3 storage plugin.
Post by Wesley Chow
What is the recommended way to issue a query against a large number of
tables in S3? At the moment I'm aliasing the table as a giant UNION ALL,
but is there a better way to do this?
Our data is stored as a time hierarchy, like YYYY/MM/DD/HH/MM in UTC, but
unfortunately I can't simply run the query recursively on an entire day
of
Post by Wesley Chow
data. I usually need a day of data in a non-UTC time zone. Is there some
elegant way to grab that data using the dir0, dir1 magic columns?
Thanks,
Wes
Charles Givre
2017-05-10 12:15:09 UTC
Permalink
Raw Message
Hi Wes,
Are you putting the dirX fields in the WHERE clause?
IE Couldn't you do soemthing like:

SELECT <fields>
FROM s3.data
WHERE (dir2 = 15 AND dir3 < 20) AND (dir2 = 14 AND dir3 > 4)

In theory this could work for UTC -4. It’s ugly… but I think it would work.
— C
Post by Wesley Chow
What is the recommended way to issue a query against a large number of
tables in S3? At the moment I'm aliasing the table as a giant UNION ALL,
but is there a better way to do this?
Our data is stored as a time hierarchy, like YYYY/MM/DD/HH/MM in UTC, but
unfortunately I can't simply run the query recursively on an entire day of
data. I usually need a day of data in a non-UTC time zone. Is there some
elegant way to grab that data using the dir0, dir1 magic columns?
Thanks,
Wes
Wesley Chow
2017-05-10 16:04:12 UTC
Permalink
Raw Message
I don't think so, because doesn't AND commute, which would mean dir2 = 15
AND dir2=14 would always be false?

Even if there is some comparison that works, isn't there still an issue
that the S3 file source has to scan all the files on every query?

Wes
Post by Charles Givre
Hi Wes,
Are you putting the dirX fields in the WHERE clause?
SELECT <fields>
FROM s3.data
WHERE (dir2 = 15 AND dir3 < 20) AND (dir2 = 14 AND dir3 > 4)
In theory this could work for UTC -4. It’s ugly
 but I think it would
work.
— C
Post by Wesley Chow
What is the recommended way to issue a query against a large number of
tables in S3? At the moment I'm aliasing the table as a giant UNION ALL,
but is there a better way to do this?
Our data is stored as a time hierarchy, like YYYY/MM/DD/HH/MM in UTC, but
unfortunately I can't simply run the query recursively on an entire day
of
Post by Wesley Chow
data. I usually need a day of data in a non-UTC time zone. Is there some
elegant way to grab that data using the dir0, dir1 magic columns?
Thanks,
Wes
Chunhui Shi
2017-05-10 16:15:08 UTC
Permalink
Raw Message
I think what Charles meant was "WHERE (dir2 = 15 AND dir3 < 20) OR (dir2 = 14 AND dir3 > 4)", and of course you need to add dir0 and dir1 for year and month.


And what do you mean by "scan all the files on every query", scan all the files of one day data, I thought this was your purpose?

________________________________
From: Wesley Chow <***@chartbeat.com>
Sent: Wednesday, May 10, 2017 9:04:12 AM
To: ***@drill.apache.org
Subject: Re: querying from multiple directories in S3

I don't think so, because doesn't AND commute, which would mean dir2 = 15
AND dir2=14 would always be false?

Even if there is some comparison that works, isn't there still an issue
that the S3 file source has to scan all the files on every query?

Wes
Post by Charles Givre
Hi Wes,
Are you putting the dirX fields in the WHERE clause?
SELECT <fields>
FROM s3.data
WHERE (dir2 = 15 AND dir3 < 20) AND (dir2 = 14 AND dir3 > 4)
In theory this could work for UTC -4. It’s ugly… but I think it would
work.
— C
Post by Wesley Chow
What is the recommended way to issue a query against a large number of
tables in S3? At the moment I'm aliasing the table as a giant UNION ALL,
but is there a better way to do this?
Our data is stored as a time hierarchy, like YYYY/MM/DD/HH/MM in UTC, but
unfortunately I can't simply run the query recursively on an entire day
of
Post by Wesley Chow
data. I usually need a day of data in a non-UTC time zone. Is there some
elegant way to grab that data using the dir0, dir1 magic columns?
Thanks,
Wes
Charles Givre
2017-05-10 16:16:14 UTC
Permalink
Raw Message
Yeah…
Thanks Chunhui!
Post by Chunhui Shi
I think what Charles meant was "WHERE (dir2 = 15 AND dir3 < 20) OR (dir2 = 14 AND dir3 > 4)", and of course you need to add dir0 and dir1 for year and month.
And what do you mean by "scan all the files on every query", scan all the files of one day data, I thought this was your purpose?
________________________________
Sent: Wednesday, May 10, 2017 9:04:12 AM
Subject: Re: querying from multiple directories in S3
I don't think so, because doesn't AND commute, which would mean dir2 = 15
AND dir2=14 would always be false?
Even if there is some comparison that works, isn't there still an issue
that the S3 file source has to scan all the files on every query?
Wes
Post by Charles Givre
Hi Wes,
Are you putting the dirX fields in the WHERE clause?
SELECT <fields>
FROM s3.data
WHERE (dir2 = 15 AND dir3 < 20) AND (dir2 = 14 AND dir3 > 4)
In theory this could work for UTC -4. It’s ugly… but I think it would work.
— C
Post by Wesley Chow
What is the recommended way to issue a query against a large number of
tables in S3? At the moment I'm aliasing the table as a giant UNION ALL,
but is there a better way to do this?
Our data is stored as a time hierarchy, like YYYY/MM/DD/HH/MM in UTC, but
unfortunately I can't simply run the query recursively on an entire day
of
Post by Wesley Chow
data. I usually need a day of data in a non-UTC time zone. Is there some
elegant way to grab that data using the dir0, dir1 magic columns?
Thanks,
Wes
Wesley Chow
2017-05-10 17:32:06 UTC
Permalink
Raw Message
Suppose that I have a directory structure in S3 like so:

root/YYYY/MM/{lots of files}

Where YYYY and MM are year and month numbers. If I run a query like:

SELECT count(1) FROM root WHERE dir0='2017' AND dir1='03';

Does Drill do a scan to find all files in root, thus picking up files from
2016, and then filter them down to ones matching dir0='2017' and dir1='03'
before reading the data? That's what I meant by "scan all the files." Or
does Drill know that it only has to do a scan of files in the 2017/03
directory?

Wes
Post by Chunhui Shi
I think what Charles meant was "WHERE (dir2 = 15 AND dir3 < 20) OR (dir2 =
14 AND dir3 > 4)", and of course you need to add dir0 and dir1 for year
and month.
And what do you mean by "scan all the files on every query", scan all the
files of one day data, I thought this was your purpose?
________________________________
Sent: Wednesday, May 10, 2017 9:04:12 AM
Subject: Re: querying from multiple directories in S3
I don't think so, because doesn't AND commute, which would mean dir2 = 15
AND dir2=14 would always be false?
Even if there is some comparison that works, isn't there still an issue
that the S3 file source has to scan all the files on every query?
Wes
Post by Charles Givre
Hi Wes,
Are you putting the dirX fields in the WHERE clause?
SELECT <fields>
FROM s3.data
WHERE (dir2 = 15 AND dir3 < 20) AND (dir2 = 14 AND dir3 > 4)
In theory this could work for UTC -4. It’s ugly
 but I think it would
work.
— C
Post by Wesley Chow
What is the recommended way to issue a query against a large number of
tables in S3? At the moment I'm aliasing the table as a giant UNION
ALL,
Post by Charles Givre
Post by Wesley Chow
but is there a better way to do this?
Our data is stored as a time hierarchy, like YYYY/MM/DD/HH/MM in UTC,
but
Post by Charles Givre
Post by Wesley Chow
unfortunately I can't simply run the query recursively on an entire day
of
Post by Wesley Chow
data. I usually need a day of data in a non-UTC time zone. Is there
some
Post by Charles Givre
Post by Wesley Chow
elegant way to grab that data using the dir0, dir1 magic columns?
Thanks,
Wes
Zelaine Fong
2017-05-10 17:47:06 UTC
Permalink
Raw Message
Drill will only scan the files in the 2017/03 directory. See https://drill.apache.org/docs/how-to-partition-data/, which describes an example very similar to your use case.

-- Zelaine

On 5/10/17, 10:32 AM, "Wesley Chow" <***@chartbeat.com> wrote:

Suppose that I have a directory structure in S3 like so:

root/YYYY/MM/{lots of files}

Where YYYY and MM are year and month numbers. If I run a query like:

SELECT count(1) FROM root WHERE dir0='2017' AND dir1='03';

Does Drill do a scan to find all files in root, thus picking up files from
2016, and then filter them down to ones matching dir0='2017' and dir1='03'
before reading the data? That's what I meant by "scan all the files." Or
does Drill know that it only has to do a scan of files in the 2017/03
directory?

Wes
Post by Chunhui Shi
I think what Charles meant was "WHERE (dir2 = 15 AND dir3 < 20) OR (dir2 =
14 AND dir3 > 4)", and of course you need to add dir0 and dir1 for year
and month.
And what do you mean by "scan all the files on every query", scan all the
files of one day data, I thought this was your purpose?
________________________________
Sent: Wednesday, May 10, 2017 9:04:12 AM
Subject: Re: querying from multiple directories in S3
I don't think so, because doesn't AND commute, which would mean dir2 = 15
AND dir2=14 would always be false?
Even if there is some comparison that works, isn't there still an issue
that the S3 file source has to scan all the files on every query?
Wes
Post by Charles Givre
Hi Wes,
Are you putting the dirX fields in the WHERE clause?
SELECT <fields>
FROM s3.data
WHERE (dir2 = 15 AND dir3 < 20) AND (dir2 = 14 AND dir3 > 4)
In theory this could work for UTC -4. It’s ugly… but I think it would work.
— C
Post by Wesley Chow
What is the recommended way to issue a query against a large number of
tables in S3? At the moment I'm aliasing the table as a giant UNION
ALL,
Post by Charles Givre
Post by Wesley Chow
but is there a better way to do this?
Our data is stored as a time hierarchy, like YYYY/MM/DD/HH/MM in UTC,
but
Post by Charles Givre
Post by Wesley Chow
unfortunately I can't simply run the query recursively on an entire day
of
Post by Wesley Chow
data. I usually need a day of data in a non-UTC time zone. Is there
some
Post by Charles Givre
Post by Wesley Chow
elegant way to grab that data using the dir0, dir1 magic columns?
Wesley Chow
2017-05-10 17:50:46 UTC
Permalink
Raw Message
That is neat, thanks!

Wes
Post by Zelaine Fong
Drill will only scan the files in the 2017/03 directory. See
https://drill.apache.org/docs/how-to-partition-data/, which describes an
example very similar to your use case.
-- Zelaine
root/YYYY/MM/{lots of files}
SELECT count(1) FROM root WHERE dir0='2017' AND dir1='03';
Does Drill do a scan to find all files in root, thus picking up files from
2016, and then filter them down to ones matching dir0='2017' and dir1='03'
before reading the data? That's what I meant by "scan all the files." Or
does Drill know that it only has to do a scan of files in the 2017/03
directory?
Wes
Post by Chunhui Shi
I think what Charles meant was "WHERE (dir2 = 15 AND dir3 < 20) OR
(dir2 =
Post by Chunhui Shi
14 AND dir3 > 4)", and of course you need to add dir0 and dir1 for
year
Post by Chunhui Shi
and month.
And what do you mean by "scan all the files on every query", scan
all the
Post by Chunhui Shi
files of one day data, I thought this was your purpose?
________________________________
Sent: Wednesday, May 10, 2017 9:04:12 AM
Subject: Re: querying from multiple directories in S3
I don't think so, because doesn't AND commute, which would mean dir2
= 15
Post by Chunhui Shi
AND dir2=14 would always be false?
Even if there is some comparison that works, isn't there still an
issue
Post by Chunhui Shi
that the S3 file source has to scan all the files on every query?
Wes
Post by Charles Givre
Hi Wes,
Are you putting the dirX fields in the WHERE clause?
SELECT <fields>
FROM s3.data
WHERE (dir2 = 15 AND dir3 < 20) AND (dir2 = 14 AND dir3 > 4)
In theory this could work for UTC -4. It’s ugly
 but I think it
would
Post by Chunhui Shi
Post by Charles Givre
work.
— C
Post by Wesley Chow
What is the recommended way to issue a query against a large
number of
Post by Chunhui Shi
Post by Charles Givre
Post by Wesley Chow
tables in S3? At the moment I'm aliasing the table as a giant
UNION
Post by Chunhui Shi
ALL,
Post by Charles Givre
Post by Wesley Chow
but is there a better way to do this?
Our data is stored as a time hierarchy, like YYYY/MM/DD/HH/MM in
UTC,
Post by Chunhui Shi
but
Post by Charles Givre
Post by Wesley Chow
unfortunately I can't simply run the query recursively on an
entire day
Post by Chunhui Shi
Post by Charles Givre
of
Post by Wesley Chow
data. I usually need a day of data in a non-UTC time zone. Is
there
Post by Chunhui Shi
some
Post by Charles Givre
Post by Wesley Chow
elegant way to grab that data using the dir0, dir1 magic columns?
Thanks,
Wes
Loading...