Discussion:
delimiter in column values
Add Reply
Paul Rogers
2017-08-02 16:10:36 UTC
Reply
Permalink
Raw Message
Hi Divya,

Drill follows the commonly-accepted practice for CSV files. The general rule is:

1. Column headers all on one line, comma separated. (Drill 1.11 has fixes in this area, so you’ll want to use that if you have any problems.
2. Each record on its own line, comma-separated, no leading or trailing spaces.
3. No need for quotes unless your value contains commas.

You can customize behavior using the storage plugin config:

* Choose delimiter (tab for TSV, | for PSV, etc.)
* Choose to read or skip the header.

You’ll want to make sure to use the “,” delimiter, read and use the header. The docs have an example of the required setup.

Values are always read as text, so even your numbers will start as VarChar. You can convert to a numeric type in the query.

Example using your data:

Column1,Column2,Column3,Column4,Column5
colonedata1,coltwodata1,-35.924476,138.5987123,
colonedata2,coltwodata2,-27.4372536,153.0304583,137

Note that if columns are empty (like your first row), you still should include the comma separators. (Another bug fix in 1.11 fixes this case; 1.10 and earlier have problems if trailing columns are missing.)

Thanks,

- Paul


On Aug 1, 2017, at 11:51 PM, Divya Gehlot <***@gmail.com<mailto:***@gmail.com>> wrote:

Hi,
My column headers are in single line only i.e.
Column1,Column2,Column3,Column4,Column5
"colonedata1","coltwodata1","-35.924476","138.5987123",""
"colonedata2","coltwodata2","-27.4372536","153.0304583","137"
colonedata3","coltwodata3","-35.2793885","149.1233503","134"
"colonedata4","coltwodata4","-33.8724176","151.2067579",""

As you advised to put quotes as string delimeter for each column data and ran the select query.
attaching the data file too .

Appreciate the help !

Thanks,
Divya

On 2 August 2017 at 12:37, Kunal Khatua <***@mapr.com<mailto:***@mapr.com>> wrote:
So, the way you’ve shown your data is basically in this format:

<List of column headers, one per line>
<actual column data, one row per line>

Unfortunately, I don't believe the text reader in Drill is that advanced as to interpret the list of column headers across multiple lines, while the actual data is in a single line per row.

Typically text data is in CSV (or other delimiters similar to the comma) and can have the first line representing a header.

Also, I'm not sure if there was ever an option introduced to allow skipping of the initial set of lines within a text file being read.


-----Original Message-----
From: Divya Gehlot [mailto:***@gmail.com<mailto:***@gmail.com>]
Sent: Tuesday, August 01, 2017 7:06 PM
To: ***@drill.apache.org<mailto:***@drill.apache.org>
Subject: Re: delimiter in column values

For my sample dataset as you advised I surrounded with single columns also with quotes and the results are as below :
col_Column1
Column2
Column3
Column4
Column5
"Chifley" "coltwodata5" "" "" ""
"colonedata1" "coltwodata1" "-35.924476" "138.5987123" ""
"colonedata2" "coltwodata2" "-27.4372536" "153.0304583" "137"
"colonedata4" "coltwodata4" "-33.8724176" "151.2067579" ""
"colonedata5" "coltwodata5" "" "" ""
"This col6 data" "coltwodata6" "-33.869732" "151.2055553"
"This col7 data yes." "coltwodata7" "1.2845045" "103.8482739"
colonedata3" "coltwodata3" "-35.2793885" "149.1233503" "134"

Thanks,
Divya
I think you need quotes around the single word datasets as well,
because the quotes act as String delimiters and help in indicating the
start and end of a String.
Is there a reason why the single word strings cannot be in quotes as well?
-----Original Message-----
Sent: Tuesday, August 01, 2017 3:04 AM
Subject: delimiter in column values
Hi,
I have data set which has delimeter in first column value when I read
col_Column1
Column2
Column3
Column4
Column5
"This col6 data" coltwodata6 -33.869732 151.2055553 "This col7 data yes."
coltwodata7 1.2845045 103.8482739 Chifley coltwodata5
colonedata1 coltwodata1 -35.924476 138.5987123
colonedata2 coltwodata2 -27.4372536 153.0304583 137
colonedata3 coltwodata3 -35.2793885 149.1233503 134
colonedata4 coltwodata4 -33.8724176 151.2067579
colonedata5 coltwodata5
How can I read the column1 values as is without getting split into two
columns for instance the Column values should be
Column1
colonedata1,
colonedata2,
colonedata3,
colonedata4,
colonedata5,
"This, col6 data"
"This, col7 data"
Chifley,
Appreciate the help !
Thanks ,
Divya
<sample_data.csv>
Divya Gehlot
2017-08-02 02:00:14 UTC
Reply
Permalink
Raw Message
Hi ,
I received the data in the format which I posted .


Thanks,
Divya
I think you need quotes around the single word datasets as well, because
the quotes act as String delimiters and help in indicating the start and
end of a String.
Is there a reason why the single word strings cannot be in quotes as well?
-----Original Message-----
Sent: Tuesday, August 01, 2017 3:04 AM
Subject: delimiter in column values
Hi,
I have data set which has delimeter in first column value when I read the
col_Column1
Column2
Column3
Column4
Column5
"This col6 data" coltwodata6 -33.869732 151.2055553 "This col7 data yes."
coltwodata7 1.2845045 103.8482739 Chifley coltwodata5
colonedata1 coltwodata1 -35.924476 138.5987123
colonedata2 coltwodata2 -27.4372536 153.0304583 137
colonedata3 coltwodata3 -35.2793885 149.1233503 134
colonedata4 coltwodata4 -33.8724176 151.2067579
colonedata5 coltwodata5
How can I read the column1 values as is without getting split into two
columns for instance the Column values should be
Column1
colonedata1,
colonedata2,
colonedata3,
colonedata4,
colonedata5,
"This, col6 data"
"This, col7 data"
Chifley,
Appreciate the help !
Thanks ,
Divya
Kunal Khatua
2017-08-02 04:39:08 UTC
Reply
Permalink
Raw Message
You could just try to have the headers in a single line too... emulating the structure that the rest of the data follows.

-----Original Message-----
From: Kunal Khatua [mailto:***@mapr.com]
Sent: Tuesday, August 01, 2017 9:38 PM
To: ***@drill.apache.org
Subject: RE: delimiter in column values

So, the way you’ve shown your data is basically in this format:

<List of column headers, one per line>
<actual column data, one row per line>

Unfortunately, I don't believe the text reader in Drill is that advanced as to interpret the list of column headers across multiple lines, while the actual data is in a single line per row.

Typically text data is in CSV (or other delimiters similar to the comma) and can have the first line representing a header.

Also, I'm not sure if there was ever an option introduced to allow skipping of the initial set of lines within a text file being read.


-----Original Message-----
From: Divya Gehlot [mailto:***@gmail.com]
Sent: Tuesday, August 01, 2017 7:06 PM
To: ***@drill.apache.org
Subject: Re: delimiter in column values

For my sample dataset as you advised I surrounded with single columns also with quotes and the results are as below :
col_Column1
Column2
Column3
Column4
Column5
"Chifley" "coltwodata5" "" "" ""
"colonedata1" "coltwodata1" "-35.924476" "138.5987123" ""
"colonedata2" "coltwodata2" "-27.4372536" "153.0304583" "137"
"colonedata4" "coltwodata4" "-33.8724176" "151.2067579" ""
"colonedata5" "coltwodata5" "" "" ""
"This col6 data" "coltwodata6" "-33.869732" "151.2055553"
"This col7 data yes." "coltwodata7" "1.2845045" "103.8482739"
colonedata3" "coltwodata3" "-35.2793885" "149.1233503" "134"

Thanks,
Divya
I think you need quotes around the single word datasets as well,
because the quotes act as String delimiters and help in indicating the
start and end of a String.
Is there a reason why the single word strings cannot be in quotes as well?
-----Original Message-----
Sent: Tuesday, August 01, 2017 3:04 AM
Subject: delimiter in column values
Hi,
I have data set which has delimeter in first column value when I read
col_Column1
Column2
Column3
Column4
Column5
"This col6 data" coltwodata6 -33.869732 151.2055553 "This col7 data yes."
coltwodata7 1.2845045 103.8482739 Chifley coltwodata5
colonedata1 coltwodata1 -35.924476 138.5987123
colonedata2 coltwodata2 -27.4372536 153.0304583 137
colonedata3 coltwodata3 -35.2793885 149.1233503 134
colonedata4 coltwodata4 -33.8724176 151.2067579
colonedata5 coltwodata5
How can I read the column1 values as is without getting split into two
columns for instance the Column values should be
Column1
colonedata1,
colonedata2,
colonedata3,
colonedata4,
colonedata5,
"This, col6 data"
"This, col7 data"
Chifley,
Appreciate the h
Divya Gehlot
2017-08-03 01:44:57 UTC
Reply
Permalink
Raw Message
Hi ,

I am using Drill 1.11 and with all the setting which you have mentioned in
plugin configurations .
As Kunal advised to surrounded the column values with quotes which acts as
a string delimiter as one of my column value includes delimiter same as
field delimiter,
still getting the same results i.e the first column values are getting
split into two columns as posted my earlier posts.
I am kind of wondering how to resolve the column split issue , as the
received data set is from third party.


Appreciate the help!

Thanks,
Divya
Post by Paul Rogers
Hi Divya,
1. Column headers all on one line, comma separated. (Drill 1.11 has fixes
in this area, so you’ll want to use that if you have any problems.
2. Each record on its own line, comma-separated, no leading or trailing spaces.
3. No need for quotes unless your value contains commas.
* Choose delimiter (tab for TSV, | for PSV, etc.)
* Choose to read or skip the header.
You’ll want to make sure to use the “,” delimiter, read and use the
header. The docs have an example of the required setup.
Values are always read as text, so even your numbers will start as
VarChar. You can convert to a numeric type in the query.
Column1,Column2,Column3,Column4,Column5
colonedata1,coltwodata1,-35.924476,138.5987123,
colonedata2,coltwodata2,-27.4372536,153.0304583,137
Note that if columns are empty (like your first row), you still should
include the comma separators. (Another bug fix in 1.11 fixes this case;
1.10 and earlier have problems if trailing columns are missing.)
Thanks,
- Paul
Hi,
My column headers are in single line only i.e.
Column1,Column2,Column3,Column4,Column5
"colonedata1","coltwodata1","-35.924476","138.5987123",""
"colonedata2","coltwodata2","-27.4372536","153.0304583","137"
colonedata3","coltwodata3","-35.2793885","149.1233503","134"
"colonedata4","coltwodata4","-33.8724176","151.2067579",""
As you advised to put quotes as string delimeter for each column data and
ran the select query.
attaching the data file too .
Appreciate the help !
Thanks,
Divya
<List of column headers, one per line>
<actual column data, one row per line>
Unfortunately, I don't believe the text reader in Drill is that advanced
as to interpret the list of column headers across multiple lines, while
the actual data is in a single line per row.
Typically text data is in CSV (or other delimiters similar to the comma)
and can have the first line representing a header.
Also, I'm not sure if there was ever an option introduced to allow
skipping of the initial set of lines within a text file being read.
-----Original Message-----
gmail.com>]
Sent: Tuesday, August 01, 2017 7:06 PM
Subject: Re: delimiter in column values
For my sample dataset as you advised I surrounded with single columns also
col_Column1
Column2
Column3
Column4
Column5
"Chifley" "coltwodata5" "" "" ""
"colonedata1" "coltwodata1" "-35.924476" "138.5987123" ""
"colonedata2" "coltwodata2" "-27.4372536" "153.0304583" "137"
"colonedata4" "coltwodata4" "-33.8724176" "151.2067579" ""
"colonedata5" "coltwodata5" "" "" ""
"This col6 data" "coltwodata6" "-33.869732" "151.2055553"
"This col7 data yes." "coltwodata7" "1.2845045" "103.8482739"
colonedata3" "coltwodata3" "-35.2793885" "149.1233503" "134"
Thanks,
Divya
I think you need quotes around the single word datasets as well,
because the quotes act as String delimiters and help in indicating the
start and end of a String.
Is there a reason why the single word strings cannot be in quotes as
well?
-----Original Message-----
gmail.com>]
Sent: Tuesday, August 01, 2017 3:04 AM
Subject: delimiter in column values
Hi,
I have data set which has delimeter in first column value when I read
col_Column1
Column2
Column3
Column4
Column5
"This col6 data" coltwodata6 -33.869732 151.2055553 "This col7 data yes."
coltwodata7 1.2845045 103.8482739 Chifley coltwodata5
colonedata1 coltwodata1 -35.924476 138.5987123
colonedata2 coltwodata2 -27.4372536 153.0304583 137
colonedata3 coltwodata3 -35.2793885 149.1233503 134
colonedata4 coltwodata4 -33.8724176 151.2067579
colonedata5 coltwodata5
How can I read the column1 values as is without getting split into two
columns for instance the Column values should be
Column1
colonedata1,
colonedata2,
colonedata3,
colonedata4,
colonedata5,
"This, col6 data"
"This, col7 data"
Chifley,
Appreciate the help !
Thanks ,
Divya
<sample_data.csv>
Kunal Khatua
2017-08-03 04:45:13 UTC
Reply
Permalink
Raw Message
Based on your sample data, which contains this:
Column1,Column2,Column3,Column4,Column5
"colonedata1","coltwodata1","-35.924476","138.5987123",""
"colonedata2","coltwodata2","-27.4372536","153.0304583","137"
colonedata3","coltwodata3","-35.2793885","149.1233503","134"
"colonedata4","coltwodata4","-33.8724176","151.2067579",""
"colonedata5","coltwodata5","","",""
"This, col6 data","coltwodata6","-33.869732","151.2055553","351"
"This, col7 data yes.","coltwodata7","1.2845045","103.8482739","80"
"Chifley","coltwodata5","","",""

I got this and it looks like this...


0: jdbc:drill:schema=dfs.root> select * from `sample_data.csv`;
+------------------------------------------------------------------------+
| columns |
+------------------------------------------------------------------------+
| ["Column1","Column2","Column3","Column4","Column5"] |
| ["colonedata1","coltwodata1","-35.924476","138.5987123",""] |
| ["colonedata2","coltwodata2","-27.4372536","153.0304583","137"] |
| ["colonedata3\"","coltwodata3","-35.2793885","149.1233503","134"] |
| ["colonedata4","coltwodata4","-33.8724176","151.2067579",""] |
| ["colonedata5","coltwodata5","","",""] |
| ["This, col6 data","coltwodata6","-33.869732","151.2055553","351"] |
| ["This, col7 data yes.","coltwodata7","1.2845045","103.8482739","80"] |
| ["Chifley","coltwodata5","","",""] |
+------------------------------------------------------------------------+
9 rows selected (0.502 seconds)
0: jdbc:drill:schema=dfs.root> select columns[0] from `sample_data.csv`;
+-----------------------+
| EXPR$0 |
+-----------------------+
| Column1 |
| colonedata1 |
| colonedata2 |
| colonedata3" |
| colonedata4 |
| colonedata5 |
| This, col6 data |
| This, col7 data yes. |
| Chifley |
+-----------------------+
9 rows selected (0.581 seconds)

I was wondering if there is something else you're seeing because you're running this on Windows. So I tried after converting the Unix format and got the exact same result. Is this what you're getting?
I'm running this on a Linux machine.

-----Original Message-----
From: Divya Gehlot [mailto:***@gmail.com]
Sent: Wednesday, August 02, 2017 6:45 PM
To: ***@drill.apache.org
Subject: Re: delimiter in column values

Hi ,

I am using Drill 1.11 and with all the setting which you have mentioned in plugin configurations .
As Kunal advised to surrounded the column values with quotes which acts as a string delimiter as one of my column value includes delimiter same as field delimiter, still getting the same results i.e the first column values are getting split into two columns as posted my earlier posts.
I am kind of wondering how to resolve the column split issue , as the received data set is from third party.


Appreciate the help!

Thanks,
Divya
Post by Paul Rogers
Hi Divya,
1. Column headers all on one line, comma separated. (Drill 1.11 has
fixes in this area, so you’ll want to use that if you have any problems.
2. Each record on its own line, comma-separated, no leading or
trailing spaces.
3. No need for quotes unless your value contains commas.
* Choose delimiter (tab for TSV, | for PSV, etc.)
* Choose to read or skip the header.
You’ll want to make sure to use the “,” delimiter, read and use the
header. The docs have an example of the required setup.
Values are always read as text, so even your numbers will start as
VarChar. You can convert to a numeric type in the query.
Column1,Column2,Column3,Column4,Column5
colonedata1,coltwodata1,-35.924476,138.5987123,
colonedata2,coltwodata2,-27.4372536,153.0304583,137
Note that if columns are empty (like your first row), you still should
include the comma separators. (Another bug fix in 1.11 fixes this case;
1.10 and earlier have problems if trailing columns are missing.)
Thanks,
- Paul
Hi,
My column headers are in single line only i.e.
Column1,Column2,Column3,Column4,Column5
"colonedata1","coltwodata1","-35.924476","138.5987123",""
"colonedata2","coltwodata2","-27.4372536","153.0304583","137"
colonedata3","coltwodata3","-35.2793885","149.1233503","134"
"colonedata4","coltwodata4","-33.8724176","151.2067579",""
As you advised to put quotes as string delimeter for each column data
and ran the select query.
attaching the data file too .
Appreciate the help !
Thanks,
Divya
<List of column headers, one per line> <actual column data, one row
per line>
Unfortunately, I don't believe the text reader in Drill is that
advanced as to interpret the list of column headers across multiple
lines, while the actual data is in a single line per row.
Typically text data is in CSV (or other delimiters similar to the
comma) and can have the first line representing a header.
Also, I'm not sure if there was ever an option introduced to allow
skipping of the initial set of lines within a text file being read.
-----Original Message-----
From: Divya Gehlot
gmail.com>]
Sent: Tuesday, August 01, 2017 7:06 PM
Subject: Re: delimiter in column values
For my sample dataset as you advised I surrounded with single columns
col_Column1
Column2
Column3
Column4
Column5
"Chifley" "coltwodata5" "" "" ""
"colonedata1" "coltwodata1" "-35.924476" "138.5987123" ""
"colonedata2" "coltwodata2" "-27.4372536" "153.0304583" "137"
"colonedata4" "coltwodata4" "-33.8724176" "151.2067579" ""
"colonedata5" "coltwodata5" "" "" ""
"This col6 data" "coltwodata6" "-33.869732" "151.2055553"
"This col7 data yes." "coltwodata7" "1.2845045" "103.8482739"
colonedata3" "coltwodata3" "-35.2793885" "149.1233503" "134"
Thanks,
Divya
I think you need quotes around the single word datasets as well,
because the quotes act as String delimiters and help in indicating
the start and end of a String.
Is there a reason why the single word strings cannot be in quotes as
well?
-----Original Message-----
From: Divya Gehlot
gmail.com>]
Sent: Tuesday, August 01, 2017 3:04 AM
Subject: delimiter in column values
Hi,
I have data set which has delimeter in first column value when I
col_Column1
Column2
Column3
Column4
Column5
"This col6 data" coltwodata6 -33.869732 151.2055553 "This col7 data yes."
coltwodata7 1.2845045 103.8482739 Chifley coltwodata5
colonedata1 coltwodata1 -35.924476 138.5987123
colonedata2 coltwodata2 -27.4372536 153.0304583 137
colonedata3 coltwodata3 -35.2793885 149.1233503 134
colonedata4 coltwodata4 -33.8724176 151.2067579
colonedata5 coltwodata5
How can I read the column1 values as is without getting split into
two columns for instance the Column values should be
Column1
colonedata1,
colonedata2,
colonedata3,
colonedata4,
colonedata5,
"This, col6 data"
"This, col7 data"
Chifley,
Appreciate the help !
T
Divya Gehlot
2017-08-03 06:14:56 UTC
Reply
Permalink
Raw Message
Hi ,
This is my output when run in sqlline on Windows Embedded mode

0: jdbc:drill:zk=local> select * from
`dfs`.`installedsoftwares/ApacheDrill/apache-drill-1.10.0.tar/apache-drill-1.10.0/sample-data/sample_data.csv`;
+----------------+-------------------+----------------+----------------+----------------+
| col_Column1 | Column2 | Column3 | Column4 |
Column5 |
+----------------+-------------------+----------------+----------------+----------------+
| "colonedata1" | "coltwodata1" | "-35.924476" | "138.5987123" | ""
|
| "colonedata2" | "coltwodata2" | "-27.4372536" | "153.0304583" |
"137" |
| "colonedata3" | "coltwodata3" | "-35.2793885" | "149.1233503" |
"134" |
| "colonedata4" | "coltwodata4" | "-33.8724176" | "151.2067579" | ""
|
| "colonedata5" | "coltwodata5" | "" | "" | ""
|
| "This | col6 data" | "coltwodata6" | "-33.869732" |
"151.2055553" |
| "This | col7 data yes." | "coltwodata7" | "1.2845045" |
"103.8482739" |
| "Chifley" | "coltwodata5" | "" | "" | ""
|
+----------------+-------------------+----------------+----------------+----------------+
8 rows selected (0.147 seconds)
0: jdbc:drill:zk=local> select `col_Column1` from
`dfs`.`installedsoftwares/ApacheDrill/apache-drill-1.10.0.tar/apache-drill-1.10.0/sample-data/sample_data.csv`;
+----------------+
| col_Column1 |
+----------------+
| "colonedata1" |
| "colonedata2" |
| "colonedata3" |
| "colonedata4" |
| "colonedata5" |
| "This |
| "This |
| "Chifley" |
+----------------+
8 rows selected (0.1 seconds)


The query returning the different results due to host operating system?


Thanks,
Divya
Post by Paul Rogers
Column1,Column2,Column3,Column4,Column5
"colonedata1","coltwodata1","-35.924476","138.5987123",""
"colonedata2","coltwodata2","-27.4372536","153.0304583","137"
colonedata3","coltwodata3","-35.2793885","149.1233503","134"
"colonedata4","coltwodata4","-33.8724176","151.2067579",""
"colonedata5","coltwodata5","","",""
"This, col6 data","coltwodata6","-33.869732","151.2055553","351"
"This, col7 data yes.","coltwodata7","1.2845045","103.8482739","80"
"Chifley","coltwodata5","","",""
I got this and it looks like this...
0: jdbc:drill:schema=dfs.root> select * from `sample_data.csv`;
+------------------------------------------------------------------------+
| columns |
+------------------------------------------------------------------------+
| ["Column1","Column2","Column3","Column4","Column5"]
|
| ["colonedata1","coltwodata1","-35.924476","138.5987123",""] |
| ["colonedata2","coltwodata2","-27.4372536","153.0304583","137"] |
| ["colonedata3\"","coltwodata3","-35.2793885","149.1233503","134"] |
| ["colonedata4","coltwodata4","-33.8724176","151.2067579",""] |
| ["colonedata5","coltwodata5","","",""] |
| ["This, col6 data","coltwodata6","-33.869732","151.2055553","351"] |
| ["This, col7 data yes.","coltwodata7","1.2845045","103.8482739","80"] |
| ["Chifley","coltwodata5","","",""] |
+------------------------------------------------------------------------+
9 rows selected (0.502 seconds)
0: jdbc:drill:schema=dfs.root> select columns[0] from `sample_data.csv`;
+-----------------------+
| EXPR$0 |
+-----------------------+
| Column1 |
| colonedata1 |
| colonedata2 |
| colonedata3" |
| colonedata4 |
| colonedata5 |
| This, col6 data |
| This, col7 data yes. |
| Chifley |
+-----------------------+
9 rows selected (0.581 seconds)
I was wondering if there is something else you're seeing because you're
running this on Windows. So I tried after converting the Unix format and
got the exact same result. Is this what you're getting?
I'm running this on a Linux machine.
-----Original Message-----
Sent: Wednesday, August 02, 2017 6:45 PM
Subject: Re: delimiter in column values
Hi ,
I am using Drill 1.11 and with all the setting which you have mentioned
in plugin configurations .
As Kunal advised to surrounded the column values with quotes which acts
as a string delimiter as one of my column value includes delimiter same as
field delimiter, still getting the same results i.e the first column
values are getting split into two columns as posted my earlier posts.
I am kind of wondering how to resolve the column split issue , as the
received data set is from third party.
Appreciate the help!
Thanks,
Divya
Post by Paul Rogers
Hi Divya,
1. Column headers all on one line, comma separated. (Drill 1.11 has
fixes in this area, so you’ll want to use that if you have any problems.
2. Each record on its own line, comma-separated, no leading or trailing spaces.
3. No need for quotes unless your value contains commas.
* Choose delimiter (tab for TSV, | for PSV, etc.)
* Choose to read or skip the header.
You’ll want to make sure to use the “,” delimiter, read and use the
header. The docs have an example of the required setup.
Values are always read as text, so even your numbers will start as
VarChar. You can convert to a numeric type in the query.
Column1,Column2,Column3,Column4,Column5
colonedata1,coltwodata1,-35.924476,138.5987123,
colonedata2,coltwodata2,-27.4372536,153.0304583,137
Note that if columns are empty (like your first row), you still should
include the comma separators. (Another bug fix in 1.11 fixes this case;
1.10 and earlier have problems if trailing columns are missing.)
Thanks,
- Paul
Hi,
My column headers are in single line only i.e.
Column1,Column2,Column3,Column4,Column5
"colonedata1","coltwodata1","-35.924476","138.5987123",""
"colonedata2","coltwodata2","-27.4372536","153.0304583","137"
colonedata3","coltwodata3","-35.2793885","149.1233503","134"
"colonedata4","coltwodata4","-33.8724176","151.2067579",""
As you advised to put quotes as string delimeter for each column data
and ran the select query.
attaching the data file too .
Appreciate the help !
Thanks,
Divya
<List of column headers, one per line> <actual column data, one row
per line>
Unfortunately, I don't believe the text reader in Drill is that
advanced as to interpret the list of column headers across multiple
lines, while the actual data is in a single line per row.
Typically text data is in CSV (or other delimiters similar to the
comma) and can have the first line representing a header.
Also, I'm not sure if there was ever an option introduced to allow
skipping of the initial set of lines within a text file being read.
-----Original Message-----
From: Divya Gehlot
gmail.com>]
Sent: Tuesday, August 01, 2017 7:06 PM
Subject: Re: delimiter in column values
For my sample dataset as you advised I surrounded with single columns
col_Column1
Column2
Column3
Column4
Column5
"Chifley" "coltwodata5" "" "" ""
"colonedata1" "coltwodata1" "-35.924476" "138.5987123" ""
"colonedata2" "coltwodata2" "-27.4372536" "153.0304583" "137"
"colonedata4" "coltwodata4" "-33.8724176" "151.2067579" ""
"colonedata5" "coltwodata5" "" "" ""
"This col6 data" "coltwodata6" "-33.869732" "151.2055553"
"This col7 data yes." "coltwodata7" "1.2845045" "103.8482739"
colonedata3" "coltwodata3" "-35.2793885" "149.1233503" "134"
Thanks,
Divya
I think you need quotes around the single word datasets as well,
because the quotes act as String delimiters and help in indicating
the start and end of a String.
Is there a reason why the single word strings cannot be in quotes as
well?
-----Original Message-----
From: Divya Gehlot
gmail.com>]
Sent: Tuesday, August 01, 2017 3:04 AM
Subject: delimiter in column values
Hi,
I have data set which has delimeter in first column value when I
col_Column1
Column2
Column3
Column4
Column5
"This col6 data" coltwodata6 -33.869732 151.2055553 "This col7 data
yes."
Post by Paul Rogers
coltwodata7 1.2845045 103.8482739 Chifley coltwodata5
colonedata1 coltwodata1 -35.924476 138.5987123
colonedata2 coltwodata2 -27.4372536 153.0304583 137
colonedata3 coltwodata3 -35.2793885 149.1233503 134
colonedata4 coltwodata4 -33.8724176 151.2067579
colonedata5 coltwodata5
How can I read the column1 values as is without getting split into
two columns for instance the Column values should be
Column1
colonedata1,
colonedata2,
colonedata3,
colonedata4,
colonedata5,
"This, col6 data"
"This, col7 data"
Chifley,
Appreciate the help !
Thanks ,
Divya
<sample_data.csv>
Kunal Khatua
2017-08-03 13:06:37 UTC
Reply
Permalink
Raw Message
A couple of things...

1. Your delimiter is a pipe in this example, and not a comma as originally seen in the attached file. For such seminars, either we modify the storage plugin, or rename the extension to 'psv' so that drill understands what the delimiter is.

2. Can you try Drill-1.11.0 ?

3. There are table functions in Drill that guide it with additional inputs on how to manage the preparation of the table.

I'll try this in a Windows machine in the meanwhile.


________________________________
From: Divya Gehlot <***@gmail.com>
Sent: Wednesday, August 2, 2017 11:14:56 PM
To: ***@drill.apache.org
Subject: Re: delimiter in column values

Hi ,
This is my output when run in sqlline on Windows Embedded mode

0: jdbc:drill:zk=local> select * from
`dfs`.`installedsoftwares/ApacheDrill/apache-drill-1.10.0.tar/apache-drill-1.10.0/sample-data/sample_data.csv`;
+----------------+-------------------+----------------+----------------+----------------+
| col_Column1 | Column2 | Column3 | Column4 |
Column5 |
+----------------+-------------------+----------------+----------------+----------------+
| "colonedata1" | "coltwodata1" | "-35.924476" | "138.5987123" | ""
|
| "colonedata2" | "coltwodata2" | "-27.4372536" | "153.0304583" |
"137" |
| "colonedata3" | "coltwodata3" | "-35.2793885" | "149.1233503" |
"134" |
| "colonedata4" | "coltwodata4" | "-33.8724176" | "151.2067579" | ""
|
| "colonedata5" | "coltwodata5" | "" | "" | ""
|
| "This | col6 data" | "coltwodata6" | "-33.869732" |
"151.2055553" |
| "This | col7 data yes." | "coltwodata7" | "1.2845045" |
"103.8482739" |
| "Chifley" | "coltwodata5" | "" | "" | ""
|
+----------------+-------------------+----------------+----------------+----------------+
8 rows selected (0.147 seconds)
0: jdbc:drill:zk=local> select `col_Column1` from
`dfs`.`installedsoftwares/ApacheDrill/apache-drill-1.10.0.tar/apache-drill-1.10.0/sample-data/sample_data.csv`;
+----------------+
| col_Column1 |
+----------------+
| "colonedata1" |
| "colonedata2" |
| "colonedata3" |
| "colonedata4" |
| "colonedata5" |
| "This |
| "This |
| "Chifley" |
+----------------+
8 rows selected (0.1 seconds)


The query returning the different results due to host operating system?


Thanks,
Divya
Post by Paul Rogers
Column1,Column2,Column3,Column4,Column5
"colonedata1","coltwodata1","-35.924476","138.5987123",""
"colonedata2","coltwodata2","-27.4372536","153.0304583","137"
colonedata3","coltwodata3","-35.2793885","149.1233503","134"
"colonedata4","coltwodata4","-33.8724176","151.2067579",""
"colonedata5","coltwodata5","","",""
"This, col6 data","coltwodata6","-33.869732","151.2055553","351"
"This, col7 data yes.","coltwodata7","1.2845045","103.8482739","80"
"Chifley","coltwodata5","","",""
I got this and it looks like this...
0: jdbc:drill:schema=dfs.root> select * from `sample_data.csv`;
+------------------------------------------------------------------------+
| columns |
+------------------------------------------------------------------------+
| ["Column1","Column2","Column3","Column4","Column5"]
|
| ["colonedata1","coltwodata1","-35.924476","138.5987123",""] |
| ["colonedata2","coltwodata2","-27.4372536","153.0304583","137"] |
| ["colonedata3\"","coltwodata3","-35.2793885","149.1233503","134"] |
| ["colonedata4","coltwodata4","-33.8724176","151.2067579",""] |
| ["colonedata5","coltwodata5","","",""] |
| ["This, col6 data","coltwodata6","-33.869732","151.2055553","351"] |
| ["This, col7 data yes.","coltwodata7","1.2845045","103.8482739","80"] |
| ["Chifley","coltwodata5","","",""] |
+------------------------------------------------------------------------+
9 rows selected (0.502 seconds)
0: jdbc:drill:schema=dfs.root> select columns[0] from `sample_data.csv`;
+-----------------------+
| EXPR$0 |
+-----------------------+
| Column1 |
| colonedata1 |
| colonedata2 |
| colonedata3" |
| colonedata4 |
| colonedata5 |
| This, col6 data |
| This, col7 data yes. |
| Chifley |
+-----------------------+
9 rows selected (0.581 seconds)
I was wondering if there is something else you're seeing because you're
running this on Windows. So I tried after converting the Unix format and
got the exact same result. Is this what you're getting?
I'm running this on a Linux machine.
-----Original Message-----
Sent: Wednesday, August 02, 2017 6:45 PM
Subject: Re: delimiter in column values
Hi ,
I am using Drill 1.11 and with all the setting which you have mentioned
in plugin configurations .
As Kunal advised to surrounded the column values with quotes which acts
as a string delimiter as one of my column value includes delimiter same as
field delimiter, still getting the same results i.e the first column
values are getting split into two columns as posted my earlier posts.
I am kind of wondering how to resolve the column split issue , as the
received data set is from third party.
Appreciate the help!
Thanks,
Divya
Post by Paul Rogers
Hi Divya,
1. Column headers all on one line, comma separated. (Drill 1.11 has
fixes in this area, so you’ll want to use that if you have any problems.
2. Each record on its own line, comma-separated, no leading or trailing spaces.
3. No need for quotes unless your value contains commas.
* Choose delimiter (tab for TSV, | for PSV, etc.)
* Choose to read or skip the header.
You’ll want to make sure to use the “,” delimiter, read and use the
header. The docs have an example of the required setup.
Values are always read as text, so even your numbers will start as
VarChar. You can convert to a numeric type in the query.
Column1,Column2,Column3,Column4,Column5
colonedata1,coltwodata1,-35.924476,138.5987123,
colonedata2,coltwodata2,-27.4372536,153.0304583,137
Note that if columns are empty (like your first row), you still should
include the comma separators. (Another bug fix in 1.11 fixes this case;
1.10 and earlier have problems if trailing columns are missing.)
Thanks,
- Paul
Hi,
My column headers are in single line only i.e.
Column1,Column2,Column3,Column4,Column5
"colonedata1","coltwodata1","-35.924476","138.5987123",""
"colonedata2","coltwodata2","-27.4372536","153.0304583","137"
colonedata3","coltwodata3","-35.2793885","149.1233503","134"
"colonedata4","coltwodata4","-33.8724176","151.2067579",""
As you advised to put quotes as string delimeter for each column data
and ran the select query.
attaching the data file too .
Appreciate the help !
Thanks,
Divya
<List of column headers, one per line> <actual column data, one row
per line>
Unfortunately, I don't believe the text reader in Drill is that
advanced as to interpret the list of column headers across multiple
lines, while the actual data is in a single line per row.
Typically text data is in CSV (or other delimiters similar to the
comma) and can have the first line representing a header.
Also, I'm not sure if there was ever an option introduced to allow
skipping of the initial set of lines within a text file being read.
-----Original Message-----
From: Divya Gehlot
gmail.com>]
Sent: Tuesday, August 01, 2017 7:06 PM
Subject: Re: delimiter in column values
For my sample dataset as you advised I surrounded with single columns
col_Column1
Column2
Column3
Column4
Column5
"Chifley" "coltwodata5" "" "" ""
"colonedata1" "coltwodata1" "-35.924476" "138.5987123" ""
"colonedata2" "coltwodata2" "-27.4372536" "153.0304583" "137"
"colonedata4" "coltwodata4" "-33.8724176" "151.2067579" ""
"colonedata5" "coltwodata5" "" "" ""
"This col6 data" "coltwodata6" "-33.869732" "151.2055553"
"This col7 data yes." "coltwodata7" "1.2845045" "103.8482739"
colonedata3" "coltwodata3" "-35.2793885" "149.1233503" "134"
Thanks,
Divya
I think you need quotes around the single word datasets as well,
because the quotes act as String delimiters and help in indicating
the start and end of a String.
Is there a reason why the single word strings cannot be in quotes as
well?
-----Original Message-----
From: Divya Gehlot
gmail.com>]
Sent: Tuesday, August 01, 2017 3:04 AM
Subject: delimiter in column values
Hi,
I have data set which has delimeter in first column value when I
col_Column1
Column2
Column3
Column4
Column5
"This col6 data" coltwodata6 -33.869732 151.2055553 "This col7 data
yes."
Post by Paul Rogers
coltwodata7 1.2845045 103.8482739 Chifley coltwodata5
colonedata1 coltwodata1 -35.924476 138.5987123
colonedata2 coltwodata2 -27.4372536 153.0304583 137
colonedata3 coltwodata3 -35.2793885 149.1233503 134
colonedata4 coltwodata4 -33.8724176 151.2067579
colonedata5 coltwodata5
How can I read the column1 values as is without getting split into
two columns for instance the Column values should be
Column1
colonedata1,
colonedata2,
colonedata3,
colonedata4,
colonedata5,
"This, col6 data"
"This, col7 data"
Chifley,
Appreciate the help !
Thanks ,
Divya
<sample_data.csv>
Divya Gehlot
2017-08-07 02:41:26 UTC
Reply
Permalink
Raw Message
Hi,

Please find the response inline :

1. Your delimiter is a pipe in this example, and not a comma as originally
seen in the attached file. For such seminars, either we modify the storage
plugin, or rename the extension to 'psv' so that drill understands what the
delimiter is.
Delimeter the file is comma not pipe its the same file I query in drill
console which I shared in earlier email messages .

2. Can you try Drill-1.11.0 ?
I am using Drill 1.11.0
3. There are table functions in Drill that guide it with additional inputs
on how to manage the preparation of the table.
Can you please share the link ?


Thanks,
Divya
Post by Kunal Khatua
A couple of things...
1. Your delimiter is a pipe in this example, and not a comma as originally
seen in the attached file. For such seminars, either we modify the storage
plugin, or rename the extension to 'psv' so that drill understands what the
delimiter is.
2. Can you try Drill-1.11.0 ?
3. There are table functions in Drill that guide it with additional inputs
on how to manage the preparation of the table.
I'll try this in a Windows machine in the meanwhile.
________________________________
Sent: Wednesday, August 2, 2017 11:14:56 PM
Subject: Re: delimiter in column values
Hi ,
This is my output when run in sqlline on Windows Embedded mode
0: jdbc:drill:zk=local> select * from
`dfs`.`installedsoftwares/ApacheDrill/apache-drill-1.10.
0.tar/apache-drill-1.10.0/sample-data/sample_data.csv`;
+----------------+-------------------+----------------+-----
-----------+----------------+
| col_Column1 | Column2 | Column3 | Column4 |
Column5 |
+----------------+-------------------+----------------+-----
-----------+----------------+
| "colonedata1" | "coltwodata1" | "-35.924476" | "138.5987123" | ""
|
| "colonedata2" | "coltwodata2" | "-27.4372536" | "153.0304583" |
"137" |
| "colonedata3" | "coltwodata3" | "-35.2793885" | "149.1233503" |
"134" |
| "colonedata4" | "coltwodata4" | "-33.8724176" | "151.2067579" | ""
|
| "colonedata5" | "coltwodata5" | "" | "" | ""
|
| "This | col6 data" | "coltwodata6" | "-33.869732" |
"151.2055553" |
| "This | col7 data yes." | "coltwodata7" | "1.2845045" |
"103.8482739" |
| "Chifley" | "coltwodata5" | "" | "" | ""
|
+----------------+-------------------+----------------+-----
-----------+----------------+
8 rows selected (0.147 seconds)
0: jdbc:drill:zk=local> select `col_Column1` from
`dfs`.`installedsoftwares/ApacheDrill/apache-drill-1.10.
0.tar/apache-drill-1.10.0/sample-data/sample_data.csv`;
+----------------+
| col_Column1 |
+----------------+
| "colonedata1" |
| "colonedata2" |
| "colonedata3" |
| "colonedata4" |
| "colonedata5" |
| "This |
| "This |
| "Chifley" |
+----------------+
8 rows selected (0.1 seconds)
The query returning the different results due to host operating system?
Thanks,
Divya
Post by Paul Rogers
Column1,Column2,Column3,Column4,Column5
"colonedata1","coltwodata1","-35.924476","138.5987123",""
"colonedata2","coltwodata2","-27.4372536","153.0304583","137"
colonedata3","coltwodata3","-35.2793885","149.1233503","134"
"colonedata4","coltwodata4","-33.8724176","151.2067579",""
"colonedata5","coltwodata5","","",""
"This, col6 data","coltwodata6","-33.869732","151.2055553","351"
"This, col7 data yes.","coltwodata7","1.2845045","103.8482739","80"
"Chifley","coltwodata5","","",""
I got this and it looks like this...
0: jdbc:drill:schema=dfs.root> select * from `sample_data.csv`;
+-----------------------------------------------------------
-------------+
Post by Paul Rogers
| columns
|
Post by Paul Rogers
+-----------------------------------------------------------
-------------+
Post by Paul Rogers
| ["Column1","Column2","Column3","Column4","Column5"]
|
| ["colonedata1","coltwodata1","-35.924476","138.5987123",""]
|
Post by Paul Rogers
| ["colonedata2","coltwodata2","-27.4372536","153.0304583","137"]
|
Post by Paul Rogers
| ["colonedata3\"","coltwodata3","-35.2793885","149.1233503","134"]
|
Post by Paul Rogers
| ["colonedata4","coltwodata4","-33.8724176","151.2067579",""]
|
Post by Paul Rogers
| ["colonedata5","coltwodata5","","",""]
|
Post by Paul Rogers
| ["This, col6 data","coltwodata6","-33.869732","151.2055553","351"]
|
Post by Paul Rogers
| ["This, col7 data yes.","coltwodata7","1.2845045","103.8482739","80"]
|
Post by Paul Rogers
| ["Chifley","coltwodata5","","",""]
|
Post by Paul Rogers
+-----------------------------------------------------------
-------------+
Post by Paul Rogers
9 rows selected (0.502 seconds)
0: jdbc:drill:schema=dfs.root> select columns[0] from `sample_data.csv`;
+-----------------------+
| EXPR$0 |
+-----------------------+
| Column1 |
| colonedata1 |
| colonedata2 |
| colonedata3" |
| colonedata4 |
| colonedata5 |
| This, col6 data |
| This, col7 data yes. |
| Chifley |
+-----------------------+
9 rows selected (0.581 seconds)
I was wondering if there is something else you're seeing because you're
running this on Windows. So I tried after converting the Unix format and
got the exact same result. Is this what you're getting?
I'm running this on a Linux machine.
-----Original Message-----
Sent: Wednesday, August 02, 2017 6:45 PM
Subject: Re: delimiter in column values
Hi ,
I am using Drill 1.11 and with all the setting which you have mentioned
in plugin configurations .
As Kunal advised to surrounded the column values with quotes which acts
as a string delimiter as one of my column value includes delimiter same
as
Post by Paul Rogers
field delimiter, still getting the same results i.e the first column
values are getting split into two columns as posted my earlier posts.
I am kind of wondering how to resolve the column split issue , as the
received data set is from third party.
Appreciate the help!
Thanks,
Divya
Post by Paul Rogers
Hi Divya,
1. Column headers all on one line, comma separated. (Drill 1.11 has
fixes in this area, so you’ll want to use that if you have any
problems.
Post by Paul Rogers
Post by Paul Rogers
2. Each record on its own line, comma-separated, no leading or trailing spaces.
3. No need for quotes unless your value contains commas.
* Choose delimiter (tab for TSV, | for PSV, etc.)
* Choose to read or skip the header.
You’ll want to make sure to use the “,” delimiter, read and use the
header. The docs have an example of the required setup.
Values are always read as text, so even your numbers will start as
VarChar. You can convert to a numeric type in the query.
Column1,Column2,Column3,Column4,Column5
colonedata1,coltwodata1,-35.924476,138.5987123,
colonedata2,coltwodata2,-27.4372536,153.0304583,137
Note that if columns are empty (like your first row), you still should
include the comma separators. (Another bug fix in 1.11 fixes this case;
1.10 and earlier have problems if trailing columns are missing.)
Thanks,
- Paul
Hi,
My column headers are in single line only i.e.
Column1,Column2,Column3,Column4,Column5
"colonedata1","coltwodata1","-35.924476","138.5987123",""
"colonedata2","coltwodata2","-27.4372536","153.0304583","137"
colonedata3","coltwodata3","-35.2793885","149.1233503","134"
"colonedata4","coltwodata4","-33.8724176","151.2067579",""
As you advised to put quotes as string delimeter for each column data
and ran the select query.
attaching the data file too .
Appreciate the help !
Thanks,
Divya
<List of column headers, one per line> <actual column data, one row
per line>
Unfortunately, I don't believe the text reader in Drill is that
advanced as to interpret the list of column headers across multiple
lines, while the actual data is in a single line per row.
Typically text data is in CSV (or other delimiters similar to the
comma) and can have the first line representing a header.
Also, I'm not sure if there was ever an option introduced to allow
skipping of the initial set of lines within a text file being read.
-----Original Message-----
From: Divya Gehlot
gmail.com>]
Sent: Tuesday, August 01, 2017 7:06 PM
Subject: Re: delimiter in column values
For my sample dataset as you advised I surrounded with single columns
col_Column1
Column2
Column3
Column4
Column5
"Chifley" "coltwodata5" "" "" ""
"colonedata1" "coltwodata1" "-35.924476" "138.5987123" ""
"colonedata2" "coltwodata2" "-27.4372536" "153.0304583" "137"
"colonedata4" "coltwodata4" "-33.8724176" "151.2067579" ""
"colonedata5" "coltwodata5" "" "" ""
"This col6 data" "coltwodata6" "-33.869732" "151.2055553"
"This col7 data yes." "coltwodata7" "1.2845045" "103.8482739"
colonedata3" "coltwodata3" "-35.2793885" "149.1233503" "134"
Thanks,
Divya
I think you need quotes around the single word datasets as well,
because the quotes act as String delimiters and help in indicating
the start and end of a String.
Is there a reason why the single word strings cannot be in quotes as
well?
-----Original Message-----
From: Divya Gehlot
gmail.com>]
Sent: Tuesday, August 01, 2017 3:04 AM
Subject: delimiter in column values
Hi,
I have data set which has delimeter in first column value when I
col_Column1
Column2
Column3
Column4
Column5
"This col6 data" coltwodata6 -33.869732 151.2055553 "This col7 data
yes."
Post by Paul Rogers
coltwodata7 1.2845045 103.8482739 Chifley coltwodata5
colonedata1 coltwodata1 -35.924476 138.5987123
colonedata2 coltwodata2 -27.4372536 153.0304583 137
colonedata3 coltwodata3 -35.2793885 149.1233503 134
colonedata4 coltwodata4 -33.8724176 151.2067579
colonedata5 coltwodata5
How can I read the column1 values as is without getting split into
two columns for instance the Column values should be
Column1
colonedata1,
colonedata2,
colonedata3,
colonedata4,
colonedata5,
"This, col6 data"
"This, col7 data"
Chifley,
Appreciate the help !
Thanks ,
Divya
<sample_data.csv>
Kunal Khatua
2017-08-07 18:37:45 UTC
Reply
Permalink
Raw Message
You'd probably want to look at this to understand how to manage your CSV files:
https://drill.apache.org/docs/text-files-csv-tsv-psv/

The table functions I'm referring to are these and has an example that I think addresses your needs
https://drill.apache.org/docs/plugin-configuration-basics/#using-the-formats-attributes-as-table-function-parameters
See the preceeding table ( https://drill.apache.org/docs/plugin-configuration-basics/#using-the-formats-attributes-as-table-function-parameters ) to see other options beyond the example.



-----Original Message-----
From: Divya Gehlot [mailto:***@gmail.com]
Sent: Sunday, August 06, 2017 7:41 PM
To: ***@drill.apache.org
Subject: Re: delimiter in column values

Hi,

Please find the response inline :

1. Your delimiter is a pipe in this example, and not a comma as originally seen in the attached file. For such seminars, either we modify the storage plugin, or rename the extension to 'psv' so that drill understands what the delimiter is.
Delimeter the file is comma not pipe its the same file I query in drill console which I shared in earlier email messages .

2. Can you try Drill-1.11.0 ?
I am using Drill 1.11.0
3. There are table functions in Drill that guide it with additional inputs on how to manage the preparation of the table.
Can you please share the link ?


Thanks,
Divya
Post by Kunal Khatua
A couple of things...
1. Your delimiter is a pipe in this example, and not a comma as
originally seen in the attached file. For such seminars, either we
modify the storage plugin, or rename the extension to 'psv' so that
drill understands what the delimiter is.
2. Can you try Drill-1.11.0 ?
3. There are table functions in Drill that guide it with additional
inputs on how to manage the preparation of the table.
I'll try this in a Windows machine in the meanwhile.
________________________________
Sent: Wednesday, August 2, 2017 11:14:56 PM
Subject: Re: delimiter in column values
Hi ,
This is my output when run in sqlline on Windows Embedded mode
0: jdbc:drill:zk=local> select * from
`dfs`.`installedsoftwares/ApacheDrill/apache-drill-1.10.
0.tar/apache-drill-1.10.0/sample-data/sample_data.csv`;
+----------------+-------------------+----------------+-----
-----------+----------------+
| col_Column1 | Column2 | Column3 | Column4 |
Column5 |
+----------------+-------------------+----------------+-----
-----------+----------------+
| "colonedata1" | "coltwodata1" | "-35.924476" | "138.5987123" | ""
|
| "colonedata2" | "coltwodata2" | "-27.4372536" | "153.0304583" |
"137" |
| "colonedata3" | "coltwodata3" | "-35.2793885" | "149.1233503" |
"134" |
| "colonedata4" | "coltwodata4" | "-33.8724176" | "151.2067579" | ""
|
| "colonedata5" | "coltwodata5" | "" | "" | ""
|
| "This | col6 data" | "coltwodata6" | "-33.869732" |
"151.2055553" |
| "This | col7 data yes." | "coltwodata7" | "1.2845045" |
"103.8482739" |
| "Chifley" | "coltwodata5" | "" | "" | ""
|
+----------------+-------------------+----------------+-----
-----------+----------------+
8 rows selected (0.147 seconds)
0: jdbc:drill:zk=local> select `col_Column1` from
`dfs`.`installedsoftwares/ApacheDrill/apache-drill-1.10.
0.tar/apache-drill-1.10.0/sample-data/sample_data.csv`;
+----------------+
| col_Column1 |
+----------------+
| "colonedata1" |
| "colonedata2" |
| "colonedata3" |
| "colonedata4" |
| "colonedata5" |
| "This |
| "This |
| "Chifley" |
+----------------+
8 rows selected (0.1 seconds)
The query returning the different results due to host operating system?
Thanks,
Divya
Post by Paul Rogers
Column1,Column2,Column3,Column4,Column5
"colonedata1","coltwodata1","-35.924476","138.5987123",""
"colonedata2","coltwodata2","-27.4372536","153.0304583","137"
colonedata3","coltwodata3","-35.2793885","149.1233503","134"
"colonedata4","coltwodata4","-33.8724176","151.2067579",""
"colonedata5","coltwodata5","","",""
"This, col6 data","coltwodata6","-33.869732","151.2055553","351"
"This, col7 data yes.","coltwodata7","1.2845045","103.8482739","80"
"Chifley","coltwodata5","","",""
I got this and it looks like this...
0: jdbc:drill:schema=dfs.root> select * from `sample_data.csv`;
+-----------------------------------------------------------
-------------+
Post by Paul Rogers
| columns
|
Post by Paul Rogers
+-----------------------------------------------------------
-------------+
Post by Paul Rogers
| ["Column1","Column2","Column3","Column4","Column5"]
|
| ["colonedata1","coltwodata1","-35.924476","138.5987123",""]
|
Post by Paul Rogers
| ["colonedata2","coltwodata2","-27.4372536","153.0304583","137"]
|
Post by Paul Rogers
| ["colonedata3\"","coltwodata3","-35.2793885","149.1233503","134"]
|
Post by Paul Rogers
| ["colonedata4","coltwodata4","-33.8724176","151.2067579",""]
|
Post by Paul Rogers
| ["colonedata5","coltwodata5","","",""]
|
Post by Paul Rogers
| ["This, col6 data","coltwodata6","-33.869732","151.2055553","351"]
|
Post by Paul Rogers
| ["This, col7 data
| yes.","coltwodata7","1.2845045","103.8482739","80"]
|
Post by Paul Rogers
| ["Chifley","coltwodata5","","",""]
|
Post by Paul Rogers
+-----------------------------------------------------------
-------------+
Post by Paul Rogers
9 rows selected (0.502 seconds)
0: jdbc:drill:schema=dfs.root> select columns[0] from
`sample_data.csv`;
+-----------------------+
| EXPR$0 |
+-----------------------+
| Column1 |
| colonedata1 |
| colonedata2 |
| colonedata3" |
| colonedata4 |
| colonedata5 |
| This, col6 data |
| This, col7 data yes. |
| Chifley |
+-----------------------+
9 rows selected (0.581 seconds)
I was wondering if there is something else you're seeing because
you're running this on Windows. So I tried after converting the Unix
format and got the exact same result. Is this what you're getting?
I'm running this on a Linux machine.
-----Original Message-----
Sent: Wednesday, August 02, 2017 6:45 PM
Subject: Re: delimiter in column values
Hi ,
I am using Drill 1.11 and with all the setting which you have
mentioned in plugin configurations .
As Kunal advised to surrounded the column values with quotes which
acts as a string delimiter as one of my column value includes
delimiter same
as
Post by Paul Rogers
field delimiter, still getting the same results i.e the first
column values are getting split into two columns as posted my earlier posts.
I am kind of wondering how to resolve the column split issue , as
the received data set is from third party.
Appreciate the help!
Thanks,
Divya
Post by Paul Rogers
Hi Divya,
1. Column headers all on one line, comma separated. (Drill 1.11
has fixes in this area, so you’ll want to use that if you have any
problems.
Post by Paul Rogers
Post by Paul Rogers
2. Each record on its own line, comma-separated, no leading or trailing spaces.
3. No need for quotes unless your value contains commas.
* Choose delimiter (tab for TSV, | for PSV, etc.)
* Choose to read or skip the header.
You’ll want to make sure to use the “,” delimiter, read and use
the header. The docs have an example of the required setup.
Values are always read as text, so even your numbers will start as
VarChar. You can convert to a numeric type in the query.
Column1,Column2,Column3,Column4,Column5
colonedata1,coltwodata1,-35.924476,138.5987123,
colonedata2,coltwodata2,-27.4372536,153.0304583,137
Note that if columns are empty (like your first row), you still
should include the comma separators. (Another bug fix in 1.11
fixes this case;
1.10 and earlier have problems if trailing columns are missing.)
Thanks,
- Paul
On Aug 1, 2017, at 11:51 PM, Divya Gehlot
Hi,
My column headers are in single line only i.e.
Column1,Column2,Column3,Column4,Column5
"colonedata1","coltwodata1","-35.924476","138.5987123",""
"colonedata2","coltwodata2","-27.4372536","153.0304583","137"
colonedata3","coltwodata3","-35.2793885","149.1233503","134"
"colonedata4","coltwodata4","-33.8724176","151.2067579",""
As you advised to put quotes as string delimeter for each column
data and ran the select query.
attaching the data file too .
Appreciate the help !
Thanks,
Divya
On 2 August 2017 at 12:37, Kunal Khatua
<List of column headers, one per line> <actual column data, one
row per line>
Unfortunately, I don't believe the text reader in Drill is that
advanced as to interpret the list of column headers across
multiple lines, while the actual data is in a single line per row.
Typically text data is in CSV (or other delimiters similar to the
comma) and can have the first line representing a header.
Also, I'm not sure if there was ever an option introduced to allow
skipping of the initial set of lines within a text file being read.
-----Original Message-----
From: Divya Gehlot
gmail.com>]
Sent: Tuesday, August 01, 2017 7:06 PM
Subject: Re: delimiter in column values
For my sample dataset as you advised I surrounded with single
col_Column1
Column2
Column3
Column4
Column5
"Chifley" "coltwodata5" "" "" ""
"colonedata1" "coltwodata1" "-35.924476" "138.5987123" ""
"colonedata2" "coltwodata2" "-27.4372536" "153.0304583" "137"
"colonedata4" "coltwodata4" "-33.8724176" "151.2067579" ""
"colonedata5" "coltwodata5" "" "" ""
"This col6 data" "coltwodata6" "-33.869732" "151.2055553"
"This col7 data yes." "coltwodata7" "1.2845045" "103.8482739"
colonedata3" "coltwodata3" "-35.2793885" "149.1233503" "134"
Thanks,
Divya
On 1 August 2017 at 22:39, Kunal Khatua
I think you need quotes around the single word datasets as well,
because the quotes act as String delimiters and help in
indicating the start and end of a String.
Is there a reason why the single word strings cannot be in quotes as
well?
-----Original Message-----
From: Divya Gehlot
gmail.com>]
Sent: Tuesday, August 01, 2017 3:04 AM
Subject: delimiter in column values
Hi,
I have data set which has delimeter in first column value when I
col_Column1
Column2
Column3
Column4
Column5
"This col6 data" coltwodata6 -33.869732 151.2055553 "This col7 data
yes."
Post by Paul Rogers
coltwodata7 1.2845045 103.8482739 Chifley coltwodata5
colonedata1 coltwodata1 -35.924476 138.5987123
colonedata2 coltwodata2 -27.4372536 153.0304583 137
colonedata3 coltwodata3 -35.2793885 149.1233503 134
colonedata4 coltwodata4 -33.8724176 151.2067579
colonedata5 coltwodata5
How can I read the column1 values as is without getting split
into two columns for instance the Column values should be
Column1
colonedata1,
colonedata2,
colonedata3,
colonedata4,
colonedata5,
"This, col6 data"
"This, col7 data"
Chifley,
Appreciate the help !
Thanks ,
Divya
<
Loading...