Error: DATA_READ ERROR: Error parsing JSON - Cannot read from the middle of a record

Discussion:

scott

2018-08-27 23:58:34 UTC

Hi All,
I'm getting an error querying some of my json files.
The error I'm getting is: Error: DATA_READ ERROR: Error parsing JSON -
Cannot read from the middle of a record. Current token was START_ARRAY

The json files are in array format, like [ { "var1": "foo", "var2":
"bar"},{"var1": "fo", "var2": "baz"}]

I found a ticket that indicates this format is not supported by Drill yet,
DRILL-1755 <https://jira.apache.org/jira/browse/DRILL-1755> , but I find it
hard to believe there is no workaround or solution since this was reported
4 years back. Does anyone have a solution or workaround to this problem?

Thanks,
Scott

Lee, David

2018-08-28 00:38:05 UTC

Permalink

Get rid of the opening and closing brackets and see if you can turn the commas into newlines.. The file needs to be splittable I think to reduce memory overhead vs parsing a giant string...

{"var1": "foo", "var2":"bar"}
{"var1": "fo", "var2": "baz"}
{"var1": "f2o", "var2": "baz2"}
{"var1": "f3o", "var2": "baz3"}
{"var1": "f4o", "var2": "baz4"}
{"var1": "f5o", "var2": "baz5"}

-----Original Message-----
From: scott [mailto:***@gmail.com]
Sent: Monday, August 27, 2018 4:59 PM
To: ***@drill.apache.org
Subject: Error: DATA_READ ERROR: Error parsing JSON - Cannot read from the middle of a record

[EXTERNAL EMAIL]

Hi All,
I'm getting an error querying some of my json files.
The error I'm getting is: Error: DATA_READ ERROR: Error parsing JSON - Cannot read from the middle of a record. Current token was START_ARRAY

The json files are in array format, like [ { "var1": "foo", "var2":
"bar"},{"var1": "fo", "var2": "baz"}]

I found a ticket that indicates this format is not supported by Drill yet,
DRILL-1755 <https://urldefense.proofpoint.com/v2/url?u=https-3A__jira.apache.org_jira_browse_DRILL-2D1755&d=DwIBaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=G0Hsj4vSq2tBbv1c1dW6zC3pOzA_kSuhlQoFvFKpdJo&s=Dh8nYVKoOA8nQ3XdDmauSethwq9x4ric2_MsYMcfDdc&e=> , but I find it hard to believe there is no workaround or solution since this was reported
4 years back. Does anyone have a solution or workaround to this problem?

Thanks,
Scott

This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/en-us/compliance/email-disclaimers for further information. Please refer to http://www.blackrock.com/corporate/en-us/compliance/privacy-policy for more information about BlackRock’s Privacy Policy.

For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/en-us/about-us/contacts-locati

Paul Rogers

2018-08-28 00:47:10 UTC

Permalink

Hi David,

JSON files are never splittable: there is no single-character way to find the start of a JSON record within a file.

Drill is supposed to support two JSON formats: the array format from the earlier post, and the non-JSON (but very common) list of objects format in this example.

Thanks,
- Paul

On Monday, August 27, 2018, 5:38:32 PM PDT, Lee, David <***@blackrock.com> wrote:

Get rid of the opening and closing brackets and see if you can turn the commas into newlines.. The file needs to be splittable I think to reduce memory overhead vs parsing a giant string...

{"var1": "foo", "var2":"bar"}
{"var1": "fo", "var2": "baz"}
{"var1": "f2o", "var2": "baz2"}
{"var1": "f3o", "var2": "baz3"}
{"var1": "f4o", "var2": "baz4"}
{"var1": "f5o", "var2": "baz5"}

-----Original Message-----
From: scott [mailto:***@gmail.com]
Sent: Monday, August 27, 2018 4:59 PM
To: ***@drill.apache.org
Subject: Error: DATA_READ ERROR: Error parsing JSON - Cannot read from the middle of a record

[EXTERNAL EMAIL]

Hi All,
I'm getting an error querying some of my json files.
The error I'm getting is: Error: DATA_READ ERROR: Error parsing JSON - Cannot read from the middle of a record. Current token was START_ARRAY

The json files are in array format, like [ { "var1": "foo", "var2":
"bar"},{"var1": "fo", "var2": "baz"}]

I found a ticket that indicates this format is not supported by Drill yet,
DRILL-1755 <https://urldefense.proofpoint.com/v2/url?u=https-3A__jira.apache.org_jira_browse_DRILL-2D1755&d=DwIBaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=G0Hsj4vSq2tBbv1c1dW6zC3pOzA_kSuhlQoFvFKpdJo&s=Dh8nYVKoOA8nQ3XdDmauSethwq9x4ric2_MsYMcfDdc&e=> , but I find it hard to believe there is no workaround or solution since this was reported
4 years back. Does anyone have a solution or workaround to this problem?

Thanks,
Scott

This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/en-us/compliance/email-disclaimers for further information.Â Please refer to http://www.blackrock.com/corporate/en-us/compliance/privacy-policy for more information about BlackRockâs Privacy Policy.

For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/en-us/about-us/contacts-locations.

Â© 2018 BlackRock, Inc. All rights reserved.

scott

2018-08-28 05:30:43 UTC

Permalink

Paul,
I'm using version 1.12. Can you tell me what version you think that was
fixed in? The ticket I referenced is still open, with no comments.

Scott

Post by Paul Rogers
Hi David,
JSON files are never splittable: there is no single-character way to find
the start of a JSON record within a file.
Drill is supposed to support two JSON formats: the array format from the
earlier post, and the non-JSON (but very common) list of objects format in
this example.
Thanks,
- Paul
On Monday, August 27, 2018, 5:38:32 PM PDT, Lee, David <
Get rid of the opening and closing brackets and see if you can turn the
commas into newlines.. The file needs to be splittable I think to reduce
memory overhead vs parsing a giant string...
{"var1": "foo", "var2":"bar"}
{"var1": "fo", "var2": "baz"}
{"var1": "f2o", "var2": "baz2"}
{"var1": "f3o", "var2": "baz3"}
{"var1": "f4o", "var2": "baz4"}
{"var1": "f5o", "var2": "baz5"}
-----Original Message-----
Sent: Monday, August 27, 2018 4:59 PM
Subject: Error: DATA_READ ERROR: Error parsing JSON - Cannot read from the
middle of a record
[EXTERNAL EMAIL]
Hi All,
I'm getting an error querying some of my json files.
The error I'm getting is: Error: DATA_READ ERROR: Error parsing JSON -
Cannot read from the middle of a record. Current token was START_ARRAY
"bar"},{"var1": "fo", "var2": "baz"}]
I found a ticket that indicates this format is not supported by Drill yet,
DRILL-1755 <
https://urldefense.proofpoint.com/v2/url?u=https-3A__jira.apache.org_jira_browse_DRILL-2D1755&d=DwIBaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=G0Hsj4vSq2tBbv1c1dW6zC3pOzA_kSuhlQoFvFKpdJo&s=Dh8nYVKoOA8nQ3XdDmauSethwq9x4ric2_MsYMcfDdc&e=>
, but I find it hard to believe there is no workaround or solution since
this was reported
4 years back. Does anyone have a solution or workaround to this problem?
Thanks,
Scott
This message may contain information that is confidential or privileged.
If you are not the intended recipient, please advise the sender immediately
and delete this message. See
http://www.blackrock.com/corporate/en-us/compliance/email-disclaimers for
further information. Please refer to
http://www.blackrock.com/corporate/en-us/compliance/privacy-policy for
more information about BlackRockâs Privacy Policy.
For a list of BlackRock's office addresses worldwide, see
http://www.blackrock.com/corporate/en-us/about-us/contacts-locations.
Â© 2018 BlackRock, Inc. All rights reserved.

Paul Rogers

2018-08-28 06:58:06 UTC

Permalink

Hi Scott,

I created a file, "test.json", using the data from your e-mail:

[ { "var1": "foo", "var2":"bar"},{"var1": "fo", "var2": "baz"}]

The oldest build I have readily available is Drill 1.13. I ran that as a server, then connected with sqlline as a client. I ran a query:

select * from `test.json`;
+-------+-------+| var1 Â | var2 Â |+-------+-------+| foo Â | bar Â || fo Â Â | baz Â |+-------+-------+

I can try with Drill 1.12, once I find and download it. Or, you can try with Drill 1.14 (the latest release.)

I do wonder, however, if we are talking about the same thing. My test puts your JSON in a JSON file with ".json" extension so that Drill choses the JSON parser. I'm using default JSON (session) options.

Is this what you are doing? Or, is your JSON coming from some other source? Kafka? A field from a CSV file, say?

Thanks,
- Paul

On Monday, August 27, 2018, 10:31:00 PM PDT, scott <***@gmail.com> wrote:

Paul,
I'm using version 1.12. Can you tell me what version you think that was
fixed in? The ticket I referenced is still open, with no comments.

Scott

Post by Paul Rogers
Hi David,
JSON files are never splittable: there is no single-character way to find
the start of a JSON record within a file.
Drill is supposed to support two JSON formats: the array format from the
earlier post, and the non-JSON (but very common) list of objects format in
this example.
Thanks,
- Paul
Â Â On Monday, August 27, 2018, 5:38:32 PM PDT, Lee, David <
Â Get rid of the opening and closing brackets and see if you can turn the
commas into newlines.. The file needs to be splittable I think to reduce
memory overhead vs parsing a giant string...
{"var1": "foo", "var2":"bar"}
{"var1": "fo", "var2": "baz"}
{"var1": "f2o", "var2": "baz2"}
{"var1": "f3o", "var2": "baz3"}
{"var1": "f4o", "var2": "baz4"}
{"var1": "f5o", "var2": "baz5"}
-----Original Message-----
Sent: Monday, August 27, 2018 4:59 PM
Subject: Error: DATA_READ ERROR: Error parsing JSON - Cannot read from the
middle of a record
[EXTERNAL EMAIL]
Hi All,
I'm getting an error querying some of my json files.
The error I'm getting is: Error: DATA_READ ERROR: Error parsing JSON -
Cannot read from the middle of a record. Current token was START_ARRAY
"bar"},{"var1": "fo", "var2": "baz"}]
I found a ticket that indicates this format is not supported by Drill yet,
DRILL-1755 <
https://urldefense.proofpoint.com/v2/url?u=https-3A__jira.apache.org_jira_browse_DRILL-2D1755&d=DwIBaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=G0Hsj4vSq2tBbv1c1dW6zC3pOzA_kSuhlQoFvFKpdJo&s=Dh8nYVKoOA8nQ3XdDmauSethwq9x4ric2_MsYMcfDdc&e=>
, but I find it hard to believe there is no workaround or solution since
this was reported
4 years back. Does anyone have a solution or workaround to this problem?
Thanks,
Scott
This message may contain information that is confidential or privileged.
If you are not the intended recipient, please advise the sender immediately
and delete this message. See
http://www.blackrock.com/corporate/en-us/compliance/email-disclaimers for
further information.Â Please refer to
http://www.blackrock.com/corporate/en-us/compliance/privacy-policy for
more information about BlackRockâs Privacy Policy.
For a list of BlackRock's office addresses worldwide, see
http://www.blackrock.com/corporate/en-us/about-us/contacts-locations.
Â© 2018 BlackRock, Inc. All rights reserved.

scott

2018-08-28 15:45:49 UTC

Permalink

Paul,
Thanks for prompting the right questions. I went back and took another look
at my queries. It turns out that there is some condition that causes this
error when running functions like "count(*)" on the data to cause this
error, where a normal unqualified select does not. I also ran across this
article from MapR that led me to conclude Drill just doesn't support it.

https://mapr.com/support/s/article/Apache-Drill-cannot-read-from-middle-of-a-record?language=en_US

I think if we can confirm exactly which conditions cause the problem, we
should open a high priority Jira. What do you think?

Post by Paul Rogers
Hi Scott,
[ { "var1": "foo", "var2":"bar"},{"var1": "fo", "var2": "baz"}]
The oldest build I have readily available is Drill 1.13. I ran that as a
select * from `test.json`;
+-------+-------+| var1 | var2 |+-------+-------+| foo | bar || fo
| baz |+-------+-------+
I can try with Drill 1.12, once I find and download it. Or, you can try
with Drill 1.14 (the latest release.)
I do wonder, however, if we are talking about the same thing. My test puts
your JSON in a JSON file with ".json" extension so that Drill choses the
JSON parser. I'm using default JSON (session) options.
Is this what you are doing? Or, is your JSON coming from some other
source? Kafka? A field from a CSV file, say?
Thanks,
- Paul
On Monday, August 27, 2018, 10:31:00 PM PDT, scott <
Paul,
I'm using version 1.12. Can you tell me what version you think that was
fixed in? The ticket I referenced is still open, with no comments.
Scott

Post by Paul Rogers
this example.
Thanks,
- Paul
On Monday, August 27, 2018, 5:38:32 PM PDT, Lee, David <
Get rid of the opening and closing brackets and see if you can turn the
commas into newlines.. The file needs to be splittable I think to reduce
memory overhead vs parsing a giant string...
{"var1": "foo", "var2":"bar"}
{"var1": "fo", "var2": "baz"}
{"var1": "f2o", "var2": "baz2"}
{"var1": "f3o", "var2": "baz3"}
{"var1": "f4o", "var2": "baz4"}
{"var1": "f5o", "var2": "baz5"}
-----Original Message-----
Sent: Monday, August 27, 2018 4:59 PM
Subject: Error: DATA_READ ERROR: Error parsing JSON - Cannot read from

the

Post by Paul Rogers
middle of a record
[EXTERNAL EMAIL]
Hi All,
I'm getting an error querying some of my json files.
The error I'm getting is: Error: DATA_READ ERROR: Error parsing JSON -
Cannot read from the middle of a record. Current token was START_ARRAY
"bar"},{"var1": "fo", "var2": "baz"}]
I found a ticket that indicates this format is not supported by Drill

yet,

Post by Paul Rogers
DRILL-1755 <

https://urldefense.proofpoint.com/v2/url?u=https-3A__jira.apache.org_jira_browse_DRILL-2D1755&d=DwIBaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=G0Hsj4vSq2tBbv1c1dW6zC3pOzA_kSuhlQoFvFKpdJo&s=Dh8nYVKoOA8nQ3XdDmauSethwq9x4ric2_MsYMcfDdc&e=

Post by Paul Rogers
, but I find it hard to believe there is no workaround or solution since
this was reported
4 years back. Does anyone have a solution or workaround to this problem?
Thanks,
Scott
This message may contain information that is confidential or privileged.
If you are not the intended recipient, please advise the sender

immediately

Post by Paul Rogers
and delete this message. See
http://www.blackrock.com/corporate/en-us/compliance/email-disclaimers

for

Post by Paul Rogers
further information. Please refer to
http://www.blackrock.com/corporate/en-us/compliance/privacy-policy for
more information about BlackRockâs Privacy Policy.
For a list of BlackRock's office addresses worldwide, see
http://www.blackrock.com/corporate/en-us/about-us/contacts-locations.
Â© 2018 BlackRock, Inc. All rights reserved.

Paul Rogers

2018-08-28 18:22:30 UTC

Permalink

Hi Scott,

Bingo. Just tried this very case with the sample file from the previous post. Got exactly the failure in the post you provided. I notice that a "select *" query returns immediately, but a "count(*)" query hangs for the 30+ seconds before it errors out. Mine is only a two-record file, so taking 30 seconds to fail is excessive.

Clearly, something is wrong. At the very least, a count(*) should simply read all records and discard the data, using exactly the same JSON parser as for a "SELECT *" query. That Drill is not doing so suggests that perhaps the code is trying to be clever to optimize for the "count(*)" case, and is doing so incorrectly.

Here is a clunky workaround: just add a WHERE clause that accepts all records:

SELECT COUNT(*) FROM `test.json` WHERE 1 = 1;
+---------+| EXPR$0 Â |+---------+| 2 Â Â Â |+---------+

As it turns out, I'm in the (very slow) process of issuing PRs for a revised JSON record reader to handle other issues. A side effect of that change is that the new implementation does use the same parse path for both the "SELECT *" an "SELECT count(*)" paths. So, even if someone cannot fix this bug short term, there is a longer-term fix coming.

Thanks,
- Paul

On Tuesday, August 28, 2018, 8:46:11 AM PDT, scott <***@gmail.com> wrote:

Paul,
Thanks for prompting the right questions. I went back and took another look
at my queries. It turns out that there is some condition that causes this
error when running functions like "count(*)" on the data to cause this
error, where a normal unqualified select does not. I also ran across this
article from MapR that led me to conclude Drill just doesn't support it.

https://mapr.com/support/s/article/Apache-Drill-cannot-read-from-middle-of-a-record?language=en_US

I think if we can confirm exactly which conditions cause the problem, we
should open a high priority Jira. What do you think?

Post by Paul Rogers
Hi Scott,
[ { "var1": "foo", "var2":"bar"},{"var1": "fo", "var2": "baz"}]
The oldest build I have readily available is Drill 1.13. I ran that as a
select * from `test.json`;
+-------+-------+| var1Â | var2Â |+-------+-------+| fooÂ | barÂ || fo
Â | bazÂ |+-------+-------+
I can try with Drill 1.12, once I find and download it. Or, you can try
with Drill 1.14 (the latest release.)
I do wonder, however, if we are talking about the same thing. My test puts
your JSON in a JSON file with ".json" extension so that Drill choses the
JSON parser. I'm using default JSON (session) options.
Is this what you are doing? Or, is your JSON coming from some other
source? Kafka? A field from a CSV file, say?
Thanks,
- Paul
Â Â On Monday, August 27, 2018, 10:31:00 PM PDT, scott <
Â Paul,
I'm using version 1.12. Can you tell me what version you think that was
fixed in? The ticket I referenced is still open, with no comments.
Scott

Post by Paul Rogers
this example.
Thanks,
- Paul
Â Â On Monday, August 27, 2018, 5:38:32 PM PDT, Lee, David <
Â Get rid of the opening and closing brackets and see if you can turn the
commas into newlines.. The file needs to be splittable I think to reduce
memory overhead vs parsing a giant string...
{"var1": "foo", "var2":"bar"}
{"var1": "fo", "var2": "baz"}
{"var1": "f2o", "var2": "baz2"}
{"var1": "f3o", "var2": "baz3"}
{"var1": "f4o", "var2": "baz4"}
{"var1": "f5o", "var2": "baz5"}
-----Original Message-----
Sent: Monday, August 27, 2018 4:59 PM
Subject: Error: DATA_READ ERROR: Error parsing JSON - Cannot read from

the

yet,

Post by Paul Rogers
DRILL-1755 <

immediately

Post by Paul Rogers
and delete this message. See
http://www.blackrock.com/corporate/en-us/compliance/email-disclaimers

for

Post by Paul Rogers
further information.Â Please refer to
http://www.blackrock.com/corporate/en-us/compliance/privacy-policy for
more information about BlackRockâs Privacy Policy.
For a list of BlackRock's office addresses worldwide, see
http://www.blackrock.com/corporate/en-us/about-us/contacts-locations.
Â© 2018 BlackRock, Inc. All rights reserved.

Lee, David

2018-08-28 19:11:20 UTC

Permalink

select count(*) on a jsonl file comes back instantly

/u1/my_login=> wc -l test.jsonl
7226 test.jsonl

select count(*) from dfs.`/u1/my_login/test.jsonl`

EXPR$0
7227

Overview
Operator ID Type Avg Setup Time Max Setup Time Avg Process Time Max Process Time Min Wait Time Avg Wait Time Max Wait Time % Fragment Time % Query Time Rows Avg Peak Memory Max Peak Memory
00-xx-00 JSON_SUB_SCAN 0.000s 0.000s 1.096s 3.287s 0.000s 0.181s 0.543s 99.58% 99.58% 7,228 24KB 32KB
00-xx-01 PROJECT 0.001s 0.001s 0.000s 0.000s 0.000s 0.000s 0.000s 0.00% 0.00% 1 32KB 32KB
00-xx-02 STREAMING_AGGREGATE 0.022s 0.022s 0.001s 0.001s 0.000s 0.000s 0.000s 0.04% 0.04% 1 64KB 64KB
00-xx-03 STREAMING_AGGREGATE 0.040s 0.040s 0.011s 0.011s 0.000s 0.000s 0.000s 0.34% 0.34% 7,227 48KB 48KB
00-xx-04 PROJECT 0.032s 0.032s 0.001s 0.001s 0.000s 0.000s 0.000s 0.04% 0.04% 7,227 16KB 16KB

-----Original Message-----
From: Paul Rogers [mailto:***@yahoo.com.INVALID]
Sent: Tuesday, August 28, 2018 11:23 AM
To: ***@drill.apache.org
Subject: Re: RE: Error: DATA_READ ERROR: Error parsing JSON - Cannot read from the middle of a record

[EXTERNAL EMAIL]

Hi Scott,

Bingo. Just tried this very case with the sample file from the previous post. Got exactly the failure in the post you provided. I notice that a "select *" query returns immediately, but a "count(*)" query hangs for the 30+ seconds before it errors out. Mine is only a two-record file, so taking 30 seconds to fail is excessive.

Clearly, something is wrong. At the very least, a count(*) should simply read all records and discard the data, using exactly the same JSON parser as for a "SELECT *" query. That Drill is not doing so suggests that perhaps the code is trying to be clever to optimize for the "count(*)" case, and is doing so incorrectly.

Here is a clunky workaround: just add a WHERE clause that accepts all records:

SELECT COUNT(*) FROM `test.json` WHERE 1 = 1;
+---------+| EXPR$0 |+---------+| 2 |+---------+

As it turns out, I'm in the (very slow) process of issuing PRs for a revised JSON record reader to handle other issues. A side effect of that change is that the new implementation does use the same parse path for both the "SELECT *" an "SELECT count(*)" paths. So, even if someone cannot fix this bug short term, there is a longer-term fix coming.

Thanks,
- Paul

On Tuesday, August 28, 2018, 8:46:11 AM PDT, scott <***@gmail.com> wrote:

Paul,
Thanks for prompting the right questions. I went back and took another look at my queries. It turns out that there is some condition that causes this error when running functions like "count(*)" on the data to cause this error, where a normal unqualified select does not. I also ran across this article from MapR that led me to conclude Drill just doesn't support it.

https://urldefense.proofpoint.com/v2/url?u=https-3A__mapr.com_support_s_article_Apache-2DDrill-2Dcannot-2Dread-2Dfrom-2Dmiddle-2Dof-2Da-2Drecord-3Flanguage-3Den-5FUS&d=DwIFaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=dxottkFod9H47Nc4z5FFPEXrUSmqQBXSqE_dy2vBbo8&s=ah8AI98Fb49IXVN1GkiBk3dMGzCQH8I8CZZc9dJpm_g&e=

I think if we can confirm exactly which conditions cause the problem, we should open a high priority Jira. What do you think?

Post by Paul Rogers
Hi Scott,
[ { "var1": "foo", "var2":"bar"},{"var1": "fo", "var2": "baz"}]
The oldest build I have readily available is Drill 1.13. I ran that as
select * from `test.json`;
+-------+-------+| var1 | var2 |+-------+-------+| foo | bar || fo
| baz |+-------+-------+
I can try with Drill 1.12, once I find and download it. Or, you can
try with Drill 1.14 (the latest release.)
I do wonder, however, if we are talking about the same thing. My test
puts your JSON in a JSON file with ".json" extension so that Drill
choses the JSON parser. I'm using default JSON (session) options.
Is this what you are doing? Or, is your JSON coming from some other
source? Kafka? A field from a CSV file, say?
Thanks,
- Paul
On Monday, August 27, 2018, 10:31:00 PM PDT, scott <
Paul,
I'm using version 1.12. Can you tell me what version you think that
was fixed in? The ticket I referenced is still open, with no comments.
Scott
On Mon, Aug 27, 2018 at 5:47 PM Paul Rogers

Post by Paul Rogers
this example.
Thanks,
- Paul
On Monday, August 27, 2018, 5:38:32 PM PDT, Lee, David <
Get rid of the opening and closing brackets and see if you can turn
the commas into newlines.. The file needs to be splittable I think
to reduce memory overhead vs parsing a giant string...
{"var1": "foo", "var2":"bar"}
{"var1": "fo", "var2": "baz"}
{"var1": "f2o", "var2": "baz2"}
{"var1": "f3o", "var2": "baz3"}
{"var1": "f4o", "var2": "baz4"}
{"var1": "f5o", "var2": "baz5"}
-----Original Message-----
Sent: Monday, August 27, 2018 4:59 PM
Subject: Error: DATA_READ ERROR: Error parsing JSON - Cannot read from

the

Post by Paul Rogers
middle of a record
[EXTERNAL EMAIL]
Hi All,
I'm getting an error querying some of my json files.
The error I'm getting is: Error: DATA_READ ERROR: Error parsing JSON
- Cannot read from the middle of a record. Current token was
START_ARRAY
"bar"},{"var1": "fo", "var2": "baz"}]
I found a ticket that indicates this format is not supported by Drill

yet,

Post by Paul Rogers
DRILL-1755 <

https://urldefense.proofpoint.com/v2/url?u=https-3A__jira.apache.org_j
ira_browse_DRILL-2D1755&d=DwIBaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTif
ecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=G0Hsj4vSq2tBbv1c1dW6zC3pOzA_kSuhlQ
oFvFKpdJo&s=Dh8nYVKoOA8nQ3XdDmauSethwq9x4ric2_MsYMcfDdc&e=

Post by Paul Rogers
, but I find it hard to believe there is no workaround or solution
since this was reported
4 years back. Does anyone have a solution or workaround to this problem?
Thanks,
Scott
This message may contain information that is confidential or privileged.
If you are not the intended recipient, please advise the sender

immediately

Post by Paul Rogers
and delete this message. See
http://www.blackrock.com/corporate/en-us/compliance/email-disclaimer
s

for

Post by Paul Rogers
further information. Please refer to
http://www.blackrock.com/corporate/en-us/compliance/privacy-policy
for more information about BlackRock’s Privacy Policy.
For a list of BlackRock's office addresses worldwide, see
http://www.blackrock.com/corporate/en-us/about-us/contacts-locations.

Lee, David

2018-08-28 19:20:18 UTC

Permalink

This is a pretty ugly json file.. 568 megs for 7227 records..

=> ls -l test.jsonl
-rw-r--r-- 1 my_login users 568693075 Aug 28 15:15 test.jsonl

There is one difference 7226 vs 7227, but that is from wc..

wc -l is NOT counting last of the file if it does not have end of line character

-----Original Message-----
From: Lee, David
Sent: Tuesday, August 28, 2018 12:11 PM
To: ***@drill.apache.org
Subject: RE: RE: Error: DATA_READ ERROR: Error parsing JSON - Cannot read from the middle of a record

select count(*) on a jsonl file comes back instantly

/u1/my_login=> wc -l test.jsonl
7226 test.jsonl

select count(*) from dfs.`/u1/my_login/test.jsonl`

EXPR$0
7227

Overview
Operator ID Type Avg Setup Time Max Setup Time Avg Process Time Max Process Time Min Wait Time Avg Wait Time Max Wait Time % Fragment Time % Query Time Rows Avg Peak Memory Max Peak Memory
00-xx-00 JSON_SUB_SCAN 0.000s 0.000s 1.096s 3.287s 0.000s 0.181s 0.543s 99.58% 99.58% 7,228 24KB 32KB
00-xx-01 PROJECT 0.001s 0.001s 0.000s 0.000s 0.000s 0.000s 0.000s 0.00% 0.00% 1 32KB 32KB
00-xx-02 STREAMING_AGGREGATE 0.022s 0.022s 0.001s 0.001s 0.000s 0.000s 0.000s 0.04% 0.04% 1 64KB 64KB
00-xx-03 STREAMING_AGGREGATE 0.040s 0.040s 0.011s 0.011s 0.000s 0.000s 0.000s 0.34% 0.34% 7,227 48KB 48KB
00-xx-04 PROJECT 0.032s 0.032s 0.001s 0.001s 0.000s 0.000s 0.000s 0.04% 0.04% 7,227 16KB 16KB

-----Original Message-----
From: Paul Rogers [mailto:***@yahoo.com.INVALID]
Sent: Tuesday, August 28, 2018 11:23 AM
To: ***@drill.apache.org
Subject: Re: RE: Error: DATA_READ ERROR: Error parsing JSON - Cannot read from the middle of a record

[EXTERNAL EMAIL]

Hi Scott,

Bingo. Just tried this very case with the sample file from the previous post. Got exactly the failure in the post you provided. I notice that a "select *" query returns immediately, but a "count(*)" query hangs for the 30+ seconds before it errors out. Mine is only a two-record file, so taking 30 seconds to fail is excessive.

Clearly, something is wrong. At the very least, a count(*) should simply read all records and discard the data, using exactly the same JSON parser as for a "SELECT *" query. That Drill is not doing so suggests that perhaps the code is trying to be clever to optimize for the "count(*)" case, and is doing so incorrectly.

Here is a clunky workaround: just add a WHERE clause that accepts all records:

SELECT COUNT(*) FROM `test.json` WHERE 1 = 1;
+---------+| EXPR$0 |+---------+| 2 |+---------+

As it turns out, I'm in the (very slow) process of issuing PRs for a revised JSON record reader to handle other issues. A side effect of that change is that the new implementation does use the same parse path for both the "SELECT *" an "SELECT count(*)" paths. So, even if someone cannot fix this bug short term, there is a longer-term fix coming.

Thanks,
- Paul

On Tuesday, August 28, 2018, 8:46:11 AM PDT, scott <***@gmail.com> wrote:

Paul,
Thanks for prompting the right questions. I went back and took another look at my queries. It turns out that there is some condition that causes this error when running functions like "count(*)" on the data to cause this error, where a normal unqualified select does not. I also ran across this article from MapR that led me to conclude Drill just doesn't support it.

https://urldefense.proofpoint.com/v2/url?u=https-3A__mapr.com_support_s_article_Apache-2DDrill-2Dcannot-2Dread-2Dfrom-2Dmiddle-2Dof-2Da-2Drecord-3Flanguage-3Den-5FUS&d=DwIFaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=dxottkFod9H47Nc4z5FFPEXrUSmqQBXSqE_dy2vBbo8&s=ah8AI98Fb49IXVN1GkiBk3dMGzCQH8I8CZZc9dJpm_g&e=

I think if we can confirm exactly which conditions cause the problem, we should open a high priority Jira. What do you think?

Post by Paul Rogers
this example.
Thanks,
- Paul
On Monday, August 27, 2018, 5:38:32 PM PDT, Lee, David <
Get rid of the opening and closing brackets and see if you can turn
the commas into newlines.. The file needs to be splittable I think
to reduce memory overhead vs parsing a giant string...
{"var1": "foo", "var2":"bar"}
{"var1": "fo", "var2": "baz"}
{"var1": "f2o", "var2": "baz2"}
{"var1": "f3o", "var2": "baz3"}
{"var1": "f4o", "var2": "baz4"}
{"var1": "f5o", "var2": "baz5"}
-----Original Message-----
Sent: Monday, August 27, 2018 4:59 PM
Subject: Error: DATA_READ ERROR: Error parsing JSON - Cannot read from

the

Post by Paul Rogers
middle of a record
[EXTERNAL EMAIL]
Hi All,
I'm getting an error querying some of my json files.
The error I'm getting is: Error: DATA_READ ERROR: Error parsing JSON
- Cannot read from the middle of a record. Current token was
START_ARRAY
"bar"},{"var1": "fo", "var2": "baz"}]
I found a ticket that indicates this format is not supported by Drill

yet,

Post by Paul Rogers
DRILL-1755 <

https://urldefense.proofpoint.com/v2/url?u=https-3A__jira.apache.org_j
ira_browse_DRILL-2D1755&d=DwIBaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTif
ecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=G0Hsj4vSq2tBbv1c1dW6zC3pOzA_kSuhlQ
oFvFKpdJo&s=Dh8nYVKoOA8nQ3XdDmauSethwq9x4ric2_MsYMcfDdc&e=

Post by Paul Rogers
, but I find it hard to believe there is no workaround or solution
since this was reported
4 years back. Does anyone have a solution or workaround to this problem?
Thanks,
Scott
This message may contain information that is confidential or privileged.
If you are not the intended recipient, please advise the sender

immediately

Post by Paul Rogers
and delete this message. See
http://www.blackrock.com/corporate/en-us/compliance/email-disclaimer
s

for

scott

2018-08-28 20:00:40 UTC

Permalink

Paul, Thank you for the workaround, that worked in my case perfectly !!

Scott

Post by Lee, David
This is a pretty ugly json file.. 568 megs for 7227 records..
=> ls -l test.jsonl
-rw-r--r-- 1 my_login users 568693075 Aug 28 15:15 test.jsonl
There is one difference 7226 vs 7227, but that is from wc..
wc -l is NOT counting last of the file if it does not have end of line character
-----Original Message-----
From: Lee, David
Sent: Tuesday, August 28, 2018 12:11 PM
Subject: RE: RE: Error: DATA_READ ERROR: Error parsing JSON - Cannot read
from the middle of a record
select count(*) on a jsonl file comes back instantly
/u1/my_login=> wc -l test.jsonl
7226 test.jsonl
select count(*) from dfs.`/u1/my_login/test.jsonl`
EXPR$0
7227
Overview
Operator ID Type Avg Setup Time Max Setup Time Avg Process Time
Max Process Time Min Wait Time Avg Wait Time Max Wait
Time % Fragment Time % Query Time Rows Avg Peak Memory Max Peak
Memory
00-xx-00 JSON_SUB_SCAN 0.000s 0.000s 1.096s 3.287s 0.000s
0.181s 0.543s 99.58% 99.58% 7,228 24KB 32KB
00-xx-01 PROJECT 0.001s 0.001s 0.000s 0.000s 0.000s 0.000s
0.000s 0.00% 0.00% 1 32KB 32KB
00-xx-02 STREAMING_AGGREGATE 0.022s 0.022s 0.001s 0.001s
0.000s 0.000s 0.000s 0.04% 0.04% 1 64KB 64KB
00-xx-03 STREAMING_AGGREGATE 0.040s 0.040s 0.011s 0.011s
0.000s 0.000s 0.000s 0.34% 0.34% 7,227 48KB 48KB
00-xx-04 PROJECT 0.032s 0.032s 0.001s 0.001s 0.000s 0.000s
0.000s 0.04% 0.04% 7,227 16KB 16KB
-----Original Message-----
Sent: Tuesday, August 28, 2018 11:23 AM
Subject: Re: RE: Error: DATA_READ ERROR: Error parsing JSON - Cannot read
from the middle of a record
[EXTERNAL EMAIL]
Hi Scott,
Bingo. Just tried this very case with the sample file from the previous
post. Got exactly the failure in the post you provided. I notice that a
"select *" query returns immediately, but a "count(*)" query hangs for the
30+ seconds before it errors out. Mine is only a two-record file, so taking
30 seconds to fail is excessive.
Clearly, something is wrong. At the very least, a count(*) should simply
read all records and discard the data, using exactly the same JSON parser
as for a "SELECT *" query. That Drill is not doing so suggests that perhaps
the code is trying to be clever to optimize for the "count(*)" case, and is
doing so incorrectly.
SELECT COUNT(*) FROM `test.json` WHERE 1 = 1;
+---------+| EXPR$0 |+---------+| 2 |+---------+
As it turns out, I'm in the (very slow) process of issuing PRs for a
revised JSON record reader to handle other issues. A side effect of that
change is that the new implementation does use the same parse path for both
the "SELECT *" an "SELECT count(*)" paths. So, even if someone cannot fix
this bug short term, there is a longer-term fix coming.
Thanks,
- Paul
On Tuesday, August 28, 2018, 8:46:11 AM PDT, scott <
Paul,
Thanks for prompting the right questions. I went back and took another
look at my queries. It turns out that there is some condition that causes
this error when running functions like "count(*)" on the data to cause this
error, where a normal unqualified select does not. I also ran across this
article from MapR that led me to conclude Drill just doesn't support it.
https://urldefense.proofpoint.com/v2/url?u=https-3A__mapr.com_support_s_article_Apache-2DDrill-2Dcannot-2Dread-2Dfrom-2Dmiddle-2Dof-2Da-2Drecord-3Flanguage-3Den-5FUS&d=DwIFaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=dxottkFod9H47Nc4z5FFPEXrUSmqQBXSqE_dy2vBbo8&s=ah8AI98Fb49IXVN1GkiBk3dMGzCQH8I8CZZc9dJpm_g&e=
I think if we can confirm exactly which conditions cause the problem, we
should open a high priority Jira. What do you think?

Post by Paul Rogers
this example.
Thanks,
- Paul
On Monday, August 27, 2018, 5:38:32 PM PDT, Lee, David <
Get rid of the opening and closing brackets and see if you can turn
the commas into newlines.. The file needs to be splittable I think
to reduce memory overhead vs parsing a giant string...
{"var1": "foo", "var2":"bar"}
{"var1": "fo", "var2": "baz"}
{"var1": "f2o", "var2": "baz2"}
{"var1": "f3o", "var2": "baz3"}
{"var1": "f4o", "var2": "baz4"}
{"var1": "f5o", "var2": "baz5"}
-----Original Message-----
Sent: Monday, August 27, 2018 4:59 PM
Subject: Error: DATA_READ ERROR: Error parsing JSON - Cannot read from

the

Post by Paul Rogers
middle of a record
[EXTERNAL EMAIL]
Hi All,
I'm getting an error querying some of my json files.
The error I'm getting is: Error: DATA_READ ERROR: Error parsing JSON
- Cannot read from the middle of a record. Current token was
START_ARRAY
"bar"},{"var1": "fo", "var2": "baz"}]
I found a ticket that indicates this format is not supported by Drill

yet,

Post by Paul Rogers
DRILL-1755 <

https://urldefense.proofpoint.com/v2/url?u=https-3A__jira.apache.org_j
ira_browse_DRILL-2D1755&d=DwIBaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTif
ecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=G0Hsj4vSq2tBbv1c1dW6zC3pOzA_kSuhlQ
oFvFKpdJo&s=Dh8nYVKoOA8nQ3XdDmauSethwq9x4ric2_MsYMcfDdc&e=

Post by Paul Rogers
, but I find it hard to believe there is no workaround or solution
since this was reported
4 years back. Does anyone have a solution or workaround to this

problem?

Post by Paul Rogers

Post by Paul Rogers
Thanks,
Scott
This message may contain information that is confidential or

privileged.

Post by Paul Rogers

Post by Paul Rogers
If you are not the intended recipient, please advise the sender

immediately

Post by Paul Rogers
and delete this message. See
http://www.blackrock.com/corporate/en-us/compliance/email-disclaimer
s

for

Post by Paul Rogers
further information. Please refer to
http://www.blackrock.com/corporate/en-us/compliance/privacy-policy
for more information about BlackRockâs Privacy Policy.
For a list of BlackRock's office addresses worldwide, see
http://www.blackrock.com/corporate/en-us/about-us/contacts-locations.
Â© 2018 BlackRock, Inc. All rights reserved.

Lee, David

2018-08-28 15:19:10 UTC

Permalink

The other JSON format is officially JSONL.. Can we in the next version of drill in Storage Plugins by default include jsonl in extensions??

http://jsonlines.org/

From:

"json": {
"type": "json",
"extensions": [
"json"
]
},

To

"json": {
"type": "json",
"extensions": [
"json", "jsonl"
]
},

After working with both JSON and JSONL, JSONL is so much easier to work with using other tools and programming languages..

A simple linux GREP command can be used to find data, but trying to GREP a JSON file with no line breaks just returns back a wall of text..

-----Original Message-----
From: Paul Rogers [mailto:***@yahoo.com.INVALID]
Sent: Monday, August 27, 2018 5:47 PM
To: ***@drill.apache.org
Subject: Re: RE: Error: DATA_READ ERROR: Error parsing JSON - Cannot read from the middle of a record

[EXTERNAL EMAIL]

Hi David,

JSON files are never splittable: there is no single-character way to find the start of a JSON record within a file.

Drill is supposed to support two JSON formats: the array format from the earlier post, and the non-JSON (but very common) list of objects format in this example.

Thanks,
- Paul

On Monday, August 27, 2018, 5:38:32 PM PDT, Lee, David <***@blackrock.com> wrote:

Get rid of the opening and closing brackets and see if you can turn the commas into newlines.. The file needs to be splittable I think to reduce memory overhead vs parsing a giant string...

{"var1": "foo", "var2":"bar"}
{"var1": "fo", "var2": "baz"}
{"var1": "f2o", "var2": "baz2"}
{"var1": "f3o", "var2": "baz3"}
{"var1": "f4o", "var2": "baz4"}
{"var1": "f5o", "var2": "baz5"}

-----Original Message-----
From: scott [mailto:***@gmail.com]
Sent: Monday, August 27, 2018 4:59 PM
To: ***@drill.apache.org
Subject: Error: DATA_READ ERROR: Error parsing JSON - Cannot read from the middle of a record

[EXTERNAL EMAIL]

Hi All,
I'm getting an error querying some of my json files.
The error I'm getting is: Error: DATA_READ ERROR: Error parsing JSON - Cannot read from the middle of a record. Current token was START_ARRAY

The json files are in array format, like [ { "var1": "foo", "var2":
"bar"},{"var1": "fo", "var2": "baz"}]

I found a ticket that indicates this format is not supported by Drill yet,
DRILL-1755 <https://urldefense.proofpoint.com/v2/url?u=https-3A__jira.apache.org_jira_browse_DRILL-2D1755&d=DwIBaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=G0Hsj4vSq2tBbv1c1dW6zC3pOzA_kSuhlQoFvFKpdJo&s=Dh8nYVKoOA8nQ3XdDmauSethwq9x4ric2_MsYMcfDdc&e=> , but I find it hard to believe there is no workaround or solution since this was reported
4 years back. Does anyone have a solution or workaround to this problem?

Thanks,
Scott

This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/en-us/compliance/email-disclaimers for further information. Please refer to http://www.blackrock.com/corporate/en-us/compliance/privacy-policy for more information about BlackRock’s Privacy Policy.

For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/en-us/about-us/contacts-locations.

© 2018 BlackRock

Paul Rogers

2018-08-28 00:44:44 UTC

Permalink

Hi Scott,

The code to handle top-level arrays is supposed to be in Drill already. I tested it for a not-yet-committed version of the JSON parser. I thought it worked in the current version as well...

Just checked the unit tests. We have one, TestJsonRecordReader.testContainingArray that reads a file "listdoc.json" that looks like this:

[Â {a: 4, b:6},Â {a: 5, b:7}]

Since we require that all unit tests pass, I'm thinking the code works. Maybe there is something odd about your file? The example you posted, is it the entire file? With no enclosing brackets or anything else?

Are you reading from a file, or from a column value and using a function to decode JSON? Maybe you can share your query?

Thanks,
- Paul

On Monday, August 27, 2018, 4:58:50 PM PDT, scott <***@gmail.com> wrote:

Hi All,
I'm getting an error querying some of my json files.
The error I'm getting is: Error: DATA_READ ERROR: Error parsing JSON -
Cannot read from the middle of a record. Current token was START_ARRAY

The json files are in array format, like [ { "var1": "foo", "var2":
"bar"},{"var1": "fo", "var2": "baz"}]

I found a ticket that indicates this format is not supported by Drill yet,
DRILL-1755 <https://jira.apache.org/jira/browse/DRILL-1755> , but I find it
hard to believe there is no workaround or solution since this was reported
4 years back. Does anyone have a solution or workaround to this problem?

Thanks,
Scott