Paul, Thank you for the workaround, that worked in my case perfectly !!
Post by Lee, DavidThis is a pretty ugly json file.. 568 megs for 7227 records..
=> ls -l test.jsonl
-rw-r--r-- 1 my_login users 568693075 Aug 28 15:15 test.jsonl
There is one difference 7226 vs 7227, but that is from wc..
wc -l is NOT counting last of the file if it does not have end of line character
-----Original Message-----
From: Lee, David
Sent: Tuesday, August 28, 2018 12:11 PM
Subject: RE: RE: Error: DATA_READ ERROR: Error parsing JSON - Cannot read
from the middle of a record
select count(*) on a jsonl file comes back instantly
/u1/my_login=> wc -l test.jsonl
7226 test.jsonl
select count(*) from dfs.`/u1/my_login/test.jsonl`
EXPR$0
7227
Overview
Operator ID Type Avg Setup Time Max Setup Time Avg Process Time
Max Process Time Min Wait Time Avg Wait Time Max Wait
Time % Fragment Time % Query Time Rows Avg Peak Memory Max Peak
Memory
00-xx-00 JSON_SUB_SCAN 0.000s 0.000s 1.096s 3.287s 0.000s
0.181s 0.543s 99.58% 99.58% 7,228 24KB 32KB
00-xx-01 PROJECT 0.001s 0.001s 0.000s 0.000s 0.000s 0.000s
0.000s 0.00% 0.00% 1 32KB 32KB
00-xx-02 STREAMING_AGGREGATE 0.022s 0.022s 0.001s 0.001s
0.000s 0.000s 0.000s 0.04% 0.04% 1 64KB 64KB
00-xx-03 STREAMING_AGGREGATE 0.040s 0.040s 0.011s 0.011s
0.000s 0.000s 0.000s 0.34% 0.34% 7,227 48KB 48KB
00-xx-04 PROJECT 0.032s 0.032s 0.001s 0.001s 0.000s 0.000s
0.000s 0.04% 0.04% 7,227 16KB 16KB
-----Original Message-----
Sent: Tuesday, August 28, 2018 11:23 AM
Subject: Re: RE: Error: DATA_READ ERROR: Error parsing JSON - Cannot read
from the middle of a record
[EXTERNAL EMAIL]
Hi Scott,
Bingo. Just tried this very case with the sample file from the previous
post. Got exactly the failure in the post you provided. I notice that a
"select *" query returns immediately, but a "count(*)" query hangs for the
30+ seconds before it errors out. Mine is only a two-record file, so taking
30 seconds to fail is excessive.
Clearly, something is wrong. At the very least, a count(*) should simply
read all records and discard the data, using exactly the same JSON parser
as for a "SELECT *" query. That Drill is not doing so suggests that perhaps
the code is trying to be clever to optimize for the "count(*)" case, and is
doing so incorrectly.
SELECT COUNT(*) FROM `test.json` WHERE 1 = 1;
+---------+| EXPR$0 |+---------+| 2 |+---------+
As it turns out, I'm in the (very slow) process of issuing PRs for a
revised JSON record reader to handle other issues. A side effect of that
change is that the new implementation does use the same parse path for both
the "SELECT *" an "SELECT count(*)" paths. So, even if someone cannot fix
this bug short term, there is a longer-term fix coming.
Thanks,
- Paul
On Tuesday, August 28, 2018, 8:46:11 AM PDT, scott <
Paul,
Thanks for prompting the right questions. I went back and took another
look at my queries. It turns out that there is some condition that causes
this error when running functions like "count(*)" on the data to cause this
error, where a normal unqualified select does not. I also ran across this
article from MapR that led me to conclude Drill just doesn't support it.
https://urldefense.proofpoint.com/v2/url?u=https-3A__mapr.com_support_s_article_Apache-2DDrill-2Dcannot-2Dread-2Dfrom-2Dmiddle-2Dof-2Da-2Drecord-3Flanguage-3Den-5FUS&d=DwIFaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=dxottkFod9H47Nc4z5FFPEXrUSmqQBXSqE_dy2vBbo8&s=ah8AI98Fb49IXVN1GkiBk3dMGzCQH8I8CZZc9dJpm_g&e=
I think if we can confirm exactly which conditions cause the problem, we
should open a high priority Jira. What do you think?
Post by Paul RogersHi Scott,
[ { "var1": "foo", "var2":"bar"},{"var1": "fo", "var2": "baz"}]
The oldest build I have readily available is Drill 1.13. I ran that as
select * from `test.json`;
+-------+-------+| var1 | var2 |+-------+-------+| foo | bar || fo
| baz |+-------+-------+
I can try with Drill 1.12, once I find and download it. Or, you can
try with Drill 1.14 (the latest release.)
I do wonder, however, if we are talking about the same thing. My test
puts your JSON in a JSON file with ".json" extension so that Drill
choses the JSON parser. I'm using default JSON (session) options.
Is this what you are doing? Or, is your JSON coming from some other
source? Kafka? A field from a CSV file, say?
Thanks,
- Paul
On Monday, August 27, 2018, 10:31:00 PM PDT, scott <
Paul,
I'm using version 1.12. Can you tell me what version you think that
was fixed in? The ticket I referenced is still open, with no comments.
Scott
On Mon, Aug 27, 2018 at 5:47 PM Paul Rogers
Post by Paul RogersHi David,
JSON files are never splittable: there is no single-character way to
find the start of a JSON record within a file.
Drill is supposed to support two JSON formats: the array format from
the earlier post, and the non-JSON (but very common) list of objects
format
in
Post by Paul Rogersthis example.
Thanks,
- Paul
On Monday, August 27, 2018, 5:38:32 PM PDT, Lee, David <
Get rid of the opening and closing brackets and see if you can turn
the commas into newlines.. The file needs to be splittable I think
to reduce memory overhead vs parsing a giant string...
{"var1": "foo", "var2":"bar"}
{"var1": "fo", "var2": "baz"}
{"var1": "f2o", "var2": "baz2"}
{"var1": "f3o", "var2": "baz3"}
{"var1": "f4o", "var2": "baz4"}
{"var1": "f5o", "var2": "baz5"}
-----Original Message-----
Sent: Monday, August 27, 2018 4:59 PM
Subject: Error: DATA_READ ERROR: Error parsing JSON - Cannot read from
the
Post by Paul Rogersmiddle of a record
[EXTERNAL EMAIL]
Hi All,
I'm getting an error querying some of my json files.
The error I'm getting is: Error: DATA_READ ERROR: Error parsing JSON
- Cannot read from the middle of a record. Current token was
START_ARRAY
"bar"},{"var1": "fo", "var2": "baz"}]
I found a ticket that indicates this format is not supported by Drill
yet,
https://urldefense.proofpoint.com/v2/url?u=https-3A__jira.apache.org_j
ira_browse_DRILL-2D1755&d=DwIBaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTif
ecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=G0Hsj4vSq2tBbv1c1dW6zC3pOzA_kSuhlQ
oFvFKpdJo&s=Dh8nYVKoOA8nQ3XdDmauSethwq9x4ric2_MsYMcfDdc&e=
Post by Paul Rogers, but I find it hard to believe there is no workaround or solution
since this was reported
4 years back. Does anyone have a solution or workaround to this
problem?
Post by Paul RogersPost by Paul RogersThanks,
Scott
This message may contain information that is confidential or
privileged.
Post by Paul RogersPost by Paul RogersIf you are not the intended recipient, please advise the sender
immediately
Post by Paul Rogersand delete this message. See
http://www.blackrock.com/corporate/en-us/compliance/email-disclaimer
s
for
Post by Paul Rogersfurther information. Please refer to
http://www.blackrock.com/corporate/en-us/compliance/privacy-policy
for more information about BlackRockâs Privacy Policy.
For a list of BlackRock's office addresses worldwide, see
http://www.blackrock.com/corporate/en-us/about-us/contacts-locations.
© 2018 BlackRock, Inc. All rights reserved.