Discussion:
how to use LIKE operator with a binary column
(too old to reply)
franca perrina
2017-04-25 15:45:50 UTC
Permalink
Raw Message
Hi,

I would like to use Drill to query data formatted in avro.

My avro schema looks like

..
{"name":"payload",
"type":"bytes"}
..

and the result to the query

SELECT payload FROM `dfs`.`myfile.avro` LIMIT 1

looks like:

+-----------------+
| payload |
+-----------------+
| [***@3b8e004e |
+-----------------+


My problem is that when I run a query like:

SELECT * FROM `dfs`.`myfile.avro` WHERE `PAYLOAD` LIKE '%abcd%'

then I have
org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
DrillRuntimeException: Unexpected byte 0xfd at position 1008556 encountered
while decoding UTF8 string. Fragment 0:0 [Error Id:
0c247c14-0e51-402c-ad9a-411cbc445597
on maprdemo:31010]

It seems like drill tries to decode the payload's bytes to UTF8.

What I would need is a grep like behaviour, where my payload data is
considered as is, i.e. binary data, and it is not converted to a string
data type.

Thanks a lot for your help.
franca
Boaz Ben-Zvi
2017-04-28 22:02:00 UTC
Permalink
Raw Message
Hi Franca,

This issue is specific to the “bytes” type; for other Avro types the LIKE clause matches the printed representation, like:

select * from dfs.`/data/avro/twitter.snappy.avro` where `timestamp` like '%66%';
+-------------+--------------------------------------+-------------+
| username | tweet | timestamp |
+-------------+--------------------------------------+-------------+
| miguno | Rock: Nerf paper, scissors is fine. | 1366150681 |
| BlizzardCS | Works as intended. Terran is IMBA. | 1366154481 |
+-------------+--------------------------------------+-------------+

Can you share some sample avro file with “bytes” type? (I couldn’t find any such sample online) Maybe we’ll need to open a Jira for this case …

Thanks,

-- Boaz

On 4/25/17, 8:45 AM, "franca perrina" <***@gmail.com> wrote:

Hi,

I would like to use Drill to query data formatted in avro.

My avro schema looks like

..
{"name":"payload",
"type":"bytes"}
..

and the result to the query

SELECT payload FROM `dfs`.`myfile.avro` LIMIT 1

looks like:

+-----------------+
| payload |
+-----------------+
| [***@3b8e004e |
+-----------------+


My problem is that when I run a query like:

SELECT * FROM `dfs`.`myfile.avro` WHERE `PAYLOAD` LIKE '%abcd%'

then I have
org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
DrillRuntimeException: Unexpected byte 0xfd at position 1008556 encountered
while decoding UTF8 string. Fragment 0:0 [Error Id:
0c247c14-0e51-402c-ad9a-411cbc445597
on maprdemo:31010]

It seems like drill tries to decode the payload's bytes to UTF8.

What I would need is a grep like behaviour, where my payload data is
considered as is, i.e. binary data, and it is not converted to a string
data type.

Thanks a lot for your help.
franca
Jinfeng Ni
2017-04-28 22:13:35 UTC
Permalink
Raw Message
The error seems to indicated 'PAYLOAD' does not contain UTF8-encoded
bytes. The like function is a string function, and it only accepts
varchar/char type, which assumes inputs are UTF8 bytes.

You may consider implementing a Drill UDF 'blike" which works similar
to string function 'like', but could operate on non-UTF8 bytes.
Post by Boaz Ben-Zvi
Hi Franca,
select * from dfs.`/data/avro/twitter.snappy.avro` where `timestamp` like '%66%';
+-------------+--------------------------------------+-------------+
| username | tweet | timestamp |
+-------------+--------------------------------------+-------------+
| miguno | Rock: Nerf paper, scissors is fine. | 1366150681 |
| BlizzardCS | Works as intended. Terran is IMBA. | 1366154481 |
+-------------+--------------------------------------+-------------+
Can you share some sample avro file with “bytes” type? (I couldn’t find any such sample online) Maybe we’ll need to open a Jira for this case …
Thanks,
-- Boaz
Hi,
I would like to use Drill to query data formatted in avro.
My avro schema looks like
..
{"name":"payload",
"type":"bytes"}
..
and the result to the query
SELECT payload FROM `dfs`.`myfile.avro` LIMIT 1
+-----------------+
| payload |
+-----------------+
+-----------------+
SELECT * FROM `dfs`.`myfile.avro` WHERE `PAYLOAD` LIKE '%abcd%'
then I have
DrillRuntimeException: Unexpected byte 0xfd at position 1008556 encountered
0c247c14-0e51-402c-ad9a-411cbc445597
on maprdemo:31010]
It seems like drill tries to decode the payload's bytes to UTF8.
What I would need is a grep like behaviour, where my payload data is
considered as is, i.e. binary data, and it is not converted to a string
data type.
Thanks a lot for your help.
franca
franca perrina
2017-05-05 16:03:58 UTC
Permalink
Raw Message
Hello,

Thanks for your answers.

My payload is not encoded in UTF8, it can contain some non printable
characters, new lines and it may contain bytes that are not valid in UTF8.
An the latter should be my case.

I have tried with regexp_matches:

SELECT * FROM `dfs`.`myfile.avro` WHERE regexp_matches(payload,
'(?s).*abcd.*');

but I have the same problem, and I have the same error if I do, obviously,

SELECT CAST(payload as VARCHAR) FROM `dfs`.`myfile.avro`;

So, I implemented an UDF function to convert the bytes into a hex encoded
string


public class AsciiStringBinaryFunc {

// Converts a varbinary type into a hex encoded string.
// (byte[]) {(byte)0xca, (byte)0xfe, (byte)0xba, (byte)0xbe} =>
"\xca\xfe\xba\xbe"
@FunctionTemplate(name = "ascii_string_binary", scope =
FunctionScope.SIMPLE, nulls = NullHandling.NULL_IF_NULL)
public static class StringBinary implements DrillSimpleFunc {
@Param VarBinaryHolder in;
@Output VarCharHolder out;
@Workspace Charset charset;
@Inject DrillBuf buffer;

@Override
public void setup() {
charset = java.nio.charset.Charset.forName("US-ASCII");
}

@Override
public void eval() {
byte[] buf =
org.apache.drill.common.util.DrillStringUtils.toBinaryString(in.buffer,
in.start, in.end).getBytes(charset);
buffer.setBytes(0, buf);
buffer.setIndex(0, buf.length);

out.start = 0;
out.end = buf.length;
out.buffer = buffer;
}
}
}

but then, I have a new problem

SELECT ascii_string_binary(payload) FROM `dfs`.`myfile.avro` LIMIT 1;

Error: SYSTEM ERROR: IndexOutOfBoundsException: index: 0, length: 3484
(expected: range(0, 256))

Fragment 0:0

[Error Id: d0ab90d6-8b2a-4200-8809-534138c217fb on maprdemo:31010]
(state=,code=0)


knowing that

SELECT length(payload) FROM `dfs`.`myfile.avro` LIMIT 1;

+---------+
| EXPR$0 |
+---------+
| 3484 |
+---------+




Thanks a lot for your help,
Franca
Post by Jinfeng Ni
The error seems to indicated 'PAYLOAD' does not contain UTF8-encoded
bytes. The like function is a string function, and it only accepts
varchar/char type, which assumes inputs are UTF8 bytes.
You may consider implementing a Drill UDF 'blike" which works similar
to string function 'like', but could operate on non-UTF8 bytes.
Post by Boaz Ben-Zvi
Hi Franca,
This issue is specific to the “bytes” type; for other Avro types the
select * from dfs.`/data/avro/twitter.snappy.avro` where `timestamp`
like '%66%';
Post by Boaz Ben-Zvi
+-------------+--------------------------------------+-------------+
| username | tweet | timestamp |
+-------------+--------------------------------------+-------------+
| miguno | Rock: Nerf paper, scissors is fine. | 1366150681 |
| BlizzardCS | Works as intended. Terran is IMBA. | 1366154481 |
+-------------+--------------------------------------+-------------+
Can you share some sample avro file with “bytes” type? (I couldn’t find
any such sample online) Maybe we’ll need to open a Jira for this case 

Post by Boaz Ben-Zvi
Thanks,
-- Boaz
Hi,
I would like to use Drill to query data formatted in avro.
My avro schema looks like
..
{"name":"payload",
"type":"bytes"}
..
and the result to the query
SELECT payload FROM `dfs`.`myfile.avro` LIMIT 1
+-----------------+
| payload |
+-----------------+
+-----------------+
SELECT * FROM `dfs`.`myfile.avro` WHERE `PAYLOAD` LIKE '%abcd%'
then I have
org.apache.drill.common.exceptions.UserRemoteException: SYSTEM
DrillRuntimeException: Unexpected byte 0xfd at position 1008556
encountered
Post by Boaz Ben-Zvi
0c247c14-0e51-402c-ad9a-411cbc445597
on maprdemo:31010]
It seems like drill tries to decode the payload's bytes to UTF8.
What I would need is a grep like behaviour, where my payload data is
considered as is, i.e. binary data, and it is not converted to a
string
Post by Boaz Ben-Zvi
data type.
Thanks a lot for your help.
franca
franca perrina
2017-05-10 15:53:03 UTC
Permalink
Raw Message
I found the way to resize the buffer

@Override
public void eval() {

byte[] buf =
org.apache.drill.common.util.DrillStringUtils.toBinaryString(in.buffer,
in.start, in.end).getBytes(charset);
* out.buffer = buffer = buffer.reallocIfNeeded(buf.length);*
...
}

Thanks a lot for you help.

Franca
Post by franca perrina
Hello,
Thanks for your answers.
My payload is not encoded in UTF8, it can contain some non printable
characters, new lines and it may contain bytes that are not valid in UTF8.
An the latter should be my case.
SELECT * FROM `dfs`.`myfile.avro` WHERE regexp_matches(payload,
'(?s).*abcd.*');
but I have the same problem, and I have the same error if I do, obviously,
SELECT CAST(payload as VARCHAR) FROM `dfs`.`myfile.avro`;
So, I implemented an UDF function to convert the bytes into a hex encoded
string
public class AsciiStringBinaryFunc {
// Converts a varbinary type into a hex encoded string.
// (byte[]) {(byte)0xca, (byte)0xfe, (byte)0xba, (byte)0xbe} =>
"\xca\xfe\xba\xbe"
@FunctionTemplate(name = "ascii_string_binary", scope =
FunctionScope.SIMPLE, nulls = NullHandling.NULL_IF_NULL)
public static class StringBinary implements DrillSimpleFunc {
@Param VarBinaryHolder in;
@Output VarCharHolder out;
@Workspace Charset charset;
@Inject DrillBuf buffer;
@Override
public void setup() {
charset = java.nio.charset.Charset.forName("US-ASCII");
}
@Override
public void eval() {
byte[] buf = org.apache.drill.common.util.DrillStringUtils.toBinaryString(in.buffer,
in.start, in.end).getBytes(charset);
buffer.setBytes(0, buf);
buffer.setIndex(0, buf.length);
out.start = 0;
out.end = buf.length;
out.buffer = buffer;
}
}
}
but then, I have a new problem
SELECT ascii_string_binary(payload) FROM `dfs`.`myfile.avro` LIMIT 1;
Error: SYSTEM ERROR: IndexOutOfBoundsException: index: 0, length: 3484
(expected: range(0, 256))
Fragment 0:0
[Error Id: d0ab90d6-8b2a-4200-8809-534138c217fb on maprdemo:31010]
(state=,code=0)
knowing that
SELECT length(payload) FROM `dfs`.`myfile.avro` LIMIT 1;
+---------+
| EXPR$0 |
+---------+
| 3484 |
+---------+
Thanks a lot for your help,
Franca
Post by Jinfeng Ni
The error seems to indicated 'PAYLOAD' does not contain UTF8-encoded
bytes. The like function is a string function, and it only accepts
varchar/char type, which assumes inputs are UTF8 bytes.
You may consider implementing a Drill UDF 'blike" which works similar
to string function 'like', but could operate on non-UTF8 bytes.
Post by Boaz Ben-Zvi
Hi Franca,
This issue is specific to the “bytes” type; for other Avro types
select * from dfs.`/data/avro/twitter.snappy.avro` where `timestamp`
like '%66%';
Post by Boaz Ben-Zvi
+-------------+--------------------------------------+-------------+
| username | tweet | timestamp |
+-------------+--------------------------------------+-------------+
| miguno | Rock: Nerf paper, scissors is fine. | 1366150681 |
| BlizzardCS | Works as intended. Terran is IMBA. | 1366154481 |
+-------------+--------------------------------------+-------------+
Can you share some sample avro file with “bytes” type? (I couldn’t
find any such sample online) Maybe we’ll need to open a Jira for this
case 

Post by Boaz Ben-Zvi
Thanks,
-- Boaz
Hi,
I would like to use Drill to query data formatted in avro.
My avro schema looks like
..
{"name":"payload",
"type":"bytes"}
..
and the result to the query
SELECT payload FROM `dfs`.`myfile.avro` LIMIT 1
+-----------------+
| payload |
+-----------------+
+-----------------+
SELECT * FROM `dfs`.`myfile.avro` WHERE `PAYLOAD` LIKE '%abcd%'
then I have
org.apache.drill.common.exceptions.UserRemoteException: SYSTEM
DrillRuntimeException: Unexpected byte 0xfd at position 1008556
encountered
Post by Boaz Ben-Zvi
0c247c14-0e51-402c-ad9a-411cbc445597
on maprdemo:31010]
It seems like drill tries to decode the payload's bytes to UTF8.
What I would need is a grep like behaviour, where my payload data is
considered as is, i.e. binary data, and it is not converted to a
string
Post by Boaz Ben-Zvi
data type.
Thanks a lot for your help.
franca
Loading...