Discussion:
Dealing with bad data when trying to do date computations
(too old to reply)
John Omernik
2017-02-28 16:03:09 UTC
Permalink
Raw Message
I have a data set that has birthdays in YYYY-MM-DD format.

Most of this data is great. I am trying to compute the age using

EXTRACT(year from age(dob))


But some of my data is crapola... let's call it alternative data...


When I try to run the Extract function, I get

Error: SYSTEM ERROR: IllegalFieldValueException: Value 0 for monthOfYear
must be in the range [1,12]

Fragment 5:17

[Error Id: 62f90784-c9f4-4362-9710-a37464fc801a on drillnode:20005]


I've tried an ugly where clause, and this works:

where

(dob LIKE '%-01-%' or dob LIKE '%-02-%' or dob LIKE '%-03-%' or dob LIKE
'%-04-%' or dob LIKE '%-05-%' or dob LIKE '%-06-%' or dob LIKE '%-07-%' or
dob LIKE '%-08-%' or dob LIKE '%-09-%' or

dob LIKE '%-1-%' or dob LIKE '%-2-%' or dob LIKE '%-3-%' or dob LIKE
'%-4-%' or dob LIKE '%-5-%' or dob LIKE '%-6-%' or dob LIKE '%-7-%' or dob
LIKE '%-8-%' or dob LIKE '%-9-%' or

dob LIKE '%-10-%' or dob LIKE '%-11-%' or dob LIKE '%-12-%')


But WOW is that ugly. I could add the jar for regex contains, and make it
much easier (do we have a regex search function built into drill? I think
we should at this point...)


Is there another way to say try the extra function, and catch a failure,
and ignore on failure? What if we had a cast function that returned NULL
on failure so we could use it in the where clause? Any other more elegant
ways to handle this?


Thanks!


John
Charles Givre
2017-02-28 16:15:10 UTC
Permalink
Raw Message
Hi John,
I believe that Drill 1.9 includes a REGEXP_MATCHES( <source>, <pattern> )
function which does what you'd expect it to. I'm not sure when this was
introduced, so it maybe in earlier versions of Drill.
Best,
-- C
Post by John Omernik
I have a data set that has birthdays in YYYY-MM-DD format.
Most of this data is great. I am trying to compute the age using
EXTRACT(year from age(dob))
But some of my data is crapola... let's call it alternative data...
When I try to run the Extract function, I get
Error: SYSTEM ERROR: IllegalFieldValueException: Value 0 for monthOfYear
must be in the range [1,12]
Fragment 5:17
[Error Id: 62f90784-c9f4-4362-9710-a37464fc801a on drillnode:20005]
where
(dob LIKE '%-01-%' or dob LIKE '%-02-%' or dob LIKE '%-03-%' or dob LIKE
'%-04-%' or dob LIKE '%-05-%' or dob LIKE '%-06-%' or dob LIKE '%-07-%' or
dob LIKE '%-08-%' or dob LIKE '%-09-%' or
dob LIKE '%-1-%' or dob LIKE '%-2-%' or dob LIKE '%-3-%' or dob LIKE
'%-4-%' or dob LIKE '%-5-%' or dob LIKE '%-6-%' or dob LIKE '%-7-%' or dob
LIKE '%-8-%' or dob LIKE '%-9-%' or
dob LIKE '%-10-%' or dob LIKE '%-11-%' or dob LIKE '%-12-%')
But WOW is that ugly. I could add the jar for regex contains, and make it
much easier (do we have a regex search function built into drill? I think
we should at this point...)
Is there another way to say try the extra function, and catch a failure,
and ignore on failure? What if we had a cast function that returned NULL
on failure so we could use it in the where clause? Any other more elegant
ways to handle this?
Thanks!
John
John Omernik
2017-02-28 16:48:29 UTC
Permalink
Raw Message
Thanks Charles, that worked even on my 1.8.

Drill folks: We need to do some documentation updates. We've added
functions (like REGEXP_MATCHES, and it's in 1.8, so I am not sure where it
was added) and other functions like SPLIT and yet no mention in
https://drill.apache.org/docs/string-manipulation/

So, yes, this is "meh" work compared to programming all the cool things in
Drill. But there are a number of reasons that this needs to be done
besides common practices.

1. Users, and more importantly POTENTIAL users get frustrated when trying
to use drill for the first time. Coming from other Big Data systems like
Hive, not having Regex, split, and other functions is frustrating. But what
is more frustrating is to find that they actually exist, and are just not
documented. Nothing will turn people off faster.

2. Without the knowledge of these functions, people try "hacky" work
arounds like what I did, killing performance, and setting Drill in a bad
light.

3. It provides an over all feeling of lack of effort by the community. I
am know that resources are not unlimited, and these things need to be
addressed by "someone" but issues like this are really important for
getting more people into the community who may be able to help contribute!

4. I think as part of developer review and pull requests that add
functions/functionality should require a pull request to also provide a
documentation update. This helps to ensure that the docs keep up to date,
as well as keeping users appraised of what is happening... i.e. it's a good
"feeling" to see a great tool like Drill "improving" with new
functionality.

Please, folks, we need to do some one time clean up (go back through pull
requests to ensure all functions are documented up to now) and then then
get processes in place to ensure ongoing updates.

Thanks

John Omernik
Post by Charles Givre
Hi John,
I believe that Drill 1.9 includes a REGEXP_MATCHES( <source>, <pattern> )
function which does what you'd expect it to. I'm not sure when this was
introduced, so it maybe in earlier versions of Drill.
Best,
-- C
Post by John Omernik
I have a data set that has birthdays in YYYY-MM-DD format.
Most of this data is great. I am trying to compute the age using
EXTRACT(year from age(dob))
But some of my data is crapola... let's call it alternative data...
When I try to run the Extract function, I get
Error: SYSTEM ERROR: IllegalFieldValueException: Value 0 for monthOfYear
must be in the range [1,12]
Fragment 5:17
[Error Id: 62f90784-c9f4-4362-9710-a37464fc801a on drillnode:20005]
where
(dob LIKE '%-01-%' or dob LIKE '%-02-%' or dob LIKE '%-03-%' or dob LIKE
'%-04-%' or dob LIKE '%-05-%' or dob LIKE '%-06-%' or dob LIKE '%-07-%'
or
Post by John Omernik
dob LIKE '%-08-%' or dob LIKE '%-09-%' or
dob LIKE '%-1-%' or dob LIKE '%-2-%' or dob LIKE '%-3-%' or dob LIKE
'%-4-%' or dob LIKE '%-5-%' or dob LIKE '%-6-%' or dob LIKE '%-7-%' or
dob
Post by John Omernik
LIKE '%-8-%' or dob LIKE '%-9-%' or
dob LIKE '%-10-%' or dob LIKE '%-11-%' or dob LIKE '%-12-%')
But WOW is that ugly. I could add the jar for regex contains, and make it
much easier (do we have a regex search function built into drill? I think
we should at this point...)
Is there another way to say try the extra function, and catch a failure,
and ignore on failure? What if we had a cast function that returned NULL
on failure so we could use it in the where clause? Any other more
elegant
Post by John Omernik
ways to handle this?
Thanks!
John
Jinfeng Ni
2017-02-28 17:02:02 UTC
Permalink
Raw Message
Post by John Omernik
4. I think as part of developer review and pull requests that add
functions/functionality should require a pull request to also provide a
documentation update. This helps to ensure that the docs keep up to date,
as well as keeping users appraised of what is happening... i.e. it's a good
"feeling" to see a great tool like Drill "improving" with new
functionality.
Please, folks, we need to do some one time clean up (go back through pull
requests to ensure all functions are documented up to now) and then then
get processes in place to ensure ongoing updates.
That's a good suggestion. We should try our best to keep the code and
doc in sync.

+1
Jinfeng Ni
2017-02-28 18:11:48 UTC
Permalink
Raw Message
Regarding the list of functions (build-in or UDF), someone once
suggested that we make the functions self-documented by adding a
sys.functions table.

select * from sys.functions where name like '%SPLIT%';

return function_name, parameter_list, description etc.

This way, use could simply query sys.functions using Drill.
Post by Jinfeng Ni
Post by John Omernik
4. I think as part of developer review and pull requests that add
functions/functionality should require a pull request to also provide a
documentation update. This helps to ensure that the docs keep up to date,
as well as keeping users appraised of what is happening... i.e. it's a good
"feeling" to see a great tool like Drill "improving" with new
functionality.
Please, folks, we need to do some one time clean up (go back through pull
requests to ensure all functions are documented up to now) and then then
get processes in place to ensure ongoing updates.
That's a good suggestion. We should try our best to keep the code and
doc in sync.
+1
John Omernik
2017-02-28 18:20:27 UTC
Permalink
Raw Message
You could also generate documentation updates via query at each release.
This would be a great feature, move the information close to the analysts
hands, I love how that would work. (I think I remember some talk about
extending sys.options to be self documenting as well.... )
Post by Jinfeng Ni
Regarding the list of functions (build-in or UDF), someone once
suggested that we make the functions self-documented by adding a
sys.functions table.
select * from sys.functions where name like '%SPLIT%';
return function_name, parameter_list, description etc.
This way, use could simply query sys.functions using Drill.
Post by Jinfeng Ni
Post by John Omernik
4. I think as part of developer review and pull requests that add
functions/functionality should require a pull request to also provide a
documentation update. This helps to ensure that the docs keep up to
date,
Post by Jinfeng Ni
Post by John Omernik
as well as keeping users appraised of what is happening... i.e. it's a
good
Post by Jinfeng Ni
Post by John Omernik
"feeling" to see a great tool like Drill "improving" with new
functionality.
Please, folks, we need to do some one time clean up (go back through
pull
Post by Jinfeng Ni
Post by John Omernik
requests to ensure all functions are documented up to now) and then then
get processes in place to ensure ongoing updates.
That's a good suggestion. We should try our best to keep the code and
doc in sync.
+1
John Omernik
2017-03-01 14:50:24 UTC
Permalink
Raw Message
So what would need to be done to get this process kick started? I see a
few components here:

1. Develop the table in sys (sys.functions) that stores the information
about the function. For this I propose this for discussions

name - The name of the function
description - The Description of the function
docgroup- This should be the groupings under "SQL Functions" in the
documentation. See * below for proposal
tags - A way to tag functions for easy searching like select from
sys.functions where tags like '%string%'
arguments - The Ordered list of arguments and their types
return - What is returned (and it's type)
Examples: Usage examples that can display nicely in both select * from
sys.functions as well as provide enough information for use in HTML Docs.


* docgroup.

Currently, in the documentation, the nesting is as below (I didn't expand
all the stuff on SQL Window and Nested Data). I want to outline my proposal
for to hand handle this... So I propose moving "SQL Window Functions",
"Nested Data Functions" and "Query Directory Functions" all under SQL
Functions and adding another group here "Common Functions". This will now
be the "Functions root". The Docgroup field in sys.function will have
levels and grouping separated by ":". Thus, firstlevel:second:level or
firstlevel:secondlevel:thirdlevel This will allow us to keep the grouping
in the sys.functions table, and make it so that on every release, the
documentation could be updated with a query. Note the level "SQL
Functions" would not be represented in the docgroup.

Here are some examples of functions

LOWER()
Current Doc Placement: SQL Functions -> String Manipulation
Proposed docgroup: 'Common Functions:String Manipulation'
How it would appear on the Doc page: SQL Functions -> Common Functions ->
String Manipulation

COUNT()
Current Doc Placement: SQL Functions -> SQL Window Functions -> Aggregate
Window Functions
Proposed docgroup: 'SQL Window Functions:Aggregate Window Functions'
How it would appear on the Doc pages: SQL Functions -> SQL Window Functions
-> Aggregate Window Functions

Basically it would be used to add the raw HTML to the final page in the
docgroup (Aggregate Window Functions)


Current Doc Layout:

SQL Reference
SQL Reference Introduction
-> Data Types
Lexical Structure
Operators
-> SQL Functions
->-> About SQL Function Examples
->-> Math and Trig
->->Data Type Conversion
->->Date/Time Functions and Arithmetic
->->String Manipulation
->->Aggregate and Aggregate Statistical
->-> Functions for handling Nulls
-> SQL Window Functions
->-> ...
->-> ...
-> Nested Data Functions
->->...
->->...
-> Query Directory Functions

2. Once we get to a certain point in development on sys.functions, we need
a call to arms. We need to come up with an initial "fill" of
sys.functions. For that we'll need to take current data fill it in, as
well as getting a list of all the ninja functions that have been added to
drill and not documented... not sure how get those with out an intense
code/jira review.

3. Come up with new function proposal guidelines. If you do a pull
request with a function, what will need to be included for your pull
request to be approved? We should not allow functions to be added to Drill
without a basic doc addition.

4. Update procedures?

This is complicated, but done well, it could really put the knowledge and
analyst needs right in the system itself!

John
Post by John Omernik
You could also generate documentation updates via query at each release.
This would be a great feature, move the information close to the analysts
hands, I love how that would work. (I think I remember some talk about
extending sys.options to be self documenting as well.... )
Post by Jinfeng Ni
Regarding the list of functions (build-in or UDF), someone once
suggested that we make the functions self-documented by adding a
sys.functions table.
select * from sys.functions where name like '%SPLIT%';
return function_name, parameter_list, description etc.
This way, use could simply query sys.functions using Drill.
Post by Jinfeng Ni
Post by John Omernik
4. I think as part of developer review and pull requests that add
functions/functionality should require a pull request to also provide a
documentation update. This helps to ensure that the docs keep up to
date,
Post by Jinfeng Ni
Post by John Omernik
as well as keeping users appraised of what is happening... i.e. it's a
good
Post by Jinfeng Ni
Post by John Omernik
"feeling" to see a great tool like Drill "improving" with new
functionality.
Please, folks, we need to do some one time clean up (go back through
pull
Post by Jinfeng Ni
Post by John Omernik
requests to ensure all functions are documented up to now) and then
then
Post by Jinfeng Ni
Post by John Omernik
get processes in place to ensure ongoing updates.
That's a good suggestion. We should try our best to keep the code and
doc in sync.
+1
John Omernik
2017-05-02 17:52:07 UTC
Permalink
Raw Message
I just want to say, there is a great JIRA already opened here:

https://issues.apache.org/jira/browse/DRILL-4258

I added a comment, I would encourage others to add comments if they think
this idea would be beneficial.
Post by John Omernik
So what would need to be done to get this process kick started? I see a
1. Develop the table in sys (sys.functions) that stores the information
about the function. For this I propose this for discussions
name - The name of the function
description - The Description of the function
docgroup- This should be the groupings under "SQL Functions" in the
documentation. See * below for proposal
tags - A way to tag functions for easy searching like select from
sys.functions where tags like '%string%'
arguments - The Ordered list of arguments and their types
return - What is returned (and it's type)
Examples: Usage examples that can display nicely in both select * from
sys.functions as well as provide enough information for use in HTML Docs.
* docgroup.
Currently, in the documentation, the nesting is as below (I didn't expand
all the stuff on SQL Window and Nested Data). I want to outline my proposal
for to hand handle this... So I propose moving "SQL Window Functions",
"Nested Data Functions" and "Query Directory Functions" all under SQL
Functions and adding another group here "Common Functions". This will now
be the "Functions root". The Docgroup field in sys.function will have
levels and grouping separated by ":". Thus, firstlevel:second:level or
firstlevel:secondlevel:thirdlevel This will allow us to keep the
grouping in the sys.functions table, and make it so that on every release,
the documentation could be updated with a query. Note the level "SQL
Functions" would not be represented in the docgroup.
Here are some examples of functions
LOWER()
Current Doc Placement: SQL Functions -> String Manipulation
Proposed docgroup: 'Common Functions:String Manipulation'
How it would appear on the Doc page: SQL Functions -> Common Functions ->
String Manipulation
COUNT()
Current Doc Placement: SQL Functions -> SQL Window Functions -> Aggregate
Window Functions
Proposed docgroup: 'SQL Window Functions:Aggregate Window Functions'
How it would appear on the Doc pages: SQL Functions -> SQL Window
Functions -> Aggregate Window Functions
Basically it would be used to add the raw HTML to the final page in the
docgroup (Aggregate Window Functions)
SQL Reference
SQL Reference Introduction
-> Data Types
Lexical Structure
Operators
-> SQL Functions
->-> About SQL Function Examples
->-> Math and Trig
->->Data Type Conversion
->->Date/Time Functions and Arithmetic
->->String Manipulation
->->Aggregate and Aggregate Statistical
->-> Functions for handling Nulls
-> SQL Window Functions
->-> ...
->-> ...
-> Nested Data Functions
->->...
->->...
-> Query Directory Functions
2. Once we get to a certain point in development on sys.functions, we need
a call to arms. We need to come up with an initial "fill" of
sys.functions. For that we'll need to take current data fill it in, as
well as getting a list of all the ninja functions that have been added to
drill and not documented... not sure how get those with out an intense
code/jira review.
3. Come up with new function proposal guidelines. If you do a pull
request with a function, what will need to be included for your pull
request to be approved? We should not allow functions to be added to Drill
without a basic doc addition.
4. Update procedures?
This is complicated, but done well, it could really put the knowledge and
analyst needs right in the system itself!
John
Post by John Omernik
You could also generate documentation updates via query at each release.
This would be a great feature, move the information close to the analysts
hands, I love how that would work. (I think I remember some talk about
extending sys.options to be self documenting as well.... )
Post by Jinfeng Ni
Regarding the list of functions (build-in or UDF), someone once
suggested that we make the functions self-documented by adding a
sys.functions table.
select * from sys.functions where name like '%SPLIT%';
return function_name, parameter_list, description etc.
This way, use could simply query sys.functions using Drill.
Post by Jinfeng Ni
Post by John Omernik
4. I think as part of developer review and pull requests that add
functions/functionality should require a pull request to also provide
a
Post by Jinfeng Ni
Post by John Omernik
documentation update. This helps to ensure that the docs keep up to
date,
Post by Jinfeng Ni
Post by John Omernik
as well as keeping users appraised of what is happening... i.e. it's
a good
Post by Jinfeng Ni
Post by John Omernik
"feeling" to see a great tool like Drill "improving" with new
functionality.
Please, folks, we need to do some one time clean up (go back through
pull
Post by Jinfeng Ni
Post by John Omernik
requests to ensure all functions are documented up to now) and then
then
Post by Jinfeng Ni
Post by John Omernik
get processes in place to ensure ongoing updates.
That's a good suggestion. We should try our best to keep the code and
doc in sync.
+1
Loading...