CTAS memory leak

Discussion:

CTAS memory leak

scott

2018-08-29 23:09:59 UTC

Hi all,
I've got a problem using the create table as option I was hoping someone
could help with. I am trying to create parquet files from existing json
files using this method. It works on smaller datasets, but when I try this
on a large dataset, drill will take up all memory on my servers until it
swaps and then crashes. I'm running version 1.12 on centos 7. I've got my
drillbits set to xmx 8G, which seems to work for most queries and it does
not exceed that limit by much, but when I do the CTAS, it just keeps
growing without bounds.
I run 4 drillbits on each server with these settings: -Xms8G -Xmx8G
-XX:MaxDirectMemorySize=10G on a server that has 48G RAM.
Has anyone else experienced this? Are there any workarounds you can suggest?

Thanks for your time,
Scott

Boaz Ben-Zvi

2018-08-30 00:10:46 UTC

Permalink

Hi Scott,

1. "swaps and then crashes" - do you mean an Out-Of-Memory error ?

2. Version 1.14 is available now, with several memory control
improvements (e.g., Hash Join spilling, output batch sizing)

3. Direct memory is only 10G - why not go higher ? This is where most of
Drill's in-memory data is held (not so much the stack and heap).

4. May want to increase the memory available to each query on each node;
the default ( 2GB ) is too conservative (i.e. low).

    E.g., to go to 8GB, do

      alter session set `planner.memory.max_query_memory_per_node` =
8589934592;

Thanks,

       Boaz

Post by scott
Hi all,
I've got a problem using the create table as option I was hoping someone
could help with. I am trying to create parquet files from existing json
files using this method. It works on smaller datasets, but when I try this
on a large dataset, drill will take up all memory on my servers until it
swaps and then crashes. I'm running version 1.12 on centos 7. I've got my
drillbits set to xmx 8G, which seems to work for most queries and it does
not exceed that limit by much, but when I do the CTAS, it just keeps
growing without bounds.
I run 4 drillbits on each server with these settings: -Xms8G -Xmx8G
-XX:MaxDirectMemorySize=10G on a server that has 48G RAM.
Has anyone else experienced this? Are there any workarounds you can suggest?
Thanks for your time,
Scott

Kunal Khatua

2018-08-30 18:04:02 UTC

Permalink

ScottÂ

I think I can explain why you are getting the OutOfMemory.

Drill essentially has 2 pools of memory... the standard JVM Heap and the Netty-managed Direct memory. When you are reading a JSON document, it needs to be deserialized into Java heap objects because of the JSON parser libraries Drill uses. After that, Drill converts it into its internal representation within the Direct memory space.Â The issue you are seeing is most likely that this initial step is consuming a very large amount of Heap memory.Â

So, the options you have are
1. Reduce the size of the individual units of the dataset (I'm assuming it is one giant JSON document within the source file)
2. Increase the Heap, possibly at the cost of Direct (say, 12GB Xmx and 6GB Direct)
3. Reduce the parallelization, so that fewer JSON files are read and materialized in the heap memory at a given time.

~ Kunal

On 8/29/2018 5:10:55 PM, Boaz Ben-Zvi <bben-***@mapr.com> wrote:
Hi Scott,

1. "swaps and then crashes" - do you mean an Out-Of-Memory error ?

2. Version 1.14 is available now, with several memory control
improvements (e.g., Hash Join spilling, output batch sizing)

3. Direct memory is only 10G - why not go higher ? This is where most of
Drill's in-memory data is held (not so much the stack and heap).

4. May want to increase the memory available to each query on each node;
the default ( 2GB ) is too conservative (i.e. low).

E.g., to go to 8GB, do

alter session set `planner.memory.max_query_memory_per_node` =
8589934592;

Thanks,

Boaz