Discussion:
External Sort - Unable to Allocate Buffer error
(too old to reply)
Nate Butler
2017-05-01 14:44:49 UTC
Permalink
Raw Message
We keep running into this issue when trying to issue a query with hashagg
disabled. When I look at system memory usage though, drill doesn't seem to
be using much of it but still hits this error.

Our environment:

- 1 r3.8xl
- 1 drillbit version 1.10.0 configured with 4GB of Heap and 230G of Direct
- Data stored on S3 is compressed CSV

I've tried increasing planner.memory.max_query_memory_per_node to 230G and
lowered planner.width.max_per_query to 1 and it still fails.

We've applied the patch from this bug in the hopes that it would resolve
the issue but it hasn't:

https://issues.apache.org/jira/browse/DRILL-5226

Stack Trace:

(org.apache.drill.exec.exception.OutOfMemoryException) Unable to allocate
buffer of size 16777216 due to memory limit. Current allocation: 8445952
org.apache.drill.exec.memory.BaseAllocator.buffer():220
org.apache.drill.exec.memory.BaseAllocator.buffer():195
org.apache.drill.exec.vector.VarCharVector.reAlloc():425
org.apache.drill.exec.vector.VarCharVector.copyFromSafe():278
org.apache.drill.exec.vector.NullableVarCharVector.copyFromSafe():379

org.apache.drill.exec.test.generated.PriorityQueueCopierGen328.doCopy():22
org.apache.drill.exec.test.generated.PriorityQueueCopierGen328.next():75

org.apache.drill.exec.physical.impl.xsort.ExternalSortBatch.mergeAndSpill():602

org.apache.drill.exec.physical.impl.xsort.ExternalSortBatch.innerNext():428
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.record.AbstractRecordBatch.next():119
org.apache.drill.exec.record.AbstractRecordBatch.next():109

org.apache.drill.exec.physical.impl.aggregate.StreamingAggBatch.innerNext():137
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.physical.impl.BaseRootExec.next():104

org.apache.drill.exec.physical.impl.partitionsender.PartitionSenderRootExec.innerNext():144
org.apache.drill.exec.physical.impl.BaseRootExec.next():94
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():232
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():226
java.security.AccessController.doPrivileged():-2
javax.security.auth.Subject.doAs():422
org.apache.hadoop.security.UserGroupInformation.doAs():1657
org.apache.drill.exec.work.fragment.FragmentExecutor.run():226
org.apache.drill.common.SelfCleaningRunnable.run():38
java.util.concurrent.ThreadPoolExecutor.runWorker():1142
java.util.concurrent.ThreadPoolExecutor$Worker.run():617
java.lang.Thread.run():745 (state=,code=0)

Is there something I'm missing here? Any help/direction would be
appreciated.

Thanks,
Nate
Zelaine Fong
2017-05-01 15:09:59 UTC
Permalink
Raw Message
Nate,

The Jira you’ve referenced relates to the new external sort, which is not enabled by default, as it is still going through some additional testing. If you’d like to try it to see if it resolves your problem, you’ll need to
set “sort.external.disable_managed” as follows in your drill-override.conf file:

drill.exec: {
cluster-id: "drillbits1",
zk.connect: "localhost:2181",
sort.external.disable_managed: false
}

and run the following query:

ALTER SESSION SET `exec.sort.disable_managed` = false;

-- Zelaine

On 5/1/17, 7:44 AM, "Nate Butler" <***@chartbeat.com> wrote:

We keep running into this issue when trying to issue a query with hashagg
disabled. When I look at system memory usage though, drill doesn't seem to
be using much of it but still hits this error.

Our environment:

- 1 r3.8xl
- 1 drillbit version 1.10.0 configured with 4GB of Heap and 230G of Direct
- Data stored on S3 is compressed CSV

I've tried increasing planner.memory.max_query_memory_per_node to 230G and
lowered planner.width.max_per_query to 1 and it still fails.

We've applied the patch from this bug in the hopes that it would resolve
the issue but it hasn't:

https://issues.apache.org/jira/browse/DRILL-5226

Stack Trace:

(org.apache.drill.exec.exception.OutOfMemoryException) Unable to allocate
buffer of size 16777216 due to memory limit. Current allocation: 8445952
org.apache.drill.exec.memory.BaseAllocator.buffer():220
org.apache.drill.exec.memory.BaseAllocator.buffer():195
org.apache.drill.exec.vector.VarCharVector.reAlloc():425
org.apache.drill.exec.vector.VarCharVector.copyFromSafe():278
org.apache.drill.exec.vector.NullableVarCharVector.copyFromSafe():379

org.apache.drill.exec.test.generated.PriorityQueueCopierGen328.doCopy():22
org.apache.drill.exec.test.generated.PriorityQueueCopierGen328.next():75

org.apache.drill.exec.physical.impl.xsort.ExternalSortBatch.mergeAndSpill():602

org.apache.drill.exec.physical.impl.xsort.ExternalSortBatch.innerNext():428
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.record.AbstractRecordBatch.next():119
org.apache.drill.exec.record.AbstractRecordBatch.next():109

org.apache.drill.exec.physical.impl.aggregate.StreamingAggBatch.innerNext():137
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.physical.impl.BaseRootExec.next():104

org.apache.drill.exec.physical.impl.partitionsender.PartitionSenderRootExec.innerNext():144
org.apache.drill.exec.physical.impl.BaseRootExec.next():94
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():232
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():226
java.security.AccessController.doPrivileged():-2
javax.security.auth.Subject.doAs():422
org.apache.hadoop.security.UserGroupInformation.doAs():1657
org.apache.drill.exec.work.fragment.FragmentExecutor.run():226
org.apache.drill.common.SelfCleaningRunnable.run():38
java.util.concurrent.ThreadPoolExecutor.runWorker():1142
java.util.concurrent.ThreadPoolExecutor$Worker.run():617
java.lang.Thread.run():745 (state=,code=0)

Is there something I'm missing here? Any help/direction would be
appreciated.

Thanks,
Nate
Nate Butler
2017-05-02 15:24:50 UTC
Permalink
Raw Message
Zelaine, thanks for the suggestion. I added this option both to the
drill-override and in the session and this time the query did stay running
for much longer but it still eventually failed with the same error,
although much different memory values.

(org.apache.drill.exec.exception.OutOfMemoryException) Unable to allocate
buffer of size 134217728 due to memory limit. Current allocation:
10653214316
org.apache.drill.exec.memory.BaseAllocator.buffer():220
org.apache.drill.exec.memory.BaseAllocator.buffer():195
org.apache.drill.exec.vector.VarCharVector.reAlloc():425
org.apache.drill.exec.vector.VarCharVector.copyFromSafe():278
org.apache.drill.exec.vector.NullableVarCharVector.copyFromSafe():379
org.apache.drill.exec.test.generated.PriorityQueueCopierGen8.doCopy():22
org.apache.drill.exec.test.generated.PriorityQueueCopierGen8.next():76

org.apache.drill.exec.physical.impl.xsort.managed.CopierHolder$BatchMerger.next():234

org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.doMergeAndSpill():1408

org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.mergeAndSpill():1376

org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.spillFromMemory():1339

org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.processBatch():831

org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.loadBatch():618

org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.load():660

org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.innerNext():559
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.record.AbstractRecordBatch.next():119
org.apache.drill.exec.record.AbstractRecordBatch.next():109

org.apache.drill.exec.physical.impl.aggregate.StreamingAggBatch.innerNext():137
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.physical.impl.BaseRootExec.next():104

org.apache.drill.exec.physical.impl.partitionsender.PartitionSenderRootExec.innerNext():144
org.apache.drill.exec.physical.impl.BaseRootExec.next():94
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():232
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():226
java.security.AccessController.doPrivileged():-2
javax.security.auth.Subject.doAs():422
org.apache.hadoop.security.UserGroupInformation.doAs():1657
org.apache.drill.exec.work.fragment.FragmentExecutor.run():226
org.apache.drill.common.SelfCleaningRunnable.run():38
java.util.concurrent.ThreadPoolExecutor.runWorker():1142
java.util.concurrent.ThreadPoolExecutor$Worker.run():617
java.lang.Thread.run():745 (state=,code=0)

At first I didn't change planner.width.max_per_query and the default on a
32 core machine makes it 23. This query failed after 34 minutes. I then
tried setting planner.width.max_per_query=1 and this query also failed but
of course took took longer, about 2 hours. In both cases,
planner.memory.max_query_memory_per_node was set to 230G.
Nate,
The Jira you’ve referenced relates to the new external sort, which is not
enabled by default, as it is still going through some additional testing.
If you’d like to try it to see if it resolves your problem, you’ll need to
set “sort.external.disable_managed” as follows in your
drill.exec: {
cluster-id: "drillbits1",
zk.connect: "localhost:2181",
sort.external.disable_managed: false
}
ALTER SESSION SET `exec.sort.disable_managed` = false;
-- Zelaine
We keep running into this issue when trying to issue a query with hashagg
disabled. When I look at system memory usage though, drill doesn't seem to
be using much of it but still hits this error.
- 1 r3.8xl
- 1 drillbit version 1.10.0 configured with 4GB of Heap and 230G of Direct
- Data stored on S3 is compressed CSV
I've tried increasing planner.memory.max_query_memory_per_node to 230G and
lowered planner.width.max_per_query to 1 and it still fails.
We've applied the patch from this bug in the hopes that it would resolve
https://issues.apache.org/jira/browse/DRILL-5226
(org.apache.drill.exec.exception.OutOfMemoryException) Unable to allocate
buffer of size 16777216 due to memory limit. Current allocation: 8445952
org.apache.drill.exec.memory.BaseAllocator.buffer():220
org.apache.drill.exec.memory.BaseAllocator.buffer():195
org.apache.drill.exec.vector.VarCharVector.reAlloc():425
org.apache.drill.exec.vector.VarCharVector.copyFromSafe():278
org.apache.drill.exec.vector.NullableVarCharVector.
copyFromSafe():379
org.apache.drill.exec.test.generated.PriorityQueueCopierGen328.
doCopy():22
org.apache.drill.exec.test.generated.PriorityQueueCopierGen328.
next():75
org.apache.drill.exec.physical.impl.xsort.ExternalSortBatch.
mergeAndSpill():602
org.apache.drill.exec.physical.impl.xsort.
ExternalSortBatch.innerNext():428
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.record.AbstractRecordBatch.next():119
org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.physical.impl.aggregate.
StreamingAggBatch.innerNext():137
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.physical.impl.BaseRootExec.next():104
org.apache.drill.exec.physical.impl.partitionsender.
PartitionSenderRootExec.innerNext():144
org.apache.drill.exec.physical.impl.BaseRootExec.next():94
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():232
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():226
java.security.AccessController.doPrivileged():-2
javax.security.auth.Subject.doAs():422
org.apache.hadoop.security.UserGroupInformation.doAs():1657
org.apache.drill.exec.work.fragment.FragmentExecutor.run():226
org.apache.drill.common.SelfCleaningRunnable.run():38
java.util.concurrent.ThreadPoolExecutor.runWorker():1142
java.util.concurrent.ThreadPoolExecutor$Worker.run():617
java.lang.Thread.run():745 (state=,code=0)
Is there something I'm missing here? Any help/direction would be
appreciated.
Thanks,
Nate
rahul challapalli
2017-05-02 17:47:41 UTC
Permalink
Raw Message
This is clearly a bug and like zelaine suggested the new sort is still work
in progress. We have a few similar bugs open for the new sort. I could have
pointed to the jira's but unfortunately JIRA is not working for me due to
firewall issues.

Another suggestion is build drill from the latest master and try it out, if
you are willing to spend some time. But again there is no guarantee yet.

Please go ahead and raise a new jira. If it is a duplicate, I will mark it
as such later. Thank You.

- Rahul
Post by Nate Butler
Zelaine, thanks for the suggestion. I added this option both to the
drill-override and in the session and this time the query did stay running
for much longer but it still eventually failed with the same error,
although much different memory values.
(org.apache.drill.exec.exception.OutOfMemoryException) Unable to allocate
10653214316
org.apache.drill.exec.memory.BaseAllocator.buffer():220
org.apache.drill.exec.memory.BaseAllocator.buffer():195
org.apache.drill.exec.vector.VarCharVector.reAlloc():425
org.apache.drill.exec.vector.VarCharVector.copyFromSafe():278
org.apache.drill.exec.vector.NullableVarCharVector.copyFromSafe():379
org.apache.drill.exec.test.generated.PriorityQueueCopierGen8.
doCopy():22
org.apache.drill.exec.test.generated.PriorityQueueCopierGen8.next():76
org.apache.drill.exec.physical.impl.xsort.managed.
CopierHolder$BatchMerger.next():234
org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.
doMergeAndSpill():1408
org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.
mergeAndSpill():1376
org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.
spillFromMemory():1339
org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.
processBatch():831
org.apache.drill.exec.physical.impl.xsort.managed.
ExternalSortBatch.loadBatch():618
org.apache.drill.exec.physical.impl.xsort.managed.
ExternalSortBatch.load():660
org.apache.drill.exec.physical.impl.xsort.managed.
ExternalSortBatch.innerNext():559
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.record.AbstractRecordBatch.next():119
org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.physical.impl.aggregate.
StreamingAggBatch.innerNext():137
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.physical.impl.BaseRootExec.next():104
org.apache.drill.exec.physical.impl.partitionsender.
PartitionSenderRootExec.innerNext():144
org.apache.drill.exec.physical.impl.BaseRootExec.next():94
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():232
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():226
java.security.AccessController.doPrivileged():-2
javax.security.auth.Subject.doAs():422
org.apache.hadoop.security.UserGroupInformation.doAs():1657
org.apache.drill.exec.work.fragment.FragmentExecutor.run():226
org.apache.drill.common.SelfCleaningRunnable.run():38
java.util.concurrent.ThreadPoolExecutor.runWorker():1142
java.util.concurrent.ThreadPoolExecutor$Worker.run():617
java.lang.Thread.run():745 (state=,code=0)
At first I didn't change planner.width.max_per_query and the default on a
32 core machine makes it 23. This query failed after 34 minutes. I then
tried setting planner.width.max_per_query=1 and this query also failed but
of course took took longer, about 2 hours. In both cases,
planner.memory.max_query_memory_per_node was set to 230G.
Nate,
The Jira you’ve referenced relates to the new external sort, which is not
enabled by default, as it is still going through some additional testing.
If you’d like to try it to see if it resolves your problem, you’ll need
to
set “sort.external.disable_managed” as follows in your
drill.exec: {
cluster-id: "drillbits1",
zk.connect: "localhost:2181",
sort.external.disable_managed: false
}
ALTER SESSION SET `exec.sort.disable_managed` = false;
-- Zelaine
We keep running into this issue when trying to issue a query with hashagg
disabled. When I look at system memory usage though, drill doesn't seem to
be using much of it but still hits this error.
- 1 r3.8xl
- 1 drillbit version 1.10.0 configured with 4GB of Heap and 230G of Direct
- Data stored on S3 is compressed CSV
I've tried increasing planner.memory.max_query_memory_per_node to 230G and
lowered planner.width.max_per_query to 1 and it still fails.
We've applied the patch from this bug in the hopes that it would resolve
https://issues.apache.org/jira/browse/DRILL-5226
(org.apache.drill.exec.exception.OutOfMemoryException) Unable to allocate
buffer of size 16777216 due to memory limit. Current allocation: 8445952
org.apache.drill.exec.memory.BaseAllocator.buffer():220
org.apache.drill.exec.memory.BaseAllocator.buffer():195
org.apache.drill.exec.vector.VarCharVector.reAlloc():425
org.apache.drill.exec.vector.VarCharVector.copyFromSafe():278
org.apache.drill.exec.vector.NullableVarCharVector.
copyFromSafe():379
org.apache.drill.exec.test.generated.PriorityQueueCopierGen328.
doCopy():22
org.apache.drill.exec.test.generated.PriorityQueueCopierGen328.
next():75
org.apache.drill.exec.physical.impl.xsort.ExternalSortBatch.
mergeAndSpill():602
org.apache.drill.exec.physical.impl.xsort.
ExternalSortBatch.innerNext():428
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.record.AbstractRecordBatch.next():119
org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.physical.impl.aggregate.
StreamingAggBatch.innerNext():137
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.physical.impl.BaseRootExec.next():104
org.apache.drill.exec.physical.impl.partitionsender.
PartitionSenderRootExec.innerNext():144
org.apache.drill.exec.physical.impl.BaseRootExec.next():94
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():232
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():226
java.security.AccessController.doPrivileged():-2
javax.security.auth.Subject.doAs():422
org.apache.hadoop.security.UserGroupInformation.doAs():1657
org.apache.drill.exec.work.fragment.FragmentExecutor.run():226
org.apache.drill.common.SelfCleaningRunnable.run():38
java.util.concurrent.ThreadPoolExecutor.runWorker():1142
java.util.concurrent.ThreadPoolExecutor$Worker.run():617
java.lang.Thread.run():745 (state=,code=0)
Is there something I'm missing here? Any help/direction would be
appreciated.
Thanks,
Nate
Nate Butler
2017-05-02 18:35:38 UTC
Permalink
Raw Message
Ok, thanks Rahul, I will do that.

On Tue, May 2, 2017 at 1:47 PM, rahul challapalli <
Post by rahul challapalli
This is clearly a bug and like zelaine suggested the new sort is still work
in progress. We have a few similar bugs open for the new sort. I could have
pointed to the jira's but unfortunately JIRA is not working for me due to
firewall issues.
Another suggestion is build drill from the latest master and try it out, if
you are willing to spend some time. But again there is no guarantee yet.
Please go ahead and raise a new jira. If it is a duplicate, I will mark it
as such later. Thank You.
- Rahul
Post by Nate Butler
Zelaine, thanks for the suggestion. I added this option both to the
drill-override and in the session and this time the query did stay
running
Post by Nate Butler
for much longer but it still eventually failed with the same error,
although much different memory values.
(org.apache.drill.exec.exception.OutOfMemoryException) Unable to allocate
10653214316
org.apache.drill.exec.memory.BaseAllocator.buffer():220
org.apache.drill.exec.memory.BaseAllocator.buffer():195
org.apache.drill.exec.vector.VarCharVector.reAlloc():425
org.apache.drill.exec.vector.VarCharVector.copyFromSafe():278
org.apache.drill.exec.vector.NullableVarCharVector.
copyFromSafe():379
Post by Nate Butler
org.apache.drill.exec.test.generated.PriorityQueueCopierGen8.
doCopy():22
org.apache.drill.exec.test.generated.PriorityQueueCopierGen8.next()
:76
Post by Nate Butler
org.apache.drill.exec.physical.impl.xsort.managed.
CopierHolder$BatchMerger.next():234
org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.
doMergeAndSpill():1408
org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.
mergeAndSpill():1376
org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.
spillFromMemory():1339
org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.
processBatch():831
org.apache.drill.exec.physical.impl.xsort.managed.
ExternalSortBatch.loadBatch():618
org.apache.drill.exec.physical.impl.xsort.managed.
ExternalSortBatch.load():660
org.apache.drill.exec.physical.impl.xsort.managed.
ExternalSortBatch.innerNext():559
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.record.AbstractRecordBatch.next():119
org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.physical.impl.aggregate.
StreamingAggBatch.innerNext():137
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.physical.impl.BaseRootExec.next():104
org.apache.drill.exec.physical.impl.partitionsender.
PartitionSenderRootExec.innerNext():144
org.apache.drill.exec.physical.impl.BaseRootExec.next():94
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():232
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():226
java.security.AccessController.doPrivileged():-2
javax.security.auth.Subject.doAs():422
org.apache.hadoop.security.UserGroupInformation.doAs():1657
org.apache.drill.exec.work.fragment.FragmentExecutor.run():226
org.apache.drill.common.SelfCleaningRunnable.run():38
java.util.concurrent.ThreadPoolExecutor.runWorker():1142
java.util.concurrent.ThreadPoolExecutor$Worker.run():617
java.lang.Thread.run():745 (state=,code=0)
At first I didn't change planner.width.max_per_query and the default on a
32 core machine makes it 23. This query failed after 34 minutes. I then
tried setting planner.width.max_per_query=1 and this query also failed
but
Post by Nate Butler
of course took took longer, about 2 hours. In both cases,
planner.memory.max_query_memory_per_node was set to 230G.
Nate,
The Jira you’ve referenced relates to the new external sort, which is
not
Post by Nate Butler
enabled by default, as it is still going through some additional
testing.
Post by Nate Butler
If you’d like to try it to see if it resolves your problem, you’ll need
to
set “sort.external.disable_managed” as follows in your
drill.exec: {
cluster-id: "drillbits1",
zk.connect: "localhost:2181",
sort.external.disable_managed: false
}
ALTER SESSION SET `exec.sort.disable_managed` = false;
-- Zelaine
We keep running into this issue when trying to issue a query with hashagg
disabled. When I look at system memory usage though, drill doesn't seem to
be using much of it but still hits this error.
- 1 r3.8xl
- 1 drillbit version 1.10.0 configured with 4GB of Heap and 230G of Direct
- Data stored on S3 is compressed CSV
I've tried increasing planner.memory.max_query_memory_per_node to 230G and
lowered planner.width.max_per_query to 1 and it still fails.
We've applied the patch from this bug in the hopes that it would resolve
https://issues.apache.org/jira/browse/DRILL-5226
(org.apache.drill.exec.exception.OutOfMemoryException) Unable to allocate
buffer of size 16777216 due to memory limit. Current allocation: 8445952
org.apache.drill.exec.memory.BaseAllocator.buffer():220
org.apache.drill.exec.memory.BaseAllocator.buffer():195
org.apache.drill.exec.vector.VarCharVector.reAlloc():425
org.apache.drill.exec.vector.VarCharVector.copyFromSafe():278
org.apache.drill.exec.vector.NullableVarCharVector.
copyFromSafe():379
org.apache.drill.exec.test.generated.PriorityQueueCopierGen328.
doCopy():22
org.apache.drill.exec.test.generated.
PriorityQueueCopierGen328.
Post by Nate Butler
next():75
org.apache.drill.exec.physical.impl.xsort.ExternalSortBatch.
mergeAndSpill():602
org.apache.drill.exec.physical.impl.xsort.
ExternalSortBatch.innerNext():428
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.record.AbstractRecordBatch.next():119
org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.physical.impl.aggregate.
StreamingAggBatch.innerNext():137
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.physical.impl.BaseRootExec.next():104
org.apache.drill.exec.physical.impl.partitionsender.
PartitionSenderRootExec.innerNext():144
org.apache.drill.exec.physical.impl.BaseRootExec.next():94
org.apache.drill.exec.work.fragment.FragmentExecutor$1.
run():232
Post by Nate Butler
org.apache.drill.exec.work.fragment.FragmentExecutor$1.
run():226
Post by Nate Butler
java.security.AccessController.doPrivileged():-2
javax.security.auth.Subject.doAs():422
org.apache.hadoop.security.UserGroupInformation.doAs():1657
org.apache.drill.exec.work.fragment.FragmentExecutor.run():226
org.apache.drill.common.SelfCleaningRunnable.run():38
java.util.concurrent.ThreadPoolExecutor.runWorker():1142
java.util.concurrent.ThreadPoolExecutor$Worker.run():617
java.lang.Thread.run():745 (state=,code=0)
Is there something I'm missing here? Any help/direction would be
appreciated.
Thanks,
Nate
Paul Rogers
2017-05-02 18:44:59 UTC
Permalink
Raw Message
Hi Nate,

I’ll give you three separate suggestions. The first two build on the discussion with Zelaine. The third gets at a separate problem that could be the root cause.

First, let’s discuss logging. When we hit a bug such as this, the logs are incredibly useful to learn what is going on. Turn on debug logging. If you are familiar with Java logging, then you only need to enable the debug level for the org.apache.drill.exec.physical.impl.xsort.managed package. Then, look for lines that say “ExternalSortBatch”.

You will see a number of entries early on that identify the amount of memory available to the sort, the size of the incoming batches, and how we will slice up memory. Please post those lines to your JIRA entry.

Then, later, you’ll see an entry for the OOM error. Review the preceding entries to get a sense of where the sort was: was it still reading and spilling data from upstream (the sort phase)? Or, had it gotten to the merge phase in which we reread spilled data.

The log entries, while cryptic on first glance, make a bit more sense after you scan through the full set. Post those lines with summary info.

Also, the query profile will tell you how much memory was actually used at the time of the OOM. You can compare that with the “budget” explained in the log file entry mentioned above.

Second, we can better define how Drill works with sort memory to help you properly configure your setup.

Here is some background.

* Your system has some amount of memory. In your case, 230 GB.
* To allocate memory to the sort, Drill does not use the actual memory. Instead, we use planner.memory.max_query_memory_per_node. (The idea is that you set this value as, roughly, system memory / number of concurrent queries.)
* Drill divides up memory to compute per-sort memory as: query memory per node / no. of slices / no. of sorts in the query.
* In your system, the number of slices is 23, so each fragment gets 10 GB of memory.
* If your query has a single sort, then each sort gets 10 GB of memory.
* However, memory per query is capped by the boot-time drill.memory.top.max option. (See below) which defaults go 20 GB. Not an issue here, but is an issue if the numbers above come out differently.
* When you changed planner.width.max_per_query, it has no effect on memory.
* You’d ideally change planner.width.max_per_node to 1 to run the query single-threaded. But, due to the item above, no sort will get more than 20 GB anyway.

For the actual code, see [1].

Despite all this, the likely original 10 GB allocation should be plenty; the sort is supposed to spill. How much it spills depends on your input data size. When sorting, performance is affected by memory:

* If your data is smaller than sort memory, sorting happens in memory, and performance is optimal.
* If your data is larger than memory, but smaller than 8x memory, you’ll get a “single generation” spill/merge and performance should be no worse than 3x an in-memory sort. (1 x is the original data read, then another 1x for spill and the third 1x for read/merge.)
* If your data is larger than 8x memory, sorting will need multiple generations of spill/merge/re-spill, and run-time will increase accordingly.

Some options:

* Set planner.width.max_per_node to 1 to run the query single-threaded. This will use all memory for the single sort.
* But, we’ve got that pesky 20 GB global cap. So, change your drill-override.conf file as follows:

drill.memory.top.max: 100000000000;

(Sorry for all the zeros. It is supposed to be 100 GB. We really should switch to a better format to specify memory…) 100 GB seems plenty without going larger.

You can verify that these changes take effect by looking for the log line that explains the managed sort’s memory calculations (when debug logging is enabled.)

Third, all that said, I wonder if the problem is elsewhere. Yes, you are getting an Out of Memory (OOM) error. But, not in the usual place that indicates a sort issue. Instead, you are getting it in the allocation of a “value vector.” This raises some questions:

* How big is your input data (size on disk)?
* How many columns?
* How wide are your VarChar columns, on average?

You mentioned data is compressed CSV. With typical 8x compression, actual data sorted will be ~8x your on-disk size.

The column width question is critical. I see that the vector is trying to allocate 16 MB of data, which suggests that your column widths are 250 or larger. If so, we are probably looking at a different error that happens to be showing up while sorting.

Once we see the details of your data size, we can determine if we should focus more closely in that area.

Thanks,

- Paul

[1] https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/util/MemoryAllocationUtilities.java
Post by rahul challapalli
This is clearly a bug and like zelaine suggested the new sort is still work
in progress. We have a few similar bugs open for the new sort. I could have
pointed to the jira's but unfortunately JIRA is not working for me due to
firewall issues.
Another suggestion is build drill from the latest master and try it out, if
you are willing to spend some time. But again there is no guarantee yet.
Please go ahead and raise a new jira. If it is a duplicate, I will mark it
as such later. Thank You.
- Rahul
Post by Nate Butler
Zelaine, thanks for the suggestion. I added this option both to the
drill-override and in the session and this time the query did stay running
for much longer but it still eventually failed with the same error,
although much different memory values.
(org.apache.drill.exec.exception.OutOfMemoryException) Unable to allocate
10653214316
org.apache.drill.exec.memory.BaseAllocator.buffer():220
org.apache.drill.exec.memory.BaseAllocator.buffer():195
org.apache.drill.exec.vector.VarCharVector.reAlloc():425
org.apache.drill.exec.vector.VarCharVector.copyFromSafe():278
org.apache.drill.exec.vector.NullableVarCharVector.copyFromSafe():379
org.apache.drill.exec.test.generated.PriorityQueueCopierGen8.
doCopy():22
org.apache.drill.exec.test.generated.PriorityQueueCopierGen8.next():76
org.apache.drill.exec.physical.impl.xsort.managed.
CopierHolder$BatchMerger.next():234
org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.
doMergeAndSpill():1408
org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.
mergeAndSpill():1376
org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.
spillFromMemory():1339
org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.
processBatch():831
org.apache.drill.exec.physical.impl.xsort.managed.
ExternalSortBatch.loadBatch():618
org.apache.drill.exec.physical.impl.xsort.managed.
ExternalSortBatch.load():660
org.apache.drill.exec.physical.impl.xsort.managed.
ExternalSortBatch.innerNext():559
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.record.AbstractRecordBatch.next():119
org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.physical.impl.aggregate.
StreamingAggBatch.innerNext():137
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.physical.impl.BaseRootExec.next():104
org.apache.drill.exec.physical.impl.partitionsender.
PartitionSenderRootExec.innerNext():144
org.apache.drill.exec.physical.impl.BaseRootExec.next():94
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():232
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():226
java.security.AccessController.doPrivileged():-2
javax.security.auth.Subject.doAs():422
org.apache.hadoop.security.UserGroupInformation.doAs():1657
org.apache.drill.exec.work.fragment.FragmentExecutor.run():226
org.apache.drill.common.SelfCleaningRunnable.run():38
java.util.concurrent.ThreadPoolExecutor.runWorker():1142
java.util.concurrent.ThreadPoolExecutor$Worker.run():617
java.lang.Thread.run():745 (state=,code=0)
At first I didn't change planner.width.max_per_query and the default on a
32 core machine makes it 23. This query failed after 34 minutes. I then
tried setting planner.width.max_per_query=1 and this query also failed but
of course took took longer, about 2 hours. In both cases,
planner.memory.max_query_memory_per_node was set to 230G.
Post by Zelaine Fong
Nate,
The Jira you’ve referenced relates to the new external sort, which is not
enabled by default, as it is still going through some additional testing.
If you’d like to try it to see if it resolves your problem, you’ll need
to
Post by Zelaine Fong
set “sort.external.disable_managed” as follows in your
drill.exec: {
cluster-id: "drillbits1",
zk.connect: "localhost:2181",
sort.external.disable_managed: false
}
ALTER SESSION SET `exec.sort.disable_managed` = false;
-- Zelaine
We keep running into this issue when trying to issue a query with hashagg
disabled. When I look at system memory usage though, drill doesn't seem to
be using much of it but still hits this error.
- 1 r3.8xl
- 1 drillbit version 1.10.0 configured with 4GB of Heap and 230G of Direct
- Data stored on S3 is compressed CSV
I've tried increasing planner.memory.max_query_memory_per_node to 230G and
lowered planner.width.max_per_query to 1 and it still fails.
We've applied the patch from this bug in the hopes that it would resolve
https://issues.apache.org/jira/browse/DRILL-5226
(org.apache.drill.exec.exception.OutOfMemoryException) Unable to allocate
buffer of size 16777216 due to memory limit. Current allocation: 8445952
org.apache.drill.exec.memory.BaseAllocator.buffer():220
org.apache.drill.exec.memory.BaseAllocator.buffer():195
org.apache.drill.exec.vector.VarCharVector.reAlloc():425
org.apache.drill.exec.vector.VarCharVector.copyFromSafe():278
org.apache.drill.exec.vector.NullableVarCharVector.
copyFromSafe():379
org.apache.drill.exec.test.generated.PriorityQueueCopierGen328.
doCopy():22
org.apache.drill.exec.test.generated.PriorityQueueCopierGen328.
next():75
org.apache.drill.exec.physical.impl.xsort.ExternalSortBatch.
mergeAndSpill():602
org.apache.drill.exec.physical.impl.xsort.
ExternalSortBatch.innerNext():428
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.record.AbstractRecordBatch.next():119
org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.physical.impl.aggregate.
StreamingAggBatch.innerNext():137
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.physical.impl.BaseRootExec.next():104
org.apache.drill.exec.physical.impl.partitionsender.
PartitionSenderRootExec.innerNext():144
org.apache.drill.exec.physical.impl.BaseRootExec.next():94
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():232
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():226
java.security.AccessController.doPrivileged():-2
javax.security.auth.Subject.doAs():422
org.apache.hadoop.security.UserGroupInformation.doAs():1657
org.apache.drill.exec.work.fragment.FragmentExecutor.run():226
org.apache.drill.common.SelfCleaningRunnable.run():38
java.util.concurrent.ThreadPoolExecutor.runWorker():1142
java.util.concurrent.ThreadPoolExecutor$Worker.run():617
java.lang.Thread.run():745 (state=,code=0)
Is there something I'm missing here? Any help/direction would be
appreciated.
Thanks,
Nate
Loading...