question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

how to handle large context by Machine Comprehension(bi-att-flow) &Resource Exhausted error

See original GitHub issue

I have a test file, whose name is mytest1.json. The context is a large text, which has so many words in it.

when i run the folowing: basic/run_single.sh $HOME/data/squad/mytest1.json single.json

some errors happen, do you guys have an idea about how to solve this… Thanks so much.

`File “/home/weijiang/bi-att-flow/inference/main.py”, line 29, in main eval_data = _forward(config, data, shared) File “/home/weijiang/bi-att-flow/inference/main.py”, line 88, in _forward models = get_multi_gpu_models(config) File “/home/weijiang/bi-att-flow/inference/model.py”, line 19, in get_multi_gpu_models model = Model(config, scope, rep=gpu_idx == 0) File “/home/weijiang/bi-att-flow/inference/model.py”, line 58, in init self._build_forward() File “/home/weijiang/bi-att-flow/inference/model.py”, line 164, in _build_forward p0 = attention_layer(config, self.is_train, h, u, h_mask=self.x_mask, u_mask=self.q_mask, scope=“p0”, tensor_dict=self.tensor_dict) File “/home/weijiang/bi-att-flow/inference/model.py”, line 421, in attention_layer u_a, h_a = bi_attention(config, is_train, h, u, h_mask=h_mask, u_mask=u_mask, tensor_dict=tensor_dict) File “/home/weijiang/bi-att-flow/inference/model.py”, line 398, in bi_attention is_train=is_train, func=config.logit_func, scope=‘u_logits’) # [N, M, JX, JQ] File “/home/weijiang/bi-att-flow/my/tensorflow/nn.py”, line 127, in get_logits new_arg = args[0] * args[1] File “/home/weijiang/anaconda2/envs/tensorflow-0.11-py3.5/lib/python3.5/site-packages/tensorflow/python/ops/math_ops.py”, line 751, in binary_op_wrapper return func(x, y, name=name) File “/home/weijiang/anaconda2/envs/tensorflow-0.11-py3.5/lib/python3.5/site-packages/tensorflow/python/ops/math_ops.py”, line 910, in _mul_dispatch return gen_math_ops.mul(x, y, name=name) File “/home/weijiang/anaconda2/envs/tensorflow-0.11-py3.5/lib/python3.5/site-packages/tensorflow/python/ops/gen_math_ops.py”, line 1519, in mul result = _op_def_lib.apply_op(“Mul”, x=x, y=y, name=name) File “/home/weijiang/anaconda2/envs/tensorflow-0.11-py3.5/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py”, line 749, in apply_op op_def=op_def) File “/home/weijiang/anaconda2/envs/tensorflow-0.11-py3.5/lib/python3.5/site-packages/tensorflow/python/framework/ops.py”, line 2380, in create_op original_op=self._default_original_op, op_def=op_def) File “/home/weijiang/anaconda2/envs/tensorflow-0.11-py3.5/lib/python3.5/site-packages/tensorflow/python/framework/ops.py”, line 1298, in init self._traceback = _extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1,1,34680,7,200] [[Node: model_0/main/p0/bi_attention/mul = Mul[T=DT_FLOAT, _device=“/job:localhost/replica:0/task:0/gpu:0”](model_0/main/p0/bi_attention/Tile, model_0/main/p0/bi_attention/Tile_1)]] [[Node: model_0/main/g2/BW/BW/Assert/AssertGuard/Assert/Switch/_333 = _Recvclient_terminated=false, recv_device=“/job:localhost/replica:0/task:0/cpu:0”, send_device=“/job:localhost/replica:0/task:0/gpu:0”, send_device_incarnation=1, tensor_name=“edge_4307_model_0/main/g2/BW/BW/Assert/AssertGuard/Assert/Switch”, tensor_type=DT_BOOL, _device=“/job:localhost/replica:0/task:0/cpu:0”]]`

Issue Analytics

  • State:open
  • Created 6 years ago
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
seominjooncommented, Apr 7, 2017

@webeng You are right that you can control those params to avoid OOM during training, but I think he is using pre-trained model and only testing.

Looking at the error log, I am assuming that you have single example and your context is 34680 words. Unfortunately, you can only fit in around 60*500 words, so it might be a little over limit.

Easiest thing you can do is you can run this on CPU. Then you won’t have memory error (though will take a longer time…)

Another way is that you can arbitrarily split the context into a few chunks (for this example, split by 2 seems to be enough), and copy the question for each chunk so that you have the same question for all chunks. Then you can consider each pair as independent question, and run the inference (batch_size 1).

Then, for you output, you will have two answers with confidence levels. You can compare the answers and take more confident one. The only caveat here is that, for confidence score, you shouldn’t use the the probability output in answer folder because it is locally normalized via softmax. Instead, you will need to use logits, which are unnormalized. These are outputted in eval folder.

Of course, these are easiest ways without modifying the code. You can modify the code to use multi gpus, etc. but I think this will be more difficult.

I will leave this issue for possible feature in future.

0reactions
seominjooncommented, Apr 10, 2017

@rubby33 It is a trial-and-error thing, and I usually figure it out by looking at the memory usage and whether it gives OOM. 30k is a rough estimate.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Resource exhausted error when running tf.keras.Model. ...
I'm trying to run tf.keras.Model.predict on a complex tensor with shape: (1532, 128, 2049, 2). With a batch size of 4, the model...
Read more >
Resource Exhausted Error on allocating tensor of sizes that ...
I am running a keras/tensorflow model with a custom loss function that creates some very large tensors. But when I run the code ......
Read more >
How to solve Error of ResourceExhaustedError in Tensorflow
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. So...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found