Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

how to handle large context by Machine Comprehension(bi-att-flow) &Resource Exhausted error

See original GitHub issue

I have a test file, whose name is mytest1.json. The context is a large text, which has so many words in it.

when i run the folowing: basic/run_single.sh $HOME/data/squad/mytest1.json single.json

some errors happen, do you guys have an idea about how to solve this… Thanks so much.

`File “/home/weijiang/bi-att-flow/inference/main.py”, line 29, in main eval_data = _forward(config, data, shared) File “/home/weijiang/bi-att-flow/inference/main.py”, line 88, in _forward models = get_multi_gpu_models(config) File “/home/weijiang/bi-att-flow/inference/model.py”, line 19, in get_multi_gpu_models model = Model(config, scope, rep=gpu_idx == 0) File “/home/weijiang/bi-att-flow/inference/model.py”, line 58, in init self._build_forward() File “/home/weijiang/bi-att-flow/inference/model.py”, line 164, in _build_forward p0 = attention_layer(config, self.is_train, h, u, h_mask=self.x_mask, u_mask=self.q_mask, scope=“p0”, tensor_dict=self.tensor_dict) File “/home/weijiang/bi-att-flow/inference/model.py”, line 421, in attention_layer u_a, h_a = bi_attention(config, is_train, h, u, h_mask=h_mask, u_mask=u_mask, tensor_dict=tensor_dict) File “/home/weijiang/bi-att-flow/inference/model.py”, line 398, in bi_attention is_train=is_train, func=config.logit_func, scope=‘u_logits’) # [N, M, JX, JQ] File “/home/weijiang/bi-att-flow/my/tensorflow/nn.py”, line 127, in get_logits new_arg = args[0] * args[1] File “/home/weijiang/anaconda2/envs/tensorflow-0.11-py3.5/lib/python3.5/site-packages/tensorflow/python/ops/math_ops.py”, line 751, in binary_op_wrapper return func(x, y, name=name) File “/home/weijiang/anaconda2/envs/tensorflow-0.11-py3.5/lib/python3.5/site-packages/tensorflow/python/ops/math_ops.py”, line 910, in _mul_dispatch return gen_math_ops.mul(x, y, name=name) File “/home/weijiang/anaconda2/envs/tensorflow-0.11-py3.5/lib/python3.5/site-packages/tensorflow/python/ops/gen_math_ops.py”, line 1519, in mul result = _op_def_lib.apply_op(“Mul”, x=x, y=y, name=name) File “/home/weijiang/anaconda2/envs/tensorflow-0.11-py3.5/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py”, line 749, in apply_op op_def=op_def) File “/home/weijiang/anaconda2/envs/tensorflow-0.11-py3.5/lib/python3.5/site-packages/tensorflow/python/framework/ops.py”, line 2380, in create_op original_op=self._default_original_op, op_def=op_def) File “/home/weijiang/anaconda2/envs/tensorflow-0.11-py3.5/lib/python3.5/site-packages/tensorflow/python/framework/ops.py”, line 1298, in init self._traceback = _extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1,1,34680,7,200] [[Node: model_0/main/p0/bi_attention/mul = Mul[T=DT_FLOAT, _device=“/job:localhost/replica:0/task:0/gpu:0”](model_0/main/p0/bi_attention/Tile, model_0/main/p0/bi_attention/Tile_1)]] [[Node: model_0/main/g2/BW/BW/Assert/AssertGuard/Assert/Switch/_333 = _Recvclient_terminated=false, recv_device=“/job:localhost/replica:0/task:0/cpu:0”, send_device=“/job:localhost/replica:0/task:0/gpu:0”, send_device_incarnation=1, tensor_name=“edge_4307_model_0/main/g2/BW/BW/Assert/AssertGuard/Assert/Switch”, tensor_type=DT_BOOL, _device=“/job:localhost/replica:0/task:0/cpu:0”]]`

Issue Analytics

State:
Created 6 years ago
Comments:6 (2 by maintainers)

Top GitHub Comments

1reaction

seominjooncommented, Apr 7, 2017

@webeng You are right that you can control those params to avoid OOM during training, but I think he is using pre-trained model and only testing.

Looking at the error log, I am assuming that you have single example and your context is 34680 words. Unfortunately, you can only fit in around 60*500 words, so it might be a little over limit.

Easiest thing you can do is you can run this on CPU. Then you won’t have memory error (though will take a longer time…)

Another way is that you can arbitrarily split the context into a few chunks (for this example, split by 2 seems to be enough), and copy the question for each chunk so that you have the same question for all chunks. Then you can consider each pair as independent question, and run the inference (batch_size 1).

Then, for you output, you will have two answers with confidence levels. You can compare the answers and take more confident one. The only caveat here is that, for confidence score, you shouldn’t use the the probability output in answer folder because it is locally normalized via softmax. Instead, you will need to use logits, which are unnormalized. These are outputted in eval folder.

Of course, these are easiest ways without modifying the code. You can modify the code to use multi gpus, etc. but I think this will be more difficult.

I will leave this issue for possible feature in future.

0reactions

seominjooncommented, Apr 10, 2017

@rubby33 It is a trial-and-error thing, and I usually figure it out by looking at the memory usage and whether it gives OOM. 30k is a rough estimate.