Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Ludwig does not use tensorflow-gpu?

See original GitHub issue

I have tensorflow-gpu installed and Keras can use the GPU effectively. I only have one GPU. With ludwig, I tried a regression problem and found the training is very slow. train_stats = ludwig_model.train(data_df=df, logging_level=logging.ERROR, gpus=[0]) By watch -n 1 nvidia-smi, I found the training did not actually utilize the GPU but stored the data in the GPU memory anyway.

±----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 822 G /usr/bin/gnome-shell 186MiB | | 0 4687 C /home/yshi1/anaconda3/bin/python 22929MiB | | 0 10166 C /home/yshi1/anaconda3/bin/python 977MiB | | 0 23459 G /usr/bin/X 191MiB | ±----------------------------------------------------------------------------+

Issue Analytics

State:
Created 5 years ago
Comments:7 (1 by maintainers)

Top GitHub Comments

2reactions

w4nderlustcommented, Feb 13, 2019

The fact the the GPU memory is fully utilized by TesnorFlow means that the model is running on GPU. Try to run the same model from the command line instead of using the API and you should see the TensorFlow messaged printed on stderr. The fact that the utilization of the GPU is low may have to do with a couple things: your model is really small so there’s not much computation per batch to be done, or your batch is really small and so again there’s not much computation to be done per batch. To test for this, try to increase the batch size considerably. Finally, the process that reads data and provides it to TensorFlow at the moment is not super optimized, we are working on improving it, but you may be hitting an i/o bottleneck if your computation per batch is too small.

0reactions

w4nderlustcommented, Feb 14, 2019

As for the YAML examples, you find a bunch here. Be mindful of the - and the indentation. Glad you were able to make it work decently fast with a bigger batch size. Regarding the initialization, you can specify which initializer to use, so playing around with that may give you some better results. Regarding the reproducible example, you can use the data_synthesyzer script in ludwig/data co create a dataset that looks like yours pretty easily, we use it for integration tests. That should resolve the data issue. I’m closing the issue, but feel free to either open another one or reach out in private if you can provide me with the comparison script. You’re welcome.