Ludwig does not use tensorflow-gpu?
See original GitHub issueI have tensorflow-gpu installed and Keras can use the GPU effectively. I only have one GPU.
With ludwig, I tried a regression problem and found the training is very slow.
train_stats = ludwig_model.train(data_df=df, logging_level=logging.ERROR, gpus=[0])
By watch -n 1 nvidia-smi, I found the training did not actually utilize the GPU but stored the data in the GPU memory anyway.
±----------------------------------------------------------------------------+ | NVIDIA-SMI 410.93 Driver Version: 410.93 CUDA Version: 10.0 | |-------------------------------±---------------------±---------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Quadro P6000 Off | 00000000:03:00.0 On | Off | | 26% 44C P8 19W / 250W | 24289MiB / 24449MiB | 0% Default | ±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 822 G /usr/bin/gnome-shell 186MiB | | 0 4687 C /home/yshi1/anaconda3/bin/python 22929MiB | | 0 10166 C /home/yshi1/anaconda3/bin/python 977MiB | | 0 23459 G /usr/bin/X 191MiB | ±----------------------------------------------------------------------------+
Issue Analytics
- State:
- Created 5 years ago
- Comments:7 (1 by maintainers)
Top GitHub Comments
The fact the the GPU memory is fully utilized by TesnorFlow means that the model is running on GPU. Try to run the same model from the command line instead of using the API and you should see the TensorFlow messaged printed on stderr. The fact that the utilization of the GPU is low may have to do with a couple things: your model is really small so there’s not much computation per batch to be done, or your batch is really small and so again there’s not much computation to be done per batch. To test for this, try to increase the batch size considerably. Finally, the process that reads data and provides it to TensorFlow at the moment is not super optimized, we are working on improving it, but you may be hitting an i/o bottleneck if your computation per batch is too small.
As for the YAML examples, you find a bunch here. Be mindful of the
-
and the indentation. Glad you were able to make it work decently fast with a bigger batch size. Regarding the initialization, you can specify which initializer to use, so playing around with that may give you some better results. Regarding the reproducible example, you can use thedata_synthesyzer
script inludwig/data
co create a dataset that looks like yours pretty easily, we use it for integration tests. That should resolve the data issue. I’m closing the issue, but feel free to either open another one or reach out in private if you can provide me with the comparison script. You’re welcome.