question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Metrics Implementation Question

See original GitHub issue

Thanks for the great library, especially the metrics. I have a few questions to better understand the implementation:

During the update stage, why are the values converted to Python floats instead of keeping them as torch values (e.g. here)? This operation incurs a device->host transfer, so the operation is blocking, right? Wouldn’t it be better to keep the metric values as torch values on the GPU so the update is async? Then, they can be converted to python floats in the compute method.

In the distributed case, the values are put back in a tensor before the all-reduce, so why not keep them as tensors to begin with?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:16 (9 by maintainers)

github_iconTop GitHub Comments

2reactions
vfdev-5commented, Jun 7, 2020

@n2cholas in some sense maybe it could make sense to update metrics code, such that internal cumulators where is not done become tensors and user can specify the storage by device. We already have this argument but it is unsed in most of the cases… We also should be careful about specfic implementations where double precision is required…

2reactions
n2cholascommented, Jun 6, 2020

@vfdev-5 thanks for the rerun and choosing a more realistic batch size.

@sdesrozis Here is a similar script for evaluating a validation loop instead of training. I used a batch size of 512 (like @vfdev-5 ) and 50 runs to get a tighter standard deviation. This and my previous runs were both on a GTX 1080. The results are pretty similar, but the traces are much cleaner. The completely async nature of the custom cuda implementation is much more apparent.

Ignite:

Mean Time: 1.059605598449707s
Std Time: 0.014668694697320461s
All Times: [1.111772705078125, 1.0760010986328126, 1.0779454345703126, 1.0475296630859374, 1.063631591796875, 1.0401522216796875, 1.08554150390625, 1.0513619384765625, 1.0520213623046875, 1.0756719970703126, 1.05441259765625, 1.06146484375, 1.0483077392578126, 1.0805594482421874, 1.047625732421875, 1.050948486328125, 1.0686212158203126, 1.040948974609375, 1.068043701171875, 1.0732552490234375, 1.0576787109375, 1.038017578125, 1.0690546875, 1.0786173095703124, 1.0792237548828125, 1.067673583984375, 1.0534521484375001, 1.048385498046875, 1.0591666259765624, 1.0659741210937501, 1.0362990722656251, 1.0420843505859376, 1.053845947265625, 1.0706207275390625, 1.0574412841796874, 1.05014111328125, 1.0607860107421876, 1.0439991455078126, 1.060889404296875, 1.051964599609375, 1.0492752685546876, 1.070322021484375, 1.0469660644531251, 1.0494393310546875, 1.0538858642578126, 1.0563857421875, 1.0503135986328125, 1.0760842285156251, 1.043965087890625, 1.0625087890625]

Custom on CPU:

Mean Time: 1.0600998401641846s
Std Time: 0.016978954896330833s
All Times: [1.0984403076171876, 1.0494051513671876, 1.035371826171875, 1.0610870361328124, 1.045078125, 1.0598719482421874, 1.0756278076171875, 1.0297471923828125, 1.027935302734375, 1.065294189453125, 1.0613526611328126, 1.0637432861328124, 1.0517838134765625, 1.0540438232421876, 1.0337780761718751, 1.0409871826171875, 1.0708486328125, 1.0449569091796875, 1.0629097900390625, 1.0386883544921874, 1.0573258056640624, 1.0498575439453126, 1.0445811767578126, 1.0379779052734375, 1.0409044189453125, 1.05157470703125, 1.03104833984375, 1.080220703125, 1.0673245849609376, 1.0762749023437501, 1.062837646484375, 1.051602294921875, 1.0489697265625, 1.0705521240234375, 1.070385498046875, 1.057185546875, 1.0765167236328126, 1.0771417236328125, 1.058330322265625, 1.0861689453125, 1.083775146484375, 1.0895032958984374, 1.077263916015625, 1.0667071533203125, 1.0633187255859375, 1.0628101806640626, 1.0629976806640624, 1.0817080078125, 1.0624017333984375, 1.0867723388671875]

Custom on GPU:

Mean Time: 0.9974555969238281s
Std Time: 0.012949894182384014s
All Times: [1.026460693359375, 1.0256392822265625, 0.9810296630859375, 0.992499267578125, 0.9777049560546875, 0.9932820434570313, 0.9784326171875001, 0.9781923828125, 0.990439453125, 0.984932373046875, 1.008616455078125, 1.013127197265625, 0.992532470703125, 0.9864641723632813, 0.9835632934570313, 1.0048839721679688, 0.9925211791992188, 0.9954232177734376, 1.000791015625, 0.999962646484375, 0.9947289428710938, 0.98918603515625, 0.9813984985351563, 1.003957275390625, 0.9944736938476563, 0.9956331787109375, 0.9954109497070313, 1.0007716064453125, 0.99830859375, 1.0178343505859375, 1.0066236572265626, 0.9960570678710937, 0.9823938598632813, 1.0055481567382814, 0.997375, 1.0058368530273438, 0.9903257446289063, 0.9843046264648437, 0.9871902465820312, 1.0250125732421875, 1.0187028198242187, 0.996167236328125, 0.9972598266601562, 0.9860249633789062, 1.012158447265625, 0.98768896484375, 1.024879638671875, 1.0026895141601562, 0.9879009399414063, 1.000437744140625]

Traces:

Ignite: ignite2

Custom CPU: custom-cpu2

Custom GPU: custom-cuda2

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to define metrics with the “Questions first” approach
In the process of defining performance metrics, the first thing we need to do is understand what questions we want to answer with...
Read more >
609 questions with answers in METRICS | Science topic
Explore the latest questions and answers in Metrics, and find Metrics experts. ... function by implementing new designed metrics in RPL using Contiki/cooja....
Read more >
THE GOAL QUESTION METRIC APPROACH
The Goal Question Metric (GQM) approach is based upon the assumption that for an organization to measure in a purposeful way it must...
Read more >
Metrics Sense: Designing a Metric II - Hacking the TPM Interview
This question is both a metric and technical question combined into one, making it excellent for practicing both skills needed for a TPM...
Read more >
The Importance of Implementing Effective Metrics - iSixSigma
The successful implementation of any new metric requires the approval and interest of senior managers. They have to lead the culture change from...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found