Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Metrics Implementation Question

See original GitHub issue

Thanks for the great library, especially the metrics. I have a few questions to better understand the implementation:

During the update stage, why are the values converted to Python floats instead of keeping them as torch values (e.g. here)? This operation incurs a device->host transfer, so the operation is blocking, right? Wouldn’t it be better to keep the metric values as torch values on the GPU so the update is async? Then, they can be converted to python floats in the compute method.

In the distributed case, the values are put back in a tensor before the all-reduce, so why not keep them as tensors to begin with?

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:16 (9 by maintainers)

Top GitHub Comments

2reactions

vfdev-5commented, Jun 7, 2020

@n2cholas in some sense maybe it could make sense to update metrics code, such that internal cumulators where is not done become tensors and user can specify the storage by device. We already have this argument but it is unsed in most of the cases… We also should be careful about specfic implementations where double precision is required…

2reactions

n2cholascommented, Jun 6, 2020

@vfdev-5 thanks for the rerun and choosing a more realistic batch size.

@sdesrozis Here is a similar script for evaluating a validation loop instead of training. I used a batch size of 512 (like @vfdev-5 ) and 50 runs to get a tighter standard deviation. This and my previous runs were both on a GTX 1080. The results are pretty similar, but the traces are much cleaner. The completely async nature of the custom cuda implementation is much more apparent.

Ignite:

Mean Time: 1.059605598449707s
Std Time: 0.014668694697320461s
All Times: [1.111772705078125, 1.0760010986328126, 1.0779454345703126, 1.0475296630859374, 1.063631591796875, 1.0401522216796875, 1.08554150390625, 1.0513619384765625, 1.0520213623046875, 1.0756719970703126, 1.05441259765625, 1.06146484375, 1.0483077392578126, 1.0805594482421874, 1.047625732421875, 1.050948486328125, 1.0686212158203126, 1.040948974609375, 1.068043701171875, 1.0732552490234375, 1.0576787109375, 1.038017578125, 1.0690546875, 1.0786173095703124, 1.0792237548828125, 1.067673583984375, 1.0534521484375001, 1.048385498046875, 1.0591666259765624, 1.0659741210937501, 1.0362990722656251, 1.0420843505859376, 1.053845947265625, 1.0706207275390625, 1.0574412841796874, 1.05014111328125, 1.0607860107421876, 1.0439991455078126, 1.060889404296875, 1.051964599609375, 1.0492752685546876, 1.070322021484375, 1.0469660644531251, 1.0494393310546875, 1.0538858642578126, 1.0563857421875, 1.0503135986328125, 1.0760842285156251, 1.043965087890625, 1.0625087890625]

Custom on CPU:

Mean Time: 1.0600998401641846s
Std Time: 0.016978954896330833s
All Times: [1.0984403076171876, 1.0494051513671876, 1.035371826171875, 1.0610870361328124, 1.045078125, 1.0598719482421874, 1.0756278076171875, 1.0297471923828125, 1.027935302734375, 1.065294189453125, 1.0613526611328126, 1.0637432861328124, 1.0517838134765625, 1.0540438232421876, 1.0337780761718751, 1.0409871826171875, 1.0708486328125, 1.0449569091796875, 1.0629097900390625, 1.0386883544921874, 1.0573258056640624, 1.0498575439453126, 1.0445811767578126, 1.0379779052734375, 1.0409044189453125, 1.05157470703125, 1.03104833984375, 1.080220703125, 1.0673245849609376, 1.0762749023437501, 1.062837646484375, 1.051602294921875, 1.0489697265625, 1.0705521240234375, 1.070385498046875, 1.057185546875, 1.0765167236328126, 1.0771417236328125, 1.058330322265625, 1.0861689453125, 1.083775146484375, 1.0895032958984374, 1.077263916015625, 1.0667071533203125, 1.0633187255859375, 1.0628101806640626, 1.0629976806640624, 1.0817080078125, 1.0624017333984375, 1.0867723388671875]

Custom on GPU:

Mean Time: 0.9974555969238281s
Std Time: 0.012949894182384014s
All Times: [1.026460693359375, 1.0256392822265625, 0.9810296630859375, 0.992499267578125, 0.9777049560546875, 0.9932820434570313, 0.9784326171875001, 0.9781923828125, 0.990439453125, 0.984932373046875, 1.008616455078125, 1.013127197265625, 0.992532470703125, 0.9864641723632813, 0.9835632934570313, 1.0048839721679688, 0.9925211791992188, 0.9954232177734376, 1.000791015625, 0.999962646484375, 0.9947289428710938, 0.98918603515625, 0.9813984985351563, 1.003957275390625, 0.9944736938476563, 0.9956331787109375, 0.9954109497070313, 1.0007716064453125, 0.99830859375, 1.0178343505859375, 1.0066236572265626, 0.9960570678710937, 0.9823938598632813, 1.0055481567382814, 0.997375, 1.0058368530273438, 0.9903257446289063, 0.9843046264648437, 0.9871902465820312, 1.0250125732421875, 1.0187028198242187, 0.996167236328125, 0.9972598266601562, 0.9860249633789062, 1.012158447265625, 0.98768896484375, 1.024879638671875, 1.0026895141601562, 0.9879009399414063, 1.000437744140625]

Traces:

Ignite:

Custom CPU:

Custom GPU:

Top Results From Across the Web

How to define metrics with the “Questions first” approach

In the process of defining performance metrics, the first thing we need to do is understand what questions we want to answer with...

609 questions with answers in METRICS | Science topic

Explore the latest questions and answers in Metrics, and find Metrics experts. ... function by implementing new designed metrics in RPL using Contiki/cooja....

THE GOAL QUESTION METRIC APPROACH

The Goal Question Metric (GQM) approach is based upon the assumption that for an organization to measure in a purposeful way it must...

Metrics Sense: Designing a Metric II - Hacking the TPM Interview

This question is both a metric and technical question combined into one, making it excellent for practicing both skills needed for a TPM...

The Importance of Implementing Effective Metrics - iSixSigma

The successful implementation of any new metric requires the approval and interest of senior managers. They have to lead the culture change from...