Computation of the accuracy scores when there are compilation and runtime errors
See original GitHub issueHi thank you for this great dataset! I have some questions about how you compute the accuracy scores in this https://github.com/hendrycks/apps/blob/c55cce35806c14423b41decf7241615261cf9de0/eval/test_one_solution.py#L22-L42 I was curious why you use -2
and -1
for compilation and runtime errors and include them in the average computation of the accuracy which could lead to a negative score. It seems more natural to give a False
label to a code with syntax/runtime error similarily to a code that just doesn’t pass the unit tests.
Also the expression all_correct.append(np.all(results[index]))
will consider -2 and -1 as True
since np.all
evaluates non zero numbers to True
, which could give a false accuracy.
Below is an example:
print_results({0: [[-2]], 1: [[-2]], 2: [[-2]], 3: [[-2]]}, args)
number of compile errors = 1 avg = 0.25
number of runtime errors = 1 avg = 0.25
number of test cases run = 4
Test Case Average (average accuracy over problems) = -2.0
Strict Accuracy (all test cases passed / total problems) = 1.0
Another thing regarding the expressions:
compile_errors = len(tmp_results[tmp_results==-2])
runtiome_errors = len(tmp_results[tmp_results==-1])
if I’m not mistaken this doesn’t work (at least on Python 3.9), another implementation could be
compile_errors = len([e for e in tmp_results if -2 in e])
runtiome_errors = len([e for e in tmp_results if -1 in e])
Issue Analytics
- State:
- Created a year ago
- Comments:7 (3 by maintainers)
Okay I think now with the examples and documentation it is working correctly and as intended. So I think this issue is good to close now. Feel free to reopen if there’s something that was missed.
Great
I’ll open a PR!I saw that you already changed it thanks!