Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

baseline and modified version showing different benchmark result even if the codebase is same

See original GitHub issue

Hi @mikemccand,

I have cloned Lucene 9 code in both baseline and candidate folder( so codebase is 100% same) I saw there are performance difference after running command:

python3 src/python/localrun.py -source wikimedium10k

1st Output table shows

  		TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value

                     Prefix3      407.18      (0.0%)      314.24      (0.0%)  -22.8% ( -22% -  -22%) 1.000
                 LowSpanNear     1073.76      (0.0%)      852.06      (0.0%)  -20.6% ( -20% -  -20%) 1.000
             MedSloppyPhrase     1140.22      (0.0%)      927.42      (0.0%)  -18.7% ( -18% -  -18%) 1.000
                   MedPhrase      964.51      (0.0%)      848.50      (0.0%)  -12.0% ( -12% -  -12%) 1.000
        HighIntervalsOrdered     1002.98      (0.0%)      884.65      (0.0%)  -11.8% ( -11% -  -11%) 1.000
           HighTermMonthSort     4017.92      (0.0%)     3660.73      (0.0%)   -8.9% (  -8% -   -8%) 1.000
                     Respell      512.33      (0.0%)      467.72      (0.0%)   -8.7% (  -8% -   -8%) 1.000
                HighSpanNear      893.76      (0.0%)      821.69      (0.0%)   -8.1% (  -8% -   -8%) 1.000
                      IntNRQ     1828.06      (0.0%)     1682.03      (0.0%)   -8.0% (  -7% -   -7%) 1.000
                    HighTerm     5614.10      (0.0%)     5200.05      (0.0%)   -7.4% (  -7% -   -7%) 1.000
       BrowseMonthTaxoFacets     4142.06      (0.0%)     3870.82      (0.0%)   -6.5% (  -6% -   -6%) 1.000
       HighTermDayOfYearSort     3782.61      (0.0%)     3538.93      (0.0%)   -6.4% (  -6% -   -6%) 1.000
   BrowseDayOfYearSSDVFacets     2665.19      (0.0%)     2514.64      (0.0%)   -5.6% (  -5% -   -5%) 1.000
                     LowTerm     6806.33      (0.0%)     6460.07      (0.0%)   -5.1% (  -5% -   -5%) 1.000
            HighSloppyPhrase      886.16      (0.0%)      845.10      (0.0%)   -4.6% (  -4% -   -4%) 1.000
                   OrHighMed      898.26      (0.0%)      858.97      (0.0%)   -4.4% (  -4% -   -4%) 1.000
                   LowPhrase      988.79      (0.0%)      947.64      (0.0%)   -4.2% (  -4% -   -4%) 1.000
                   OrHighLow     1171.10      (0.0%)     1124.50      (0.0%)   -4.0% (  -3% -   -3%) 1.000
        BrowseDateTaxoFacets     3796.98      (0.0%)     3648.76      (0.0%)   -3.9% (  -3% -   -3%) 1.000
                    PKLookup      326.99      (0.0%)      315.53      (0.0%)   -3.5% (  -3% -   -3%) 1.000
       BrowseMonthSSDVFacets     3212.18      (0.0%)     3110.22      (0.0%)   -3.2% (  -3% -   -3%) 1.000
                  AndHighLow     2763.74      (0.0%)     2691.30      (0.0%)   -2.6% (  -2% -   -2%) 1.000
                 MedSpanNear      634.86      (0.0%)      624.48      (0.0%)   -1.6% (  -1% -   -1%) 1.000
                    Wildcard      581.94      (0.0%)      572.55      (0.0%)   -1.6% (  -1% -   -1%) 1.000
                  HighPhrase      729.77      (0.0%)      720.61      (0.0%)   -1.3% (  -1% -   -1%) 1.000
   BrowseDayOfYearTaxoFacets     3111.47      (0.0%)     3073.01      (0.0%)   -1.2% (  -1% -   -1%) 1.000
                  OrHighHigh      430.85      (0.0%)      426.77      (0.0%)   -0.9% (   0% -    0%) 1.000
                 AndHighHigh     1029.49      (0.0%)     1028.71      (0.0%)   -0.1% (   0% -    0%) 1.000
             LowSloppyPhrase     1351.24      (0.0%)     1365.14      (0.0%)    1.0% (   1% -    1%) 1.000
                      Fuzzy2       70.31      (0.0%)       71.83      (0.0%)    2.2% (   2% -    2%) 1.000
                      Fuzzy1      324.58      (0.0%)      338.44      (0.0%)    4.3% (   4% -    4%) 1.000
         LowIntervalsOrdered     1721.13      (0.0%)     1807.65      (0.0%)    5.0% (   5% -    5%) 1.000
                     MedTerm     5749.70      (0.0%)     6042.57      (0.0%)    5.1% (   5% -    5%) 1.000
         MedIntervalsOrdered     1291.17      (0.0%)     1382.36      (0.0%)    7.1% (   7% -    7%) 1.000
                  AndHighMed     1322.11      (0.0%)     1575.31      (0.0%)   19.2% (  19% -   19%) 1.000

My expectation was that both same code will perform the same way but you can notice deviations. Can you please explain it ? Is it right way to run benchmark?

Thanks!

Issue Analytics

State:
Created 2 years ago
Comments:5 (1 by maintainers)

Top GitHub Comments

1reaction

msokolovcommented, Oct 12, 2021

There is no statistical difference here. The final column, p value, tells you the probability that the difference you are observing is due to random chance. It’s one.

You can observe the absolute values of these random differences reduced by running larger test samples. More iterations, larger indexes, more queries.

What your a/a test shows you is the magnitude off noise on the system given your sample size.

On Mon, Oct 11, 2021, 5:17 AM praveennish @.***> wrote:

Hi @mikemccand https://github.com/mikemccand,

I have cloned Lucene 9 code in both baseline and candidate folder( so codebase is 100% same) I saw there are performance difference after running command:

python3 src/python/localrun.py -source wikimedium10k

1st Output table shows
  	TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value

                 Prefix3      407.18      (0.0%)      314.24      (0.0%)  -22.8% ( -22% -  -22%) 1.000
             LowSpanNear     1073.76      (0.0%)      852.06      (0.0%)  -20.6% ( -20% -  -20%) 1.000
         MedSloppyPhrase     1140.22      (0.0%)      927.42      (0.0%)  -18.7% ( -18% -  -18%) 1.000
               MedPhrase      964.51      (0.0%)      848.50      (0.0%)  -12.0% ( -12% -  -12%) 1.000
    HighIntervalsOrdered     1002.98      (0.0%)      884.65      (0.0%)  -11.8% ( -11% -  -11%) 1.000
       HighTermMonthSort     4017.92      (0.0%)     3660.73      (0.0%)   -8.9% (  -8% -   -8%) 1.000
                 Respell      512.33      (0.0%)      467.72      (0.0%)   -8.7% (  -8% -   -8%) 1.000
            HighSpanNear      893.76      (0.0%)      821.69      (0.0%)   -8.1% (  -8% -   -8%) 1.000
                  IntNRQ     1828.06      (0.0%)     1682.03      (0.0%)   -8.0% (  -7% -   -7%) 1.000
                HighTerm     5614.10      (0.0%)     5200.05      (0.0%)   -7.4% (  -7% -   -7%) 1.000
   BrowseMonthTaxoFacets     4142.06      (0.0%)     3870.82      (0.0%)   -6.5% (  -6% -   -6%) 1.000
   HighTermDayOfYearSort     3782.61      (0.0%)     3538.93      (0.0%)   -6.4% (  -6% -   -6%) 1.000
BrowseDayOfYearSSDVFacets LowTerm 6806.33 HighSloppyPhrase OrHighMed 898.26 LowPhrase 988.79 OrHighLow 1171.10 BrowseDateTaxoFacets PKLookup 326.99 BrowseMonthSSDVFacets AndHighLow 2763.74 MedSpanNear Wildcard 581.94 HighPhrase 729.77 BrowseDayOfYearTaxoFacets OrHighHigh 430.85 AndHighHigh 1029.49 LowSloppyPhrase 1351.24 Fuzzy2 70.31 Fuzzy1 324.58 LowIntervalsOrdered MedTerm 5749.70 MedIntervalsOrdered AndHighMed 1322.11 2665.19 (0.0%) 2514.64 (0.0%) -5.6% ( -5% - -5%) 1.000 (0.0%) 6460.07 (0.0%) -5.1% ( -5% - -5%) 1.000 886.16 (0.0%) 845.10 (0.0%) -4.6% ( -4% - -4%) 1.000 (0.0%) 858.97 (0.0%) -4.4% ( -4% - -4%) 1.000 (0.0%) 947.64 (0.0%) -4.2% ( -4% - -4%) 1.000 (0.0%) 1124.50 (0.0%) -4.0% ( -3% - -3%) 1.000 3796.98 (0.0%) 3648.76 (0.0%) -3.9% ( -3% - -3%) 1.000 (0.0%) 315.53 (0.0%) -3.5% ( -3% - -3%) 1.000 3212.18 (0.0%) 3110.22 (0.0%) -3.2% ( -3% - -3%) 1.000 (0.0%) 2691.30 (0.0%) -2.6% ( -2% - -2%) 1.000 634.86 (0.0%) 624.48 (0.0%) -1.6% ( -1% - -1%) 1.000 (0.0%) 572.55 (0.0%) -1.6% ( -1% - -1%) 1.000 (0.0%) 720.61 (0.0%) -1.3% ( -1% - -1%) 1.000 3111.47 (0.0%) 3073.01 (0.0%) -1.2% ( -1% - -1%) 1.000 (0.0%) 426.77 (0.0%) -0.9% ( 0% - 0%) 1.000 (0.0%) 1028.71 (0.0%) -0.1% ( 0% - 0%) 1.000 (0.0%) 1365.14 (0.0%) 1.0% ( 1% - 1%) 1.000 (0.0%) 71.83 (0.0%) 2.2% ( 2% - 2%) 1.000 (0.0%) 338.44 (0.0%) 4.3% ( 4% - 4%) 1.000 1721.13 (0.0%) 1807.65 (0.0%) 5.0% ( 5% - 5%) 1.000 (0.0%) 6042.57 (0.0%) 5.1% ( 5% - 5%) 1.000 1291.17 (0.0%) 1382.36 (0.0%) 7.1% ( 7% - 7%) 1.000 (0.0%) 1575.31 (0.0%) 19.2% ( 19% - 19%) 1.000

My expectation was that both same code will perform the same way but you can notice deviations. Can you please explain it ? Is it right way to run benchmark?

Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mikemccand/luceneutil/issues/142, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHHUQO6HZOLZ5WE73XIQPDUGKTSTANCNFSM5FXX6M3Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

0reactions

praveennishcommented, Jan 10, 2022

I am very sorry @mikemccand for the late reply!

I wanted to retest today but after latest pull i am getting FileNotFoundException for this file enwiki-20120502-lines-1k-fixed-utf8-with-random-label.txt in data folder

From where i can download this file please?

Top Results From Across the Web

Performance Improvements in .NET 7

NET 7 is fast. Really fast. This post deep-dives into hundreds of performance improvements that contributed to that reality.

Commit Procedures — Verilog-to-Routing 8.0.0 documentation

Suppose we've make a change to VTR, and we now want to evaluate the change. As described above we produce QoR measurements for...

VLMbench: A Compositional Benchmark for Vision-and ...

VLMbench is the first benchmark that compositional designs for vision-and-language reasoning and categorizes the manipulation tasks from the perspectives of ...

phoronix-test-suite/phoronix-test-suite.md at master - GitHub

This option will allows you to specify a result as a baseline (first parameter) and a second result file (second parameter) that will...

Parboil: A Revised Benchmark Suite for Scientific ... - IMPACT

nity to focus on a shared codebase, with individual researchers more easily able to understand another's results. If a benchmark or set of ......