question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

baseline and modified version showing different benchmark result even if the codebase is same

See original GitHub issue

Hi @mikemccand,

I have cloned Lucene 9 code in both baseline and candidate folder( so codebase is 100% same) I saw there are performance difference after running command:

python3 src/python/localrun.py -source wikimedium10k

1st Output table shows

  		TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                     Prefix3      407.18      (0.0%)      314.24      (0.0%)  -22.8% ( -22% -  -22%) 1.000
                 LowSpanNear     1073.76      (0.0%)      852.06      (0.0%)  -20.6% ( -20% -  -20%) 1.000
             MedSloppyPhrase     1140.22      (0.0%)      927.42      (0.0%)  -18.7% ( -18% -  -18%) 1.000
                   MedPhrase      964.51      (0.0%)      848.50      (0.0%)  -12.0% ( -12% -  -12%) 1.000
        HighIntervalsOrdered     1002.98      (0.0%)      884.65      (0.0%)  -11.8% ( -11% -  -11%) 1.000
           HighTermMonthSort     4017.92      (0.0%)     3660.73      (0.0%)   -8.9% (  -8% -   -8%) 1.000
                     Respell      512.33      (0.0%)      467.72      (0.0%)   -8.7% (  -8% -   -8%) 1.000
                HighSpanNear      893.76      (0.0%)      821.69      (0.0%)   -8.1% (  -8% -   -8%) 1.000
                      IntNRQ     1828.06      (0.0%)     1682.03      (0.0%)   -8.0% (  -7% -   -7%) 1.000
                    HighTerm     5614.10      (0.0%)     5200.05      (0.0%)   -7.4% (  -7% -   -7%) 1.000
       BrowseMonthTaxoFacets     4142.06      (0.0%)     3870.82      (0.0%)   -6.5% (  -6% -   -6%) 1.000
       HighTermDayOfYearSort     3782.61      (0.0%)     3538.93      (0.0%)   -6.4% (  -6% -   -6%) 1.000
   BrowseDayOfYearSSDVFacets     2665.19      (0.0%)     2514.64      (0.0%)   -5.6% (  -5% -   -5%) 1.000
                     LowTerm     6806.33      (0.0%)     6460.07      (0.0%)   -5.1% (  -5% -   -5%) 1.000
            HighSloppyPhrase      886.16      (0.0%)      845.10      (0.0%)   -4.6% (  -4% -   -4%) 1.000
                   OrHighMed      898.26      (0.0%)      858.97      (0.0%)   -4.4% (  -4% -   -4%) 1.000
                   LowPhrase      988.79      (0.0%)      947.64      (0.0%)   -4.2% (  -4% -   -4%) 1.000
                   OrHighLow     1171.10      (0.0%)     1124.50      (0.0%)   -4.0% (  -3% -   -3%) 1.000
        BrowseDateTaxoFacets     3796.98      (0.0%)     3648.76      (0.0%)   -3.9% (  -3% -   -3%) 1.000
                    PKLookup      326.99      (0.0%)      315.53      (0.0%)   -3.5% (  -3% -   -3%) 1.000
       BrowseMonthSSDVFacets     3212.18      (0.0%)     3110.22      (0.0%)   -3.2% (  -3% -   -3%) 1.000
                  AndHighLow     2763.74      (0.0%)     2691.30      (0.0%)   -2.6% (  -2% -   -2%) 1.000
                 MedSpanNear      634.86      (0.0%)      624.48      (0.0%)   -1.6% (  -1% -   -1%) 1.000
                    Wildcard      581.94      (0.0%)      572.55      (0.0%)   -1.6% (  -1% -   -1%) 1.000
                  HighPhrase      729.77      (0.0%)      720.61      (0.0%)   -1.3% (  -1% -   -1%) 1.000
   BrowseDayOfYearTaxoFacets     3111.47      (0.0%)     3073.01      (0.0%)   -1.2% (  -1% -   -1%) 1.000
                  OrHighHigh      430.85      (0.0%)      426.77      (0.0%)   -0.9% (   0% -    0%) 1.000
                 AndHighHigh     1029.49      (0.0%)     1028.71      (0.0%)   -0.1% (   0% -    0%) 1.000
             LowSloppyPhrase     1351.24      (0.0%)     1365.14      (0.0%)    1.0% (   1% -    1%) 1.000
                      Fuzzy2       70.31      (0.0%)       71.83      (0.0%)    2.2% (   2% -    2%) 1.000
                      Fuzzy1      324.58      (0.0%)      338.44      (0.0%)    4.3% (   4% -    4%) 1.000
         LowIntervalsOrdered     1721.13      (0.0%)     1807.65      (0.0%)    5.0% (   5% -    5%) 1.000
                     MedTerm     5749.70      (0.0%)     6042.57      (0.0%)    5.1% (   5% -    5%) 1.000
         MedIntervalsOrdered     1291.17      (0.0%)     1382.36      (0.0%)    7.1% (   7% -    7%) 1.000
                  AndHighMed     1322.11      (0.0%)     1575.31      (0.0%)   19.2% (  19% -   19%) 1.000

My expectation was that both same code will perform the same way but you can notice deviations. Can you please explain it ? Is it right way to run benchmark?

Thanks!

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:5 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
msokolovcommented, Oct 12, 2021

There is no statistical difference here. The final column, p value, tells you the probability that the difference you are observing is due to random chance. It’s one.

You can observe the absolute values of these random differences reduced by running larger test samples. More iterations, larger indexes, more queries.

What your a/a test shows you is the magnitude off noise on the system given your sample size.

On Mon, Oct 11, 2021, 5:17 AM praveennish @.***> wrote:

Hi @mikemccand https://github.com/mikemccand,

I have cloned Lucene 9 code in both baseline and candidate folder( so codebase is 100% same) I saw there are performance difference after running command:

python3 src/python/localrun.py -source wikimedium10k

1st Output table shows

  	TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value

                 Prefix3      407.18      (0.0%)      314.24      (0.0%)  -22.8% ( -22% -  -22%) 1.000
             LowSpanNear     1073.76      (0.0%)      852.06      (0.0%)  -20.6% ( -20% -  -20%) 1.000
         MedSloppyPhrase     1140.22      (0.0%)      927.42      (0.0%)  -18.7% ( -18% -  -18%) 1.000
               MedPhrase      964.51      (0.0%)      848.50      (0.0%)  -12.0% ( -12% -  -12%) 1.000
    HighIntervalsOrdered     1002.98      (0.0%)      884.65      (0.0%)  -11.8% ( -11% -  -11%) 1.000
       HighTermMonthSort     4017.92      (0.0%)     3660.73      (0.0%)   -8.9% (  -8% -   -8%) 1.000
                 Respell      512.33      (0.0%)      467.72      (0.0%)   -8.7% (  -8% -   -8%) 1.000
            HighSpanNear      893.76      (0.0%)      821.69      (0.0%)   -8.1% (  -8% -   -8%) 1.000
                  IntNRQ     1828.06      (0.0%)     1682.03      (0.0%)   -8.0% (  -7% -   -7%) 1.000
                HighTerm     5614.10      (0.0%)     5200.05      (0.0%)   -7.4% (  -7% -   -7%) 1.000
   BrowseMonthTaxoFacets     4142.06      (0.0%)     3870.82      (0.0%)   -6.5% (  -6% -   -6%) 1.000
   HighTermDayOfYearSort     3782.61      (0.0%)     3538.93      (0.0%)   -6.4% (  -6% -   -6%) 1.000

BrowseDayOfYearSSDVFacets 2665.19 (0.0%) 2514.64 (0.0%) -5.6% ( -5% - -5%) 1.000 LowTerm 6806.33 (0.0%) 6460.07 (0.0%) -5.1% ( -5% - -5%) 1.000 HighSloppyPhrase 886.16 (0.0%) 845.10 (0.0%) -4.6% ( -4% - -4%) 1.000 OrHighMed 898.26 (0.0%) 858.97 (0.0%) -4.4% ( -4% - -4%) 1.000 LowPhrase 988.79 (0.0%) 947.64 (0.0%) -4.2% ( -4% - -4%) 1.000 OrHighLow 1171.10 (0.0%) 1124.50 (0.0%) -4.0% ( -3% - -3%) 1.000 BrowseDateTaxoFacets 3796.98 (0.0%) 3648.76 (0.0%) -3.9% ( -3% - -3%) 1.000 PKLookup 326.99 (0.0%) 315.53 (0.0%) -3.5% ( -3% - -3%) 1.000 BrowseMonthSSDVFacets 3212.18 (0.0%) 3110.22 (0.0%) -3.2% ( -3% - -3%) 1.000 AndHighLow 2763.74 (0.0%) 2691.30 (0.0%) -2.6% ( -2% - -2%) 1.000 MedSpanNear 634.86 (0.0%) 624.48 (0.0%) -1.6% ( -1% - -1%) 1.000 Wildcard 581.94 (0.0%) 572.55 (0.0%) -1.6% ( -1% - -1%) 1.000 HighPhrase 729.77 (0.0%) 720.61 (0.0%) -1.3% ( -1% - -1%) 1.000 BrowseDayOfYearTaxoFacets 3111.47 (0.0%) 3073.01 (0.0%) -1.2% ( -1% - -1%) 1.000 OrHighHigh 430.85 (0.0%) 426.77 (0.0%) -0.9% ( 0% - 0%) 1.000 AndHighHigh 1029.49 (0.0%) 1028.71 (0.0%) -0.1% ( 0% - 0%) 1.000 LowSloppyPhrase 1351.24 (0.0%) 1365.14 (0.0%) 1.0% ( 1% - 1%) 1.000 Fuzzy2 70.31 (0.0%) 71.83 (0.0%) 2.2% ( 2% - 2%) 1.000 Fuzzy1 324.58 (0.0%) 338.44 (0.0%) 4.3% ( 4% - 4%) 1.000 LowIntervalsOrdered 1721.13 (0.0%) 1807.65 (0.0%) 5.0% ( 5% - 5%) 1.000 MedTerm 5749.70 (0.0%) 6042.57 (0.0%) 5.1% ( 5% - 5%) 1.000 MedIntervalsOrdered 1291.17 (0.0%) 1382.36 (0.0%) 7.1% ( 7% - 7%) 1.000 AndHighMed 1322.11 (0.0%) 1575.31 (0.0%) 19.2% ( 19% - 19%) 1.000

My expectation was that both same code will perform the same way but you can notice deviations. Can you please explain it ? Is it right way to run benchmark?

Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mikemccand/luceneutil/issues/142, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHHUQO6HZOLZ5WE73XIQPDUGKTSTANCNFSM5FXX6M3Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

0reactions
praveennishcommented, Jan 10, 2022

I am very sorry @mikemccand for the late reply!

I wanted to retest today but after latest pull i am getting FileNotFoundException for this file enwiki-20120502-lines-1k-fixed-utf8-with-random-label.txt in data folder

From where i can download this file please?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Performance Improvements in .NET 7
NET 7 is fast. Really fast. This post deep-dives into hundreds of performance improvements that contributed to that reality.
Read more >
Commit Procedures — Verilog-to-Routing 8.0.0 documentation
Suppose we've make a change to VTR, and we now want to evaluate the change. As described above we produce QoR measurements for...
Read more >
VLMbench: A Compositional Benchmark for Vision-and ...
VLMbench is the first benchmark that compositional designs for vision-and-language reasoning and categorizes the manipulation tasks from the perspectives of ...
Read more >
phoronix-test-suite/phoronix-test-suite.md at master - GitHub
This option will allows you to specify a result as a baseline (first parameter) and a second result file (second parameter) that will...
Read more >
Parboil: A Revised Benchmark Suite for Scientific ... - IMPACT
nity to focus on a shared codebase, with individual researchers more easily able to understand another's results. If a benchmark or set of ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found