Skip to contents

Complete Results

These results are based on Stanley (2017), Alinaghi (2018), Bom (2019), and Carter (2019) data-generating mechanisms with a total of 1665 conditions.

Average Performance

Method performance measures are aggregated across all simulated conditions to provide an overall impression of method performance. However, keep in mind that a method with a high overall ranking is not necessarily the “best” method for a particular application. To select a suitable method for your application, consider also non-aggregated performance measures in conditions most relevant to your application.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Mean Rank Rank Method Mean Rank
1 AK (AK1) 6.339 1 AK (AK1) 6.384
2 RoBMA (PSMA) 7.038 2 RoBMA (PSMA) 7.204
3 WAAPWLS (default) 7.520 3 WAAPWLS (default) 7.533
4 FMA (default) 7.958 4 FMA (default) 7.971
5 WLS (default) 7.965 5 WLS (default) 7.978
6 trimfill (default) 8.514 6 trimfill (default) 8.532
7 SM (3PSM) 8.941 7 SM (3PSM) 8.969
8 PEESE (default) 8.951 8 PEESE (default) 8.977
9 PETPEESE (default) 9.559 9 PETPEESE (default) 9.582
10 WILS (default) 9.701 10 WILS (default) 9.712
11 puniform (star) 9.782 11 puniform (star) 9.799
12 RMA (default) 10.416 12 RMA (default) 10.434
13 AK (AK2) 11.341 13 AK (AK2) 10.780
14 EK (default) 11.449 14 EK (default) 11.471
15 SM (4PSM) 11.497 15 SM (4PSM) 11.538
16 PET (default) 11.583 16 PET (default) 11.608
17 pcurve (default) 11.686 17 pcurve (default) 11.703
18 puniform (default) 13.138 18 puniform (default) 13.157
19 mean (default) 14.269 19 mean (default) 14.317

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the average RMSE is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of RMSE values on the corresponding outcome scale.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Mean Rank Rank Method Mean Rank
1 WAAPWLS (default) 7.392 1 WAAPWLS (default) 7.489
2 AK (AK1) 7.741 2 AK (AK1) 7.810
3 PETPEESE (default) 7.910 3 PETPEESE (default) 7.967
4 PEESE (default) 8.189 4 PEESE (default) 8.250
5 SM (3PSM) 8.389 5 SM (3PSM) 8.380
6 RoBMA (PSMA) 8.643 6 RoBMA (PSMA) 8.610
7 puniform (star) 9.047 7 puniform (star) 9.112
8 EK (default) 9.088 8 EK (default) 9.147
9 PET (default) 9.171 9 PET (default) 9.229
10 WLS (default) 9.213 10 WLS (default) 9.305
11 FMA (default) 9.217 11 FMA (default) 9.309
12 SM (4PSM) 9.627 12 SM (4PSM) 9.682
13 WILS (default) 10.234 13 AK (AK2) 9.971
14 trimfill (default) 10.392 14 WILS (default) 10.283
15 AK (AK2) 10.969 15 trimfill (default) 10.467
16 RMA (default) 12.153 16 RMA (default) 12.220
17 puniform (default) 12.634 17 puniform (default) 12.677
18 pcurve (default) 12.804 18 pcurve (default) 12.836
19 mean (default) 15.017 19 mean (default) 15.086

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0. Methods are compared using condition-wise ranks. Direct comparison using the average bias is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of bias values on the corresponding outcome scale.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Mean Rank Rank Method Mean Rank
1 RMA (default) 3.868 1 RMA (default) 3.726
2 AK (AK1) 5.163 2 WLS (default) 5.267
3 WLS (default) 5.375 3 FMA (default) 5.272
4 FMA (default) 5.380 4 AK (AK1) 5.315
5 trimfill (default) 6.847 5 trimfill (default) 6.799
6 mean (default) 7.882 6 mean (default) 7.821
7 pcurve (default) 7.912 7 pcurve (default) 7.871
8 RoBMA (PSMA) 8.532 8 WAAPWLS (default) 8.538
9 WAAPWLS (default) 8.625 9 RoBMA (PSMA) 9.159
10 PEESE (default) 10.629 10 PEESE (default) 10.601
11 SM (3PSM) 10.903 11 SM (3PSM) 10.861
12 puniform (default) 11.119 12 puniform (default) 11.085
13 WILS (default) 12.284 13 WILS (default) 12.230
14 puniform (star) 12.305 14 puniform (star) 12.254
15 AK (AK2) 13.214 15 PETPEESE (default) 13.235
16 PETPEESE (default) 13.253 16 AK (AK2) 13.374
17 SM (4PSM) 14.047 17 SM (4PSM) 13.998
18 EK (default) 15.160 18 EK (default) 15.126
19 PET (default) 15.235 19 PET (default) 15.204

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance. Methods are compared using condition-wise ranks. Direct comparison using the average empirical standard error is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of empirical standard error values on the corresponding outcome scale.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 5.554 1 RoBMA (PSMA) 5.711
2 AK (AK1) 5.962 2 AK (AK1) 5.971
3 SM (3PSM) 6.729 3 SM (3PSM) 6.820
4 puniform (star) 7.120 4 puniform (star) 7.214
5 WAAPWLS (default) 7.590 5 WAAPWLS (default) 7.648
6 trimfill (default) 8.422 6 trimfill (default) 8.461
7 SM (4PSM) 8.810 7 SM (4PSM) 8.605
8 WLS (default) 9.283 8 WLS (default) 9.339
9 PEESE (default) 9.313 9 PEESE (default) 9.347
10 PETPEESE (default) 9.744 10 AK (AK2) 9.412
11 AK (AK2) 9.890 11 PETPEESE (default) 9.754
12 RMA (default) 10.127 12 RMA (default) 10.170
13 EK (default) 10.463 13 EK (default) 10.469
14 PET (default) 11.374 14 PET (default) 11.376
15 WILS (default) 11.458 15 WILS (default) 11.449
16 FMA (default) 11.743 16 FMA (default) 11.776
17 puniform (default) 11.791 17 puniform (default) 11.810
18 mean (default) 15.172 18 mean (default) 15.214
19 pcurve (default) 18.977 19 pcurve (default) 18.977

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the average Interval Score is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of empirical standard error values on the corresponding outcome scale.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 0.800 1 RoBMA (PSMA) 0.798
2 AK (AK2) 0.795 2 AK (AK2) 0.769
3 SM (4PSM) 0.765 3 SM (4PSM) 0.760
4 puniform (star) 0.733 4 puniform (star) 0.733
5 SM (3PSM) 0.733 5 SM (3PSM) 0.728
6 EK (default) 0.641 6 EK (default) 0.641
7 PETPEESE (default) 0.629 7 PETPEESE (default) 0.629
8 PET (default) 0.620 8 PET (default) 0.620
9 AK (AK1) 0.609 9 AK (AK1) 0.609
10 WAAPWLS (default) 0.582 10 WAAPWLS (default) 0.582
11 trimfill (default) 0.544 11 trimfill (default) 0.543
12 PEESE (default) 0.526 12 PEESE (default) 0.526
13 WILS (default) 0.504 13 WILS (default) 0.504
14 puniform (default) 0.484 14 puniform (default) 0.484
15 WLS (default) 0.464 15 WLS (default) 0.464
16 RMA (default) 0.457 16 RMA (default) 0.457
17 FMA (default) 0.342 17 FMA (default) 0.342
18 mean (default) 0.260 18 mean (default) 0.260
19 pcurve (default) NaN 19 pcurve (default) NaN

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Mean Rank Rank Method Mean Rank
1 FMA (default) 2.348 1 FMA (default) 2.310
2 WLS (default) 3.883 2 WLS (default) 3.841
3 WILS (default) 4.891 3 WILS (default) 4.871
4 mean (default) 7.108 4 mean (default) 7.107
5 PEESE (default) 7.142 5 PEESE (default) 7.129
6 WAAPWLS (default) 7.474 6 WAAPWLS (default) 7.433
7 RMA (default) 7.726 7 RMA (default) 7.682
8 trimfill (default) 7.771 8 trimfill (default) 7.693
9 AK (AK1) 8.774 9 AK (AK1) 8.777
10 PETPEESE (default) 9.258 10 PETPEESE (default) 9.298
11 RoBMA (PSMA) 10.541 11 RoBMA (PSMA) 10.617
12 puniform (default) 11.068 12 puniform (default) 11.079
13 SM (3PSM) 12.107 13 SM (3PSM) 12.180
14 puniform (star) 13.280 14 PET (default) 13.437
15 PET (default) 13.342 15 puniform (star) 13.447
16 EK (default) 14.474 16 AK (AK2) 14.238
17 AK (AK2) 14.521 17 EK (default) 14.577
18 SM (4PSM) 14.836 18 SM (4PSM) 14.831
19 pcurve (default) 18.977 19 pcurve (default) 18.977

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the average CI width is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of CI width values on the corresponding outcome scale.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Log Value Rank Method Log Value
1 RoBMA (PSMA) 3.703 1 RoBMA (PSMA) 3.577
2 AK (AK2) 2.124 2 AK (AK2) 1.735
3 PETPEESE (default) 1.532 3 PETPEESE (default) 1.532
4 PET (default) 1.515 4 PET (default) 1.515
5 EK (default) 1.515 5 EK (default) 1.515
6 puniform (default) 1.515 6 puniform (default) 1.501
7 puniform (star) 1.325 7 puniform (star) 1.325
8 SM (3PSM) 1.325 8 SM (3PSM) 1.321
9 AK (AK1) 1.215 9 AK (AK1) 1.205
10 SM (4PSM) 1.156 10 SM (4PSM) 1.185
11 RMA (default) 0.998 11 RMA (default) 0.998
12 WAAPWLS (default) 0.945 12 WAAPWLS (default) 0.945
13 trimfill (default) 0.922 13 trimfill (default) 0.922
14 WILS (default) 0.871 14 WILS (default) 0.871
15 PEESE (default) 0.843 15 PEESE (default) 0.843
16 WLS (default) 0.790 16 WLS (default) 0.790
17 FMA (default) 0.503 17 FMA (default) 0.503
18 mean (default) 0.487 18 mean (default) 0.487
19 pcurve (default) NaN 19 pcurve (default) NaN

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Log Value Rank Method Log Value
1 PETPEESE (default) -4.626 1 AK (AK2) -4.661
2 EK (default) -4.496 2 PETPEESE (default) -4.626
3 PET (default) -4.496 3 EK (default) -4.496
4 WAAPWLS (default) -4.042 4 PET (default) -4.496
5 PEESE (default) -3.890 5 WAAPWLS (default) -4.042
6 SM (3PSM) -3.560 6 PEESE (default) -3.890
7 WLS (default) -3.450 7 SM (3PSM) -3.593
8 trimfill (default) -3.445 8 WLS (default) -3.450
9 puniform (default) -3.374 9 trimfill (default) -3.446
10 RoBMA (PSMA) -3.331 10 puniform (default) -3.376
11 AK (AK1) -3.277 11 RoBMA (PSMA) -3.332
12 puniform (star) -3.208 12 AK (AK1) -3.281
13 AK (AK2) -3.158 13 puniform (star) -3.208
14 RMA (default) -3.121 14 RMA (default) -3.121
15 FMA (default) -3.058 15 FMA (default) -3.058
16 WILS (default) -3.037 16 WILS (default) -3.037
17 SM (4PSM) -2.636 17 SM (4PSM) -2.873
18 mean (default) -2.503 18 mean (default) -2.503
19 pcurve (default) NaN 19 pcurve (default) NaN

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 0.102 1 RoBMA (PSMA) 0.106
2 AK (AK2) 0.125 2 AK (AK2) 0.237
3 SM (4PSM) 0.245 3 SM (4PSM) 0.248
4 PET (default) 0.257 4 PET (default) 0.257
5 EK (default) 0.257 5 EK (default) 0.257
6 PETPEESE (default) 0.270 6 PETPEESE (default) 0.270
7 SM (3PSM) 0.277 7 SM (3PSM) 0.280
8 puniform (star) 0.293 8 puniform (star) 0.293
9 WILS (default) 0.391 9 WILS (default) 0.391
10 WAAPWLS (default) 0.523 10 WAAPWLS (default) 0.523
11 PEESE (default) 0.546 11 PEESE (default) 0.546
12 AK (AK1) 0.581 12 AK (AK1) 0.581
13 trimfill (default) 0.586 13 trimfill (default) 0.586
14 puniform (default) 0.608 14 puniform (default) 0.607
15 WLS (default) 0.621 15 WLS (default) 0.621
16 RMA (default) 0.622 16 RMA (default) 0.622
17 FMA (default) 0.772 17 FMA (default) 0.772
18 mean (default) 0.779 18 mean (default) 0.779
19 pcurve (default) NaN 19 pcurve (default) NaN

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 mean (default) 0.990 1 mean (default) 0.990
2 FMA (default) 0.989 2 FMA (default) 0.989
3 WLS (default) 0.976 3 WLS (default) 0.976
4 RMA (default) 0.974 4 RMA (default) 0.974
5 AK (AK1) 0.969 5 AK (AK1) 0.969
6 trimfill (default) 0.965 6 trimfill (default) 0.965
7 PEESE (default) 0.953 7 PEESE (default) 0.953
8 puniform (default) 0.939 8 puniform (default) 0.939
9 WAAPWLS (default) 0.934 9 WAAPWLS (default) 0.934
10 PETPEESE (default) 0.893 10 PETPEESE (default) 0.893
11 EK (default) 0.873 11 AK (AK2) 0.885
12 PET (default) 0.873 12 EK (default) 0.873
13 WILS (default) 0.864 13 PET (default) 0.873
14 SM (3PSM) 0.828 14 WILS (default) 0.864
15 AK (AK2) 0.812 15 SM (3PSM) 0.835
16 puniform (star) 0.808 16 puniform (star) 0.808
17 SM (4PSM) 0.754 17 SM (4PSM) 0.772
18 RoBMA (PSMA) 0.703 18 RoBMA (PSMA) 0.706
19 pcurve (default) NaN 19 pcurve (default) NaN

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

By-Condition Performance (Conditional on Method Convergence)

The results below are conditional on method convergence. Note that the methods might differ in convergence rate and are therefore not compared on the same data sets.

Raincloud plot showing convergence rates across different methods

Raincloud plot showing RMSE (Root Mean Square Error) across different methods

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the average RMSE is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of RMSE values on the corresponding outcome scale.

Raincloud plot showing bias across different methods

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0. Methods are compared using condition-wise ranks. Direct comparison using the average bias is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of bias values on the corresponding outcome scale.

Raincloud plot showing bias across different methods

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance. Methods are compared using condition-wise ranks. Direct comparison using the empirical standard error is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of bias values on the corresponding outcome scale.

Raincloud plot showing 95% confidence interval width across different methods

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the interval score is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of bias values on the corresponding outcome scale.

Raincloud plot showing 95% confidence interval coverage across different methods

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Raincloud plot showing 95% confidence interval width across different methods

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the average 95% CI width is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of 95% CI width values on the corresponding outcome scale.

Raincloud plot showing positive likelihood ratio across different methods

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Raincloud plot showing negative likelihood ratio across different methods

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Raincloud plot showing Type I Error rates across different methods

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Raincloud plot showing statistical power across different methods

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

By-Condition Performance (Replacement in Case of Non-Convergence)

The results below incorporate method replacement to handle non-convergence. If a method fails to converge, its results are replaced with the results from a simpler method (e.g., random-effects meta-analysis without publication bias adjustment). This emulates what a data analyst may do in practice in case a method does not converge. However, note that these results do not correspond to “pure” method performance as they might combine multiple different methods. See Method Replacement Strategy for details of the method replacement specification.

Raincloud plot showing convergence rates across different methods

Raincloud plot showing RMSE (Root Mean Square Error) across different methods

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the average RMSE is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of RMSE values on the corresponding outcome scale.

Raincloud plot showing bias across different methods

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0. Methods are compared using condition-wise ranks. Direct comparison using the average bias is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of bias values on the corresponding outcome scale.

Raincloud plot showing bias across different methods

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance. Methods are compared using condition-wise ranks. Direct comparison using the empirical standard error is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of bias values on the corresponding outcome scale.

Raincloud plot showing 95% confidence interval width across different methods

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the interval score is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of bias values on the corresponding outcome scale.

Raincloud plot showing 95% confidence interval coverage across different methods

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Raincloud plot showing 95% confidence interval width across different methods

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the average 95% CI width is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of 95% CI width values on the corresponding outcome scale.

Raincloud plot showing positive likelihood ratio across different methods

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Raincloud plot showing negative likelihood ratio across different methods

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Raincloud plot showing Type I Error rates across different methods

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Raincloud plot showing statistical power across different methods

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

Subset: Publication Bias Present

These results are based on Stanley (2017), Alinaghi (2018), Bom (2019), and Carter (2019) data-generating mechanisms with a total of 1143 conditions.

Average Performance

Method performance measures are aggregated across all simulated conditions to provide an overall impression of method performance. However, keep in mind that a method with a high overall ranking is not necessarily the “best” method for a particular application. To select a suitable method for your application, consider also non-aggregated performance measures in conditions most relevant to your application.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Mean Rank Rank Method Mean Rank
1 AK (AK1) 6.486 1 AK (AK1) 6.475
2 WAAPWLS (default) 7.641 2 WAAPWLS (default) 7.654
3 RoBMA (PSMA) 7.654 3 RoBMA (PSMA) 7.863
4 trimfill (default) 8.411 4 trimfill (default) 8.414
5 PEESE (default) 8.544 5 PEESE (default) 8.573
6 FMA (default) 8.585 6 FMA (default) 8.590
7 WLS (default) 8.590 7 WLS (default) 8.594
8 PETPEESE (default) 8.877 8 PETPEESE (default) 8.891
9 WILS (default) 9.296 9 WILS (default) 9.304
10 SM (3PSM) 9.509 10 SM (3PSM) 9.535
11 puniform (star) 9.916 11 puniform (star) 9.920
12 pcurve (default) 10.421 12 AK (AK2) 10.412
13 EK (default) 10.623 13 pcurve (default) 10.435
14 PET (default) 10.752 14 EK (default) 10.644
15 AK (AK2) 10.911 15 PET (default) 10.777
16 SM (4PSM) 11.726 16 SM (4PSM) 11.760
17 RMA (default) 11.883 17 RMA (default) 11.911
18 puniform (default) 12.356 18 puniform (default) 12.374
19 mean (default) 15.540 19 mean (default) 15.594

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the average RMSE is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of RMSE values on the corresponding outcome scale.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Mean Rank Rank Method Mean Rank
1 WAAPWLS (default) 7.444 1 WAAPWLS (default) 7.535
2 PETPEESE (default) 7.618 2 PETPEESE (default) 7.651
3 AK (AK1) 8.054 3 RoBMA (PSMA) 8.066
4 RoBMA (PSMA) 8.150 4 AK (AK1) 8.081
5 PEESE (default) 8.301 5 PEESE (default) 8.353
6 SM (3PSM) 8.668 6 SM (3PSM) 8.652
7 EK (default) 8.883 7 EK (default) 8.918
8 WILS (default) 8.899 8 WILS (default) 8.953
9 PET (default) 9.010 9 PET (default) 9.047
10 puniform (star) 9.025 10 puniform (star) 9.101
11 trimfill (default) 9.208 11 trimfill (default) 9.266
12 SM (4PSM) 9.817 12 SM (4PSM) 9.866
13 FMA (default) 9.937 13 AK (AK2) 9.892
14 WLS (default) 9.944 14 FMA (default) 10.025
15 AK (AK2) 10.712 15 WLS (default) 10.032
16 pcurve (default) 11.549 16 pcurve (default) 11.581
17 puniform (default) 11.905 17 puniform (default) 11.964
18 RMA (default) 13.843 18 RMA (default) 13.926
19 mean (default) 16.829 19 mean (default) 16.886

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0. Methods are compared using condition-wise ranks. Direct comparison using the average bias is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of bias values on the corresponding outcome scale.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Mean Rank Rank Method Mean Rank
1 RMA (default) 3.972 1 RMA (default) 3.817
2 AK (AK1) 4.755 2 AK (AK1) 4.822
3 WLS (default) 5.384 3 WLS (default) 5.251
4 FMA (default) 5.388 4 FMA (default) 5.255
5 trimfill (default) 6.523 5 trimfill (default) 6.468
6 pcurve (default) 6.996 6 pcurve (default) 6.943
7 mean (default) 7.560 7 mean (default) 7.476
8 WAAPWLS (default) 8.716 8 WAAPWLS (default) 8.607
9 RoBMA (PSMA) 9.600 9 PEESE (default) 10.407
10 PEESE (default) 10.482 10 RoBMA (PSMA) 10.519
11 puniform (default) 10.866 11 puniform (default) 10.818
12 SM (3PSM) 11.488 12 SM (3PSM) 11.402
13 WILS (default) 12.593 13 WILS (default) 12.507
14 puniform (star) 12.708 14 puniform (star) 12.605
15 AK (AK2) 13.037 15 PETPEESE (default) 13.332
16 PETPEESE (default) 13.399 16 AK (AK2) 13.437
17 SM (4PSM) 14.441 17 SM (4PSM) 14.350
18 EK (default) 14.930 18 EK (default) 14.873
19 PET (default) 14.989 19 PET (default) 14.936

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance. Methods are compared using condition-wise ranks. Direct comparison using the average empirical standard error is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of empirical standard error values on the corresponding outcome scale.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 5.997 1 RoBMA (PSMA) 6.184
2 AK (AK1) 6.247 2 AK (AK1) 6.209
3 puniform (star) 7.006 3 puniform (star) 7.083
4 SM (3PSM) 7.037 4 SM (3PSM) 7.133
5 WAAPWLS (default) 7.719 5 WAAPWLS (default) 7.766
6 trimfill (default) 8.204 6 trimfill (default) 8.228
7 SM (4PSM) 8.972 7 SM (4PSM) 8.787
8 PEESE (default) 9.093 8 AK (AK2) 9.111
9 PETPEESE (default) 9.225 9 PEESE (default) 9.126
10 AK (AK2) 9.500 10 PETPEESE (default) 9.212
11 EK (default) 9.646 11 EK (default) 9.647
12 WLS (default) 9.950 12 WLS (default) 10.008
13 PET (default) 10.542 13 PET (default) 10.542
14 puniform (default) 10.857 14 puniform (default) 10.869
15 WILS (default) 10.905 15 WILS (default) 10.870
16 RMA (default) 11.525 16 RMA (default) 11.588
17 FMA (default) 11.958 17 FMA (default) 11.989
18 mean (default) 16.203 18 mean (default) 16.234
19 pcurve (default) 18.975 19 pcurve (default) 18.975

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the average Interval Score is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of empirical standard error values on the corresponding outcome scale.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 0.759 1 RoBMA (PSMA) 0.756
2 AK (AK2) 0.754 2 SM (4PSM) 0.717
3 SM (4PSM) 0.722 3 AK (AK2) 0.716
4 SM (3PSM) 0.688 4 puniform (star) 0.688
5 puniform (star) 0.688 5 SM (3PSM) 0.682
6 EK (default) 0.611 6 EK (default) 0.611
7 PETPEESE (default) 0.599 7 PETPEESE (default) 0.599
8 PET (default) 0.588 8 PET (default) 0.588
9 AK (AK1) 0.526 9 AK (AK1) 0.526
10 WAAPWLS (default) 0.523 10 WAAPWLS (default) 0.523
11 trimfill (default) 0.485 11 trimfill (default) 0.484
12 WILS (default) 0.479 12 WILS (default) 0.479
13 puniform (default) 0.478 13 puniform (default) 0.478
14 PEESE (default) 0.467 14 PEESE (default) 0.467
15 WLS (default) 0.393 15 WLS (default) 0.393
16 RMA (default) 0.358 16 RMA (default) 0.358
17 FMA (default) 0.288 17 FMA (default) 0.288
18 mean (default) 0.148 18 mean (default) 0.148
19 pcurve (default) NaN 19 pcurve (default) NaN

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Mean Rank Rank Method Mean Rank
1 FMA (default) 2.332 1 FMA (default) 2.283
2 WLS (default) 3.981 2 WLS (default) 3.934
3 WILS (default) 5.625 3 WILS (default) 5.608
4 mean (default) 6.780 4 mean (default) 6.776
5 PEESE (default) 6.983 5 PEESE (default) 6.973
6 WAAPWLS (default) 7.553 6 WAAPWLS (default) 7.510
7 trimfill (default) 7.796 7 trimfill (default) 7.729
8 RMA (default) 7.836 8 RMA (default) 7.800
9 AK (AK1) 8.204 9 AK (AK1) 8.205
10 PETPEESE (default) 9.124 10 PETPEESE (default) 9.162
11 puniform (default) 10.870 11 puniform (default) 10.900
12 RoBMA (PSMA) 11.163 12 RoBMA (PSMA) 11.255
13 SM (3PSM) 12.385 13 SM (3PSM) 12.484
14 PET (default) 13.105 14 PET (default) 13.203
15 puniform (star) 13.246 15 puniform (star) 13.417
16 EK (default) 14.241 16 AK (AK2) 13.943
17 AK (AK2) 14.346 17 EK (default) 14.343
18 SM (4PSM) 15.018 18 SM (4PSM) 15.061
19 pcurve (default) 18.975 19 pcurve (default) 18.975

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the average CI width is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of CI width values on the corresponding outcome scale.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Log Value Rank Method Log Value
1 RoBMA (PSMA) 3.072 1 RoBMA (PSMA) 2.925
2 AK (AK2) 1.862 2 puniform (default) 1.707
3 puniform (default) 1.712 3 PETPEESE (default) 1.411
4 PETPEESE (default) 1.411 4 PET (default) 1.402
5 PET (default) 1.402 5 EK (default) 1.401
6 EK (default) 1.401 6 AK (AK2) 1.346
7 SM (3PSM) 1.058 7 puniform (star) 1.057
8 puniform (star) 1.057 8 SM (3PSM) 1.047
9 SM (4PSM) 0.857 9 SM (4PSM) 0.891
10 AK (AK1) 0.816 10 AK (AK1) 0.814
11 WILS (default) 0.780 11 WILS (default) 0.780
12 trimfill (default) 0.710 12 trimfill (default) 0.710
13 WAAPWLS (default) 0.694 13 WAAPWLS (default) 0.694
14 RMA (default) 0.601 14 RMA (default) 0.601
15 PEESE (default) 0.592 15 PEESE (default) 0.592
16 WLS (default) 0.520 16 WLS (default) 0.520
17 FMA (default) 0.267 17 FMA (default) 0.267
18 mean (default) 0.143 18 mean (default) 0.143
19 pcurve (default) NaN 19 pcurve (default) NaN

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Log Value Rank Method Log Value
1 PETPEESE (default) -4.416 1 PETPEESE (default) -4.416
2 PET (default) -4.295 2 PET (default) -4.295
3 EK (default) -4.295 3 EK (default) -4.295
4 WAAPWLS (default) -3.646 4 AK (AK2) -4.176
5 PEESE (default) -3.433 5 WAAPWLS (default) -3.646
6 puniform (default) -3.371 6 PEESE (default) -3.433
7 RoBMA (PSMA) -2.917 7 puniform (default) -3.370
8 SM (3PSM) -2.841 8 RoBMA (PSMA) -2.912
9 trimfill (default) -2.827 9 SM (3PSM) -2.882
10 WLS (default) -2.818 10 trimfill (default) -2.827
11 AK (AK2) -2.697 11 WLS (default) -2.818
12 AK (AK1) -2.628 12 AK (AK1) -2.629
13 puniform (star) -2.613 13 puniform (star) -2.613
14 WILS (default) -2.580 14 WILS (default) -2.580
15 RMA (default) -2.441 15 RMA (default) -2.441
16 FMA (default) -2.367 16 FMA (default) -2.367
17 SM (4PSM) -1.917 17 SM (4PSM) -2.168
18 mean (default) -1.692 18 mean (default) -1.692
19 pcurve (default) NaN 19 pcurve (default) NaN

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 0.124 1 RoBMA (PSMA) 0.129
2 AK (AK2) 0.150 2 SM (4PSM) 0.284
3 SM (4PSM) 0.279 3 PET (default) 0.293
4 PET (default) 0.293 4 EK (default) 0.294
5 EK (default) 0.294 5 AK (AK2) 0.307
6 PETPEESE (default) 0.311 6 PETPEESE (default) 0.311
7 SM (3PSM) 0.318 7 SM (3PSM) 0.323
8 puniform (star) 0.327 8 puniform (star) 0.327
9 WILS (default) 0.404 9 WILS (default) 0.404
10 puniform (default) 0.618 10 puniform (default) 0.618
11 WAAPWLS (default) 0.650 11 WAAPWLS (default) 0.650
12 PEESE (default) 0.664 12 PEESE (default) 0.664
13 trimfill (default) 0.706 13 trimfill (default) 0.706
14 AK (AK1) 0.731 14 AK (AK1) 0.731
15 WLS (default) 0.762 15 WLS (default) 0.762
16 RMA (default) 0.782 16 RMA (default) 0.782
17 FMA (default) 0.883 17 FMA (default) 0.883
18 mean (default) 0.931 18 mean (default) 0.931
19 pcurve (default) NaN 19 pcurve (default) NaN

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 mean (default) 0.996 1 mean (default) 0.996
2 FMA (default) 0.991 2 FMA (default) 0.991
3 RMA (default) 0.979 3 RMA (default) 0.979
4 WLS (default) 0.978 4 WLS (default) 0.978
5 AK (AK1) 0.974 5 AK (AK1) 0.974
6 trimfill (default) 0.971 6 trimfill (default) 0.971
7 PEESE (default) 0.953 7 PEESE (default) 0.953
8 WAAPWLS (default) 0.937 8 WAAPWLS (default) 0.937
9 puniform (default) 0.929 9 puniform (default) 0.929
10 PETPEESE (default) 0.886 10 PETPEESE (default) 0.886
11 EK (default) 0.865 11 AK (AK2) 0.870
12 PET (default) 0.865 12 EK (default) 0.865
13 WILS (default) 0.839 13 PET (default) 0.865
14 SM (3PSM) 0.789 14 WILS (default) 0.839
15 AK (AK2) 0.777 15 SM (3PSM) 0.798
16 puniform (star) 0.766 16 puniform (star) 0.766
17 SM (4PSM) 0.711 17 SM (4PSM) 0.732
18 RoBMA (PSMA) 0.665 18 RoBMA (PSMA) 0.669
19 pcurve (default) NaN 19 pcurve (default) NaN

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

By-Condition Performance (Conditional on Method Convergence)

The results below are conditional on method convergence. Note that the methods might differ in convergence rate and are therefore not compared on the same data sets.

Raincloud plot showing convergence rates across different methods

Raincloud plot showing RMSE (Root Mean Square Error) across different methods

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the average RMSE is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of RMSE values on the corresponding outcome scale.

Raincloud plot showing bias across different methods

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0. Methods are compared using condition-wise ranks. Direct comparison using the average bias is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of bias values on the corresponding outcome scale.

Raincloud plot showing bias across different methods

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance. Methods are compared using condition-wise ranks. Direct comparison using the empirical standard error is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of bias values on the corresponding outcome scale.

Raincloud plot showing 95% confidence interval width across different methods

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the interval score is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of bias values on the corresponding outcome scale.

Raincloud plot showing 95% confidence interval coverage across different methods

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Raincloud plot showing 95% confidence interval width across different methods

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the average 95% CI width is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of 95% CI width values on the corresponding outcome scale.

Raincloud plot showing positive likelihood ratio across different methods

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Raincloud plot showing negative likelihood ratio across different methods

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Raincloud plot showing Type I Error rates across different methods

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Raincloud plot showing statistical power across different methods

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

By-Condition Performance (Replacement in Case of Non-Convergence)

The results below incorporate method replacement to handle non-convergence. If a method fails to converge, its results are replaced with the results from a simpler method (e.g., random-effects meta-analysis without publication bias adjustment). This emulates what a data analyst may do in practice in case a method does not converge. However, note that these results do not correspond to “pure” method performance as they might combine multiple different methods. See Method Replacement Strategy for details of the method replacement specification.

Raincloud plot showing convergence rates across different methods

Raincloud plot showing RMSE (Root Mean Square Error) across different methods

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the average RMSE is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of RMSE values on the corresponding outcome scale.

Raincloud plot showing bias across different methods

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0. Methods are compared using condition-wise ranks. Direct comparison using the average bias is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of bias values on the corresponding outcome scale.

Raincloud plot showing bias across different methods

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance. Methods are compared using condition-wise ranks. Direct comparison using the empirical standard error is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of bias values on the corresponding outcome scale.

Raincloud plot showing 95% confidence interval width across different methods

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the interval score is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of bias values on the corresponding outcome scale.

Raincloud plot showing 95% confidence interval coverage across different methods

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Raincloud plot showing 95% confidence interval width across different methods

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the average 95% CI width is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of 95% CI width values on the corresponding outcome scale.

Raincloud plot showing positive likelihood ratio across different methods

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Raincloud plot showing negative likelihood ratio across different methods

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Raincloud plot showing Type I Error rates across different methods

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Raincloud plot showing statistical power across different methods

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

Subset: Publication Bias Absent

These results are based on Stanley (2017), Alinaghi (2018), Bom (2019), and Carter (2019) data-generating mechanisms with a total of 522 conditions.

Average Performance

Method performance measures are aggregated across all simulated conditions to provide an overall impression of method performance. However, keep in mind that a method with a high overall ranking is not necessarily the “best” method for a particular application. To select a suitable method for your application, consider also non-aggregated performance measures in conditions most relevant to your application.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Mean Rank Rank Method Mean Rank
1 RoBMA (PSMA) 5.690 1 RoBMA (PSMA) 5.762
2 AK (AK1) 6.019 2 AK (AK1) 6.186
3 FMA (default) 6.584 3 FMA (default) 6.615
4 WLS (default) 6.598 4 WLS (default) 6.628
5 RMA (default) 7.205 5 RMA (default) 7.199
6 WAAPWLS (default) 7.255 6 WAAPWLS (default) 7.268
7 SM (3PSM) 7.695 7 SM (3PSM) 7.730
8 trimfill (default) 8.739 8 trimfill (default) 8.791
9 puniform (star) 9.489 9 puniform (star) 9.533
10 PEESE (default) 9.843 10 PEESE (default) 9.860
11 WILS (default) 10.588 11 WILS (default) 10.605
12 SM (4PSM) 10.994 12 SM (4PSM) 11.050
13 PETPEESE (default) 11.054 13 PETPEESE (default) 11.096
14 mean (default) 11.487 14 mean (default) 11.521
15 AK (AK2) 12.284 15 AK (AK2) 11.586
16 EK (default) 13.257 16 EK (default) 13.284
17 PET (default) 13.402 17 PET (default) 13.427
18 pcurve (default) 14.458 18 pcurve (default) 14.479
19 puniform (default) 14.851 19 puniform (default) 14.870

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the average RMSE is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of RMSE values on the corresponding outcome scale.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Mean Rank Rank Method Mean Rank
1 AK (AK1) 7.056 1 AK (AK1) 7.216
2 WAAPWLS (default) 7.278 2 WAAPWLS (default) 7.389
3 WLS (default) 7.613 3 WLS (default) 7.713
4 FMA (default) 7.642 4 FMA (default) 7.741
5 SM (3PSM) 7.780 5 SM (3PSM) 7.785
6 PEESE (default) 7.943 6 PEESE (default) 8.025
7 RMA (default) 8.454 7 RMA (default) 8.487
8 PETPEESE (default) 8.550 8 PETPEESE (default) 8.659
9 puniform (star) 9.094 9 puniform (star) 9.136
10 SM (4PSM) 9.211 10 SM (4PSM) 9.280
11 PET (default) 9.523 11 PET (default) 9.628
12 EK (default) 9.538 12 EK (default) 9.649
13 RoBMA (PSMA) 9.724 13 RoBMA (PSMA) 9.799
14 mean (default) 11.050 14 AK (AK2) 10.144
15 AK (AK2) 11.533 15 mean (default) 11.146
16 trimfill (default) 12.985 16 trimfill (default) 13.096
17 WILS (default) 13.157 17 WILS (default) 13.195
18 puniform (default) 14.232 18 puniform (default) 14.239
19 pcurve (default) 15.552 19 pcurve (default) 15.584

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0. Methods are compared using condition-wise ranks. Direct comparison using the average bias is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of bias values on the corresponding outcome scale.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Mean Rank Rank Method Mean Rank
1 RMA (default) 3.640 1 RMA (default) 3.527
2 WLS (default) 5.356 2 WLS (default) 5.303
3 FMA (default) 5.362 3 FMA (default) 5.308
4 AK (AK1) 6.057 4 RoBMA (PSMA) 6.180
5 RoBMA (PSMA) 6.193 5 AK (AK1) 6.393
6 trimfill (default) 7.557 6 trimfill (default) 7.523
7 WAAPWLS (default) 8.427 7 WAAPWLS (default) 8.385
8 mean (default) 8.586 8 mean (default) 8.577
9 SM (3PSM) 9.621 9 SM (3PSM) 9.676
10 pcurve (default) 9.918 10 pcurve (default) 9.902
11 PEESE (default) 10.952 11 PEESE (default) 11.025
12 puniform (star) 11.423 12 puniform (star) 11.485
13 WILS (default) 11.607 13 WILS (default) 11.623
14 puniform (default) 11.672 14 puniform (default) 11.670
15 PETPEESE (default) 12.933 15 PETPEESE (default) 13.025
16 SM (4PSM) 13.186 16 SM (4PSM) 13.226
17 AK (AK2) 13.602 17 AK (AK2) 13.238
18 EK (default) 15.663 18 EK (default) 15.678
19 PET (default) 15.776 19 PET (default) 15.789

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance. Methods are compared using condition-wise ranks. Direct comparison using the average empirical standard error is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of empirical standard error values on the corresponding outcome scale.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 4.582 1 RoBMA (PSMA) 4.674
2 AK (AK1) 5.339 2 AK (AK1) 5.450
3 SM (3PSM) 6.054 3 SM (3PSM) 6.134
4 RMA (default) 7.065 4 RMA (default) 7.065
5 WAAPWLS (default) 7.308 5 WAAPWLS (default) 7.389
6 puniform (star) 7.370 6 puniform (star) 7.502
7 WLS (default) 7.824 7 WLS (default) 7.874
8 SM (4PSM) 8.456 8 SM (4PSM) 8.205
9 trimfill (default) 8.900 9 trimfill (default) 8.969
10 PEESE (default) 9.795 10 PEESE (default) 9.831
11 AK (AK2) 10.743 11 AK (AK2) 10.071
12 PETPEESE (default) 10.879 12 PETPEESE (default) 10.943
13 FMA (default) 11.272 13 FMA (default) 11.310
14 EK (default) 12.253 14 EK (default) 12.270
15 WILS (default) 12.669 15 WILS (default) 12.718
16 mean (default) 12.916 16 mean (default) 12.981
17 PET (default) 13.195 17 PET (default) 13.201
18 puniform (default) 13.837 18 puniform (default) 13.870
19 pcurve (default) 18.981 19 pcurve (default) 18.981

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the average Interval Score is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of empirical standard error values on the corresponding outcome scale.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 0.888 1 RoBMA (PSMA) 0.888
2 AK (AK2) 0.879 2 AK (AK2) 0.879
3 SM (4PSM) 0.860 3 SM (4PSM) 0.856
4 puniform (star) 0.832 4 puniform (star) 0.832
5 SM (3PSM) 0.831 5 SM (3PSM) 0.830
6 AK (AK1) 0.791 6 AK (AK1) 0.790
7 WAAPWLS (default) 0.711 7 WAAPWLS (default) 0.711
8 EK (default) 0.706 8 EK (default) 0.706
9 PETPEESE (default) 0.695 9 PETPEESE (default) 0.695
10 PET (default) 0.689 10 PET (default) 0.689
11 RMA (default) 0.675 11 RMA (default) 0.675
12 trimfill (default) 0.673 12 trimfill (default) 0.673
13 PEESE (default) 0.656 13 PEESE (default) 0.656
14 WLS (default) 0.619 14 WLS (default) 0.619
15 WILS (default) 0.557 15 WILS (default) 0.557
16 mean (default) 0.505 16 mean (default) 0.505
17 puniform (default) 0.497 17 puniform (default) 0.499
18 FMA (default) 0.461 18 FMA (default) 0.461
19 pcurve (default) NaN 19 pcurve (default) NaN

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Mean Rank Rank Method Mean Rank
1 FMA (default) 2.385 1 FMA (default) 2.370
2 WILS (default) 3.284 2 WILS (default) 3.257
3 WLS (default) 3.669 3 WLS (default) 3.636
4 WAAPWLS (default) 7.303 4 WAAPWLS (default) 7.264
5 RMA (default) 7.485 5 RMA (default) 7.423
6 PEESE (default) 7.490 6 PEESE (default) 7.469
7 trimfill (default) 7.715 7 trimfill (default) 7.615
8 mean (default) 7.828 8 mean (default) 7.831
9 RoBMA (PSMA) 9.180 9 RoBMA (PSMA) 9.222
10 PETPEESE (default) 9.550 10 PETPEESE (default) 9.596
11 AK (AK1) 10.023 11 AK (AK1) 10.029
12 SM (3PSM) 11.498 12 puniform (default) 11.471
13 puniform (default) 11.504 13 SM (3PSM) 11.513
14 puniform (star) 13.354 14 puniform (star) 13.511
15 PET (default) 13.862 15 PET (default) 13.948
16 SM (4PSM) 14.437 16 SM (4PSM) 14.328
17 AK (AK2) 14.906 17 AK (AK2) 14.883
18 EK (default) 14.987 18 EK (default) 15.090
19 pcurve (default) 18.981 19 pcurve (default) 18.981

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the average CI width is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of CI width values on the corresponding outcome scale.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Log Value Rank Method Log Value
1 RoBMA (PSMA) 5.048 1 RoBMA (PSMA) 4.967
2 AK (AK2) 2.632 2 AK (AK2) 2.489
3 AK (AK1) 2.066 3 AK (AK1) 2.038
4 puniform (star) 1.896 4 SM (3PSM) 1.903
5 SM (3PSM) 1.894 5 puniform (star) 1.896
6 RMA (default) 1.842 6 RMA (default) 1.842
7 SM (4PSM) 1.794 7 SM (4PSM) 1.814
8 PETPEESE (default) 1.790 8 PETPEESE (default) 1.790
9 EK (default) 1.759 9 EK (default) 1.759
10 PET (default) 1.758 10 PET (default) 1.758
11 WAAPWLS (default) 1.480 11 WAAPWLS (default) 1.480
12 PEESE (default) 1.380 12 PEESE (default) 1.380
13 trimfill (default) 1.375 13 trimfill (default) 1.375
14 WLS (default) 1.367 14 WLS (default) 1.367
15 mean (default) 1.222 15 mean (default) 1.222
16 puniform (default) 1.095 16 WILS (default) 1.065
17 WILS (default) 1.065 17 puniform (default) 1.063
18 FMA (default) 1.006 18 FMA (default) 1.006
19 pcurve (default) NaN 19 pcurve (default) NaN

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Log Value Rank Method Log Value
1 SM (3PSM) -5.093 1 AK (AK2) -5.601
2 PETPEESE (default) -5.073 2 SM (3PSM) -5.111
3 EK (default) -4.924 3 PETPEESE (default) -5.073
4 PET (default) -4.924 4 EK (default) -4.924
5 WAAPWLS (default) -4.886 5 PET (default) -4.924
6 PEESE (default) -4.863 6 WAAPWLS (default) -4.886
7 WLS (default) -4.799 7 PEESE (default) -4.863
8 trimfill (default) -4.763 8 WLS (default) -4.799
9 AK (AK1) -4.662 9 trimfill (default) -4.764
10 RMA (default) -4.572 10 AK (AK1) -4.670
11 FMA (default) -4.531 11 RMA (default) -4.572
12 puniform (star) -4.477 12 FMA (default) -4.531
13 mean (default) -4.232 13 puniform (star) -4.477
14 RoBMA (PSMA) -4.216 14 SM (4PSM) -4.377
15 SM (4PSM) -4.170 15 mean (default) -4.232
16 AK (AK2) -4.051 16 RoBMA (PSMA) -4.227
17 WILS (default) -4.013 17 WILS (default) -4.013
18 puniform (default) -3.380 18 puniform (default) -3.388
19 pcurve (default) NaN 19 pcurve (default) NaN

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 0.053 1 RoBMA (PSMA) 0.054
2 AK (AK2) 0.071 2 AK (AK2) 0.090
3 SM (4PSM) 0.167 3 SM (4PSM) 0.167
4 PET (default) 0.172 4 PET (default) 0.172
5 EK (default) 0.172 5 EK (default) 0.172
6 PETPEESE (default) 0.176 6 PETPEESE (default) 0.176
7 SM (3PSM) 0.183 7 SM (3PSM) 0.183
8 puniform (star) 0.216 8 puniform (star) 0.216
9 WAAPWLS (default) 0.233 9 WAAPWLS (default) 0.233
10 AK (AK1) 0.236 10 AK (AK1) 0.236
11 RMA (default) 0.255 11 RMA (default) 0.255
12 PEESE (default) 0.276 12 PEESE (default) 0.276
13 WLS (default) 0.296 13 WLS (default) 0.296
14 trimfill (default) 0.310 14 trimfill (default) 0.310
15 WILS (default) 0.361 15 WILS (default) 0.361
16 mean (default) 0.430 16 mean (default) 0.430
17 FMA (default) 0.518 17 FMA (default) 0.518
18 puniform (default) 0.587 18 puniform (default) 0.582
19 pcurve (default) NaN 19 pcurve (default) NaN

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 FMA (default) 0.984 1 FMA (default) 0.984
2 mean (default) 0.978 2 mean (default) 0.978
3 WLS (default) 0.971 3 WLS (default) 0.971
4 RMA (default) 0.963 4 RMA (default) 0.963
5 AK (AK1) 0.960 5 puniform (default) 0.960
6 puniform (default) 0.959 6 AK (AK1) 0.960
7 trimfill (default) 0.953 7 trimfill (default) 0.953
8 PEESE (default) 0.952 8 PEESE (default) 0.952
9 WAAPWLS (default) 0.927 9 WAAPWLS (default) 0.927
10 WILS (default) 0.918 10 WILS (default) 0.918
11 SM (3PSM) 0.911 11 AK (AK2) 0.915
12 PETPEESE (default) 0.910 12 SM (3PSM) 0.914
13 puniform (star) 0.898 13 PETPEESE (default) 0.910
14 EK (default) 0.889 14 puniform (star) 0.898
15 PET (default) 0.889 15 EK (default) 0.889
16 AK (AK2) 0.883 16 PET (default) 0.889
17 SM (4PSM) 0.843 17 SM (4PSM) 0.858
18 RoBMA (PSMA) 0.784 18 RoBMA (PSMA) 0.785
19 pcurve (default) NaN 19 pcurve (default) NaN

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

By-Condition Performance (Conditional on Method Convergence)

The results below are conditional on method convergence. Note that the methods might differ in convergence rate and are therefore not compared on the same data sets.

Raincloud plot showing convergence rates across different methods

Raincloud plot showing RMSE (Root Mean Square Error) across different methods

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the average RMSE is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of RMSE values on the corresponding outcome scale.

Raincloud plot showing bias across different methods

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0. Methods are compared using condition-wise ranks. Direct comparison using the average bias is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of bias values on the corresponding outcome scale.

Raincloud plot showing bias across different methods

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance. Methods are compared using condition-wise ranks. Direct comparison using the empirical standard error is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of bias values on the corresponding outcome scale.

Raincloud plot showing 95% confidence interval width across different methods

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the interval score is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of bias values on the corresponding outcome scale.

Raincloud plot showing 95% confidence interval coverage across different methods

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Raincloud plot showing 95% confidence interval width across different methods

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the average 95% CI width is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of 95% CI width values on the corresponding outcome scale.

Raincloud plot showing positive likelihood ratio across different methods

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Raincloud plot showing negative likelihood ratio across different methods

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Raincloud plot showing Type I Error rates across different methods

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Raincloud plot showing statistical power across different methods

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

By-Condition Performance (Replacement in Case of Non-Convergence)

The results below incorporate method replacement to handle non-convergence. If a method fails to converge, its results are replaced with the results from a simpler method (e.g., random-effects meta-analysis without publication bias adjustment). This emulates what a data analyst may do in practice in case a method does not converge. However, note that these results do not correspond to “pure” method performance as they might combine multiple different methods. See Method Replacement Strategy for details of the method replacement specification.

Raincloud plot showing convergence rates across different methods

Raincloud plot showing RMSE (Root Mean Square Error) across different methods

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the average RMSE is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of RMSE values on the corresponding outcome scale.

Raincloud plot showing bias across different methods

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0. Methods are compared using condition-wise ranks. Direct comparison using the average bias is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of bias values on the corresponding outcome scale.

Raincloud plot showing bias across different methods

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance. Methods are compared using condition-wise ranks. Direct comparison using the empirical standard error is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of bias values on the corresponding outcome scale.

Raincloud plot showing 95% confidence interval width across different methods

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the interval score is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of bias values on the corresponding outcome scale.

Raincloud plot showing 95% confidence interval coverage across different methods

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Raincloud plot showing 95% confidence interval width across different methods

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the average 95% CI width is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of 95% CI width values on the corresponding outcome scale.

Raincloud plot showing positive likelihood ratio across different methods

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Raincloud plot showing negative likelihood ratio across different methods

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Raincloud plot showing Type I Error rates across different methods

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Raincloud plot showing statistical power across different methods

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

Session Info

This report was compiled on Thu Oct 23 14:07:32 2025 (UTC) using the following computational environment

## R version 4.5.1 (2025-06-13)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## time zone: UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] scales_1.4.0                   ggdist_3.3.3                  
## [3] ggplot2_4.0.0                  PublicationBiasBenchmark_0.1.0
## 
## loaded via a namespace (and not attached):
##  [1] generics_0.1.4       sandwich_3.1-1       sass_0.4.10         
##  [4] xml2_1.4.0           stringi_1.8.7        lattice_0.22-7      
##  [7] httpcode_0.3.0       digest_0.6.37        magrittr_2.0.4      
## [10] evaluate_1.0.5       grid_4.5.1           RColorBrewer_1.1-3  
## [13] fastmap_1.2.0        jsonlite_2.0.0       crul_1.6.0          
## [16] urltools_1.7.3.1     httr_1.4.7           purrr_1.1.0         
## [19] viridisLite_0.4.2    textshaping_1.0.4    jquerylib_0.1.4     
## [22] Rdpack_2.6.4         cli_3.6.5            rlang_1.1.6         
## [25] triebeard_0.4.1      rbibutils_2.3        withr_3.0.2         
## [28] cachem_1.1.0         yaml_2.3.10          tools_4.5.1         
## [31] memoise_2.0.1        kableExtra_1.4.0     curl_7.0.0          
## [34] vctrs_0.6.5          R6_2.6.1             clubSandwich_0.6.1  
## [37] zoo_1.8-14           lifecycle_1.0.4      stringr_1.5.2       
## [40] fs_1.6.6             htmlwidgets_1.6.4    ragg_1.5.0          
## [43] pkgconfig_2.0.3      desc_1.4.3           osfr_0.2.9          
## [46] pkgdown_2.1.3        bslib_0.9.0          pillar_1.11.1       
## [49] gtable_0.3.6         Rcpp_1.1.0           glue_1.8.0          
## [52] systemfonts_1.3.1    xfun_0.53            tibble_3.3.0        
## [55] rstudioapi_0.17.1    knitr_1.50           farver_2.1.2        
## [58] htmltools_0.5.8.1    labeling_0.4.3       svglite_2.2.2       
## [61] rmarkdown_2.30       compiler_4.5.1       S7_0.2.0            
## [64] distributional_0.5.0