Skip to contents

Complete Results

These results are based on Alinaghi (2018) data-generating mechanism with a total of 81 conditions.

Average Performance

Method performance measures are aggregated across all simulated conditions to provide an overall impression of method performance. However, keep in mind that a method with a high overall ranking is not necessarily the “best” method for a particular application. To select a suitable method for your application, consider also non-aggregated performance measures in conditions most relevant to your application.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 0.216 1 RoBMA (PSMA) 0.216
2 AK (AK2) 0.229 2 trimfill (default) 0.236
3 trimfill (default) 0.236 3 AK (AK2) 0.245
4 AK (AK1) 0.255 4 AK (AK1) 0.255
5 SM (4PSM) 0.263 5 SM (4PSM) 0.263
6 SM (3PSM) 0.310 6 SM (3PSM) 0.310
7 puniform (star) 0.316 7 puniform (star) 0.316
8 RMA (default) 0.320 8 RMA (default) 0.320
9 FMA (default) 0.345 9 FMA (default) 0.345
9 WLS (default) 0.345 9 WLS (default) 0.345
11 PEESE (default) 0.359 11 PEESE (default) 0.359
12 PETPEESE (default) 0.363 12 PETPEESE (default) 0.363
13 WAAPWLS (default) 0.372 13 WAAPWLS (default) 0.372
14 EK (default) 0.437 14 EK (default) 0.437
15 PET (default) 0.438 15 PET (default) 0.438
16 mean (default) 0.496 16 mean (default) 0.496
17 WILS (default) 0.571 17 WILS (default) 0.571
18 puniform (default) 0.643 18 puniform (default) 0.643
19 pcurve (default) 1.376 19 pcurve (default) 1.376

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 SM (4PSM) 0.025 1 SM (4PSM) 0.025
2 PET (default) 0.078 2 PET (default) 0.078
3 EK (default) 0.080 3 EK (default) 0.080
4 AK (AK2) 0.086 4 trimfill (default) 0.096
5 trimfill (default) 0.096 5 SM (3PSM) 0.098
6 SM (3PSM) 0.098 6 RoBMA (PSMA) 0.099
7 RoBMA (PSMA) 0.099 7 AK (AK2) 0.108
8 puniform (star) 0.111 8 puniform (star) 0.111
9 PETPEESE (default) 0.112 9 PETPEESE (default) 0.112
10 WAAPWLS (default) 0.115 10 WAAPWLS (default) 0.115
11 PEESE (default) 0.116 11 PEESE (default) 0.116
12 FMA (default) 0.131 12 FMA (default) 0.131
12 WLS (default) 0.131 12 WLS (default) 0.131
14 AK (AK1) 0.183 14 AK (AK1) 0.182
15 WILS (default) -0.183 15 WILS (default) -0.183
16 RMA (default) 0.262 16 RMA (default) 0.262
17 mean (default) 0.429 17 mean (default) 0.429
18 puniform (default) 0.606 18 puniform (default) 0.606
19 pcurve (default) -1.219 19 pcurve (default) -1.219

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 pcurve (default) 0.056 1 pcurve (default) 0.056
2 RMA (default) 0.124 2 RMA (default) 0.124
3 AK (AK1) 0.132 3 AK (AK1) 0.132
4 RoBMA (PSMA) 0.138 4 RoBMA (PSMA) 0.138
5 mean (default) 0.149 5 mean (default) 0.149
6 puniform (default) 0.155 6 puniform (default) 0.155
7 trimfill (default) 0.158 7 trimfill (default) 0.158
8 puniform (star) 0.160 8 puniform (star) 0.160
9 SM (3PSM) 0.161 9 SM (3PSM) 0.161
10 SM (4PSM) 0.191 10 SM (4PSM) 0.191
11 AK (AK2) 0.192 11 AK (AK2) 0.198
12 FMA (default) 0.286 12 FMA (default) 0.286
13 WLS (default) 0.286 13 WLS (default) 0.286
14 PEESE (default) 0.307 14 PEESE (default) 0.307
15 PETPEESE (default) 0.312 15 PETPEESE (default) 0.312
16 WAAPWLS (default) 0.324 16 WAAPWLS (default) 0.324
17 EK (default) 0.395 17 EK (default) 0.395
18 PET (default) 0.395 18 PET (default) 0.395
19 WILS (default) 0.453 19 WILS (default) 0.453

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 2.220 1 RoBMA (PSMA) 2.220
2 AK (AK2) 3.037 2 AK (AK2) 3.550
3 FMA (default) 3.780 3 FMA (default) 3.780
4 SM (4PSM) 3.999 4 SM (4PSM) 3.999
5 trimfill (default) 4.897 5 trimfill (default) 4.897
6 AK (AK1) 5.772 6 AK (AK1) 5.764
7 RMA (default) 6.200 7 RMA (default) 6.200
8 SM (3PSM) 6.625 8 SM (3PSM) 6.625
9 puniform (star) 7.284 9 puniform (star) 7.284
10 WAAPWLS (default) 7.800 10 WAAPWLS (default) 7.800
11 WLS (default) 8.687 11 WLS (default) 8.687
12 PEESE (default) 8.983 12 PEESE (default) 8.983
13 PETPEESE (default) 9.060 13 PETPEESE (default) 9.060
14 EK (default) 10.612 14 EK (default) 10.612
15 PET (default) 10.651 15 PET (default) 10.651
16 mean (default) 14.940 16 mean (default) 14.940
17 WILS (default) 16.151 17 WILS (default) 16.151
18 puniform (default) 20.205 18 puniform (default) 20.205
19 pcurve (default) NaN 19 pcurve (default) NaN

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 0.868 1 RoBMA (PSMA) 0.868
2 AK (AK2) 0.802 2 AK (AK2) 0.759
3 SM (4PSM) 0.749 3 SM (4PSM) 0.749
4 AK (AK1) 0.652 4 AK (AK1) 0.651
5 SM (3PSM) 0.625 5 SM (3PSM) 0.625
6 trimfill (default) 0.614 6 trimfill (default) 0.614
7 RMA (default) 0.597 7 RMA (default) 0.597
8 puniform (star) 0.597 8 puniform (star) 0.597
9 FMA (default) 0.597 9 FMA (default) 0.597
10 WAAPWLS (default) 0.555 10 WAAPWLS (default) 0.555
11 PETPEESE (default) 0.467 11 PETPEESE (default) 0.467
12 PEESE (default) 0.455 12 PEESE (default) 0.455
13 WLS (default) 0.441 13 WLS (default) 0.441
14 EK (default) 0.412 14 EK (default) 0.412
15 PET (default) 0.361 15 PET (default) 0.361
16 puniform (default) 0.344 16 puniform (default) 0.344
17 WILS (default) 0.307 17 WILS (default) 0.307
18 mean (default) 0.299 18 mean (default) 0.299
19 pcurve (default) NaN 19 pcurve (default) NaN

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 WILS (default) 0.154 1 WILS (default) 0.154
2 WLS (default) 0.163 2 WLS (default) 0.163
3 PEESE (default) 0.168 3 PEESE (default) 0.168
4 PETPEESE (default) 0.170 4 PETPEESE (default) 0.170
5 EK (default) 0.208 5 EK (default) 0.208
6 PET (default) 0.208 6 PET (default) 0.208
7 trimfill (default) 0.229 7 trimfill (default) 0.229
8 AK (AK1) 0.246 8 AK (AK1) 0.246
9 mean (default) 0.247 9 mean (default) 0.247
10 WAAPWLS (default) 0.289 10 WAAPWLS (default) 0.289
11 puniform (star) 0.290 11 puniform (star) 0.290
12 SM (3PSM) 0.317 12 SM (3PSM) 0.317
13 puniform (default) 0.321 13 puniform (default) 0.321
14 SM (4PSM) 0.395 14 AK (AK2) 0.393
15 AK (AK2) 0.404 15 SM (4PSM) 0.395
16 RMA (default) 0.448 16 RMA (default) 0.448
17 RoBMA (PSMA) 0.494 17 RoBMA (PSMA) 0.494
18 FMA (default) 0.980 18 FMA (default) 0.980
19 pcurve (default) NaN 19 pcurve (default) NaN

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Log Value Rank Method Log Value
1 RoBMA (PSMA) 5.904 1 RoBMA (PSMA) 5.904
2 AK (AK2) 2.307 2 AK (AK2) 2.211
3 RMA (default) 2.161 3 RMA (default) 2.161
4 AK (AK1) 1.616 4 AK (AK1) 1.618
5 SM (4PSM) 1.521 5 SM (4PSM) 1.521
6 trimfill (default) 1.487 6 trimfill (default) 1.489
7 EK (default) 1.101 7 EK (default) 1.101
7 PET (default) 1.101 7 PET (default) 1.101
9 PETPEESE (default) 1.059 9 PETPEESE (default) 1.059
10 mean (default) 1.039 10 mean (default) 1.039
11 FMA (default) 1.010 11 FMA (default) 1.010
12 WAAPWLS (default) 0.905 12 WAAPWLS (default) 0.905
13 SM (3PSM) 0.819 13 SM (3PSM) 0.819
14 WLS (default) 0.811 14 WLS (default) 0.811
15 PEESE (default) 0.795 15 PEESE (default) 0.795
16 puniform (default) 0.749 16 puniform (default) 0.749
17 puniform (star) 0.742 17 puniform (star) 0.742
18 WILS (default) 0.446 18 WILS (default) 0.446
19 pcurve (default) NaN 19 pcurve (default) NaN

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Log Value Rank Method Log Value
1 RoBMA (PSMA) -6.322 1 AK (AK2) -6.345
2 AK (AK2) -5.588 2 RoBMA (PSMA) -6.322
3 SM (4PSM) -5.405 3 SM (4PSM) -5.405
4 WAAPWLS (default) -5.289 4 WAAPWLS (default) -5.289
5 PETPEESE (default) -5.199 5 PETPEESE (default) -5.199
6 EK (default) -5.148 6 EK (default) -5.148
6 PET (default) -5.148 6 PET (default) -5.148
8 RMA (default) -5.086 8 AK (AK1) -5.101
9 AK (AK1) -5.078 9 RMA (default) -5.086
10 trimfill (default) -4.937 10 trimfill (default) -4.936
11 mean (default) -4.588 11 mean (default) -4.588
12 WLS (default) -4.347 12 WLS (default) -4.347
13 PEESE (default) -4.310 13 PEESE (default) -4.310
14 SM (3PSM) -3.941 14 SM (3PSM) -3.941
15 FMA (default) -3.575 15 FMA (default) -3.575
16 puniform (star) -3.410 16 puniform (star) -3.410
17 WILS (default) -3.167 17 WILS (default) -3.167
18 puniform (default) -2.490 18 puniform (default) -2.490
19 pcurve (default) NaN 19 pcurve (default) NaN

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 0.092 1 RoBMA (PSMA) 0.092
2 AK (AK2) 0.196 2 AK (AK2) 0.206
3 RMA (default) 0.359 3 RMA (default) 0.359
4 SM (4PSM) 0.372 4 SM (4PSM) 0.372
5 AK (AK1) 0.452 5 AK (AK1) 0.452
6 trimfill (default) 0.484 6 trimfill (default) 0.484
7 FMA (default) 0.535 7 FMA (default) 0.535
8 PETPEESE (default) 0.536 8 PETPEESE (default) 0.536
9 EK (default) 0.545 9 EK (default) 0.545
9 PET (default) 0.545 9 PET (default) 0.545
11 mean (default) 0.552 11 mean (default) 0.552
12 WAAPWLS (default) 0.574 12 WAAPWLS (default) 0.574
13 SM (3PSM) 0.637 13 SM (3PSM) 0.637
14 WLS (default) 0.645 14 WLS (default) 0.645
15 PEESE (default) 0.647 15 PEESE (default) 0.647
16 puniform (star) 0.697 16 puniform (star) 0.697
17 puniform (default) 0.707 17 puniform (default) 0.707
18 WILS (default) 0.775 18 WILS (default) 0.775
19 pcurve (default) NaN 19 pcurve (default) NaN

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 puniform (default) 1.000 1 puniform (default) 1.000
2 mean (default) 0.998 2 mean (default) 0.998
3 AK (AK1) 0.997 3 AK (AK1) 0.997
4 WLS (default) 0.993 4 WLS (default) 0.993
5 PEESE (default) 0.993 5 PEESE (default) 0.993
6 trimfill (default) 0.992 6 trimfill (default) 0.992
7 EK (default) 0.988 7 EK (default) 0.988
7 PET (default) 0.988 7 PET (default) 0.988
9 PETPEESE (default) 0.986 9 PETPEESE (default) 0.986
10 RMA (default) 0.985 10 RMA (default) 0.985
11 puniform (star) 0.983 11 puniform (star) 0.983
12 WILS (default) 0.982 12 WILS (default) 0.982
13 WAAPWLS (default) 0.976 13 WAAPWLS (default) 0.976
14 AK (AK2) 0.974 14 AK (AK2) 0.974
15 SM (3PSM) 0.971 15 SM (3PSM) 0.971
16 SM (4PSM) 0.957 16 SM (4PSM) 0.957
17 RoBMA (PSMA) 0.945 17 RoBMA (PSMA) 0.945
18 FMA (default) 0.928 18 FMA (default) 0.928
19 pcurve (default) NaN 19 pcurve (default) NaN

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

By-Condition Performance (Conditional on Method Convergence)

The results below are conditional on method convergence. Note that the methods might differ in convergence rate and are therefore not compared on the same data sets.

Raincloud plot showing convergence rates across different methods

Raincloud plot showing RMSE (Root Mean Square Error) across different methods

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing bias across different methods

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0. Values lower than -0.5 or larger than 0.5 are visualized as -0.5 and 0.5 respectively.

Raincloud plot showing bias across different methods

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing 95% confidence interval width across different methods

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method. Values larger than 100 are visualized as 100.

Raincloud plot showing 95% confidence interval coverage across different methods

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Raincloud plot showing 95% confidence interval width across different methods

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method.

Raincloud plot showing positive likelihood ratio across different methods

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Raincloud plot showing negative likelihood ratio across different methods

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Raincloud plot showing Type I Error rates across different methods

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Raincloud plot showing statistical power across different methods

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

By-Condition Performance (Replacement in Case of Non-Convergence)

The results below incorporate method replacement to handle non-convergence. If a method fails to converge, its results are replaced with the results from a simpler method (e.g., random-effects meta-analysis without publication bias adjustment). This emulates what a data analyst may do in practice in case a method does not converge. However, note that these results do not correspond to “pure” method performance as they might combine multiple different methods. See Method Replacement Strategy for details of the method replacement specification.

Raincloud plot showing convergence rates across different methods

Raincloud plot showing RMSE (Root Mean Square Error) across different methods

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing bias across different methods

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0. Values lower than -0.5 or larger than 0.5 are visualized as -0.5 and 0.5 respectively.

Raincloud plot showing bias across different methods

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing 95% confidence interval width across different methods

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method. Values larger than 100 are visualized as 100.

Raincloud plot showing 95% confidence interval coverage across different methods

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Raincloud plot showing 95% confidence interval width across different methods

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method.

Raincloud plot showing positive likelihood ratio across different methods

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Raincloud plot showing negative likelihood ratio across different methods

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Raincloud plot showing Type I Error rates across different methods

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Raincloud plot showing statistical power across different methods

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

Subset: Fixed Effects

These results are based on Alinaghi (2018) data-generating mechanism with a total of 27 conditions.

Average Performance

Method performance measures are aggregated across all simulated conditions to provide an overall impression of method performance. However, keep in mind that a method with a high overall ranking is not necessarily the “best” method for a particular application. To select a suitable method for your application, consider also non-aggregated performance measures in conditions most relevant to your application.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 0.008 1 RoBMA (PSMA) 0.008
2 AK (AK2) 0.009 2 PETPEESE (default) 0.010
3 PETPEESE (default) 0.010 3 PEESE (default) 0.012
4 PEESE (default) 0.012 4 WAAPWLS (default) 0.012
5 WAAPWLS (default) 0.012 5 WLS (default) 0.014
6 WLS (default) 0.014 6 FMA (default) 0.014
7 FMA (default) 0.014 7 EK (default) 0.015
8 trimfill (default) 0.015 8 trimfill (default) 0.015
9 EK (default) 0.015 9 WILS (default) 0.016
10 WILS (default) 0.016 10 SM (4PSM) 0.017
11 SM (4PSM) 0.017 11 AK (AK2) 0.017
12 PET (default) 0.020 12 PET (default) 0.020
13 RMA (default) 0.022 13 RMA (default) 0.022
14 AK (AK1) 0.037 14 AK (AK1) 0.036
15 SM (3PSM) 0.041 15 SM (3PSM) 0.041
16 puniform (star) 0.047 16 puniform (star) 0.047
17 puniform (default) 0.080 17 puniform (default) 0.080
18 mean (default) 0.348 18 mean (default) 0.348
19 pcurve (default) 1.340 19 pcurve (default) 1.340

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 AK (AK2) 0.000 1 RoBMA (PSMA) 0.000
2 RoBMA (PSMA) 0.000 2 PETPEESE (default) -0.001
3 PETPEESE (default) -0.001 3 PEESE (default) 0.001
4 PEESE (default) 0.001 4 trimfill (default) 0.002
5 trimfill (default) 0.002 5 WILS (default) -0.003
6 WILS (default) -0.003 6 WAAPWLS (default) 0.003
7 WAAPWLS (default) 0.003 7 AK (AK2) 0.003
8 SM (4PSM) -0.006 8 SM (4PSM) -0.006
9 FMA (default) 0.007 9 FMA (default) 0.007
10 WLS (default) 0.007 10 WLS (default) 0.007
11 EK (default) -0.008 11 EK (default) -0.008
12 RMA (default) 0.012 12 RMA (default) 0.012
13 PET (default) -0.013 13 PET (default) -0.013
14 puniform (default) 0.018 14 puniform (default) 0.018
15 SM (3PSM) 0.019 15 SM (3PSM) 0.019
16 puniform (star) 0.025 16 puniform (star) 0.025
17 AK (AK1) 0.028 17 AK (AK1) 0.027
18 mean (default) 0.318 18 mean (default) 0.318
19 pcurve (default) -1.305 19 pcurve (default) -1.305

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 0.008 1 RoBMA (PSMA) 0.008
2 AK (AK2) 0.009 2 PEESE (default) 0.010
3 PEESE (default) 0.010 3 WLS (default) 0.010
4 WLS (default) 0.010 4 FMA (default) 0.010
5 FMA (default) 0.010 5 PETPEESE (default) 0.010
6 PETPEESE (default) 0.010 6 WAAPWLS (default) 0.010
7 WAAPWLS (default) 0.010 7 EK (default) 0.011
8 EK (default) 0.011 8 WILS (default) 0.011
9 WILS (default) 0.011 9 PET (default) 0.011
10 PET (default) 0.011 10 trimfill (default) 0.013
11 trimfill (default) 0.013 11 SM (4PSM) 0.014
12 SM (4PSM) 0.014 12 RMA (default) 0.015
13 RMA (default) 0.015 13 AK (AK2) 0.017
14 puniform (star) 0.017 14 puniform (star) 0.017
15 AK (AK1) 0.018 15 AK (AK1) 0.019
16 SM (3PSM) 0.021 16 SM (3PSM) 0.021
17 pcurve (default) 0.036 17 pcurve (default) 0.036
18 mean (default) 0.068 18 mean (default) 0.068
19 puniform (default) 0.075 19 puniform (default) 0.075

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 0.035 1 RoBMA (PSMA) 0.035
2 AK (AK2) 0.043 2 PETPEESE (default) 0.053
3 PETPEESE (default) 0.053 3 PEESE (default) 0.096
4 PEESE (default) 0.096 4 AK (AK2) 0.100
5 SM (4PSM) 0.101 5 SM (4PSM) 0.101
6 WAAPWLS (default) 0.108 6 WAAPWLS (default) 0.108
7 trimfill (default) 0.126 7 trimfill (default) 0.127
8 WLS (default) 0.131 8 WLS (default) 0.131
9 FMA (default) 0.136 9 FMA (default) 0.136
10 EK (default) 0.148 10 EK (default) 0.148
11 WILS (default) 0.172 11 WILS (default) 0.172
12 PET (default) 0.242 12 PET (default) 0.242
13 RMA (default) 0.289 13 RMA (default) 0.289
14 puniform (default) 0.397 14 puniform (default) 0.397
15 AK (AK1) 0.692 15 AK (AK1) 0.668
16 SM (3PSM) 0.728 16 SM (3PSM) 0.728
17 puniform (star) 1.047 17 puniform (star) 1.047
18 mean (default) 10.014 18 mean (default) 10.014
19 pcurve (default) NaN 19 pcurve (default) NaN

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 0.953 1 RoBMA (PSMA) 0.953
2 AK (AK2) 0.952 2 AK (AK2) 0.942
3 SM (4PSM) 0.939 3 SM (4PSM) 0.939
4 puniform (default) 0.927 4 puniform (default) 0.927
5 PETPEESE (default) 0.920 5 PETPEESE (default) 0.920
6 WAAPWLS (default) 0.909 6 WAAPWLS (default) 0.909
7 trimfill (default) 0.905 7 trimfill (default) 0.904
8 AK (AK1) 0.903 8 AK (AK1) 0.901
9 PEESE (default) 0.886 9 PEESE (default) 0.886
10 SM (3PSM) 0.885 10 SM (3PSM) 0.885
11 puniform (star) 0.879 11 puniform (star) 0.879
12 WLS (default) 0.849 12 WLS (default) 0.849
13 RMA (default) 0.839 13 RMA (default) 0.839
14 FMA (default) 0.829 14 FMA (default) 0.829
15 EK (default) 0.757 15 EK (default) 0.757
16 WILS (default) 0.748 16 WILS (default) 0.748
17 PET (default) 0.603 17 PET (default) 0.603
18 mean (default) 0.378 18 mean (default) 0.378
19 pcurve (default) NaN 19 pcurve (default) NaN

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 0.029 1 RoBMA (PSMA) 0.029
2 WILS (default) 0.033 2 WILS (default) 0.033
3 FMA (default) 0.034 3 FMA (default) 0.034
4 PEESE (default) 0.035 4 PEESE (default) 0.035
5 WAAPWLS (default) 0.036 5 WAAPWLS (default) 0.036
6 PETPEESE (default) 0.036 6 PETPEESE (default) 0.036
7 AK (AK2) 0.036 7 WLS (default) 0.036
8 WLS (default) 0.036 8 EK (default) 0.038
9 EK (default) 0.038 9 AK (AK2) 0.038
10 PET (default) 0.040 10 PET (default) 0.040
11 SM (4PSM) 0.045 11 SM (4PSM) 0.045
12 trimfill (default) 0.053 12 trimfill (default) 0.053
13 AK (AK1) 0.054 13 AK (AK1) 0.053
14 RMA (default) 0.056 14 RMA (default) 0.056
15 SM (3PSM) 0.058 15 SM (3PSM) 0.058
16 puniform (star) 0.060 16 puniform (star) 0.060
17 mean (default) 0.247 17 mean (default) 0.247
18 puniform (default) 0.326 18 puniform (default) 0.326
19 pcurve (default) NaN 19 pcurve (default) NaN

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Log Value Rank Method Log Value
1 RoBMA (PSMA) 7.601 1 RoBMA (PSMA) 7.601
2 AK (AK2) 3.114 2 AK (AK2) 2.832
3 EK (default) 2.803 3 EK (default) 2.803
3 PET (default) 2.803 3 PET (default) 2.803
5 PETPEESE (default) 2.631 5 PETPEESE (default) 2.631
6 RMA (default) 2.357 6 RMA (default) 2.357
7 puniform (default) 2.246 7 puniform (default) 2.246
8 AK (AK1) 2.233 8 AK (AK1) 2.236
9 trimfill (default) 2.164 9 trimfill (default) 2.169
10 SM (4PSM) 2.158 10 SM (4PSM) 2.158
11 WAAPWLS (default) 1.936 11 WAAPWLS (default) 1.936
12 WLS (default) 1.900 12 WLS (default) 1.900
13 PEESE (default) 1.860 13 PEESE (default) 1.860
14 mean (default) 1.671 14 mean (default) 1.671
15 FMA (default) 1.398 15 FMA (default) 1.398
16 WILS (default) 1.255 16 WILS (default) 1.255
17 puniform (star) 1.106 17 puniform (star) 1.106
18 SM (3PSM) 1.100 18 SM (3PSM) 1.100
19 pcurve (default) NaN 19 pcurve (default) NaN

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Log Value Rank Method Log Value
1 RoBMA (PSMA) -7.601 1 RoBMA (PSMA) -7.601
2 EK (default) -7.539 2 EK (default) -7.539
2 PET (default) -7.539 2 PET (default) -7.539
4 PETPEESE (default) -7.523 4 AK (AK2) -7.531
5 puniform (default) -7.469 5 PETPEESE (default) -7.523
6 SM (4PSM) -7.359 6 puniform (default) -7.469
7 WAAPWLS (default) -6.809 7 SM (4PSM) -7.359
8 AK (AK2) -6.178 8 WAAPWLS (default) -6.809
9 AK (AK1) -6.073 9 AK (AK1) -6.141
10 SM (3PSM) -5.936 10 SM (3PSM) -5.936
11 RMA (default) -5.045 11 RMA (default) -5.045
12 trimfill (default) -5.043 12 trimfill (default) -5.040
13 WLS (default) -5.028 13 WLS (default) -5.028
14 PEESE (default) -5.026 14 PEESE (default) -5.026
15 mean (default) -4.983 15 mean (default) -4.983
16 FMA (default) -4.956 16 FMA (default) -4.956
17 WILS (default) -4.923 17 WILS (default) -4.923
18 puniform (star) -4.716 18 puniform (star) -4.716
19 pcurve (default) NaN 19 pcurve (default) NaN

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 0.000 1 RoBMA (PSMA) 0.000
2 AK (AK2) 0.044 2 EK (default) 0.060
3 EK (default) 0.060 2 PET (default) 0.060
3 PET (default) 0.060 4 AK (AK2) 0.067
5 PETPEESE (default) 0.075 5 PETPEESE (default) 0.075
6 puniform (default) 0.121 6 puniform (default) 0.121
7 SM (4PSM) 0.187 7 SM (4PSM) 0.187
8 WAAPWLS (default) 0.337 8 WAAPWLS (default) 0.337
9 AK (AK1) 0.353 9 AK (AK1) 0.353
10 RMA (default) 0.356 10 RMA (default) 0.356
11 trimfill (default) 0.360 11 trimfill (default) 0.360
12 WLS (default) 0.372 12 WLS (default) 0.372
13 PEESE (default) 0.374 13 PEESE (default) 0.374
14 mean (default) 0.410 14 mean (default) 0.410
15 FMA (default) 0.433 15 FMA (default) 0.433
16 WILS (default) 0.459 16 WILS (default) 0.459
17 SM (3PSM) 0.557 17 SM (3PSM) 0.557
18 puniform (star) 0.563 18 puniform (star) 0.563
19 pcurve (default) NaN 19 pcurve (default) NaN

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 AK (AK1) 1 1 AK (AK1) 1
1 AK (AK2) 1 1 AK (AK2) 1
1 EK (default) 1 1 EK (default) 1
1 FMA (default) 1 1 FMA (default) 1
1 mean (default) 1 1 mean (default) 1
1 PEESE (default) 1 1 PEESE (default) 1
1 PET (default) 1 1 PET (default) 1
1 PETPEESE (default) 1 1 PETPEESE (default) 1
1 puniform (default) 1 1 puniform (default) 1
1 puniform (star) 1 1 puniform (star) 1
1 RMA (default) 1 1 RMA (default) 1
1 RoBMA (PSMA) 1 1 RoBMA (PSMA) 1
1 SM (3PSM) 1 1 SM (3PSM) 1
1 SM (4PSM) 1 1 SM (4PSM) 1
1 trimfill (default) 1 1 trimfill (default) 1
1 WAAPWLS (default) 1 1 WAAPWLS (default) 1
1 WILS (default) 1 1 WILS (default) 1
1 WLS (default) 1 1 WLS (default) 1
19 pcurve (default) NaN 19 pcurve (default) NaN

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

By-Condition Performance (Conditional on Method Convergence)

The results below are conditional on method convergence. Note that the methods might differ in convergence rate and are therefore not compared on the same data sets.

Raincloud plot showing convergence rates across different methods

Raincloud plot showing RMSE (Root Mean Square Error) across different methods

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing bias across different methods

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0. Values lower than -0.5 or larger than 0.5 are visualized as -0.5 and 0.5 respectively.

Raincloud plot showing bias across different methods

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing 95% confidence interval width across different methods

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method. Values larger than 100 are visualized as 100.

Raincloud plot showing 95% confidence interval coverage across different methods

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Raincloud plot showing 95% confidence interval width across different methods

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method.

Raincloud plot showing positive likelihood ratio across different methods

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Raincloud plot showing negative likelihood ratio across different methods

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Raincloud plot showing Type I Error rates across different methods

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Raincloud plot showing statistical power across different methods

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

By-Condition Performance (Replacement in Case of Non-Convergence)

The results below incorporate method replacement to handle non-convergence. If a method fails to converge, its results are replaced with the results from a simpler method (e.g., random-effects meta-analysis without publication bias adjustment). This emulates what a data analyst may do in practice in case a method does not converge. However, note that these results do not correspond to “pure” method performance as they might combine multiple different methods. See Method Replacement Strategy for details of the method replacement specification.

Raincloud plot showing convergence rates across different methods

Raincloud plot showing RMSE (Root Mean Square Error) across different methods

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing bias across different methods

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0. Values lower than -0.5 or larger than 0.5 are visualized as -0.5 and 0.5 respectively.

Raincloud plot showing bias across different methods

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing 95% confidence interval width across different methods

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method. Values larger than 100 are visualized as 100.

Raincloud plot showing 95% confidence interval coverage across different methods

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Raincloud plot showing 95% confidence interval width across different methods

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method.

Raincloud plot showing positive likelihood ratio across different methods

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Raincloud plot showing negative likelihood ratio across different methods

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Raincloud plot showing Type I Error rates across different methods

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Raincloud plot showing statistical power across different methods

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

Subset: Random Effects

These results are based on Alinaghi (2018) data-generating mechanism with a total of 27 conditions.

Average Performance

Method performance measures are aggregated across all simulated conditions to provide an overall impression of method performance. However, keep in mind that a method with a high overall ranking is not necessarily the “best” method for a particular application. To select a suitable method for your application, consider also non-aggregated performance measures in conditions most relevant to your application.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 0.098 1 RoBMA (PSMA) 0.098
2 AK (AK2) 0.101 2 AK (AK2) 0.120
3 trimfill (default) 0.151 3 trimfill (default) 0.151
4 AK (AK1) 0.173 4 AK (AK1) 0.173
5 SM (4PSM) 0.186 5 SM (4PSM) 0.186
6 PEESE (default) 0.198 6 PEESE (default) 0.198
7 PETPEESE (default) 0.199 7 PETPEESE (default) 0.199
8 FMA (default) 0.199 8 FMA (default) 0.199
8 WLS (default) 0.199 8 WLS (default) 0.199
10 WAAPWLS (default) 0.207 10 WAAPWLS (default) 0.207
11 EK (default) 0.223 11 EK (default) 0.223
11 PET (default) 0.223 11 PET (default) 0.223
13 RMA (default) 0.272 13 RMA (default) 0.272
14 SM (3PSM) 0.280 14 SM (3PSM) 0.280
15 puniform (star) 0.287 15 puniform (star) 0.287
16 puniform (default) 0.448 16 puniform (default) 0.448
17 mean (default) 0.461 17 mean (default) 0.461
18 WILS (default) 0.475 18 WILS (default) 0.475
19 pcurve (default) 1.405 19 pcurve (default) 1.405

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 AK (AK2) 0.003 1 SM (3PSM) 0.020
2 SM (3PSM) 0.020 2 AK (AK2) 0.028
3 EK (default) 0.033 3 EK (default) 0.033
3 PET (default) 0.033 3 PET (default) 0.033
5 puniform (star) 0.036 5 puniform (star) 0.036
6 RoBMA (PSMA) -0.040 6 RoBMA (PSMA) -0.040
7 PETPEESE (default) 0.070 7 PETPEESE (default) 0.070
8 PEESE (default) 0.070 8 PEESE (default) 0.070
9 WAAPWLS (default) 0.072 9 WAAPWLS (default) 0.072
10 FMA (default) 0.086 10 FMA (default) 0.086
11 WLS (default) 0.086 11 WLS (default) 0.086
12 trimfill (default) 0.093 12 trimfill (default) 0.093
13 SM (4PSM) -0.107 13 SM (4PSM) -0.107
14 AK (AK1) 0.131 14 AK (AK1) 0.131
15 RMA (default) 0.247 15 RMA (default) 0.247
16 WILS (default) -0.283 16 WILS (default) -0.283
17 mean (default) 0.430 17 mean (default) 0.430
18 puniform (default) 0.438 18 puniform (default) 0.438
19 pcurve (default) -1.236 19 pcurve (default) -1.236

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 pcurve (default) 0.034 1 pcurve (default) 0.034
2 RMA (default) 0.058 2 RMA (default) 0.058
3 AK (AK1) 0.064 3 AK (AK1) 0.064
4 trimfill (default) 0.069 4 trimfill (default) 0.069
5 mean (default) 0.076 5 mean (default) 0.076
6 puniform (default) 0.080 6 puniform (default) 0.080
7 SM (3PSM) 0.082 7 SM (3PSM) 0.082
8 puniform (star) 0.085 8 puniform (star) 0.085
9 RoBMA (PSMA) 0.085 9 RoBMA (PSMA) 0.085
10 AK (AK2) 0.101 10 SM (4PSM) 0.104
11 SM (4PSM) 0.104 11 AK (AK2) 0.110
12 FMA (default) 0.148 12 FMA (default) 0.148
13 WLS (default) 0.148 13 WLS (default) 0.148
14 PEESE (default) 0.155 14 PEESE (default) 0.155
15 PETPEESE (default) 0.156 15 PETPEESE (default) 0.156
16 WAAPWLS (default) 0.166 16 WAAPWLS (default) 0.166
17 EK (default) 0.190 17 EK (default) 0.190
17 PET (default) 0.190 17 PET (default) 0.190
19 WILS (default) 0.270 19 WILS (default) 0.270

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 0.426 1 RoBMA (PSMA) 0.426
2 AK (AK2) 0.471 2 AK (AK2) 0.887
3 SM (4PSM) 2.479 3 SM (4PSM) 2.479
4 trimfill (default) 2.707 4 trimfill (default) 2.707
5 WAAPWLS (default) 2.898 5 WAAPWLS (default) 2.898
6 AK (AK1) 3.950 6 AK (AK1) 3.950
7 PEESE (default) 4.234 7 PEESE (default) 4.234
8 PETPEESE (default) 4.247 8 PETPEESE (default) 4.247
9 WLS (default) 4.339 9 WLS (default) 4.339
10 EK (default) 4.512 10 EK (default) 4.512
11 PET (default) 4.516 11 PET (default) 4.516
12 FMA (default) 5.792 12 FMA (default) 5.792
13 SM (3PSM) 6.604 13 SM (3PSM) 6.604
14 puniform (star) 6.865 14 puniform (star) 6.865
15 RMA (default) 7.368 15 RMA (default) 7.368
16 puniform (default) 12.917 16 puniform (default) 12.917
17 WILS (default) 14.064 17 WILS (default) 14.064
18 mean (default) 14.386 18 mean (default) 14.386
19 pcurve (default) NaN 19 pcurve (default) NaN

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 AK (AK2) 0.951 1 RoBMA (PSMA) 0.941
2 RoBMA (PSMA) 0.941 2 AK (AK2) 0.866
3 SM (4PSM) 0.835 3 SM (4PSM) 0.835
4 AK (AK1) 0.719 4 AK (AK1) 0.719
5 trimfill (default) 0.626 5 trimfill (default) 0.626
6 SM (3PSM) 0.598 6 SM (3PSM) 0.598
7 puniform (star) 0.586 7 puniform (star) 0.586
8 WAAPWLS (default) 0.529 8 WAAPWLS (default) 0.529
9 RMA (default) 0.422 9 RMA (default) 0.422
10 mean (default) 0.342 10 mean (default) 0.342
11 PETPEESE (default) 0.335 11 PETPEESE (default) 0.335
12 EK (default) 0.335 12 EK (default) 0.335
13 PEESE (default) 0.335 13 PEESE (default) 0.335
14 PET (default) 0.335 14 PET (default) 0.335
15 WLS (default) 0.330 15 WLS (default) 0.330
16 FMA (default) 0.113 16 FMA (default) 0.113
17 puniform (default) 0.098 17 puniform (default) 0.098
18 WILS (default) 0.091 18 WILS (default) 0.091
19 pcurve (default) NaN 19 pcurve (default) NaN

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 FMA (default) 0.051 1 FMA (default) 0.051
2 WILS (default) 0.147 2 WILS (default) 0.147
3 WLS (default) 0.153 3 WLS (default) 0.153
4 PEESE (default) 0.156 4 PEESE (default) 0.156
5 PETPEESE (default) 0.157 5 PETPEESE (default) 0.157
6 PET (default) 0.185 6 PET (default) 0.185
7 EK (default) 0.185 7 EK (default) 0.185
8 RMA (default) 0.228 8 RMA (default) 0.228
9 trimfill (default) 0.234 9 trimfill (default) 0.234
10 mean (default) 0.244 10 mean (default) 0.244
11 AK (AK1) 0.248 11 AK (AK1) 0.248
12 puniform (default) 0.254 12 puniform (default) 0.254
13 WAAPWLS (default) 0.304 13 WAAPWLS (default) 0.304
14 RoBMA (PSMA) 0.310 14 RoBMA (PSMA) 0.310
15 SM (3PSM) 0.314 15 SM (3PSM) 0.314
16 puniform (star) 0.324 16 puniform (star) 0.324
17 AK (AK2) 0.392 17 AK (AK2) 0.383
18 SM (4PSM) 0.408 18 SM (4PSM) 0.408
19 pcurve (default) NaN 19 pcurve (default) NaN

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Log Value Rank Method Log Value
1 RoBMA (PSMA) 6.663 1 RoBMA (PSMA) 6.663
2 AK (AK2) 3.115 2 AK (AK2) 3.117
3 RMA (default) 2.150 3 RMA (default) 2.150
4 AK (AK1) 2.110 4 AK (AK1) 2.110
5 trimfill (default) 1.957 5 trimfill (default) 1.957
6 SM (4PSM) 1.835 6 SM (4PSM) 1.835
7 mean (default) 1.179 7 mean (default) 1.179
8 SM (3PSM) 0.935 8 SM (3PSM) 0.935
9 puniform (star) 0.930 9 puniform (star) 0.930
10 WAAPWLS (default) 0.598 10 WAAPWLS (default) 0.598
11 PETPEESE (default) 0.403 11 PETPEESE (default) 0.403
12 WLS (default) 0.399 12 WLS (default) 0.399
13 PEESE (default) 0.395 13 PEESE (default) 0.395
14 EK (default) 0.384 14 EK (default) 0.384
14 PET (default) 0.384 14 PET (default) 0.384
16 FMA (default) 0.098 16 FMA (default) 0.098
17 WILS (default) 0.044 17 WILS (default) 0.044
18 puniform (default) 0.000 18 puniform (default) 0.000
19 pcurve (default) NaN 19 pcurve (default) NaN

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Log Value Rank Method Log Value
1 RoBMA (PSMA) -6.364 1 AK (AK2) -6.830
2 AK (AK2) -6.064 2 RoBMA (PSMA) -6.364
3 WAAPWLS (default) -5.828 3 WAAPWLS (default) -5.828
4 RMA (default) -5.039 4 RMA (default) -5.039
5 AK (AK1) -5.039 5 AK (AK1) -5.039
6 trimfill (default) -5.023 6 trimfill (default) -5.023
7 mean (default) -4.918 7 mean (default) -4.918
8 PETPEESE (default) -4.860 8 PETPEESE (default) -4.860
9 EK (default) -4.812 9 EK (default) -4.812
9 PET (default) -4.812 9 PET (default) -4.812
11 WLS (default) -4.362 11 WLS (default) -4.362
12 PEESE (default) -4.345 12 PEESE (default) -4.345
13 SM (4PSM) -4.038 13 SM (4PSM) -4.038
14 FMA (default) -3.715 14 FMA (default) -3.715
15 WILS (default) -2.818 15 WILS (default) -2.818
16 SM (3PSM) -1.905 16 SM (3PSM) -1.905
17 puniform (star) -1.874 17 puniform (star) -1.874
18 puniform (default) 0.000 18 puniform (default) 0.000
19 pcurve (default) NaN 19 pcurve (default) NaN

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 0.002 1 RoBMA (PSMA) 0.002
2 AK (AK2) 0.043 2 AK (AK2) 0.043
3 RMA (default) 0.361 3 RMA (default) 0.361
4 AK (AK1) 0.361 4 AK (AK1) 0.361
5 SM (4PSM) 0.369 5 SM (4PSM) 0.369
6 trimfill (default) 0.376 6 trimfill (default) 0.376
7 mean (default) 0.464 7 mean (default) 0.464
8 WAAPWLS (default) 0.594 8 WAAPWLS (default) 0.594
9 puniform (star) 0.684 9 puniform (star) 0.684
9 SM (3PSM) 0.684 9 SM (3PSM) 0.684
11 PETPEESE (default) 0.702 11 PETPEESE (default) 0.702
12 WLS (default) 0.704 12 WLS (default) 0.704
13 PEESE (default) 0.707 13 PEESE (default) 0.707
14 EK (default) 0.712 14 EK (default) 0.712
14 PET (default) 0.712 14 PET (default) 0.712
16 FMA (default) 0.909 16 FMA (default) 0.909
17 WILS (default) 0.937 17 WILS (default) 0.937
18 puniform (default) 1.000 18 puniform (default) 1.000
19 pcurve (default) NaN 19 pcurve (default) NaN

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 AK (AK1) 1.000 1 AK (AK1) 1.000
1 mean (default) 1.000 1 mean (default) 1.000
1 puniform (default) 1.000 1 puniform (default) 1.000
1 RMA (default) 1.000 1 RMA (default) 1.000
1 trimfill (default) 1.000 1 trimfill (default) 1.000
6 FMA (default) 1.000 6 FMA (default) 1.000
7 WLS (default) 1.000 7 WLS (default) 1.000
8 PEESE (default) 1.000 8 PEESE (default) 1.000
9 PETPEESE (default) 0.997 9 PETPEESE (default) 0.997
10 EK (default) 0.997 10 EK (default) 0.997
10 PET (default) 0.997 10 PET (default) 0.997
12 WAAPWLS (default) 0.981 12 WAAPWLS (default) 0.981
13 AK (AK2) 0.978 13 AK (AK2) 0.978
14 WILS (default) 0.978 14 WILS (default) 0.978
15 SM (3PSM) 0.962 15 SM (3PSM) 0.962
16 puniform (star) 0.958 16 puniform (star) 0.958
17 RoBMA (PSMA) 0.944 17 RoBMA (PSMA) 0.944
18 SM (4PSM) 0.935 18 SM (4PSM) 0.935
19 pcurve (default) NaN 19 pcurve (default) NaN

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

By-Condition Performance (Conditional on Method Convergence)

The results below are conditional on method convergence. Note that the methods might differ in convergence rate and are therefore not compared on the same data sets.

Raincloud plot showing convergence rates across different methods

Raincloud plot showing RMSE (Root Mean Square Error) across different methods

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing bias across different methods

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0. Values lower than -0.5 or larger than 0.5 are visualized as -0.5 and 0.5 respectively.

Raincloud plot showing bias across different methods

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing 95% confidence interval width across different methods

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method. Values larger than 100 are visualized as 100.

Raincloud plot showing 95% confidence interval coverage across different methods

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Raincloud plot showing 95% confidence interval width across different methods

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method.

Raincloud plot showing positive likelihood ratio across different methods

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Raincloud plot showing negative likelihood ratio across different methods

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Raincloud plot showing Type I Error rates across different methods

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Raincloud plot showing statistical power across different methods

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

By-Condition Performance (Replacement in Case of Non-Convergence)

The results below incorporate method replacement to handle non-convergence. If a method fails to converge, its results are replaced with the results from a simpler method (e.g., random-effects meta-analysis without publication bias adjustment). This emulates what a data analyst may do in practice in case a method does not converge. However, note that these results do not correspond to “pure” method performance as they might combine multiple different methods. See Method Replacement Strategy for details of the method replacement specification.

Raincloud plot showing convergence rates across different methods

Raincloud plot showing RMSE (Root Mean Square Error) across different methods

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing bias across different methods

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0. Values lower than -0.5 or larger than 0.5 are visualized as -0.5 and 0.5 respectively.

Raincloud plot showing bias across different methods

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing 95% confidence interval width across different methods

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method. Values larger than 100 are visualized as 100.

Raincloud plot showing 95% confidence interval coverage across different methods

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Raincloud plot showing 95% confidence interval width across different methods

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method.

Raincloud plot showing positive likelihood ratio across different methods

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Raincloud plot showing negative likelihood ratio across different methods

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Raincloud plot showing Type I Error rates across different methods

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Raincloud plot showing statistical power across different methods

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

Subset: Panel Random Effects

These results are based on Alinaghi (2018) data-generating mechanism with a total of 27 conditions.

Average Performance

Method performance measures are aggregated across all simulated conditions to provide an overall impression of method performance. However, keep in mind that a method with a high overall ranking is not necessarily the “best” method for a particular application. To select a suitable method for your application, consider also non-aggregated performance measures in conditions most relevant to your application.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 trimfill (default) 0.542 1 trimfill (default) 0.542
2 RoBMA (PSMA) 0.543 2 RoBMA (PSMA) 0.543
3 AK (AK1) 0.555 3 AK (AK1) 0.555
4 AK (AK2) 0.575 4 SM (4PSM) 0.587
5 SM (4PSM) 0.587 5 AK (AK2) 0.598
6 SM (3PSM) 0.608 6 SM (3PSM) 0.608
7 puniform (star) 0.615 7 puniform (star) 0.615
8 RMA (default) 0.667 8 RMA (default) 0.667
9 mean (default) 0.678 9 mean (default) 0.678
10 FMA (default) 0.824 10 FMA (default) 0.824
10 WLS (default) 0.824 10 WLS (default) 0.824
12 PEESE (default) 0.868 12 PEESE (default) 0.868
13 PETPEESE (default) 0.879 13 PETPEESE (default) 0.879
14 WAAPWLS (default) 0.897 14 WAAPWLS (default) 0.897
15 PET (default) 1.071 15 PET (default) 1.071
16 EK (default) 1.071 16 EK (default) 1.071
17 WILS (default) 1.222 17 WILS (default) 1.222
18 pcurve (default) 1.381 18 pcurve (default) 1.381
19 puniform (default) 1.400 19 puniform (default) 1.400

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 SM (4PSM) 0.190 1 SM (4PSM) 0.190
2 trimfill (default) 0.194 2 trimfill (default) 0.194
3 PET (default) 0.214 3 PET (default) 0.214
4 EK (default) 0.214 4 EK (default) 0.214
5 SM (3PSM) 0.256 5 SM (3PSM) 0.256
6 AK (AK2) 0.256 6 WILS (default) -0.263
7 WILS (default) -0.263 7 PETPEESE (default) 0.266
8 PETPEESE (default) 0.266 8 WAAPWLS (default) 0.270
9 WAAPWLS (default) 0.270 9 puniform (star) 0.272
10 puniform (star) 0.272 10 PEESE (default) 0.276
11 PEESE (default) 0.276 11 AK (AK2) 0.293
12 WLS (default) 0.301 12 WLS (default) 0.301
13 FMA (default) 0.301 13 FMA (default) 0.301
14 RoBMA (PSMA) 0.336 14 RoBMA (PSMA) 0.336
15 AK (AK1) 0.389 15 AK (AK1) 0.389
16 RMA (default) 0.528 16 RMA (default) 0.528
17 mean (default) 0.538 17 mean (default) 0.538
18 pcurve (default) -1.115 18 pcurve (default) -1.115
19 puniform (default) 1.362 19 puniform (default) 1.362

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 pcurve (default) 0.098 1 pcurve (default) 0.098
2 RMA (default) 0.299 2 RMA (default) 0.299
3 mean (default) 0.302 3 mean (default) 0.302
4 puniform (default) 0.309 4 puniform (default) 0.309
5 AK (AK1) 0.313 5 AK (AK1) 0.313
6 RoBMA (PSMA) 0.321 6 RoBMA (PSMA) 0.321
7 puniform (star) 0.378 7 puniform (star) 0.378
8 SM (3PSM) 0.380 8 SM (3PSM) 0.380
9 trimfill (default) 0.393 9 trimfill (default) 0.393
10 SM (4PSM) 0.454 10 SM (4PSM) 0.454
11 AK (AK2) 0.466 11 AK (AK2) 0.467
12 FMA (default) 0.699 12 FMA (default) 0.699
13 WLS (default) 0.699 13 WLS (default) 0.699
14 PEESE (default) 0.758 14 PEESE (default) 0.758
15 PETPEESE (default) 0.771 15 PETPEESE (default) 0.771
16 WAAPWLS (default) 0.795 16 WAAPWLS (default) 0.795
17 PET (default) 0.985 17 PET (default) 0.985
18 EK (default) 0.985 18 EK (default) 0.985
19 WILS (default) 1.079 19 WILS (default) 1.079

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 FMA (default) 5.413 1 FMA (default) 5.413
2 RoBMA (PSMA) 6.200 2 RoBMA (PSMA) 6.200
3 AK (AK2) 8.596 3 SM (4PSM) 9.418
4 SM (4PSM) 9.418 4 AK (AK2) 9.662
5 RMA (default) 10.943 5 RMA (default) 10.943
6 trimfill (default) 11.857 6 trimfill (default) 11.857
7 SM (3PSM) 12.542 7 SM (3PSM) 12.542
8 AK (AK1) 12.674 8 AK (AK1) 12.674
9 puniform (star) 13.940 9 puniform (star) 13.940
10 WAAPWLS (default) 20.395 10 WAAPWLS (default) 20.395
11 mean (default) 20.420 11 mean (default) 20.420
12 WLS (default) 21.589 12 WLS (default) 21.589
13 PEESE (default) 22.620 13 PEESE (default) 22.620
14 PETPEESE (default) 22.879 14 PETPEESE (default) 22.879
15 EK (default) 27.177 15 EK (default) 27.177
16 PET (default) 27.193 16 PET (default) 27.193
17 WILS (default) 34.217 17 WILS (default) 34.217
18 puniform (default) 47.302 18 puniform (default) 47.302
19 pcurve (default) NaN 19 pcurve (default) NaN

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 FMA (default) 0.848 1 FMA (default) 0.848
2 RoBMA (PSMA) 0.710 2 RoBMA (PSMA) 0.710
3 RMA (default) 0.531 3 RMA (default) 0.531
4 AK (AK2) 0.502 4 SM (4PSM) 0.474
5 SM (4PSM) 0.474 5 AK (AK2) 0.467
6 SM (3PSM) 0.392 6 SM (3PSM) 0.392
7 AK (AK1) 0.334 7 AK (AK1) 0.334
8 puniform (star) 0.326 8 puniform (star) 0.326
9 trimfill (default) 0.313 9 trimfill (default) 0.313
10 WAAPWLS (default) 0.226 10 WAAPWLS (default) 0.226
11 mean (default) 0.175 11 mean (default) 0.175
12 WLS (default) 0.145 12 WLS (default) 0.145
13 PETPEESE (default) 0.145 13 PETPEESE (default) 0.145
14 EK (default) 0.145 14 EK (default) 0.145
15 PET (default) 0.144 15 PET (default) 0.144
16 PEESE (default) 0.144 16 PEESE (default) 0.144
17 WILS (default) 0.081 17 WILS (default) 0.081
18 puniform (default) 0.006 18 puniform (default) 0.006
19 pcurve (default) NaN 19 pcurve (default) NaN

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 mean (default) 0.251 1 mean (default) 0.251
2 WILS (default) 0.283 2 WILS (default) 0.283
3 WLS (default) 0.299 3 WLS (default) 0.299
4 PEESE (default) 0.312 4 PEESE (default) 0.312
5 PETPEESE (default) 0.318 5 PETPEESE (default) 0.318
6 puniform (default) 0.383 6 puniform (default) 0.383
7 trimfill (default) 0.401 7 trimfill (default) 0.401
8 PET (default) 0.401 8 PET (default) 0.401
9 EK (default) 0.402 9 EK (default) 0.402
10 AK (AK1) 0.437 10 AK (AK1) 0.437
11 puniform (star) 0.487 11 puniform (star) 0.487
12 WAAPWLS (default) 0.527 12 WAAPWLS (default) 0.527
13 SM (3PSM) 0.579 13 SM (3PSM) 0.579
14 SM (4PSM) 0.733 14 SM (4PSM) 0.733
15 AK (AK2) 0.784 15 AK (AK2) 0.759
16 RMA (default) 1.060 16 RMA (default) 1.060
17 RoBMA (PSMA) 1.144 17 RoBMA (PSMA) 1.144
18 FMA (default) 2.855 18 FMA (default) 2.855
19 pcurve (default) NaN 19 pcurve (default) NaN

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Log Value Rank Method Log Value
1 RoBMA (PSMA) 3.448 1 RoBMA (PSMA) 3.448
2 RMA (default) 1.974 2 RMA (default) 1.974
3 FMA (default) 1.533 3 FMA (default) 1.533
4 AK (AK2) 0.691 4 AK (AK2) 0.686
5 SM (4PSM) 0.570 5 SM (4PSM) 0.570
6 AK (AK1) 0.506 6 AK (AK1) 0.506
7 SM (3PSM) 0.421 7 SM (3PSM) 0.421
8 trimfill (default) 0.341 8 trimfill (default) 0.341
9 mean (default) 0.265 9 mean (default) 0.265
10 puniform (star) 0.189 10 puniform (star) 0.189
11 WAAPWLS (default) 0.182 11 WAAPWLS (default) 0.182
12 PETPEESE (default) 0.144 12 PETPEESE (default) 0.144
13 WLS (default) 0.134 13 WLS (default) 0.134
14 PEESE (default) 0.132 14 PEESE (default) 0.132
15 EK (default) 0.116 15 EK (default) 0.116
15 PET (default) 0.116 15 PET (default) 0.116
17 WILS (default) 0.038 17 WILS (default) 0.038
18 puniform (default) 0.000 18 puniform (default) 0.000
19 pcurve (default) NaN 19 pcurve (default) NaN

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Log Value Rank Method Log Value
1 RMA (default) -5.174 1 RMA (default) -5.174
2 RoBMA (PSMA) -5.000 2 RoBMA (PSMA) -5.000
3 SM (4PSM) -4.819 3 SM (4PSM) -4.819
4 trimfill (default) -4.744 4 trimfill (default) -4.744
5 AK (AK2) -4.522 5 AK (AK2) -4.674
6 AK (AK1) -4.122 6 AK (AK1) -4.122
7 SM (3PSM) -3.982 7 SM (3PSM) -3.982
8 mean (default) -3.862 8 mean (default) -3.862
9 WLS (default) -3.651 9 WLS (default) -3.651
10 puniform (star) -3.640 10 puniform (star) -3.640
11 PEESE (default) -3.558 11 PEESE (default) -3.558
12 WAAPWLS (default) -3.230 12 WAAPWLS (default) -3.230
13 PETPEESE (default) -3.213 13 PETPEESE (default) -3.213
14 EK (default) -3.094 14 EK (default) -3.094
14 PET (default) -3.094 14 PET (default) -3.094
16 FMA (default) -2.055 16 FMA (default) -2.055
17 WILS (default) -1.760 17 WILS (default) -1.760
18 puniform (default) 0.000 18 puniform (default) 0.000
19 pcurve (default) NaN 19 pcurve (default) NaN

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 FMA (default) 0.263 1 FMA (default) 0.263
2 RoBMA (PSMA) 0.273 2 RoBMA (PSMA) 0.273
3 RMA (default) 0.361 3 RMA (default) 0.361
4 AK (AK2) 0.501 4 AK (AK2) 0.506
5 SM (4PSM) 0.559 5 SM (4PSM) 0.559
6 AK (AK1) 0.642 6 AK (AK1) 0.642
7 SM (3PSM) 0.671 7 SM (3PSM) 0.671
8 trimfill (default) 0.715 8 trimfill (default) 0.715
9 mean (default) 0.783 9 mean (default) 0.783
10 WAAPWLS (default) 0.791 10 WAAPWLS (default) 0.791
11 PETPEESE (default) 0.832 11 PETPEESE (default) 0.832
12 puniform (star) 0.844 12 puniform (star) 0.844
13 WLS (default) 0.860 13 WLS (default) 0.860
14 PEESE (default) 0.861 14 PEESE (default) 0.861
15 EK (default) 0.864 15 EK (default) 0.864
15 PET (default) 0.864 15 PET (default) 0.864
17 WILS (default) 0.931 17 WILS (default) 0.931
18 puniform (default) 1.000 18 puniform (default) 1.000
19 pcurve (default) NaN 19 pcurve (default) NaN

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 puniform (default) 1.000 1 puniform (default) 1.000
2 mean (default) 0.995 2 mean (default) 0.995
3 puniform (star) 0.992 3 puniform (star) 0.992
4 AK (AK1) 0.991 4 AK (AK1) 0.991
5 WLS (default) 0.980 5 WLS (default) 0.980
6 PEESE (default) 0.979 6 PEESE (default) 0.979
7 trimfill (default) 0.977 7 trimfill (default) 0.977
8 EK (default) 0.969 8 EK (default) 0.969
8 PET (default) 0.969 8 PET (default) 0.969
10 WILS (default) 0.968 10 WILS (default) 0.968
11 PETPEESE (default) 0.959 11 PETPEESE (default) 0.959
12 RMA (default) 0.955 12 RMA (default) 0.955
13 SM (3PSM) 0.950 13 SM (3PSM) 0.950
14 WAAPWLS (default) 0.948 14 WAAPWLS (default) 0.948
15 AK (AK2) 0.943 15 AK (AK2) 0.943
16 SM (4PSM) 0.935 16 SM (4PSM) 0.935
17 RoBMA (PSMA) 0.890 17 RoBMA (PSMA) 0.890
18 FMA (default) 0.785 18 FMA (default) 0.785
19 pcurve (default) NaN 19 pcurve (default) NaN

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

By-Condition Performance (Conditional on Method Convergence)

The results below are conditional on method convergence. Note that the methods might differ in convergence rate and are therefore not compared on the same data sets.

Raincloud plot showing convergence rates across different methods

Raincloud plot showing RMSE (Root Mean Square Error) across different methods

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing bias across different methods

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0. Values lower than -0.5 or larger than 0.5 are visualized as -0.5 and 0.5 respectively.

Raincloud plot showing bias across different methods

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing 95% confidence interval width across different methods

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method. Values larger than 100 are visualized as 100.

Raincloud plot showing 95% confidence interval coverage across different methods

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Raincloud plot showing 95% confidence interval width across different methods

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method.

Raincloud plot showing positive likelihood ratio across different methods

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Raincloud plot showing negative likelihood ratio across different methods

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Raincloud plot showing Type I Error rates across different methods

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Raincloud plot showing statistical power across different methods

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

By-Condition Performance (Replacement in Case of Non-Convergence)

The results below incorporate method replacement to handle non-convergence. If a method fails to converge, its results are replaced with the results from a simpler method (e.g., random-effects meta-analysis without publication bias adjustment). This emulates what a data analyst may do in practice in case a method does not converge. However, note that these results do not correspond to “pure” method performance as they might combine multiple different methods. See Method Replacement Strategy for details of the method replacement specification.

Raincloud plot showing convergence rates across different methods

Raincloud plot showing RMSE (Root Mean Square Error) across different methods

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing bias across different methods

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0. Values lower than -0.5 or larger than 0.5 are visualized as -0.5 and 0.5 respectively.

Raincloud plot showing bias across different methods

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing 95% confidence interval width across different methods

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method. Values larger than 100 are visualized as 100.

Raincloud plot showing 95% confidence interval coverage across different methods

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Raincloud plot showing 95% confidence interval width across different methods

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method.

Raincloud plot showing positive likelihood ratio across different methods

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Raincloud plot showing negative likelihood ratio across different methods

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Raincloud plot showing Type I Error rates across different methods

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Raincloud plot showing statistical power across different methods

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

Session Info

This report was compiled on Thu Oct 23 13:54:23 2025 (UTC) using the following computational environment

## R version 4.5.1 (2025-06-13)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## time zone: UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] scales_1.4.0                   ggdist_3.3.3                  
## [3] ggplot2_4.0.0                  PublicationBiasBenchmark_0.1.0
## 
## loaded via a namespace (and not attached):
##  [1] generics_0.1.4       sandwich_3.1-1       sass_0.4.10         
##  [4] xml2_1.4.0           stringi_1.8.7        lattice_0.22-7      
##  [7] httpcode_0.3.0       digest_0.6.37        magrittr_2.0.4      
## [10] evaluate_1.0.5       grid_4.5.1           RColorBrewer_1.1-3  
## [13] fastmap_1.2.0        jsonlite_2.0.0       crul_1.6.0          
## [16] urltools_1.7.3.1     httr_1.4.7           purrr_1.1.0         
## [19] viridisLite_0.4.2    textshaping_1.0.4    jquerylib_0.1.4     
## [22] Rdpack_2.6.4         cli_3.6.5            rlang_1.1.6         
## [25] triebeard_0.4.1      rbibutils_2.3        withr_3.0.2         
## [28] cachem_1.1.0         yaml_2.3.10          tools_4.5.1         
## [31] memoise_2.0.1        kableExtra_1.4.0     curl_7.0.0          
## [34] vctrs_0.6.5          R6_2.6.1             clubSandwich_0.6.1  
## [37] zoo_1.8-14           lifecycle_1.0.4      stringr_1.5.2       
## [40] fs_1.6.6             htmlwidgets_1.6.4    ragg_1.5.0          
## [43] pkgconfig_2.0.3      desc_1.4.3           osfr_0.2.9          
## [46] pkgdown_2.1.3        bslib_0.9.0          pillar_1.11.1       
## [49] gtable_0.3.6         Rcpp_1.1.0           glue_1.8.0          
## [52] systemfonts_1.3.1    xfun_0.53            tibble_3.3.0        
## [55] rstudioapi_0.17.1    knitr_1.50           farver_2.1.2        
## [58] htmltools_0.5.8.1    labeling_0.4.3       svglite_2.2.2       
## [61] rmarkdown_2.30       compiler_4.5.1       S7_0.2.0            
## [64] distributional_0.5.0