Skip to contents

Complete Results

These results are based on Alinaghi (2018) data-generating mechanism with a total of 81 conditions.

Average Performance

Method performance measures are aggregated across all simulated conditions to provide an overall impression of method performance. However, keep in mind that a method with a high overall ranking is not necessarily the “best” method for a particular application. To select a suitable method for your application, consider also non-aggregated performance measures in conditions most relevant to your application.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 0.216 1 RoBMA (PSMA) 0.216
2 AK (AK2) 0.229 2 trimfill (default) 0.236
3 trimfill (default) 0.236 3 AK (AK2) 0.245
4 AK (AK1) 0.255 4 AK (AK1) 0.255
5 SM (4PSM) 0.263 5 SM (4PSM) 0.263
6 MAIVE (WAIVE) 0.295 6 MAIVE (WAIVE) 0.295
7 SM (3PSM) 0.310 7 SM (3PSM) 0.310
8 puniform (star) 0.316 8 puniform (star) 0.316
9 MAIVE (default) 0.317 9 MAIVE (default) 0.317
10 RMA (default) 0.320 10 RMA (default) 0.320
11 FMA (default) 0.345 11 FMA (default) 0.345
11 WLS (default) 0.345 11 WLS (default) 0.345
13 PEESE (default) 0.359 13 PEESE (default) 0.359
14 PETPEESE (default) 0.363 14 PETPEESE (default) 0.363
15 WAAPWLS (default) 0.372 15 WAAPWLS (default) 0.372
16 EK (default) 0.437 16 EK (default) 0.437
17 PET (default) 0.438 17 PET (default) 0.438
18 mean (default) 0.496 18 mean (default) 0.496
19 WILS (default) 0.571 19 WILS (default) 0.571
20 puniform (default) 0.643 20 puniform (default) 0.643
21 pcurve (default) 1.376 21 pcurve (default) 1.376

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 SM (4PSM) 0.025 1 SM (4PSM) 0.025
2 PET (default) 0.078 2 PET (default) 0.078
3 EK (default) 0.080 3 EK (default) 0.080
4 AK (AK2) 0.086 4 trimfill (default) 0.096
5 trimfill (default) 0.096 5 SM (3PSM) 0.098
6 SM (3PSM) 0.098 6 RoBMA (PSMA) 0.099
7 RoBMA (PSMA) 0.099 7 AK (AK2) 0.108
8 puniform (star) 0.111 8 puniform (star) 0.111
9 PETPEESE (default) 0.112 9 PETPEESE (default) 0.112
10 WAAPWLS (default) 0.115 10 WAAPWLS (default) 0.115
11 PEESE (default) 0.116 11 PEESE (default) 0.116
12 FMA (default) 0.131 12 FMA (default) 0.131
12 WLS (default) 0.131 12 WLS (default) 0.131
14 MAIVE (WAIVE) 0.165 14 MAIVE (WAIVE) 0.165
15 AK (AK1) 0.183 15 AK (AK1) 0.182
16 WILS (default) -0.183 16 WILS (default) -0.183
17 MAIVE (default) 0.187 17 MAIVE (default) 0.187
18 RMA (default) 0.262 18 RMA (default) 0.262
19 mean (default) 0.429 19 mean (default) 0.429
20 puniform (default) 0.606 20 puniform (default) 0.606
21 pcurve (default) -1.219 21 pcurve (default) -1.219

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 pcurve (default) 0.056 1 pcurve (default) 0.056
2 RMA (default) 0.124 2 RMA (default) 0.124
3 AK (AK1) 0.132 3 AK (AK1) 0.132
4 RoBMA (PSMA) 0.138 4 RoBMA (PSMA) 0.138
5 mean (default) 0.149 5 mean (default) 0.149
6 puniform (default) 0.155 6 puniform (default) 0.155
7 trimfill (default) 0.158 7 trimfill (default) 0.158
8 puniform (star) 0.160 8 puniform (star) 0.160
9 SM (3PSM) 0.161 9 SM (3PSM) 0.161
10 MAIVE (default) 0.187 10 MAIVE (default) 0.187
11 MAIVE (WAIVE) 0.188 11 MAIVE (WAIVE) 0.188
12 SM (4PSM) 0.191 12 SM (4PSM) 0.191
13 AK (AK2) 0.192 13 AK (AK2) 0.198
14 FMA (default) 0.286 14 FMA (default) 0.286
15 WLS (default) 0.286 15 WLS (default) 0.286
16 PEESE (default) 0.307 16 PEESE (default) 0.307
17 PETPEESE (default) 0.312 17 PETPEESE (default) 0.312
18 WAAPWLS (default) 0.324 18 WAAPWLS (default) 0.324
19 EK (default) 0.395 19 EK (default) 0.395
20 PET (default) 0.395 20 PET (default) 0.395
21 WILS (default) 0.453 21 WILS (default) 0.453

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 2.220 1 RoBMA (PSMA) 2.220
2 AK (AK2) 3.037 2 MAIVE (WAIVE) 3.441
3 MAIVE (WAIVE) 3.441 3 AK (AK2) 3.550
4 FMA (default) 3.780 4 FMA (default) 3.780
5 SM (4PSM) 3.999 5 SM (4PSM) 3.999
6 MAIVE (default) 4.031 6 MAIVE (default) 4.031
7 trimfill (default) 4.897 7 trimfill (default) 4.897
8 AK (AK1) 5.772 8 AK (AK1) 5.764
9 RMA (default) 6.200 9 RMA (default) 6.200
10 SM (3PSM) 6.625 10 SM (3PSM) 6.625
11 puniform (star) 7.284 11 puniform (star) 7.284
12 WAAPWLS (default) 7.800 12 WAAPWLS (default) 7.800
13 WLS (default) 8.687 13 WLS (default) 8.687
14 PEESE (default) 8.983 14 PEESE (default) 8.983
15 PETPEESE (default) 9.060 15 PETPEESE (default) 9.060
16 EK (default) 10.612 16 EK (default) 10.612
17 PET (default) 10.651 17 PET (default) 10.651
18 mean (default) 14.940 18 mean (default) 14.940
19 WILS (default) 16.151 19 WILS (default) 16.151
20 puniform (default) 20.205 20 puniform (default) 20.205
21 pcurve (default) NaN 21 pcurve (default) NaN

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 0.868 1 RoBMA (PSMA) 0.868
2 AK (AK2) 0.802 2 AK (AK2) 0.759
3 SM (4PSM) 0.749 3 SM (4PSM) 0.749
4 MAIVE (WAIVE) 0.720 4 MAIVE (WAIVE) 0.720
5 MAIVE (default) 0.680 5 MAIVE (default) 0.680
6 AK (AK1) 0.652 6 AK (AK1) 0.651
7 SM (3PSM) 0.625 7 SM (3PSM) 0.625
8 trimfill (default) 0.614 8 trimfill (default) 0.614
9 RMA (default) 0.597 9 RMA (default) 0.597
10 puniform (star) 0.597 10 puniform (star) 0.597
11 FMA (default) 0.597 11 FMA (default) 0.597
12 WAAPWLS (default) 0.555 12 WAAPWLS (default) 0.555
13 PETPEESE (default) 0.467 13 PETPEESE (default) 0.467
14 PEESE (default) 0.455 14 PEESE (default) 0.455
15 WLS (default) 0.441 15 WLS (default) 0.441
16 EK (default) 0.412 16 EK (default) 0.412
17 PET (default) 0.361 17 PET (default) 0.361
18 puniform (default) 0.344 18 puniform (default) 0.344
19 WILS (default) 0.307 19 WILS (default) 0.307
20 mean (default) 0.299 20 mean (default) 0.299
21 pcurve (default) NaN 21 pcurve (default) NaN

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 WILS (default) 0.154 1 WILS (default) 0.154
2 WLS (default) 0.163 2 WLS (default) 0.163
3 PEESE (default) 0.168 3 PEESE (default) 0.168
4 PETPEESE (default) 0.170 4 PETPEESE (default) 0.170
5 EK (default) 0.208 5 EK (default) 0.208
6 PET (default) 0.208 6 PET (default) 0.208
7 trimfill (default) 0.229 7 trimfill (default) 0.229
8 AK (AK1) 0.246 8 AK (AK1) 0.246
9 mean (default) 0.247 9 mean (default) 0.247
10 WAAPWLS (default) 0.289 10 WAAPWLS (default) 0.289
11 puniform (star) 0.290 11 puniform (star) 0.290
12 SM (3PSM) 0.317 12 SM (3PSM) 0.317
13 puniform (default) 0.321 13 puniform (default) 0.321
14 SM (4PSM) 0.395 14 AK (AK2) 0.393
15 AK (AK2) 0.404 15 SM (4PSM) 0.395
16 RMA (default) 0.448 16 RMA (default) 0.448
17 RoBMA (PSMA) 0.494 17 RoBMA (PSMA) 0.494
18 MAIVE (WAIVE) 0.664 18 MAIVE (WAIVE) 0.664
19 MAIVE (default) 0.676 19 MAIVE (default) 0.676
20 FMA (default) 0.980 20 FMA (default) 0.980
21 pcurve (default) NaN 21 pcurve (default) NaN

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Log Value Rank Method Log Value
1 RoBMA (PSMA) 5.904 1 RoBMA (PSMA) 5.904
2 MAIVE (default) 2.518 2 MAIVE (default) 2.518
3 MAIVE (WAIVE) 2.467 3 MAIVE (WAIVE) 2.467
4 AK (AK2) 2.307 4 AK (AK2) 2.211
5 RMA (default) 2.161 5 RMA (default) 2.161
6 AK (AK1) 1.616 6 AK (AK1) 1.618
7 SM (4PSM) 1.521 7 SM (4PSM) 1.521
8 trimfill (default) 1.487 8 trimfill (default) 1.489
9 EK (default) 1.101 9 EK (default) 1.101
9 PET (default) 1.101 9 PET (default) 1.101
11 PETPEESE (default) 1.059 11 PETPEESE (default) 1.059
12 mean (default) 1.039 12 mean (default) 1.039
13 FMA (default) 1.010 13 FMA (default) 1.010
14 WAAPWLS (default) 0.905 14 WAAPWLS (default) 0.905
15 SM (3PSM) 0.819 15 SM (3PSM) 0.819
16 WLS (default) 0.811 16 WLS (default) 0.811
17 PEESE (default) 0.795 17 PEESE (default) 0.795
18 puniform (default) 0.749 18 puniform (default) 0.749
19 puniform (star) 0.742 19 puniform (star) 0.742
20 WILS (default) 0.446 20 WILS (default) 0.446
21 pcurve (default) NaN 21 pcurve (default) NaN

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Log Value Rank Method Log Value
1 RoBMA (PSMA) -6.322 1 AK (AK2) -6.345
2 MAIVE (default) -5.986 2 RoBMA (PSMA) -6.322
3 MAIVE (WAIVE) -5.825 3 MAIVE (default) -5.986
4 AK (AK2) -5.588 4 MAIVE (WAIVE) -5.825
5 SM (4PSM) -5.405 5 SM (4PSM) -5.405
6 WAAPWLS (default) -5.289 6 WAAPWLS (default) -5.289
7 PETPEESE (default) -5.199 7 PETPEESE (default) -5.199
8 EK (default) -5.148 8 EK (default) -5.148
8 PET (default) -5.148 8 PET (default) -5.148
10 RMA (default) -5.086 10 AK (AK1) -5.101
11 AK (AK1) -5.078 11 RMA (default) -5.086
12 trimfill (default) -4.937 12 trimfill (default) -4.936
13 mean (default) -4.588 13 mean (default) -4.588
14 WLS (default) -4.347 14 WLS (default) -4.347
15 PEESE (default) -4.310 15 PEESE (default) -4.310
16 SM (3PSM) -3.941 16 SM (3PSM) -3.941
17 FMA (default) -3.575 17 FMA (default) -3.575
18 puniform (star) -3.410 18 puniform (star) -3.410
19 WILS (default) -3.167 19 WILS (default) -3.167
20 puniform (default) -2.490 20 puniform (default) -2.490
21 pcurve (default) NaN 21 pcurve (default) NaN

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 0.092 1 RoBMA (PSMA) 0.092
2 AK (AK2) 0.196 2 AK (AK2) 0.206
3 MAIVE (default) 0.229 3 MAIVE (default) 0.229
4 MAIVE (WAIVE) 0.230 4 MAIVE (WAIVE) 0.230
5 RMA (default) 0.359 5 RMA (default) 0.359
6 SM (4PSM) 0.372 6 SM (4PSM) 0.372
7 AK (AK1) 0.452 7 AK (AK1) 0.452
8 trimfill (default) 0.484 8 trimfill (default) 0.484
9 FMA (default) 0.535 9 FMA (default) 0.535
10 PETPEESE (default) 0.536 10 PETPEESE (default) 0.536
11 EK (default) 0.545 11 EK (default) 0.545
11 PET (default) 0.545 11 PET (default) 0.545
13 mean (default) 0.552 13 mean (default) 0.552
14 WAAPWLS (default) 0.574 14 WAAPWLS (default) 0.574
15 SM (3PSM) 0.637 15 SM (3PSM) 0.637
16 WLS (default) 0.645 16 WLS (default) 0.645
17 PEESE (default) 0.647 17 PEESE (default) 0.647
18 puniform (star) 0.697 18 puniform (star) 0.697
19 puniform (default) 0.707 19 puniform (default) 0.707
20 WILS (default) 0.775 20 WILS (default) 0.775
21 pcurve (default) NaN 21 pcurve (default) NaN

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 puniform (default) 1.000 1 puniform (default) 1.000
2 mean (default) 0.998 2 mean (default) 0.998
3 AK (AK1) 0.997 3 AK (AK1) 0.997
4 WLS (default) 0.993 4 WLS (default) 0.993
5 PEESE (default) 0.993 5 PEESE (default) 0.993
6 trimfill (default) 0.992 6 trimfill (default) 0.992
7 EK (default) 0.988 7 EK (default) 0.988
7 PET (default) 0.988 7 PET (default) 0.988
9 PETPEESE (default) 0.986 9 PETPEESE (default) 0.986
10 RMA (default) 0.985 10 RMA (default) 0.985
11 puniform (star) 0.983 11 puniform (star) 0.983
12 WILS (default) 0.982 12 WILS (default) 0.982
13 WAAPWLS (default) 0.976 13 WAAPWLS (default) 0.976
14 AK (AK2) 0.974 14 AK (AK2) 0.974
15 SM (3PSM) 0.971 15 SM (3PSM) 0.971
16 MAIVE (default) 0.960 16 MAIVE (default) 0.960
17 MAIVE (WAIVE) 0.959 17 MAIVE (WAIVE) 0.959
18 SM (4PSM) 0.957 18 SM (4PSM) 0.957
19 RoBMA (PSMA) 0.945 19 RoBMA (PSMA) 0.945
20 FMA (default) 0.928 20 FMA (default) 0.928
21 pcurve (default) NaN 21 pcurve (default) NaN

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

By-Condition Performance (Conditional on Method Convergence)

The results below are conditional on method convergence. Note that the methods might differ in convergence rate and are therefore not compared on the same data sets.

Raincloud plot showing convergence rates across different methods

Raincloud plot showing RMSE (Root Mean Square Error) across different methods

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing bias across different methods

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0. Values lower than -0.5 or larger than 0.5 are visualized as -0.5 and 0.5 respectively.

Raincloud plot showing bias across different methods

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing 95% confidence interval width across different methods

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method. Values larger than 100 are visualized as 100.

Raincloud plot showing 95% confidence interval coverage across different methods

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Raincloud plot showing 95% confidence interval width across different methods

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method.

Raincloud plot showing positive likelihood ratio across different methods

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Raincloud plot showing negative likelihood ratio across different methods

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Raincloud plot showing Type I Error rates across different methods

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Raincloud plot showing statistical power across different methods

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

By-Condition Performance (Replacement in Case of Non-Convergence)

The results below incorporate method replacement to handle non-convergence. If a method fails to converge, its results are replaced with the results from a simpler method (e.g., random-effects meta-analysis without publication bias adjustment). This emulates what a data analyst may do in practice in case a method does not converge. However, note that these results do not correspond to “pure” method performance as they might combine multiple different methods. See Method Replacement Strategy for details of the method replacement specification.

Raincloud plot showing convergence rates across different methods

Raincloud plot showing RMSE (Root Mean Square Error) across different methods

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing bias across different methods

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0. Values lower than -0.5 or larger than 0.5 are visualized as -0.5 and 0.5 respectively.

Raincloud plot showing bias across different methods

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing 95% confidence interval width across different methods

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method. Values larger than 100 are visualized as 100.

Raincloud plot showing 95% confidence interval coverage across different methods

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Raincloud plot showing 95% confidence interval width across different methods

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method.

Raincloud plot showing positive likelihood ratio across different methods

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Raincloud plot showing negative likelihood ratio across different methods

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Raincloud plot showing Type I Error rates across different methods

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Raincloud plot showing statistical power across different methods

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

Subset: Fixed Effects

These results are based on Alinaghi (2018) data-generating mechanism with a total of 27 conditions.

Average Performance

Method performance measures are aggregated across all simulated conditions to provide an overall impression of method performance. However, keep in mind that a method with a high overall ranking is not necessarily the “best” method for a particular application. To select a suitable method for your application, consider also non-aggregated performance measures in conditions most relevant to your application.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 0.008 1 RoBMA (PSMA) 0.008
2 AK (AK2) 0.009 2 PETPEESE (default) 0.010
3 PETPEESE (default) 0.010 3 PEESE (default) 0.012
4 PEESE (default) 0.012 4 WAAPWLS (default) 0.012
5 WAAPWLS (default) 0.012 5 WLS (default) 0.014
6 WLS (default) 0.014 6 FMA (default) 0.014
7 FMA (default) 0.014 7 EK (default) 0.015
8 trimfill (default) 0.015 8 trimfill (default) 0.015
9 EK (default) 0.015 9 WILS (default) 0.016
10 WILS (default) 0.016 10 SM (4PSM) 0.017
11 SM (4PSM) 0.017 11 AK (AK2) 0.017
12 PET (default) 0.020 12 PET (default) 0.020
13 RMA (default) 0.022 13 RMA (default) 0.022
14 AK (AK1) 0.037 14 AK (AK1) 0.036
15 SM (3PSM) 0.041 15 SM (3PSM) 0.041
16 puniform (star) 0.047 16 puniform (star) 0.047
17 puniform (default) 0.080 17 puniform (default) 0.080
18 MAIVE (WAIVE) 0.085 18 MAIVE (WAIVE) 0.085
19 MAIVE (default) 0.114 19 MAIVE (default) 0.114
20 mean (default) 0.348 20 mean (default) 0.348
21 pcurve (default) 1.340 21 pcurve (default) 1.340

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 AK (AK2) 0.000 1 RoBMA (PSMA) 0.000
2 RoBMA (PSMA) 0.000 2 PETPEESE (default) -0.001
3 PETPEESE (default) -0.001 3 PEESE (default) 0.001
4 PEESE (default) 0.001 4 trimfill (default) 0.002
5 trimfill (default) 0.002 5 WILS (default) -0.003
6 WILS (default) -0.003 6 WAAPWLS (default) 0.003
7 WAAPWLS (default) 0.003 7 AK (AK2) 0.003
8 SM (4PSM) -0.006 8 SM (4PSM) -0.006
9 FMA (default) 0.007 9 FMA (default) 0.007
10 WLS (default) 0.007 10 WLS (default) 0.007
11 EK (default) -0.008 11 EK (default) -0.008
12 RMA (default) 0.012 12 RMA (default) 0.012
13 PET (default) -0.013 13 PET (default) -0.013
14 puniform (default) 0.018 14 puniform (default) 0.018
15 SM (3PSM) 0.019 15 SM (3PSM) 0.019
16 MAIVE (WAIVE) 0.023 16 MAIVE (WAIVE) 0.023
17 puniform (star) 0.025 17 puniform (star) 0.025
18 AK (AK1) 0.028 18 AK (AK1) 0.027
19 MAIVE (default) 0.038 19 MAIVE (default) 0.038
20 mean (default) 0.318 20 mean (default) 0.318
21 pcurve (default) -1.305 21 pcurve (default) -1.305

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 0.008 1 RoBMA (PSMA) 0.008
2 AK (AK2) 0.009 2 PEESE (default) 0.010
3 PEESE (default) 0.010 3 WLS (default) 0.010
4 WLS (default) 0.010 4 FMA (default) 0.010
5 FMA (default) 0.010 5 PETPEESE (default) 0.010
6 PETPEESE (default) 0.010 6 WAAPWLS (default) 0.010
7 WAAPWLS (default) 0.010 7 EK (default) 0.011
8 EK (default) 0.011 8 WILS (default) 0.011
9 WILS (default) 0.011 9 PET (default) 0.011
10 PET (default) 0.011 10 trimfill (default) 0.013
11 trimfill (default) 0.013 11 SM (4PSM) 0.014
12 SM (4PSM) 0.014 12 RMA (default) 0.015
13 RMA (default) 0.015 13 AK (AK2) 0.017
14 puniform (star) 0.017 14 puniform (star) 0.017
15 AK (AK1) 0.018 15 AK (AK1) 0.019
16 SM (3PSM) 0.021 16 SM (3PSM) 0.021
17 pcurve (default) 0.036 17 pcurve (default) 0.036
18 MAIVE (WAIVE) 0.063 18 MAIVE (WAIVE) 0.063
19 mean (default) 0.068 19 mean (default) 0.068
20 MAIVE (default) 0.070 20 MAIVE (default) 0.070
21 puniform (default) 0.075 21 puniform (default) 0.075

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 0.035 1 RoBMA (PSMA) 0.035
2 AK (AK2) 0.043 2 PETPEESE (default) 0.053
3 PETPEESE (default) 0.053 3 PEESE (default) 0.096
4 PEESE (default) 0.096 4 AK (AK2) 0.100
5 SM (4PSM) 0.101 5 SM (4PSM) 0.101
6 WAAPWLS (default) 0.108 6 WAAPWLS (default) 0.108
7 trimfill (default) 0.126 7 trimfill (default) 0.127
8 WLS (default) 0.131 8 WLS (default) 0.131
9 FMA (default) 0.136 9 FMA (default) 0.136
10 EK (default) 0.148 10 EK (default) 0.148
11 WILS (default) 0.172 11 WILS (default) 0.172
12 PET (default) 0.242 12 PET (default) 0.242
13 RMA (default) 0.289 13 RMA (default) 0.289
14 puniform (default) 0.397 14 puniform (default) 0.397
15 MAIVE (WAIVE) 0.681 15 AK (AK1) 0.668
16 AK (AK1) 0.692 16 MAIVE (WAIVE) 0.681
17 SM (3PSM) 0.728 17 SM (3PSM) 0.728
18 puniform (star) 1.047 18 puniform (star) 1.047
19 MAIVE (default) 1.207 19 MAIVE (default) 1.207
20 mean (default) 10.014 20 mean (default) 10.014
21 pcurve (default) NaN 21 pcurve (default) NaN

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 0.953 1 RoBMA (PSMA) 0.953
2 AK (AK2) 0.952 2 AK (AK2) 0.942
3 SM (4PSM) 0.939 3 SM (4PSM) 0.939
4 puniform (default) 0.927 4 puniform (default) 0.927
5 PETPEESE (default) 0.920 5 PETPEESE (default) 0.920
6 WAAPWLS (default) 0.909 6 WAAPWLS (default) 0.909
7 trimfill (default) 0.905 7 trimfill (default) 0.904
8 AK (AK1) 0.903 8 AK (AK1) 0.901
9 PEESE (default) 0.886 9 PEESE (default) 0.886
10 SM (3PSM) 0.885 10 SM (3PSM) 0.885
11 puniform (star) 0.879 11 puniform (star) 0.879
12 WLS (default) 0.849 12 WLS (default) 0.849
13 RMA (default) 0.839 13 RMA (default) 0.839
14 FMA (default) 0.829 14 FMA (default) 0.829
15 MAIVE (WAIVE) 0.789 15 MAIVE (WAIVE) 0.789
16 EK (default) 0.757 16 EK (default) 0.757
17 WILS (default) 0.748 17 WILS (default) 0.748
18 MAIVE (default) 0.727 18 MAIVE (default) 0.727
19 PET (default) 0.603 19 PET (default) 0.603
20 mean (default) 0.378 20 mean (default) 0.378
21 pcurve (default) NaN 21 pcurve (default) NaN

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 0.029 1 RoBMA (PSMA) 0.029
2 WILS (default) 0.033 2 WILS (default) 0.033
3 FMA (default) 0.034 3 FMA (default) 0.034
4 PEESE (default) 0.035 4 PEESE (default) 0.035
5 WAAPWLS (default) 0.036 5 WAAPWLS (default) 0.036
6 PETPEESE (default) 0.036 6 PETPEESE (default) 0.036
7 AK (AK2) 0.036 7 WLS (default) 0.036
8 WLS (default) 0.036 8 EK (default) 0.038
9 EK (default) 0.038 9 AK (AK2) 0.038
10 PET (default) 0.040 10 PET (default) 0.040
11 SM (4PSM) 0.045 11 SM (4PSM) 0.045
12 trimfill (default) 0.053 12 trimfill (default) 0.053
13 AK (AK1) 0.054 13 AK (AK1) 0.053
14 RMA (default) 0.056 14 RMA (default) 0.056
15 SM (3PSM) 0.058 15 SM (3PSM) 0.058
16 puniform (star) 0.060 16 puniform (star) 0.060
17 MAIVE (WAIVE) 0.220 17 MAIVE (WAIVE) 0.220
18 mean (default) 0.247 18 mean (default) 0.247
19 MAIVE (default) 0.267 19 MAIVE (default) 0.267
20 puniform (default) 0.326 20 puniform (default) 0.326
21 pcurve (default) NaN 21 pcurve (default) NaN

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Log Value Rank Method Log Value
1 RoBMA (PSMA) 7.601 1 RoBMA (PSMA) 7.601
2 MAIVE (WAIVE) 3.338 2 MAIVE (WAIVE) 3.338
3 MAIVE (default) 3.323 3 MAIVE (default) 3.323
4 AK (AK2) 3.114 4 AK (AK2) 2.832
5 EK (default) 2.803 5 EK (default) 2.803
5 PET (default) 2.803 5 PET (default) 2.803
7 PETPEESE (default) 2.631 7 PETPEESE (default) 2.631
8 RMA (default) 2.357 8 RMA (default) 2.357
9 puniform (default) 2.246 9 puniform (default) 2.246
10 AK (AK1) 2.233 10 AK (AK1) 2.236
11 trimfill (default) 2.164 11 trimfill (default) 2.169
12 SM (4PSM) 2.158 12 SM (4PSM) 2.158
13 WAAPWLS (default) 1.936 13 WAAPWLS (default) 1.936
14 WLS (default) 1.900 14 WLS (default) 1.900
15 PEESE (default) 1.860 15 PEESE (default) 1.860
16 mean (default) 1.671 16 mean (default) 1.671
17 FMA (default) 1.398 17 FMA (default) 1.398
18 WILS (default) 1.255 18 WILS (default) 1.255
19 puniform (star) 1.106 19 puniform (star) 1.106
20 SM (3PSM) 1.100 20 SM (3PSM) 1.100
21 pcurve (default) NaN 21 pcurve (default) NaN

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Log Value Rank Method Log Value
1 RoBMA (PSMA) -7.601 1 RoBMA (PSMA) -7.601
2 EK (default) -7.539 2 EK (default) -7.539
2 PET (default) -7.539 2 PET (default) -7.539
4 MAIVE (default) -7.533 4 MAIVE (default) -7.533
5 PETPEESE (default) -7.523 5 AK (AK2) -7.531
6 puniform (default) -7.469 6 PETPEESE (default) -7.523
7 SM (4PSM) -7.359 7 puniform (default) -7.469
8 MAIVE (WAIVE) -7.314 8 SM (4PSM) -7.359
9 WAAPWLS (default) -6.809 9 MAIVE (WAIVE) -7.314
10 AK (AK2) -6.178 10 WAAPWLS (default) -6.809
11 AK (AK1) -6.073 11 AK (AK1) -6.141
12 SM (3PSM) -5.936 12 SM (3PSM) -5.936
13 RMA (default) -5.045 13 RMA (default) -5.045
14 trimfill (default) -5.043 14 trimfill (default) -5.040
15 WLS (default) -5.028 15 WLS (default) -5.028
16 PEESE (default) -5.026 16 PEESE (default) -5.026
17 mean (default) -4.983 17 mean (default) -4.983
18 FMA (default) -4.956 18 FMA (default) -4.956
19 WILS (default) -4.923 19 WILS (default) -4.923
20 puniform (star) -4.716 20 puniform (star) -4.716
21 pcurve (default) NaN 21 pcurve (default) NaN

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 0.000 1 RoBMA (PSMA) 0.000
2 MAIVE (WAIVE) 0.038 2 MAIVE (WAIVE) 0.038
3 MAIVE (default) 0.039 3 MAIVE (default) 0.039
4 AK (AK2) 0.044 4 EK (default) 0.060
5 EK (default) 0.060 4 PET (default) 0.060
5 PET (default) 0.060 6 AK (AK2) 0.067
7 PETPEESE (default) 0.075 7 PETPEESE (default) 0.075
8 puniform (default) 0.121 8 puniform (default) 0.121
9 SM (4PSM) 0.187 9 SM (4PSM) 0.187
10 WAAPWLS (default) 0.337 10 WAAPWLS (default) 0.337
11 AK (AK1) 0.353 11 AK (AK1) 0.353
12 RMA (default) 0.356 12 RMA (default) 0.356
13 trimfill (default) 0.360 13 trimfill (default) 0.360
14 WLS (default) 0.372 14 WLS (default) 0.372
15 PEESE (default) 0.374 15 PEESE (default) 0.374
16 mean (default) 0.410 16 mean (default) 0.410
17 FMA (default) 0.433 17 FMA (default) 0.433
18 WILS (default) 0.459 18 WILS (default) 0.459
19 SM (3PSM) 0.557 19 SM (3PSM) 0.557
20 puniform (star) 0.563 20 puniform (star) 0.563
21 pcurve (default) NaN 21 pcurve (default) NaN

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 AK (AK1) 1.000 1 AK (AK1) 1.000
1 AK (AK2) 1.000 1 AK (AK2) 1.000
1 EK (default) 1.000 1 EK (default) 1.000
1 FMA (default) 1.000 1 FMA (default) 1.000
1 mean (default) 1.000 1 mean (default) 1.000
1 PEESE (default) 1.000 1 PEESE (default) 1.000
1 PET (default) 1.000 1 PET (default) 1.000
1 PETPEESE (default) 1.000 1 PETPEESE (default) 1.000
1 puniform (default) 1.000 1 puniform (default) 1.000
1 puniform (star) 1.000 1 puniform (star) 1.000
1 RMA (default) 1.000 1 RMA (default) 1.000
1 RoBMA (PSMA) 1.000 1 RoBMA (PSMA) 1.000
1 SM (3PSM) 1.000 1 SM (3PSM) 1.000
1 SM (4PSM) 1.000 1 SM (4PSM) 1.000
1 trimfill (default) 1.000 1 trimfill (default) 1.000
1 WAAPWLS (default) 1.000 1 WAAPWLS (default) 1.000
1 WILS (default) 1.000 1 WILS (default) 1.000
1 WLS (default) 1.000 1 WLS (default) 1.000
19 MAIVE (default) 1.000 19 MAIVE (default) 1.000
20 MAIVE (WAIVE) 0.999 20 MAIVE (WAIVE) 0.999
21 pcurve (default) NaN 21 pcurve (default) NaN

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

By-Condition Performance (Conditional on Method Convergence)

The results below are conditional on method convergence. Note that the methods might differ in convergence rate and are therefore not compared on the same data sets.

Raincloud plot showing convergence rates across different methods

Raincloud plot showing RMSE (Root Mean Square Error) across different methods

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing bias across different methods

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0. Values lower than -0.5 or larger than 0.5 are visualized as -0.5 and 0.5 respectively.

Raincloud plot showing bias across different methods

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing 95% confidence interval width across different methods

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method. Values larger than 100 are visualized as 100.

Raincloud plot showing 95% confidence interval coverage across different methods

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Raincloud plot showing 95% confidence interval width across different methods

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method.

Raincloud plot showing positive likelihood ratio across different methods

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Raincloud plot showing negative likelihood ratio across different methods

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Raincloud plot showing Type I Error rates across different methods

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Raincloud plot showing statistical power across different methods

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

By-Condition Performance (Replacement in Case of Non-Convergence)

The results below incorporate method replacement to handle non-convergence. If a method fails to converge, its results are replaced with the results from a simpler method (e.g., random-effects meta-analysis without publication bias adjustment). This emulates what a data analyst may do in practice in case a method does not converge. However, note that these results do not correspond to “pure” method performance as they might combine multiple different methods. See Method Replacement Strategy for details of the method replacement specification.

Raincloud plot showing convergence rates across different methods

Raincloud plot showing RMSE (Root Mean Square Error) across different methods

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing bias across different methods

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0. Values lower than -0.5 or larger than 0.5 are visualized as -0.5 and 0.5 respectively.

Raincloud plot showing bias across different methods

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing 95% confidence interval width across different methods

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method. Values larger than 100 are visualized as 100.

Raincloud plot showing 95% confidence interval coverage across different methods

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Raincloud plot showing 95% confidence interval width across different methods

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method.

Raincloud plot showing positive likelihood ratio across different methods

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Raincloud plot showing negative likelihood ratio across different methods

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Raincloud plot showing Type I Error rates across different methods

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Raincloud plot showing statistical power across different methods

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

Subset: Random Effects

These results are based on Alinaghi (2018) data-generating mechanism with a total of 27 conditions.

Average Performance

Method performance measures are aggregated across all simulated conditions to provide an overall impression of method performance. However, keep in mind that a method with a high overall ranking is not necessarily the “best” method for a particular application. To select a suitable method for your application, consider also non-aggregated performance measures in conditions most relevant to your application.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 0.098 1 RoBMA (PSMA) 0.098
2 AK (AK2) 0.101 2 AK (AK2) 0.120
3 trimfill (default) 0.151 3 trimfill (default) 0.151
4 AK (AK1) 0.173 4 AK (AK1) 0.173
5 SM (4PSM) 0.186 5 SM (4PSM) 0.186
6 MAIVE (WAIVE) 0.191 6 MAIVE (WAIVE) 0.191
7 PEESE (default) 0.198 7 PEESE (default) 0.198
8 PETPEESE (default) 0.199 8 PETPEESE (default) 0.199
9 FMA (default) 0.199 9 FMA (default) 0.199
9 WLS (default) 0.199 9 WLS (default) 0.199
11 WAAPWLS (default) 0.207 11 WAAPWLS (default) 0.207
12 MAIVE (default) 0.216 12 MAIVE (default) 0.216
13 EK (default) 0.223 13 EK (default) 0.223
13 PET (default) 0.223 13 PET (default) 0.223
15 RMA (default) 0.272 15 RMA (default) 0.272
16 SM (3PSM) 0.280 16 SM (3PSM) 0.280
17 puniform (star) 0.287 17 puniform (star) 0.287
18 puniform (default) 0.448 18 puniform (default) 0.448
19 mean (default) 0.461 19 mean (default) 0.461
20 WILS (default) 0.475 20 WILS (default) 0.475
21 pcurve (default) 1.405 21 pcurve (default) 1.405

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 AK (AK2) 0.003 1 SM (3PSM) 0.020
2 SM (3PSM) 0.020 2 AK (AK2) 0.028
3 EK (default) 0.033 3 EK (default) 0.033
3 PET (default) 0.033 3 PET (default) 0.033
5 puniform (star) 0.036 5 puniform (star) 0.036
6 RoBMA (PSMA) -0.040 6 RoBMA (PSMA) -0.040
7 PETPEESE (default) 0.070 7 PETPEESE (default) 0.070
8 PEESE (default) 0.070 8 PEESE (default) 0.070
9 WAAPWLS (default) 0.072 9 WAAPWLS (default) 0.072
10 FMA (default) 0.086 10 FMA (default) 0.086
11 WLS (default) 0.086 11 WLS (default) 0.086
12 trimfill (default) 0.093 12 trimfill (default) 0.093
13 SM (4PSM) -0.107 13 SM (4PSM) -0.107
14 AK (AK1) 0.131 14 AK (AK1) 0.131
15 MAIVE (WAIVE) 0.135 15 MAIVE (WAIVE) 0.135
16 MAIVE (default) 0.158 16 MAIVE (default) 0.158
17 RMA (default) 0.247 17 RMA (default) 0.247
18 WILS (default) -0.283 18 WILS (default) -0.283
19 mean (default) 0.430 19 mean (default) 0.430
20 puniform (default) 0.438 20 puniform (default) 0.438
21 pcurve (default) -1.236 21 pcurve (default) -1.236

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 pcurve (default) 0.034 1 pcurve (default) 0.034
2 RMA (default) 0.058 2 RMA (default) 0.058
3 AK (AK1) 0.064 3 AK (AK1) 0.064
4 trimfill (default) 0.069 4 trimfill (default) 0.069
5 mean (default) 0.076 5 mean (default) 0.076
6 puniform (default) 0.080 6 puniform (default) 0.080
7 SM (3PSM) 0.082 7 SM (3PSM) 0.082
8 MAIVE (WAIVE) 0.084 8 MAIVE (WAIVE) 0.084
9 MAIVE (default) 0.084 9 MAIVE (default) 0.084
10 puniform (star) 0.085 10 puniform (star) 0.085
11 RoBMA (PSMA) 0.085 11 RoBMA (PSMA) 0.085
12 AK (AK2) 0.101 12 SM (4PSM) 0.104
13 SM (4PSM) 0.104 13 AK (AK2) 0.110
14 FMA (default) 0.148 14 FMA (default) 0.148
15 WLS (default) 0.148 15 WLS (default) 0.148
16 PEESE (default) 0.155 16 PEESE (default) 0.155
17 PETPEESE (default) 0.156 17 PETPEESE (default) 0.156
18 WAAPWLS (default) 0.166 18 WAAPWLS (default) 0.166
19 EK (default) 0.190 19 EK (default) 0.190
19 PET (default) 0.190 19 PET (default) 0.190
21 WILS (default) 0.270 21 WILS (default) 0.270

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 0.426 1 RoBMA (PSMA) 0.426
2 AK (AK2) 0.471 2 AK (AK2) 0.887
3 SM (4PSM) 2.479 3 SM (4PSM) 2.479
4 trimfill (default) 2.707 4 trimfill (default) 2.707
5 WAAPWLS (default) 2.898 5 WAAPWLS (default) 2.898
6 MAIVE (WAIVE) 3.371 6 MAIVE (WAIVE) 3.371
7 AK (AK1) 3.950 7 AK (AK1) 3.950
8 MAIVE (default) 4.196 8 MAIVE (default) 4.196
9 PEESE (default) 4.234 9 PEESE (default) 4.234
10 PETPEESE (default) 4.247 10 PETPEESE (default) 4.247
11 WLS (default) 4.339 11 WLS (default) 4.339
12 EK (default) 4.512 12 EK (default) 4.512
13 PET (default) 4.516 13 PET (default) 4.516
14 FMA (default) 5.792 14 FMA (default) 5.792
15 SM (3PSM) 6.604 15 SM (3PSM) 6.604
16 puniform (star) 6.865 16 puniform (star) 6.865
17 RMA (default) 7.368 17 RMA (default) 7.368
18 puniform (default) 12.917 18 puniform (default) 12.917
19 WILS (default) 14.064 19 WILS (default) 14.064
20 mean (default) 14.386 20 mean (default) 14.386
21 pcurve (default) NaN 21 pcurve (default) NaN

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 AK (AK2) 0.951 1 RoBMA (PSMA) 0.941
2 RoBMA (PSMA) 0.941 2 AK (AK2) 0.866
3 SM (4PSM) 0.835 3 SM (4PSM) 0.835
4 AK (AK1) 0.719 4 AK (AK1) 0.719
5 MAIVE (WAIVE) 0.646 5 MAIVE (WAIVE) 0.646
6 trimfill (default) 0.626 6 trimfill (default) 0.626
7 MAIVE (default) 0.615 7 MAIVE (default) 0.615
8 SM (3PSM) 0.598 8 SM (3PSM) 0.598
9 puniform (star) 0.586 9 puniform (star) 0.586
10 WAAPWLS (default) 0.529 10 WAAPWLS (default) 0.529
11 RMA (default) 0.422 11 RMA (default) 0.422
12 mean (default) 0.342 12 mean (default) 0.342
13 PETPEESE (default) 0.335 13 PETPEESE (default) 0.335
14 EK (default) 0.335 14 EK (default) 0.335
15 PEESE (default) 0.335 15 PEESE (default) 0.335
16 PET (default) 0.335 16 PET (default) 0.335
17 WLS (default) 0.330 17 WLS (default) 0.330
18 FMA (default) 0.113 18 FMA (default) 0.113
19 puniform (default) 0.098 19 puniform (default) 0.098
20 WILS (default) 0.091 20 WILS (default) 0.091
21 pcurve (default) NaN 21 pcurve (default) NaN

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 FMA (default) 0.051 1 FMA (default) 0.051
2 WILS (default) 0.147 2 WILS (default) 0.147
3 WLS (default) 0.153 3 WLS (default) 0.153
4 PEESE (default) 0.156 4 PEESE (default) 0.156
5 PETPEESE (default) 0.157 5 PETPEESE (default) 0.157
6 PET (default) 0.185 6 PET (default) 0.185
7 EK (default) 0.185 7 EK (default) 0.185
8 RMA (default) 0.228 8 RMA (default) 0.228
9 trimfill (default) 0.234 9 trimfill (default) 0.234
10 mean (default) 0.244 10 mean (default) 0.244
11 AK (AK1) 0.248 11 AK (AK1) 0.248
12 puniform (default) 0.254 12 puniform (default) 0.254
13 WAAPWLS (default) 0.304 13 WAAPWLS (default) 0.304
14 RoBMA (PSMA) 0.310 14 RoBMA (PSMA) 0.310
15 SM (3PSM) 0.314 15 SM (3PSM) 0.314
16 MAIVE (WAIVE) 0.315 16 MAIVE (WAIVE) 0.315
17 puniform (star) 0.324 17 puniform (star) 0.324
18 MAIVE (default) 0.328 18 MAIVE (default) 0.328
19 AK (AK2) 0.392 19 AK (AK2) 0.383
20 SM (4PSM) 0.408 20 SM (4PSM) 0.408
21 pcurve (default) NaN 21 pcurve (default) NaN

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Log Value Rank Method Log Value
1 RoBMA (PSMA) 6.663 1 RoBMA (PSMA) 6.663
2 AK (AK2) 3.115 2 AK (AK2) 3.117
3 MAIVE (default) 2.279 3 MAIVE (default) 2.279
4 MAIVE (WAIVE) 2.224 4 MAIVE (WAIVE) 2.224
5 RMA (default) 2.150 5 RMA (default) 2.150
6 AK (AK1) 2.110 6 AK (AK1) 2.110
7 trimfill (default) 1.957 7 trimfill (default) 1.957
8 SM (4PSM) 1.835 8 SM (4PSM) 1.835
9 mean (default) 1.179 9 mean (default) 1.179
10 SM (3PSM) 0.935 10 SM (3PSM) 0.935
11 puniform (star) 0.930 11 puniform (star) 0.930
12 WAAPWLS (default) 0.598 12 WAAPWLS (default) 0.598
13 PETPEESE (default) 0.403 13 PETPEESE (default) 0.403
14 WLS (default) 0.399 14 WLS (default) 0.399
15 PEESE (default) 0.395 15 PEESE (default) 0.395
16 EK (default) 0.384 16 EK (default) 0.384
16 PET (default) 0.384 16 PET (default) 0.384
18 FMA (default) 0.098 18 FMA (default) 0.098
19 WILS (default) 0.044 19 WILS (default) 0.044
20 puniform (default) 0.000 20 puniform (default) 0.000
21 pcurve (default) NaN 21 pcurve (default) NaN

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Log Value Rank Method Log Value
1 RoBMA (PSMA) -6.364 1 AK (AK2) -6.830
2 AK (AK2) -6.064 2 RoBMA (PSMA) -6.364
3 WAAPWLS (default) -5.828 3 WAAPWLS (default) -5.828
4 MAIVE (default) -5.648 4 MAIVE (default) -5.648
5 MAIVE (WAIVE) -5.474 5 MAIVE (WAIVE) -5.474
6 RMA (default) -5.039 6 RMA (default) -5.039
7 AK (AK1) -5.039 7 AK (AK1) -5.039
8 trimfill (default) -5.023 8 trimfill (default) -5.023
9 mean (default) -4.918 9 mean (default) -4.918
10 PETPEESE (default) -4.860 10 PETPEESE (default) -4.860
11 EK (default) -4.812 11 EK (default) -4.812
11 PET (default) -4.812 11 PET (default) -4.812
13 WLS (default) -4.362 13 WLS (default) -4.362
14 PEESE (default) -4.345 14 PEESE (default) -4.345
15 SM (4PSM) -4.038 15 SM (4PSM) -4.038
16 FMA (default) -3.715 16 FMA (default) -3.715
17 WILS (default) -2.818 17 WILS (default) -2.818
18 SM (3PSM) -1.905 18 SM (3PSM) -1.905
19 puniform (star) -1.874 19 puniform (star) -1.874
20 puniform (default) 0.000 20 puniform (default) 0.000
21 pcurve (default) NaN 21 pcurve (default) NaN

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 0.002 1 RoBMA (PSMA) 0.002
2 AK (AK2) 0.043 2 AK (AK2) 0.043
3 MAIVE (default) 0.353 3 MAIVE (default) 0.353
4 MAIVE (WAIVE) 0.356 4 MAIVE (WAIVE) 0.356
5 RMA (default) 0.361 5 RMA (default) 0.361
6 AK (AK1) 0.361 6 AK (AK1) 0.361
7 SM (4PSM) 0.369 7 SM (4PSM) 0.369
8 trimfill (default) 0.376 8 trimfill (default) 0.376
9 mean (default) 0.464 9 mean (default) 0.464
10 WAAPWLS (default) 0.594 10 WAAPWLS (default) 0.594
11 puniform (star) 0.684 11 puniform (star) 0.684
11 SM (3PSM) 0.684 11 SM (3PSM) 0.684
13 PETPEESE (default) 0.702 13 PETPEESE (default) 0.702
14 WLS (default) 0.704 14 WLS (default) 0.704
15 PEESE (default) 0.707 15 PEESE (default) 0.707
16 EK (default) 0.712 16 EK (default) 0.712
16 PET (default) 0.712 16 PET (default) 0.712
18 FMA (default) 0.909 18 FMA (default) 0.909
19 WILS (default) 0.937 19 WILS (default) 0.937
20 puniform (default) 1.000 20 puniform (default) 1.000
21 pcurve (default) NaN 21 pcurve (default) NaN

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 AK (AK1) 1.000 1 AK (AK1) 1.000
1 mean (default) 1.000 1 mean (default) 1.000
1 puniform (default) 1.000 1 puniform (default) 1.000
1 RMA (default) 1.000 1 RMA (default) 1.000
1 trimfill (default) 1.000 1 trimfill (default) 1.000
6 FMA (default) 1.000 6 FMA (default) 1.000
7 WLS (default) 1.000 7 WLS (default) 1.000
8 PEESE (default) 1.000 8 PEESE (default) 1.000
9 MAIVE (default) 0.999 9 MAIVE (default) 0.999
10 MAIVE (WAIVE) 0.998 10 MAIVE (WAIVE) 0.998
11 PETPEESE (default) 0.997 11 PETPEESE (default) 0.997
12 EK (default) 0.997 12 EK (default) 0.997
12 PET (default) 0.997 12 PET (default) 0.997
14 WAAPWLS (default) 0.981 14 WAAPWLS (default) 0.981
15 AK (AK2) 0.978 15 AK (AK2) 0.978
16 WILS (default) 0.978 16 WILS (default) 0.978
17 SM (3PSM) 0.962 17 SM (3PSM) 0.962
18 puniform (star) 0.958 18 puniform (star) 0.958
19 RoBMA (PSMA) 0.944 19 RoBMA (PSMA) 0.944
20 SM (4PSM) 0.935 20 SM (4PSM) 0.935
21 pcurve (default) NaN 21 pcurve (default) NaN

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

By-Condition Performance (Conditional on Method Convergence)

The results below are conditional on method convergence. Note that the methods might differ in convergence rate and are therefore not compared on the same data sets.

Raincloud plot showing convergence rates across different methods

Raincloud plot showing RMSE (Root Mean Square Error) across different methods

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing bias across different methods

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0. Values lower than -0.5 or larger than 0.5 are visualized as -0.5 and 0.5 respectively.

Raincloud plot showing bias across different methods

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing 95% confidence interval width across different methods

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method. Values larger than 100 are visualized as 100.

Raincloud plot showing 95% confidence interval coverage across different methods

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Raincloud plot showing 95% confidence interval width across different methods

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method.

Raincloud plot showing positive likelihood ratio across different methods

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Raincloud plot showing negative likelihood ratio across different methods

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Raincloud plot showing Type I Error rates across different methods

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Raincloud plot showing statistical power across different methods

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

By-Condition Performance (Replacement in Case of Non-Convergence)

The results below incorporate method replacement to handle non-convergence. If a method fails to converge, its results are replaced with the results from a simpler method (e.g., random-effects meta-analysis without publication bias adjustment). This emulates what a data analyst may do in practice in case a method does not converge. However, note that these results do not correspond to “pure” method performance as they might combine multiple different methods. See Method Replacement Strategy for details of the method replacement specification.

Raincloud plot showing convergence rates across different methods

Raincloud plot showing RMSE (Root Mean Square Error) across different methods

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing bias across different methods

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0. Values lower than -0.5 or larger than 0.5 are visualized as -0.5 and 0.5 respectively.

Raincloud plot showing bias across different methods

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing 95% confidence interval width across different methods

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method. Values larger than 100 are visualized as 100.

Raincloud plot showing 95% confidence interval coverage across different methods

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Raincloud plot showing 95% confidence interval width across different methods

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method.

Raincloud plot showing positive likelihood ratio across different methods

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Raincloud plot showing negative likelihood ratio across different methods

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Raincloud plot showing Type I Error rates across different methods

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Raincloud plot showing statistical power across different methods

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

Subset: Panel Random Effects

These results are based on Alinaghi (2018) data-generating mechanism with a total of 27 conditions.

Average Performance

Method performance measures are aggregated across all simulated conditions to provide an overall impression of method performance. However, keep in mind that a method with a high overall ranking is not necessarily the “best” method for a particular application. To select a suitable method for your application, consider also non-aggregated performance measures in conditions most relevant to your application.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 trimfill (default) 0.542 1 trimfill (default) 0.542
2 RoBMA (PSMA) 0.543 2 RoBMA (PSMA) 0.543
3 AK (AK1) 0.555 3 AK (AK1) 0.555
4 AK (AK2) 0.575 4 SM (4PSM) 0.587
5 SM (4PSM) 0.587 5 AK (AK2) 0.598
6 SM (3PSM) 0.608 6 SM (3PSM) 0.608
7 MAIVE (WAIVE) 0.610 7 MAIVE (WAIVE) 0.610
8 puniform (star) 0.615 8 puniform (star) 0.615
9 MAIVE (default) 0.622 9 MAIVE (default) 0.622
10 RMA (default) 0.667 10 RMA (default) 0.667
11 mean (default) 0.678 11 mean (default) 0.678
12 FMA (default) 0.824 12 FMA (default) 0.824
12 WLS (default) 0.824 12 WLS (default) 0.824
14 PEESE (default) 0.868 14 PEESE (default) 0.868
15 PETPEESE (default) 0.879 15 PETPEESE (default) 0.879
16 WAAPWLS (default) 0.897 16 WAAPWLS (default) 0.897
17 PET (default) 1.071 17 PET (default) 1.071
18 EK (default) 1.071 18 EK (default) 1.071
19 WILS (default) 1.222 19 WILS (default) 1.222
20 pcurve (default) 1.381 20 pcurve (default) 1.381
21 puniform (default) 1.400 21 puniform (default) 1.400

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 SM (4PSM) 0.190 1 SM (4PSM) 0.190
2 trimfill (default) 0.194 2 trimfill (default) 0.194
3 PET (default) 0.214 3 PET (default) 0.214
4 EK (default) 0.214 4 EK (default) 0.214
5 SM (3PSM) 0.256 5 SM (3PSM) 0.256
6 AK (AK2) 0.256 6 WILS (default) -0.263
7 WILS (default) -0.263 7 PETPEESE (default) 0.266
8 PETPEESE (default) 0.266 8 WAAPWLS (default) 0.270
9 WAAPWLS (default) 0.270 9 puniform (star) 0.272
10 puniform (star) 0.272 10 PEESE (default) 0.276
11 PEESE (default) 0.276 11 AK (AK2) 0.293
12 WLS (default) 0.301 12 WLS (default) 0.301
13 FMA (default) 0.301 13 FMA (default) 0.301
14 RoBMA (PSMA) 0.336 14 RoBMA (PSMA) 0.336
15 MAIVE (WAIVE) 0.338 15 MAIVE (WAIVE) 0.338
16 MAIVE (default) 0.365 16 MAIVE (default) 0.365
17 AK (AK1) 0.389 17 AK (AK1) 0.389
18 RMA (default) 0.528 18 RMA (default) 0.528
19 mean (default) 0.538 19 mean (default) 0.538
20 pcurve (default) -1.115 20 pcurve (default) -1.115
21 puniform (default) 1.362 21 puniform (default) 1.362

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 pcurve (default) 0.098 1 pcurve (default) 0.098
2 RMA (default) 0.299 2 RMA (default) 0.299
3 mean (default) 0.302 3 mean (default) 0.302
4 puniform (default) 0.309 4 puniform (default) 0.309
5 AK (AK1) 0.313 5 AK (AK1) 0.313
6 RoBMA (PSMA) 0.321 6 RoBMA (PSMA) 0.321
7 puniform (star) 0.378 7 puniform (star) 0.378
8 SM (3PSM) 0.380 8 SM (3PSM) 0.380
9 trimfill (default) 0.393 9 trimfill (default) 0.393
10 MAIVE (default) 0.408 10 MAIVE (default) 0.408
11 MAIVE (WAIVE) 0.417 11 MAIVE (WAIVE) 0.417
12 SM (4PSM) 0.454 12 SM (4PSM) 0.454
13 AK (AK2) 0.466 13 AK (AK2) 0.467
14 FMA (default) 0.699 14 FMA (default) 0.699
15 WLS (default) 0.699 15 WLS (default) 0.699
16 PEESE (default) 0.758 16 PEESE (default) 0.758
17 PETPEESE (default) 0.771 17 PETPEESE (default) 0.771
18 WAAPWLS (default) 0.795 18 WAAPWLS (default) 0.795
19 PET (default) 0.985 19 PET (default) 0.985
20 EK (default) 0.985 20 EK (default) 0.985
21 WILS (default) 1.079 21 WILS (default) 1.079

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 FMA (default) 5.413 1 FMA (default) 5.413
2 RoBMA (PSMA) 6.200 2 RoBMA (PSMA) 6.200
3 MAIVE (WAIVE) 6.273 3 MAIVE (WAIVE) 6.273
4 MAIVE (default) 6.690 4 MAIVE (default) 6.690
5 AK (AK2) 8.596 5 SM (4PSM) 9.418
6 SM (4PSM) 9.418 6 AK (AK2) 9.662
7 RMA (default) 10.943 7 RMA (default) 10.943
8 trimfill (default) 11.857 8 trimfill (default) 11.857
9 SM (3PSM) 12.542 9 SM (3PSM) 12.542
10 AK (AK1) 12.674 10 AK (AK1) 12.674
11 puniform (star) 13.940 11 puniform (star) 13.940
12 WAAPWLS (default) 20.395 12 WAAPWLS (default) 20.395
13 mean (default) 20.420 13 mean (default) 20.420
14 WLS (default) 21.589 14 WLS (default) 21.589
15 PEESE (default) 22.620 15 PEESE (default) 22.620
16 PETPEESE (default) 22.879 16 PETPEESE (default) 22.879
17 EK (default) 27.177 17 EK (default) 27.177
18 PET (default) 27.193 18 PET (default) 27.193
19 WILS (default) 34.217 19 WILS (default) 34.217
20 puniform (default) 47.302 20 puniform (default) 47.302
21 pcurve (default) NaN 21 pcurve (default) NaN

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 FMA (default) 0.848 1 FMA (default) 0.848
2 MAIVE (WAIVE) 0.724 2 MAIVE (WAIVE) 0.724
3 RoBMA (PSMA) 0.710 3 RoBMA (PSMA) 0.710
4 MAIVE (default) 0.697 4 MAIVE (default) 0.697
5 RMA (default) 0.531 5 RMA (default) 0.531
6 AK (AK2) 0.502 6 SM (4PSM) 0.474
7 SM (4PSM) 0.474 7 AK (AK2) 0.467
8 SM (3PSM) 0.392 8 SM (3PSM) 0.392
9 AK (AK1) 0.334 9 AK (AK1) 0.334
10 puniform (star) 0.326 10 puniform (star) 0.326
11 trimfill (default) 0.313 11 trimfill (default) 0.313
12 WAAPWLS (default) 0.226 12 WAAPWLS (default) 0.226
13 mean (default) 0.175 13 mean (default) 0.175
14 WLS (default) 0.145 14 WLS (default) 0.145
15 PETPEESE (default) 0.145 15 PETPEESE (default) 0.145
16 EK (default) 0.145 16 EK (default) 0.145
17 PET (default) 0.144 17 PET (default) 0.144
18 PEESE (default) 0.144 18 PEESE (default) 0.144
19 WILS (default) 0.081 19 WILS (default) 0.081
20 puniform (default) 0.006 20 puniform (default) 0.006
21 pcurve (default) NaN 21 pcurve (default) NaN

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 mean (default) 0.251 1 mean (default) 0.251
2 WILS (default) 0.283 2 WILS (default) 0.283
3 WLS (default) 0.299 3 WLS (default) 0.299
4 PEESE (default) 0.312 4 PEESE (default) 0.312
5 PETPEESE (default) 0.318 5 PETPEESE (default) 0.318
6 puniform (default) 0.383 6 puniform (default) 0.383
7 trimfill (default) 0.401 7 trimfill (default) 0.401
8 PET (default) 0.401 8 PET (default) 0.401
9 EK (default) 0.402 9 EK (default) 0.402
10 AK (AK1) 0.437 10 AK (AK1) 0.437
11 puniform (star) 0.487 11 puniform (star) 0.487
12 WAAPWLS (default) 0.527 12 WAAPWLS (default) 0.527
13 SM (3PSM) 0.579 13 SM (3PSM) 0.579
14 SM (4PSM) 0.733 14 SM (4PSM) 0.733
15 AK (AK2) 0.784 15 AK (AK2) 0.759
16 RMA (default) 1.060 16 RMA (default) 1.060
17 RoBMA (PSMA) 1.144 17 RoBMA (PSMA) 1.144
18 MAIVE (default) 1.433 18 MAIVE (default) 1.433
19 MAIVE (WAIVE) 1.456 19 MAIVE (WAIVE) 1.456
20 FMA (default) 2.855 20 FMA (default) 2.855
21 pcurve (default) NaN 21 pcurve (default) NaN

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Log Value Rank Method Log Value
1 RoBMA (PSMA) 3.448 1 RoBMA (PSMA) 3.448
2 RMA (default) 1.974 2 RMA (default) 1.974
3 MAIVE (default) 1.952 3 MAIVE (default) 1.952
4 MAIVE (WAIVE) 1.839 4 MAIVE (WAIVE) 1.839
5 FMA (default) 1.533 5 FMA (default) 1.533
6 AK (AK2) 0.691 6 AK (AK2) 0.686
7 SM (4PSM) 0.570 7 SM (4PSM) 0.570
8 AK (AK1) 0.506 8 AK (AK1) 0.506
9 SM (3PSM) 0.421 9 SM (3PSM) 0.421
10 trimfill (default) 0.341 10 trimfill (default) 0.341
11 mean (default) 0.265 11 mean (default) 0.265
12 puniform (star) 0.189 12 puniform (star) 0.189
13 WAAPWLS (default) 0.182 13 WAAPWLS (default) 0.182
14 PETPEESE (default) 0.144 14 PETPEESE (default) 0.144
15 WLS (default) 0.134 15 WLS (default) 0.134
16 PEESE (default) 0.132 16 PEESE (default) 0.132
17 EK (default) 0.116 17 EK (default) 0.116
17 PET (default) 0.116 17 PET (default) 0.116
19 WILS (default) 0.038 19 WILS (default) 0.038
20 puniform (default) 0.000 20 puniform (default) 0.000
21 pcurve (default) NaN 21 pcurve (default) NaN

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Log Value Rank Method Log Value
1 RMA (default) -5.174 1 RMA (default) -5.174
2 RoBMA (PSMA) -5.000 2 RoBMA (PSMA) -5.000
3 SM (4PSM) -4.819 3 SM (4PSM) -4.819
4 MAIVE (default) -4.778 4 MAIVE (default) -4.778
5 trimfill (default) -4.744 5 trimfill (default) -4.744
6 MAIVE (WAIVE) -4.687 6 MAIVE (WAIVE) -4.687
7 AK (AK2) -4.522 7 AK (AK2) -4.674
8 AK (AK1) -4.122 8 AK (AK1) -4.122
9 SM (3PSM) -3.982 9 SM (3PSM) -3.982
10 mean (default) -3.862 10 mean (default) -3.862
11 WLS (default) -3.651 11 WLS (default) -3.651
12 puniform (star) -3.640 12 puniform (star) -3.640
13 PEESE (default) -3.558 13 PEESE (default) -3.558
14 WAAPWLS (default) -3.230 14 WAAPWLS (default) -3.230
15 PETPEESE (default) -3.213 15 PETPEESE (default) -3.213
16 EK (default) -3.094 16 EK (default) -3.094
16 PET (default) -3.094 16 PET (default) -3.094
18 FMA (default) -2.055 18 FMA (default) -2.055
19 WILS (default) -1.760 19 WILS (default) -1.760
20 puniform (default) 0.000 20 puniform (default) 0.000
21 pcurve (default) NaN 21 pcurve (default) NaN

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 FMA (default) 0.263 1 FMA (default) 0.263
2 RoBMA (PSMA) 0.273 2 RoBMA (PSMA) 0.273
3 MAIVE (default) 0.294 3 MAIVE (default) 0.294
4 MAIVE (WAIVE) 0.297 4 MAIVE (WAIVE) 0.297
5 RMA (default) 0.361 5 RMA (default) 0.361
6 AK (AK2) 0.501 6 AK (AK2) 0.506
7 SM (4PSM) 0.559 7 SM (4PSM) 0.559
8 AK (AK1) 0.642 8 AK (AK1) 0.642
9 SM (3PSM) 0.671 9 SM (3PSM) 0.671
10 trimfill (default) 0.715 10 trimfill (default) 0.715
11 mean (default) 0.783 11 mean (default) 0.783
12 WAAPWLS (default) 0.791 12 WAAPWLS (default) 0.791
13 PETPEESE (default) 0.832 13 PETPEESE (default) 0.832
14 puniform (star) 0.844 14 puniform (star) 0.844
15 WLS (default) 0.860 15 WLS (default) 0.860
16 PEESE (default) 0.861 16 PEESE (default) 0.861
17 EK (default) 0.864 17 EK (default) 0.864
17 PET (default) 0.864 17 PET (default) 0.864
19 WILS (default) 0.931 19 WILS (default) 0.931
20 puniform (default) 1.000 20 puniform (default) 1.000
21 pcurve (default) NaN 21 pcurve (default) NaN

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 puniform (default) 1.000 1 puniform (default) 1.000
2 mean (default) 0.995 2 mean (default) 0.995
3 puniform (star) 0.992 3 puniform (star) 0.992
4 AK (AK1) 0.991 4 AK (AK1) 0.991
5 WLS (default) 0.980 5 WLS (default) 0.980
6 PEESE (default) 0.979 6 PEESE (default) 0.979
7 trimfill (default) 0.977 7 trimfill (default) 0.977
8 EK (default) 0.969 8 EK (default) 0.969
8 PET (default) 0.969 8 PET (default) 0.969
10 WILS (default) 0.968 10 WILS (default) 0.968
11 PETPEESE (default) 0.959 11 PETPEESE (default) 0.959
12 RMA (default) 0.955 12 RMA (default) 0.955
13 SM (3PSM) 0.950 13 SM (3PSM) 0.950
14 WAAPWLS (default) 0.948 14 WAAPWLS (default) 0.948
15 AK (AK2) 0.943 15 AK (AK2) 0.943
16 SM (4PSM) 0.935 16 SM (4PSM) 0.935
17 RoBMA (PSMA) 0.890 17 RoBMA (PSMA) 0.890
18 MAIVE (default) 0.881 18 MAIVE (default) 0.881
19 MAIVE (WAIVE) 0.880 19 MAIVE (WAIVE) 0.880
20 FMA (default) 0.785 20 FMA (default) 0.785
21 pcurve (default) NaN 21 pcurve (default) NaN

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

By-Condition Performance (Conditional on Method Convergence)

The results below are conditional on method convergence. Note that the methods might differ in convergence rate and are therefore not compared on the same data sets.

Raincloud plot showing convergence rates across different methods

Raincloud plot showing RMSE (Root Mean Square Error) across different methods

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing bias across different methods

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0. Values lower than -0.5 or larger than 0.5 are visualized as -0.5 and 0.5 respectively.

Raincloud plot showing bias across different methods

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing 95% confidence interval width across different methods

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method. Values larger than 100 are visualized as 100.

Raincloud plot showing 95% confidence interval coverage across different methods

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Raincloud plot showing 95% confidence interval width across different methods

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method.

Raincloud plot showing positive likelihood ratio across different methods

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Raincloud plot showing negative likelihood ratio across different methods

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Raincloud plot showing Type I Error rates across different methods

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Raincloud plot showing statistical power across different methods

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

By-Condition Performance (Replacement in Case of Non-Convergence)

The results below incorporate method replacement to handle non-convergence. If a method fails to converge, its results are replaced with the results from a simpler method (e.g., random-effects meta-analysis without publication bias adjustment). This emulates what a data analyst may do in practice in case a method does not converge. However, note that these results do not correspond to “pure” method performance as they might combine multiple different methods. See Method Replacement Strategy for details of the method replacement specification.

Raincloud plot showing convergence rates across different methods

Raincloud plot showing RMSE (Root Mean Square Error) across different methods

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing bias across different methods

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0. Values lower than -0.5 or larger than 0.5 are visualized as -0.5 and 0.5 respectively.

Raincloud plot showing bias across different methods

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing 95% confidence interval width across different methods

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method. Values larger than 100 are visualized as 100.

Raincloud plot showing 95% confidence interval coverage across different methods

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Raincloud plot showing 95% confidence interval width across different methods

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method.

Raincloud plot showing positive likelihood ratio across different methods

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Raincloud plot showing negative likelihood ratio across different methods

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Raincloud plot showing Type I Error rates across different methods

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Raincloud plot showing statistical power across different methods

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

Session Info

This report was compiled on Mon Mar 16 19:04:18 2026 (UTC) using the following computational environment

## R version 4.5.3 (2026-03-11)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## time zone: UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] scales_1.4.0                   ggdist_3.3.3                  
## [3] ggplot2_4.0.2                  PublicationBiasBenchmark_0.2.0
## 
## loaded via a namespace (and not attached):
##  [1] generics_0.1.4       sandwich_3.1-1       sass_0.4.10         
##  [4] xml2_1.5.2           stringi_1.8.7        lattice_0.22-9      
##  [7] httpcode_0.3.0       digest_0.6.39        magrittr_2.0.4      
## [10] evaluate_1.0.5       grid_4.5.3           RColorBrewer_1.1-3  
## [13] fastmap_1.2.0        jsonlite_2.0.0       crul_1.6.0          
## [16] urltools_1.7.3.1     httr_1.4.8           purrr_1.2.1         
## [19] viridisLite_0.4.3    textshaping_1.0.5    jquerylib_0.1.4     
## [22] Rdpack_2.6.6         cli_3.6.5            rlang_1.1.7         
## [25] triebeard_0.4.1      rbibutils_2.4.1      withr_3.0.2         
## [28] cachem_1.1.0         yaml_2.3.12          otel_0.2.0          
## [31] tools_4.5.3          memoise_2.0.1        kableExtra_1.4.0    
## [34] curl_7.0.0           vctrs_0.7.1          R6_2.6.1            
## [37] clubSandwich_0.6.2   zoo_1.8-15           lifecycle_1.0.5     
## [40] stringr_1.6.0        fs_1.6.7             htmlwidgets_1.6.4   
## [43] ragg_1.5.1           pkgconfig_2.0.3      desc_1.4.3          
## [46] osfr_0.2.9           pkgdown_2.2.0        bslib_0.10.0        
## [49] pillar_1.11.1        gtable_0.3.6         Rcpp_1.1.1          
## [52] glue_1.8.0           systemfonts_1.3.2    xfun_0.56           
## [55] tibble_3.3.1         rstudioapi_0.18.0    knitr_1.51          
## [58] farver_2.1.2         htmltools_0.5.9      labeling_0.4.3      
## [61] svglite_2.2.2        rmarkdown_2.30       compiler_4.5.3      
## [64] S7_0.2.1             distributional_0.6.0