Skip to contents

Complete Results

These results are based on Carter (2019) data-generating mechanism with a total of 756 conditions.

Average Performance

Method performance measures are aggregated across all simulated conditions to provide an overall impression of method performance. However, keep in mind that a method with a high overall ranking is not necessarily the “best” method for a particular application. To select a suitable method for your application, consider also non-aggregated performance measures in conditions most relevant to your application.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 WAAPWLS (default) 0.105 1 WAAPWLS (default) 0.105
2 PEESE (default) 0.107 2 PEESE (default) 0.107
3 trimfill (default) 0.113 3 trimfill (default) 0.113
4 PETPEESE (default) 0.119 4 PETPEESE (default) 0.119
5 FMA (default) 0.124 5 FMA (default) 0.124
5 WLS (default) 0.124 5 WLS (default) 0.124
7 WILS (default) 0.128 7 WILS (default) 0.128
8 RoBMA (PSMA) 0.135 8 RoBMA (PSMA) 0.137
9 EK (default) 0.149 9 EK (default) 0.149
10 PET (default) 0.149 10 PET (default) 0.149
11 AK (AK1) 0.163 11 AK (AK1) 0.161
12 RMA (default) 0.164 12 RMA (default) 0.164
13 pcurve (default) 0.169 13 pcurve (default) 0.165
14 MAIVE (default) 0.170 14 MAIVE (default) 0.170
15 AK (AK2) 0.189 15 AK (AK2) 0.221
16 mean (default) 0.249 16 mean (default) 0.249
17 SM (3PSM) 0.300 17 SM (3PSM) 0.288
18 MAIVE (WAIVE) 0.378 18 MAIVE (WAIVE) 0.378
19 puniform (default) 0.453 19 puniform (default) 0.421
20 SM (4PSM) 0.459 20 SM (4PSM) 0.479
21 puniform (star) 169.671 21 puniform (star) 169.671

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 PETPEESE (default) 0.002 1 PETPEESE (default) 0.002
2 AK (AK1) 0.022 2 AK (AK1) 0.022
3 PEESE (default) 0.025 3 PEESE (default) 0.025
4 puniform (default) 0.028 4 puniform (default) 0.033
5 pcurve (default) 0.052 5 AK (AK2) -0.035
6 WILS (default) -0.052 6 pcurve (default) 0.048
7 EK (default) -0.052 7 WILS (default) -0.052
8 PET (default) -0.052 8 EK (default) -0.052
9 WAAPWLS (default) 0.054 9 PET (default) -0.052
10 MAIVE (default) -0.055 10 WAAPWLS (default) 0.054
11 trimfill (default) 0.067 11 MAIVE (default) -0.055
12 AK (AK2) -0.075 12 trimfill (default) 0.067
13 FMA (default) 0.092 13 RoBMA (PSMA) -0.088
13 WLS (default) 0.092 14 FMA (default) 0.092
15 RoBMA (PSMA) -0.094 14 WLS (default) 0.092
16 SM (3PSM) -0.108 16 SM (3PSM) -0.100
17 RMA (default) 0.150 17 RMA (default) 0.150
18 SM (4PSM) -0.200 18 SM (4PSM) -0.203
19 mean (default) 0.235 19 mean (default) 0.235
20 MAIVE (WAIVE) -0.260 20 MAIVE (WAIVE) -0.260
21 puniform (star) -31.860 21 puniform (star) -31.860

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RMA (default) 0.041 1 RMA (default) 0.041
2 trimfill (default) 0.045 2 trimfill (default) 0.045
3 mean (default) 0.052 3 mean (default) 0.052
4 FMA (default) 0.058 4 FMA (default) 0.058
4 WLS (default) 0.058 4 WLS (default) 0.058
6 WAAPWLS (default) 0.072 6 WAAPWLS (default) 0.072
7 RoBMA (PSMA) 0.074 7 PEESE (default) 0.074
8 PEESE (default) 0.074 8 RoBMA (PSMA) 0.086
9 WILS (default) 0.087 9 WILS (default) 0.087
10 PETPEESE (default) 0.096 10 pcurve (default) 0.094
11 pcurve (default) 0.097 11 PETPEESE (default) 0.096
12 PET (default) 0.119 12 AK (AK1) 0.117
13 EK (default) 0.119 13 PET (default) 0.119
14 AK (AK1) 0.119 14 EK (default) 0.119
15 MAIVE (default) 0.120 15 MAIVE (default) 0.120
16 AK (AK2) 0.130 16 AK (AK2) 0.184
17 MAIVE (WAIVE) 0.223 17 MAIVE (WAIVE) 0.223
18 SM (3PSM) 0.246 18 SM (3PSM) 0.235
19 SM (4PSM) 0.366 19 puniform (default) 0.348
20 puniform (default) 0.381 20 SM (4PSM) 0.386
21 puniform (star) 165.831 21 puniform (star) 165.831

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 puniform (star) 1.402 1 puniform (star) 1.402
2 WAAPWLS (default) 1.473 2 WAAPWLS (default) 1.473
3 PETPEESE (default) 1.679 3 PETPEESE (default) 1.679
4 RoBMA (PSMA) 1.721 4 PEESE (default) 1.725
5 PEESE (default) 1.725 5 RoBMA (PSMA) 1.761
6 EK (default) 1.827 6 SM (3PSM) 1.824
7 SM (3PSM) 1.841 7 EK (default) 1.827
8 PET (default) 1.901 8 PET (default) 1.901
9 MAIVE (default) 2.210 9 MAIVE (default) 2.210
10 WILS (default) 2.220 10 WILS (default) 2.220
11 trimfill (default) 2.241 11 trimfill (default) 2.244
12 puniform (default) 2.587 12 puniform (default) 2.533
13 WLS (default) 2.680 13 WLS (default) 2.680
14 SM (4PSM) 2.879 14 SM (4PSM) 2.820
15 AK (AK1) 2.896 15 AK (AK1) 2.822
16 FMA (default) 3.172 16 FMA (default) 3.172
17 AK (AK2) 3.599 17 AK (AK2) 3.480
18 RMA (default) 4.078 18 RMA (default) 4.078
19 MAIVE (WAIVE) 4.682 19 MAIVE (WAIVE) 4.682
20 mean (default) 7.200 20 mean (default) 7.200
21 pcurve (default) NaN 21 pcurve (default) NaN

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 AK (AK2) 0.680 1 AK (AK2) 0.674
2 RoBMA (PSMA) 0.647 2 RoBMA (PSMA) 0.645
3 puniform (star) 0.643 3 puniform (star) 0.643
4 SM (3PSM) 0.635 4 SM (3PSM) 0.630
5 SM (4PSM) 0.634 5 SM (4PSM) 0.629
6 WAAPWLS (default) 0.604 6 WAAPWLS (default) 0.604
7 AK (AK1) 0.590 7 AK (AK1) 0.589
8 MAIVE (default) 0.563 8 MAIVE (default) 0.563
9 puniform (default) 0.549 9 puniform (default) 0.549
10 trimfill (default) 0.512 10 PETPEESE (default) 0.512
11 PETPEESE (default) 0.512 11 trimfill (default) 0.512
12 EK (default) 0.477 12 EK (default) 0.477
13 MAIVE (WAIVE) 0.468 13 MAIVE (WAIVE) 0.468
14 PEESE (default) 0.464 14 PEESE (default) 0.464
15 PET (default) 0.463 15 PET (default) 0.463
16 WILS (default) 0.415 16 WILS (default) 0.415
17 WLS (default) 0.397 17 WLS (default) 0.397
18 RMA (default) 0.359 18 RMA (default) 0.359
19 FMA (default) 0.307 19 FMA (default) 0.307
20 mean (default) 0.179 20 mean (default) 0.179
21 pcurve (default) NaN 21 pcurve (default) NaN

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 FMA (default) 0.091 1 FMA (default) 0.091
2 WLS (default) 0.133 2 WLS (default) 0.133
3 mean (default) 0.148 3 mean (default) 0.148
4 PEESE (default) 0.153 4 PEESE (default) 0.153
5 trimfill (default) 0.160 5 trimfill (default) 0.160
6 WILS (default) 0.161 6 WILS (default) 0.161
7 RMA (default) 0.163 7 RMA (default) 0.163
8 WAAPWLS (default) 0.189 8 WAAPWLS (default) 0.189
9 PETPEESE (default) 0.190 9 PETPEESE (default) 0.190
10 PET (default) 0.244 10 PET (default) 0.244
11 EK (default) 0.264 11 EK (default) 0.264
12 RoBMA (PSMA) 0.279 12 RoBMA (PSMA) 0.277
13 puniform (star) 0.341 13 puniform (star) 0.341
14 MAIVE (default) 0.381 14 MAIVE (default) 0.381
15 puniform (default) 0.540 15 puniform (default) 0.494
16 SM (3PSM) 0.615 16 SM (3PSM) 0.557
17 MAIVE (WAIVE) 0.638 17 MAIVE (WAIVE) 0.638
18 SM (4PSM) 1.079 18 SM (4PSM) 0.985
19 AK (AK1) 1.787 19 AK (AK1) 1.711
20 AK (AK2) 2.464 20 AK (AK2) 2.604
21 pcurve (default) NaN 21 pcurve (default) NaN

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Log Value Rank Method Log Value
1 RoBMA (PSMA) 2.323 1 RoBMA (PSMA) 2.252
2 puniform (default) 1.987 2 puniform (default) 1.979
3 AK (AK2) 1.351 3 AK (AK2) 1.165
4 PETPEESE (default) 1.091 4 PETPEESE (default) 1.091
5 AK (AK1) 1.030 5 AK (AK1) 1.027
6 MAIVE (default) 0.973 6 MAIVE (default) 0.973
7 PET (default) 0.961 7 PET (default) 0.961
8 EK (default) 0.960 8 EK (default) 0.960
9 SM (3PSM) 0.909 9 SM (3PSM) 0.895
10 WAAPWLS (default) 0.821 10 WAAPWLS (default) 0.821
11 puniform (star) 0.798 11 puniform (star) 0.798
12 trimfill (default) 0.793 12 trimfill (default) 0.793
13 RMA (default) 0.764 13 RMA (default) 0.764
14 PEESE (default) 0.627 14 PEESE (default) 0.627
15 WLS (default) 0.621 15 WLS (default) 0.621
16 WILS (default) 0.573 16 WILS (default) 0.573
17 FMA (default) 0.385 17 FMA (default) 0.385
18 SM (4PSM) 0.368 18 SM (4PSM) 0.381
19 mean (default) 0.351 19 mean (default) 0.351
20 MAIVE (WAIVE) 0.073 20 MAIVE (WAIVE) 0.073
21 pcurve (default) NaN 21 pcurve (default) NaN

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Log Value Rank Method Log Value
1 PETPEESE (default) -4.367 1 PETPEESE (default) -4.367
2 PET (default) -4.242 2 PET (default) -4.242
3 EK (default) -4.242 3 EK (default) -4.242
4 WAAPWLS (default) -4.199 4 WAAPWLS (default) -4.199
5 PEESE (default) -3.861 5 PEESE (default) -3.861
6 puniform (default) -3.695 6 AK (AK2) -3.721
7 WLS (default) -3.403 7 puniform (default) -3.695
8 trimfill (default) -3.240 8 WLS (default) -3.403
9 AK (AK1) -3.054 9 trimfill (default) -3.241
10 FMA (default) -3.028 10 AK (AK1) -3.057
11 RMA (default) -2.866 11 FMA (default) -3.028
12 SM (3PSM) -2.688 12 RMA (default) -2.866
13 puniform (star) -2.427 13 SM (3PSM) -2.705
14 MAIVE (default) -2.347 14 puniform (star) -2.427
15 WILS (default) -2.292 15 MAIVE (default) -2.347
16 RoBMA (PSMA) -2.270 16 WILS (default) -2.292
17 mean (default) -2.164 17 RoBMA (PSMA) -2.266
18 AK (AK2) -1.915 18 mean (default) -2.164
19 SM (4PSM) -1.222 19 SM (4PSM) -1.314
20 MAIVE (WAIVE) 0.007 20 MAIVE (WAIVE) 0.007
21 pcurve (default) NaN 21 pcurve (default) NaN

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 AK (AK2) 0.276 1 RoBMA (PSMA) 0.283
2 RoBMA (PSMA) 0.278 2 MAIVE (WAIVE) 0.288
3 MAIVE (WAIVE) 0.288 3 MAIVE (default) 0.344
4 MAIVE (default) 0.344 4 PETPEESE (default) 0.403
5 PETPEESE (default) 0.403 5 PET (default) 0.421
6 PET (default) 0.421 6 EK (default) 0.421
7 EK (default) 0.421 7 AK (AK2) 0.439
8 puniform (star) 0.443 8 puniform (star) 0.443
9 puniform (default) 0.457 9 puniform (default) 0.457
10 SM (3PSM) 0.468 10 SM (3PSM) 0.473
11 WAAPWLS (default) 0.506 11 WAAPWLS (default) 0.506
12 SM (4PSM) 0.543 12 SM (4PSM) 0.545
13 WILS (default) 0.557 13 WILS (default) 0.557
14 PEESE (default) 0.638 14 PEESE (default) 0.638
15 AK (AK1) 0.642 15 AK (AK1) 0.642
16 trimfill (default) 0.660 16 trimfill (default) 0.660
17 WLS (default) 0.690 17 WLS (default) 0.690
18 RMA (default) 0.702 18 RMA (default) 0.702
19 FMA (default) 0.796 19 FMA (default) 0.796
20 mean (default) 0.827 20 mean (default) 0.827
21 pcurve (default) NaN 21 pcurve (default) NaN

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 mean (default) 0.995 1 mean (default) 0.995
2 FMA (default) 0.994 2 FMA (default) 0.994
3 RMA (default) 0.989 3 RMA (default) 0.989
4 WLS (default) 0.985 4 WLS (default) 0.985
5 trimfill (default) 0.980 5 trimfill (default) 0.980
6 AK (AK1) 0.979 6 AK (AK1) 0.979
7 PEESE (default) 0.963 7 PEESE (default) 0.963
8 WAAPWLS (default) 0.925 8 WAAPWLS (default) 0.925
9 puniform (default) 0.913 9 puniform (default) 0.913
10 PETPEESE (default) 0.899 10 PETPEESE (default) 0.899
11 EK (default) 0.885 11 EK (default) 0.885
12 PET (default) 0.885 12 PET (default) 0.885
13 WILS (default) 0.846 13 AK (AK2) 0.858
14 SM (3PSM) 0.806 14 WILS (default) 0.846
15 AK (AK2) 0.748 15 SM (3PSM) 0.812
16 puniform (star) 0.746 16 puniform (star) 0.746
17 MAIVE (default) 0.706 17 SM (4PSM) 0.709
18 SM (4PSM) 0.697 18 MAIVE (default) 0.706
19 RoBMA (PSMA) 0.642 19 RoBMA (PSMA) 0.646
20 MAIVE (WAIVE) 0.281 20 MAIVE (WAIVE) 0.281
21 pcurve (default) NaN 21 pcurve (default) NaN

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

By-Condition Performance (Conditional on Method Convergence)

The results below are conditional on method convergence. Note that the methods might differ in convergence rate and are therefore not compared on the same data sets.

Raincloud plot showing convergence rates across different methods

Raincloud plot showing RMSE (Root Mean Square Error) across different methods

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing bias across different methods

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0. Values lower than -0.5 or larger than 0.5 are visualized as -0.5 and 0.5 respectively.

Raincloud plot showing bias across different methods

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing 95% confidence interval width across different methods

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method. Values larger than 100 are visualized as 100.

Raincloud plot showing 95% confidence interval coverage across different methods

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Raincloud plot showing 95% confidence interval width across different methods

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method.

Raincloud plot showing positive likelihood ratio across different methods

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Raincloud plot showing negative likelihood ratio across different methods

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Raincloud plot showing Type I Error rates across different methods

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Raincloud plot showing statistical power across different methods

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

By-Condition Performance (Replacement in Case of Non-Convergence)

The results below incorporate method replacement to handle non-convergence. If a method fails to converge, its results are replaced with the results from a simpler method (e.g., random-effects meta-analysis without publication bias adjustment). This emulates what a data analyst may do in practice in case a method does not converge. However, note that these results do not correspond to “pure” method performance as they might combine multiple different methods. See Method Replacement Strategy for details of the method replacement specification.

Raincloud plot showing convergence rates across different methods

Raincloud plot showing RMSE (Root Mean Square Error) across different methods

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing bias across different methods

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0. Values lower than -0.5 or larger than 0.5 are visualized as -0.5 and 0.5 respectively.

Raincloud plot showing bias across different methods

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing 95% confidence interval width across different methods

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method. Values larger than 100 are visualized as 100.

Raincloud plot showing 95% confidence interval coverage across different methods

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Raincloud plot showing 95% confidence interval width across different methods

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method.

Raincloud plot showing positive likelihood ratio across different methods

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Raincloud plot showing negative likelihood ratio across different methods

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Raincloud plot showing Type I Error rates across different methods

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Raincloud plot showing statistical power across different methods

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

Subset: No Questionable Research Practices

These results are based on Carter (2019) data-generating mechanism with a total of 252 conditions.

Average Performance

Method performance measures are aggregated across all simulated conditions to provide an overall impression of method performance. However, keep in mind that a method with a high overall ranking is not necessarily the “best” method for a particular application. To select a suitable method for your application, consider also non-aggregated performance measures in conditions most relevant to your application.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RMA (default) 0.057 1 RMA (default) 0.057
2 FMA (default) 0.063 2 FMA (default) 0.063
2 WLS (default) 0.063 2 WLS (default) 0.063
4 trimfill (default) 0.063 4 trimfill (default) 0.063
5 WAAPWLS (default) 0.077 5 WAAPWLS (default) 0.077
6 PEESE (default) 0.094 6 PEESE (default) 0.094
7 mean (default) 0.104 7 mean (default) 0.104
8 RoBMA (PSMA) 0.109 8 RoBMA (PSMA) 0.108
9 PETPEESE (default) 0.118 9 PETPEESE (default) 0.118
10 WILS (default) 0.131 10 WILS (default) 0.131
11 SM (3PSM) 0.134 11 SM (3PSM) 0.134
12 EK (default) 0.161 12 EK (default) 0.161
13 PET (default) 0.161 13 PET (default) 0.161
14 MAIVE (default) 0.181 14 MAIVE (default) 0.181
15 SM (4PSM) 0.196 15 SM (4PSM) 0.198
16 pcurve (default) 0.235 16 pcurve (default) 0.222
17 AK (AK2) 0.246 17 AK (AK1) 0.289
18 AK (AK1) 0.297 18 AK (AK2) 0.356
19 MAIVE (WAIVE) 0.388 19 MAIVE (WAIVE) 0.388
20 puniform (default) 0.952 20 puniform (default) 0.862
21 puniform (star) 46.988 21 puniform (star) 46.988

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 FMA (default) 0.004 1 puniform (default) -0.001
1 WLS (default) 0.004 2 FMA (default) 0.004
3 WAAPWLS (default) -0.007 2 WLS (default) 0.004
4 puniform (default) -0.014 4 WAAPWLS (default) -0.007
5 trimfill (default) -0.025 5 trimfill (default) -0.025
6 RMA (default) 0.028 6 RMA (default) 0.028
7 AK (AK1) -0.031 7 AK (AK1) -0.032
8 PEESE (default) -0.038 8 PEESE (default) -0.038
9 PETPEESE (default) -0.052 9 PETPEESE (default) -0.052
10 EK (default) -0.073 10 pcurve (default) 0.063
11 PET (default) -0.073 11 EK (default) -0.073
12 pcurve (default) 0.075 12 PET (default) -0.073
13 mean (default) 0.075 13 AK (AK2) -0.074
14 SM (3PSM) -0.085 14 mean (default) 0.075
15 RoBMA (PSMA) -0.088 15 SM (3PSM) -0.085
16 WILS (default) -0.096 16 RoBMA (PSMA) -0.087
17 AK (AK2) -0.101 17 WILS (default) -0.096
18 SM (4PSM) -0.106 18 SM (4PSM) -0.106
19 MAIVE (default) -0.113 19 MAIVE (default) -0.113
20 MAIVE (WAIVE) -0.277 20 MAIVE (WAIVE) -0.277
21 puniform (star) -4.095 21 puniform (star) -4.095

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RMA (default) 0.043 1 RMA (default) 0.043
2 trimfill (default) 0.048 2 trimfill (default) 0.048
3 RoBMA (PSMA) 0.054 3 mean (default) 0.055
4 mean (default) 0.055 4 RoBMA (PSMA) 0.057
5 FMA (default) 0.058 5 FMA (default) 0.058
5 WLS (default) 0.058 5 WLS (default) 0.058
7 WAAPWLS (default) 0.074 7 WAAPWLS (default) 0.074
8 WILS (default) 0.076 8 WILS (default) 0.076
9 SM (3PSM) 0.081 9 SM (3PSM) 0.080
10 PEESE (default) 0.082 10 PEESE (default) 0.082
11 PETPEESE (default) 0.101 11 PETPEESE (default) 0.101
12 MAIVE (default) 0.122 12 MAIVE (default) 0.122
13 EK (default) 0.135 13 EK (default) 0.135
14 PET (default) 0.135 14 PET (default) 0.135
15 SM (4PSM) 0.136 15 SM (4PSM) 0.137
16 pcurve (default) 0.164 16 pcurve (default) 0.156
17 AK (AK2) 0.187 17 MAIVE (WAIVE) 0.234
18 MAIVE (WAIVE) 0.234 18 AK (AK1) 0.268
19 AK (AK1) 0.275 19 AK (AK2) 0.310
20 puniform (default) 0.888 20 puniform (default) 0.800
21 puniform (star) 46.707 21 puniform (star) 46.707

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RMA (default) 0.419 1 RMA (default) 0.419
2 WLS (default) 0.511 2 WLS (default) 0.511
3 trimfill (default) 0.533 3 trimfill (default) 0.533
4 WAAPWLS (default) 0.539 4 WAAPWLS (default) 0.539
5 FMA (default) 0.863 5 FMA (default) 0.863
6 PEESE (default) 0.977 6 PEESE (default) 0.977
7 puniform (star) 1.130 7 puniform (star) 1.130
8 PETPEESE (default) 1.153 8 PETPEESE (default) 1.153
9 RoBMA (PSMA) 1.385 9 RoBMA (PSMA) 1.363
10 SM (3PSM) 1.563 10 SM (3PSM) 1.559
11 EK (default) 1.711 11 EK (default) 1.711
12 mean (default) 1.735 12 SM (4PSM) 1.717
13 SM (4PSM) 1.762 13 mean (default) 1.735
14 PET (default) 1.784 14 PET (default) 1.784
15 MAIVE (default) 1.918 15 MAIVE (default) 1.918
16 WILS (default) 2.636 16 puniform (default) 2.538
17 puniform (default) 2.690 17 WILS (default) 2.636
18 MAIVE (WAIVE) 3.978 18 MAIVE (WAIVE) 3.978
19 AK (AK1) 5.500 19 AK (AK1) 5.273
20 AK (AK2) 5.978 20 AK (AK2) 6.407
21 pcurve (default) NaN 21 pcurve (default) NaN

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 WAAPWLS (default) 0.798 1 WAAPWLS (default) 0.798
2 trimfill (default) 0.749 2 trimfill (default) 0.748
3 RMA (default) 0.736 3 RMA (default) 0.736
4 AK (AK1) 0.711 4 AK (AK1) 0.711
5 WLS (default) 0.698 5 WLS (default) 0.698
6 puniform (star) 0.656 6 AK (AK2) 0.659
7 AK (AK2) 0.653 7 puniform (star) 0.656
8 MAIVE (default) 0.651 8 MAIVE (default) 0.651
9 RoBMA (PSMA) 0.639 9 RoBMA (PSMA) 0.640
10 SM (4PSM) 0.619 10 SM (4PSM) 0.619
11 PEESE (default) 0.618 11 PEESE (default) 0.618
12 PETPEESE (default) 0.615 12 PETPEESE (default) 0.615
13 puniform (default) 0.606 13 puniform (default) 0.607
14 SM (3PSM) 0.595 14 SM (3PSM) 0.595
15 EK (default) 0.568 15 EK (default) 0.568
16 PET (default) 0.554 16 PET (default) 0.554
17 MAIVE (WAIVE) 0.541 17 MAIVE (WAIVE) 0.541
18 FMA (default) 0.530 18 FMA (default) 0.530
19 mean (default) 0.437 19 mean (default) 0.437
20 WILS (default) 0.402 20 WILS (default) 0.402
21 pcurve (default) NaN 21 pcurve (default) NaN

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 FMA (default) 0.096 1 FMA (default) 0.096
2 WILS (default) 0.138 2 WILS (default) 0.138
3 WLS (default) 0.143 3 WLS (default) 0.143
4 mean (default) 0.149 4 mean (default) 0.149
5 trimfill (default) 0.175 5 trimfill (default) 0.175
6 RMA (default) 0.177 6 RMA (default) 0.177
7 PEESE (default) 0.184 7 PEESE (default) 0.184
8 WAAPWLS (default) 0.206 8 WAAPWLS (default) 0.206
9 RoBMA (PSMA) 0.211 9 RoBMA (PSMA) 0.210
10 PETPEESE (default) 0.237 10 PETPEESE (default) 0.237
11 SM (3PSM) 0.268 11 SM (3PSM) 0.264
12 puniform (star) 0.270 12 puniform (star) 0.270
13 PET (default) 0.296 13 PET (default) 0.296
14 EK (default) 0.322 14 EK (default) 0.322
15 SM (4PSM) 0.455 15 SM (4PSM) 0.410
16 MAIVE (default) 0.477 16 MAIVE (default) 0.477
17 MAIVE (WAIVE) 0.762 17 MAIVE (WAIVE) 0.762
18 puniform (default) 0.976 18 puniform (default) 0.845
19 AK (AK2) 4.827 19 AK (AK1) 4.810
20 AK (AK1) 5.038 20 AK (AK2) 5.426
21 pcurve (default) NaN 21 pcurve (default) NaN

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Log Value Rank Method Log Value
1 RoBMA (PSMA) 3.582 1 RoBMA (PSMA) 3.669
2 AK (AK1) 2.778 2 AK (AK1) 2.770
3 RMA (default) 2.193 3 RMA (default) 2.193
4 trimfill (default) 2.152 4 trimfill (default) 2.152
5 puniform (default) 1.977 5 puniform (default) 1.961
6 WLS (default) 1.676 6 WLS (default) 1.676
7 WAAPWLS (default) 1.563 7 WAAPWLS (default) 1.563
8 puniform (star) 1.441 8 AK (AK2) 1.556
9 PEESE (default) 1.327 9 puniform (star) 1.441
10 PETPEESE (default) 1.272 10 PEESE (default) 1.327
11 AK (AK2) 1.240 11 PETPEESE (default) 1.272
12 SM (3PSM) 1.112 12 SM (3PSM) 1.120
13 FMA (default) 1.095 13 FMA (default) 1.095
14 EK (default) 1.066 14 EK (default) 1.066
14 PET (default) 1.066 14 PET (default) 1.066
16 mean (default) 1.033 16 mean (default) 1.033
17 MAIVE (default) 0.980 17 MAIVE (default) 0.980
18 SM (4PSM) 0.893 18 SM (4PSM) 0.915
19 WILS (default) 0.856 19 WILS (default) 0.856
20 MAIVE (WAIVE) -0.131 20 MAIVE (WAIVE) -0.131
21 pcurve (default) NaN 21 pcurve (default) NaN

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Log Value Rank Method Log Value
1 RMA (default) -6.535 1 RMA (default) -6.535
2 AK (AK1) -6.401 2 AK (AK1) -6.404
3 WLS (default) -6.229 3 WLS (default) -6.229
4 trimfill (default) -6.222 4 trimfill (default) -6.224
5 FMA (default) -6.193 5 FMA (default) -6.193
6 mean (default) -5.430 6 mean (default) -5.430
7 PEESE (default) -5.202 7 PEESE (default) -5.202
8 WAAPWLS (default) -4.263 8 WAAPWLS (default) -4.263
9 PETPEESE (default) -4.199 9 PETPEESE (default) -4.199
10 puniform (default) -4.028 10 puniform (default) -4.030
11 EK (default) -3.956 11 EK (default) -3.956
11 PET (default) -3.956 11 PET (default) -3.956
13 puniform (star) -3.690 13 puniform (star) -3.690
14 RoBMA (PSMA) -3.314 14 AK (AK2) -3.483
15 SM (3PSM) -3.252 15 RoBMA (PSMA) -3.326
16 WILS (default) -3.212 16 SM (3PSM) -3.259
17 SM (4PSM) -2.447 17 WILS (default) -3.212
18 MAIVE (default) -2.288 18 SM (4PSM) -2.519
19 AK (AK2) -1.757 19 MAIVE (default) -2.288
20 MAIVE (WAIVE) -0.127 20 MAIVE (WAIVE) -0.127
21 pcurve (default) NaN 21 pcurve (default) NaN

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 AK (AK1) 0.114 1 AK (AK1) 0.114
2 RoBMA (PSMA) 0.143 2 RoBMA (PSMA) 0.143
3 trimfill (default) 0.155 3 trimfill (default) 0.155
4 RMA (default) 0.186 4 RMA (default) 0.186
5 WAAPWLS (default) 0.220 5 WAAPWLS (default) 0.220
6 WLS (default) 0.227 6 WLS (default) 0.227
7 MAIVE (WAIVE) 0.276 7 MAIVE (WAIVE) 0.276
8 MAIVE (default) 0.293 8 MAIVE (default) 0.293
9 PETPEESE (default) 0.302 9 PETPEESE (default) 0.302
10 AK (AK2) 0.312 10 AK (AK2) 0.307
11 PEESE (default) 0.312 11 PEESE (default) 0.312
12 puniform (star) 0.319 12 puniform (star) 0.319
13 EK (default) 0.348 13 EK (default) 0.348
13 PET (default) 0.348 13 PET (default) 0.348
15 puniform (default) 0.378 15 puniform (default) 0.377
16 SM (4PSM) 0.413 16 SM (4PSM) 0.413
17 SM (3PSM) 0.428 17 SM (3PSM) 0.428
18 FMA (default) 0.442 18 FMA (default) 0.442
19 mean (default) 0.499 19 mean (default) 0.499
20 WILS (default) 0.501 20 WILS (default) 0.501
21 pcurve (default) NaN 21 pcurve (default) NaN

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 mean (default) 0.986 1 mean (default) 0.986
2 FMA (default) 0.985 2 FMA (default) 0.985
3 RMA (default) 0.969 3 RMA (default) 0.969
4 WLS (default) 0.964 4 WLS (default) 0.964
5 trimfill (default) 0.949 5 trimfill (default) 0.949
6 AK (AK1) 0.949 6 AK (AK1) 0.948
7 PEESE (default) 0.915 7 PEESE (default) 0.915
8 puniform (default) 0.885 8 puniform (default) 0.885
9 WILS (default) 0.876 9 WILS (default) 0.876
10 WAAPWLS (default) 0.858 10 WAAPWLS (default) 0.858
11 PETPEESE (default) 0.845 11 PETPEESE (default) 0.845
12 EK (default) 0.831 12 EK (default) 0.831
12 PET (default) 0.831 12 PET (default) 0.831
14 SM (3PSM) 0.819 14 SM (3PSM) 0.821
15 puniform (star) 0.811 15 puniform (star) 0.811
16 SM (4PSM) 0.732 16 AK (AK2) 0.795
17 AK (AK2) 0.711 17 SM (4PSM) 0.739
18 RoBMA (PSMA) 0.674 18 RoBMA (PSMA) 0.676
19 MAIVE (default) 0.649 19 MAIVE (default) 0.649
20 MAIVE (WAIVE) 0.286 20 MAIVE (WAIVE) 0.286
21 pcurve (default) NaN 21 pcurve (default) NaN

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

By-Condition Performance (Conditional on Method Convergence)

The results below are conditional on method convergence. Note that the methods might differ in convergence rate and are therefore not compared on the same data sets.

Raincloud plot showing convergence rates across different methods

Raincloud plot showing RMSE (Root Mean Square Error) across different methods

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing bias across different methods

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0. Values lower than -0.5 or larger than 0.5 are visualized as -0.5 and 0.5 respectively.

Raincloud plot showing bias across different methods

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing 95% confidence interval width across different methods

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method. Values larger than 100 are visualized as 100.

Raincloud plot showing 95% confidence interval coverage across different methods

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Raincloud plot showing 95% confidence interval width across different methods

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method.

Raincloud plot showing positive likelihood ratio across different methods

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Raincloud plot showing negative likelihood ratio across different methods

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Raincloud plot showing Type I Error rates across different methods

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Raincloud plot showing statistical power across different methods

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

By-Condition Performance (Replacement in Case of Non-Convergence)

The results below incorporate method replacement to handle non-convergence. If a method fails to converge, its results are replaced with the results from a simpler method (e.g., random-effects meta-analysis without publication bias adjustment). This emulates what a data analyst may do in practice in case a method does not converge. However, note that these results do not correspond to “pure” method performance as they might combine multiple different methods. See Method Replacement Strategy for details of the method replacement specification.

Raincloud plot showing convergence rates across different methods

Raincloud plot showing RMSE (Root Mean Square Error) across different methods

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing bias across different methods

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0. Values lower than -0.5 or larger than 0.5 are visualized as -0.5 and 0.5 respectively.

Raincloud plot showing bias across different methods

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing 95% confidence interval width across different methods

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method. Values larger than 100 are visualized as 100.

Raincloud plot showing 95% confidence interval coverage across different methods

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Raincloud plot showing 95% confidence interval width across different methods

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method.

Raincloud plot showing positive likelihood ratio across different methods

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Raincloud plot showing negative likelihood ratio across different methods

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Raincloud plot showing Type I Error rates across different methods

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Raincloud plot showing statistical power across different methods

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

Subset: Medium Questionable Research Practices

These results are based on Carter (2019) data-generating mechanism with a total of 252 conditions.

Average Performance

Method performance measures are aggregated across all simulated conditions to provide an overall impression of method performance. However, keep in mind that a method with a high overall ranking is not necessarily the “best” method for a particular application. To select a suitable method for your application, consider also non-aggregated performance measures in conditions most relevant to your application.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 AK (AK1) 0.087 1 AK (AK1) 0.087
2 PEESE (default) 0.101 2 PEESE (default) 0.101
3 WAAPWLS (default) 0.107 3 AK (AK2) 0.105
4 PETPEESE (default) 0.111 4 WAAPWLS (default) 0.107
5 trimfill (default) 0.117 5 PETPEESE (default) 0.111
6 WILS (default) 0.122 6 trimfill (default) 0.117
7 FMA (default) 0.133 7 WILS (default) 0.122
7 WLS (default) 0.133 8 FMA (default) 0.133
9 RoBMA (PSMA) 0.135 8 WLS (default) 0.133
10 AK (AK2) 0.136 10 RoBMA (PSMA) 0.137
11 pcurve (default) 0.143 11 pcurve (default) 0.142
12 EK (default) 0.145 12 EK (default) 0.145
13 PET (default) 0.145 13 PET (default) 0.145
14 MAIVE (default) 0.162 14 MAIVE (default) 0.162
15 RMA (default) 0.191 15 RMA (default) 0.191
16 puniform (default) 0.259 16 puniform (default) 0.251
17 SM (3PSM) 0.272 17 SM (3PSM) 0.263
18 mean (default) 0.284 18 mean (default) 0.284
19 MAIVE (WAIVE) 0.370 19 MAIVE (WAIVE) 0.370
20 SM (4PSM) 0.444 20 SM (4PSM) 0.462
21 puniform (star) 147.205 21 puniform (star) 147.205

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 PETPEESE (default) 0.011 1 PETPEESE (default) 0.011
2 AK (AK1) 0.036 2 AK (AK2) -0.018
3 PEESE (default) 0.037 3 AK (AK1) 0.036
4 MAIVE (default) -0.041 4 PEESE (default) 0.037
5 pcurve (default) 0.042 5 MAIVE (default) -0.041
6 puniform (default) 0.045 6 pcurve (default) 0.042
7 WILS (default) -0.048 7 puniform (default) 0.046
8 EK (default) -0.052 8 WILS (default) -0.048
9 PET (default) -0.052 9 EK (default) -0.052
10 WAAPWLS (default) 0.068 10 PET (default) -0.052
11 AK (AK2) -0.086 11 WAAPWLS (default) 0.068
12 SM (3PSM) -0.090 12 SM (3PSM) -0.084
13 trimfill (default) 0.090 13 RoBMA (PSMA) -0.085
14 RoBMA (PSMA) -0.091 14 trimfill (default) 0.090
15 FMA (default) 0.112 15 FMA (default) 0.112
15 WLS (default) 0.112 15 WLS (default) 0.112
17 RMA (default) 0.183 17 RMA (default) 0.183
18 SM (4PSM) -0.202 18 SM (4PSM) -0.204
19 MAIVE (WAIVE) -0.254 19 MAIVE (WAIVE) -0.254
20 mean (default) 0.277 20 mean (default) 0.277
21 puniform (star) -23.968 21 puniform (star) -23.968

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RMA (default) 0.041 1 RMA (default) 0.041
2 AK (AK1) 0.042 2 AK (AK1) 0.042
3 trimfill (default) 0.044 3 trimfill (default) 0.045
4 mean (default) 0.052 4 mean (default) 0.052
5 FMA (default) 0.059 5 FMA (default) 0.059
5 WLS (default) 0.059 5 WLS (default) 0.059
7 pcurve (default) 0.071 7 pcurve (default) 0.070
8 PEESE (default) 0.074 8 PEESE (default) 0.074
9 WAAPWLS (default) 0.074 9 WAAPWLS (default) 0.074
10 RoBMA (PSMA) 0.078 10 AK (AK2) 0.081
11 AK (AK2) 0.079 11 WILS (default) 0.089
12 WILS (default) 0.089 12 RoBMA (PSMA) 0.091
13 PETPEESE (default) 0.095 13 PETPEESE (default) 0.095
14 PET (default) 0.118 14 PET (default) 0.118
15 EK (default) 0.118 15 EK (default) 0.118
16 MAIVE (default) 0.118 16 MAIVE (default) 0.118
17 puniform (default) 0.183 17 puniform (default) 0.175
18 SM (3PSM) 0.220 18 SM (3PSM) 0.212
19 MAIVE (WAIVE) 0.221 19 MAIVE (WAIVE) 0.221
20 SM (4PSM) 0.343 20 SM (4PSM) 0.361
21 puniform (star) 144.875 21 puniform (star) 144.875

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 AK (AK2) 1.212 1 AK (AK2) 0.960
2 AK (AK1) 1.235 2 AK (AK1) 1.240
3 puniform (star) 1.358 3 puniform (star) 1.358
4 WAAPWLS (default) 1.375 4 WAAPWLS (default) 1.375
5 PETPEESE (default) 1.437 5 PETPEESE (default) 1.437
6 PEESE (default) 1.462 6 PEESE (default) 1.462
7 SM (3PSM) 1.698 7 SM (3PSM) 1.674
8 RoBMA (PSMA) 1.724 8 EK (default) 1.745
9 EK (default) 1.745 9 RoBMA (PSMA) 1.780
10 PET (default) 1.819 10 PET (default) 1.819
11 WILS (default) 1.929 11 WILS (default) 1.929
12 MAIVE (default) 1.952 12 MAIVE (default) 1.952
13 trimfill (default) 2.200 13 trimfill (default) 2.202
14 puniform (default) 2.525 14 puniform (default) 2.516
15 WLS (default) 2.871 15 WLS (default) 2.871
16 SM (4PSM) 3.078 16 SM (4PSM) 2.975
17 FMA (default) 3.472 17 FMA (default) 3.472
18 MAIVE (WAIVE) 4.466 18 MAIVE (WAIVE) 4.466
19 RMA (default) 4.763 19 RMA (default) 4.763
20 mean (default) 8.442 20 mean (default) 8.442
21 pcurve (default) NaN 21 pcurve (default) NaN

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 AK (AK2) 0.735 1 AK (AK2) 0.686
2 RoBMA (PSMA) 0.653 2 RoBMA (PSMA) 0.650
3 SM (3PSM) 0.640 3 SM (3PSM) 0.635
4 puniform (star) 0.633 4 puniform (star) 0.633
5 SM (4PSM) 0.619 5 SM (4PSM) 0.614
6 WAAPWLS (default) 0.570 6 WAAPWLS (default) 0.570
7 MAIVE (default) 0.562 7 MAIVE (default) 0.562
8 AK (AK1) 0.553 8 AK (AK1) 0.551
9 puniform (default) 0.529 9 puniform (default) 0.529
10 PETPEESE (default) 0.512 10 PETPEESE (default) 0.512
11 MAIVE (WAIVE) 0.492 11 MAIVE (WAIVE) 0.492
12 EK (default) 0.468 12 EK (default) 0.468
13 PET (default) 0.453 13 PET (default) 0.453
14 PEESE (default) 0.446 14 PEESE (default) 0.446
15 trimfill (default) 0.427 15 trimfill (default) 0.426
16 WILS (default) 0.412 16 WILS (default) 0.412
17 WLS (default) 0.275 17 WLS (default) 0.275
18 FMA (default) 0.213 18 FMA (default) 0.213
19 RMA (default) 0.191 19 RMA (default) 0.191
20 mean (default) 0.059 20 mean (default) 0.059
21 pcurve (default) NaN 21 pcurve (default) NaN

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 FMA (default) 0.089 1 FMA (default) 0.089
2 WLS (default) 0.134 2 WLS (default) 0.134
3 mean (default) 0.147 3 mean (default) 0.147
4 PEESE (default) 0.150 4 PEESE (default) 0.150
5 trimfill (default) 0.160 5 trimfill (default) 0.160
6 WILS (default) 0.160 6 WILS (default) 0.160
7 RMA (default) 0.164 7 RMA (default) 0.164
8 AK (AK1) 0.168 8 AK (AK1) 0.168
9 PETPEESE (default) 0.184 9 PETPEESE (default) 0.184
10 WAAPWLS (default) 0.198 10 WAAPWLS (default) 0.198
11 PET (default) 0.237 11 PET (default) 0.237
12 EK (default) 0.257 12 EK (default) 0.257
13 RoBMA (PSMA) 0.288 13 AK (AK2) 0.266
14 puniform (star) 0.332 14 RoBMA (PSMA) 0.286
15 MAIVE (default) 0.371 15 puniform (star) 0.332
16 puniform (default) 0.375 16 puniform (default) 0.367
17 AK (AK2) 0.433 17 MAIVE (default) 0.371
18 SM (3PSM) 0.543 18 SM (3PSM) 0.498
19 MAIVE (WAIVE) 0.630 19 MAIVE (WAIVE) 0.630
20 SM (4PSM) 1.069 20 SM (4PSM) 0.939
21 pcurve (default) NaN 21 pcurve (default) NaN

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Log Value Rank Method Log Value
1 puniform (default) 1.991 1 puniform (default) 1.983
2 RoBMA (PSMA) 1.807 2 RoBMA (PSMA) 1.743
3 AK (AK2) 1.595 3 PETPEESE (default) 1.261
4 PETPEESE (default) 1.261 4 MAIVE (default) 1.059
5 MAIVE (default) 1.059 5 PET (default) 1.012
6 PET (default) 1.012 6 EK (default) 1.012
7 EK (default) 1.012 7 AK (AK2) 0.847
8 SM (3PSM) 0.833 8 SM (3PSM) 0.828
9 WAAPWLS (default) 0.665 9 WAAPWLS (default) 0.665
10 puniform (star) 0.553 10 puniform (star) 0.553
11 PEESE (default) 0.461 11 PEESE (default) 0.461
12 MAIVE (WAIVE) 0.444 12 MAIVE (WAIVE) 0.444
13 WILS (default) 0.427 13 WILS (default) 0.427
14 AK (AK1) 0.242 14 AK (AK1) 0.241
15 trimfill (default) 0.193 15 trimfill (default) 0.193
16 WLS (default) 0.154 16 WLS (default) 0.154
17 RMA (default) 0.088 17 RMA (default) 0.088
18 SM (4PSM) 0.057 18 SM (4PSM) 0.072
19 FMA (default) 0.053 19 FMA (default) 0.053
20 mean (default) 0.019 20 mean (default) 0.019
21 pcurve (default) NaN 21 pcurve (default) NaN

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Log Value Rank Method Log Value
1 WAAPWLS (default) -4.827 1 WAAPWLS (default) -4.827
2 PETPEESE (default) -4.469 2 PETPEESE (default) -4.469
3 EK (default) -4.350 3 EK (default) -4.350
4 PET (default) -4.350 4 PET (default) -4.350
5 PEESE (default) -4.272 5 PEESE (default) -4.272
6 puniform (default) -3.630 6 AK (AK2) -4.067
7 WLS (default) -2.606 7 puniform (default) -3.629
8 SM (3PSM) -2.571 8 WLS (default) -2.606
9 MAIVE (default) -2.525 9 SM (3PSM) -2.590
10 trimfill (default) -2.517 10 MAIVE (default) -2.525
11 WILS (default) -2.389 11 trimfill (default) -2.519
12 AK (AK2) -2.109 12 WILS (default) -2.389
13 FMA (default) -2.070 13 FMA (default) -2.070
14 puniform (star) -2.025 14 puniform (star) -2.025
15 RoBMA (PSMA) -1.921 15 RoBMA (PSMA) -1.913
16 AK (AK1) -1.790 16 AK (AK1) -1.793
17 RMA (default) -1.448 17 RMA (default) -1.448
18 mean (default) -0.850 18 mean (default) -0.850
19 SM (4PSM) -0.444 19 SM (4PSM) -0.550
20 MAIVE (WAIVE) -0.120 20 MAIVE (WAIVE) -0.120
21 pcurve (default) NaN 21 pcurve (default) NaN

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 AK (AK2) 0.169 1 MAIVE (WAIVE) 0.226
2 MAIVE (WAIVE) 0.226 2 RoBMA (PSMA) 0.333
3 RoBMA (PSMA) 0.324 3 MAIVE (default) 0.342
4 MAIVE (default) 0.342 4 PETPEESE (default) 0.357
5 PETPEESE (default) 0.357 5 PET (default) 0.397
6 PET (default) 0.397 6 EK (default) 0.397
7 EK (default) 0.397 7 puniform (default) 0.483
8 puniform (default) 0.483 8 AK (AK2) 0.492
9 SM (3PSM) 0.491 9 SM (3PSM) 0.493
10 WAAPWLS (default) 0.515 10 WAAPWLS (default) 0.515
11 puniform (star) 0.518 11 puniform (star) 0.518
12 WILS (default) 0.601 12 WILS (default) 0.601
13 SM (4PSM) 0.623 13 SM (4PSM) 0.625
14 PEESE (default) 0.684 14 PEESE (default) 0.684
15 trimfill (default) 0.857 15 trimfill (default) 0.857
16 AK (AK1) 0.867 16 AK (AK1) 0.867
17 WLS (default) 0.876 17 WLS (default) 0.876
18 RMA (default) 0.932 18 RMA (default) 0.932
19 FMA (default) 0.954 19 FMA (default) 0.954
20 mean (default) 0.983 20 mean (default) 0.983
21 pcurve (default) NaN 21 pcurve (default) NaN

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 mean (default) 1.000 1 mean (default) 1.000
2 FMA (default) 0.999 2 FMA (default) 0.999
3 RMA (default) 0.998 3 RMA (default) 0.998
4 trimfill (default) 0.995 4 trimfill (default) 0.995
5 WLS (default) 0.994 5 WLS (default) 0.994
6 AK (AK1) 0.990 6 AK (AK1) 0.990
7 PEESE (default) 0.981 7 PEESE (default) 0.981
8 WAAPWLS (default) 0.948 8 WAAPWLS (default) 0.948
9 puniform (default) 0.923 9 puniform (default) 0.923
10 PETPEESE (default) 0.915 10 PETPEESE (default) 0.915
11 EK (default) 0.900 11 EK (default) 0.900
12 PET (default) 0.900 12 PET (default) 0.900
13 WILS (default) 0.861 13 AK (AK2) 0.879
14 SM (3PSM) 0.821 14 WILS (default) 0.861
15 puniform (star) 0.755 15 SM (3PSM) 0.826
16 AK (AK2) 0.716 16 puniform (star) 0.755
17 MAIVE (default) 0.716 17 MAIVE (default) 0.716
18 SM (4PSM) 0.670 18 SM (4PSM) 0.684
19 RoBMA (PSMA) 0.629 19 RoBMA (PSMA) 0.634
20 MAIVE (WAIVE) 0.288 20 MAIVE (WAIVE) 0.288
21 pcurve (default) NaN 21 pcurve (default) NaN

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

By-Condition Performance (Conditional on Method Convergence)

The results below are conditional on method convergence. Note that the methods might differ in convergence rate and are therefore not compared on the same data sets.

Raincloud plot showing convergence rates across different methods

Raincloud plot showing RMSE (Root Mean Square Error) across different methods

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing bias across different methods

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0. Values lower than -0.5 or larger than 0.5 are visualized as -0.5 and 0.5 respectively.

Raincloud plot showing bias across different methods

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing 95% confidence interval width across different methods

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method. Values larger than 100 are visualized as 100.

Raincloud plot showing 95% confidence interval coverage across different methods

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Raincloud plot showing 95% confidence interval width across different methods

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method.

Raincloud plot showing positive likelihood ratio across different methods

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Raincloud plot showing negative likelihood ratio across different methods

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Raincloud plot showing Type I Error rates across different methods

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Raincloud plot showing statistical power across different methods

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

By-Condition Performance (Replacement in Case of Non-Convergence)

The results below incorporate method replacement to handle non-convergence. If a method fails to converge, its results are replaced with the results from a simpler method (e.g., random-effects meta-analysis without publication bias adjustment). This emulates what a data analyst may do in practice in case a method does not converge. However, note that these results do not correspond to “pure” method performance as they might combine multiple different methods. See Method Replacement Strategy for details of the method replacement specification.

Raincloud plot showing convergence rates across different methods

Raincloud plot showing RMSE (Root Mean Square Error) across different methods

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing bias across different methods

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0. Values lower than -0.5 or larger than 0.5 are visualized as -0.5 and 0.5 respectively.

Raincloud plot showing bias across different methods

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing 95% confidence interval width across different methods

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method. Values larger than 100 are visualized as 100.

Raincloud plot showing 95% confidence interval coverage across different methods

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Raincloud plot showing 95% confidence interval width across different methods

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method.

Raincloud plot showing positive likelihood ratio across different methods

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Raincloud plot showing negative likelihood ratio across different methods

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Raincloud plot showing Type I Error rates across different methods

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Raincloud plot showing statistical power across different methods

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

Subset: High Questionable Research Practices

These results are based on Carter (2019) data-generating mechanism with a total of 252 conditions.

Average Performance

Method performance measures are aggregated across all simulated conditions to provide an overall impression of method performance. However, keep in mind that a method with a high overall ranking is not necessarily the “best” method for a particular application. To select a suitable method for your application, consider also non-aggregated performance measures in conditions most relevant to your application.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 AK (AK1) 0.105 1 AK (AK1) 0.105
2 PEESE (default) 0.127 2 AK (AK2) 0.117
3 PETPEESE (default) 0.129 3 PEESE (default) 0.127
4 pcurve (default) 0.129 4 PETPEESE (default) 0.129
5 WILS (default) 0.130 5 pcurve (default) 0.129
6 WAAPWLS (default) 0.131 6 WILS (default) 0.130
7 EK (default) 0.140 7 WAAPWLS (default) 0.131
8 PET (default) 0.140 8 EK (default) 0.140
9 puniform (default) 0.149 9 PET (default) 0.140
10 AK (AK2) 0.158 10 puniform (default) 0.149
11 trimfill (default) 0.159 11 trimfill (default) 0.159
12 RoBMA (PSMA) 0.162 12 RoBMA (PSMA) 0.164
13 MAIVE (default) 0.169 13 MAIVE (default) 0.169
14 FMA (default) 0.175 14 FMA (default) 0.175
14 WLS (default) 0.175 14 WLS (default) 0.175
16 RMA (default) 0.244 16 RMA (default) 0.244
17 mean (default) 0.358 17 mean (default) 0.358
18 MAIVE (WAIVE) 0.378 18 MAIVE (WAIVE) 0.378
19 SM (3PSM) 0.493 19 SM (3PSM) 0.468
20 SM (4PSM) 0.736 20 SM (4PSM) 0.779
21 puniform (star) 314.821 21 puniform (star) 314.821

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 WILS (default) -0.012 1 WILS (default) -0.012
2 MAIVE (default) -0.012 2 MAIVE (default) -0.012
3 EK (default) -0.032 3 EK (default) -0.032
4 PET (default) -0.032 4 PET (default) -0.032
5 pcurve (default) 0.038 5 pcurve (default) 0.038
6 PETPEESE (default) 0.046 6 PETPEESE (default) 0.046
7 AK (AK2) 0.050 7 AK (AK2) 0.046
8 puniform (default) 0.054 8 puniform (default) 0.054
9 AK (AK1) 0.062 9 AK (AK1) 0.062
10 PEESE (default) 0.075 10 PEESE (default) 0.075
11 WAAPWLS (default) 0.101 11 RoBMA (PSMA) -0.094
12 RoBMA (PSMA) -0.103 12 WAAPWLS (default) 0.101
13 trimfill (default) 0.136 13 SM (3PSM) -0.130
14 SM (3PSM) -0.148 14 trimfill (default) 0.136
15 FMA (default) 0.159 15 FMA (default) 0.159
15 WLS (default) 0.159 15 WLS (default) 0.159
17 RMA (default) 0.237 17 RMA (default) 0.237
18 MAIVE (WAIVE) -0.250 18 MAIVE (WAIVE) -0.250
19 SM (4PSM) -0.291 19 SM (4PSM) -0.299
20 mean (default) 0.353 20 mean (default) 0.353
21 puniform (star) -67.517 21 puniform (star) -67.517

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RMA (default) 0.038 1 RMA (default) 0.038
2 AK (AK1) 0.041 2 AK (AK1) 0.042
3 trimfill (default) 0.042 3 trimfill (default) 0.042
4 mean (default) 0.048 4 mean (default) 0.048
5 pcurve (default) 0.055 5 pcurve (default) 0.055
6 WLS (default) 0.057 6 WLS (default) 0.057
7 FMA (default) 0.057 7 FMA (default) 0.057
8 WAAPWLS (default) 0.067 8 WAAPWLS (default) 0.067
9 PEESE (default) 0.067 9 PEESE (default) 0.067
10 puniform (default) 0.070 10 puniform (default) 0.070
11 AK (AK2) 0.079 11 AK (AK2) 0.074
12 RoBMA (PSMA) 0.090 12 PETPEESE (default) 0.091
13 PETPEESE (default) 0.091 13 WILS (default) 0.097
14 WILS (default) 0.097 14 PET (default) 0.105
15 PET (default) 0.105 15 EK (default) 0.105
16 EK (default) 0.105 16 RoBMA (PSMA) 0.109
17 MAIVE (default) 0.121 17 MAIVE (default) 0.121
18 MAIVE (WAIVE) 0.215 18 MAIVE (WAIVE) 0.215
19 SM (3PSM) 0.436 19 SM (3PSM) 0.414
20 SM (4PSM) 0.618 20 SM (4PSM) 0.661
21 puniform (star) 305.910 21 puniform (star) 305.910

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 puniform (star) 1.717 1 AK (AK2) 1.346
2 AK (AK1) 1.951 2 puniform (star) 1.717
3 EK (default) 2.025 3 AK (AK1) 1.953
4 RoBMA (PSMA) 2.053 4 EK (default) 2.025
5 WILS (default) 2.094 5 WILS (default) 2.094
6 PET (default) 2.098 6 PET (default) 2.098
7 SM (3PSM) 2.262 7 RoBMA (PSMA) 2.140
8 PETPEESE (default) 2.446 8 SM (3PSM) 2.238
9 WAAPWLS (default) 2.507 9 PETPEESE (default) 2.446
10 puniform (default) 2.546 10 WAAPWLS (default) 2.507
11 PEESE (default) 2.736 11 puniform (default) 2.546
12 MAIVE (default) 2.759 12 PEESE (default) 2.736
13 AK (AK2) 2.956 13 MAIVE (default) 2.759
14 SM (4PSM) 3.798 14 SM (4PSM) 3.770
15 trimfill (default) 3.991 15 trimfill (default) 3.996
16 WLS (default) 4.658 16 WLS (default) 4.658
17 FMA (default) 5.182 17 FMA (default) 5.182
18 MAIVE (WAIVE) 5.601 18 MAIVE (WAIVE) 5.601
19 RMA (default) 7.051 19 RMA (default) 7.051
20 mean (default) 11.423 20 mean (default) 11.423
21 pcurve (default) NaN 21 pcurve (default) NaN

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 SM (3PSM) 0.672 1 AK (AK2) 0.691
2 SM (4PSM) 0.663 2 SM (3PSM) 0.659
3 RoBMA (PSMA) 0.648 3 SM (4PSM) 0.654
4 puniform (star) 0.641 4 RoBMA (PSMA) 0.645
5 AK (AK2) 0.596 5 puniform (star) 0.641
6 puniform (default) 0.510 6 puniform (default) 0.510
7 AK (AK1) 0.505 7 AK (AK1) 0.504
8 MAIVE (default) 0.474 8 MAIVE (default) 0.474
9 WAAPWLS (default) 0.444 9 WAAPWLS (default) 0.444
10 WILS (default) 0.432 10 WILS (default) 0.432
11 PETPEESE (default) 0.409 11 PETPEESE (default) 0.409
12 EK (default) 0.396 12 EK (default) 0.396
13 PET (default) 0.382 13 PET (default) 0.382
14 MAIVE (WAIVE) 0.370 14 MAIVE (WAIVE) 0.370
15 trimfill (default) 0.361 15 trimfill (default) 0.360
16 PEESE (default) 0.329 16 PEESE (default) 0.329
17 WLS (default) 0.218 17 WLS (default) 0.218
18 FMA (default) 0.179 18 FMA (default) 0.179
19 RMA (default) 0.149 19 RMA (default) 0.149
20 mean (default) 0.040 20 mean (default) 0.040
21 pcurve (default) NaN 21 pcurve (default) NaN

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 FMA (default) 0.088 1 FMA (default) 0.088
2 WLS (default) 0.122 2 WLS (default) 0.122
3 PEESE (default) 0.126 3 PEESE (default) 0.126
4 trimfill (default) 0.144 4 trimfill (default) 0.144
5 mean (default) 0.147 5 mean (default) 0.147
6 RMA (default) 0.147 6 RMA (default) 0.147
7 PETPEESE (default) 0.148 7 PETPEESE (default) 0.148
8 AK (AK1) 0.155 8 AK (AK1) 0.155
9 WAAPWLS (default) 0.163 9 WAAPWLS (default) 0.163
10 WILS (default) 0.186 10 WILS (default) 0.186
11 PET (default) 0.197 11 PET (default) 0.197
12 EK (default) 0.213 12 EK (default) 0.213
13 puniform (default) 0.270 13 AK (AK2) 0.252
14 MAIVE (default) 0.296 14 puniform (default) 0.270
15 RoBMA (PSMA) 0.337 15 MAIVE (default) 0.296
16 puniform (star) 0.422 16 RoBMA (PSMA) 0.333
17 MAIVE (WAIVE) 0.524 17 puniform (star) 0.422
18 AK (AK2) 0.725 18 MAIVE (WAIVE) 0.524
19 SM (3PSM) 1.035 19 SM (3PSM) 0.907
20 SM (4PSM) 1.711 20 SM (4PSM) 1.606
21 pcurve (default) NaN 21 pcurve (default) NaN

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Log Value Rank Method Log Value
1 puniform (default) 1.992 1 puniform (default) 1.992
2 RoBMA (PSMA) 1.581 2 RoBMA (PSMA) 1.344
3 MAIVE (default) 0.880 3 MAIVE (default) 0.880
4 PET (default) 0.804 4 PET (default) 0.804
5 EK (default) 0.802 5 EK (default) 0.802
6 SM (3PSM) 0.781 6 PETPEESE (default) 0.739
7 PETPEESE (default) 0.739 7 SM (3PSM) 0.737
8 AK (AK2) 0.451 8 WILS (default) 0.435
9 WILS (default) 0.435 9 puniform (star) 0.400
10 puniform (star) 0.400 10 AK (AK2) 0.241
11 WAAPWLS (default) 0.236 11 WAAPWLS (default) 0.236
12 SM (4PSM) 0.153 12 SM (4PSM) 0.157
13 PEESE (default) 0.093 13 PEESE (default) 0.093
14 AK (AK1) 0.071 14 AK (AK1) 0.071
15 trimfill (default) 0.035 15 trimfill (default) 0.035
16 WLS (default) 0.034 16 WLS (default) 0.034
17 RMA (default) 0.013 17 RMA (default) 0.013
18 FMA (default) 0.007 18 FMA (default) 0.007
19 mean (default) 0.000 19 mean (default) 0.000
20 MAIVE (WAIVE) -0.094 20 MAIVE (WAIVE) -0.094
21 pcurve (default) NaN 21 pcurve (default) NaN

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Log Value Rank Method Log Value
1 PETPEESE (default) -4.432 1 PETPEESE (default) -4.432
2 PET (default) -4.420 2 PET (default) -4.420
3 EK (default) -4.420 3 EK (default) -4.420
4 WAAPWLS (default) -3.509 4 WAAPWLS (default) -3.509
5 puniform (default) -3.426 5 puniform (default) -3.426
6 SM (3PSM) -2.241 6 AK (AK2) -3.133
7 MAIVE (default) -2.230 7 SM (3PSM) -2.265
8 PEESE (default) -2.108 8 MAIVE (default) -2.230
9 AK (AK2) -1.792 9 PEESE (default) -2.108
10 RoBMA (PSMA) -1.576 10 puniform (star) -1.566
11 puniform (star) -1.566 11 RoBMA (PSMA) -1.560
12 WLS (default) -1.374 12 WLS (default) -1.374
13 WILS (default) -1.274 13 WILS (default) -1.274
14 trimfill (default) -0.981 14 trimfill (default) -0.982
15 AK (AK1) -0.972 15 AK (AK1) -0.973
16 FMA (default) -0.821 16 SM (4PSM) -0.872
17 SM (4PSM) -0.775 17 FMA (default) -0.821
18 RMA (default) -0.613 18 RMA (default) -0.613
19 mean (default) -0.213 19 mean (default) -0.213
20 MAIVE (WAIVE) 0.267 20 MAIVE (WAIVE) 0.267
21 pcurve (default) NaN 21 pcurve (default) NaN

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 MAIVE (WAIVE) 0.361 1 MAIVE (WAIVE) 0.361
2 RoBMA (PSMA) 0.369 2 RoBMA (PSMA) 0.374
3 MAIVE (default) 0.399 3 MAIVE (default) 0.399
4 SM (3PSM) 0.486 4 puniform (star) 0.492
5 puniform (star) 0.492 5 SM (3PSM) 0.497
6 puniform (default) 0.511 6 puniform (default) 0.511
7 PET (default) 0.518 7 PET (default) 0.518
8 EK (default) 0.518 8 EK (default) 0.518
9 PETPEESE (default) 0.549 9 PETPEESE (default) 0.549
10 WILS (default) 0.569 10 WILS (default) 0.569
11 SM (4PSM) 0.593 11 SM (4PSM) 0.599
12 AK (AK2) 0.637 12 WAAPWLS (default) 0.783
13 WAAPWLS (default) 0.783 13 AK (AK2) 0.860
14 PEESE (default) 0.916 14 PEESE (default) 0.916
15 AK (AK1) 0.945 15 AK (AK1) 0.945
16 trimfill (default) 0.968 16 WLS (default) 0.968
17 WLS (default) 0.968 17 trimfill (default) 0.968
18 RMA (default) 0.988 18 RMA (default) 0.988
19 FMA (default) 0.992 19 FMA (default) 0.992
20 mean (default) 0.999 20 mean (default) 0.999
21 pcurve (default) NaN 21 pcurve (default) NaN

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 mean (default) 1.000 1 mean (default) 1.000
2 RMA (default) 1.000 2 RMA (default) 1.000
3 FMA (default) 1.000 3 FMA (default) 1.000
4 trimfill (default) 0.998 4 trimfill (default) 0.998
5 WLS (default) 0.998 5 WLS (default) 0.998
6 AK (AK1) 0.998 6 AK (AK1) 0.998
7 PEESE (default) 0.994 7 AK (AK2) 0.994
8 WAAPWLS (default) 0.969 8 PEESE (default) 0.994
9 AK (AK2) 0.959 9 WAAPWLS (default) 0.969
10 PETPEESE (default) 0.939 10 PETPEESE (default) 0.939
11 puniform (default) 0.930 11 puniform (default) 0.930
12 EK (default) 0.925 12 EK (default) 0.925
13 PET (default) 0.925 13 PET (default) 0.925
14 WILS (default) 0.801 14 WILS (default) 0.801
15 SM (3PSM) 0.777 15 SM (3PSM) 0.789
16 MAIVE (default) 0.754 16 MAIVE (default) 0.754
17 SM (4PSM) 0.689 17 SM (4PSM) 0.704
18 puniform (star) 0.671 18 puniform (star) 0.671
19 RoBMA (PSMA) 0.624 19 RoBMA (PSMA) 0.629
20 MAIVE (WAIVE) 0.270 20 MAIVE (WAIVE) 0.270
21 pcurve (default) NaN 21 pcurve (default) NaN

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

By-Condition Performance (Conditional on Method Convergence)

The results below are conditional on method convergence. Note that the methods might differ in convergence rate and are therefore not compared on the same data sets.

Raincloud plot showing convergence rates across different methods

Raincloud plot showing RMSE (Root Mean Square Error) across different methods

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing bias across different methods

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0. Values lower than -0.5 or larger than 0.5 are visualized as -0.5 and 0.5 respectively.

Raincloud plot showing bias across different methods

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing 95% confidence interval width across different methods

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method. Values larger than 100 are visualized as 100.

Raincloud plot showing 95% confidence interval coverage across different methods

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Raincloud plot showing 95% confidence interval width across different methods

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method.

Raincloud plot showing positive likelihood ratio across different methods

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Raincloud plot showing negative likelihood ratio across different methods

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Raincloud plot showing Type I Error rates across different methods

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Raincloud plot showing statistical power across different methods

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

By-Condition Performance (Replacement in Case of Non-Convergence)

The results below incorporate method replacement to handle non-convergence. If a method fails to converge, its results are replaced with the results from a simpler method (e.g., random-effects meta-analysis without publication bias adjustment). This emulates what a data analyst may do in practice in case a method does not converge. However, note that these results do not correspond to “pure” method performance as they might combine multiple different methods. See Method Replacement Strategy for details of the method replacement specification.

Raincloud plot showing convergence rates across different methods

Raincloud plot showing RMSE (Root Mean Square Error) across different methods

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing bias across different methods

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0. Values lower than -0.5 or larger than 0.5 are visualized as -0.5 and 0.5 respectively.

Raincloud plot showing bias across different methods

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing 95% confidence interval width across different methods

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method. Values larger than 100 are visualized as 100.

Raincloud plot showing 95% confidence interval coverage across different methods

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Raincloud plot showing 95% confidence interval width across different methods

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method.

Raincloud plot showing positive likelihood ratio across different methods

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Raincloud plot showing negative likelihood ratio across different methods

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Raincloud plot showing Type I Error rates across different methods

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Raincloud plot showing statistical power across different methods

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

Session Info

This report was compiled on Mon Mar 16 19:09:28 2026 (UTC) using the following computational environment

## R version 4.5.3 (2026-03-11)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## time zone: UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] scales_1.4.0                   ggdist_3.3.3                  
## [3] ggplot2_4.0.2                  PublicationBiasBenchmark_0.2.0
## 
## loaded via a namespace (and not attached):
##  [1] generics_0.1.4       sandwich_3.1-1       sass_0.4.10         
##  [4] xml2_1.5.2           stringi_1.8.7        lattice_0.22-9      
##  [7] httpcode_0.3.0       digest_0.6.39        magrittr_2.0.4      
## [10] evaluate_1.0.5       grid_4.5.3           RColorBrewer_1.1-3  
## [13] fastmap_1.2.0        jsonlite_2.0.0       crul_1.6.0          
## [16] urltools_1.7.3.1     httr_1.4.8           purrr_1.2.1         
## [19] viridisLite_0.4.3    textshaping_1.0.5    jquerylib_0.1.4     
## [22] Rdpack_2.6.6         cli_3.6.5            rlang_1.1.7         
## [25] triebeard_0.4.1      rbibutils_2.4.1      withr_3.0.2         
## [28] cachem_1.1.0         yaml_2.3.12          otel_0.2.0          
## [31] tools_4.5.3          memoise_2.0.1        kableExtra_1.4.0    
## [34] curl_7.0.0           vctrs_0.7.1          R6_2.6.1            
## [37] clubSandwich_0.6.2   zoo_1.8-15           lifecycle_1.0.5     
## [40] stringr_1.6.0        fs_1.6.7             htmlwidgets_1.6.4   
## [43] ragg_1.5.1           pkgconfig_2.0.3      desc_1.4.3          
## [46] osfr_0.2.9           pkgdown_2.2.0        bslib_0.10.0        
## [49] pillar_1.11.1        gtable_0.3.6         Rcpp_1.1.1          
## [52] glue_1.8.0           systemfonts_1.3.2    xfun_0.56           
## [55] tibble_3.3.1         rstudioapi_0.18.0    knitr_1.51          
## [58] farver_2.1.2         htmltools_0.5.9      labeling_0.4.3      
## [61] svglite_2.2.2        rmarkdown_2.30       compiler_4.5.3      
## [64] S7_0.2.1             distributional_0.6.0