Skip to contents

Complete Results

These results are based on Stanley (2017), Alinaghi (2018), Bom (2019), and Carter (2019) data-generating mechanisms with a total of 1665 conditions.

Average Performance

Method performance measures are aggregated across all simulated conditions to provide an overall impression of method performance. However, keep in mind that a method with a high overall ranking is not necessarily the “best” method for a particular application. To select a suitable method for your application, consider also non-aggregated performance measures in conditions most relevant to your application.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Mean Rank Rank Method Mean Rank
1 AK (AK1) 6.632 1 AK (AK1) 6.678
2 RoBMA (PSMA) 7.240 2 RoBMA (PSMA) 7.416
3 WAAPWLS (default) 7.920 3 WAAPWLS (default) 7.933
4 FMA (default) 8.452 4 FMA (default) 8.465
5 WLS (default) 8.459 5 WLS (default) 8.472
6 trimfill (default) 8.968 6 trimfill (default) 8.986
7 PEESE (default) 9.306 7 PEESE (default) 9.332
8 SM (3PSM) 9.360 8 SM (3PSM) 9.388
9 PETPEESE (default) 9.865 9 PETPEESE (default) 9.888
10 WILS (default) 10.067 10 WILS (default) 10.079
11 puniform (star) 10.318 11 puniform (star) 10.335
12 RMA (default) 11.079 12 RMA (default) 11.096
13 EK (default) 11.960 13 AK (AK2) 11.421
14 AK (AK2) 12.008 14 EK (default) 11.983
15 PET (default) 12.095 15 PET (default) 12.120
16 SM (4PSM) 12.129 16 SM (4PSM) 12.169
17 pcurve (default) 12.500 17 pcurve (default) 12.516
18 MAIVE (default) 13.745 18 MAIVE (default) 13.772
19 puniform (default) 14.079 19 puniform (default) 14.097
20 mean (default) 15.187 20 mean (default) 15.235
21 MAIVE (WAIVE) 17.279 21 MAIVE (WAIVE) 17.268

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the average RMSE is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of RMSE values on the corresponding outcome scale.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Mean Rank Rank Method Mean Rank
1 WAAPWLS (default) 8.069 1 WAAPWLS (default) 8.167
2 PETPEESE (default) 8.398 2 PETPEESE (default) 8.455
3 AK (AK1) 8.438 3 AK (AK1) 8.510
4 PEESE (default) 8.783 4 PEESE (default) 8.844
5 SM (3PSM) 8.987 5 SM (3PSM) 8.968
6 RoBMA (PSMA) 9.219 6 RoBMA (PSMA) 9.184
7 EK (default) 9.688 7 EK (default) 9.747
8 PET (default) 9.771 8 PET (default) 9.829
9 puniform (star) 9.801 9 puniform (star) 9.865
10 WLS (default) 9.990 10 WLS (default) 10.082
11 FMA (default) 9.994 11 FMA (default) 10.086
12 SM (4PSM) 10.354 12 SM (4PSM) 10.407
13 WILS (default) 11.002 13 AK (AK2) 10.828
14 trimfill (default) 11.316 14 WILS (default) 11.050
15 MAIVE (default) 11.725 15 trimfill (default) 11.390
16 AK (AK2) 11.852 16 MAIVE (default) 11.768
17 RMA (default) 13.156 17 RMA (default) 13.223
18 puniform (default) 13.771 18 puniform (default) 13.813
19 pcurve (default) 13.902 19 pcurve (default) 13.933
20 MAIVE (WAIVE) 14.345 20 MAIVE (WAIVE) 14.342
21 mean (default) 16.270 21 mean (default) 16.340

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0. Methods are compared using condition-wise ranks. Direct comparison using the average bias is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of bias values on the corresponding outcome scale.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Mean Rank Rank Method Mean Rank
1 RMA (default) 3.868 1 RMA (default) 3.726
2 AK (AK1) 5.204 2 AK (AK1) 5.356
3 WLS (default) 5.484 3 WLS (default) 5.376
4 FMA (default) 5.489 4 FMA (default) 5.381
5 trimfill (default) 6.962 5 trimfill (default) 6.914
6 mean (default) 7.911 6 mean (default) 7.850
7 pcurve (default) 8.172 7 pcurve (default) 8.132
8 RoBMA (PSMA) 8.721 8 WAAPWLS (default) 8.713
9 WAAPWLS (default) 8.801 9 RoBMA (PSMA) 9.406
10 PEESE (default) 10.856 10 PEESE (default) 10.828
11 SM (3PSM) 11.297 11 SM (3PSM) 11.257
12 puniform (default) 11.446 12 puniform (default) 11.413
13 WILS (default) 12.605 13 WILS (default) 12.551
14 puniform (star) 12.814 14 puniform (star) 12.763
15 PETPEESE (default) 13.629 15 PETPEESE (default) 13.612
16 AK (AK2) 13.950 16 AK (AK2) 14.072
17 SM (4PSM) 14.705 17 SM (4PSM) 14.652
18 EK (default) 15.742 18 EK (default) 15.708
19 PET (default) 15.822 19 PET (default) 15.790
20 MAIVE (default) 15.831 20 MAIVE (default) 15.832
21 MAIVE (WAIVE) 19.424 21 MAIVE (WAIVE) 19.402

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance. Methods are compared using condition-wise ranks. Direct comparison using the average empirical standard error is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of empirical standard error values on the corresponding outcome scale.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 5.772 1 RoBMA (PSMA) 5.941
2 AK (AK1) 6.519 2 AK (AK1) 6.529
3 SM (3PSM) 7.151 3 SM (3PSM) 7.250
4 puniform (star) 7.521 4 puniform (star) 7.615
5 WAAPWLS (default) 8.259 5 WAAPWLS (default) 8.317
6 trimfill (default) 9.195 6 SM (4PSM) 9.137
7 SM (4PSM) 9.358 7 trimfill (default) 9.233
8 PEESE (default) 10.011 8 PEESE (default) 10.045
9 WLS (default) 10.113 9 AK (AK2) 10.147
10 PETPEESE (default) 10.392 10 WLS (default) 10.168
11 AK (AK2) 10.598 11 PETPEESE (default) 10.403
12 RMA (default) 10.993 12 RMA (default) 11.036
13 EK (default) 11.112 13 EK (default) 11.118
14 MAIVE (default) 11.702 14 MAIVE (default) 11.700
15 PET (default) 12.071 15 PET (default) 12.073
16 WILS (default) 12.321 16 WILS (default) 12.312
17 FMA (default) 12.841 17 FMA (default) 12.874
18 puniform (default) 12.884 18 puniform (default) 12.903
19 MAIVE (WAIVE) 14.234 19 MAIVE (WAIVE) 14.204
20 mean (default) 16.499 20 mean (default) 16.541
21 pcurve (default) 20.977 21 pcurve (default) 20.977

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the average Interval Score is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of empirical standard error values on the corresponding outcome scale.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 0.800 1 RoBMA (PSMA) 0.798
2 AK (AK2) 0.795 2 AK (AK2) 0.769
3 SM (4PSM) 0.765 3 SM (4PSM) 0.760
4 puniform (star) 0.733 4 puniform (star) 0.733
5 SM (3PSM) 0.733 5 SM (3PSM) 0.728
6 MAIVE (default) 0.695 6 MAIVE (default) 0.695
7 MAIVE (WAIVE) 0.647 7 MAIVE (WAIVE) 0.647
8 EK (default) 0.641 8 EK (default) 0.641
9 PETPEESE (default) 0.629 9 PETPEESE (default) 0.629
10 PET (default) 0.620 10 PET (default) 0.620
11 AK (AK1) 0.609 11 AK (AK1) 0.609
12 WAAPWLS (default) 0.582 12 WAAPWLS (default) 0.582
13 trimfill (default) 0.544 13 trimfill (default) 0.543
14 PEESE (default) 0.526 14 PEESE (default) 0.526
15 WILS (default) 0.504 15 WILS (default) 0.504
16 puniform (default) 0.484 16 puniform (default) 0.484
17 WLS (default) 0.464 17 WLS (default) 0.464
18 RMA (default) 0.457 18 RMA (default) 0.457
19 FMA (default) 0.342 19 FMA (default) 0.342
20 mean (default) 0.260 20 mean (default) 0.260
21 pcurve (default) NaN 21 pcurve (default) NaN

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Mean Rank Rank Method Mean Rank
1 FMA (default) 2.381 1 FMA (default) 2.342
2 WLS (default) 3.884 2 WLS (default) 3.842
3 WILS (default) 4.932 3 WILS (default) 4.912
4 PEESE (default) 7.142 4 PEESE (default) 7.129
5 mean (default) 7.180 5 mean (default) 7.178
6 WAAPWLS (default) 7.534 6 WAAPWLS (default) 7.492
7 RMA (default) 7.733 7 RMA (default) 7.689
8 trimfill (default) 7.775 8 trimfill (default) 7.698
9 AK (AK1) 8.858 9 AK (AK1) 8.859
10 PETPEESE (default) 9.278 10 PETPEESE (default) 9.318
11 RoBMA (PSMA) 10.820 11 RoBMA (PSMA) 10.892
12 puniform (default) 11.416 12 puniform (default) 11.426
13 SM (3PSM) 12.494 13 SM (3PSM) 12.565
14 PET (default) 13.507 14 PET (default) 13.601
15 puniform (star) 13.659 15 puniform (star) 13.826
16 EK (default) 14.740 16 AK (AK2) 14.833
17 AK (AK2) 15.248 17 EK (default) 14.843
18 SM (4PSM) 15.479 18 SM (4PSM) 15.454
19 MAIVE (default) 16.941 19 MAIVE (default) 17.033
20 MAIVE (WAIVE) 18.543 20 MAIVE (WAIVE) 18.611
21 pcurve (default) 20.977 21 pcurve (default) 20.977

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the average CI width is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of CI width values on the corresponding outcome scale.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Log Value Rank Method Log Value
1 RoBMA (PSMA) 3.703 1 RoBMA (PSMA) 3.577
2 AK (AK2) 2.124 2 AK (AK2) 1.735
3 MAIVE (default) 1.579 3 MAIVE (default) 1.579
4 PETPEESE (default) 1.532 4 PETPEESE (default) 1.532
5 PET (default) 1.515 5 PET (default) 1.515
6 EK (default) 1.515 6 EK (default) 1.515
7 puniform (default) 1.515 7 puniform (default) 1.501
8 puniform (star) 1.325 8 puniform (star) 1.325
9 SM (3PSM) 1.325 9 SM (3PSM) 1.321
10 AK (AK1) 1.215 10 AK (AK1) 1.205
11 SM (4PSM) 1.156 11 SM (4PSM) 1.185
12 MAIVE (WAIVE) 1.111 12 MAIVE (WAIVE) 1.111
13 RMA (default) 0.998 13 RMA (default) 0.998
14 WAAPWLS (default) 0.945 14 WAAPWLS (default) 0.945
15 trimfill (default) 0.922 15 trimfill (default) 0.922
16 WILS (default) 0.871 16 WILS (default) 0.871
17 PEESE (default) 0.843 17 PEESE (default) 0.843
18 WLS (default) 0.790 18 WLS (default) 0.790
19 FMA (default) 0.503 19 FMA (default) 0.503
20 mean (default) 0.487 20 mean (default) 0.487
21 pcurve (default) NaN 21 pcurve (default) NaN

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Log Value Rank Method Log Value
1 PETPEESE (default) -4.626 1 AK (AK2) -4.661
2 EK (default) -4.496 2 PETPEESE (default) -4.626
3 PET (default) -4.496 3 EK (default) -4.496
4 WAAPWLS (default) -4.042 4 PET (default) -4.496
5 PEESE (default) -3.890 5 WAAPWLS (default) -4.042
6 SM (3PSM) -3.560 6 PEESE (default) -3.890
7 WLS (default) -3.450 7 SM (3PSM) -3.593
8 trimfill (default) -3.445 8 WLS (default) -3.450
9 MAIVE (default) -3.394 9 trimfill (default) -3.446
10 puniform (default) -3.374 10 MAIVE (default) -3.394
11 RoBMA (PSMA) -3.331 11 puniform (default) -3.376
12 AK (AK1) -3.277 12 RoBMA (PSMA) -3.332
13 puniform (star) -3.208 13 AK (AK1) -3.281
14 AK (AK2) -3.158 14 puniform (star) -3.208
15 RMA (default) -3.121 15 RMA (default) -3.121
16 FMA (default) -3.058 16 FMA (default) -3.058
17 WILS (default) -3.037 17 WILS (default) -3.037
18 SM (4PSM) -2.636 18 SM (4PSM) -2.873
19 mean (default) -2.503 19 mean (default) -2.503
20 MAIVE (WAIVE) -1.547 20 MAIVE (WAIVE) -1.547
21 pcurve (default) NaN 21 pcurve (default) NaN

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 0.102 1 RoBMA (PSMA) 0.106
2 AK (AK2) 0.125 2 MAIVE (WAIVE) 0.176
3 MAIVE (WAIVE) 0.176 3 AK (AK2) 0.237
4 SM (4PSM) 0.245 4 SM (4PSM) 0.248
5 PET (default) 0.257 5 PET (default) 0.257
6 EK (default) 0.257 6 EK (default) 0.257
7 MAIVE (default) 0.264 7 MAIVE (default) 0.264
8 PETPEESE (default) 0.270 8 PETPEESE (default) 0.270
9 SM (3PSM) 0.277 9 SM (3PSM) 0.280
10 puniform (star) 0.293 10 puniform (star) 0.293
11 WILS (default) 0.391 11 WILS (default) 0.391
12 WAAPWLS (default) 0.523 12 WAAPWLS (default) 0.523
13 PEESE (default) 0.546 13 PEESE (default) 0.546
14 AK (AK1) 0.581 14 AK (AK1) 0.581
15 trimfill (default) 0.586 15 trimfill (default) 0.586
16 puniform (default) 0.608 16 puniform (default) 0.607
17 WLS (default) 0.621 17 WLS (default) 0.621
18 RMA (default) 0.622 18 RMA (default) 0.622
19 FMA (default) 0.772 19 FMA (default) 0.772
20 mean (default) 0.779 20 mean (default) 0.779
21 pcurve (default) NaN 21 pcurve (default) NaN

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 mean (default) 0.990 1 mean (default) 0.990
2 FMA (default) 0.989 2 FMA (default) 0.989
3 WLS (default) 0.976 3 WLS (default) 0.976
4 RMA (default) 0.974 4 RMA (default) 0.974
5 AK (AK1) 0.969 5 AK (AK1) 0.969
6 trimfill (default) 0.965 6 trimfill (default) 0.965
7 PEESE (default) 0.953 7 PEESE (default) 0.953
8 puniform (default) 0.939 8 puniform (default) 0.939
9 WAAPWLS (default) 0.934 9 WAAPWLS (default) 0.934
10 PETPEESE (default) 0.893 10 PETPEESE (default) 0.893
11 EK (default) 0.873 11 AK (AK2) 0.885
12 PET (default) 0.873 12 EK (default) 0.873
13 WILS (default) 0.864 13 PET (default) 0.873
14 SM (3PSM) 0.828 14 WILS (default) 0.864
15 AK (AK2) 0.812 15 SM (3PSM) 0.835
16 puniform (star) 0.808 16 puniform (star) 0.808
17 MAIVE (default) 0.779 17 MAIVE (default) 0.779
18 SM (4PSM) 0.754 18 SM (4PSM) 0.772
19 RoBMA (PSMA) 0.703 19 RoBMA (PSMA) 0.706
20 MAIVE (WAIVE) 0.498 20 MAIVE (WAIVE) 0.498
21 pcurve (default) NaN 21 pcurve (default) NaN

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

By-Condition Performance (Conditional on Method Convergence)

The results below are conditional on method convergence. Note that the methods might differ in convergence rate and are therefore not compared on the same data sets.

Raincloud plot showing convergence rates across different methods

Raincloud plot showing RMSE (Root Mean Square Error) across different methods

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the average RMSE is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of RMSE values on the corresponding outcome scale.

Raincloud plot showing bias across different methods

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0. Methods are compared using condition-wise ranks. Direct comparison using the average bias is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of bias values on the corresponding outcome scale.

Raincloud plot showing bias across different methods

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance. Methods are compared using condition-wise ranks. Direct comparison using the empirical standard error is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of bias values on the corresponding outcome scale.

Raincloud plot showing 95% confidence interval width across different methods

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the interval score is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of bias values on the corresponding outcome scale.

Raincloud plot showing 95% confidence interval coverage across different methods

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Raincloud plot showing 95% confidence interval width across different methods

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the average 95% CI width is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of 95% CI width values on the corresponding outcome scale.

Raincloud plot showing positive likelihood ratio across different methods

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Raincloud plot showing negative likelihood ratio across different methods

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Raincloud plot showing Type I Error rates across different methods

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Raincloud plot showing statistical power across different methods

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

By-Condition Performance (Replacement in Case of Non-Convergence)

The results below incorporate method replacement to handle non-convergence. If a method fails to converge, its results are replaced with the results from a simpler method (e.g., random-effects meta-analysis without publication bias adjustment). This emulates what a data analyst may do in practice in case a method does not converge. However, note that these results do not correspond to “pure” method performance as they might combine multiple different methods. See Method Replacement Strategy for details of the method replacement specification.

Raincloud plot showing convergence rates across different methods

Raincloud plot showing RMSE (Root Mean Square Error) across different methods

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the average RMSE is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of RMSE values on the corresponding outcome scale.

Raincloud plot showing bias across different methods

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0. Methods are compared using condition-wise ranks. Direct comparison using the average bias is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of bias values on the corresponding outcome scale.

Raincloud plot showing bias across different methods

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance. Methods are compared using condition-wise ranks. Direct comparison using the empirical standard error is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of bias values on the corresponding outcome scale.

Raincloud plot showing 95% confidence interval width across different methods

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the interval score is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of bias values on the corresponding outcome scale.

Raincloud plot showing 95% confidence interval coverage across different methods

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Raincloud plot showing 95% confidence interval width across different methods

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the average 95% CI width is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of 95% CI width values on the corresponding outcome scale.

Raincloud plot showing positive likelihood ratio across different methods

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Raincloud plot showing negative likelihood ratio across different methods

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Raincloud plot showing Type I Error rates across different methods

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Raincloud plot showing statistical power across different methods

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

Subset: Publication Bias Present

These results are based on Stanley (2017), Alinaghi (2018), Bom (2019), and Carter (2019) data-generating mechanisms with a total of 1143 conditions.

Average Performance

Method performance measures are aggregated across all simulated conditions to provide an overall impression of method performance. However, keep in mind that a method with a high overall ranking is not necessarily the “best” method for a particular application. To select a suitable method for your application, consider also non-aggregated performance measures in conditions most relevant to your application.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Mean Rank Rank Method Mean Rank
1 AK (AK1) 6.812 1 AK (AK1) 6.802
2 RoBMA (PSMA) 7.892 2 RoBMA (PSMA) 8.115
3 WAAPWLS (default) 8.104 3 WAAPWLS (default) 8.117
4 PEESE (default) 8.926 4 trimfill (default) 8.934
5 trimfill (default) 8.931 5 PEESE (default) 8.955
6 FMA (default) 9.160 6 FMA (default) 9.164
7 WLS (default) 9.164 7 WLS (default) 9.169
8 PETPEESE (default) 9.169 8 PETPEESE (default) 9.183
9 WILS (default) 9.677 9 WILS (default) 9.686
10 SM (3PSM) 10.033 10 SM (3PSM) 10.059
11 puniform (star) 10.527 11 puniform (star) 10.531
12 pcurve (default) 11.071 12 AK (AK2) 11.037
13 EK (default) 11.085 13 pcurve (default) 11.085
14 PET (default) 11.215 14 EK (default) 11.106
15 AK (AK2) 11.555 15 PET (default) 11.240
16 SM (4PSM) 12.421 16 SM (4PSM) 12.460
17 RMA (default) 12.692 17 RMA (default) 12.720
18 puniform (default) 13.221 18 puniform (default) 13.240
19 MAIVE (default) 13.465 19 MAIVE (default) 13.477
20 mean (default) 16.636 20 mean (default) 16.690
21 MAIVE (WAIVE) 16.963 21 MAIVE (WAIVE) 16.951

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the average RMSE is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of RMSE values on the corresponding outcome scale.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Mean Rank Rank Method Mean Rank
1 PETPEESE (default) 8.015 1 PETPEESE (default) 8.048
2 WAAPWLS (default) 8.118 2 WAAPWLS (default) 8.210
3 RoBMA (PSMA) 8.571 3 RoBMA (PSMA) 8.487
4 AK (AK1) 8.762 4 AK (AK1) 8.789
5 PEESE (default) 8.876 5 PEESE (default) 8.927
6 SM (3PSM) 9.185 6 SM (3PSM) 9.158
7 EK (default) 9.407 7 EK (default) 9.442
8 WILS (default) 9.454 8 WILS (default) 9.507
9 PET (default) 9.538 9 PET (default) 9.575
10 puniform (star) 9.705 10 puniform (star) 9.780
11 trimfill (default) 9.988 11 trimfill (default) 10.045
12 SM (4PSM) 10.448 12 SM (4PSM) 10.498
13 FMA (default) 10.730 13 AK (AK2) 10.675
14 WLS (default) 10.737 14 FMA (default) 10.818
15 AK (AK2) 11.499 15 WLS (default) 10.825
16 pcurve (default) 12.471 16 pcurve (default) 12.501
17 MAIVE (default) 12.668 17 MAIVE (default) 12.706
18 puniform (default) 12.914 18 puniform (default) 12.974
19 MAIVE (WAIVE) 14.523 19 MAIVE (WAIVE) 14.500
20 RMA (default) 14.961 20 RMA (default) 15.044
21 mean (default) 18.226 21 mean (default) 18.283

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0. Methods are compared using condition-wise ranks. Direct comparison using the average bias is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of bias values on the corresponding outcome scale.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Mean Rank Rank Method Mean Rank
1 RMA (default) 3.972 1 RMA (default) 3.817
2 AK (AK1) 4.770 2 AK (AK1) 4.837
3 WLS (default) 5.483 3 WLS (default) 5.350
4 FMA (default) 5.487 4 FMA (default) 5.354
5 trimfill (default) 6.640 5 trimfill (default) 6.584
6 pcurve (default) 7.114 6 pcurve (default) 7.061
7 mean (default) 7.591 7 mean (default) 7.507
8 WAAPWLS (default) 8.873 8 WAAPWLS (default) 8.765
9 RoBMA (PSMA) 9.836 9 PEESE (default) 10.623
10 PEESE (default) 10.698 10 RoBMA (PSMA) 10.841
11 puniform (default) 11.138 11 puniform (default) 11.090
12 SM (3PSM) 11.991 12 SM (3PSM) 11.907
13 WILS (default) 12.922 13 WILS (default) 12.836
14 puniform (star) 13.302 14 puniform (star) 13.199
15 AK (AK2) 13.785 15 PETPEESE (default) 13.727
16 PETPEESE (default) 13.794 16 AK (AK2) 14.148
17 SM (4PSM) 15.194 17 SM (4PSM) 15.105
18 EK (default) 15.492 18 EK (default) 15.435
19 PET (default) 15.556 19 PET (default) 15.503
20 MAIVE (default) 15.810 20 MAIVE (default) 15.788
21 MAIVE (WAIVE) 19.379 21 MAIVE (WAIVE) 19.348

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance. Methods are compared using condition-wise ranks. Direct comparison using the average empirical standard error is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of empirical standard error values on the corresponding outcome scale.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 6.260 1 RoBMA (PSMA) 6.462
2 AK (AK1) 6.892 2 AK (AK1) 6.856
3 puniform (star) 7.453 3 puniform (star) 7.530
4 SM (3PSM) 7.522 4 SM (3PSM) 7.628
5 WAAPWLS (default) 8.458 5 WAAPWLS (default) 8.505
6 trimfill (default) 9.039 6 trimfill (default) 9.064
7 SM (4PSM) 9.576 7 SM (4PSM) 9.380
8 PEESE (default) 9.828 8 PETPEESE (default) 9.842
9 PETPEESE (default) 9.855 9 PEESE (default) 9.861
10 AK (AK2) 10.218 10 AK (AK2) 9.883
11 EK (default) 10.228 11 EK (default) 10.229
12 WLS (default) 10.870 12 WLS (default) 10.927
13 PET (default) 11.171 13 PET (default) 11.171
14 MAIVE (default) 11.460 14 MAIVE (default) 11.436
15 WILS (default) 11.735 15 WILS (default) 11.700
16 puniform (default) 11.859 16 puniform (default) 11.871
17 RMA (default) 12.553 17 RMA (default) 12.616
18 FMA (default) 13.078 18 FMA (default) 13.108
19 MAIVE (WAIVE) 13.843 19 MAIVE (WAIVE) 13.798
20 mean (default) 17.689 20 mean (default) 17.719
21 pcurve (default) 20.975 21 pcurve (default) 20.975

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the average Interval Score is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of empirical standard error values on the corresponding outcome scale.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 0.759 1 RoBMA (PSMA) 0.756
2 AK (AK2) 0.754 2 SM (4PSM) 0.717
3 SM (4PSM) 0.722 3 AK (AK2) 0.716
4 SM (3PSM) 0.688 4 puniform (star) 0.688
5 puniform (star) 0.688 5 SM (3PSM) 0.682
6 MAIVE (default) 0.632 6 MAIVE (default) 0.632
7 EK (default) 0.611 7 EK (default) 0.611
8 PETPEESE (default) 0.599 8 PETPEESE (default) 0.599
9 MAIVE (WAIVE) 0.588 9 MAIVE (WAIVE) 0.588
10 PET (default) 0.588 10 PET (default) 0.588
11 AK (AK1) 0.526 11 AK (AK1) 0.526
12 WAAPWLS (default) 0.523 12 WAAPWLS (default) 0.523
13 trimfill (default) 0.485 13 trimfill (default) 0.484
14 WILS (default) 0.479 14 WILS (default) 0.479
15 puniform (default) 0.478 15 puniform (default) 0.478
16 PEESE (default) 0.467 16 PEESE (default) 0.467
17 WLS (default) 0.393 17 WLS (default) 0.393
18 RMA (default) 0.358 18 RMA (default) 0.358
19 FMA (default) 0.288 19 FMA (default) 0.288
20 mean (default) 0.148 20 mean (default) 0.148
21 pcurve (default) NaN 21 pcurve (default) NaN

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Mean Rank Rank Method Mean Rank
1 FMA (default) 2.363 1 FMA (default) 2.314
2 WLS (default) 3.983 2 WLS (default) 3.936
3 WILS (default) 5.682 3 WILS (default) 5.666
4 mean (default) 6.830 4 mean (default) 6.827
5 PEESE (default) 6.983 5 PEESE (default) 6.973
6 WAAPWLS (default) 7.619 6 WAAPWLS (default) 7.577
7 trimfill (default) 7.803 7 trimfill (default) 7.736
8 RMA (default) 7.847 8 RMA (default) 7.811
9 AK (AK1) 8.273 9 AK (AK1) 8.274
10 PETPEESE (default) 9.147 10 PETPEESE (default) 9.185
11 puniform (default) 11.190 11 puniform (default) 11.220
12 RoBMA (PSMA) 11.511 12 RoBMA (PSMA) 11.596
13 SM (3PSM) 12.882 13 SM (3PSM) 12.978
14 PET (default) 13.290 14 PET (default) 13.388
15 puniform (star) 13.714 15 puniform (star) 13.885
16 EK (default) 14.536 16 AK (AK2) 14.575
17 AK (AK2) 15.143 17 EK (default) 14.639
18 SM (4PSM) 15.780 18 SM (4PSM) 15.808
19 MAIVE (default) 16.612 19 MAIVE (default) 16.717
20 MAIVE (WAIVE) 18.395 20 MAIVE (WAIVE) 18.480
21 pcurve (default) 20.975 21 pcurve (default) 20.975

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the average CI width is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of CI width values on the corresponding outcome scale.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Log Value Rank Method Log Value
1 RoBMA (PSMA) 3.072 1 RoBMA (PSMA) 2.925
2 AK (AK2) 1.862 2 puniform (default) 1.707
3 puniform (default) 1.712 3 PETPEESE (default) 1.411
4 PETPEESE (default) 1.411 4 PET (default) 1.402
5 PET (default) 1.402 5 EK (default) 1.401
6 EK (default) 1.401 6 AK (AK2) 1.346
7 MAIVE (default) 1.258 7 MAIVE (default) 1.258
8 SM (3PSM) 1.058 8 puniform (star) 1.057
9 puniform (star) 1.057 9 SM (3PSM) 1.047
10 SM (4PSM) 0.857 10 SM (4PSM) 0.891
11 AK (AK1) 0.816 11 AK (AK1) 0.814
12 WILS (default) 0.780 12 WILS (default) 0.780
13 trimfill (default) 0.710 13 trimfill (default) 0.710
14 WAAPWLS (default) 0.694 14 WAAPWLS (default) 0.694
15 MAIVE (WAIVE) 0.658 15 MAIVE (WAIVE) 0.658
16 RMA (default) 0.601 16 RMA (default) 0.601
17 PEESE (default) 0.592 17 PEESE (default) 0.592
18 WLS (default) 0.520 18 WLS (default) 0.520
19 FMA (default) 0.267 19 FMA (default) 0.267
20 mean (default) 0.143 20 mean (default) 0.143
21 pcurve (default) NaN 21 pcurve (default) NaN

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Log Value Rank Method Log Value
1 PETPEESE (default) -4.416 1 PETPEESE (default) -4.416
2 PET (default) -4.295 2 PET (default) -4.295
3 EK (default) -4.295 3 EK (default) -4.295
4 WAAPWLS (default) -3.646 4 AK (AK2) -4.176
5 PEESE (default) -3.433 5 WAAPWLS (default) -3.646
6 puniform (default) -3.371 6 PEESE (default) -3.433
7 RoBMA (PSMA) -2.917 7 puniform (default) -3.370
8 MAIVE (default) -2.868 8 RoBMA (PSMA) -2.912
9 SM (3PSM) -2.841 9 SM (3PSM) -2.882
10 trimfill (default) -2.827 10 MAIVE (default) -2.868
11 WLS (default) -2.818 11 trimfill (default) -2.827
12 AK (AK2) -2.697 12 WLS (default) -2.818
13 AK (AK1) -2.628 13 AK (AK1) -2.629
14 puniform (star) -2.613 14 puniform (star) -2.613
15 WILS (default) -2.580 15 WILS (default) -2.580
16 RMA (default) -2.441 16 RMA (default) -2.441
17 FMA (default) -2.367 17 FMA (default) -2.367
18 SM (4PSM) -1.917 18 SM (4PSM) -2.168
19 mean (default) -1.692 19 mean (default) -1.692
20 MAIVE (WAIVE) -1.366 20 MAIVE (WAIVE) -1.366
21 pcurve (default) NaN 21 pcurve (default) NaN

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 0.124 1 RoBMA (PSMA) 0.129
2 AK (AK2) 0.150 2 MAIVE (WAIVE) 0.220
3 MAIVE (WAIVE) 0.220 3 SM (4PSM) 0.284
4 SM (4PSM) 0.279 4 PET (default) 0.293
5 PET (default) 0.293 5 EK (default) 0.294
6 EK (default) 0.294 6 AK (AK2) 0.307
7 PETPEESE (default) 0.311 7 PETPEESE (default) 0.311
8 SM (3PSM) 0.318 8 SM (3PSM) 0.323
9 puniform (star) 0.327 9 puniform (star) 0.327
10 MAIVE (default) 0.328 10 MAIVE (default) 0.328
11 WILS (default) 0.404 11 WILS (default) 0.404
12 puniform (default) 0.618 12 puniform (default) 0.618
13 WAAPWLS (default) 0.650 13 WAAPWLS (default) 0.650
14 PEESE (default) 0.664 14 PEESE (default) 0.664
15 trimfill (default) 0.706 15 trimfill (default) 0.706
16 AK (AK1) 0.731 16 AK (AK1) 0.731
17 WLS (default) 0.762 17 WLS (default) 0.762
18 RMA (default) 0.782 18 RMA (default) 0.782
19 FMA (default) 0.883 19 FMA (default) 0.883
20 mean (default) 0.931 20 mean (default) 0.931
21 pcurve (default) NaN 21 pcurve (default) NaN

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 mean (default) 0.996 1 mean (default) 0.996
2 FMA (default) 0.991 2 FMA (default) 0.991
3 RMA (default) 0.979 3 RMA (default) 0.979
4 WLS (default) 0.978 4 WLS (default) 0.978
5 AK (AK1) 0.974 5 AK (AK1) 0.974
6 trimfill (default) 0.971 6 trimfill (default) 0.971
7 PEESE (default) 0.953 7 PEESE (default) 0.953
8 WAAPWLS (default) 0.937 8 WAAPWLS (default) 0.937
9 puniform (default) 0.929 9 puniform (default) 0.929
10 PETPEESE (default) 0.886 10 PETPEESE (default) 0.886
11 EK (default) 0.865 11 AK (AK2) 0.870
12 PET (default) 0.865 12 EK (default) 0.865
13 WILS (default) 0.839 13 PET (default) 0.865
14 SM (3PSM) 0.789 14 WILS (default) 0.839
15 AK (AK2) 0.777 15 SM (3PSM) 0.798
16 puniform (star) 0.766 16 puniform (star) 0.766
17 MAIVE (default) 0.743 17 MAIVE (default) 0.743
18 SM (4PSM) 0.711 18 SM (4PSM) 0.732
19 RoBMA (PSMA) 0.665 19 RoBMA (PSMA) 0.669
20 MAIVE (WAIVE) 0.475 20 MAIVE (WAIVE) 0.475
21 pcurve (default) NaN 21 pcurve (default) NaN

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

By-Condition Performance (Conditional on Method Convergence)

The results below are conditional on method convergence. Note that the methods might differ in convergence rate and are therefore not compared on the same data sets.

Raincloud plot showing convergence rates across different methods

Raincloud plot showing RMSE (Root Mean Square Error) across different methods

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the average RMSE is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of RMSE values on the corresponding outcome scale.

Raincloud plot showing bias across different methods

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0. Methods are compared using condition-wise ranks. Direct comparison using the average bias is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of bias values on the corresponding outcome scale.

Raincloud plot showing bias across different methods

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance. Methods are compared using condition-wise ranks. Direct comparison using the empirical standard error is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of bias values on the corresponding outcome scale.

Raincloud plot showing 95% confidence interval width across different methods

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the interval score is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of bias values on the corresponding outcome scale.

Raincloud plot showing 95% confidence interval coverage across different methods

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Raincloud plot showing 95% confidence interval width across different methods

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the average 95% CI width is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of 95% CI width values on the corresponding outcome scale.

Raincloud plot showing positive likelihood ratio across different methods

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Raincloud plot showing negative likelihood ratio across different methods

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Raincloud plot showing Type I Error rates across different methods

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Raincloud plot showing statistical power across different methods

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

By-Condition Performance (Replacement in Case of Non-Convergence)

The results below incorporate method replacement to handle non-convergence. If a method fails to converge, its results are replaced with the results from a simpler method (e.g., random-effects meta-analysis without publication bias adjustment). This emulates what a data analyst may do in practice in case a method does not converge. However, note that these results do not correspond to “pure” method performance as they might combine multiple different methods. See Method Replacement Strategy for details of the method replacement specification.

Raincloud plot showing convergence rates across different methods

Raincloud plot showing RMSE (Root Mean Square Error) across different methods

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the average RMSE is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of RMSE values on the corresponding outcome scale.

Raincloud plot showing bias across different methods

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0. Methods are compared using condition-wise ranks. Direct comparison using the average bias is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of bias values on the corresponding outcome scale.

Raincloud plot showing bias across different methods

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance. Methods are compared using condition-wise ranks. Direct comparison using the empirical standard error is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of bias values on the corresponding outcome scale.

Raincloud plot showing 95% confidence interval width across different methods

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the interval score is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of bias values on the corresponding outcome scale.

Raincloud plot showing 95% confidence interval coverage across different methods

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Raincloud plot showing 95% confidence interval width across different methods

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the average 95% CI width is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of 95% CI width values on the corresponding outcome scale.

Raincloud plot showing positive likelihood ratio across different methods

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Raincloud plot showing negative likelihood ratio across different methods

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Raincloud plot showing Type I Error rates across different methods

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Raincloud plot showing statistical power across different methods

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

Subset: Publication Bias Absent

These results are based on Stanley (2017), Alinaghi (2018), Bom (2019), and Carter (2019) data-generating mechanisms with a total of 522 conditions.

Average Performance

Method performance measures are aggregated across all simulated conditions to provide an overall impression of method performance. However, keep in mind that a method with a high overall ranking is not necessarily the “best” method for a particular application. To select a suitable method for your application, consider also non-aggregated performance measures in conditions most relevant to your application.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Mean Rank Rank Method Mean Rank
1 RoBMA (PSMA) 5.810 1 RoBMA (PSMA) 5.885
2 AK (AK1) 6.239 2 AK (AK1) 6.406
3 FMA (default) 6.902 3 FMA (default) 6.933
4 WLS (default) 6.916 4 WLS (default) 6.946
5 WAAPWLS (default) 7.517 5 WAAPWLS (default) 7.531
6 RMA (default) 7.546 6 RMA (default) 7.540
7 SM (3PSM) 7.885 7 SM (3PSM) 7.918
8 trimfill (default) 9.050 8 trimfill (default) 9.102
9 puniform (star) 9.862 9 puniform (star) 9.906
10 PEESE (default) 10.140 10 PEESE (default) 10.157
11 WILS (default) 10.921 11 WILS (default) 10.939
12 PETPEESE (default) 11.391 12 PETPEESE (default) 11.433
13 SM (4PSM) 11.489 13 SM (4PSM) 11.531
14 mean (default) 12.013 14 mean (default) 12.048
15 AK (AK2) 13.000 15 AK (AK2) 12.262
16 EK (default) 13.875 16 EK (default) 13.902
17 PET (default) 14.023 17 PET (default) 14.048
18 MAIVE (default) 14.356 18 MAIVE (default) 14.418
19 pcurve (default) 15.628 19 pcurve (default) 15.649
20 puniform (default) 15.956 20 puniform (default) 15.973
21 MAIVE (WAIVE) 17.969 21 MAIVE (WAIVE) 17.964

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the average RMSE is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of RMSE values on the corresponding outcome scale.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Mean Rank Rank Method Mean Rank
1 AK (AK1) 7.730 1 AK (AK1) 7.898
2 WAAPWLS (default) 7.962 2 WAAPWLS (default) 8.073
3 WLS (default) 8.354 3 WLS (default) 8.454
4 FMA (default) 8.383 4 FMA (default) 8.483
5 SM (3PSM) 8.556 5 SM (3PSM) 8.550
6 PEESE (default) 8.580 6 PEESE (default) 8.663
7 RMA (default) 9.203 7 RMA (default) 9.236
8 PETPEESE (default) 9.238 8 PETPEESE (default) 9.347
9 MAIVE (default) 9.659 9 MAIVE (default) 9.715
10 puniform (star) 10.010 10 puniform (star) 10.052
11 SM (4PSM) 10.149 11 SM (4PSM) 10.209
12 PET (default) 10.282 12 PET (default) 10.387
13 EK (default) 10.303 13 EK (default) 10.414
14 RoBMA (PSMA) 10.638 14 RoBMA (PSMA) 10.711
15 mean (default) 11.989 15 AK (AK2) 11.163
16 AK (AK2) 12.626 16 mean (default) 12.084
17 MAIVE (WAIVE) 13.954 17 MAIVE (WAIVE) 13.994
18 trimfill (default) 14.224 18 trimfill (default) 14.335
19 WILS (default) 14.391 19 WILS (default) 14.429
20 puniform (default) 15.646 20 puniform (default) 15.649
21 pcurve (default) 17.036 21 pcurve (default) 17.067

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0. Methods are compared using condition-wise ranks. Direct comparison using the average bias is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of bias values on the corresponding outcome scale.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Mean Rank Rank Method Mean Rank
1 RMA (default) 3.640 1 RMA (default) 3.527
2 WLS (default) 5.487 2 WLS (default) 5.433
3 FMA (default) 5.492 3 FMA (default) 5.439
4 AK (AK1) 6.153 4 RoBMA (PSMA) 6.264
5 RoBMA (PSMA) 6.280 5 AK (AK1) 6.490
6 trimfill (default) 7.669 6 trimfill (default) 7.634
7 mean (default) 8.613 7 WAAPWLS (default) 8.600
8 WAAPWLS (default) 8.642 8 mean (default) 8.603
9 SM (3PSM) 9.778 9 SM (3PSM) 9.833
10 pcurve (default) 10.490 10 pcurve (default) 10.475
11 PEESE (default) 11.203 11 PEESE (default) 11.276
12 puniform (star) 11.747 12 puniform (star) 11.808
13 WILS (default) 11.912 13 WILS (default) 11.927
14 puniform (default) 12.121 14 puniform (default) 12.121
15 PETPEESE (default) 13.268 15 PETPEESE (default) 13.360
16 SM (4PSM) 13.632 16 SM (4PSM) 13.661
17 AK (AK2) 14.310 17 AK (AK2) 13.906
18 MAIVE (default) 15.875 18 MAIVE (default) 15.927
19 EK (default) 16.291 19 EK (default) 16.307
20 PET (default) 16.406 20 PET (default) 16.420
21 MAIVE (WAIVE) 19.523 21 MAIVE (WAIVE) 19.521

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance. Methods are compared using condition-wise ranks. Direct comparison using the average empirical standard error is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of empirical standard error values on the corresponding outcome scale.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 4.705 1 RoBMA (PSMA) 4.799
2 AK (AK1) 5.703 2 AK (AK1) 5.814
3 SM (3PSM) 6.337 3 SM (3PSM) 6.421
4 RMA (default) 7.577 4 RMA (default) 7.577
5 puniform (star) 7.669 5 puniform (star) 7.801
6 WAAPWLS (default) 7.824 6 WAAPWLS (default) 7.904
7 WLS (default) 8.456 7 WLS (default) 8.506
8 SM (4PSM) 8.881 8 SM (4PSM) 8.605
9 trimfill (default) 9.534 9 trimfill (default) 9.603
10 PEESE (default) 10.412 10 PEESE (default) 10.448
11 AK (AK2) 11.431 11 AK (AK2) 10.726
12 PETPEESE (default) 11.569 12 PETPEESE (default) 11.632
13 MAIVE (default) 12.232 13 MAIVE (default) 12.280
14 FMA (default) 12.322 14 FMA (default) 12.360
15 EK (default) 13.048 15 EK (default) 13.065
16 WILS (default) 13.603 16 WILS (default) 13.653
17 mean (default) 13.895 17 mean (default) 13.960
18 PET (default) 14.044 18 PET (default) 14.050
19 MAIVE (WAIVE) 15.088 19 MAIVE (WAIVE) 15.092
20 puniform (default) 15.128 20 puniform (default) 15.161
21 pcurve (default) 20.981 21 pcurve (default) 20.981

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the average Interval Score is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of empirical standard error values on the corresponding outcome scale.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 0.888 1 RoBMA (PSMA) 0.888
2 AK (AK2) 0.879 2 AK (AK2) 0.879
3 SM (4PSM) 0.860 3 SM (4PSM) 0.856
4 MAIVE (default) 0.835 4 MAIVE (default) 0.835
5 puniform (star) 0.832 5 puniform (star) 0.832
6 SM (3PSM) 0.831 6 SM (3PSM) 0.830
7 AK (AK1) 0.791 7 AK (AK1) 0.790
8 MAIVE (WAIVE) 0.776 8 MAIVE (WAIVE) 0.776
9 WAAPWLS (default) 0.711 9 WAAPWLS (default) 0.711
10 EK (default) 0.706 10 EK (default) 0.706
11 PETPEESE (default) 0.695 11 PETPEESE (default) 0.695
12 PET (default) 0.689 12 PET (default) 0.689
13 RMA (default) 0.675 13 RMA (default) 0.675
14 trimfill (default) 0.673 14 trimfill (default) 0.673
15 PEESE (default) 0.656 15 PEESE (default) 0.656
16 WLS (default) 0.619 16 WLS (default) 0.619
17 WILS (default) 0.557 17 WILS (default) 0.557
18 mean (default) 0.505 18 mean (default) 0.505
19 puniform (default) 0.497 19 puniform (default) 0.499
20 FMA (default) 0.461 20 FMA (default) 0.461
21 pcurve (default) NaN 21 pcurve (default) NaN

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Mean Rank Rank Method Mean Rank
1 FMA (default) 2.420 1 FMA (default) 2.404
2 WILS (default) 3.289 2 WILS (default) 3.262
3 WLS (default) 3.669 3 WLS (default) 3.636
4 WAAPWLS (default) 7.347 4 WAAPWLS (default) 7.308
5 RMA (default) 7.485 5 RMA (default) 7.423
6 PEESE (default) 7.490 6 PEESE (default) 7.469
7 trimfill (default) 7.715 7 trimfill (default) 7.615
8 mean (default) 7.944 8 mean (default) 7.948
9 RoBMA (PSMA) 9.308 9 RoBMA (PSMA) 9.351
10 PETPEESE (default) 9.565 10 PETPEESE (default) 9.611
11 AK (AK1) 10.140 11 AK (AK1) 10.142
12 SM (3PSM) 11.646 12 SM (3PSM) 11.661
13 puniform (default) 11.910 13 puniform (default) 11.875
14 puniform (star) 13.538 14 puniform (star) 13.695
15 PET (default) 13.981 15 PET (default) 14.067
16 SM (4PSM) 14.820 16 SM (4PSM) 14.678
17 EK (default) 15.186 17 EK (default) 15.289
18 AK (AK2) 15.479 18 AK (AK2) 15.398
19 MAIVE (default) 17.661 19 MAIVE (default) 17.724
20 MAIVE (WAIVE) 18.866 20 MAIVE (WAIVE) 18.898
21 pcurve (default) 20.981 21 pcurve (default) 20.981

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the average CI width is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of CI width values on the corresponding outcome scale.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Log Value Rank Method Log Value
1 RoBMA (PSMA) 5.048 1 RoBMA (PSMA) 4.967
2 AK (AK2) 2.632 2 AK (AK2) 2.489
3 MAIVE (default) 2.263 3 MAIVE (default) 2.263
4 MAIVE (WAIVE) 2.079 4 MAIVE (WAIVE) 2.079
5 AK (AK1) 2.066 5 AK (AK1) 2.038
6 puniform (star) 1.896 6 SM (3PSM) 1.903
7 SM (3PSM) 1.894 7 puniform (star) 1.896
8 RMA (default) 1.842 8 RMA (default) 1.842
9 SM (4PSM) 1.794 9 SM (4PSM) 1.814
10 PETPEESE (default) 1.790 10 PETPEESE (default) 1.790
11 EK (default) 1.759 11 EK (default) 1.759
12 PET (default) 1.758 12 PET (default) 1.758
13 WAAPWLS (default) 1.480 13 WAAPWLS (default) 1.480
14 PEESE (default) 1.380 14 PEESE (default) 1.380
15 trimfill (default) 1.375 15 trimfill (default) 1.375
16 WLS (default) 1.367 16 WLS (default) 1.367
17 mean (default) 1.222 17 mean (default) 1.222
18 puniform (default) 1.095 18 WILS (default) 1.065
19 WILS (default) 1.065 19 puniform (default) 1.063
20 FMA (default) 1.006 20 FMA (default) 1.006
21 pcurve (default) NaN 21 pcurve (default) NaN

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Log Value Rank Method Log Value
1 SM (3PSM) -5.093 1 AK (AK2) -5.601
2 PETPEESE (default) -5.073 2 SM (3PSM) -5.111
3 EK (default) -4.924 3 PETPEESE (default) -5.073
4 PET (default) -4.924 4 EK (default) -4.924
5 WAAPWLS (default) -4.886 5 PET (default) -4.924
6 PEESE (default) -4.863 6 WAAPWLS (default) -4.886
7 WLS (default) -4.799 7 PEESE (default) -4.863
8 trimfill (default) -4.763 8 WLS (default) -4.799
9 AK (AK1) -4.662 9 trimfill (default) -4.764
10 RMA (default) -4.572 10 AK (AK1) -4.670
11 FMA (default) -4.531 11 RMA (default) -4.572
12 MAIVE (default) -4.517 12 FMA (default) -4.531
13 puniform (star) -4.477 13 MAIVE (default) -4.517
14 mean (default) -4.232 14 puniform (star) -4.477
15 RoBMA (PSMA) -4.216 15 SM (4PSM) -4.377
16 SM (4PSM) -4.170 16 mean (default) -4.232
17 AK (AK2) -4.051 17 RoBMA (PSMA) -4.227
18 WILS (default) -4.013 18 WILS (default) -4.013
19 puniform (default) -3.380 19 puniform (default) -3.388
20 MAIVE (WAIVE) -1.933 20 MAIVE (WAIVE) -1.933
21 pcurve (default) NaN 21 pcurve (default) NaN

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 0.053 1 RoBMA (PSMA) 0.054
2 AK (AK2) 0.071 2 MAIVE (WAIVE) 0.077
3 MAIVE (WAIVE) 0.077 3 AK (AK2) 0.090
4 MAIVE (default) 0.118 4 MAIVE (default) 0.118
5 SM (4PSM) 0.167 5 SM (4PSM) 0.167
6 PET (default) 0.172 6 PET (default) 0.172
7 EK (default) 0.172 7 EK (default) 0.172
8 PETPEESE (default) 0.176 8 PETPEESE (default) 0.176
9 SM (3PSM) 0.183 9 SM (3PSM) 0.183
10 puniform (star) 0.216 10 puniform (star) 0.216
11 WAAPWLS (default) 0.233 11 WAAPWLS (default) 0.233
12 AK (AK1) 0.236 12 AK (AK1) 0.236
13 RMA (default) 0.255 13 RMA (default) 0.255
14 PEESE (default) 0.276 14 PEESE (default) 0.276
15 WLS (default) 0.296 15 WLS (default) 0.296
16 trimfill (default) 0.310 16 trimfill (default) 0.310
17 WILS (default) 0.361 17 WILS (default) 0.361
18 mean (default) 0.430 18 mean (default) 0.430
19 FMA (default) 0.518 19 FMA (default) 0.518
20 puniform (default) 0.587 20 puniform (default) 0.582
21 pcurve (default) NaN 21 pcurve (default) NaN

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 FMA (default) 0.984 1 FMA (default) 0.984
2 mean (default) 0.978 2 mean (default) 0.978
3 WLS (default) 0.971 3 WLS (default) 0.971
4 RMA (default) 0.963 4 RMA (default) 0.963
5 AK (AK1) 0.960 5 puniform (default) 0.960
6 puniform (default) 0.959 6 AK (AK1) 0.960
7 trimfill (default) 0.953 7 trimfill (default) 0.953
8 PEESE (default) 0.952 8 PEESE (default) 0.952
9 WAAPWLS (default) 0.927 9 WAAPWLS (default) 0.927
10 WILS (default) 0.918 10 WILS (default) 0.918
11 SM (3PSM) 0.911 11 AK (AK2) 0.915
12 PETPEESE (default) 0.910 12 SM (3PSM) 0.914
13 puniform (star) 0.898 13 PETPEESE (default) 0.910
14 EK (default) 0.889 14 puniform (star) 0.898
15 PET (default) 0.889 15 EK (default) 0.889
16 AK (AK2) 0.883 16 PET (default) 0.889
17 MAIVE (default) 0.857 17 SM (4PSM) 0.858
18 SM (4PSM) 0.843 18 MAIVE (default) 0.857
19 RoBMA (PSMA) 0.784 19 RoBMA (PSMA) 0.785
20 MAIVE (WAIVE) 0.547 20 MAIVE (WAIVE) 0.547
21 pcurve (default) NaN 21 pcurve (default) NaN

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

By-Condition Performance (Conditional on Method Convergence)

The results below are conditional on method convergence. Note that the methods might differ in convergence rate and are therefore not compared on the same data sets.

Raincloud plot showing convergence rates across different methods

Raincloud plot showing RMSE (Root Mean Square Error) across different methods

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the average RMSE is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of RMSE values on the corresponding outcome scale.

Raincloud plot showing bias across different methods

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0. Methods are compared using condition-wise ranks. Direct comparison using the average bias is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of bias values on the corresponding outcome scale.

Raincloud plot showing bias across different methods

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance. Methods are compared using condition-wise ranks. Direct comparison using the empirical standard error is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of bias values on the corresponding outcome scale.

Raincloud plot showing 95% confidence interval width across different methods

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the interval score is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of bias values on the corresponding outcome scale.

Raincloud plot showing 95% confidence interval coverage across different methods

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Raincloud plot showing 95% confidence interval width across different methods

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the average 95% CI width is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of 95% CI width values on the corresponding outcome scale.

Raincloud plot showing positive likelihood ratio across different methods

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Raincloud plot showing negative likelihood ratio across different methods

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Raincloud plot showing Type I Error rates across different methods

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Raincloud plot showing statistical power across different methods

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

By-Condition Performance (Replacement in Case of Non-Convergence)

The results below incorporate method replacement to handle non-convergence. If a method fails to converge, its results are replaced with the results from a simpler method (e.g., random-effects meta-analysis without publication bias adjustment). This emulates what a data analyst may do in practice in case a method does not converge. However, note that these results do not correspond to “pure” method performance as they might combine multiple different methods. See Method Replacement Strategy for details of the method replacement specification.

Raincloud plot showing convergence rates across different methods

Raincloud plot showing RMSE (Root Mean Square Error) across different methods

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the average RMSE is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of RMSE values on the corresponding outcome scale.

Raincloud plot showing bias across different methods

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0. Methods are compared using condition-wise ranks. Direct comparison using the average bias is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of bias values on the corresponding outcome scale.

Raincloud plot showing bias across different methods

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance. Methods are compared using condition-wise ranks. Direct comparison using the empirical standard error is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of bias values on the corresponding outcome scale.

Raincloud plot showing 95% confidence interval width across different methods

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the interval score is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of bias values on the corresponding outcome scale.

Raincloud plot showing 95% confidence interval coverage across different methods

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Raincloud plot showing 95% confidence interval width across different methods

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the average 95% CI width is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of 95% CI width values on the corresponding outcome scale.

Raincloud plot showing positive likelihood ratio across different methods

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Raincloud plot showing negative likelihood ratio across different methods

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Raincloud plot showing Type I Error rates across different methods

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Raincloud plot showing statistical power across different methods

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

Session Info

This report was compiled on Mon Mar 16 19:18:06 2026 (UTC) using the following computational environment

## R version 4.5.3 (2026-03-11)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## time zone: UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] scales_1.4.0                   ggdist_3.3.3                  
## [3] ggplot2_4.0.2                  PublicationBiasBenchmark_0.2.0
## 
## loaded via a namespace (and not attached):
##  [1] generics_0.1.4       sandwich_3.1-1       sass_0.4.10         
##  [4] xml2_1.5.2           stringi_1.8.7        lattice_0.22-9      
##  [7] httpcode_0.3.0       digest_0.6.39        magrittr_2.0.4      
## [10] evaluate_1.0.5       grid_4.5.3           RColorBrewer_1.1-3  
## [13] fastmap_1.2.0        jsonlite_2.0.0       crul_1.6.0          
## [16] urltools_1.7.3.1     httr_1.4.8           purrr_1.2.1         
## [19] viridisLite_0.4.3    textshaping_1.0.5    jquerylib_0.1.4     
## [22] Rdpack_2.6.6         cli_3.6.5            rlang_1.1.7         
## [25] triebeard_0.4.1      rbibutils_2.4.1      withr_3.0.2         
## [28] cachem_1.1.0         yaml_2.3.12          otel_0.2.0          
## [31] tools_4.5.3          memoise_2.0.1        kableExtra_1.4.0    
## [34] curl_7.0.0           vctrs_0.7.1          R6_2.6.1            
## [37] clubSandwich_0.6.2   zoo_1.8-15           lifecycle_1.0.5     
## [40] stringr_1.6.0        fs_1.6.7             htmlwidgets_1.6.4   
## [43] ragg_1.5.1           pkgconfig_2.0.3      desc_1.4.3          
## [46] osfr_0.2.9           pkgdown_2.2.0        bslib_0.10.0        
## [49] pillar_1.11.1        gtable_0.3.6         Rcpp_1.1.1          
## [52] glue_1.8.0           systemfonts_1.3.2    xfun_0.56           
## [55] tibble_3.3.1         rstudioapi_0.18.0    knitr_1.51          
## [58] farver_2.1.2         htmltools_0.5.9      labeling_0.4.3      
## [61] svglite_2.2.2        rmarkdown_2.30       compiler_4.5.3      
## [64] S7_0.2.1             distributional_0.6.0