Skip to contents

Complete Results

These results are based on Stanley (2017) data-generating mechanism with a total of 324 conditions.

Average Performance

Method performance measures are aggregated across all simulated conditions to provide an overall impression of method performance. However, keep in mind that a method with a high overall ranking is not necessarily the “best” method for a particular application. To select a suitable method for your application, consider also non-aggregated performance measures in conditions most relevant to your application.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Mean Rank Rank Method Mean Rank
1 RoBMA (PSMA) 4.975 1 RoBMA (PSMA) 5.293
2 AK (AK1) 6.765 2 AK (AK1) 6.707
3 SM (3PSM) 7.191 3 SM (3PSM) 7.105
4 FMA (default) 8.185 4 FMA (default) 8.191
5 WLS (default) 8.198 5 WLS (default) 8.204
6 WAAPWLS (default) 8.324 6 WAAPWLS (default) 8.380
7 puniform (star) 9.414 7 puniform (star) 9.364
8 SM (4PSM) 9.448 8 SM (4PSM) 9.435
9 WILS (default) 9.688 9 WILS (default) 9.627
10 RMA (default) 10.012 10 RMA (default) 10.031
11 trimfill (default) 10.065 11 trimfill (default) 10.093
12 PEESE (default) 10.367 12 PEESE (default) 10.398
13 PETPEESE (default) 10.401 12 PETPEESE (default) 10.398
14 EK (default) 11.491 14 AK (AK2) 11.429
15 PET (default) 11.549 15 EK (default) 11.481
16 AK (AK2) 11.636 16 PET (default) 11.540
17 pcurve (default) 12.568 17 pcurve (default) 12.577
18 puniform (default) 13.244 18 puniform (default) 13.219
19 mean (default) 13.886 19 mean (default) 13.935

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the average RMSE is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of RMSE values on the corresponding outcome scale.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Mean Rank Rank Method Mean Rank
1 SM (3PSM) 6.451 1 SM (3PSM) 6.386
2 SM (4PSM) 6.880 2 SM (4PSM) 6.941
3 AK (AK1) 7.562 3 AK (AK1) 7.583
4 RoBMA (PSMA) 7.580 4 RoBMA (PSMA) 7.765
5 puniform (star) 8.340 5 puniform (star) 8.324
6 PETPEESE (default) 8.750 6 PETPEESE (default) 8.784
7 EK (default) 8.941 7 EK (default) 8.954
8 PET (default) 8.954 8 PET (default) 8.966
9 WAAPWLS (default) 9.059 9 WAAPWLS (default) 9.154
10 PEESE (default) 9.583 10 PEESE (default) 9.636
11 WLS (default) 10.105 11 WLS (default) 10.207
12 FMA (default) 10.120 12 FMA (default) 10.222
13 WILS (default) 11.210 13 AK (AK2) 10.590
14 AK (AK2) 11.414 14 WILS (default) 11.213
15 puniform (default) 11.864 15 puniform (default) 11.870
16 RMA (default) 11.867 16 RMA (default) 11.895
17 trimfill (default) 12.164 17 trimfill (default) 12.244
18 pcurve (default) 12.806 18 pcurve (default) 12.840
19 mean (default) 14.086 19 mean (default) 14.160

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0. Methods are compared using condition-wise ranks. Direct comparison using the average bias is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of bias values on the corresponding outcome scale.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Mean Rank Rank Method Mean Rank
1 RMA (default) 3.380 1 RMA (default) 3.238
2 WLS (default) 3.975 2 WLS (default) 3.821
3 FMA (default) 3.981 3 FMA (default) 3.827
4 AK (AK1) 5.694 4 AK (AK1) 5.833
5 WAAPWLS (default) 7.241 5 WAAPWLS (default) 7.167
6 trimfill (default) 7.386 6 trimfill (default) 7.343
7 RoBMA (PSMA) 7.664 7 RoBMA (PSMA) 7.994
8 mean (default) 8.235 8 mean (default) 8.157
9 SM (3PSM) 9.802 9 SM (3PSM) 9.793
10 pcurve (default) 11.000 10 pcurve (default) 10.920
11 PEESE (default) 11.299 11 PEESE (default) 11.293
12 WILS (default) 11.627 12 WILS (default) 11.611
13 puniform (default) 12.009 13 puniform (default) 11.951
14 puniform (star) 12.043 14 puniform (star) 12.046
15 AK (AK2) 12.917 15 AK (AK2) 13.318
16 SM (4PSM) 13.485 16 SM (4PSM) 13.457
17 PETPEESE (default) 14.052 17 PETPEESE (default) 14.068
18 EK (default) 15.840 18 EK (default) 15.815
19 PET (default) 15.886 19 PET (default) 15.864

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance. Methods are compared using condition-wise ranks. Direct comparison using the average empirical standard error is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of empirical standard error values on the corresponding outcome scale.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 4.000 1 RoBMA (PSMA) 4.296
2 SM (3PSM) 5.327 2 SM (3PSM) 5.435
3 puniform (star) 6.123 3 puniform (star) 6.160
4 AK (AK1) 6.269 4 AK (AK1) 6.210
5 SM (4PSM) 7.309 5 SM (4PSM) 6.994
6 WAAPWLS (default) 9.037 6 WAAPWLS (default) 9.182
7 WLS (default) 9.741 7 WLS (default) 9.796
8 RMA (default) 9.802 8 RMA (default) 9.824
9 EK (default) 9.920 9 EK (default) 9.870
10 trimfill (default) 10.235 10 trimfill (default) 10.302
11 PETPEESE (default) 10.389 11 AK (AK2) 10.367
12 PEESE (default) 10.438 12 PETPEESE (default) 10.392
13 AK (AK2) 10.701 13 PEESE (default) 10.500
14 PET (default) 10.997 14 PET (default) 10.938
15 puniform (default) 11.704 15 puniform (default) 11.719
16 WILS (default) 11.926 16 WILS (default) 11.867
17 FMA (default) 12.056 17 FMA (default) 12.065
18 mean (default) 14.457 18 mean (default) 14.509
19 pcurve (default) 19.000 19 pcurve (default) 19.000

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the average Interval Score is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of empirical standard error values on the corresponding outcome scale.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 0.902 1 RoBMA (PSMA) 0.896
2 SM (4PSM) 0.845 2 SM (4PSM) 0.833
3 AK (AK2) 0.801 3 AK (AK2) 0.783
4 puniform (star) 0.765 4 puniform (star) 0.765
5 SM (3PSM) 0.743 5 SM (3PSM) 0.736
6 EK (default) 0.715 6 EK (default) 0.715
7 PET (default) 0.688 7 PET (default) 0.688
8 PETPEESE (default) 0.681 8 PETPEESE (default) 0.681
9 AK (AK1) 0.608 9 AK (AK1) 0.607
10 puniform (default) 0.541 10 puniform (default) 0.542
11 PEESE (default) 0.524 11 PEESE (default) 0.524
12 WAAPWLS (default) 0.510 12 WAAPWLS (default) 0.510
13 trimfill (default) 0.497 13 trimfill (default) 0.497
14 WILS (default) 0.494 14 WILS (default) 0.494
15 RMA (default) 0.493 15 RMA (default) 0.493
16 WLS (default) 0.481 16 WLS (default) 0.481
17 FMA (default) 0.380 17 FMA (default) 0.380
18 mean (default) 0.366 18 mean (default) 0.366
19 pcurve (default) NaN 19 pcurve (default) NaN

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Mean Rank Rank Method Mean Rank
1 FMA (default) 2.373 1 FMA (default) 2.346
2 WILS (default) 3.188 2 WILS (default) 3.139
3 WLS (default) 3.969 3 WLS (default) 3.904
4 WAAPWLS (default) 5.972 4 WAAPWLS (default) 5.883
5 trimfill (default) 7.105 5 trimfill (default) 7.015
6 RMA (default) 7.148 6 RMA (default) 7.049
7 mean (default) 8.312 7 mean (default) 8.309
8 RoBMA (PSMA) 9.043 8 RoBMA (PSMA) 9.037
9 PEESE (default) 9.074 9 PEESE (default) 9.096
10 AK (AK1) 9.191 10 AK (AK1) 9.201
11 SM (3PSM) 10.608 11 SM (3PSM) 10.682
12 puniform (default) 11.438 12 puniform (default) 11.404
13 PETPEESE (default) 11.469 13 PETPEESE (default) 11.515
14 puniform (star) 12.509 14 puniform (star) 12.688
15 AK (AK2) 13.698 15 AK (AK2) 13.864
16 SM (4PSM) 14.080 16 SM (4PSM) 13.898
17 PET (default) 14.991 17 PET (default) 15.056
18 EK (default) 16.259 18 EK (default) 16.343
19 pcurve (default) 19.000 19 pcurve (default) 19.000

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the average CI width is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of CI width values on the corresponding outcome scale.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Log Value Rank Method Log Value
1 RoBMA (PSMA) 5.355 1 RoBMA (PSMA) 4.807
2 AK (AK2) 2.939 2 AK (AK2) 2.293
3 SM (4PSM) 2.219 3 SM (4PSM) 2.238
4 puniform (star) 2.163 4 puniform (star) 2.163
5 SM (3PSM) 1.998 5 SM (3PSM) 1.984
6 EK (default) 1.840 6 EK (default) 1.840
7 PET (default) 1.838 7 PET (default) 1.838
8 PETPEESE (default) 1.834 8 PETPEESE (default) 1.834
9 puniform (default) 1.728 9 puniform (default) 1.667
10 AK (AK1) 1.493 10 AK (AK1) 1.445
11 WILS (default) 1.370 11 WILS (default) 1.370
12 RMA (default) 1.161 12 RMA (default) 1.161
13 PEESE (default) 1.109 13 PEESE (default) 1.109
14 WAAPWLS (default) 1.056 14 WAAPWLS (default) 1.056
15 WLS (default) 1.019 15 WLS (default) 1.019
16 trimfill (default) 0.953 16 trimfill (default) 0.953
17 mean (default) 0.849 17 mean (default) 0.849
18 FMA (default) 0.800 18 FMA (default) 0.800
19 pcurve (default) NaN 19 pcurve (default) NaN

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Log Value Rank Method Log Value
1 SM (3PSM) -4.898 1 AK (AK2) -6.215
2 RoBMA (PSMA) -4.722 2 SM (3PSM) -4.947
3 PETPEESE (default) -4.721 3 SM (4PSM) -4.828
4 AK (AK2) -4.622 4 RoBMA (PSMA) -4.737
5 EK (default) -4.570 5 PETPEESE (default) -4.721
6 PET (default) -4.570 6 EK (default) -4.570
7 SM (4PSM) -4.438 7 PET (default) -4.570
8 puniform (star) -4.287 8 puniform (star) -4.287
9 PEESE (default) -3.817 9 PEESE (default) -3.817
10 WILS (default) -3.661 10 WILS (default) -3.661
11 puniform (default) -3.597 11 puniform (default) -3.603
12 trimfill (default) -3.502 12 trimfill (default) -3.502
13 AK (AK1) -3.458 13 AK (AK1) -3.461
14 WLS (default) -3.393 14 WLS (default) -3.393
15 RMA (default) -3.312 15 RMA (default) -3.312
16 FMA (default) -3.220 16 FMA (default) -3.220
17 WAAPWLS (default) -3.095 17 WAAPWLS (default) -3.095
18 mean (default) -2.700 18 mean (default) -2.700
19 pcurve (default) NaN 19 pcurve (default) NaN

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 0.022 1 RoBMA (PSMA) 0.031
2 AK (AK2) 0.082 2 SM (4PSM) 0.125
3 SM (4PSM) 0.112 3 AK (AK2) 0.156
4 PET (default) 0.237 4 PET (default) 0.237
5 EK (default) 0.237 5 EK (default) 0.237
6 puniform (star) 0.242 6 puniform (star) 0.242
7 PETPEESE (default) 0.269 7 PETPEESE (default) 0.269
8 SM (3PSM) 0.282 8 SM (3PSM) 0.288
9 WILS (default) 0.373 9 WILS (default) 0.373
10 PEESE (default) 0.541 10 PEESE (default) 0.541
11 puniform (default) 0.544 11 puniform (default) 0.542
12 AK (AK1) 0.556 12 AK (AK1) 0.557
13 WAAPWLS (default) 0.573 13 WAAPWLS (default) 0.573
14 RMA (default) 0.603 14 RMA (default) 0.603
15 WLS (default) 0.612 15 WLS (default) 0.612
16 trimfill (default) 0.615 16 trimfill (default) 0.615
17 mean (default) 0.688 17 mean (default) 0.688
18 FMA (default) 0.720 18 FMA (default) 0.720
19 pcurve (default) NaN 19 pcurve (default) NaN

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 FMA (default) 0.990 1 FMA (default) 0.990
2 WLS (default) 0.981 2 WLS (default) 0.981
3 RMA (default) 0.980 3 RMA (default) 0.980
4 trimfill (default) 0.979 4 trimfill (default) 0.979
5 AK (AK1) 0.970 5 AK (AK2) 0.977
6 mean (default) 0.965 6 AK (AK1) 0.970
7 WAAPWLS (default) 0.956 7 mean (default) 0.965
8 PEESE (default) 0.951 8 WAAPWLS (default) 0.956
9 AK (AK2) 0.949 9 PEESE (default) 0.951
10 SM (3PSM) 0.936 10 SM (3PSM) 0.945
11 puniform (default) 0.911 11 puniform (default) 0.913
12 WILS (default) 0.898 12 SM (4PSM) 0.900
13 PETPEESE (default) 0.884 13 WILS (default) 0.898
14 puniform (star) 0.876 14 PETPEESE (default) 0.884
15 SM (4PSM) 0.869 15 puniform (star) 0.876
16 EK (default) 0.852 16 EK (default) 0.852
17 PET (default) 0.851 17 PET (default) 0.851
18 RoBMA (PSMA) 0.832 18 RoBMA (PSMA) 0.834
19 pcurve (default) NaN 19 pcurve (default) NaN

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

By-Condition Performance (Conditional on Method Convergence)

The results below are conditional on method convergence. Note that the methods might differ in convergence rate and are therefore not compared on the same data sets.

Raincloud plot showing convergence rates across different methods

Raincloud plot showing RMSE (Root Mean Square Error) across different methods

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the average RMSE is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of RMSE values on the corresponding outcome scale.

Raincloud plot showing bias across different methods

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0. Methods are compared using condition-wise ranks. Direct comparison using the average bias is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of bias values on the corresponding outcome scale.

Raincloud plot showing bias across different methods

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance. Methods are compared using condition-wise ranks. Direct comparison using the empirical standard error is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of bias values on the corresponding outcome scale.

Raincloud plot showing 95% confidence interval width across different methods

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the interval score is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of bias values on the corresponding outcome scale.

Raincloud plot showing 95% confidence interval coverage across different methods

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Raincloud plot showing 95% confidence interval width across different methods

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the average 95% CI width is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of 95% CI width values on the corresponding outcome scale.

Raincloud plot showing positive likelihood ratio across different methods

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Raincloud plot showing negative likelihood ratio across different methods

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Raincloud plot showing Type I Error rates across different methods

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Raincloud plot showing statistical power across different methods

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

By-Condition Performance (Replacement in Case of Non-Convergence)

The results below incorporate method replacement to handle non-convergence. If a method fails to converge, its results are replaced with the results from a simpler method (e.g., random-effects meta-analysis without publication bias adjustment). This emulates what a data analyst may do in practice in case a method does not converge. However, note that these results do not correspond to “pure” method performance as they might combine multiple different methods. See Method Replacement Strategy for details of the method replacement specification.

Raincloud plot showing convergence rates across different methods

Raincloud plot showing RMSE (Root Mean Square Error) across different methods

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the average RMSE is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of RMSE values on the corresponding outcome scale.

Raincloud plot showing bias across different methods

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0. Methods are compared using condition-wise ranks. Direct comparison using the average bias is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of bias values on the corresponding outcome scale.

Raincloud plot showing bias across different methods

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance. Methods are compared using condition-wise ranks. Direct comparison using the empirical standard error is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of bias values on the corresponding outcome scale.

Raincloud plot showing 95% confidence interval width across different methods

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the interval score is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of bias values on the corresponding outcome scale.

Raincloud plot showing 95% confidence interval coverage across different methods

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Raincloud plot showing 95% confidence interval width across different methods

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method. Methods are compared using condition-wise ranks. Direct comparison using the average 95% CI width is not possible because the data-generating mechanisms differ in the outcome scale. See the DGM-specific results (or subresults) to see the distribution of 95% CI width values on the corresponding outcome scale.

Raincloud plot showing positive likelihood ratio across different methods

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Raincloud plot showing negative likelihood ratio across different methods

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Raincloud plot showing Type I Error rates across different methods

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Raincloud plot showing statistical power across different methods

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

Subset: Standardized Mean Difference Effect Sizes

These results are based on Stanley (2017) data-generating mechanism with a total of 1 conditions.

Average Performance

Method performance measures are aggregated across all simulated conditions to provide an overall impression of method performance. However, keep in mind that a method with a high overall ranking is not necessarily the “best” method for a particular application. To select a suitable method for your application, consider also non-aggregated performance measures in conditions most relevant to your application.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 0.057 1 RoBMA (PSMA) 0.057
2 AK (AK2) 0.067 2 AK (AK2) 0.083
3 WILS (default) 0.095 3 WILS (default) 0.095
4 PEESE (default) 0.105 4 PEESE (default) 0.105
5 WAAPWLS (default) 0.108 5 WAAPWLS (default) 0.108
6 PETPEESE (default) 0.109 6 PETPEESE (default) 0.109
7 trimfill (default) 0.111 7 trimfill (default) 0.111
8 FMA (default) 0.112 8 FMA (default) 0.112
8 WLS (default) 0.112 8 WLS (default) 0.112
10 EK (default) 0.119 10 EK (default) 0.119
11 PET (default) 0.119 11 PET (default) 0.119
12 RMA (default) 0.123 12 RMA (default) 0.123
13 SM (3PSM) 0.130 13 SM (3PSM) 0.126
14 mean (default) 0.146 14 mean (default) 0.146
15 AK (AK1) 0.163 15 AK (AK1) 0.158
16 SM (4PSM) 0.202 16 SM (4PSM) 0.203
17 pcurve (default) 0.467 17 pcurve (default) 0.420
18 puniform (default) 0.648 18 puniform (default) 0.545
19 puniform (star) 90.749 19 puniform (star) 90.749

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 0.006 1 RoBMA (PSMA) 0.006
2 WILS (default) 0.007 2 WILS (default) 0.007
3 AK (AK2) 0.010 3 AK (AK2) 0.011
4 PET (default) 0.015 4 PET (default) 0.015
5 EK (default) 0.015 5 EK (default) 0.015
6 SM (4PSM) -0.023 6 SM (4PSM) -0.022
7 SM (3PSM) 0.023 7 SM (3PSM) 0.025
8 PETPEESE (default) 0.036 8 PETPEESE (default) 0.036
9 PEESE (default) 0.057 9 PEESE (default) 0.057
10 AK (AK1) 0.060 10 AK (AK1) 0.061
11 trimfill (default) 0.069 11 trimfill (default) 0.069
12 WAAPWLS (default) 0.076 12 WAAPWLS (default) 0.076
13 FMA (default) 0.085 13 FMA (default) 0.085
13 WLS (default) 0.085 13 WLS (default) 0.085
15 puniform (default) 0.088 15 RMA (default) 0.103
16 RMA (default) 0.103 16 puniform (default) 0.103
17 mean (default) 0.125 17 mean (default) 0.125
18 pcurve (default) 0.322 18 pcurve (default) 0.260
19 puniform (star) -9.762 19 puniform (star) -9.762

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RMA (default) 0.038 1 RMA (default) 0.038
2 mean (default) 0.042 2 mean (default) 0.042
3 FMA (default) 0.043 3 FMA (default) 0.043
3 WLS (default) 0.043 3 WLS (default) 0.043
5 RoBMA (PSMA) 0.045 5 RoBMA (PSMA) 0.045
6 trimfill (default) 0.046 6 trimfill (default) 0.046
7 WAAPWLS (default) 0.047 7 WAAPWLS (default) 0.047
8 PEESE (default) 0.058 8 PEESE (default) 0.058
9 AK (AK2) 0.059 9 WILS (default) 0.063
10 WILS (default) 0.063 10 AK (AK2) 0.076
11 PETPEESE (default) 0.079 11 PETPEESE (default) 0.079
12 EK (default) 0.095 12 EK (default) 0.095
13 PET (default) 0.095 13 PET (default) 0.095
14 SM (3PSM) 0.111 14 SM (3PSM) 0.107
15 AK (AK1) 0.113 15 AK (AK1) 0.109
16 SM (4PSM) 0.200 16 SM (4PSM) 0.201
17 pcurve (default) 0.290 17 pcurve (default) 0.286
18 puniform (default) 0.550 18 puniform (default) 0.448
19 puniform (star) 89.867 19 puniform (star) 89.867

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 0.271 1 RoBMA (PSMA) 0.271
2 AK (AK2) 0.508 2 AK (AK2) 0.641
3 puniform (star) 0.914 3 SM (4PSM) 0.891
4 EK (default) 1.027 4 puniform (star) 0.914
5 SM (3PSM) 1.058 5 EK (default) 1.027
6 PET (default) 1.097 6 SM (3PSM) 1.042
7 SM (4PSM) 1.140 7 PET (default) 1.097
8 PETPEESE (default) 1.536 8 PETPEESE (default) 1.536
9 WILS (default) 1.658 9 WILS (default) 1.658
10 PEESE (default) 1.834 10 PEESE (default) 1.834
11 WAAPWLS (default) 2.206 11 WAAPWLS (default) 2.206
12 trimfill (default) 2.263 12 trimfill (default) 2.263
13 WLS (default) 2.545 13 WLS (default) 2.545
14 RMA (default) 2.805 14 RMA (default) 2.805
15 FMA (default) 3.110 15 FMA (default) 3.110
16 AK (AK1) 3.353 16 AK (AK1) 3.157
17 puniform (default) 4.061 17 puniform (default) 3.890
18 mean (default) 4.067 18 mean (default) 4.067
19 pcurve (default) NaN 19 pcurve (default) NaN

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 0.950 1 RoBMA (PSMA) 0.950
2 SM (4PSM) 0.924 2 SM (4PSM) 0.911
3 AK (AK2) 0.902 3 AK (AK2) 0.904
4 puniform (star) 0.830 4 puniform (star) 0.830
5 SM (3PSM) 0.807 5 SM (3PSM) 0.802
6 EK (default) 0.744 6 EK (default) 0.744
7 PETPEESE (default) 0.718 7 PETPEESE (default) 0.718
8 PET (default) 0.716 8 PET (default) 0.716
9 AK (AK1) 0.666 9 AK (AK1) 0.665
10 PEESE (default) 0.573 10 PEESE (default) 0.573
11 WAAPWLS (default) 0.571 11 WAAPWLS (default) 0.571
12 trimfill (default) 0.550 12 trimfill (default) 0.550
13 RMA (default) 0.546 13 RMA (default) 0.546
14 WLS (default) 0.537 14 WLS (default) 0.537
15 puniform (default) 0.531 15 puniform (default) 0.532
16 WILS (default) 0.524 16 WILS (default) 0.524
17 FMA (default) 0.414 17 FMA (default) 0.414
18 mean (default) 0.376 18 mean (default) 0.376
19 pcurve (default) NaN 19 pcurve (default) NaN

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 FMA (default) 0.079 1 FMA (default) 0.079
2 mean (default) 0.105 2 mean (default) 0.105
3 WILS (default) 0.130 3 WILS (default) 0.130
4 WLS (default) 0.134 4 WLS (default) 0.134
5 WAAPWLS (default) 0.149 5 WAAPWLS (default) 0.149
6 trimfill (default) 0.167 6 trimfill (default) 0.167
7 PEESE (default) 0.171 7 PEESE (default) 0.171
8 RMA (default) 0.173 8 RMA (default) 0.173
9 RoBMA (PSMA) 0.191 9 RoBMA (PSMA) 0.191
10 PETPEESE (default) 0.235 10 PETPEESE (default) 0.235
11 AK (AK2) 0.255 11 puniform (star) 0.278
12 puniform (star) 0.278 12 PET (default) 0.300
13 PET (default) 0.300 13 SM (3PSM) 0.321
14 SM (3PSM) 0.366 14 EK (default) 0.367
15 EK (default) 0.367 15 puniform (default) 0.453
16 puniform (default) 0.588 16 AK (AK2) 0.464
17 SM (4PSM) 0.992 17 SM (4PSM) 0.655
18 AK (AK1) 1.982 18 AK (AK1) 1.781
19 pcurve (default) NaN 19 pcurve (default) NaN

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Log Value Rank Method Log Value
1 RoBMA (PSMA) 5.324 1 RoBMA (PSMA) 5.324
2 AK (AK2) 3.027 2 AK (AK2) 2.486
3 SM (4PSM) 2.431 3 SM (4PSM) 2.441
4 puniform (star) 2.007 4 puniform (star) 2.007
5 SM (3PSM) 1.871 5 SM (3PSM) 1.852
6 PET (default) 1.818 6 PET (default) 1.818
7 EK (default) 1.818 7 EK (default) 1.818
8 PETPEESE (default) 1.677 8 PETPEESE (default) 1.677
9 AK (AK1) 1.339 9 AK (AK1) 1.313
10 WILS (default) 1.130 10 WILS (default) 1.130
11 RMA (default) 1.060 11 RMA (default) 1.060
12 PEESE (default) 0.977 12 PEESE (default) 0.977
13 puniform (default) 0.962 13 puniform (default) 0.966
14 WAAPWLS (default) 0.946 14 WAAPWLS (default) 0.946
15 WLS (default) 0.882 15 WLS (default) 0.882
16 trimfill (default) 0.825 16 trimfill (default) 0.825
17 mean (default) 0.651 17 mean (default) 0.651
18 FMA (default) 0.566 18 FMA (default) 0.566
19 pcurve (default) NaN 19 pcurve (default) NaN

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Log Value Rank Method Log Value
1 RoBMA (PSMA) -5.344 1 AK (AK2) -6.371
2 PETPEESE (default) -5.031 2 RoBMA (PSMA) -5.344
3 AK (AK2) -4.939 3 SM (4PSM) -5.289
4 PET (default) -4.925 4 PETPEESE (default) -5.031
5 EK (default) -4.925 5 SM (3PSM) -4.949
6 SM (3PSM) -4.896 6 PET (default) -4.925
7 SM (4PSM) -4.804 7 EK (default) -4.925
8 puniform (star) -4.121 8 puniform (star) -4.121
9 puniform (default) -3.885 9 puniform (default) -3.892
10 PEESE (default) -3.691 10 PEESE (default) -3.691
11 WAAPWLS (default) -3.494 11 WAAPWLS (default) -3.494
12 trimfill (default) -3.454 12 trimfill (default) -3.454
13 AK (AK1) -3.453 13 AK (AK1) -3.453
14 WLS (default) -3.274 14 WLS (default) -3.274
15 WILS (default) -3.233 15 WILS (default) -3.233
16 RMA (default) -3.200 16 RMA (default) -3.200
17 FMA (default) -3.049 17 FMA (default) -3.049
18 mean (default) -3.038 18 mean (default) -3.038
19 pcurve (default) NaN 19 pcurve (default) NaN

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 0.025 1 RoBMA (PSMA) 0.025
2 AK (AK2) 0.073 2 SM (4PSM) 0.109
3 SM (4PSM) 0.095 3 AK (AK2) 0.136
4 PET (default) 0.259 4 PET (default) 0.259
5 EK (default) 0.259 5 EK (default) 0.259
6 puniform (star) 0.268 6 puniform (star) 0.268
7 PETPEESE (default) 0.296 7 PETPEESE (default) 0.296
8 SM (3PSM) 0.308 8 SM (3PSM) 0.314
9 WILS (default) 0.410 9 WILS (default) 0.410
10 PEESE (default) 0.568 10 PEESE (default) 0.568
11 AK (AK1) 0.579 11 AK (AK1) 0.579
12 WAAPWLS (default) 0.590 12 WAAPWLS (default) 0.590
13 puniform (default) 0.615 13 puniform (default) 0.612
14 RMA (default) 0.623 14 RMA (default) 0.623
15 WLS (default) 0.634 15 WLS (default) 0.634
16 trimfill (default) 0.638 16 trimfill (default) 0.638
17 mean (default) 0.723 17 mean (default) 0.723
18 FMA (default) 0.753 18 FMA (default) 0.753
19 pcurve (default) NaN 19 pcurve (default) NaN

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 mean (default) 0.999 1 mean (default) 0.999
2 FMA (default) 0.999 2 FMA (default) 0.999
3 puniform (default) 0.997 3 puniform (default) 0.997
4 AK (AK1) 0.990 4 AK (AK1) 0.990
5 WLS (default) 0.989 5 WLS (default) 0.989
6 RMA (default) 0.989 6 RMA (default) 0.989
7 trimfill (default) 0.988 7 trimfill (default) 0.988
8 WAAPWLS (default) 0.972 8 AK (AK2) 0.983
9 PEESE (default) 0.969 9 WAAPWLS (default) 0.972
10 AK (AK2) 0.968 10 PEESE (default) 0.969
11 SM (3PSM) 0.960 11 SM (3PSM) 0.965
12 PETPEESE (default) 0.931 12 SM (4PSM) 0.938
13 EK (default) 0.905 13 PETPEESE (default) 0.931
13 PET (default) 0.905 14 EK (default) 0.905
15 SM (4PSM) 0.903 14 PET (default) 0.905
16 WILS (default) 0.901 16 WILS (default) 0.901
17 RoBMA (PSMA) 0.898 17 RoBMA (PSMA) 0.898
18 puniform (star) 0.886 18 puniform (star) 0.886
19 pcurve (default) NaN 19 pcurve (default) NaN

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

By-Condition Performance (Conditional on Method Convergence)

The results below are conditional on method convergence. Note that the methods might differ in convergence rate and are therefore not compared on the same data sets.

Raincloud plot showing convergence rates across different methods

Raincloud plot showing RMSE (Root Mean Square Error) across different methods

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing bias across different methods

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0. Values lower than -0.5 or larger than 0.5 are visualized as -0.5 and 0.5 respectively.

Raincloud plot showing bias across different methods

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing 95% confidence interval width across different methods

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method. Values larger than 100 are visualized as 100.

Raincloud plot showing 95% confidence interval coverage across different methods

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Raincloud plot showing 95% confidence interval width across different methods

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method.

Raincloud plot showing positive likelihood ratio across different methods

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Raincloud plot showing negative likelihood ratio across different methods

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Raincloud plot showing Type I Error rates across different methods

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Raincloud plot showing statistical power across different methods

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

By-Condition Performance (Replacement in Case of Non-Convergence)

The results below incorporate method replacement to handle non-convergence. If a method fails to converge, its results are replaced with the results from a simpler method (e.g., random-effects meta-analysis without publication bias adjustment). This emulates what a data analyst may do in practice in case a method does not converge. However, note that these results do not correspond to “pure” method performance as they might combine multiple different methods. See Method Replacement Strategy for details of the method replacement specification.

Raincloud plot showing convergence rates across different methods

Raincloud plot showing RMSE (Root Mean Square Error) across different methods

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing bias across different methods

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0. Values lower than -0.5 or larger than 0.5 are visualized as -0.5 and 0.5 respectively.

Raincloud plot showing bias across different methods

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing 95% confidence interval width across different methods

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method. Values larger than 100 are visualized as 100.

Raincloud plot showing 95% confidence interval coverage across different methods

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Raincloud plot showing 95% confidence interval width across different methods

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method.

Raincloud plot showing positive likelihood ratio across different methods

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Raincloud plot showing negative likelihood ratio across different methods

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Raincloud plot showing Type I Error rates across different methods

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Raincloud plot showing statistical power across different methods

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

Subset: Log Odd Ratio Effect Sizes

These results are based on Stanley (2017) data-generating mechanism with a total of 1 conditions.

Average Performance

Method performance measures are aggregated across all simulated conditions to provide an overall impression of method performance. However, keep in mind that a method with a high overall ranking is not necessarily the “best” method for a particular application. To select a suitable method for your application, consider also non-aggregated performance measures in conditions most relevant to your application.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 0.206 1 RoBMA (PSMA) 0.238
2 EK (default) 0.263 2 EK (default) 0.263
3 PET (default) 0.264 3 PET (default) 0.264
4 AK (AK2) 0.265 4 SM (3PSM) 0.273
5 WILS (default) 0.274 5 WILS (default) 0.274
6 SM (3PSM) 0.274 6 PETPEESE (default) 0.297
7 PETPEESE (default) 0.297 7 SM (4PSM) 0.308
8 PEESE (default) 0.313 8 PEESE (default) 0.313
9 SM (4PSM) 0.316 9 trimfill (default) 0.348
10 trimfill (default) 0.348 10 WAAPWLS (default) 0.352
11 WAAPWLS (default) 0.352 11 FMA (default) 0.372
12 FMA (default) 0.372 12 WLS (default) 0.372
13 WLS (default) 0.372 13 RMA (default) 0.391
14 RMA (default) 0.391 14 mean (default) 0.501
15 mean (default) 0.501 15 AK (AK2) 1.061
16 pcurve (default) 1.293 16 pcurve (default) 1.127
17 AK (AK1) 1.581 17 puniform (default) 1.339
18 puniform (default) 1.713 18 AK (AK1) 1.429
19 puniform (star) 157.696 19 puniform (star) 157.696

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 SM (4PSM) 0.160 1 puniform (default) -0.069
2 EK (default) 0.171 2 SM (4PSM) 0.166
3 PET (default) 0.171 3 EK (default) 0.171
4 RoBMA (PSMA) 0.184 4 PET (default) 0.171
5 SM (3PSM) 0.214 5 RoBMA (PSMA) 0.204
6 PETPEESE (default) 0.220 6 PETPEESE (default) 0.220
7 AK (AK2) 0.231 7 SM (3PSM) 0.222
8 puniform (default) -0.240 8 WILS (default) 0.240
9 WILS (default) 0.240 9 AK (AK1) 0.253
10 AK (AK1) 0.248 10 AK (AK2) 0.259
11 PEESE (default) 0.284 11 PEESE (default) 0.284
12 trimfill (default) 0.329 12 trimfill (default) 0.329
13 WAAPWLS (default) 0.335 13 WAAPWLS (default) 0.335
14 FMA (default) 0.356 14 FMA (default) 0.356
15 WLS (default) 0.356 15 WLS (default) 0.356
16 RMA (default) 0.375 16 RMA (default) 0.375
17 mean (default) 0.464 17 mean (default) 0.464
18 pcurve (default) 0.933 18 pcurve (default) 0.698
19 puniform (star) -3.661 19 puniform (star) -3.661

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 FMA (default) 0.058 1 FMA (default) 0.058
2 WLS (default) 0.058 2 WLS (default) 0.058
3 trimfill (default) 0.060 3 trimfill (default) 0.060
4 RMA (default) 0.061 4 RMA (default) 0.061
5 RoBMA (PSMA) 0.064 5 WAAPWLS (default) 0.067
6 WAAPWLS (default) 0.067 6 WILS (default) 0.075
7 WILS (default) 0.075 7 PEESE (default) 0.088
8 AK (AK2) 0.075 8 RoBMA (PSMA) 0.091
9 PEESE (default) 0.088 9 SM (3PSM) 0.095
10 SM (3PSM) 0.100 10 mean (default) 0.113
11 mean (default) 0.113 11 PETPEESE (default) 0.134
12 PETPEESE (default) 0.134 12 EK (default) 0.141
13 EK (default) 0.141 13 PET (default) 0.142
14 PET (default) 0.142 14 SM (4PSM) 0.170
15 SM (4PSM) 0.179 15 pcurve (default) 0.843
16 pcurve (default) 0.812 16 AK (AK2) 0.861
17 AK (AK1) 1.373 17 puniform (default) 1.175
18 puniform (default) 1.516 18 AK (AK1) 1.222
19 puniform (star) 156.858 19 puniform (star) 156.858

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 EK (default) 3.581 1 EK (default) 3.581
2 PET (default) 3.689 2 PET (default) 3.689
3 RoBMA (PSMA) 3.887 3 RoBMA (PSMA) 4.717
4 SM (4PSM) 4.723 4 SM (4PSM) 4.768
5 SM (3PSM) 5.619 5 SM (3PSM) 5.735
6 puniform (star) 5.831 6 puniform (star) 5.831
7 AK (AK2) 5.959 7 puniform (default) 6.525
8 PETPEESE (default) 6.582 8 PETPEESE (default) 6.582
9 WILS (default) 7.531 9 WILS (default) 7.531
10 PEESE (default) 7.667 10 PEESE (default) 7.667
11 puniform (default) 7.877 11 trimfill (default) 9.544
12 trimfill (default) 9.543 12 WAAPWLS (default) 9.694
13 WAAPWLS (default) 9.694 13 FMA (default) 10.660
14 FMA (default) 10.660 14 WLS (default) 10.776
15 WLS (default) 10.776 15 RMA (default) 10.804
16 RMA (default) 10.804 16 AK (AK2) 12.194
17 mean (default) 13.014 17 mean (default) 13.014
18 AK (AK1) 44.040 18 AK (AK1) 32.961
19 pcurve (default) NaN 19 pcurve (default) NaN

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 0.661 1 RoBMA (PSMA) 0.625
2 puniform (default) 0.592 2 puniform (default) 0.588
3 EK (default) 0.574 3 EK (default) 0.574
4 PET (default) 0.548 4 PET (default) 0.548
5 PETPEESE (default) 0.496 5 PETPEESE (default) 0.496
6 SM (4PSM) 0.454 6 SM (4PSM) 0.443
7 puniform (star) 0.440 7 puniform (star) 0.440
8 SM (3PSM) 0.423 8 SM (3PSM) 0.407
9 AK (AK2) 0.394 9 WILS (default) 0.343
10 WILS (default) 0.343 10 AK (AK1) 0.319
11 AK (AK1) 0.321 11 mean (default) 0.314
12 mean (default) 0.314 12 AK (AK2) 0.297
13 PEESE (default) 0.277 13 PEESE (default) 0.277
14 trimfill (default) 0.231 14 trimfill (default) 0.231
15 RMA (default) 0.226 15 RMA (default) 0.226
16 FMA (default) 0.213 16 FMA (default) 0.213
17 WAAPWLS (default) 0.205 17 WAAPWLS (default) 0.205
18 WLS (default) 0.200 18 WLS (default) 0.200
19 pcurve (default) NaN 19 pcurve (default) NaN

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 WILS (default) 0.221 1 WILS (default) 0.221
2 WLS (default) 0.235 2 WLS (default) 0.235
3 WAAPWLS (default) 0.249 3 WAAPWLS (default) 0.249
4 FMA (default) 0.250 4 FMA (default) 0.250
5 RoBMA (PSMA) 0.271 5 RoBMA (PSMA) 0.271
6 trimfill (default) 0.274 6 trimfill (default) 0.274
7 RMA (default) 0.298 7 RMA (default) 0.298
8 PEESE (default) 0.299 8 PEESE (default) 0.299
9 AK (AK2) 0.322 9 SM (3PSM) 0.332
10 SM (3PSM) 0.344 10 puniform (star) 0.407
11 puniform (star) 0.407 11 PETPEESE (default) 0.457
12 PETPEESE (default) 0.457 12 SM (4PSM) 0.503
13 PET (default) 0.536 13 PET (default) 0.536
14 SM (4PSM) 0.569 14 EK (default) 0.658
15 EK (default) 0.658 15 mean (default) 1.505
16 mean (default) 1.505 16 puniform (default) 2.751
17 puniform (default) 4.072 17 AK (AK2) 5.859
18 AK (AK1) 38.095 18 AK (AK1) 26.987
19 pcurve (default) NaN 19 pcurve (default) NaN

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Log Value Rank Method Log Value
1 RoBMA (PSMA) 5.471 1 puniform (default) 4.298
2 puniform (default) 4.600 2 RoBMA (PSMA) 2.869
3 puniform (star) 2.747 3 puniform (star) 2.747
4 AK (AK2) 2.708 4 SM (3PSM) 2.476
5 SM (3PSM) 2.473 5 PETPEESE (default) 2.425
6 PETPEESE (default) 2.425 6 WILS (default) 2.268
7 WILS (default) 2.268 7 AK (AK1) 1.940
8 AK (AK1) 2.070 8 EK (default) 1.922
9 EK (default) 1.922 9 PET (default) 1.913
10 PET (default) 1.913 10 AK (AK2) 1.782
11 FMA (default) 1.678 11 FMA (default) 1.678
12 PEESE (default) 1.605 12 PEESE (default) 1.605
13 mean (default) 1.593 13 mean (default) 1.593
14 RMA (default) 1.539 14 RMA (default) 1.539
15 WLS (default) 1.530 15 WLS (default) 1.530
16 WAAPWLS (default) 1.469 16 SM (4PSM) 1.478
17 trimfill (default) 1.432 17 WAAPWLS (default) 1.469
18 SM (4PSM) 1.425 18 trimfill (default) 1.432
19 pcurve (default) NaN 19 pcurve (default) NaN

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Log Value Rank Method Log Value
1 WILS (default) -5.267 1 AK (AK2) -5.801
2 puniform (star) -4.910 2 WILS (default) -5.267
3 SM (3PSM) -4.904 3 SM (3PSM) -4.940
4 PEESE (default) -4.288 4 puniform (star) -4.910
5 FMA (default) -3.859 5 PEESE (default) -4.288
6 WLS (default) -3.837 6 FMA (default) -3.859
7 AK (AK2) -3.785 7 WLS (default) -3.837
8 RMA (default) -3.730 8 RMA (default) -3.730
9 trimfill (default) -3.682 9 trimfill (default) -3.682
10 PETPEESE (default) -3.556 10 PETPEESE (default) -3.556
11 AK (AK1) -3.477 11 AK (AK1) -3.488
12 EK (default) -3.238 12 EK (default) -3.238
13 PET (default) -3.236 13 PET (default) -3.236
14 SM (4PSM) -3.063 14 SM (4PSM) -3.100
15 puniform (default) -2.518 15 puniform (default) -2.517
16 RoBMA (PSMA) -2.388 16 RoBMA (PSMA) -2.460
17 WAAPWLS (default) -1.596 17 WAAPWLS (default) -1.596
18 mean (default) -1.432 18 mean (default) -1.432
19 pcurve (default) NaN 19 pcurve (default) NaN

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 RoBMA (PSMA) 0.003 1 puniform (default) 0.017
2 puniform (default) 0.009 2 puniform (star) 0.053
3 puniform (star) 0.053 3 PETPEESE (default) 0.072
4 PETPEESE (default) 0.072 4 PET (default) 0.075
5 PET (default) 0.075 5 EK (default) 0.076
6 EK (default) 0.076 6 RoBMA (PSMA) 0.080
7 SM (3PSM) 0.089 7 SM (3PSM) 0.096
8 WILS (default) 0.099 8 WILS (default) 0.099
9 AK (AK2) 0.130 9 SM (4PSM) 0.238
10 SM (4PSM) 0.237 10 AK (AK2) 0.257
11 PEESE (default) 0.334 11 PEESE (default) 0.334
12 AK (AK1) 0.385 12 AK (AK1) 0.387
13 mean (default) 0.420 13 mean (default) 0.420
14 trimfill (default) 0.444 14 trimfill (default) 0.444
15 WAAPWLS (default) 0.445 15 WAAPWLS (default) 0.445
16 WLS (default) 0.446 16 WLS (default) 0.446
17 RMA (default) 0.450 17 RMA (default) 0.450
18 FMA (default) 0.468 18 FMA (default) 0.468
19 pcurve (default) NaN 19 pcurve (default) NaN

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Conditional on Convergence
Replacement if Non-Convergence
Rank Method Value Rank Method Value
1 FMA (default) 0.958 1 FMA (default) 0.958
2 WLS (default) 0.950 2 AK (AK2) 0.958
3 RMA (default) 0.948 3 WLS (default) 0.950
4 trimfill (default) 0.943 4 RMA (default) 0.948
5 WAAPWLS (default) 0.896 5 trimfill (default) 0.943
6 AK (AK1) 0.893 6 WAAPWLS (default) 0.896
7 WILS (default) 0.888 7 AK (AK1) 0.894
8 AK (AK2) 0.886 8 WILS (default) 0.888
9 PEESE (default) 0.882 9 PEESE (default) 0.882
10 SM (3PSM) 0.850 10 SM (3PSM) 0.871
11 puniform (star) 0.842 11 puniform (star) 0.842
12 mean (default) 0.835 12 mean (default) 0.835
13 SM (4PSM) 0.742 13 SM (4PSM) 0.759
14 PETPEESE (default) 0.708 14 PETPEESE (default) 0.708
15 EK (default) 0.654 15 EK (default) 0.654
16 PET (default) 0.652 16 PET (default) 0.652
17 puniform (default) 0.591 17 puniform (default) 0.597
18 RoBMA (PSMA) 0.584 18 RoBMA (PSMA) 0.596
19 pcurve (default) NaN 19 pcurve (default) NaN

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

By-Condition Performance (Conditional on Method Convergence)

The results below are conditional on method convergence. Note that the methods might differ in convergence rate and are therefore not compared on the same data sets.

Raincloud plot showing convergence rates across different methods

Raincloud plot showing RMSE (Root Mean Square Error) across different methods

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing bias across different methods

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0. Values lower than -0.5 or larger than 0.5 are visualized as -0.5 and 0.5 respectively.

Raincloud plot showing bias across different methods

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing 95% confidence interval width across different methods

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method. Values larger than 100 are visualized as 100.

Raincloud plot showing 95% confidence interval coverage across different methods

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Raincloud plot showing 95% confidence interval width across different methods

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method.

Raincloud plot showing positive likelihood ratio across different methods

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Raincloud plot showing negative likelihood ratio across different methods

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Raincloud plot showing Type I Error rates across different methods

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Raincloud plot showing statistical power across different methods

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

By-Condition Performance (Replacement in Case of Non-Convergence)

Raincloud plot showing convergence rates across different methods

Raincloud plot showing RMSE (Root Mean Square Error) across different methods

RMSE (Root Mean Square Error) is an overall summary measure of estimation performance that combines bias and empirical SE. RMSE is the square root of the average squared difference between the meta-analytic estimate and the true effect across simulation runs. A lower RMSE indicates a better method. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing bias across different methods

Bias is the average difference between the meta-analytic estimate and the true effect across simulation runs. Ideally, this value should be close to 0. Values lower than -0.5 or larger than 0.5 are visualized as -0.5 and 0.5 respectively.

Raincloud plot showing bias across different methods

The empirical SE is the standard deviation of the meta-analytic estimate across simulation runs. A lower empirical SE indicates less variability and better method performance. Values larger than 0.5 are visualized as 0.5.

Raincloud plot showing 95% confidence interval width across different methods

The interval score measures the accuracy of a confidence interval by combining its width and coverage. It penalizes intervals that are too wide or that fail to include the true value. A lower interval score indicates a better method. Values larger than 100 are visualized as 100.

Raincloud plot showing 95% confidence interval coverage across different methods

95% CI coverage is the proportion of simulation runs in which the 95% confidence interval contained the true effect. Ideally, this value should be close to the nominal level of 95%.

Raincloud plot showing 95% confidence interval width across different methods

95% CI width is the average length of the 95% confidence interval for the true effect. A lower average 95% CI length indicates a better method.

Raincloud plot showing positive likelihood ratio across different methods

The positive likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a positive likelihood ratio greater than 1 (or a log positive likelihood ratio greater than 0). A higher (log) positive likelihood ratio indicates a better method.

Raincloud plot showing negative likelihood ratio across different methods

The negative likelihood ratio is an overall summary measure of hypothesis testing performance that combines power and type I error rate. It indicates how much a non-significant test result changes the odds of the alternative hypothesis versus the null hypothesis. A useful method has a negative likelihood ratio less than 1 (or a log negative likelihood ratio less than 0). A lower (log) negative likelihood ratio indicates a better method.

Raincloud plot showing Type I Error rates across different methods

The type I error rate is the proportion of simulation runs in which the null hypothesis of no effect was incorrectly rejected when it was true. Ideally, this value should be close to the nominal level of 5%.

Raincloud plot showing statistical power across different methods

The power is the proportion of simulation runs in which the null hypothesis of no effect was correctly rejected when the alternative hypothesis was true. A higher power indicates a better method.

Session Info

This report was compiled on Thu Oct 23 14:05:31 2025 (UTC) using the following computational environment

## R version 4.5.1 (2025-06-13)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## time zone: UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] scales_1.4.0                   ggdist_3.3.3                  
## [3] ggplot2_4.0.0                  PublicationBiasBenchmark_0.1.0
## 
## loaded via a namespace (and not attached):
##  [1] generics_0.1.4       sandwich_3.1-1       sass_0.4.10         
##  [4] xml2_1.4.0           stringi_1.8.7        lattice_0.22-7      
##  [7] httpcode_0.3.0       digest_0.6.37        magrittr_2.0.4      
## [10] evaluate_1.0.5       grid_4.5.1           RColorBrewer_1.1-3  
## [13] fastmap_1.2.0        jsonlite_2.0.0       crul_1.6.0          
## [16] urltools_1.7.3.1     httr_1.4.7           purrr_1.1.0         
## [19] viridisLite_0.4.2    textshaping_1.0.4    jquerylib_0.1.4     
## [22] Rdpack_2.6.4         cli_3.6.5            rlang_1.1.6         
## [25] triebeard_0.4.1      rbibutils_2.3        withr_3.0.2         
## [28] cachem_1.1.0         yaml_2.3.10          tools_4.5.1         
## [31] memoise_2.0.1        kableExtra_1.4.0     curl_7.0.0          
## [34] vctrs_0.6.5          R6_2.6.1             clubSandwich_0.6.1  
## [37] zoo_1.8-14           lifecycle_1.0.4      stringr_1.5.2       
## [40] fs_1.6.6             htmlwidgets_1.6.4    ragg_1.5.0          
## [43] pkgconfig_2.0.3      desc_1.4.3           osfr_0.2.9          
## [46] pkgdown_2.1.3        bslib_0.9.0          pillar_1.11.1       
## [49] gtable_0.3.6         Rcpp_1.1.0           glue_1.8.0          
## [52] systemfonts_1.3.1    xfun_0.53            tibble_3.3.0        
## [55] rstudioapi_0.17.1    knitr_1.50           farver_2.1.2        
## [58] htmltools_0.5.8.1    labeling_0.4.3       svglite_2.2.2       
## [61] rmarkdown_2.30       compiler_4.5.1       S7_0.2.0            
## [64] distributional_0.5.0