Figure 5 shows the normalized PIM values across the three SNP selection thresholds. F1score consistently emerges as the most influential criterion, with its importance growing asmore SNPs are included
The PIM values (Figure 5) determine the relative contribution of each selection metric to the final model, yet these appear to be point estimates from a single train/validation split. How stable are these PIM rankings across different random splits, bootstrap samples, or subsampling of the training data? If F1 score's dominance as the top weighted metric is sensitive to the specific data partition, this could affect reproducibility of MIXER-selected feature sets in independent applications. Have you characterized the variance of PIM estimates and does the ranking of metrics remain consistent?