What are your thoughts on the possibility of homology-based data leakage (e.g. https://www.biorxiv.org/content/10.1101/2025.01.22.634321v1.full)?
Given the hierarchical relationships generated by evolution, many genes will share some degree of relatedness both among the "phylogenetically diverse species" and with the held-out species. For example, a deeply conserved metazoan gene with little sequence diversity will be a nonindependent data point; it's sequence will be pseudoreplicated (possibly its expression too), allowing information leakage between pretraining and validation sets. 
It seems likely that data leakage will happen to varying degrees based on the conservation of each gene. If this is happening to a substantial degree (i.e., the model is learning high copy number/homologous/pseudoreplicated sequences best), then it wouldn't be surprising that performance would scale with evolutionary distance from humans; the amount of shared homology would predict model performance. It also wouldn't be surprising that TF-Metazoa outperforms other models; by having more pseudoreplicated sequences, it provides more opportunities for data leakage and, thus, overfitting.
Is there a convincing way to show that this isn't the case?