Each of these issues is evident in MULTI-evolve: the engineering success is real, but the source of performance is that mutational effects are sufficiently additive for the proteins and mutations considered, not that the neural network has learned epistatic synergies.
Thank you for going through the effort of putting together this properly benchmarked analysis, the lack of a purely additive model in the original work is a significant omission. To give credit to the original authors, it seems the biggest success here is their ensemble PLM model, I suppose some claim of 'synergistic epistasis' could be made here, as the mutants the ensemble proposed do seem to be genuinely beneficial. However, it is obvious that this does not extend to the MLP trained by the authors where the claims of epistasis being captured are made.