These results paired with the runtimes are really quite impressive! But the contrast between the results for KF distances as compared to RF distances are interesting, and seems like they may be worth unpacking.
In particular, it's notable that the RF distances at greater tree sizes for PF+FastME seem to converge with FastME, being greater than seen for IQTree/FastTree, with the difference increasing along with tree size.
As you say, RF is just the sum of differences in bipartitions between two trees, whereas KF considers both differences in topology and branch length. You find that PF+FastME consistently infers trees with lower or equivalent KF distances to IQTree and FastTree. But, as tree size increases, RF distances increase for PF+FastME at a high rate, exceeding those of FastTree and IQTree starting at relatively small trees (~20 tips).
Together, these results would suggest that PF+FastME estimates branch lengths well. This is maybe expected but a great thing to see, since PF is effectively trained to infer those evolutionary distances that FastME uses to infer branch lengths! However, despite accurately inferring branch-lengths, there seems to be a larger number of topological errors in the larger trees inferred by PF+FastME as compared to the other methods.
Do you have any intuition as to why this discrepancy arises? Or any thoughts on how you might modify the model/model architecture to better account for and mitigate this effect?