In summary, we find striking mathematical similarity among animal ‘song’ vocalizations and human musical sounds regarding the stability of CES, along with some well-defined differences across evolutionary distant taxa that evolved singing behavior independently (i.e. anurans, birds, primates)
I don't think you have yet shown enough to make this claim. The paper would greatly benefit from the use of null models. Given that all the data used in the paper share informational structure (i.e., they're way detectable in some way as "songs" to humans), are the relationships you identify actually surprising? It's very hard to gauge this without a null, either empirical or simulated, to compare to.
Similarly, it's impossible to know whether the comparisons are fair without a deeper examination of how parameter choice (e.g., window sampling size) may differentially affect the CES estimation across song types. Even if you find significant similarity between human/animal song times compared to a null, how will you know if this isn't a product of bias in your CES estimation?