This is awesome!!! The impact/relevance of your work is incredibly clear, all data/code is on GitHub (with a very robust README) and you candidly express the limitations of the predictive models (e.g. inaccuracy when predicting oxygen tolerance for certain genera or phyla, thus requiring follow-up on the relationship between AAs and metabolic niches or how the lack of precision of the models may not be helpful for cultured microorganisms). I’m looking forward to trying this out myself!
Summary: In this manuscript, Barnum et al. created computational models — favoring simple logistic regression models — that can predict the ideal oxygen tolerance, temperature, salinity and pH conditions of novel taxonomic microbial families (requiring only an unannotated, and potentially incomplete, genome from the user).
The authors leveraged the empirical data of 15.5k+ microbes and curated the dataset to omit microbes that did not have multiple measured values for a growth phenotype, had minor differences between the minimum and maximum value tested for the phenotype (<1.5 pH, 10C, 1.5% NaCl unless salinity was <0.5%), or had fewer than 4 total measurements recorded. Haloarchaea with a salinity optima <3.7% were also excluded and finally, the data set was further balanced to reduce taxonomic bias.
The authors then measured correlations between DNA and protein sequence features and oxygen tolerance, temperature, salinity and pH conditions (expressed as a Spearman’s rank correlation coefficient). No correlations between the tested DNA sequence features and the four physiochemical conditions were identified but numerous correlations between protein sequence features and the physiochemical conditions were identified. For example, a negative correlation between oxygen tolerance and cysteine frequency was revealed (p=-0.49).
Estimators were then evaluated on their ability to accurately predict the four physiochemical conditions based on 9 different sets of features and the authors found that amino acid features alone were sufficient for accurate prediction. Three models were then selected for each condition (optimum, minimum and maximum value predictions). When testing the selected models with family-level holdouts, the predictions were made with lower accuracy albeit their performance was consistent with training and cross-validation data. The models also predicted extreme growth conditions less accurately (e.g. salinity > 15% or pH > 5).
To test the models’ vulnerability to phylogenetic bias, the selected models were compared to models where the prediction was a random value or the average value of the closest relatives. As expected, the chosen models considerably outperformed the models strongly influenced by phylogeny.
To test the models’ vulnerability to genome completeness, protein and genome sequences were subsampled to 10-100% completeness for 20 different species in each condition range and evaluated for prediction accuracy. The selected models showed negligible differences between 10% and 100% genome completeness for oxygen tolerance, temperature and salinity. pH prediction experienced a bigger impact by genome completeness.
The selected models were then used to predict the ideal growth conditions of 85k+ bacteria and archaea. As expected, many of the uncultivated species were predicted to grow in more extreme conditions. The ideal growth conditions of 3.3k+ metagenomes were predicted and compared to the growth conditions of the environment from which the samples were derived. Predicted growth conditions mostly aligned with the organism’s habitat but the authors found that predicted individual genomes can deviate from the conditions of the source environment.