R0:
Reviewer #1: Manuscript as reviewed meets PLOS Global Public Health publication requirements, the author(s) clearly presented the study background, methods, results, discussions and conclusion. My comments and revision request are minor formatting and suggested input. No ethics concerns at this time. Reviewer #2: This is a well-written paper with clear methodology. From the perspective of data science applied to public health, this manuscript does a great job of clearly discussing and defining its methodology, which are all the current best practices. Correcting for class imbalance was a good choice, given the low prevalence of EC in the survey population. The use of SMOTE on the training set only ensured minimal data leakage, and is the current best practice. Using such a large variety of machine learning models creates a challenge in describing each model well enough within one manuscript, and the author did a good job of balancing that challenge.
I only have a few minor suggestions toc clarify the methodology of the manuscript:
Please specify upfront how many observations were used in training and testing, and specify how many positive EC outcomes were included in the testing set. With such a low prevalence of a positive outcome in a relatively small set of observations, it is worth mentioning that there are perhaps only 10-20 positive outcomes being predicted in the test set. In the absence of weighting, it may be that characteristics of those few positive outcomes in test set are biasing the predictors, and this is worth mentioning.
Please discuss how the initial 38 variables were selected from the survey. If there was an initial expert judgment on inclusion into the variable set for feature selection, that should be mentioned.
Cluster design was mentioned in the PMA survey. This indicates that the survey includes survey weights of some kind. Please discuss whether those weights were addressed in the machine learning methods, or defend why they were not included in the model design. Survey weights can be included in machine learning models to make the predictors more representative of the population of interest.
In the discussion, please discuss the impact of low precision, where there were many false positives compared to true positives. While it is mentioned, there are consequences (e.g., loss of trust) for low precision prediction models in public health, and this characteristic of the findings could be discussed more.
Consider including a SHAP dependance plot, because potential interactions are discussed (e.g., knowledge and ad exposure) without showing evidence. A SHAP Dependence plot could take care of this.
Consider explicitly discussing the limitation of cross-sectional survey data used for prediction, where proxies were used in place of quantitative evidence (e.g., exposure to ads to proxy perceptions).
Overall, great work, timely, and well constructed. Reviewer #3: SEE word document attached with clear table
Manuscript Number: PGPH-D-25-01837 Review report
This manuscript demonstrates a significant strength in its application of advanced machine learning and Explainable AI (XAI) to address the critical public health challenge of low emergency contraceptive (EC) use in Ethiopia. By rigorously testing multiple models and using SMOTE to handle severe class imbalance, it identifies key modifiable predictors like primarily EC awareness and media exposure rather than static socioeconomic factors. The use of SHAP values transforms complex model outputs into actionable insights, revealing that knowledge gaps are the primary barrier. This approach provides a powerful, data-driven blueprint for designing targeted interventions, such as tailored media campaigns and improved health counselling, to effectively increase EC uptake and reduce unintended pregnancies. However, the following points may need to be considered, so as to improve the quality of the paper.
Topic/ subtopic Issue Suggestions Title: Predicting Utilization of Emergency Contraceptive Usage in Ethiopia and Identifying Its Predictors Using Machine Learning Redundancy. "Utilization" and "Usage" mean the same thing. Predicting the Utilization of Emergency Contraception in Ethiopia and Identifying Its Predictors Using Machine Learning. Affiliation Inconsistent institution name. on page 1 says "College of Medicine Health Science" while first page of manuscript is "College of Health Science". Use consistent affiliation name Abstract "Traditional analyses have struggled to identify complex predictors." For flow, consider: Traditional statistical analyses have struggled to… Abstract "with SMOTE used to address class imbalance" – Grammar: This is a dependent clause. It should be connected to the previous sentence. ..., and the SMOTE was used to address class imbalance. Abstract "Findings highlight that knowledge gaps, not poverty or access, are key barriers to EC use." – Clarity: "access" is vague. Be more specific. ...not poverty or physical access barriers, are key. Introduction Page 3: "moderate’s" Change to moderates ("the way the education level moderate’s religion-based stigma"). Introduction "drives excessive maternal mortality rates of over 500 deaths per 100,000 live births, drives poverty cycles, constrains girls' and women's educational and economic opportunities, and overwhelms poor healthcare infrastructures." – The word "drives" is used twice in close succession. ...contributes to high maternal mortality rates of over 500 deaths per 100,000 live births, perpetuates cycles of poverty, constrains... Introduction "is a central preventive intervention" is a crucial preventive intervention Introduction "the use of EC remains embarrassingly low" "Embarrassingly" is subjective and informal. ...remains critically low. Introduction "tempts women to shun services" Word choice not good. ...pressures women to shun services. Introduction "woefully underserved" Informal. ...significantly underserved. Introduction "yield the predictive resolution necessary" "Resolution" unusual in this context. ...yield the predictive accuracy necessary Introduction "vastness tests for fairness" – Phrase is unclear and likely an error. Correct the phrase to clarity Methods Data Source & Inclusion Criteria: The criteria for selecting the 2,334 women from the larger PMA sample of 8,943 are not explicitly stated. Was it a complete case analysis? This needs clarification as it affects the generalizability of the findings. Clarify if sampling was done or it was a complete case study Methods "The dataset demonstrates low overall missing data prevalence" –"Prevalence" is for diseases outbreaks. The missing data were minimal overall; Methods "offering robust classifier building while preserving real performance measurement." ...facilitating the development of robust classifiers while preserving a realistic assessment of performance. Results "nailing 17 true positives" Informal word choice. ...correctly identifying 17 true positives... Results "It manages this recall strength at the expense of precision, though, which sits at approximately 11%." – "Sits at" is informal. It achieves this high recall at the expense of precision, which was approximately 11%. Results "The most influential positive feature was “heard_emergency”, indicating awareness of emergency services has the greatest influence..." add which . The most influential positive feature was “heard_emergency”, which indicates that awareness of emergency contraception has the greatest influence... Results "This resonates with core assumptions of health behavior theories like the Health Belief Model, which posit perceived knowledge as a harbinger of action." "Harbinger" misused. ...which posit knowledge as a prerequisite for action. Results Page 18: "radio-implemented" Change to radio-delivered or radio-based. Results "Even positive, this reflects continued systemic disincentives documented elsewhere" – Unclear Even not a correct word. Although positively associated, this factor reflects... Results "all the sources of blunting the effect of being in contact with the health system." Grammatically incorrect and unclear. ...all of which blunt the effect of health system contact. Results "One of the thoughtful discoveries of SHAP values was the sizeable negative impact" "Thoughtful" incorrect. A notable discovery from the SHAP analysis was. Results "Isolated use of SMOTE in the training set" – "Isolated" wrong word. Applying SMOTE exclusively to the training set Results "It shifted the ML model from being a prediction device to an analysis tool, not just deciding which features were significant, but the size and sign of their effects, and significantly, potential interactions" Not clear because of parallel verbs. It transformed the ML model from a prediction device into an analytical tool, revealing not only which features were significant but also the magnitude and direction of their effects, as well as potential interactions. Results "Simulation by counterfactual SHAP analysis suggests a hypothetical 30% increase in EC knowledge might boost utilization by approximately 12.7%, a valuable public health gain." The sentence needs clearer explanation. Counterfactual simulation using SHAP values (e.g., calculating the mean impact of increasing the "heard_emergency" feature value) suggested that a 30% increase in EC knowledge could potentially increase utilization by approximately 12.7%, representing a valuable public health gain. Results "Geographic ML modeling over the geographic data would also potentially be able to further optimize resource deployment" Repetition: "Geographic" used twice. Rewrite the sentence for clarity Results "the implied vulnerability evidenced by the 'forced pregnancy' variable (despite missing data concerns) underscore" Not clear as the subject-verb disagreement. .use the word..underscores. Methods Model Selection Justification: The list of eight algorithms is comprehensive, but justification for simpler models like Naive Bayes is weak. Justify the inclusion of Naïve Bayes. Is it possible because they were included as benchmarks. Methods Evaluation Metrics: AUC-ROC emphasized, but for imbalanced problems F1-Score or Precision-Recall AUC may be better. Also consider using F1-Score or Precision as the data is not balanced or Justify the use of AUC-ROC Methods Model Performance Presentation: Logistic Regression focus unclear since Gradient Boosting achieved higher AUC-ROC (0.85). Consider Gradient Boosting as it achieved AUC-ROC 0.85 OR Explain rationale (e.g., performance vs. interpretability). Results Confusion Matrix Analysis (Figure 3): Issue: The analysis states precision is "approximately 11%." Based on the described confusion matrix (TP=17, FP=138), precision is 17 / (17+138) = 11.0%. This is a critical weakness of the model that deserves more emphasis. It means ~89% of the people predicted to be EC users were actually non-users. This has huge implications for the cost and efficiency of any intervention based on this model Discuss this trade-off explicitly: "The model's high recall (85%) comes at the cost of low precision (11%), resulting in a high false positive rate. This suggests the model is well-suited as a screening tool where identifying most true cases is prioritized over resource efficiency, but would require secondary screening or low-cost interventions to target the large number of false positives." Discussion Addressing Limitations More Forcefully: Underreporting of EC likely major issue. Add: "A key limitation is the potential for significant underreporting of EC use due to social desirability bias and stigma..." Conclusion "myth-busting" Word choice is Informal. myth-dispelling Conclusion "stock guarantees of EC" Not clear Consider write as guaranteed EC stock availability Conclusion "This research provides an ethical and evidence-based blueprint to accelerate gains in reducing maternal mortality and advancing reproductive autonomy in Ethiopia and similar settings." – Awkward phrasing. .Conside rephrasing as ..blueprint to reduce maternal mortality and advance... Reviewer #4: This manuscript applies machine learning (ML) and explainable AI (XAI) methods to predict emergency contraceptive (EC) use among women in Ethiopia, using data from the 2023 PMA survey. The authors compare eight algorithms, address severe class imbalance with SMOTE, and use SHAP values to interpret predictors. They find that awareness of EC is the strongest predictor, followed by media exposure and health facility discussions, while demographic variables show limited predictive value.
However, the results as currently presented are unreliable. Major inconsistencies in reported performance metrics (e.g., contradictory precision values, implausible Naive Bayes results, inflated accuracy) call into question the validity of the analyses. In addition, the small number of EC users makes the modeling unstable, and subgroup analyses are not feasible with this dataset. These issues, combined with over-interpretation of SHAP as causal, limit both the methodological credibility and substantive contribution of the paper.
Contradictory precision results The performance metrics are inconsistent. Table 4 shows Logistic Regression with SMOTE achieving precision = 0.72 and recall = 0.85, yet the confusion matrix description reports precision at only ~11%. These cannot both be correct. This discrepancy raises questions about the accuracy of the reported results and must be clarified.
Inflated accuracy The reported accuracy of 0.95 for Logistic Regression with SMOTE appears implausibly high given the extreme class imbalance (4.4% EC use). Accuracy is not an informative measure in this context, and such values raise concerns about potential data leakage or overly optimistic validation. The authors should confirm that the outcome variable or proxy features were not inadvertently included in the predictors.
Over-interpretation of SHAP The SHAP analysis is framed in causal terms (e.g., a 30% increase in knowledge leading to a 12.7% increase in use). SHAP values describe associations within the model, not causal effects. The manuscript should temper these statements and present SHAP findings as indicators of relative predictive importance, not intervention outcomes.
Implausible Naive Bayes results Naive Bayes is reported as having accuracy of only 0.06 pre-SMOTE. Given that 95% of the sample did not use EC, even a trivial majority-class classifier would achieve ~95% accuracy. Such a result suggests an error in coding or reporting that must be checked.
Small minority class vs. model complexity Only 103 EC users were present in the dataset. Training and tuning eight algorithms with hyperparameter searches on such a small minority class risks overfitting and unstable results, even with SMOTE. This limitation should be acknowledged explicitly, with emphasis on the need for validation on independent samples.
Subgroup analysis claims The manuscript claims fairness testing across subgroups (rural/urban, religion, age), but no results are presented. With so few EC users, subgroup analyses would be underpowered and unreliable. It would be more appropriate to note this limitation rather than imply subgroup robustness.
Causality Issue The manuscript repeatedly interprets predictive associations as though they were causal effects. For example, SHAP values are used to suggest that increasing knowledge by 30% would increase EC use by 12.7%. Since the data are cross-sectional and observational, such statements are not justified. Machine learning models in this setting can identify predictive patterns, but they cannot establish causal relationships between predictors and outcomes. This overreach is particularly concerning because it could mislead policymakers or practitioners into believing the study provides evidence of causal effects. Reviewer #5: Summary This study investigates the underuse of emergency contraception in Ethiopia using a machine learning framework. Strengths include the application of multiple algorithms, careful handling of class imbalance, and the use of Explainable AI to interpret model outputs. The paper is generally well-structured, and the methodological workflow is presented clearly. At the same time, the results are presented in a way that overstates the model’s practical utility while giving insufficient attention to the precision–recall trade-off. The manuscript should be revised to consistently acknowledge the low precision across the abstract, results, and discussion, and to provide a clear justification for the relevance of a high-recall, low-precision model in this public health context. The limitation posed by the small number of positive cases in the validation set should also be explicitly discussed. Addressing these points is necessary to strengthen the scientific validity of the work. Specific comments 1. Title; It should be shortened to remove redundancy since Utilization and Usage mean the same thing 2. Abstract. I think something key was missed. The aurthors state a recall of 0.85 without mentioning the precision. I see that (Figure 3, page 20) show that the precision is approximately 11%. My understanding of this that for every 100 women the model flags as likely EC non-users who need intervention, 89 of them are false alarms. An abstract must present a balanced view of performance. 3. Methods (About the data): A sample size of 2,334 with a 4.4% prevalence means you only have ~103 positive cases (EC users). After an 80/20 train-test split, your test set contains only ~21 positive cases. This number is critically small and raises serious questions about the stability and generalizability of your reported performance metrics. A different random split could yield vastly different results. I suggest that such a major limitation is addressed upfront in the limitations section and acknowledged in the methods section. 4. Data balancing; I like the write up of this section 5. Evaluation Metrics; The text states the test set has 18.7% EC users, but the abstract and data balancing section state the overall prevalence is 4.4%. Please clarify this discrepancy. Is 18.7% a typo? Or did the stratified split result in a test set with a much higher prevalence than the overall dataset? This needs to be consistent. Could you also add the precision-recall plots, since you state that they were tracked. 6. Results: - In Table 4, the columns are F1 and Score. This seems like a typo. It should likely be a single column: F1 Score. Please correct. - Lastly, i think it would be good to acknowledge the weaknesses of SMOTE Reviewer #6: The title of the article is: Predicting Utilization of Emergency Contraceptive Usage in Ethiopia and Identifying Its Predictors Using Machine Learning. The author explains that traditional analyses have struggled to identify complex predictors and therefore they used machine learning (ML) and Explainable AI (XAI) to improve the prediction and interpretability of Emergency Contraceptive (EC) use. The paper can be published with the following corrections and some are extremely important. In particular methodological perspectives. Category Authors Contribution Comments Objectives The primary objectives are twofold:
one, to predict the likelihood of EC use with far greater accuracy than conventional regression techniques;
two, to identify the key modifiable socio-behavioural predictors e.g., self-efficacy, mass media exposure, provider perception, and women's autonomy through XAI methods like SHAP values to yield interpretability and actionable insights. First objective can be modified. Far greater is a vague statement. Measuring accuracy is an indicator of choosing between models but conventional regression techniques why has a problem in this study should focus on that.
Second objective seems motivation of the study. This objective should be written in clear sentence. Identify predictors to yield interpretability and actionable insights are subjective things. These objective seems ambiguous.
Methodological view Page 5: Methodologically, it represents a new contribution by rigorously testing the performance of eight alternative ML classifiers and developing an optimized analytical pipeline specifically designed to handle skewed healthcare datasets prevalent in rare outcomes like EC use
Theoretically, it applies the Socio-Ecological Model (SEM) framework to hierarchically analyze predictors at levels of individual (knowledge, attitudes), interpersonal (partner communication, family influence), community (stigma norms, access), and policy (health system factors) providing an integrated explanation for the interrelating influences on EC behavior. It is not methodological contribution.
Moreover, author mentioned theoretical contribution. However, it is just exploratory of the data.
Methodology In page 4: In contrast to conventional statistical approaches, ML algorithms, such as random forests, gradient boosting machines (e.g., XGBoost), and neural networks, can particularly identify complex, high-dimensional patterns within diverse data sets, properly manage missing data, and produce personalized risk predictions with improved accuracy Author mentioned several times about conventional statistical technique. However, in the report author directly reported the model performance of ML. My suggestion is to first run the analysis using traditional or conventional methods and then compare with ML techniques. This is very important. Outcome Variable Page 8: The outcome of interest is EC Usage, a binary measure of whether emergency contraception was used in the last 12 months. This is the dependent variable for analysis. Redundant as at the beginning you mentioned outcome of interest is….. Missing data For handling missingness in our data, a stratified approach based on missingness mechanisms and rates was followed and so on……….. The author used many approaches and it is difficult to keep track. So it is better to explain step by step and pros and cons of each process. Moreover, explain why this approach is best in this study Variables Page 12
Lots of category under one variable. Some category has very few observations. Justify the necessity. May be we can also show some cross-tabulation analysis result and report the p-value. Research Gap Page 19: The research goes beyond the correlational limitations of previous studies by utilizing predictive analytics to identify the modifiable factors and approximate their hypothetical effects What do you mean by correlational limitations? Moreover, over the report the previous studies were not mentioned in comparison to the authors current approaches. Sa add some recent references and explain the research gap. The Machine learning techniques are not new. So it is required to mention how those machine learning helps in your study as a novelty. All over the report there is a missing of synchronization and coherence of sentences. Moreover, the references, table titles etc are not space maintained. Abstract 1. SMOTE and SHAP 2. Conversely, recent reproductive events such as unintended pregnancy were linked to non-use. Static demographic factors showed poor predictive value. Findings highlight that knowledge gaps, not poverty or access, are key barriers to EC use. Tailored media campaigns and routine health counseling could enhance EC uptake. ML and XAI offer powerful tools for guiding targeted reproductive health interventions. 1. Did not mention what it is?
- The message of these sentences are not coherent. I think author can check the whole paper from an English native reviewer.
R1:
Reviewer #4: I appreciate the authors' thoughtful revisions and detailed responses. Several of my earlier comments were addressed—specifically, the correction of Naive Bayes reporting errors, improved acknowledgment of sample size limitations, and removal of unsupported subgroup analyses. These are welcome improvements. However, key concerns about the internal consistency of results, causal interpretation of SHAP analyses, and overextension of policy recommendations remain unresolved.
First, while the outdated "11% precision" text has been removed, the confusion matrix values (TP=102, FP=180, FN=18) still do not correspond to the reported performance metrics. With these numbers, precision would equal roughly 0.36, not the 0.72 cited in Table 4. This suggests an ongoing internal inconsistency between the descriptive counts and the summary metrics. The lack of alignment raises continuing doubts about the reliability of the reported model performance.
Second, the manuscript still places heavy emphasis on accuracy values approaching 0.92–0.95 despite a highly imbalanced outcome (4.4% EC use). Although the authors state that AUC-ROC and recall were prioritized, the presentation continues to foreground accuracy, which is misleading in this context. No calibration or uncertainty measures (e.g., Brier score, calibration curve) have been added, leaving the reader without a sense of how well the predicted probabilities reflect actual risk.
Third, although the authors softened their language, the interpretation of SHAP values remains quasi-causal. The new statement—"counterfactual simulation using SHAP values … suggested that a 30% increase in EC knowledge could potentially increase utilization by approximately 12.7%", still presents SHAP outputs as if they represent real-world intervention effects. SHAP analysis identifies predictive associations within a model; it does not estimate the causal impact of changing a feature in the population. Likewise, subsequent phrases such as “integrating a predictive risk-scoring tool can help identify women at high risk” and “geographic machine learning modeling can optimize resource deployment” continue to frame the model as a validated operational tool. These remain prescriptive policy claims that move beyond what a cross-sectional, unvalidated predictive study can substantiate.
Finally, while the tone of the manuscript has improved, the discussion still reads as policy advocacy rather than analytical interpretation. Phrases like "representing a valuable public health gain”" and "can help optimize resource deployment" give the impression of proven effectiveness rather than exploratory modeling. A clearer distinction between predictive insights and causal or operational evidence is necessary for the study to maintain methodological integrity.