Hypothesis

223 Matching Annotations

Last 7 days
wu-msds-capstones.github.io wu-msds-capstones.github.io

Unveiling the Drivers of Clean Air

110
1. sgberhault 10 Aug 2024
  
  in Public
  
  Air
  
  Your link to the right up here is broken
2. sgberhault 10 Aug 2024
  
  in Public
  
  Contents
  
  Your contents here has way too much stuff. Break it into bigger chunks and use subheadings
3. sgberhault 10 Aug 2024
  
  in Public
  
  On October 30, 1948, the Donora High School Football team played through a dense smog to complete the game with hundreds of fans in the audience, despite very poor visibility.
  
  Citation?
4. sgberhault 10 Aug 2024
  
  in Public
  
  (Jacobs, Burgess, Abbott)
  
  Ah, here it is. Given that you are pulling from the story for an entire paragraph, I'd lead with some reference to this source.
  
  "In their book/article published in XXXX, Jacobs, Burgess, and Abbott tell the tale of...."
5. sgberhault 10 Aug 2024
  
  in Public
  
  The quality of air we breathe has direct impacts on our health. We must understand the factors that contribute to poor air quality and how we individually and collectively contribute to these changes. Until we can visualize the impact we have on our atmosphere, we will continue behavior that negatively impacts the air around us.
  
  Also a short paragraph.
6. sgberhault 10 Aug 2024
  
  in Public
  
  This event, known as the Donora Smog of 1948, prompted the country into taking a closer look at the negative impacts of air pollution. Widespread debate surrounding the event led to the first legislation aimed at regulating the air quality within the United States, ushering in a new era of tracking, combatting, and reversing the ill effects of poor air quality.
  
  Two sentences isn't much of a paragraph.
7. sgberhault 10 Aug 2024
  
  in Public
  
  changes
  
  factors
8. sgberhault 10 Aug 2024
  
  in Public
  
  NumPy and Pandas
  
  the NumPy and Pandas libraries
9. sgberhault 10 Aug 2024
  
  in Public
  
  Matplotlib and Seaborn for visualization, and Time Series forecasting algorithms such as Prophet and SARIMAX.
  
  This is not a complete sentence.
10. sgberhault 10 Aug 2024
  
  in Public
  
  We will address data inconsistencies, missing values and ensure that data is in a tidy format.
  
  This is not a paragraph
11. sgberhault 10 Aug 2024
  
  in Public
  
  We may need to normalize or standardize data if necessary and create new features through aggregation to enhance the model’s performance.
  
  Also not a paragraph
12. sgberhault 10 Aug 2024
  
  in Public
  
  p
  
  capitalized?
13. sgberhault 10 Aug 2024
  
  in Public
  
  Here’s a breakdown of its components:
  
  Is this supposed to be above the bullet points? Either way, I think those bullet points need a better intro.
14. sgberhault 10 Aug 2024
  
  in Public
  
  Metrics to Evaluate Machine Model Performance
  
  Any section needs to be introduced by text.
15. sgberhault 10 Aug 2024
  
  in Public
  
  Akaike Information Criteria (AIC)
  
  I don't think a reader has any idea what this is initially, so this chapter heading is kinda meaningless.
16. sgberhault 10 Aug 2024
  
  in Public
  
  Technique/Metric Description Purpose/Formula Scenario: Cancer prediction
  
  I don't think this table is useful in this current location. As a table, it should be just used as a reference and put at the end of the document.It is a nice summary table, for sure, but it doesn't belong smack in the middle of your paper.
  
  As far as a reader knowing what you are referring to when you use one of these terms, some you can probably safely assume you can use without explanation, and others you should bake the explanation into your text when you introduce it.
17. sgberhault 10 Aug 2024
  
  in Public
  
  Machine Learning AQI Time Series
  
  Text should introduce every section.
18. sgberhault 10 Aug 2024
  
  in Public
  
  Used to measure of a statistical model, it quantifies:
  
  Not a complete sentence
19. sgberhault 10 Aug 2024
  
  in Public
  
  Data Explaination
  
  Why is this part of the ML AQI Time Series chapter? Or chapter/heading hierarchy is extremely confusing in general
20. sgberhault 10 Aug 2024
  
  in Public
  
  The Akaike Information Criterion (AIC) is a measure used to compare different statistical models. It helps in model selection by balancing the goodness of fit and the complexity of the model. Here’s how to interpret the AIC value:
  
  This feels more like how this section should be starting.
21. sgberhault 10 Aug 2024
  
  in Public
  
  The files were given daily on a county wide basis, separated into different files by year.
  
  So what did you collect?
22. sgberhault 10 Aug 2024
  
  in Public
  
  Indoors, high humidity can trap air, leading to the growth of mold and harmful bacteria.
  
  This feels outside the scope of what you are doing though correct?
23. sgberhault 10 Aug 2024
  
  in Public
  
  Air Quality Data:
  
  These sections are too small for their own sections. Just make them their own paragraphs.
  
  EDIT: Actually, some of the later ones are more reasonable. Think about how you can balance between them though. Can you add to some to make it more reasonable as a section? Or remove from others? Maybe bullet points with a bolded starting line would be more appropriate?
24. sgberhault 10 Aug 2024
  
  in Public
  
  calculated
  
  aggregated
25. sgberhault 10 Aug 2024
  
  in Public
  
  Carbon Monoxide
  
  Carbon Monoxide (CO)
26. sgberhault 10 Aug 2024
  
  in Public
  
  Only motorbus data was used, which may not be reflective of cities with other large methods of public transportation, such as the New York subway system.
  
  It also seems to leave out what I'd guess is probably easily the most significant transit factor: cars and trucks?
27. sgberhault 10 Aug 2024
  
  in Public
  
  is updated as of
  
  was last updated on
28. sgberhault 10 Aug 2024
  
  in Public
  
  relevant columns were selected and renamed, reducing the information being brought into our initial SQL database.
  
  Just selecting and renaming wouldn't reduce the information, unless you are trying to say that you didn't bring in anything else.
29. sgberhault 10 Aug 2024
  
  in Public
  
  and imported
  
  remove
30. sgberhault 10 Aug 2024
  
  in Public
  
  The first dimension table is the dates table, a serialized list of dates from January 1st, 2015 to December 31st, 2022.
  
  You should explain why you did this. Otherwise breaking it out into a table of essentially 1 data column seems pointless. I'm pretty sure I recall the reason why, and it is a decent reason, but that is not apparent here.
31. sgberhault 10 Aug 2024
  
  in Public
  
  as well as the population and population density
  
  that is not shown in your ERD
32. sgberhault 10 Aug 2024
  
  in Public
  
  Understanding the context of a specified line requires joining the table back to the fact table, and joining the location and date tables to that as well.
  
  Ok, but I'm pretty sure this totally undid any of the space saving measures you gained with putting dates in their own table. Because you are including a massive number of duplicate items in your main table. You could have just left them separate and still joined by truncating the date to a year and matching that + location
33. sgberhault 10 Aug 2024
  
  in Public
  
  Finally, constraints have been added to limit unusual or impossible data.
  
  Should probably describe these, since they aren't apparent in the ERD at all.
34. sgberhault 10 Aug 2024
  
  in Public
  
  Figure 1.
  
  Reference these properly in Quarto. (It will also make your life easier)
35. sgberhault 10 Aug 2024
  
  in Public
  
  ERD Diagram
  
  You need a much more comprehensive caption here.
36. sgberhault 10 Aug 2024
  
  in Public
  
  Exploratory Data Analysis
  
  EDA is what you do to narrow down what actual analysis you want or need to do to answer your question. It probably should not be shown here unless mandatory for understanding a later piece of analysis.
37. sgberhault 10 Aug 2024
  
  in Public
  
  Dataframe Shape The DataFrame contains 147039 rows and 44 columns.
  
  Wat? Why is this here in this form?
38. sgberhault 10 Aug 2024
  
  in Public
  
  Exploring Oregon State By filtering our Dataframe for Oregon state, our DataFrame contains 2922 rows.
  
  Yeah, that's not a section, nor should it be. Mistake with #?
39. sgberhault 10 Aug 2024
  
  in Public
  
  Features Engineering Date Column Preprocessing:
  
  This is a paper, not notes of what was done. You need to explain these and describe what was done. A flowchart might also be very useful.
40. sgberhault 10 Aug 2024
  
  in Public
  
  Sweetviz Data Report Done! Use 'show' commands to display/save. [100%] 00:01 -> (00:00 left) {"model_id":"0e8836738d0b492e92ad430e32f1e8d7","version_major":2,"version_minor":0,"quarto_mimetype":"application/vnd.jupyter.widget-view+json"} Report SWEETVIZ_REPORT.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files. We have generated a complete statistical report confirming the quality of EDA steps.
  
  Again, no need for this to be here. It contributes nothing to your story. Or if it does, you need to do a MUCH better job of making that clear. You can mention and link it in an appendix if you want.
41. sgberhault 10 Aug 2024
  
  in Public
  
  - Set tsmode=True when creating the ProfileReport - Ensure our DataFrame is sorted or specify the sortby parameter - Time Series Feature Identification
  
  Mangled formatting.
42. sgberhault 10 Aug 2024
  
  in Public
  
  Advanced Exploratory Data Analysis
  
  Probably also shouldn't be here, though it depends on what you mean by this.
43. sgberhault 10 Aug 2024
  
  in Public
  
  - Histograms are replaced with line plots - Feature details include new autocorrelation and partial autocorrelation plots - Two additional warnings may appear: NON STATIONARY and SEASONAL
  
  Mangled formatting
44. sgberhault 10 Aug 2024
  
  in Public
  
  These methods allowed us to thoroughly evaluate key data quality aspects, including: Class balance in categorical variables Presence and distribution of missing values (NaN) Feature distributions and correlations Potential time-series characteristics
  
  Ok, but I haven't seen you talk about any of these yet. So what use then were they toward answering your overall question?
45. sgberhault 10 Aug 2024
  
  in Public
  
  Time Series Visualization: CO, Wind and AQI
  
  Why is this a chapter? How is it contributing? Like it might be useful information for your question, but a chapter all by itself?
46. sgberhault 10 Aug 2024
  
  in Public
  
  NO2 (nitrogen dioxide) is an important air pollutant. Here’s a concise overview of it: - Reddish-brown gas with a pungent odor - Part of a group of pollutants known as nitrogen oxides (NOx) SO2 (sulfur dioxide) is an important air pollutant. Here’s a concise overview of SO2 as a pollutant: Colorless gas with a sharp, pungent odor Highly soluble in water Ozone (O₃) as a pollutant is a complex topic, as it can be both beneficial and harmful depending on its location in the atmosphere. Here’s a concise overview of ozone as a ground-level pollutant: Colorless to pale blue gas with a distinctive smell Highly reactive molecule composed of three oxygen atoms
  
  Again, wasn't all of this covered in the background?
47. sgberhault 10 Aug 2024
  
  in Public
  
  CO pollutant refers to carbon monoxide, which is a colorless, odorless, and tasteless gas that can be harmful to human health and the environment.
  
  Should have already established this in your background.
48. sgberhault 10 Aug 2024
  
  in Public
  
  Primarily produced by incomplete combustion of carbon-containing fuels Major sources include vehicle exhaust, industrial processes, and some natural sources like volcanoes Slightly less dense than air Highly flammable
  
  mangled formatting I think
49. sgberhault 10 Aug 2024
  
  in Public
  
  <Figure size 1000x1800 with 0 Axes>
  
  Figure appears before reference and explanation in text.
  
  Also, figure isn't actually a figure and has no caption.
  
  Also, plot is WAY too big for writeup
50. sgberhault 10 Aug 2024
  
  in Public
  
  <Figure size 1500x2000 with 0 Axes>
  
  Same issues as above figure: - Not explained in text - Not an actual figure with caption and reference - Way too large for the format
51. sgberhault 10 Aug 2024
  
  in Public
  
  we must
  
  You must? That is the only possible approach?
52. sgberhault 10 Aug 2024
  
  in Public
  
  We finally completed the exploratory data analysis.
  
  And you seemingly concluded nothing from it? Why should a reader care about this?
53. sgberhault 10 Aug 2024
  
  in Public
  
  147039
  
  No. You do not include raw tabulated output like this in a publication. The columns aren't even labeled, so a reader has no idea what they are looking at. If it is worth showing a reader, then you render it properly, make sure everything is labeled, insert it as a table with a caption and reference and discuss it in the text.
54. sgberhault 10 Aug 2024
  
  in Public
  
  Ultimately, we want to see which variables have the greatest impact on AQI
  
  The AQI is defined in terms of some of these correct? So those should probably not be included?
55. sgberhault 10 Aug 2024
  
  in Public
  
  First, missing data must be addressed.
  
  This wasn't addressed as any of your earlier pre-processing?
56. sgberhault 10 Aug 2024
  
  in Public
  
  date
  
  Now I have even less idea of what this is showing me
57. sgberhault 10 Aug 2024
  
  in Public
  
  Since AQI is the dependent variable being measured, all rows without AQI data are dropped. Certain cities have very little data and will be dropped out of necessity.
  
  Ok. How little is very little data? Why is it necessary?
58. sgberhault 10 Aug 2024
  
  in Public
  
  The data collected has separate information for the city of New York City. NYC is divided into five boroughs, each within its own county. These values are grouped and averaged out to make NYC have the same amount of datapoints as every other city.
  
  Are other suburbs of major cities not counted separately? It seems like this could be a tricky thing to be fair about. And counties kinda already split things in an unambiguous way?
59. sgberhault 10 Aug 2024
  
  in Public
  
  date state county city population density \
  
  Pretty sure this output should absolutely be removed.
60. sgberhault 10 Aug 2024
  
  in Public
  
  figure 24324
  
  I missed the other 24 thousand 300 somewhere....
  
  Also, the thing below is a table, and should be referenced and captioned as such.
61. sgberhault 10 Aug 2024
  
  in Public
  
  Kansas City 241
  
  I think the count is largely unnecessary to show here, but what is up with Kansas City? And why is it not discussed when it is seemingly the only take-away I get from this table?
62. sgberhault 10 Aug 2024
  
  in Public
  
  To perform a ML prediction algorithm, the predicted variable (AQI) must be discrete.
  
  That doesn't seem correct. You can do all manner of regression algorithms with machine learning. No need to make this into a classification problem unless your SPECIFIC algorithm requires it. In which case you should discuss why you are using that specific algorithm.
63. sgberhault 10 Aug 2024
  
  in Public
  
  The bins chosen are: 0-30 31-40 41-50 51-60 61-70 71-80 81-90 91-100 101-150 151+
  
  How were these chosen?
64. sgberhault 10 Aug 2024
  
  in Public
  
  date state county city population density aqi temp pressure humidity ... pm100 pm25 num_busses revenue operating_expense passenger_trips operating_hours passenger_miles operating_miles aqi_discrete 0 2015-01-04 Arizona Maricopa Phoenix 4064275 1198.9 86.50 41.458333 972.860814 60.739583 ... 20.709822 20.505426 729.0 47024975.0 2.256208e+08 55497019.0 2228182.0 2.190928e+08 28371107.0 81-90 1 2015-01-04 California Los Angeles Los Angeles 11922389 3184.7 118.50 50.281250 1010.666650 58.177084 ... 24.954331 21.400000 2259.0 273158938.0 1.056348e+09 367104774.0 7938548.0 1.448619e+09 84041668.0 101-150 7 2015-01-04 District Of Columbia District of Columbia Washington 5116378 4235.7 45.25 41.149479 1015.388525 61.075000 ... 13.750000 11.088055 1394.0 149657899.0 6.453259e+08 139353079.0 4115200.0 4.293390e+08 39643319.0 41-50 17 2015-01-04 Massachusetts Suffolk Boston 4328315 5319.0 42.75 33.458332 1018.536500 58.234375 ... 6.000000 7.244791 800.0 96572664.0 4.080501e+08 122496729.0 2231562.0 3.162285e+08 22115804.0 41-50 18 2015-01-04 Michigan Wayne Detroit 3725908 1772.2 49.25 30.203125 994.260400 72.671876 ... 18.250000 9.394618 432.0 31303313.0 1.729056e+08 33078462.0 1225079.0 1.696881e+08 17705665.0 41-50 5 rows × 25 columns date state county city population density aqi temp pressure humidity ... pm100 pm25 num_busses revenue operating_expense passenger_trips operating_hours passenger_miles operating_miles aqi_discrete 0 2015-01-04 Arizona Maricopa Phoenix 4064275 1198.9 86.50 41.458333 972.860814 60.739583 ... 20.709822 20.505426 729.0 47024975.0 2.256208e+08 55497019.0 2228182.0 2.190928e+08 28371107.0 81-90 1 2015-01-04 California Los Angeles Los Angeles 11922389 3184.7 118.50 50.281250 1010.666650 58.177084 ... 24.954331 21.400000 2259.0 273158938.0 1.056348e+09 367104774.0 7938548.0 1.448619e+09 84041668.0 101-150 7 2015-01-04 District Of Columbia District of Columbia Washington 5116378 4235.7 45.25 41.149479 1015.388525 61.075000 ... 13.750000 11.088055 1394.0 149657899.0 6.453259e+08 139353079.0 4115200.0 4.293390e+08 39643319.0 41-50 17 2015-01-04 Massachusetts Suffolk Boston 4328315 5319.0 42.75 33.458332 1018.536500 58.234375 ... 6.000000 7.244791 800.0 96572664.0 4.080501e+08 122496729.0 2231562.0 3.162285e+08 22115804.0 41-50 18 2015-01-04 Michigan Wayne Detroit 3725908 1772.2 49.25 30.203125 994.260400 72.671876 ... 18.250000 9.394618 432.0 31303313.0 1.729056e+08 33078462.0 1225079.0 1.696881e+08 17705665.0 41-50 5 rows × 25 columns
  
  What even am I looking at here?
65. sgberhault 10 Aug 2024
  
  in Public
  
  Definition 1
  
  The interactivity of the below is neat, but you need to talk about it!
66. sgberhault 10 Aug 2024
  
  in Public
  
  The following tools are used: Train Test Split One Hot Encoder Transformer Pipeline Standard Scaler
  
  For what purposes?
67. sgberhault 10 Aug 2024
  
  in Public
  
  Feature selection is done on the data.
  
  How? And what are the raw results?
68. sgberhault 10 Aug 2024
  
  in Public
  
  Carbon Monoxide Nitrogen Dioxide Ozone PM10 PM2.5
  
  Aren't all of these literally part of the definition of AQI?
69. sgberhault 10 Aug 2024
  
  in Public
  
  K nearest neighbors Tree model Random Forest model Logistic Regression Naive Bayes
  
  This is essentially the equivalent of EDA in ML. A reader doesn't care about all of the attempts that didn't go as well unless something critical was shown in that case. Just move straight to the best and discuss what it implies.
70. sgberhault 10 Aug 2024
  
  in Public
  
  'city_Los Angeles', 'city_Phoenix', 'city_Portland
  
  these these three?
71. sgberhault 10 Aug 2024
  
  in Public
  
  Definition 2
  
  I'm confused why these are just labeled at Definitions?
72. sgberhault 10 Aug 2024
  
  in Public
  
  AirQuality Confusion Matrix 1
  
  What model is the above even for?? How is a reader supposed to interpret this?
73. sgberhault 10 Aug 2024
  
  in Public
  
  Pipeline
  
  Not explain in the text, as far as I can understand.
74. sgberhault 10 Aug 2024
  
  in Public
  
  the model.
  
  WHICH?
75. sgberhault 10 Aug 2024
  
  in Public
  
  A randomized search is run with 100 iterations.
  
  Like actual just random values for these parameters each time?
76. sgberhault 10 Aug 2024
  
  in Public
  
  {'memory': None, 'steps': [('aqi_transformer', ColumnTransformer(transformers=[('categories', OneHotEncoder(handle_unknown='infrequent_if_exist', min_frequency=5, sparse_output=False), ['city']), ('scaled_air_quality', StandardScaler(), ['temp', 'humidity', 'co', 'no2', 'o3', 'pm100', 'pm25'])], verbose_feature_names_out=False)), ('RF_model', RandomForestClassifier())], 'verbose': False, 'aqi_transformer': ColumnTransformer(transformers=[('categories', OneHotEncoder(handle_unknown='infrequent_if_exist', min_frequency=5, sparse_output=False), ['city']), ('scaled_air_quality', StandardScaler(), ['temp', 'humidity', 'co', 'no2', 'o3', 'pm100', 'pm25'])], verbose_feature_names_out=False), 'RF_model': RandomForestClassifier(), 'aqi_transformer__n_jobs': None, 'aqi_transformer__remainder': 'drop', 'aqi_transformer__sparse_threshold': 0.3, 'aqi_transformer__transformer_weights': None, 'aqi_transformer__transformers': [('categories', OneHotEncoder(handle_unknown='infrequent_if_exist', min_frequency=5, sparse_output=False), ['city']), ('scaled_air_quality', StandardScaler(), ['temp', 'humidity', 'co', 'no2', 'o3', 'pm100', 'pm25'])], 'aqi_transformer__verbose': False, 'aqi_transformer__verbose_feature_names_out': False, 'aqi_transformer__categories': OneHotEncoder(handle_unknown='infrequent_if_exist', min_frequency=5, sparse_output=False), 'aqi_transformer__scaled_air_quality': StandardScaler(), 'aqi_transformer__categories__categories': 'auto', 'aqi_transformer__categories__drop': None, 'aqi_transformer__categories__dtype': numpy.float64, 'aqi_transformer__categories__feature_name_combiner': 'concat', 'aqi_transformer__categories__handle_unknown': 'infrequent_if_exist', 'aqi_transformer__categories__max_categories': None, 'aqi_transformer__categories__min_frequency': 5, 'aqi_transformer__categories__sparse_output': False, 'aqi_transformer__scaled_air_quality__copy': True, 'aqi_transformer__scaled_air_quality__with_mean': True, 'aqi_transformer__scaled_air_quality__with_std': True, 'RF_model__bootstrap': True, 'RF_model__ccp_alpha': 0.0, 'RF_model__class_weight': None, 'RF_model__criterion': 'gini', 'RF_model__max_depth': None, 'RF_model__max_features': 'sqrt', 'RF_model__max_leaf_nodes': None, 'RF_model__max_samples': None, 'RF_model__min_impurity_decrease': 0.0, 'RF_model__min_samples_leaf': 1, 'RF_model__min_samples_split': 2, 'RF_model__min_weight_fraction_leaf': 0.0, 'RF_model__monotonic_cst': None, 'RF_model__n_estimators': 100, 'RF_model__n_jobs': None, 'RF_model__oob_score': False, 'RF_model__random_state': None, 'RF_model__verbose': 0, 'RF_model__warm_start': False}
  
  Definitely don't show this!
77. sgberhault 10 Aug 2024
  
  in Public
  
  Definition 3 0.6312159709618875
  
  What does this mean?
78. sgberhault 10 Aug 2024
  
  in Public
  
  Figure
  
  It is a table, not a figure.
79. sgberhault 10 Aug 2024
  
  in Public
  
  0.5265748745864021
  
  Comment on this if you are going to show it.
80. sgberhault 10 Aug 2024
  
  in Public
  
  Decomposing the Time Series With Additive Method
  
  Is this supposed to be a much more subheading?
81. sgberhault 10 Aug 2024
  
  in Public
  
  AirQuality Confusion Matrix 2
  
  Captions need to be more detailed and discuss what a reader should take away from an image.
82. sgberhault 10 Aug 2024
  
  in Public
  
  decomposed
  
  composed
83. sgberhault 10 Aug 2024
  
  in Public
  
  By the help of statsmodel package we can break the time series into its seasonal pattern and trends. This will helps us to understand the data clearly and will help us to make more sense of the data.
  
  Ok, so how did you go about doing that?
84. sgberhault 10 Aug 2024
  
  in Public
  
  three
  
  You literally JUST told me there were 2...
85. sgberhault 10 Aug 2024
  
  in Public
  
  There are
  
  The above image is an unlabeled figure with no caption that is discussed nowhere in the text (at least so far). All of those things are problematic.
86. sgberhault 10 Aug 2024
  
  in Public
  
  If you have an increasing trend, you still see roughly the same size peaks and troughs throughout the time series. This is often seen in indexed time series where the absolute value is growing but changes stay relative.
  
  But is that what you are seeing here? It is confusing if you are talking in the abstract or about your specific data.
  
  Also, why do this? What are your takeaways? These feels like it exists in isolation?
87. sgberhault 10 Aug 2024
  
  in Public
  
  attempts to compute the optimum values of hyperparameters.
  
  Say how it works!
88. sgberhault 10 Aug 2024
  
  in Public
  
  grid search method
  
  This isn't code, so it shouldn't be in monospace. Underline or italicize it if you want to set it apart, or put it in quotes.
89. sgberhault 10 Aug 2024
  
  in Public
  
  ARIMA(0, 0, 0)x(0, 0, 0, 12) - AIC:969.5419650946665 ARIMA(0, 0, 0)x(0, 0, 1, 12) - AIC:799.0140026908043 ARIMA(0, 0, 0)x(0, 1, 0, 12) - AIC:701.7072455506197 ARIMA(0, 0, 0)x(0, 1, 1, 12) - AIC:568.3211239351035 ARIMA(0, 0, 0)x(1, 0, 0, 12) - AIC:708.2727189545345 ARIMA(0, 0, 0)x(1, 0, 1, 12) - AIC:660.9171130206936 ARIMA(0, 0, 0)x(1, 1, 0, 12) - AIC:596.1563221105039 ARIMA(0, 0, 0)x(1, 1, 1, 12) - AIC:571.8620221843147 ARIMA(0, 0, 1)x(0, 0, 0, 12) - AIC:888.4893265461405 ARIMA(0, 0, 1)x(0, 0, 1, 12) - AIC:754.7451219152275 ARIMA(0, 0, 1)x(0, 1, 0, 12) - AIC:695.0468020327725 ARIMA(0, 0, 1)x(0, 1, 1, 12) - AIC:563.3526496700842 ARIMA(0, 0, 1)x(1, 0, 0, 12) - AIC:708.3487691701486 ARIMA(0, 0, 1)x(1, 0, 1, 12) - AIC:655.8968840891383 ARIMA(0, 0, 1)x(1, 1, 0, 12) - AIC:598.1490374699148 ARIMA(0, 0, 1)x(1, 1, 1, 12) - AIC:566.3367865157978 ARIMA(0, 1, 0)x(0, 0, 0, 12) - AIC:769.1876196189784 ARIMA(0, 1, 0)x(0, 0, 1, 12) - AIC:681.4253047727481 ARIMA(0, 1, 0)x(0, 1, 0, 12) - AIC:740.3973501203114 ARIMA(0, 1, 0)x(0, 1, 1, 12) - AIC:606.0067883430007 ARIMA(0, 1, 0)x(1, 0, 0, 12) - AIC:688.9276375883021 ARIMA(0, 1, 0)x(1, 0, 1, 12) - AIC:683.2372837276466 ARIMA(0, 1, 0)x(1, 1, 0, 12) - AIC:637.9760649104885 ARIMA(0, 1, 0)x(1, 1, 1, 12) - AIC:607.9989487123431 ARIMA(0, 1, 1)x(0, 0, 0, 12) - AIC:717.0512101206406 ARIMA(0, 1, 1)x(0, 0, 1, 12) - AIC:636.373429528529 ARIMA(0, 1, 1)x(0, 1, 0, 12) - AIC:692.512410906277 ARIMA(0, 1, 1)x(0, 1, 1, 12) - AIC:559.6920424480529 ARIMA(0, 1, 1)x(1, 0, 0, 12) - AIC:650.5293595230056 ARIMA(0, 1, 1)x(1, 0, 1, 12) - AIC:638.1908637932411 ARIMA(0, 1, 1)x(1, 1, 0, 12) - AIC:594.940391452659 ARIMA(0, 1, 1)x(1, 1, 1, 12) - AIC:562.5484300875305 ARIMA(1, 0, 0)x(0, 0, 0, 12) - AIC:775.150570595756 ARIMA(1, 0, 0)x(0, 0, 1, 12) - AIC:688.1982167211085 ARIMA(1, 0, 0)x(0, 1, 0, 12) - AIC:702.425519762607 ARIMA(1, 0, 0)x(0, 1, 1, 12) - AIC:570.1689904036024 ARIMA(1, 0, 0)x(1, 0, 0, 12) - AIC:688.2931195730088 ARIMA(1, 0, 0)x(1, 0, 1, 12) - AIC:662.6749372683774 ARIMA(1, 0, 0)x(1, 1, 0, 12) - AIC:590.7883988000217 ARIMA(1, 0, 0)x(1, 1, 1, 12) - AIC:573.825547011459 ARIMA(1, 0, 1)x(0, 0, 0, 12) - AIC:725.2611476282008 ARIMA(1, 0, 1)x(0, 0, 1, 12) - AIC:644.4595774810737 ARIMA(1, 0, 1)x(0, 1, 0, 12) - AIC:696.6355146715679 ARIMA(1, 0, 1)x(0, 1, 1, 12) - AIC:565.337721591011 ARIMA(1, 0, 1)x(1, 0, 0, 12) - AIC:651.3742765976529 ARIMA(1, 0, 1)x(1, 0, 1, 12) - AIC:657.7255114881699 ARIMA(1, 0, 1)x(1, 1, 0, 12) - AIC:592.7702867201957 ARIMA(1, 0, 1)x(1, 1, 1, 12) - AIC:567.3861300859227 ARIMA(1, 1, 0)x(0, 0, 0, 12) - AIC:750.4532664961456 ARIMA(1, 1, 0)x(0, 0, 1, 12) - AIC:665.693748389872 ARIMA(1, 1, 0)x(0, 1, 0, 12) - AIC:720.7807876037391 ARIMA(1, 1, 0)x(0, 1, 1, 12) - AIC:588.6301637485213 ARIMA(1, 1, 0)x(1, 0, 0, 12) - AIC:665.7141239363682 ARIMA(1, 1, 0)x(1, 0, 1, 12) - AIC:667.6890275833365 ARIMA(1, 1, 0)x(1, 1, 0, 12) - AIC:611.4437482645567 ARIMA(1, 1, 0)x(1, 1, 1, 12) - AIC:590.6185673644065 ARIMA(1, 1, 1)x(0, 0, 0, 12) - AIC:717.3211552781574 ARIMA(1, 1, 1)x(0, 0, 1, 12) - AIC:636.7110296932944 ARIMA(1, 1, 1)x(0, 1, 0, 12) - AIC:693.1696490581699 ARIMA(1, 1, 1)x(0, 1, 1, 12) - AIC:561.5301944999834 ARIMA(1, 1, 1)x(1, 0, 0, 12) - AIC:643.9735168529521 ARIMA(1, 1, 1)x(1, 0, 1, 12) - AIC:638.640931561371 ARIMA(1, 1, 1)x(1, 1, 0, 12) - AIC:588.5992832053371 ARIMA(1, 1, 1)x(1, 1, 1, 12) - AIC:564.5468753697722
  
  This should not be shown.
90. sgberhault 10 Aug 2024
  
  in Public
  
  Summary of SARIMAX Print the summary which includes AIC
  
  Why all these other headings? This is still part of the above?
91. sgberhault 10 Aug 2024
  
  in Public
  
  ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ ar.L1 0.0483 0.306 0.158 0.875 -0.551 0.648 ma.L1 -1.0000 924.523 -0.001 0.999 -1813.031 1811.031 ma.S.L12 -1.0000 2355.498 -0.000 1.000 -4617.692 4615.692 sigma2 134.1503 3.35e+05 0.000 1.000 -6.57e+05 6.57e+05 ==============================================================================
  
  Yup, that won't mean a thing to most readers (myself included) unless you explain it.
92. sgberhault 10 Aug 2024
  
  in Public
  
  How Fit the SARIMAX model
  
  There is no "how to" here at all.
93. sgberhault 10 Aug 2024
  
  in Public
  
  Plot Diag
  
  reference the figure correctly and explain it in the text!
  
  Also, give it an actual meaningful caption.
94. sgberhault 10 Aug 2024
  
  in Public
  
  Rigorous validation is paramount to establishing the model’s reliability and practical application. To ensure the model’s generalizability, we will employ a train-test split.
  
  Why is this just being mentioned after all of the ML stuff has happened?
95. sgberhault 10 Aug 2024
  
  in Public
  
  The AIC value is: 561.5301944999834
  
  Which tells us what?
96. sgberhault 10 Aug 2024
  
  in Public
  
  Start date of the data: 2015-01-31 00:00:00 End date of the data: 2022-12-31 00:00:00
  
  ??
97. sgberhault 10 Aug 2024
  
  in Public
  
  To facilitate c
  
  The above graphic is a figure. Treat it as such!
  
  Also, use dashed lines for one of the entries so that we can see that they are actually perfectly overlapping and not just take your word for it.
98. sgberhault 10 Aug 2024
  
  in Public
  
  The Mean Squared Error of our forecasts is 1.41
  
  units? What do you conclude from this?
99. sgberhault 10 Aug 2024
  
  in Public
  
  Forecasting Future Values As we conclude our modeling process, we generate predictions for the next 7 data points: Model Information: The result variable contains our fitted model’s details. Forecasting Method: We use the .get_forecast() method on our model results. Prediction Generation: This method analyzes observed patterns in our data to project future values. Output: We obtain forecasts for the next 7 time points, representing predicted air quality levels. This step transforms our analytical work into actionable insights for air quality management.
  
  I don't understand what you are trying to say or do here. You have already done some of this above (I think) so I'd guess this is a summary, except that some of this I'm pretty sure I haven't seen?
100. sgberhault 10 Aug 2024
  
  in Public
  
  Our plot
  
  Reference the figure number! And stick a caption on it!
  
  Is this plot for portland? That isn't apparent anywhere that I can see either.
101. sgberhault 10 Aug 2024
  
  in Public
  
  Interpreting the Forecast Plot
  
  unnecessary
102. sgberhault 10 Aug 2024
  
  in Public
  
  Represents the actual, historical air quality measurements Provides a baseline for comparing our predictions Forecasted Values (Orange Line) Depicts the future air quality levels predicted by our SARIMAX Time Series Model Allows us to visualize potential trends and patterns in air quality Confidence Interval (Shaded Region) The shaded area around the forecast line represents the 95% Confidence Interval (CI) Indicates the range within which we can be 95% confident that the true future values will fall Wider intervals suggest greater uncertainty in the prediction
  
  This is a publication. Use complete sentences.
103. sgberhault 10 Aug 2024
  
  in Public
  
  exasperated by the dry heat and lack of rainfall
  
  Is this the actual cause? You showed some seasonality, I'm not sure these causes were showcased.
104. sgberhault 10 Aug 2024
  
  in Public
  
  we have landed on these specific recommendations.
  
  Ok, let me just say that at this point, after reading through all your above analysis, I have NO IDEA what your recommendations are going to be. Which probably tells me that you did a poor job of actually showcasing your proof for each of these recommendations.
  
  I haven't read what they are yet, but for every recommendation you make, I should be able to go back to a specific section or figure and see the exact reason for why you would make that prediction. If that is not the case, then you are either making unfounded recommendations, or you are not communicating what your analysis was for clearly enough.
105. sgberhault 10 Aug 2024
  
  in Public
  
  As climate change raises temperatures and water sources dry up, wildfire season will continue to get worse over time.
  
  Agreed, how would you interpret your data in that light? Can you see evidence of that? Is the effect more pronounced in cities near lots of national forest? Otherwise you are just conjecturing.
106. sgberhault 10 Aug 2024
  
  in Public
  
  Weather conditions Wind speed and direction Temperature fluctuations Humidity levels Atmospheric pressure Solar radiation intensity
  
  Significantly affected? I thought you only saw a few of these at best as being significant contributors.
107. sgberhault 10 Aug 2024
  
  in Public
  
  That leaves us with three criteria gasses and all particulate matter.
  
  But again, these are just part of the definition of AQI aren't they? So of course they have a large impact?
108. sgberhault 10 Aug 2024
  
  in Public
  
  The largest source of carbon monoxide, nitrogen dioxide, and ozone is the cars, trucks, and other vehicles we use daily (Environmental Protection Agency). We can lower our reliance on personal vehicles by utilizing public transportation, carpooling, walking, biking, increasing work from home to lower commutes when available, and overall be more considerate about if driving a car is necessary.
  
  Did you see evidence of this? You had bus data. Did cities with less traffic show decreases in these values?
109. sgberhault 10 Aug 2024
  
  in Public
  
  Algorithm Dependence. This is the reliability of forecasts which are inherently tied to the chosen predictive algorithms. Different models may yield varying results, emphasizing the importance of algorithm selection and validation.
  
  So how did you choose your algorithms with this in mind?
110. sgberhault 10 Aug 2024
  
  in Public
  
  Industrial manufacturing processes and agriculture are significant polluters of the environment. We should invest in the research of more environmentally friendly manufacturing methods, working with materials that require less combustion, or are recyclable.
  
  Agreed, but I'm not sure you could see from your research if this was what was playing a large role?
Visit annotations in context

Annotators

sgberhault

URL

wu-msds-capstones.github.io/Air-Quality-Index/
hypothes.is hypothes.is

Hypothesis

90
1. sgberhault 10 Aug 2024
  
  in Public
  
  Air
  
  Your link to the right up here is broken
2. sgberhault 10 Aug 2024
  
  in Public
  
  Contents
  
  Your contents here has way too much stuff. Break it into bigger chunks and use subheadings
3. sgberhault 10 Aug 2024
  
  in Public
  
  On October 30, 1948, the Donora High School Football team played through a dense smog to complete the game with hundreds of fans in the audience, despite very poor visibility.
  
  Citation?
4. sgberhault 10 Aug 2024
  
  in Public
  
  (Jacobs, Burgess, Abbott)
  
  Ah, here it is. Given that you are pulling from the story for an entire paragraph, I'd lead with some reference to this source.
  
  "In their book/article published in XXXX, Jacobs, Burgess, and Abbott tell the tale of...."
5. sgberhault 10 Aug 2024
  
  in Public
  
  This event, known as the Donora Smog of 1948, prompted the country into taking a closer look at the negative impacts of air pollution. Widespread debate surrounding the event led to the first legislation aimed at regulating the air quality within the United States, ushering in a new era of tracking, combatting, and reversing the ill effects of poor air quality.
  
  Two sentences isn't much of a paragraph.
6. sgberhault 10 Aug 2024
  
  in Public
  
  changes
  
  factors
7. sgberhault 10 Aug 2024
  
  in Public
  
  The quality of air we breathe has direct impacts on our health. We must understand the factors that contribute to poor air quality and how we individually and collectively contribute to these changes. Until we can visualize the impact we have on our atmosphere, we will continue behavior that negatively impacts the air around us.
  
  Also a short paragraph.
8. sgberhault 10 Aug 2024
  
  in Public
  
  NumPy and Pandas
  
  the NumPy and Pandas libraries
9. sgberhault 10 Aug 2024
  
  in Public
  
  Matplotlib and Seaborn for visualization, and Time Series forecasting algorithms such as Prophet and SARIMAX.
  
  This is not a complete sentence.
10. sgberhault 10 Aug 2024
  
  in Public
  
  We will address data inconsistencies, missing values and ensure that data is in a tidy format.
  
  This is not a paragraph
11. sgberhault 10 Aug 2024
  
  in Public
  
  We may need to normalize or standardize data if necessary and create new features through aggregation to enhance the model’s performance.
  
  Also not a paragraph
12. sgberhault 10 Aug 2024
  
  in Public
  
  the above section
  
  Reference the section. Section numbering helps with this
13. sgberhault 10 Aug 2024
  
  in Public
  
  p
  
  capitalized?
14. sgberhault 10 Aug 2024
  
  in Public
  
  Here’s a breakdown of its components:
  
  Is this supposed to be above the bullet points? Either way, I think those bullet points need a better intro.
15. sgberhault 10 Aug 2024
  
  in Public
  
  Metrics to Evaluate Machine Model Performance
  
  Any section needs to be introduced by text.
16. sgberhault 10 Aug 2024
  
  in Public
  
  Technique/Metric Description Purpose/Formula Scenario: Cancer prediction
  
  I don't think this table is useful in this current location. As a table, it should be just used as a reference and put at the end of the document.It is a nice summary table, for sure, but it doesn't belong smack in the middle of your paper.
  
  As far as a reader knowing what you are referring to when you use one of these terms, some you can probably safely assume you can use without explanation, and others you should bake the explanation into your text when you introduce it.
17. sgberhault 10 Aug 2024
  
  in Public
  
  Akaike Information Criteria (AIC)
  
  I don't think a reader has any idea what this is initially, so this chapter heading is kinda meaningless.
18. sgberhault 10 Aug 2024
  
  in Public
  
  Machine Learning AQI Time Series
  
  Text should introduce every section.
19. sgberhault 10 Aug 2024
  
  in Public
  
  Used to measure of a statistical model, it quantifies:
  
  Not a complete sentence
20. sgberhault 10 Aug 2024
  
  in Public
  
  Data Explaination
  
  Why is this part of the ML AQI Time Series chapter? Or chapter/heading hierarchy is extremely confusing in general
21. sgberhault 10 Aug 2024
  
  in Public
  
  The Akaike Information Criterion (AIC) is a measure used to compare different statistical models. It helps in model selection by balancing the goodness of fit and the complexity of the model. Here’s how to interpret the AIC value:
  
  This feels more like how this section should be starting.
22. sgberhault 10 Aug 2024
  
  in Public
  
  The files were given daily on a county wide basis, separated into different files by year.
  
  So what did you collect?
23. sgberhault 10 Aug 2024
  
  in Public
  
  Indoors, high humidity can trap air, leading to the growth of mold and harmful bacteria.
  
  This feels outside the scope of what you are doing though correct?
24. sgberhault 10 Aug 2024
  
  in Public
  
  calculated
  
  aggregated
25. sgberhault 10 Aug 2024
  
  in Public
  
  Air Quality Data:
  
  These sections are too small for their own sections. Just make them their own paragraphs.
  
  EDIT: Actually, some of the later ones are more reasonable. Think about how you can balance between them though. Can you add to some to make it more reasonable as a section? Or remove from others? Maybe bullet points with a bolded starting line would be more appropriate?
26. sgberhault 10 Aug 2024
  
  in Public
  
  Carbon Monoxide
  
  Carbon Monoxide (CO)
27. sgberhault 10 Aug 2024
  
  in Public
  
  Only motorbus data was used, which may not be reflective of cities with other large methods of public transportation, such as the New York subway system.
  
  It also seems to leave out what I'd guess is probably easily the most significant transit factor: cars and trucks?
28. sgberhault 10 Aug 2024
  
  in Public
  
  is updated as of
  
  was last updated on
29. sgberhault 10 Aug 2024
  
  in Public
  
  relevant columns were selected and renamed, reducing the information being brought into our initial SQL database.
  
  Just selecting and renaming wouldn't reduce the information, unless you are trying to say that you didn't bring in anything else.
30. sgberhault 10 Aug 2024
  
  in Public
  
  and imported
  
  remove
31. sgberhault 10 Aug 2024
  
  in Public
  
  The first dimension table is the dates table, a serialized list of dates from January 1st, 2015 to December 31st, 2022.
  
  You should explain why you did this. Otherwise breaking it out into a table of essentially 1 data column seems pointless. I'm pretty sure I recall the reason why, and it is a decent reason, but that is not apparent here.
32. sgberhault 10 Aug 2024
  
  in Public
  
  as well as the population and population density
  
  that is not shown in your ERD
33. sgberhault 10 Aug 2024
  
  in Public
  
  Understanding the context of a specified line requires joining the table back to the fact table, and joining the location and date tables to that as well.
  
  Ok, but I'm pretty sure this totally undid any of the space saving measures you gained with putting dates in their own table. Because you are including a massive number of duplicate items in your main table. You could have just left them separate and still joined by truncating the date to a year and matching that + location
34. sgberhault 10 Aug 2024
  
  in Public
  
  Finally, constraints have been added to limit unusual or impossible data.
  
  Should probably describe these, since they aren't apparent in the ERD at all.
35. sgberhault 10 Aug 2024
  
  in Public
  
  Figure 1.
  
  Reference these properly in Quarto. (It will also make your life easier)
36. sgberhault 10 Aug 2024
  
  in Public
  
  ERD Diagram
  
  You need a much more comprehensive caption here.
37. sgberhault 10 Aug 2024
  
  in Public
  
  Exploratory Data Analysis
  
  EDA is what you do to narrow down what actual analysis you want or need to do to answer your question. It probably should not be shown here unless mandatory for understanding a later piece of analysis.
38. sgberhault 10 Aug 2024
  
  in Public
  
  Dataframe Shape The DataFrame contains 147039 rows and 44 columns.
  
  Wat? Why is this here in this form?
39. sgberhault 10 Aug 2024
  
  in Public
  
  Exploring Oregon State By filtering our Dataframe for Oregon state, our DataFrame contains 2922 rows.
  
  Yeah, that's not a section, nor should it be. Mistake with #?
40. sgberhault 10 Aug 2024
  
  in Public
  
  Features Engineering Date Column Preprocessing:
  
  This is a paper, not notes of what was done. You need to explain these and describe what was done. A flowchart might also be very useful.
41. sgberhault 10 Aug 2024
  
  in Public
  
  Sweetviz Data Report Done! Use 'show' commands to display/save. [100%] 00:01 -> (00:00 left) {"model_id":"0e8836738d0b492e92ad430e32f1e8d7","version_major":2,"version_minor":0,"quarto_mimetype":"application/vnd.jupyter.widget-view+json"} Report SWEETVIZ_REPORT.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files. We have generated a complete statistical report confirming the quality of EDA steps.
  
  Again, no need for this to be here. It contributes nothing to your story. Or if it does, you need to do a MUCH better job of making that clear. You can mention and link it in an appendix if you want.
42. sgberhault 10 Aug 2024
  
  in Public
  
  - Set tsmode=True when creating the ProfileReport - Ensure our DataFrame is sorted or specify the sortby parameter - Time Series Feature Identification
  
  Mangled formatting.
43. sgberhault 10 Aug 2024
  
  in Public
  
  - Histograms are replaced with line plots - Feature details include new autocorrelation and partial autocorrelation plots - Two additional warnings may appear: NON STATIONARY and SEASONAL
  
  Mangled formatting
44. sgberhault 10 Aug 2024
  
  in Public
  
  Advanced Exploratory Data Analysis
  
  Probably also shouldn't be here, though it depends on what you mean by this.
45. sgberhault 10 Aug 2024
  
  in Public
  
  These methods allowed us to thoroughly evaluate key data quality aspects, including: Class balance in categorical variables Presence and distribution of missing values (NaN) Feature distributions and correlations Potential time-series characteristics
  
  Ok, but I haven't seen you talk about any of these yet. So what use then were they toward answering your overall question?
46. sgberhault 10 Aug 2024
  
  in Public
  
  Time Series Visualization: CO, Wind and AQI
  
  Why is this a chapter? How is it contributing? Like it might be useful information for your question, but a chapter all by itself?
47. sgberhault 10 Aug 2024
  
  in Public
  
  CO pollutant refers to carbon monoxide, which is a colorless, odorless, and tasteless gas that can be harmful to human health and the environment.
  
  Should have already established this in your background.
48. sgberhault 10 Aug 2024
  
  in Public
  
  Primarily produced by incomplete combustion of carbon-containing fuels Major sources include vehicle exhaust, industrial processes, and some natural sources like volcanoes Slightly less dense than air Highly flammable
  
  mangled formatting I think
49. sgberhault 10 Aug 2024
  
  in Public
  
  First, missing data must be addressed.
  
  This wasn't addressed as any of your earlier pre-processing?
50. sgberhault 10 Aug 2024
  
  in Public
  
  <Figure size 1000x1800 with 0 Axes>
  
  Figure appears before reference and explanation in text.
  
  Also, figure isn't actually a figure and has no caption.
  
  Also, plot is WAY too big for writeup
51. sgberhault 10 Aug 2024
  
  in Public
  
  NO2 (nitrogen dioxide) is an important air pollutant. Here’s a concise overview of it: - Reddish-brown gas with a pungent odor - Part of a group of pollutants known as nitrogen oxides (NOx) SO2 (sulfur dioxide) is an important air pollutant. Here’s a concise overview of SO2 as a pollutant: Colorless gas with a sharp, pungent odor Highly soluble in water Ozone (O₃) as a pollutant is a complex topic, as it can be both beneficial and harmful depending on its location in the atmosphere. Here’s a concise overview of ozone as a ground-level pollutant: Colorless to pale blue gas with a distinctive smell Highly reactive molecule composed of three oxygen atoms
  
  Again, wasn't all of this covered in the background?
52. sgberhault 10 Aug 2024
  
  in Public
  
  <Figure size 1500x2000 with 0 Axes>
  
  Same issues as above figure: - Not explained in text - Not an actual figure with caption and reference - Way too large for the format
53. sgberhault 10 Aug 2024
  
  in Public
  
  We finally completed the exploratory data analysis.
  
  And you seemingly concluded nothing from it? Why should a reader care about this?
54. sgberhault 10 Aug 2024
  
  in Public
  
  Ultimately, we want to see which variables have the greatest impact on AQI
  
  The AQI is defined in terms of some of these correct? So those should probably not be included?
55. sgberhault 10 Aug 2024
  
  in Public
  
  we must
  
  You must? That is the only possible approach?
56. sgberhault 10 Aug 2024
  
  in Public
  
  Kansas City 241
  
  I think the count is largely unnecessary to show here, but what is up with Kansas City? And why is it not discussed when it is seemingly the only take-away I get from this table?
57. sgberhault 10 Aug 2024
  
  in Public
  
  147039
  
  No. You do not include raw tabulated output like this in a publication. The columns aren't even labeled, so a reader has no idea what they are looking at. If it is worth showing a reader, then you render it properly, make sure everything is labeled, insert it as a table with a caption and reference and discuss it in the text.
58. sgberhault 10 Aug 2024
  
  in Public
  
  date
  
  Now I have even less idea of what this is showing me
59. sgberhault 10 Aug 2024
  
  in Public
  
  Since AQI is the dependent variable being measured, all rows without AQI data are dropped. Certain cities have very little data and will be dropped out of necessity.
  
  Ok. How little is very little data? Why is it necessary?
60. sgberhault 10 Aug 2024
  
  in Public
  
  The data collected has separate information for the city of New York City. NYC is divided into five boroughs, each within its own county. These values are grouped and averaged out to make NYC have the same amount of datapoints as every other city.
  
  Are other suburbs of major cities not counted separately? It seems like this could be a tricky thing to be fair about. And counties kinda already split things in an unambiguous way?
61. sgberhault 10 Aug 2024
  
  in Public
  
  date state county city population density \
  
  Pretty sure this output should absolutely be removed.
62. sgberhault 10 Aug 2024
  
  in Public
  
  The following tools are used: Train Test Split One Hot Encoder Transformer Pipeline Standard Scaler
  
  For what purposes?
63. sgberhault 10 Aug 2024
  
  in Public
  
  figure 24324
  
  I missed the other 24 thousand 300 somewhere....
  
  Also, the thing below is a table, and should be referenced and captioned as such.
64. sgberhault 10 Aug 2024
  
  in Public
  
  To perform a ML prediction algorithm, the predicted variable (AQI) must be discrete.
  
  That doesn't seem correct. You can do all manner of regression algorithms with machine learning. No need to make this into a classification problem unless your SPECIFIC algorithm requires it. In which case you should discuss why you are using that specific algorithm.
65. sgberhault 10 Aug 2024
  
  in Public
  
  The bins chosen are: 0-30 31-40 41-50 51-60 61-70 71-80 81-90 91-100 101-150 151+
  
  How were these chosen?
66. sgberhault 10 Aug 2024
  
  in Public
  
  date state county city population density aqi temp pressure humidity ... pm100 pm25 num_busses revenue operating_expense passenger_trips operating_hours passenger_miles operating_miles aqi_discrete 0 2015-01-04 Arizona Maricopa Phoenix 4064275 1198.9 86.50 41.458333 972.860814 60.739583 ... 20.709822 20.505426 729.0 47024975.0 2.256208e+08 55497019.0 2228182.0 2.190928e+08 28371107.0 81-90 1 2015-01-04 California Los Angeles Los Angeles 11922389 3184.7 118.50 50.281250 1010.666650 58.177084 ... 24.954331 21.400000 2259.0 273158938.0 1.056348e+09 367104774.0 7938548.0 1.448619e+09 84041668.0 101-150 7 2015-01-04 District Of Columbia District of Columbia Washington 5116378 4235.7 45.25 41.149479 1015.388525 61.075000 ... 13.750000 11.088055 1394.0 149657899.0 6.453259e+08 139353079.0 4115200.0 4.293390e+08 39643319.0 41-50 17 2015-01-04 Massachusetts Suffolk Boston 4328315 5319.0 42.75 33.458332 1018.536500 58.234375 ... 6.000000 7.244791 800.0 96572664.0 4.080501e+08 122496729.0 2231562.0 3.162285e+08 22115804.0 41-50 18 2015-01-04 Michigan Wayne Detroit 3725908 1772.2 49.25 30.203125 994.260400 72.671876 ... 18.250000 9.394618 432.0 31303313.0 1.729056e+08 33078462.0 1225079.0 1.696881e+08 17705665.0 41-50 5 rows × 25 columns date state county city population density aqi temp pressure humidity ... pm100 pm25 num_busses revenue operating_expense passenger_trips operating_hours passenger_miles operating_miles aqi_discrete 0 2015-01-04 Arizona Maricopa Phoenix 4064275 1198.9 86.50 41.458333 972.860814 60.739583 ... 20.709822 20.505426 729.0 47024975.0 2.256208e+08 55497019.0 2228182.0 2.190928e+08 28371107.0 81-90 1 2015-01-04 California Los Angeles Los Angeles 11922389 3184.7 118.50 50.281250 1010.666650 58.177084 ... 24.954331 21.400000 2259.0 273158938.0 1.056348e+09 367104774.0 7938548.0 1.448619e+09 84041668.0 101-150 7 2015-01-04 District Of Columbia District of Columbia Washington 5116378 4235.7 45.25 41.149479 1015.388525 61.075000 ... 13.750000 11.088055 1394.0 149657899.0 6.453259e+08 139353079.0 4115200.0 4.293390e+08 39643319.0 41-50 17 2015-01-04 Massachusetts Suffolk Boston 4328315 5319.0 42.75 33.458332 1018.536500 58.234375 ... 6.000000 7.244791 800.0 96572664.0 4.080501e+08 122496729.0 2231562.0 3.162285e+08 22115804.0 41-50 18 2015-01-04 Michigan Wayne Detroit 3725908 1772.2 49.25 30.203125 994.260400 72.671876 ... 18.250000 9.394618 432.0 31303313.0 1.729056e+08 33078462.0 1225079.0 1.696881e+08 17705665.0 41-50 5 rows × 25 columns
  
  What even am I looking at here?
67. sgberhault 10 Aug 2024
  
  in Public
  
  Definition 1
  
  The interactivity of the below is neat, but you need to talk about it!
68. sgberhault 10 Aug 2024
  
  in Public
  
  Definition 2
  
  I'm confused why these are just labeled at Definitions?
69. sgberhault 10 Aug 2024
  
  in Public
  
  Feature selection is done on the data.
  
  How? And what are the raw results?
70. sgberhault 10 Aug 2024
  
  in Public
  
  Carbon Monoxide Nitrogen Dioxide Ozone PM10 PM2.5
  
  Aren't all of these literally part of the definition of AQI?
71. sgberhault 10 Aug 2024
  
  in Public
  
  'city_Los Angeles', 'city_Phoenix', 'city_Portland
  
  these these three?
72. sgberhault 10 Aug 2024
  
  in Public
  
  K nearest neighbors Tree model Random Forest model Logistic Regression Naive Bayes
  
  This is essentially the equivalent of EDA in ML. A reader doesn't care about all of the attempts that didn't go as well unless something critical was shown in that case. Just move straight to the best and discuss what it implies.
73. sgberhault 10 Aug 2024
  
  in Public
  
  AirQuality Confusion Matrix 1
  
  What model is the above even for?? How is a reader supposed to interpret this?
74. sgberhault 10 Aug 2024
  
  in Public
  
  A randomized search is run with 100 iterations.
  
  Like actual just random values for these parameters each time?
75. sgberhault 10 Aug 2024
  
  in Public
  
  Pipeline
  
  Not explain in the text, as far as I can understand.
76. sgberhault 10 Aug 2024
  
  in Public
  
  the model.
  
  WHICH?
77. sgberhault 10 Aug 2024
  
  in Public
  
  {'memory': None, 'steps': [('aqi_transformer', ColumnTransformer(transformers=[('categories', OneHotEncoder(handle_unknown='infrequent_if_exist', min_frequency=5, sparse_output=False), ['city']), ('scaled_air_quality', StandardScaler(), ['temp', 'humidity', 'co', 'no2', 'o3', 'pm100', 'pm25'])], verbose_feature_names_out=False)), ('RF_model', RandomForestClassifier())], 'verbose': False, 'aqi_transformer': ColumnTransformer(transformers=[('categories', OneHotEncoder(handle_unknown='infrequent_if_exist', min_frequency=5, sparse_output=False), ['city']), ('scaled_air_quality', StandardScaler(), ['temp', 'humidity', 'co', 'no2', 'o3', 'pm100', 'pm25'])], verbose_feature_names_out=False), 'RF_model': RandomForestClassifier(), 'aqi_transformer__n_jobs': None, 'aqi_transformer__remainder': 'drop', 'aqi_transformer__sparse_threshold': 0.3, 'aqi_transformer__transformer_weights': None, 'aqi_transformer__transformers': [('categories', OneHotEncoder(handle_unknown='infrequent_if_exist', min_frequency=5, sparse_output=False), ['city']), ('scaled_air_quality', StandardScaler(), ['temp', 'humidity', 'co', 'no2', 'o3', 'pm100', 'pm25'])], 'aqi_transformer__verbose': False, 'aqi_transformer__verbose_feature_names_out': False, 'aqi_transformer__categories': OneHotEncoder(handle_unknown='infrequent_if_exist', min_frequency=5, sparse_output=False), 'aqi_transformer__scaled_air_quality': StandardScaler(), 'aqi_transformer__categories__categories': 'auto', 'aqi_transformer__categories__drop': None, 'aqi_transformer__categories__dtype': numpy.float64, 'aqi_transformer__categories__feature_name_combiner': 'concat', 'aqi_transformer__categories__handle_unknown': 'infrequent_if_exist', 'aqi_transformer__categories__max_categories': None, 'aqi_transformer__categories__min_frequency': 5, 'aqi_transformer__categories__sparse_output': False, 'aqi_transformer__scaled_air_quality__copy': True, 'aqi_transformer__scaled_air_quality__with_mean': True, 'aqi_transformer__scaled_air_quality__with_std': True, 'RF_model__bootstrap': True, 'RF_model__ccp_alpha': 0.0, 'RF_model__class_weight': None, 'RF_model__criterion': 'gini', 'RF_model__max_depth': None, 'RF_model__max_features': 'sqrt', 'RF_model__max_leaf_nodes': None, 'RF_model__max_samples': None, 'RF_model__min_impurity_decrease': 0.0, 'RF_model__min_samples_leaf': 1, 'RF_model__min_samples_split': 2, 'RF_model__min_weight_fraction_leaf': 0.0, 'RF_model__monotonic_cst': None, 'RF_model__n_estimators': 100, 'RF_model__n_jobs': None, 'RF_model__oob_score': False, 'RF_model__random_state': None, 'RF_model__verbose': 0, 'RF_model__warm_start': False}
  
  Definitely don't show this!
78. sgberhault 10 Aug 2024
  
  in Public
  
  Definition 3 0.6312159709618875
  
  What does this mean?
79. sgberhault 10 Aug 2024
  
  in Public
  
  decomposed
  
  composed
80. sgberhault 10 Aug 2024
  
  in Public
  
  Figure
  
  It is a table, not a figure.
81. sgberhault 10 Aug 2024
  
  in Public
  
  0.5265748745864021
  
  Comment on this if you are going to show it.
82. sgberhault 10 Aug 2024
  
  in Public
  
  AirQuality Confusion Matrix 2
  
  Captions need to be more detailed and discuss what a reader should take away from an image.
83. sgberhault 10 Aug 2024
  
  in Public
  
  By the help of statsmodel package we can break the time series into its seasonal pattern and trends. This will helps us to understand the data clearly and will help us to make more sense of the data.
  
  Ok, so how did you go about doing that?
84. sgberhault 10 Aug 2024
  
  in Public
  
  Decomposing the Time Series With Additive Method
  
  Is this supposed to be a much more subheading?
85. sgberhault 10 Aug 2024
  
  in Public
  
  There are
  
  The above image is an unlabeled figure with no caption that is discussed nowhere in the text (at least so far). All of those things are problematic.
86. sgberhault 10 Aug 2024
  
  in Public
  
  three
  
  You literally JUST told me there were 2...
87. sgberhault 10 Aug 2024
  
  in Public
  
  If you have an increasing trend, you still see roughly the same size peaks and troughs throughout the time series. This is often seen in indexed time series where the absolute value is growing but changes stay relative.
  
  But is that what you are seeing here? It is confusing if you are talking in the abstract or about your specific data.
  
  Also, why do this? What are your takeaways? These feels like it exists in isolation?
88. sgberhault 10 Aug 2024
  
  in Public
  
  grid search method
  
  This isn't code, so it shouldn't be in monospace. Underline or italicize it if you want to set it apart, or put it in quotes.
89. sgberhault 10 Aug 2024
  
  in Public
  
  How Fit the SARIMAX model
  
  There is no "how to" here at all.
90. sgberhault 10 Aug 2024
  
  in Public
  
  attempts to compute the optimum values of hyperparameters.
  
  Say how it works!
Visit annotations in context

Annotators

sgberhault

URL

hypothes.is/welcome/a48845882e08510a

Annotators

URL

Annotators

URL