223 Matching Annotations
  1. Last 7 days
    1. Air

      Your link to the right up here is broken

    2. Contents

      Your contents here has way too much stuff. Break it into bigger chunks and use subheadings

    3. On October 30, 1948, the Donora High School Football team played through a dense smog to complete the game with hundreds of fans in the audience, despite very poor visibility.

      Citation?

    4. (Jacobs, Burgess, Abbott)

      Ah, here it is. Given that you are pulling from the story for an entire paragraph, I'd lead with some reference to this source.

      "In their book/article published in XXXX, Jacobs, Burgess, and Abbott tell the tale of...."

    5. The quality of air we breathe has direct impacts on our health. We must understand the factors that contribute to poor air quality and how we individually and collectively contribute to these changes. Until we can visualize the impact we have on our atmosphere, we will continue behavior that negatively impacts the air around us.

      Also a short paragraph.

    6. This event, known as the Donora Smog of 1948, prompted the country into taking a closer look at the negative impacts of air pollution. Widespread debate surrounding the event led to the first legislation aimed at regulating the air quality within the United States, ushering in a new era of tracking, combatting, and reversing the ill effects of poor air quality.

      Two sentences isn't much of a paragraph.

    7. changes

      factors

    8. NumPy and Pandas

      the NumPy and Pandas libraries

    9. Matplotlib and Seaborn for visualization, and Time Series forecasting algorithms such as Prophet and SARIMAX.

      This is not a complete sentence.

    10. We will address data inconsistencies, missing values and ensure that data is in a tidy format.

      This is not a paragraph

    11. We may need to normalize or standardize data if necessary and create new features through aggregation to enhance the model’s performance.

      Also not a paragraph

    12. p

      capitalized?

    13. Here’s a breakdown of its components:

      Is this supposed to be above the bullet points? Either way, I think those bullet points need a better intro.

    14. Metrics to Evaluate Machine Model Performance

      Any section needs to be introduced by text.

    15. Akaike Information Criteria (AIC)

      I don't think a reader has any idea what this is initially, so this chapter heading is kinda meaningless.

    16. Technique/Metric Description Purpose/Formula Scenario: Cancer prediction

      I don't think this table is useful in this current location. As a table, it should be just used as a reference and put at the end of the document.It is a nice summary table, for sure, but it doesn't belong smack in the middle of your paper.

      As far as a reader knowing what you are referring to when you use one of these terms, some you can probably safely assume you can use without explanation, and others you should bake the explanation into your text when you introduce it.

    17. Machine Learning AQI Time Series

      Text should introduce every section.

    18. Used to measure of a statistical model, it quantifies:

      Not a complete sentence

    19. Data Explaination

      Why is this part of the ML AQI Time Series chapter? Or chapter/heading hierarchy is extremely confusing in general

    20. The Akaike Information Criterion (AIC) is a measure used to compare different statistical models. It helps in model selection by balancing the goodness of fit and the complexity of the model. Here’s how to interpret the AIC value:

      This feels more like how this section should be starting.

    21. The files were given daily on a county wide basis, separated into different files by year.

      So what did you collect?

    22. Indoors, high humidity can trap air, leading to the growth of mold and harmful bacteria.

      This feels outside the scope of what you are doing though correct?

    23. Air Quality Data:

      These sections are too small for their own sections. Just make them their own paragraphs.

      EDIT: Actually, some of the later ones are more reasonable. Think about how you can balance between them though. Can you add to some to make it more reasonable as a section? Or remove from others? Maybe bullet points with a bolded starting line would be more appropriate?

    24. calculated

      aggregated

    25. Carbon Monoxide

      Carbon Monoxide (CO)

    26. Only motorbus data was used, which may not be reflective of cities with other large methods of public transportation, such as the New York subway system.

      It also seems to leave out what I'd guess is probably easily the most significant transit factor: cars and trucks?

    27. is updated as of

      was last updated on

    28. relevant columns were selected and renamed, reducing the information being brought into our initial SQL database.

      Just selecting and renaming wouldn't reduce the information, unless you are trying to say that you didn't bring in anything else.

    29. and imported

      remove

    30. The first dimension table is the dates table, a serialized list of dates from January 1st, 2015 to December 31st, 2022.

      You should explain why you did this. Otherwise breaking it out into a table of essentially 1 data column seems pointless. I'm pretty sure I recall the reason why, and it is a decent reason, but that is not apparent here.

    31. as well as the population and population density

      that is not shown in your ERD

    32. Understanding the context of a specified line requires joining the table back to the fact table, and joining the location and date tables to that as well.

      Ok, but I'm pretty sure this totally undid any of the space saving measures you gained with putting dates in their own table. Because you are including a massive number of duplicate items in your main table. You could have just left them separate and still joined by truncating the date to a year and matching that + location

    33. Finally, constraints have been added to limit unusual or impossible data.

      Should probably describe these, since they aren't apparent in the ERD at all.

    34. Figure 1.

      Reference these properly in Quarto. (It will also make your life easier)

    35. ERD Diagram

      You need a much more comprehensive caption here.

    36. Exploratory Data Analysis

      EDA is what you do to narrow down what actual analysis you want or need to do to answer your question. It probably should not be shown here unless mandatory for understanding a later piece of analysis.

    37. Dataframe Shape The DataFrame contains 147039 rows and 44 columns.

      Wat? Why is this here in this form?

    38. Exploring Oregon State By filtering our Dataframe for Oregon state, our DataFrame contains 2922 rows.

      Yeah, that's not a section, nor should it be. Mistake with #?

    39. Features Engineering Date Column Preprocessing:

      This is a paper, not notes of what was done. You need to explain these and describe what was done. A flowchart might also be very useful.

    40. Sweetviz Data Report Done! Use 'show' commands to display/save.    [100%]   00:01 -> (00:00 left) {"model_id":"0e8836738d0b492e92ad430e32f1e8d7","version_major":2,"version_minor":0,"quarto_mimetype":"application/vnd.jupyter.widget-view+json"} Report SWEETVIZ_REPORT.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files. We have generated a complete statistical report confirming the quality of EDA steps.

      Again, no need for this to be here. It contributes nothing to your story. Or if it does, you need to do a MUCH better job of making that clear. You can mention and link it in an appendix if you want.

    41. - Set tsmode=True when creating the ProfileReport - Ensure our DataFrame is sorted or specify the sortby parameter - Time Series Feature Identification

      Mangled formatting.

    42. Advanced Exploratory Data Analysis

      Probably also shouldn't be here, though it depends on what you mean by this.

    43. - Histograms are replaced with line plots - Feature details include new autocorrelation and partial autocorrelation plots - Two additional warnings may appear: NON STATIONARY and SEASONAL

      Mangled formatting

    44. These methods allowed us to thoroughly evaluate key data quality aspects, including: Class balance in categorical variables Presence and distribution of missing values (NaN) Feature distributions and correlations Potential time-series characteristics

      Ok, but I haven't seen you talk about any of these yet. So what use then were they toward answering your overall question?

    45. Time Series Visualization: CO, Wind and AQI

      Why is this a chapter? How is it contributing? Like it might be useful information for your question, but a chapter all by itself?

    46. NO2 (nitrogen dioxide) is an important air pollutant. Here’s a concise overview of it: - Reddish-brown gas with a pungent odor - Part of a group of pollutants known as nitrogen oxides (NOx) SO2 (sulfur dioxide) is an important air pollutant. Here’s a concise overview of SO2 as a pollutant: Colorless gas with a sharp, pungent odor Highly soluble in water Ozone (O₃) as a pollutant is a complex topic, as it can be both beneficial and harmful depending on its location in the atmosphere. Here’s a concise overview of ozone as a ground-level pollutant: Colorless to pale blue gas with a distinctive smell Highly reactive molecule composed of three oxygen atoms

      Again, wasn't all of this covered in the background?

    47. CO pollutant refers to carbon monoxide, which is a colorless, odorless, and tasteless gas that can be harmful to human health and the environment.

      Should have already established this in your background.

    48. Primarily produced by incomplete combustion of carbon-containing fuels Major sources include vehicle exhaust, industrial processes, and some natural sources like volcanoes Slightly less dense than air Highly flammable

      mangled formatting I think

    49. <Figure size 1000x1800 with 0 Axes>

      Figure appears before reference and explanation in text.

      Also, figure isn't actually a figure and has no caption.

      Also, plot is WAY too big for writeup

    50. <Figure size 1500x2000 with 0 Axes>

      Same issues as above figure: - Not explained in text - Not an actual figure with caption and reference - Way too large for the format

    51. we must

      You must? That is the only possible approach?

    52. We finally completed the exploratory data analysis.

      And you seemingly concluded nothing from it? Why should a reader care about this?

    53. 147039

      No. You do not include raw tabulated output like this in a publication. The columns aren't even labeled, so a reader has no idea what they are looking at. If it is worth showing a reader, then you render it properly, make sure everything is labeled, insert it as a table with a caption and reference and discuss it in the text.

    54. Ultimately, we want to see which variables have the greatest impact on AQI

      The AQI is defined in terms of some of these correct? So those should probably not be included?

    55. First, missing data must be addressed.

      This wasn't addressed as any of your earlier pre-processing?

    56. date

      Now I have even less idea of what this is showing me

    57. Since AQI is the dependent variable being measured, all rows without AQI data are dropped. Certain cities have very little data and will be dropped out of necessity.

      Ok. How little is very little data? Why is it necessary?

    58. The data collected has separate information for the city of New York City. NYC is divided into five boroughs, each within its own county. These values are grouped and averaged out to make NYC have the same amount of datapoints as every other city.

      Are other suburbs of major cities not counted separately? It seems like this could be a tricky thing to be fair about. And counties kinda already split things in an unambiguous way?

    59. date state county city population density \

      Pretty sure this output should absolutely be removed.

    60. figure 24324

      I missed the other 24 thousand 300 somewhere....

      Also, the thing below is a table, and should be referenced and captioned as such.

    61. Kansas City 241

      I think the count is largely unnecessary to show here, but what is up with Kansas City? And why is it not discussed when it is seemingly the only take-away I get from this table?

    62. To perform a ML prediction algorithm, the predicted variable (AQI) must be discrete.

      That doesn't seem correct. You can do all manner of regression algorithms with machine learning. No need to make this into a classification problem unless your SPECIFIC algorithm requires it. In which case you should discuss why you are using that specific algorithm.

    63. The bins chosen are: 0-30 31-40 41-50 51-60 61-70 71-80 81-90 91-100 101-150 151+

      How were these chosen?

    64. date state county city population density aqi temp pressure humidity ... pm100 pm25 num_busses revenue operating_expense passenger_trips operating_hours passenger_miles operating_miles aqi_discrete 0 2015-01-04 Arizona Maricopa Phoenix 4064275 1198.9 86.50 41.458333 972.860814 60.739583 ... 20.709822 20.505426 729.0 47024975.0 2.256208e+08 55497019.0 2228182.0 2.190928e+08 28371107.0 81-90 1 2015-01-04 California Los Angeles Los Angeles 11922389 3184.7 118.50 50.281250 1010.666650 58.177084 ... 24.954331 21.400000 2259.0 273158938.0 1.056348e+09 367104774.0 7938548.0 1.448619e+09 84041668.0 101-150 7 2015-01-04 District Of Columbia District of Columbia Washington 5116378 4235.7 45.25 41.149479 1015.388525 61.075000 ... 13.750000 11.088055 1394.0 149657899.0 6.453259e+08 139353079.0 4115200.0 4.293390e+08 39643319.0 41-50 17 2015-01-04 Massachusetts Suffolk Boston 4328315 5319.0 42.75 33.458332 1018.536500 58.234375 ... 6.000000 7.244791 800.0 96572664.0 4.080501e+08 122496729.0 2231562.0 3.162285e+08 22115804.0 41-50 18 2015-01-04 Michigan Wayne Detroit 3725908 1772.2 49.25 30.203125 994.260400 72.671876 ... 18.250000 9.394618 432.0 31303313.0 1.729056e+08 33078462.0 1225079.0 1.696881e+08 17705665.0 41-50 5 rows × 25 columns date state county city population density aqi temp pressure humidity ... pm100 pm25 num_busses revenue operating_expense passenger_trips operating_hours passenger_miles operating_miles aqi_discrete 0 2015-01-04 Arizona Maricopa Phoenix 4064275 1198.9 86.50 41.458333 972.860814 60.739583 ... 20.709822 20.505426 729.0 47024975.0 2.256208e+08 55497019.0 2228182.0 2.190928e+08 28371107.0 81-90 1 2015-01-04 California Los Angeles Los Angeles 11922389 3184.7 118.50 50.281250 1010.666650 58.177084 ... 24.954331 21.400000 2259.0 273158938.0 1.056348e+09 367104774.0 7938548.0 1.448619e+09 84041668.0 101-150 7 2015-01-04 District Of Columbia District of Columbia Washington 5116378 4235.7 45.25 41.149479 1015.388525 61.075000 ... 13.750000 11.088055 1394.0 149657899.0 6.453259e+08 139353079.0 4115200.0 4.293390e+08 39643319.0 41-50 17 2015-01-04 Massachusetts Suffolk Boston 4328315 5319.0 42.75 33.458332 1018.536500 58.234375 ... 6.000000 7.244791 800.0 96572664.0 4.080501e+08 122496729.0 2231562.0 3.162285e+08 22115804.0 41-50 18 2015-01-04 Michigan Wayne Detroit 3725908 1772.2 49.25 30.203125 994.260400 72.671876 ... 18.250000 9.394618 432.0 31303313.0 1.729056e+08 33078462.0 1225079.0 1.696881e+08 17705665.0 41-50 5 rows × 25 columns

      What even am I looking at here?

    65. Definition 1

      The interactivity of the below is neat, but you need to talk about it!

    66. The following tools are used: Train Test Split One Hot Encoder Transformer Pipeline Standard Scaler

      For what purposes?

    67. Feature selection is done on the data.

      How? And what are the raw results?

    68. Carbon Monoxide Nitrogen Dioxide Ozone PM10 PM2.5

      Aren't all of these literally part of the definition of AQI?

    69. K nearest neighbors Tree model Random Forest model Logistic Regression Naive Bayes

      This is essentially the equivalent of EDA in ML. A reader doesn't care about all of the attempts that didn't go as well unless something critical was shown in that case. Just move straight to the best and discuss what it implies.

    70. 'city_Los Angeles', 'city_Phoenix', 'city_Portland

      these these three?

    71. Definition 2

      I'm confused why these are just labeled at Definitions?

    72. AirQuality Confusion Matrix 1

      What model is the above even for?? How is a reader supposed to interpret this?

    73. Pipeline

      Not explain in the text, as far as I can understand.

    74. the model.

      WHICH?

    75. A randomized search is run with 100 iterations.

      Like actual just random values for these parameters each time?

    76. {'memory': None, 'steps': [('aqi_transformer', ColumnTransformer(transformers=[('categories', OneHotEncoder(handle_unknown='infrequent_if_exist', min_frequency=5, sparse_output=False), ['city']), ('scaled_air_quality', StandardScaler(), ['temp', 'humidity', 'co', 'no2', 'o3', 'pm100', 'pm25'])], verbose_feature_names_out=False)), ('RF_model', RandomForestClassifier())], 'verbose': False, 'aqi_transformer': ColumnTransformer(transformers=[('categories', OneHotEncoder(handle_unknown='infrequent_if_exist', min_frequency=5, sparse_output=False), ['city']), ('scaled_air_quality', StandardScaler(), ['temp', 'humidity', 'co', 'no2', 'o3', 'pm100', 'pm25'])], verbose_feature_names_out=False), 'RF_model': RandomForestClassifier(), 'aqi_transformer__n_jobs': None, 'aqi_transformer__remainder': 'drop', 'aqi_transformer__sparse_threshold': 0.3, 'aqi_transformer__transformer_weights': None, 'aqi_transformer__transformers': [('categories', OneHotEncoder(handle_unknown='infrequent_if_exist', min_frequency=5, sparse_output=False), ['city']), ('scaled_air_quality', StandardScaler(), ['temp', 'humidity', 'co', 'no2', 'o3', 'pm100', 'pm25'])], 'aqi_transformer__verbose': False, 'aqi_transformer__verbose_feature_names_out': False, 'aqi_transformer__categories': OneHotEncoder(handle_unknown='infrequent_if_exist', min_frequency=5, sparse_output=False), 'aqi_transformer__scaled_air_quality': StandardScaler(), 'aqi_transformer__categories__categories': 'auto', 'aqi_transformer__categories__drop': None, 'aqi_transformer__categories__dtype': numpy.float64, 'aqi_transformer__categories__feature_name_combiner': 'concat', 'aqi_transformer__categories__handle_unknown': 'infrequent_if_exist', 'aqi_transformer__categories__max_categories': None, 'aqi_transformer__categories__min_frequency': 5, 'aqi_transformer__categories__sparse_output': False, 'aqi_transformer__scaled_air_quality__copy': True, 'aqi_transformer__scaled_air_quality__with_mean': True, 'aqi_transformer__scaled_air_quality__with_std': True, 'RF_model__bootstrap': True, 'RF_model__ccp_alpha': 0.0, 'RF_model__class_weight': None, 'RF_model__criterion': 'gini', 'RF_model__max_depth': None, 'RF_model__max_features': 'sqrt', 'RF_model__max_leaf_nodes': None, 'RF_model__max_samples': None, 'RF_model__min_impurity_decrease': 0.0, 'RF_model__min_samples_leaf': 1, 'RF_model__min_samples_split': 2, 'RF_model__min_weight_fraction_leaf': 0.0, 'RF_model__monotonic_cst': None, 'RF_model__n_estimators': 100, 'RF_model__n_jobs': None, 'RF_model__oob_score': False, 'RF_model__random_state': None, 'RF_model__verbose': 0, 'RF_model__warm_start': False}

      Definitely don't show this!

    77. Definition 3   0.6312159709618875

      What does this mean?

    78. Figure

      It is a table, not a figure.

    79. 0.5265748745864021

      Comment on this if you are going to show it.

    80. Decomposing the Time Series With Additive Method

      Is this supposed to be a much more subheading?

    81. AirQuality Confusion Matrix 2

      Captions need to be more detailed and discuss what a reader should take away from an image.

    82. decomposed

      composed

    83. By the help of statsmodel package we can break the time series into its seasonal pattern and trends. This will helps us to understand the data clearly and will help us to make more sense of the data.

      Ok, so how did you go about doing that?

    84. three

      You literally JUST told me there were 2...

    85. There are

      The above image is an unlabeled figure with no caption that is discussed nowhere in the text (at least so far). All of those things are problematic.

    86. If you have an increasing trend, you still see roughly the same size peaks and troughs throughout the time series. This is often seen in indexed time series where the absolute value is growing but changes stay relative.

      But is that what you are seeing here? It is confusing if you are talking in the abstract or about your specific data.

      Also, why do this? What are your takeaways? These feels like it exists in isolation?

    87. attempts to compute the optimum values of hyperparameters.

      Say how it works!

    88. grid search method

      This isn't code, so it shouldn't be in monospace. Underline or italicize it if you want to set it apart, or put it in quotes.

    89. ARIMA(0, 0, 0)x(0, 0, 0, 12) - AIC:969.5419650946665 ARIMA(0, 0, 0)x(0, 0, 1, 12) - AIC:799.0140026908043 ARIMA(0, 0, 0)x(0, 1, 0, 12) - AIC:701.7072455506197 ARIMA(0, 0, 0)x(0, 1, 1, 12) - AIC:568.3211239351035 ARIMA(0, 0, 0)x(1, 0, 0, 12) - AIC:708.2727189545345 ARIMA(0, 0, 0)x(1, 0, 1, 12) - AIC:660.9171130206936 ARIMA(0, 0, 0)x(1, 1, 0, 12) - AIC:596.1563221105039 ARIMA(0, 0, 0)x(1, 1, 1, 12) - AIC:571.8620221843147 ARIMA(0, 0, 1)x(0, 0, 0, 12) - AIC:888.4893265461405 ARIMA(0, 0, 1)x(0, 0, 1, 12) - AIC:754.7451219152275 ARIMA(0, 0, 1)x(0, 1, 0, 12) - AIC:695.0468020327725 ARIMA(0, 0, 1)x(0, 1, 1, 12) - AIC:563.3526496700842 ARIMA(0, 0, 1)x(1, 0, 0, 12) - AIC:708.3487691701486 ARIMA(0, 0, 1)x(1, 0, 1, 12) - AIC:655.8968840891383 ARIMA(0, 0, 1)x(1, 1, 0, 12) - AIC:598.1490374699148 ARIMA(0, 0, 1)x(1, 1, 1, 12) - AIC:566.3367865157978 ARIMA(0, 1, 0)x(0, 0, 0, 12) - AIC:769.1876196189784 ARIMA(0, 1, 0)x(0, 0, 1, 12) - AIC:681.4253047727481 ARIMA(0, 1, 0)x(0, 1, 0, 12) - AIC:740.3973501203114 ARIMA(0, 1, 0)x(0, 1, 1, 12) - AIC:606.0067883430007 ARIMA(0, 1, 0)x(1, 0, 0, 12) - AIC:688.9276375883021 ARIMA(0, 1, 0)x(1, 0, 1, 12) - AIC:683.2372837276466 ARIMA(0, 1, 0)x(1, 1, 0, 12) - AIC:637.9760649104885 ARIMA(0, 1, 0)x(1, 1, 1, 12) - AIC:607.9989487123431 ARIMA(0, 1, 1)x(0, 0, 0, 12) - AIC:717.0512101206406 ARIMA(0, 1, 1)x(0, 0, 1, 12) - AIC:636.373429528529 ARIMA(0, 1, 1)x(0, 1, 0, 12) - AIC:692.512410906277 ARIMA(0, 1, 1)x(0, 1, 1, 12) - AIC:559.6920424480529 ARIMA(0, 1, 1)x(1, 0, 0, 12) - AIC:650.5293595230056 ARIMA(0, 1, 1)x(1, 0, 1, 12) - AIC:638.1908637932411 ARIMA(0, 1, 1)x(1, 1, 0, 12) - AIC:594.940391452659 ARIMA(0, 1, 1)x(1, 1, 1, 12) - AIC:562.5484300875305 ARIMA(1, 0, 0)x(0, 0, 0, 12) - AIC:775.150570595756 ARIMA(1, 0, 0)x(0, 0, 1, 12) - AIC:688.1982167211085 ARIMA(1, 0, 0)x(0, 1, 0, 12) - AIC:702.425519762607 ARIMA(1, 0, 0)x(0, 1, 1, 12) - AIC:570.1689904036024 ARIMA(1, 0, 0)x(1, 0, 0, 12) - AIC:688.2931195730088 ARIMA(1, 0, 0)x(1, 0, 1, 12) - AIC:662.6749372683774 ARIMA(1, 0, 0)x(1, 1, 0, 12) - AIC:590.7883988000217 ARIMA(1, 0, 0)x(1, 1, 1, 12) - AIC:573.825547011459 ARIMA(1, 0, 1)x(0, 0, 0, 12) - AIC:725.2611476282008 ARIMA(1, 0, 1)x(0, 0, 1, 12) - AIC:644.4595774810737 ARIMA(1, 0, 1)x(0, 1, 0, 12) - AIC:696.6355146715679 ARIMA(1, 0, 1)x(0, 1, 1, 12) - AIC:565.337721591011 ARIMA(1, 0, 1)x(1, 0, 0, 12) - AIC:651.3742765976529 ARIMA(1, 0, 1)x(1, 0, 1, 12) - AIC:657.7255114881699 ARIMA(1, 0, 1)x(1, 1, 0, 12) - AIC:592.7702867201957 ARIMA(1, 0, 1)x(1, 1, 1, 12) - AIC:567.3861300859227 ARIMA(1, 1, 0)x(0, 0, 0, 12) - AIC:750.4532664961456 ARIMA(1, 1, 0)x(0, 0, 1, 12) - AIC:665.693748389872 ARIMA(1, 1, 0)x(0, 1, 0, 12) - AIC:720.7807876037391 ARIMA(1, 1, 0)x(0, 1, 1, 12) - AIC:588.6301637485213 ARIMA(1, 1, 0)x(1, 0, 0, 12) - AIC:665.7141239363682 ARIMA(1, 1, 0)x(1, 0, 1, 12) - AIC:667.6890275833365 ARIMA(1, 1, 0)x(1, 1, 0, 12) - AIC:611.4437482645567 ARIMA(1, 1, 0)x(1, 1, 1, 12) - AIC:590.6185673644065 ARIMA(1, 1, 1)x(0, 0, 0, 12) - AIC:717.3211552781574 ARIMA(1, 1, 1)x(0, 0, 1, 12) - AIC:636.7110296932944 ARIMA(1, 1, 1)x(0, 1, 0, 12) - AIC:693.1696490581699 ARIMA(1, 1, 1)x(0, 1, 1, 12) - AIC:561.5301944999834 ARIMA(1, 1, 1)x(1, 0, 0, 12) - AIC:643.9735168529521 ARIMA(1, 1, 1)x(1, 0, 1, 12) - AIC:638.640931561371 ARIMA(1, 1, 1)x(1, 1, 0, 12) - AIC:588.5992832053371 ARIMA(1, 1, 1)x(1, 1, 1, 12) - AIC:564.5468753697722

      This should not be shown.

    90. Summary of SARIMAX Print the summary which includes AIC

      Why all these other headings? This is still part of the above?

    91. ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ ar.L1 0.0483 0.306 0.158 0.875 -0.551 0.648 ma.L1 -1.0000 924.523 -0.001 0.999 -1813.031 1811.031 ma.S.L12 -1.0000 2355.498 -0.000 1.000 -4617.692 4615.692 sigma2 134.1503 3.35e+05 0.000 1.000 -6.57e+05 6.57e+05 ==============================================================================

      Yup, that won't mean a thing to most readers (myself included) unless you explain it.

    92. How Fit the SARIMAX model

      There is no "how to" here at all.

    93. Plot Diag

      reference the figure correctly and explain it in the text!

      Also, give it an actual meaningful caption.

    94. Rigorous validation is paramount to establishing the model’s reliability and practical application. To ensure the model’s generalizability, we will employ a train-test split.

      Why is this just being mentioned after all of the ML stuff has happened?

    95. The AIC value is: 561.5301944999834

      Which tells us what?

    96. Start date of the data: 2015-01-31 00:00:00 End date of the data: 2022-12-31 00:00:00

      ??

    97. To facilitate c

      The above graphic is a figure. Treat it as such!

      Also, use dashed lines for one of the entries so that we can see that they are actually perfectly overlapping and not just take your word for it.

    98. The Mean Squared Error of our forecasts is 1.41

      units? What do you conclude from this?

    99. Forecasting Future Values As we conclude our modeling process, we generate predictions for the next 7 data points: Model Information: The result variable contains our fitted model’s details. Forecasting Method: We use the .get_forecast() method on our model results. Prediction Generation: This method analyzes observed patterns in our data to project future values. Output: We obtain forecasts for the next 7 time points, representing predicted air quality levels. This step transforms our analytical work into actionable insights for air quality management.

      I don't understand what you are trying to say or do here. You have already done some of this above (I think) so I'd guess this is a summary, except that some of this I'm pretty sure I haven't seen?

    100. Our plot

      Reference the figure number! And stick a caption on it!

      Is this plot for portland? That isn't apparent anywhere that I can see either.

    101. Interpreting the Forecast Plot

      unnecessary

    102. Represents the actual, historical air quality measurements Provides a baseline for comparing our predictions Forecasted Values (Orange Line) Depicts the future air quality levels predicted by our SARIMAX Time Series Model Allows us to visualize potential trends and patterns in air quality Confidence Interval (Shaded Region) The shaded area around the forecast line represents the 95% Confidence Interval (CI) Indicates the range within which we can be 95% confident that the true future values will fall Wider intervals suggest greater uncertainty in the prediction

      This is a publication. Use complete sentences.

    103. exasperated by the dry heat and lack of rainfall

      Is this the actual cause? You showed some seasonality, I'm not sure these causes were showcased.

    104. we have landed on these specific recommendations.

      Ok, let me just say that at this point, after reading through all your above analysis, I have NO IDEA what your recommendations are going to be. Which probably tells me that you did a poor job of actually showcasing your proof for each of these recommendations.

      I haven't read what they are yet, but for every recommendation you make, I should be able to go back to a specific section or figure and see the exact reason for why you would make that prediction. If that is not the case, then you are either making unfounded recommendations, or you are not communicating what your analysis was for clearly enough.

    105. As climate change raises temperatures and water sources dry up, wildfire season will continue to get worse over time.

      Agreed, how would you interpret your data in that light? Can you see evidence of that? Is the effect more pronounced in cities near lots of national forest? Otherwise you are just conjecturing.

    106. Weather conditions Wind speed and direction Temperature fluctuations Humidity levels Atmospheric pressure Solar radiation intensity

      Significantly affected? I thought you only saw a few of these at best as being significant contributors.

    107. That leaves us with three criteria gasses and all particulate matter.

      But again, these are just part of the definition of AQI aren't they? So of course they have a large impact?

    108. The largest source of carbon monoxide, nitrogen dioxide, and ozone is the cars, trucks, and other vehicles we use daily (Environmental Protection Agency). We can lower our reliance on personal vehicles by utilizing public transportation, carpooling, walking, biking, increasing work from home to lower commutes when available, and overall be more considerate about if driving a car is necessary.

      Did you see evidence of this? You had bus data. Did cities with less traffic show decreases in these values?

    109. Algorithm Dependence. This is the reliability of forecasts which are inherently tied to the chosen predictive algorithms. Different models may yield varying results, emphasizing the importance of algorithm selection and validation.

      So how did you choose your algorithms with this in mind?

    110. Industrial manufacturing processes and agriculture are significant polluters of the environment. We should invest in the research of more environmentally friendly manufacturing methods, working with materials that require less combustion, or are recyclable.

      Agreed, but I'm not sure you could see from your research if this was what was playing a large role?

    1. Air

      Your link to the right up here is broken

    2. Contents

      Your contents here has way too much stuff. Break it into bigger chunks and use subheadings

    3. On October 30, 1948, the Donora High School Football team played through a dense smog to complete the game with hundreds of fans in the audience, despite very poor visibility.

      Citation?

    4. (Jacobs, Burgess, Abbott)

      Ah, here it is. Given that you are pulling from the story for an entire paragraph, I'd lead with some reference to this source.

      "In their book/article published in XXXX, Jacobs, Burgess, and Abbott tell the tale of...."

    5. This event, known as the Donora Smog of 1948, prompted the country into taking a closer look at the negative impacts of air pollution. Widespread debate surrounding the event led to the first legislation aimed at regulating the air quality within the United States, ushering in a new era of tracking, combatting, and reversing the ill effects of poor air quality.

      Two sentences isn't much of a paragraph.

    6. changes

      factors

    7. The quality of air we breathe has direct impacts on our health. We must understand the factors that contribute to poor air quality and how we individually and collectively contribute to these changes. Until we can visualize the impact we have on our atmosphere, we will continue behavior that negatively impacts the air around us.

      Also a short paragraph.

    8. NumPy and Pandas

      the NumPy and Pandas libraries

    9. Matplotlib and Seaborn for visualization, and Time Series forecasting algorithms such as Prophet and SARIMAX.

      This is not a complete sentence.

    10. We will address data inconsistencies, missing values and ensure that data is in a tidy format.

      This is not a paragraph

    11. We may need to normalize or standardize data if necessary and create new features through aggregation to enhance the model’s performance.

      Also not a paragraph

    12. the above section

      Reference the section. Section numbering helps with this

    13. p

      capitalized?

    14. Here’s a breakdown of its components:

      Is this supposed to be above the bullet points? Either way, I think those bullet points need a better intro.

    15. Metrics to Evaluate Machine Model Performance

      Any section needs to be introduced by text.

    16. Technique/Metric Description Purpose/Formula Scenario: Cancer prediction

      I don't think this table is useful in this current location. As a table, it should be just used as a reference and put at the end of the document.It is a nice summary table, for sure, but it doesn't belong smack in the middle of your paper.

      As far as a reader knowing what you are referring to when you use one of these terms, some you can probably safely assume you can use without explanation, and others you should bake the explanation into your text when you introduce it.

    17. Akaike Information Criteria (AIC)

      I don't think a reader has any idea what this is initially, so this chapter heading is kinda meaningless.

    18. Machine Learning AQI Time Series

      Text should introduce every section.

    19. Used to measure of a statistical model, it quantifies:

      Not a complete sentence

    20. Data Explaination

      Why is this part of the ML AQI Time Series chapter? Or chapter/heading hierarchy is extremely confusing in general

    21. The Akaike Information Criterion (AIC) is a measure used to compare different statistical models. It helps in model selection by balancing the goodness of fit and the complexity of the model. Here’s how to interpret the AIC value:

      This feels more like how this section should be starting.

    22. The files were given daily on a county wide basis, separated into different files by year.

      So what did you collect?

    23. Indoors, high humidity can trap air, leading to the growth of mold and harmful bacteria.

      This feels outside the scope of what you are doing though correct?

    24. calculated

      aggregated

    25. Air Quality Data:

      These sections are too small for their own sections. Just make them their own paragraphs.

      EDIT: Actually, some of the later ones are more reasonable. Think about how you can balance between them though. Can you add to some to make it more reasonable as a section? Or remove from others? Maybe bullet points with a bolded starting line would be more appropriate?

    26. Carbon Monoxide

      Carbon Monoxide (CO)

    27. Only motorbus data was used, which may not be reflective of cities with other large methods of public transportation, such as the New York subway system.

      It also seems to leave out what I'd guess is probably easily the most significant transit factor: cars and trucks?

    28. is updated as of

      was last updated on

    29. relevant columns were selected and renamed, reducing the information being brought into our initial SQL database.

      Just selecting and renaming wouldn't reduce the information, unless you are trying to say that you didn't bring in anything else.

    30. and imported

      remove

    31. The first dimension table is the dates table, a serialized list of dates from January 1st, 2015 to December 31st, 2022.

      You should explain why you did this. Otherwise breaking it out into a table of essentially 1 data column seems pointless. I'm pretty sure I recall the reason why, and it is a decent reason, but that is not apparent here.

    32. as well as the population and population density

      that is not shown in your ERD

    33. Understanding the context of a specified line requires joining the table back to the fact table, and joining the location and date tables to that as well.

      Ok, but I'm pretty sure this totally undid any of the space saving measures you gained with putting dates in their own table. Because you are including a massive number of duplicate items in your main table. You could have just left them separate and still joined by truncating the date to a year and matching that + location

    34. Finally, constraints have been added to limit unusual or impossible data.

      Should probably describe these, since they aren't apparent in the ERD at all.

    35. Figure 1.

      Reference these properly in Quarto. (It will also make your life easier)

    36. ERD Diagram

      You need a much more comprehensive caption here.

    37. Exploratory Data Analysis

      EDA is what you do to narrow down what actual analysis you want or need to do to answer your question. It probably should not be shown here unless mandatory for understanding a later piece of analysis.

    38. Dataframe Shape The DataFrame contains 147039 rows and 44 columns.

      Wat? Why is this here in this form?

    39. Exploring Oregon State By filtering our Dataframe for Oregon state, our DataFrame contains 2922 rows.

      Yeah, that's not a section, nor should it be. Mistake with #?

    40. Features Engineering Date Column Preprocessing:

      This is a paper, not notes of what was done. You need to explain these and describe what was done. A flowchart might also be very useful.

    41. Sweetviz Data Report Done! Use 'show' commands to display/save.    [100%]   00:01 -> (00:00 left) {"model_id":"0e8836738d0b492e92ad430e32f1e8d7","version_major":2,"version_minor":0,"quarto_mimetype":"application/vnd.jupyter.widget-view+json"} Report SWEETVIZ_REPORT.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files. We have generated a complete statistical report confirming the quality of EDA steps.

      Again, no need for this to be here. It contributes nothing to your story. Or if it does, you need to do a MUCH better job of making that clear. You can mention and link it in an appendix if you want.

    42. - Set tsmode=True when creating the ProfileReport - Ensure our DataFrame is sorted or specify the sortby parameter - Time Series Feature Identification

      Mangled formatting.

    43. - Histograms are replaced with line plots - Feature details include new autocorrelation and partial autocorrelation plots - Two additional warnings may appear: NON STATIONARY and SEASONAL

      Mangled formatting

    44. Advanced Exploratory Data Analysis

      Probably also shouldn't be here, though it depends on what you mean by this.

    45. These methods allowed us to thoroughly evaluate key data quality aspects, including: Class balance in categorical variables Presence and distribution of missing values (NaN) Feature distributions and correlations Potential time-series characteristics

      Ok, but I haven't seen you talk about any of these yet. So what use then were they toward answering your overall question?

    46. Time Series Visualization: CO, Wind and AQI

      Why is this a chapter? How is it contributing? Like it might be useful information for your question, but a chapter all by itself?

    47. CO pollutant refers to carbon monoxide, which is a colorless, odorless, and tasteless gas that can be harmful to human health and the environment.

      Should have already established this in your background.

    48. Primarily produced by incomplete combustion of carbon-containing fuels Major sources include vehicle exhaust, industrial processes, and some natural sources like volcanoes Slightly less dense than air Highly flammable

      mangled formatting I think

    49. First, missing data must be addressed.

      This wasn't addressed as any of your earlier pre-processing?

    50. <Figure size 1000x1800 with 0 Axes>

      Figure appears before reference and explanation in text.

      Also, figure isn't actually a figure and has no caption.

      Also, plot is WAY too big for writeup

    51. NO2 (nitrogen dioxide) is an important air pollutant. Here’s a concise overview of it: - Reddish-brown gas with a pungent odor - Part of a group of pollutants known as nitrogen oxides (NOx) SO2 (sulfur dioxide) is an important air pollutant. Here’s a concise overview of SO2 as a pollutant: Colorless gas with a sharp, pungent odor Highly soluble in water Ozone (O₃) as a pollutant is a complex topic, as it can be both beneficial and harmful depending on its location in the atmosphere. Here’s a concise overview of ozone as a ground-level pollutant: Colorless to pale blue gas with a distinctive smell Highly reactive molecule composed of three oxygen atoms

      Again, wasn't all of this covered in the background?

    52. <Figure size 1500x2000 with 0 Axes>

      Same issues as above figure: - Not explained in text - Not an actual figure with caption and reference - Way too large for the format

    53. We finally completed the exploratory data analysis.

      And you seemingly concluded nothing from it? Why should a reader care about this?

    54. Ultimately, we want to see which variables have the greatest impact on AQI

      The AQI is defined in terms of some of these correct? So those should probably not be included?

    55. we must

      You must? That is the only possible approach?

    56. Kansas City 241

      I think the count is largely unnecessary to show here, but what is up with Kansas City? And why is it not discussed when it is seemingly the only take-away I get from this table?

    57. 147039

      No. You do not include raw tabulated output like this in a publication. The columns aren't even labeled, so a reader has no idea what they are looking at. If it is worth showing a reader, then you render it properly, make sure everything is labeled, insert it as a table with a caption and reference and discuss it in the text.

    58. date

      Now I have even less idea of what this is showing me

    59. Since AQI is the dependent variable being measured, all rows without AQI data are dropped. Certain cities have very little data and will be dropped out of necessity.

      Ok. How little is very little data? Why is it necessary?

    60. The data collected has separate information for the city of New York City. NYC is divided into five boroughs, each within its own county. These values are grouped and averaged out to make NYC have the same amount of datapoints as every other city.

      Are other suburbs of major cities not counted separately? It seems like this could be a tricky thing to be fair about. And counties kinda already split things in an unambiguous way?

    61. date state county city population density \

      Pretty sure this output should absolutely be removed.

    62. The following tools are used: Train Test Split One Hot Encoder Transformer Pipeline Standard Scaler

      For what purposes?

    63. figure 24324

      I missed the other 24 thousand 300 somewhere....

      Also, the thing below is a table, and should be referenced and captioned as such.

    64. To perform a ML prediction algorithm, the predicted variable (AQI) must be discrete.

      That doesn't seem correct. You can do all manner of regression algorithms with machine learning. No need to make this into a classification problem unless your SPECIFIC algorithm requires it. In which case you should discuss why you are using that specific algorithm.

    65. The bins chosen are: 0-30 31-40 41-50 51-60 61-70 71-80 81-90 91-100 101-150 151+

      How were these chosen?

    66. date state county city population density aqi temp pressure humidity ... pm100 pm25 num_busses revenue operating_expense passenger_trips operating_hours passenger_miles operating_miles aqi_discrete 0 2015-01-04 Arizona Maricopa Phoenix 4064275 1198.9 86.50 41.458333 972.860814 60.739583 ... 20.709822 20.505426 729.0 47024975.0 2.256208e+08 55497019.0 2228182.0 2.190928e+08 28371107.0 81-90 1 2015-01-04 California Los Angeles Los Angeles 11922389 3184.7 118.50 50.281250 1010.666650 58.177084 ... 24.954331 21.400000 2259.0 273158938.0 1.056348e+09 367104774.0 7938548.0 1.448619e+09 84041668.0 101-150 7 2015-01-04 District Of Columbia District of Columbia Washington 5116378 4235.7 45.25 41.149479 1015.388525 61.075000 ... 13.750000 11.088055 1394.0 149657899.0 6.453259e+08 139353079.0 4115200.0 4.293390e+08 39643319.0 41-50 17 2015-01-04 Massachusetts Suffolk Boston 4328315 5319.0 42.75 33.458332 1018.536500 58.234375 ... 6.000000 7.244791 800.0 96572664.0 4.080501e+08 122496729.0 2231562.0 3.162285e+08 22115804.0 41-50 18 2015-01-04 Michigan Wayne Detroit 3725908 1772.2 49.25 30.203125 994.260400 72.671876 ... 18.250000 9.394618 432.0 31303313.0 1.729056e+08 33078462.0 1225079.0 1.696881e+08 17705665.0 41-50 5 rows × 25 columns date state county city population density aqi temp pressure humidity ... pm100 pm25 num_busses revenue operating_expense passenger_trips operating_hours passenger_miles operating_miles aqi_discrete 0 2015-01-04 Arizona Maricopa Phoenix 4064275 1198.9 86.50 41.458333 972.860814 60.739583 ... 20.709822 20.505426 729.0 47024975.0 2.256208e+08 55497019.0 2228182.0 2.190928e+08 28371107.0 81-90 1 2015-01-04 California Los Angeles Los Angeles 11922389 3184.7 118.50 50.281250 1010.666650 58.177084 ... 24.954331 21.400000 2259.0 273158938.0 1.056348e+09 367104774.0 7938548.0 1.448619e+09 84041668.0 101-150 7 2015-01-04 District Of Columbia District of Columbia Washington 5116378 4235.7 45.25 41.149479 1015.388525 61.075000 ... 13.750000 11.088055 1394.0 149657899.0 6.453259e+08 139353079.0 4115200.0 4.293390e+08 39643319.0 41-50 17 2015-01-04 Massachusetts Suffolk Boston 4328315 5319.0 42.75 33.458332 1018.536500 58.234375 ... 6.000000 7.244791 800.0 96572664.0 4.080501e+08 122496729.0 2231562.0 3.162285e+08 22115804.0 41-50 18 2015-01-04 Michigan Wayne Detroit 3725908 1772.2 49.25 30.203125 994.260400 72.671876 ... 18.250000 9.394618 432.0 31303313.0 1.729056e+08 33078462.0 1225079.0 1.696881e+08 17705665.0 41-50 5 rows × 25 columns

      What even am I looking at here?

    67. Definition 1

      The interactivity of the below is neat, but you need to talk about it!

    68. Definition 2

      I'm confused why these are just labeled at Definitions?

    69. Feature selection is done on the data.

      How? And what are the raw results?

    70. Carbon Monoxide Nitrogen Dioxide Ozone PM10 PM2.5

      Aren't all of these literally part of the definition of AQI?

    71. 'city_Los Angeles', 'city_Phoenix', 'city_Portland

      these these three?

    72. K nearest neighbors Tree model Random Forest model Logistic Regression Naive Bayes

      This is essentially the equivalent of EDA in ML. A reader doesn't care about all of the attempts that didn't go as well unless something critical was shown in that case. Just move straight to the best and discuss what it implies.

    73. AirQuality Confusion Matrix 1

      What model is the above even for?? How is a reader supposed to interpret this?

    74. A randomized search is run with 100 iterations.

      Like actual just random values for these parameters each time?

    75. Pipeline

      Not explain in the text, as far as I can understand.

    76. the model.

      WHICH?

    77. {'memory': None, 'steps': [('aqi_transformer', ColumnTransformer(transformers=[('categories', OneHotEncoder(handle_unknown='infrequent_if_exist', min_frequency=5, sparse_output=False), ['city']), ('scaled_air_quality', StandardScaler(), ['temp', 'humidity', 'co', 'no2', 'o3', 'pm100', 'pm25'])], verbose_feature_names_out=False)), ('RF_model', RandomForestClassifier())], 'verbose': False, 'aqi_transformer': ColumnTransformer(transformers=[('categories', OneHotEncoder(handle_unknown='infrequent_if_exist', min_frequency=5, sparse_output=False), ['city']), ('scaled_air_quality', StandardScaler(), ['temp', 'humidity', 'co', 'no2', 'o3', 'pm100', 'pm25'])], verbose_feature_names_out=False), 'RF_model': RandomForestClassifier(), 'aqi_transformer__n_jobs': None, 'aqi_transformer__remainder': 'drop', 'aqi_transformer__sparse_threshold': 0.3, 'aqi_transformer__transformer_weights': None, 'aqi_transformer__transformers': [('categories', OneHotEncoder(handle_unknown='infrequent_if_exist', min_frequency=5, sparse_output=False), ['city']), ('scaled_air_quality', StandardScaler(), ['temp', 'humidity', 'co', 'no2', 'o3', 'pm100', 'pm25'])], 'aqi_transformer__verbose': False, 'aqi_transformer__verbose_feature_names_out': False, 'aqi_transformer__categories': OneHotEncoder(handle_unknown='infrequent_if_exist', min_frequency=5, sparse_output=False), 'aqi_transformer__scaled_air_quality': StandardScaler(), 'aqi_transformer__categories__categories': 'auto', 'aqi_transformer__categories__drop': None, 'aqi_transformer__categories__dtype': numpy.float64, 'aqi_transformer__categories__feature_name_combiner': 'concat', 'aqi_transformer__categories__handle_unknown': 'infrequent_if_exist', 'aqi_transformer__categories__max_categories': None, 'aqi_transformer__categories__min_frequency': 5, 'aqi_transformer__categories__sparse_output': False, 'aqi_transformer__scaled_air_quality__copy': True, 'aqi_transformer__scaled_air_quality__with_mean': True, 'aqi_transformer__scaled_air_quality__with_std': True, 'RF_model__bootstrap': True, 'RF_model__ccp_alpha': 0.0, 'RF_model__class_weight': None, 'RF_model__criterion': 'gini', 'RF_model__max_depth': None, 'RF_model__max_features': 'sqrt', 'RF_model__max_leaf_nodes': None, 'RF_model__max_samples': None, 'RF_model__min_impurity_decrease': 0.0, 'RF_model__min_samples_leaf': 1, 'RF_model__min_samples_split': 2, 'RF_model__min_weight_fraction_leaf': 0.0, 'RF_model__monotonic_cst': None, 'RF_model__n_estimators': 100, 'RF_model__n_jobs': None, 'RF_model__oob_score': False, 'RF_model__random_state': None, 'RF_model__verbose': 0, 'RF_model__warm_start': False}

      Definitely don't show this!

    78. Definition 3   0.6312159709618875

      What does this mean?

    79. decomposed

      composed

    80. Figure

      It is a table, not a figure.

    81. 0.5265748745864021

      Comment on this if you are going to show it.

    82. AirQuality Confusion Matrix 2

      Captions need to be more detailed and discuss what a reader should take away from an image.

    83. By the help of statsmodel package we can break the time series into its seasonal pattern and trends. This will helps us to understand the data clearly and will help us to make more sense of the data.

      Ok, so how did you go about doing that?

    84. Decomposing the Time Series With Additive Method

      Is this supposed to be a much more subheading?

    85. There are

      The above image is an unlabeled figure with no caption that is discussed nowhere in the text (at least so far). All of those things are problematic.

    86. three

      You literally JUST told me there were 2...

    87. If you have an increasing trend, you still see roughly the same size peaks and troughs throughout the time series. This is often seen in indexed time series where the absolute value is growing but changes stay relative.

      But is that what you are seeing here? It is confusing if you are talking in the abstract or about your specific data.

      Also, why do this? What are your takeaways? These feels like it exists in isolation?

    88. grid search method

      This isn't code, so it shouldn't be in monospace. Underline or italicize it if you want to set it apart, or put it in quotes.

    89. How Fit the SARIMAX model

      There is no "how to" here at all.

    90. attempts to compute the optimum values of hyperparameters.

      Say how it works!