4 Matching Annotations
  1. Jul 2025
    1. date %in% (holidays_2013$date - 1L) ~ "day before holiday", date %in% (holidays_2013$date + 1L) ~ "day after holiday",

      The exercise told you to focus on holidays rather than the days before or after holidays! you're always overthinking!

  2. Jun 2025
    1. I used a list, not a character vector, since the class of an object can have multiple values. For example, the class of the time_hour column is POSIXct, POSIXt.

      You think of everything man!

  3. May 2025
    1. Whether the mean is the best summary depends on what you are using it for :-), i.e. your objective.
      • Whether the mean is the best summary metric depends on the distribution of the variable.
      • If it is normal distribution, then mean is a good summary metric, otherwise, median is better.
      • we can use function shapiro.test() to test the Normality of a variable
      • since shapiro.test() limits the maximum length of a vector of 5000, we can subset 5000 elements of gss_cat$tvhours to perform this test.

      gss_cat$tvhours %>% head(5000) %>% shapiro.test()

      results: p-value < 0.01, which means the tvhours variable is not normal distribution, therefore, mean is not a good summary metric for this variable, we'd better use median.

  4. Mar 2025
    1. The question does not define a way to measure on-time record, so I will consider two metrics: proportion of flights not delayed or cancelled, and mean arrival delay.

      This is just making a simple problem complicated.

      worst on-time plane is just the plane with the biggest average arrival delay time.