63 Matching Annotations
  1. Jun 2019
  2. May 2019
    1. Explore the distribution

      I wonder if this includes identifying outliers for possible data errors. E.g. x = 0 for 8 diamonds and z = 0 for 20. Also z = 31.8 for one. y = 0 for 7 diamonds and y = 31.8 and y = 58.9 for one each. I believe all are data errors.

    2. mutate(id = row_number())

      I believe the plots can be generated without id:

      diamonds %>%
        select(x, y, z) %>%
        gather(variable, value) %>%
        ggplot(aes(x = value)) +
        geom_density() +
        geom_rug() +
        facet_grid(vars(variable))
      
    1. For each plane, count the number of flights before the first delay of greater than 1 hour.

      I think this is a lovely use of logicals and cumsum. But I believe that it omits planes whose first flight is delayed by more than an hour. There are 234 of these:

      (zero  <- flights %>%
         filter(!is.na(dep_delay)) %>%
         arrange(tailnum,month,day) %>%
         group_by(tailnum) %>%
         mutate(delay_gt1hr = dep_delay > 60) %>%
         filter(row_number() == 1,delay_gt1hr) %>%
         select(tailnum) %>%
         mutate(n = 0)
      )
      

      Then bind_rows could concatenate these to the tibble produced by the posted solution.

      This is a "two-part" solution. Is there a "one-part" solution?

    2. We will calculate this ranking in two parts

      Or, the two parts could be combined:

      rank <- flights %>%
         group_by(dest) %>%
         mutate(n_carriers = n_distinct(carrier)) %>%
         filter(n_carriers > 1) %>%
         group_by(carrier) %>%
         summarize(n_dest = n_distinct(dest)) %>%
         arrange(desc(n_dest))
      
  3. Apr 2019
    1. select(tailnum, on_time, arr_time, arr_delay) %>%

      If you wish, this could be eliminated since the only columns that survive are those in the summarise.

  4. Mar 2019
    1. Unusually fast flights are those flights with the smallest standardized values.

      I believe the problem asks for "fast flights per destination". If so, this code produces that list, including ties. Sadly, it uses slice.

      fast <- flights %>%
        filter(!is.na(air_time)) %>%
        group_by(dest) %>%
        mutate(z = (air_time - mean(air_time)) / sd(air_time)) %>%
        slice(which(z == min(z))) %>%
      select(origin,dest,month,day,carrier,flight,air_time,z) %>%
        arrange(z)
      
    2. ggplot(standardized_flights, aes(x = air_time_standard)) + geom_density()

      This is a nice plot, but the image below is not it. It appears to have been cut-and-pasted from a previous plot.

    3. standardized_flights

      Somewhat more succinctly:

      standardized_flights <- flights %>%
        filter(!is.na(air_time)) %>%
        group_by(dest,origin) %>%
        mutate(air_time_standard = (air_time - mean(air_time)) / sd(air_time)) %>%
      select(origin,dest,month,day,carrier,flight,air_time,air_time_standard)
      
  5. Feb 2019
    1. min_rank() and dense_rank()

      English is inherently ambiguous while only math (vectors, subscripts) and algorithms are not. But then you'd lose your reader.

      Here's a slight rephrasing where rank refers to a component of, say, min_rank(x) and value a component of x.

      For each set of tied values the min_rank() function assigns a rank equal to the number of values less than that tied value plus one. In contrast, the dense_rank() function assigns a rank equal to the number of distinct values less than that tied value plus one.

    2. min_rank() and dense_rank()

      English is inherently ambiguous while only math (vectors, subscripts) and algorithms are not. But then you'd lose your reader.

      Here's a slight rephrasing where rank refers to a component of, say, min_rank(x) and value a component of x.

      For each set of tied values the min_rank() function assigns a rank equal to the number of values less than that tied value plus one. The dense_rank() function assigns a rank equal to the number of distinct values less than that tied value plus one.

    3. The row_number() function is confusingly named. It can create ranks for any column.

      I might omit this since you give an explanation of row_number berlow which is accurate. Also, tibble adds its own rownumbers and rownumber(x) appears to be "the rownumber after x is sorted", which I don't find confusing.

    4. na.rm = TRUE

      This probably is intended to be an argument of mean. It's not necessary, however, since neither dep_delay nor dep_delay_lag have any NAs. Also, could you provide an interpretation of delay_diff? For example, what does it mean that JFK's is negative?

    5. flight

      Another definition of flight could be the quad (carrier,flight,origin,dest) which, say, might fly daily. Here's a solution with this definition.

      dest_delay <- flights %>%
        group_by(dest) %>%
        filter(arr_delay > 0) %>%
        summarize(dest_ad = sum(arr_delay,na.rm=T),dest_ct = n())
      flight_delay <- flights %>%
        group_by(carrier,flight,origin,dest) %>%
        filter(arr_delay > 0) %>%
        summarize(flight_ad = sum(arr_delay,na.rm=T),flight_ct = n())
      prop <- as_tibble(merge(dest_delay,flight_delay,by="dest")) %>%
        mutate(prop = flight_ad / dest_ad) %>%
      select(carrier,flight,origin,dest,dest_ct,flight_ct,prop)
      
    6. Exercise 5.7.4

      In general selecting just a few columns (often the mutate variables) makes the result clearer since "irrelevant" cols are eliminated.

    7. However, there are many planes that have never flown an on-time flight.

      Since this minimum rank group has an on_time of 0.0, I believe that this tibble is all planes which have never flown an on-time flight.

    8. arr_delay > 0

      (arr_delay > 0)

      Of course precedence "does the right thing" but since R will use coercion to give lgl & dbl a value, putting in the redundant parentheses prevents the not-careful reader (me, for example) from "thinking the wrong thing."

    9. cancelled

      In this dataset the presence of arr_delay implies the presence of arr_time so the boolean cancelled could be eliminated. Of course, the reader hasn't verified this relation so keeping it in makes sense. The definition of cancelled as "having an arrival delay" and "not having an arrival time" seems a bit odd.

    10. They operate within each group rather than over the entire data frame

      Do all functions on grouped tibbles operate only per group and not the entire tibble? E.g.

      flights %>% 
        group_by(month,day) %>%
        arrange(desc(dep_delay))
      

      Will the arrange function order both by group and by the entire tibble?

    11. not_cancelled

      Hadley defines this as

      not_cancelled <- flights %>% 
        filter(!is.na(dep_delay), !is.na(arr_delay))
      

      But in

      flights %>% 
        filter(is.na(dep_time) | is.na(arr_time) | is.na(dep_delay) | is.na(arr_delay)) %>% 
      select(sched_arr_time,arr_time,arr_delay,sched_dep_time,dep_time,dep_delay)
      

      the first 6 flights have an NA only in arr_delay. Aren't these likely not to have been cancelled and simply have a missing data entry? Why isn't

      not_cancelled <- flights %>%
      filter(!is.na(dep_time), !is.na(arr_time))
      

      a "better" definition?

    12. Exercise 5.5.4

      Another solution. Now that we know we can "trust" dep_delay we could simply

      md <- flights %>%
      select(carrier,flight,origin,dest,sched_dep_time,dep_time,dep_delay) %>%
      arrange(min_rank(desc(dep_delay)))
      

      There are no ties in the top 10.

    13. Except for flights daylight savings started (March 10) or ended (November 3)

      I understand the point of the paragraph but this sentence is a bit unclear.