63 Matching Annotations
  1. Jun 2019
    1. >

      Omit.

    2. Exercise 7.4.1

      Many thanks for this lovely example. I hadn't understood geom_bar's NA behavior with factor variables until your explanation.

  2. May 2019
    1. Presumably there is a premium for a 1 carat diamond

      Is this "buyer's psychology?" The opposite of $4.99 being cheaper than $5.00?

    2. number of diamonds in each carat range

      Nice! Your intuition seems correct. But how do you account for the large number of 1.01 carat diamonds?

    3. print(n = 30)

      print(n = Inf) will print all rows of a tibble.

    4. seem

      see

    5. There are no diamonds with a price of $1,500

      More precisely: there's a $90 gap: the closed interval [1455,1545].

    6. Explore the distribution

      I wonder if this includes identifying outliers for possible data errors. E.g. x = 0 for 8 diamonds and z = 0 for 20. Also z = 31.8 for one. y = 0 for 7 diamonds and y = 31.8 and y = 58.9 for one each. I believe all are data errors.

    7. mutate(id = row_number())

      I believe the plots can be generated without id:

      diamonds %>%
        select(x, y, z) %>%
        gather(variable, value) %>%
        ggplot(aes(x = value)) +
        geom_density() +
        geom_rug() +
        facet_grid(vars(variable))
      
    1. For each plane, count the number of flights before the first delay of greater than 1 hour.

      I think this is a lovely use of logicals and cumsum. But I believe that it omits planes whose first flight is delayed by more than an hour. There are 234 of these:

      (zero  <- flights %>%
         filter(!is.na(dep_delay)) %>%
         arrange(tailnum,month,day) %>%
         group_by(tailnum) %>%
         mutate(delay_gt1hr = dep_delay > 60) %>%
         filter(row_number() == 1,delay_gt1hr) %>%
         select(tailnum) %>%
         mutate(n = 0)
      )
      

      Then bind_rows could concatenate these to the tibble produced by the posted solution.

      This is a "two-part" solution. Is there a "one-part" solution?

    2. year

      Since the year column contains only 2013, you could arrange chronologically without this argument.

    3. We will calculate this ranking in two parts

      Or, the two parts could be combined:

      rank <- flights %>%
         group_by(dest) %>%
         mutate(n_carriers = n_distinct(carrier)) %>%
         filter(n_carriers > 1) %>%
         group_by(carrier) %>%
         summarize(n_dest = n_distinct(dest)) %>%
         arrange(desc(n_dest))
      
    4. width = 100

      Here width = Inf will print all columns.

    5. print(width = 120)

      I believe that print(width = Inf) will reliably print all selected columns.

    6. below than the

      below the

    7. observations each

      observations of each

  3. Apr 2019
    1. I

      Omit.

    2. select(tailnum, on_time, arr_time, arr_delay) %>%

      If you wish, this could be eliminated since the only columns that survive are those in the summarise.

  4. Mar 2019
    1. were

      omit

    2. ungroup() %>%

      I believe you don't need ungroup().

    3. an standard deviation. That

      "and standard deviation. The following"

    4. Unusually fast flights are those flights with the smallest standardized values.

      I believe the problem asks for "fast flights per destination". If so, this code produces that list, including ties. Sadly, it uses slice.

      fast <- flights %>%
        filter(!is.na(air_time)) %>%
        group_by(dest) %>%
        mutate(z = (air_time - mean(air_time)) / sd(air_time)) %>%
        slice(which(z == min(z))) %>%
      select(origin,dest,month,day,carrier,flight,air_time,z) %>%
        arrange(z)
      
    5. ggplot(standardized_flights, aes(x = air_time_standard)) + geom_density()

      This is a nice plot, but the image below is not it. It appears to have been cut-and-pasted from a previous plot.

    6. standardized_flights

      Somewhat more succinctly:

      standardized_flights <- flights %>%
        filter(!is.na(air_time)) %>%
        group_by(dest,origin) %>%
        mutate(air_time_standard = (air_time - mean(air_time)) / sd(air_time)) %>%
      select(origin,dest,month,day,carrier,flight,air_time,air_time_standard)
      
  5. Feb 2019
    1. in 10 flights

      Omit, since you've already filtered.

    2. min_rank() and dense_rank()

      English is inherently ambiguous while only math (vectors, subscripts) and algorithms are not. But then you'd lose your reader.

      Here's a slight rephrasing where rank refers to a component of, say, min_rank(x) and value a component of x.

      For each set of tied values the min_rank() function assigns a rank equal to the number of values less than that tied value plus one. In contrast, the dense_rank() function assigns a rank equal to the number of distinct values less than that tied value plus one.

    3. min_rank() and dense_rank()

      English is inherently ambiguous while only math (vectors, subscripts) and algorithms are not. But then you'd lose your reader.

      Here's a slight rephrasing where rank refers to a component of, say, min_rank(x) and value a component of x.

      For each set of tied values the min_rank() function assigns a rank equal to the number of values less than that tied value plus one. The dense_rank() function assigns a rank equal to the number of distinct values less than that tied value plus one.

    4. ranking handles

      the ranking functions handle

    5. function

      functions

    6. roughly

      Is it "rough" or "exact"?

    7. an element

      a vector

    8. The row_number() function is confusingly named. It can create ranks for any column.

      I might omit this since you give an explanation of row_number berlow which is accurate. Also, tibble adds its own rownumbers and rownumber(x) appears to be "the rownumber after x is sorted", which I don't find confusing.

    9. missing values

      ties

    10. na.rm = TRUE

      This probably is intended to be an argument of mean. It's not necessary, however, since neither dep_delay nor dep_delay_lag have any NAs. Also, could you provide an interpretation of delay_diff? For example, what does it mean that JFK's is negative?

    11. year,

      Could be eliminated since the dataset is for one specific year.

    12. flight

      Another definition of flight could be the quad (carrier,flight,origin,dest) which, say, might fly daily. Here's a solution with this definition.

      dest_delay <- flights %>%
        group_by(dest) %>%
        filter(arr_delay > 0) %>%
        summarize(dest_ad = sum(arr_delay,na.rm=T),dest_ct = n())
      flight_delay <- flights %>%
        group_by(carrier,flight,origin,dest) %>%
        filter(arr_delay > 0) %>%
        summarize(flight_ad = sum(arr_delay,na.rm=T),flight_ct = n())
      prop <- as_tibble(merge(dest_delay,flight_delay,by="dest")) %>%
        mutate(prop = flight_ad / dest_ad) %>%
      select(carrier,flight,origin,dest,dest_ct,flight_ct,prop)
      
    13. Exercise 5.7.4

      In general selecting just a few columns (often the mutate variables) makes the result clearer since "irrelevant" cols are eliminated.

    14. !is.na(arr_delay),

      This could be omitted but keeping it in redundantly may add clarity.

    15. only

      Omit

    16. However, there are many planes that have never flown an on-time flight.

      Since this minimum rank group has an on_time of 0.0, I believe that this tibble is all planes which have never flown an on-time flight.

    17. <=

      ==

      I believe min_rank cannot be 0.

    18. arr_delay > 0

      (arr_delay > 0)

      Of course precedence "does the right thing" but since R will use coercion to give lgl & dbl a value, putting in the redundant parentheses prevents the not-careful reader (me, for example) from "thinking the wrong thing."

    19. cancelled

      In this dataset the presence of arr_delay implies the presence of arr_time so the boolean cancelled could be eliminated. Of course, the reader hasn't verified this relation so keeping it in makes sense. The definition of cancelled as "having an arrival delay" and "not having an arrival time" seems a bit odd.

    20. They operate within each group rather than over the entire data frame

      Do all functions on grouped tibbles operate only per group and not the entire tibble? E.g.

      flights %>% 
        group_by(month,day) %>%
        arrange(desc(dep_delay))
      

      Will the arrange function order both by group and by the entire tibble?

    21. There are more sophisticated ways to do this analysis

      Perhaps, but this is a lovely example of the power of R.

    22. , which calculates

      ". atan(y,x) returns"

      My worry is that the reader would assume that atan(x,y) returns that angle.

    23. not_cancelled

      Hadley defines this as

      not_cancelled <- flights %>% 
        filter(!is.na(dep_delay), !is.na(arr_delay))
      

      But in

      flights %>% 
        filter(is.na(dep_time) | is.na(arr_time) | is.na(dep_delay) | is.na(arr_delay)) %>% 
      select(sched_arr_time,arr_time,arr_delay,sched_dep_time,dep_time,dep_delay)
      

      the first 6 flights have an NA only in arr_delay. Aren't these likely not to have been cancelled and simply have a missing data entry? Why isn't

      not_cancelled <- flights %>%
      filter(!is.na(dep_time), !is.na(arr_time))
      

      a "better" definition?

    24. and the passenger arrived at the same time

      Omit.

    25. Exercise 5.5.4

      Another solution. Now that we know we can "trust" dep_delay we could simply

      md <- flights %>%
      select(carrier,flight,origin,dest,sched_dep_time,dep_time,dep_delay) %>%
      arrange(min_rank(desc(dep_delay)))
      

      There are no ties in the top 10.

    26. Daylight

      If you wish, you could change this to "Eastern Daylight" to agree with its previous reference in that sentence.

    27. Except for flights daylight savings started (March 10) or ended (November 3)

      I understand the point of the paragraph but this sentence is a bit unclear.

    28. (to handle the .

      It's unclear to me what is intended here.

    29. Exercise 5.5.2

      A nice analysis, showing the "messiness" of actual datasets.

    30. select(flights, one_of(vars))

      Note that

      select(flights,vars)
      

      produces the same result.

    31. /

      *

    32. Saturday

      December

    33. month == 7, month == 8, month == 9

      Replace , by |.

    34. between()

      Interestingly between microbenchmarks to 143 microseconds while >=, <= is 2.3 on my box!

    35. What other variables are missing?

      Another interpretation:

      colnames(flights)[colSums(is.na(flights)) > 0]
      
    36. dep_time %% 2400 <= 600

      Elegant, at a 2x microbenchmark cost.

    37. is preferred

      Do numerical operators execute faster than, say, %in%?

    38. were

      Omit