430 Matching Annotations
  1. Jul 2020
    1. In a foreign key relationship

      some of the exercises for this subsection a little premature...for some readers (like me) it is difficult to understand exactly what's going on without being first introduced to basic content knowledge in keys.

    1. Second, I should check whether all values for a (country, year) are missing or whether it is possible for only some columns to be missing.

      This part doesn't seem to do what it says it does, although I might just be interpreting everything wrong. But it seems that you are trying to check whether every entry for a particular "country, year" combination would be missing (eg no data collected for that region that year). However, the section is described as checking if multiple columns have missing values (as compared to only one column for that observation having a missing value). At least, perhaps consider clarifying this section.

    2. stocks <- tibble( year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016), qtr = c( 1, 2, 3, 4, 2, 3, 4), return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66) ) stocks %>% pivot_wider(names_from = year, values_from = return, values_fill = 0) #> # A tibble: 4 x 3 #> qtr `2015` `2016` #> <dbl> <dbl> <dbl> #> 1 1 1.88 0 #> 2 2 0.59 0.92 #> 3 3 0.35 0.17 #> 4 4 NA 2.66

      duplicated example I think

  2. Jun 2020
    1. The fastest flight is the one with the average ground speed,

      Should read:

      The fastest flight is the one with the fastest average ground speed,

      The sentence as currently worded describes the flight with the average ground speed, not the fastest.

    1. numeric_cols <- vector("logical", length(df)) # test whether each column is numeric for (i in seq_along(df)) { numeric_cols[[i]] <- is.numeric(df[[i]]) } # find the indexes of the numeric columns idxs <- which(numeric_cols)

      Is there a reason not to use

      idxs <- which(sapply(df, is.numeric)) ?

  3. May 2020
    1. Exercise 5.3.3

      Your solution works, but we have not been taught the mutate command yet. Given the commands we have already been taught, you can get the same results from: arrange(flights, desc(distance / air_time))

    2. because the value of the missing TRUE or FALSE, x

      I find this hard to follow. Why not phrase it as the next example? NA | TRUE is TRUE because anything or TRUE is always TRUE

    3. mean(on_time)

      If we were to calculate the proportion of flights not delayed or cancelled, we would need to adjust for NA values of the cancelled flights:

      summarise(n = n(), on_time = sum(on_time, na.rm=TRUE) / n)

    1. The words that have seven letters or more are

      if we need to view/extract later the words that have at least 7 characters we can:

      str_view( stringr::words, "........*", match = T)

      adding "*" after the eight "." states that the last "." can be repeated zero or more times.

      or

      str_view(stringr::words, ".{7,}", match = T)

      in both cases we match the words that have at least 7 characters, not only up to the seventh letter

    2. Words that contain only consonants

      An alternative: str_view(words, "[aeiou]", match = FALSE)

      Or, using str_subset: str_subset(words, "[aeiou]", negate = TRUE)

    1. foreign

      If one adds "%>% distinct(dest)" to the expression, one sees that there are four airports that are not in the FAA list: BQN, SJU, STT, and PSE. Three of these are in Puerto Rico, one (STT) is in US Virgin Islands.

    1. ggplot(data = diamonds) + geom_pointrange( mapping = aes(x = cut, y = depth), stat = "summary", fun.ymin = min, fun.ymax = max, fun.y = median )

      Could not generate the same stat_summary( ) plot with that code, did a little research and stack overflow suggested two solutions: use geom_line( )

      ggplot(data = diamonds, mapping = aes(x = cut, y = depth)) + geom_line() + stat_summary(fun.y = "median", geom = "point", size = 3)

      or, reduce the amount of data by grouping it

      data = diamonds %>% group_by(cut) %>% summarise(min = min(depth), max = max(depth), median = median(depth))

      ggplot(data, aes(x = cut, y = median, ymin = min, ymax = max)) + geom_linerange() + geom_pointrange()

      Source: https://stackoverflow.com/questions/41850568/r-ggplot2-pointrange-example

    2. width controls the amount of vertical displacement, and height controls the amount of horizontal displacement.

      I think you flipped these labels. width is horizontal and height is vertical

      It would also be helpful to emphasize that unless height and/or weight are explicitly defined as zero, there will be jitter. When I first used the geom_jitter, I did not realize this.

  4. Apr 2020
    1. They may be combining different flights?

      It seems to refer to diverted flights. The original BTS data also has diverted airport information, including a variable DivArrDelay with the following description: "Difference in minutes between scheduled and actual arrival time for a diverted flight reaching scheduled destination. The ArrDelay column remains NULL for all diverted flights." I browsed this data quickly and, indeed, those missing arr_delay observations have values in the DivArrDelay variable.

    2. There is one remaining issue. Midnight is represented by 2400, which would correspond to 1440 minutes since midnight, but it should correspond to 0. After converting all the times to minutes after midnight, x %% 1440 will convert 1440 to zero while keeping all the other times the same.

      %% 1440 is brilliant

    1. function that directly calculates the number of unique values in a vector.

      n_distinct is a faster and more concise equivalent of length(unique(x)).

  5. Mar 2020
    1. Note

      Is anyone getting the above plot? even when I copy paste the code the boxplots are horizontal instead of vertical. tidyverse has the same plot as I do https://ggplot2.tidyverse.org/reference/geom_boxplot.html

      If I understand the aes correctly, group should make carat the "categorical" variable and plot vertically as seen here.

      If I replace y = price with y = depth it plots vertically as expected.

      Can anyone help me understand what's happening?

  6. Feb 2020
    1. Tidy the simple tibble below. Do you need to spread or gather it? What are the variables?

      Thank you for writing/maintaining r4ds! Great resource.

      I delivered this example in a lecture introducing tidy data, and the solution to have three variables doesn't make sense to me. Why is this not only two variables (sex and pregnant)? Sure, I could argue that count is a variable on this data, but it doesn't seem as natural.

  7. Jan 2020
    1. As that example shows,

      Using geom_count() with position ='jitter' may help a bit with overplotting:

      ggplot(data = mpg) +

      • geom_count(mapping = aes(x = cty, y = hwy, color = class), position = "jitter")
    2. The default stat of geom_bar() is stat_bin(). The geom_bar() function only expects an x variable. The stat, stat_bin(), preprocesses input data by counting the number of observations for each value of x. The y aesthetic uses the values of these counts.

      The ggplot2 v3.2.1 R documentation states "geom_bar() uses stat_count() by default". Was ggplot2 updated since this answer was published? My understanding is stat_count() is used for discrete x data and stat_bin() for continuous x data.

    1. Write a function that turns (e.g.) a vector c("a", "b", "c") into the string "a, b, and c". Think carefully about what it should do if given a vector of length 0, 1, or 2.

      Can we the whole code replace with: ifelse(length(x)==1, x, str_c(str_c(x[-length(x)], collapse = ", ")," and ",str_c(x[length(x)], collapse = ",")))

      of course in the form of a function

  8. Dec 2019
    1. What is meant by “48 hours over the course of the year”? This could mean two days, a span of 48 contiguous hours, or 48 hours that are not necessarily contiguous hours. I will find 48 not-necessarily contiguous hours. That definition makes better use of the methods introduced in this section and chapter.

      Thank you for the book! Could you provide a hint as to how you might go about with this if 48 hours here meant any contiguous 48 hours? My guess is using a windowing function, but I couldn't figure it out.

  9. Nov 2019
  10. Oct 2019
    1. A warning is provided since often, but not always,

      Sentence syntax unclear. Maybe this:

      If a warning is provided often, but not always, there may be a bug in the code.

    2. min_rank()

      min_rank是对数据大小进行编号排序,遇到重复值,排序相同,但每个值都占一个位置,缺失值不计入

    1. hwy vs. cyl

      I think common notation is dependent variable (Y) versus independent variable (X), so if you're asking for a y = cyl and x = hwy plot, then I'd say that's cyl vs. hwy (cyl as a function of hwy), not vice versa as written.

  11. Sep 2019
    1. Unusually fast flights are those flights with the smallest standardized values.

      Question Why would it be unusual that flights with air time on the left side of the distribution would be the fastest?

    1. Exercise 14.4.2.2

      Exercise 14.4.2.2 Q 2 - this misses words that are at the end of the sentence with a period after them. I use this: str_extract(sentences, "[A-za-z]+ing[ .]")

    1. we could use map() followed by flatten_dbl(),

      The way this is written could be somewhat confusing to a reader, in my opinion, although the code makes the order of the functions clearer..

      Suggestion:

      If we wanted a numeric vector, we could combine the map() followed with the flatten_dbl(),

  12. Aug 2019
    1. x = "Highway MPG Relative to Class Average", y = "Engine Displacement"

      Hi. is "Highway MPG Relative to Class Average" the name of Y-axis? and the name of X-axis is "Engine Displacement" because we put in X-axis with displ. thanks:)

    1. Neither NaN nor Inf are not numbers, and so they aren’t even numbers

      Edit suggestion,

      Neither NaN nor Inf is a number

      Or

      Both NaN and Inf are not even numbers.

      If accepted, then "and so they aren’t even numbers" may be deleted as it becomes superfluous.

    2. This is not the same as what you See the value of looking at the value of

      Could you please check the sentence? There seems to be some ambiguity.

    1. Okay, I’m not sure what’s going on in this data.

      Looks like a New York issue.

      > filter(flights, !is.na(dep_delay), is.na(arr_delay)) %>%
      +   count(origin)
      # A tibble: 3 x 2
        origin     n
        <chr>  <int>
      1 EWR      469
      2 JFK      337
      3 LGA      369
      
    1. (!(x %% 3) && !(x %% 5))

      Hi,

      Please can you explain to me how to read this line of code and what it means. I'm having difficulty understanding it. Individually I do understand what each symbol means but put together as it is, I'm unable to. Moreover, it looks more efficient than my effort, as shown below.

      Thank you.

      fizzbuzz <- function(n) {

      x <- n

      if(( x %% 3 == 0 ) && ( x %% 5 == 0 )) {

      print("fizzbuzz")

      } else {

      if(x %% 3 == 0) {

      print("fizz")
      

      } else {

      if(x %% 5 == 0) {

      print("buzz")

      } else { print(x)

      } } }

      }

    1. function(.data)

      Hi Jeffrey. First of all, i really appreciate this Exercise solutions! and.. i need your help! why did you code ".data"? Is there any difference in the case of "data"? when i coded "data", The results were the same as ".data". i had tried to find a difference between them, i couldn`t find it....please let me know. does it have no difference?..

    1. What does it mean for a flight to have a missing tailnum

      This seems to be a tad long so I apologise in advance. This is a whole new field for me and I would really like to understand.

      Could it be AA and MQ use different values to represent tailnum?

      filter( planes, tailnum == 0 )

      A tibble: 0 x 9

      length(is.na( planes$tailnum ))

      [1] 3322

      nrow(planes)

      [1] 3322

      filter( flights, tailnum == 0 )

      A tibble: 0 x 19

      length(is.na( flights$tailnum ) )

      [1] 336776

      nrow( flights )

      [1] 336776

      Yet , the anti_join () as shown in your code shows clearly that there are some talinum values in flights that are not represented in the planes datasets. How could that be? The one explanation I could come up with is that the two datasets used different talinum values, so I tried to investigate for AA and MQ.

      tailnum_flights <- flights %>% filter( carrier == 'AA'| carrier == 'MQ' ) %>% select ( carrier, tailnum )

      tailnum_planes <- planes %>% select( tailnum )

      tailnum_planes %in% tailnum_flights

      [1] FALSE

      So, it looks like the tailnum values are not missing for the ten airlines but are represented with values different in the two datasets (flights and planes).

      What are your thoughts? Thank you.

  13. Jul 2019
    1. Words that end with “-ed” but not ending in “-eed”

      This worked for me as well:

      str_view(stringr::words, ".*[^e]ed$", match = TRUE)

    1. avg_dest_delays <- flights %>% group_by(dest) %>% # arrival delay NA's are cancelled flights summarise(delay = mean(arr_delay, na.rm = TRUE)) %>% inner_join(airports, by = c(dest = "faa"))

      Please, could you explain to me why if I pipe this directly to ggplot the colour aesthetics is not applied. See code below, it's basically a replication of yours but with ggplot directly piped to avg_dest_delays:

      avg_dest_delays <- flights %>% group_by( dest ) %>% summarise( delay = mean( arr_delay, na.rm = TRUE )) %>% inner_join(airports, by = c( dest = "faa" ) ) %>% ggplot( aes(lon, lat, colour = delay ) ) + borders("state" ) + geom_point( ) + coord_quickmap( )

      Thanks

    2. join

      This also seems to be the case where the by = argument is not used in a code. In that case, it seems, the semi_join() will give outputs only where the rows for both datasets correctly match, for example:

      fueleconomy::vehicles %>% semi_join(fueleconomy::common)

      produces the same output as:

      fueleconomy::vehicles %>% semi_join(fueleconomy::common, by = c("make", "model"))

      Or is that a coincident?

      But, fueleconomy::vehicles %>% semi_join(fueleconomy::common, by = "make"

      will produce a different output as R will match only by "make" in this example.

    3. There are few planes older than 30 years, so I combine them into a single category.

      The code in the solution book totally dropped the data for planes age > 25. How might we combine them into a single row? I didn’t think it could be done without first defining a second tibble that contains all the planes age >25, then merge it with the first tibble that contains the data for planes age <= 25, before carrying out the summarise actions on them after applying the group_by = age argument. older_plane_cohorts <- inner_join( flights, select( planes, tailnum, plane_year = year ), by = "tailnum" ) %>% mutate(age = year - plane_year) %>% filter(!is.na(age)) %>% mutate(age = pmin(46, age) - pmin( 25, age ) ) %>% filter( age != 0 )

      Then I got stuck. I can’t figure out how to proceed after that. And frankly, I can say with any certainty if my argument is sensible. And I’m not even sure it’s possible to combine the 17 rows into a single row. Your help will be appreciated.

      Thank you in advance and for your help so far. Truly appreciated.

    4. mutate(age = pmin(25, age))

      I think you used this line of code to limit the selection to planes not older than 25 years. But in the text above, you stated, "There are few planes older than 30 years, so we combine them into a single category." So, I was expecting the selection to be age <= 30, or using your notation pmin(30, age) and not pmin(25, age). Perhaps an edit of the text may be required unless I'm wrong in my supposition?

    5. If we needed a unique identifier for our analysis, could add a surrogate key.

      Hi, I hope this doesn't come across as nitpicking:

      If we needed a unique identifier for our analysis, we could add a surrogate key, perhaps?

    1. It looks like it is possible for certain variables to missing for (country, years).

      "It looks like it is possible for certain variables to missing for (country, years)." Edit suggestion: It looks like it is possible that certain variables are missing for (country, years).

    1. ggplot

      Please, can you explain to me what I am getting wrong in the below code and especially the error message, which I have tried goggling to no success.

      I tried to see if instead of count to show, in my practice exercise, the average price per carat group using the code below:

      group_by( diamonds, carat ) %>% summarise( avg_price = mean(price ) ) %>% ggplot( ) + mapping = aes( color = cut_width(carat, 5 ), x = avg_price ) + geom_freqpoly( )

      The code returns the following error:

      Error in group_by(diamonds, carat) %>% summarise(avg_price = mean(price)) %>% : could not find function "+<-"

      I've googled the error without success. So my confusion is in two parts:

      1. What is the right code to show the average price per carat type
      2. What does the above error mean?

      Thanks in advance

    2. visualization

      Does anyone else notice that the R4DS uses British spelling for visualisation, but the Solution textbook uses the American spelling. It's a bit weird when one switches directly from the text book to the solution. I'm sure not many people notice the difference. I probably do because I generally write Brit English. By the way, I am not censuring, just expressing my thought. I am really grateful to the author for making this available.

  14. Jun 2019
    1. In dep_time, midnight is represented by 2400, not 0.

      Recommendation: For new users of R, how one determines the representation of midnight as 2400 should be explained. At this point in the book, a reader would not have been thought how to do the proper exploration to make this determination alone, so it would seem that explaining how the author of the solutions book acquired that knowledge would be most beneficial.

    1. (date - 1L) %in% holidays_2013$date ~ "day before holiday", (date + 1L) %in% holidays_2013$date ~ "day after holiday",

      I think you mixed up these two.

      (date + 1L) %in% holidays_2013$date ~ "day before holiday",
      (date - 1L) %in% holidays_2013$date ~ "day after holiday",
      
  15. May 2019
    1. Explore the distribution

      I wonder if this includes identifying outliers for possible data errors. E.g. x = 0 for 8 diamonds and z = 0 for 20. Also z = 31.8 for one. y = 0 for 7 diamonds and y = 31.8 and y = 58.9 for one each. I believe all are data errors.

    2. mutate(id = row_number())

      I believe the plots can be generated without id:

      diamonds %>%
        select(x, y, z) %>%
        gather(variable, value) %>%
        ggplot(aes(x = value)) +
        geom_density() +
        geom_rug() +
        facet_grid(vars(variable))
      
    1. bottles <- function(i) { if (i > 2) { bottles <- str_c(i - 1, " bottles") } else if (i == 2) { bottles <- "1 bottle" } else { bottles <- "no more bottles" } bottles }

      I think this should read:

      
      bottles <- function(i) {
        if (i > 1) {
          bottles <- str_c(i , " bottles")
        } else if (i == 1) {
          bottles <- "1 bottle"
        } else {
          bottles <- "no more bottles"
        }
        bottles
      }
      

      Otherwise you get no bottles of beer twice in the final output

    1. For each plane, count the number of flights before the first delay of greater than 1 hour.

      I think this is a lovely use of logicals and cumsum. But I believe that it omits planes whose first flight is delayed by more than an hour. There are 234 of these:

      (zero  <- flights %>%
         filter(!is.na(dep_delay)) %>%
         arrange(tailnum,month,day) %>%
         group_by(tailnum) %>%
         mutate(delay_gt1hr = dep_delay > 60) %>%
         filter(row_number() == 1,delay_gt1hr) %>%
         select(tailnum) %>%
         mutate(n = 0)
      )
      

      Then bind_rows could concatenate these to the tibble produced by the posted solution.

      This is a "two-part" solution. Is there a "one-part" solution?

    2. We will calculate this ranking in two parts

      Or, the two parts could be combined:

      rank <- flights %>%
         group_by(dest) %>%
         mutate(n_carriers = n_distinct(carrier)) %>%
         filter(n_carriers > 1) %>%
         group_by(carrier) %>%
         summarize(n_dest = n_distinct(dest)) %>%
         arrange(desc(n_dest))
      
  16. Apr 2019
    1. Surprisingly, it appears that depth (z) is always smaller than length (x) or width (y). Length is less than width in more than half the observations, the opposite of expectations. I don’t know what’s going on. If this was not a widely used da

      Diamond Dimensions

      1. If you look at this picture, it seems that depth is always smaller than either width or length. Maybe it is easier to plug it into a ring or other jewelry...
      2. Also you said "1. length is less than width, otherwise the width would be called the length". actually, it is the other way round.
    2. geom_histogram(binwidth = 1, center = 0) + geom_bar()

      I do not understand, why geom_bar is used in addition to geom_histogram. It greates two layers which are essentially the same, right?

    1. # … with 1 more row

      You might want to check this last row of the tibble that's not displayed. Saturday actually has the shortest delays of any day of the week. You'll see it right away if you plot it:

      flights_dt %>% mutate(wday = wday(dep_time, label = TRUE)) %>% group_by(wday) %>% summarize(ave_dep_delay = mean(dep_delay, na.rm = TRUE)) %>% ggplot(aes(x = wday, y = ave_dep_delay)) + geom_bar(stat = "identity")

      flights_dt %>% mutate(wday = wday(dep_time, label = TRUE)) %>% group_by(wday) %>% summarize(ave_arr_delay = mean(arr_delay, na.rm = TRUE)) %>% ggplot(aes(x = wday, y = ave_arr_delay)) + geom_bar(stat = "identity")

    1. skewness <- function(x, na.rm = FALSE) { n <- length(x) m <- mean(x, na.rm = na.rm) v <- var(x, na.rm = na.rm) (sum(x - m)^3 / (n - 2)) / v^(3 / 2) }

      This function always returns 0. This is because sum(x-m) (performed before raised to the power 3) will always be 0. Resolved with additional parenthesis:

      skewness <- function(x, na.rm = FALSE) { n <- length(x) m <- mean(x, na.rm = na.rm) v <- var(x, na.rm = na.rm) (sum((x - m)^3) / (n - 2)) / v^(3 / 2) }

      I appreciate there are several possible formulas for skewness which may not match this one.

    1. select(tailnum, on_time, arr_time, arr_delay) %>%

      If you wish, this could be eliminated since the only columns that survive are those in the summarise.

  17. Mar 2019
    1. arrange(flights, distance / air_time * 60)

      Do we need a descending order to put the fastest flights first? arrange(flights, desc(distance / air_time 60)) See also: flights %>% mutate(speed = distance / air_time 60) %>% select(tailnum, distance, air_time, speed, dep_time) %>% arrange(desc(speed))

    2. Unusually fast flights are those flights with the smallest standardized values.

      I believe the problem asks for "fast flights per destination". If so, this code produces that list, including ties. Sadly, it uses slice.

      fast <- flights %>%
        filter(!is.na(air_time)) %>%
        group_by(dest) %>%
        mutate(z = (air_time - mean(air_time)) / sd(air_time)) %>%
        slice(which(z == min(z))) %>%
      select(origin,dest,month,day,carrier,flight,air_time,z) %>%
        arrange(z)
      
    3. ggplot(standardized_flights, aes(x = air_time_standard)) + geom_density()

      This is a nice plot, but the image below is not it. It appears to have been cut-and-pasted from a previous plot.

    4. standardized_flights

      Somewhat more succinctly:

      standardized_flights <- flights %>%
        filter(!is.na(air_time)) %>%
        group_by(dest,origin) %>%
        mutate(air_time_standard = (air_time - mean(air_time)) / sd(air_time)) %>%
      select(origin,dest,month,day,carrier,flight,air_time,air_time_standard)
      
  18. Feb 2019
    1. min_rank() and dense_rank()

      English is inherently ambiguous while only math (vectors, subscripts) and algorithms are not. But then you'd lose your reader.

      Here's a slight rephrasing where rank refers to a component of, say, min_rank(x) and value a component of x.

      For each set of tied values the min_rank() function assigns a rank equal to the number of values less than that tied value plus one. In contrast, the dense_rank() function assigns a rank equal to the number of distinct values less than that tied value plus one.

    2. min_rank() and dense_rank()

      English is inherently ambiguous while only math (vectors, subscripts) and algorithms are not. But then you'd lose your reader.

      Here's a slight rephrasing where rank refers to a component of, say, min_rank(x) and value a component of x.

      For each set of tied values the min_rank() function assigns a rank equal to the number of values less than that tied value plus one. The dense_rank() function assigns a rank equal to the number of distinct values less than that tied value plus one.

    3. The row_number() function is confusingly named. It can create ranks for any column.

      I might omit this since you give an explanation of row_number berlow which is accurate. Also, tibble adds its own rownumbers and rownumber(x) appears to be "the rownumber after x is sorted", which I don't find confusing.

    4. na.rm = TRUE

      This probably is intended to be an argument of mean. It's not necessary, however, since neither dep_delay nor dep_delay_lag have any NAs. Also, could you provide an interpretation of delay_diff? For example, what does it mean that JFK's is negative?

    5. flight

      Another definition of flight could be the quad (carrier,flight,origin,dest) which, say, might fly daily. Here's a solution with this definition.

      dest_delay <- flights %>%
        group_by(dest) %>%
        filter(arr_delay > 0) %>%
        summarize(dest_ad = sum(arr_delay,na.rm=T),dest_ct = n())
      flight_delay <- flights %>%
        group_by(carrier,flight,origin,dest) %>%
        filter(arr_delay > 0) %>%
        summarize(flight_ad = sum(arr_delay,na.rm=T),flight_ct = n())
      prop <- as_tibble(merge(dest_delay,flight_delay,by="dest")) %>%
        mutate(prop = flight_ad / dest_ad) %>%
      select(carrier,flight,origin,dest,dest_ct,flight_ct,prop)
      
    6. Exercise 5.7.4

      In general selecting just a few columns (often the mutate variables) makes the result clearer since "irrelevant" cols are eliminated.

    7. However, there are many planes that have never flown an on-time flight.

      Since this minimum rank group has an on_time of 0.0, I believe that this tibble is all planes which have never flown an on-time flight.

    8. arr_delay > 0

      (arr_delay > 0)

      Of course precedence "does the right thing" but since R will use coercion to give lgl & dbl a value, putting in the redundant parentheses prevents the not-careful reader (me, for example) from "thinking the wrong thing."

    9. cancelled

      In this dataset the presence of arr_delay implies the presence of arr_time so the boolean cancelled could be eliminated. Of course, the reader hasn't verified this relation so keeping it in makes sense. The definition of cancelled as "having an arrival delay" and "not having an arrival time" seems a bit odd.

    10. They operate within each group rather than over the entire data frame

      Do all functions on grouped tibbles operate only per group and not the entire tibble? E.g.

      flights %>% 
        group_by(month,day) %>%
        arrange(desc(dep_delay))
      

      Will the arrange function order both by group and by the entire tibble?

    11. not_cancelled

      Hadley defines this as

      not_cancelled <- flights %>% 
        filter(!is.na(dep_delay), !is.na(arr_delay))
      

      But in

      flights %>% 
        filter(is.na(dep_time) | is.na(arr_time) | is.na(dep_delay) | is.na(arr_delay)) %>% 
      select(sched_arr_time,arr_time,arr_delay,sched_dep_time,dep_time,dep_delay)
      

      the first 6 flights have an NA only in arr_delay. Aren't these likely not to have been cancelled and simply have a missing data entry? Why isn't

      not_cancelled <- flights %>%
      filter(!is.na(dep_time), !is.na(arr_time))
      

      a "better" definition?