426 Matching Annotations
  1. Jul 2020
    1. stocks <- tibble( year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016), qtr = c( 1, 2, 3, 4, 2, 3, 4), return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66) ) stocks %>% pivot_wider(names_from = year, values_from = return, values_fill = 0) #> # A tibble: 4 x 3 #> qtr `2015` `2016` #> <dbl> <dbl> <dbl> #> 1 1 1.88 0 #> 2 2 0.59 0.92 #> 3 3 0.35 0.17 #> 4 4 NA 2.66

      duplicated example I think

    2. complete()

      delete?

    3. set

      delete?

    4. [^ex-12.2.2]

      are all the bracketed parts supposed to be links?

  2. Jun 2020
    1. flights_delayed3

      slice_max is not found

    2. The fastest flight is the one with the average ground speed,

      Should read:

      The fastest flight is the one with the fastest average ground speed,

      The sentence as currently worded describes the flight with the average ground speed, not the fastest.

    1. numeric_cols <- vector("logical", length(df)) # test whether each column is numeric for (i in seq_along(df)) { numeric_cols[[i]] <- is.numeric(df[[i]]) } # find the indexes of the numeric columns idxs <- which(numeric_cols)

      Is there a reason not to use

      idxs <- which(sapply(df, is.numeric)) ?

  3. May 2020
    1. What other variables are missing?

      summary(is.na(flights)) will give NAs of all variables

    2. >

      should be >= (at least meaning 30 or more)

    3. Exercise 5.3.3

      Your solution works, but we have not been taught the mutate command yet. Given the commands we have already been taught, you can get the same results from: arrange(flights, desc(distance / air_time))

    4. ground speed

      This worked for me arrange(flights, air_time/distance)

    5. because the value of the missing TRUE or FALSE, x

      I find this hard to follow. Why not phrase it as the next example? NA | TRUE is TRUE because anything or TRUE is always TRUE

    6. mean(on_time)

      If we were to calculate the proportion of flights not delayed or cancelled, we would need to adjust for NA values of the cancelled flights:

      summarise(n = n(), on_time = sum(on_time, na.rm=TRUE) / n)

    1. The words that have seven letters or more are

      if we need to view/extract later the words that have at least 7 characters we can:

      str_view( stringr::words, "........*", match = T)

      adding "*" after the eight "." states that the last "." can be repeated zero or more times.

      or

      str_view(stringr::words, ".{7,}", match = T)

      in both cases we match the words that have at least 7 characters, not only up to the seventh letter

    2. Words that contain only consonants

      An alternative: str_view(words, "[aeiou]", match = FALSE)

      Or, using str_subset: str_subset(words, "[aeiou]", negate = TRUE)

    1. foreign

      If one adds "%>% distinct(dest)" to the expression, one sees that there are four airports that are not in the FAA list: BQN, SJU, STT, and PSE. Three of these are in Puerto Rico, one (STT) is in US Virgin Islands.

    2. precipitation

      Perhaps the association with visibility is even stronger.

    1. ggplot(data = diamonds) + geom_pointrange( mapping = aes(x = cut, y = depth), stat = "summary", fun.ymin = min, fun.ymax = max, fun.y = median )

      Could not generate the same stat_summary( ) plot with that code, did a little research and stack overflow suggested two solutions: use geom_line( )

      ggplot(data = diamonds, mapping = aes(x = cut, y = depth)) + geom_line() + stat_summary(fun.y = "median", geom = "point", size = 3)

      or, reduce the amount of data by grouping it

      data = diamonds %>% group_by(cut) %>% summarise(min = min(depth), max = max(depth), median = median(depth))

      ggplot(data, aes(x = cut, y = median, ymin = min, ymax = max)) + geom_linerange() + geom_pointrange()

      Source: https://stackoverflow.com/questions/41850568/r-ggplot2-pointrange-example

    2. width controls the amount of vertical displacement, and height controls the amount of horizontal displacement.

      I think you flipped these labels. width is horizontal and height is vertical

      It would also be helpful to emphasize that unless height and/or weight are explicitly defined as zero, there will be jitter. When I first used the geom_jitter, I did not realize this.

  4. Apr 2020
    1. combination

      is a combination

    2. contained

      contained in

    3. a particular trip by aircraft from a particular

      the sentence is not complete.

    4. few

      flew

    5. They may be combining different flights?

      It seems to refer to diverted flights. The original BTS data also has diverted airport information, including a variable DivArrDelay with the following description: "Difference in minutes between scheduled and actual arrival time for a diverted flight reaching scheduled destination. The ArrDelay column remains NULL for all diverted flights." I browsed this data quickly and, indeed, those missing arr_delay observations have values in the DivArrDelay variable.

    6. is column

      is a column.

    7. There is one remaining issue. Midnight is represented by 2400, which would correspond to 1440 minutes since midnight, but it should correspond to 0. After converting all the times to minutes after midnight, x %% 1440 will convert 1440 to zero while keeping all the other times the same.

      %% 1440 is brilliant

    1. function that directly calculates the number of unique values in a vector.

      n_distinct is a faster and more concise equivalent of length(unique(x)).

  5. Mar 2020
    1. Note

      Is anyone getting the above plot? even when I copy paste the code the boxplots are horizontal instead of vertical. tidyverse has the same plot as I do https://ggplot2.tidyverse.org/reference/geom_boxplot.html

      If I understand the aes correctly, group should make carat the "categorical" variable and plot vertically as seen here.

      If I replace y = price with y = depth it plots vertically as expected.

      Can anyone help me understand what's happening?

    1. What does the one_of() function do? Why might it be helpful in conjunction with this vector?

      Retired Selection Helpers

      one_of() is retired in favour of the more precise any_of() and all_of() selectors.

      https://www.rdocumentation.org/packages/tidyselect/versions/1.0.0/topics/one_of

    2. will

      Change "will the difference" to "is the difference" or "will be the difference"

    1. Why are gather() and spread() not perfectly symmetrical? Carefully consider the following example:

      Shouldn't we update this functions to the newer pivor_longer and pivot_wider? (check ?gather, and also https://r4ds.had.co.nz/tidy-data.html#exercises-24 )

  6. Feb 2020
    1. Tidy the simple tibble below. Do you need to spread or gather it? What are the variables?

      Thank you for writing/maintaining r4ds! Great resource.

      I delivered this example in a lecture introducing tidy data, and the solution to have three variables doesn't make sense to me. Why is this not only two variables (sex and pregnant)? Sure, I could argue that count is a variable on this data, but it doesn't seem as natural.

    1. Exercise 3.7.5

      This exercise makes more sense AFTER introducing Position adjustments in part 3.8.

    2. na.rm:

      from the documentation: If FALSE, the default, missing values are removed with a warning. If TRUE, missing values are silently removed.

  7. Jan 2020
    1. As that example shows,

      Using geom_count() with position ='jitter' may help a bit with overplotting:

      ggplot(data = mpg) +

      • geom_count(mapping = aes(x = cty, y = hwy, color = class), position = "jitter")
    2. stat_violin()

      The ggplot2 v3.2.1 R documentation shows geom_violin() paired with stat_ydensity()

    3. The default stat of geom_bar() is stat_bin(). The geom_bar() function only expects an x variable. The stat, stat_bin(), preprocesses input data by counting the number of observations for each value of x. The y aesthetic uses the values of these counts.

      The ggplot2 v3.2.1 R documentation states "geom_bar() uses stat_count() by default". Was ggplot2 updated since this answer was published? My understanding is stat_count() is used for discrete x data and stat_bin() for continuous x data.

    4. The following list contains the categorical variables

      missing the variable manufacturer, which is also a categorical (<chr>) variable.

    1. Write a function that turns (e.g.) a vector c("a", "b", "c") into the string "a, b, and c". Think carefully about what it should do if given a vector of length 0, 1, or 2.

      Can we the whole code replace with: ifelse(length(x)==1, x, str_c(str_c(x[-length(x)], collapse = ", ")," and ",str_c(x[length(x)], collapse = ",")))

      of course in the form of a function

  8. Dec 2019
    1. What is meant by “48 hours over the course of the year”? This could mean two days, a span of 48 contiguous hours, or 48 hours that are not necessarily contiguous hours. I will find 48 not-necessarily contiguous hours. That definition makes better use of the methods introduced in this section and chapter.

      Thank you for the book! Could you provide a hint as to how you might go about with this if 48 hours here meant any contiguous 48 hours? My guess is using a windowing function, but I couldn't figure it out.

  9. Nov 2019
    1. the an

      Either it is "the" or "an", but not both.

    2. with with

      "with" written 2 times.

    3. geom_bin()

      this should be "stat_bin()"

    4. no plots

      I think this should be "no points" (i. e. no points on the scatter plot).

  10. Oct 2019
    1. )

      Typo.

    2. A warning is provided since often, but not always,

      Sentence syntax unclear. Maybe this:

      If a warning is provided often, but not always, there may be a bug in the code.

    3. vectors recycles

      vectors, R recycles

    4. min_rank()

      min_rank是对数据大小进行编号排序,遇到重复值,排序相同,但每个值都占一个位置,缺失值不计入

    1. how these parameters affects

      affect

    2. use already

      already use

    3. The benefits encoding

      Missing "of" between "benefits" and "encoding"

    4. hwy vs. cyl

      I think common notation is dependent variable (Y) versus independent variable (X), so if you're asking for a y = cyl and x = hwy plot, then I'd say that's cyl vs. hwy (cyl as a function of hwy), not vice versa as written.

  11. Sep 2019
    1. Unusually fast flights are those flights with the smallest standardized values.

      Question Why would it be unusual that flights with air time on the left side of the distribution would be the fastest?

    1. Exercise 14.4.2.2

      Exercise 14.4.2.2 Q 2 - this misses words that are at the end of the sentence with a period after them. I use this: str_extract(sentences, "[A-za-z]+ing[ .]")

    1. we could use map() followed by flatten_dbl(),

      The way this is written could be somewhat confusing to a reader, in my opinion, although the code makes the order of the functions clearer..

      Suggestion:

      If we wanted a numeric vector, we could combine the map() followed with the flatten_dbl(),

    2. like so,

      Edit suggestion.

      like shown:

  12. Aug 2019
    1. x = "Highway MPG Relative to Class Average", y = "Engine Displacement"

      Hi. is "Highway MPG Relative to Class Average" the name of Y-axis? and the name of X-axis is "Engine Displacement" because we put in X-axis with displ. thanks:)

    1. at

      Minor mistake: a not at

    2. not

      "Not" repeated. Delete?

    3. Neither NaN nor Inf are not numbers, and so they aren’t even numbers

      Edit suggestion,

      Neither NaN nor Inf is a number

      Or

      Both NaN and Inf are not even numbers.

      If accepted, then "and so they aren’t even numbers" may be deleted as it becomes superfluous.

    4. This is not the same as what you See the value of looking at the value of

      Could you please check the sentence? There seems to be some ambiguity.

    1. Okay, I’m not sure what’s going on in this data.

      Looks like a New York issue.

      > filter(flights, !is.na(dep_delay), is.na(arr_delay)) %>%
      +   count(origin)
      # A tibble: 3 x 2
        origin     n
        <chr>  <int>
      1 EWR      469
      2 JFK      337
      3 LGA      369
      
    1. is

      redundant. delete?

    2. (!(x %% 3) && !(x %% 5))

      Hi,

      Please can you explain to me how to read this line of code and what it means. I'm having difficulty understanding it. Individually I do understand what each symbol means but put together as it is, I'm unable to. Moreover, it looks more efficient than my effort, as shown below.

      Thank you.

      fizzbuzz <- function(n) {

      x <- n

      if(( x %% 3 == 0 ) && ( x %% 5 == 0 )) {

      print("fizzbuzz")

      } else {

      if(x %% 3 == 0) {

      print("fizz")
      

      } else {

      if(x %% 5 == 0) {

      print("buzz")

      } else { print(x)

      } } }

      }

    1. function(.data)

      Hi Jeffrey. First of all, i really appreciate this Exercise solutions! and.. i need your help! why did you code ".data"? Is there any difference in the case of "data"? when i coded "data", The results were the same as ".data". i had tried to find a difference between them, i couldn`t find it....please let me know. does it have no difference?..

    1. What does it mean for a flight to have a missing tailnum

      This seems to be a tad long so I apologise in advance. This is a whole new field for me and I would really like to understand.

      Could it be AA and MQ use different values to represent tailnum?

      filter( planes, tailnum == 0 )

      A tibble: 0 x 9

      length(is.na( planes$tailnum ))

      [1] 3322

      nrow(planes)

      [1] 3322

      filter( flights, tailnum == 0 )

      A tibble: 0 x 19

      length(is.na( flights$tailnum ) )

      [1] 336776

      nrow( flights )

      [1] 336776

      Yet , the anti_join () as shown in your code shows clearly that there are some talinum values in flights that are not represented in the planes datasets. How could that be? The one explanation I could come up with is that the two datasets used different talinum values, so I tried to investigate for AA and MQ.

      tailnum_flights <- flights %>% filter( carrier == 'AA'| carrier == 'MQ' ) %>% select ( carrier, tailnum )

      tailnum_planes <- planes %>% select( tailnum )

      tailnum_planes %in% tailnum_flights

      [1] FALSE

      So, it looks like the tailnum values are not missing for the ten airlines but are represented with values different in the two datasets (flights and planes).

      What are your thoughts? Thank you.

  13. Jul 2019
    1. In the full English language, no

      Erm, full stop omitted after no.

      And thanks, for the link. It is an interesting read.

    2. Words that end with “-ed” but not ending in “-eed”

      This worked for me as well:

      str_view(stringr::words, ".*[^e]ed$", match = TRUE)

    3. "ab$^$sfas"

      Please, could explain why you included this in the code? I replicated the answer without it. Thanks.

    4. str_view(words, "([[:letter:]]).*\\1", match = TRUE)

      Does this work? E.g. "achieve" does not have a matching PAIR of letters.

    1. avg_dest_delays <- flights %>% group_by(dest) %>% # arrival delay NA's are cancelled flights summarise(delay = mean(arr_delay, na.rm = TRUE)) %>% inner_join(airports, by = c(dest = "faa"))

      Please, could you explain to me why if I pipe this directly to ggplot the colour aesthetics is not applied. See code below, it's basically a replication of yours but with ggplot directly piped to avg_dest_delays:

      avg_dest_delays <- flights %>% group_by( dest ) %>% summarise( delay = mean( arr_delay, na.rm = TRUE )) %>% inner_join(airports, by = c( dest = "faa" ) ) %>% ggplot( aes(lon, lat, colour = delay ) ) + borders("state" ) + geom_point( ) + coord_quickmap( )

      Thanks

    2. any any

      repetition?

    3. join

      This also seems to be the case where the by = argument is not used in a code. In that case, it seems, the semi_join() will give outputs only where the rows for both datasets correctly match, for example:

      fueleconomy::vehicles %>% semi_join(fueleconomy::common)

      produces the same output as:

      fueleconomy::vehicles %>% semi_join(fueleconomy::common, by = c("make", "model"))

      Or is that a coincident?

      But, fueleconomy::vehicles %>% semi_join(fueleconomy::common, by = "make"

      will produce a different output as R will match only by "make" in this example.

    4. arr_delay

      The code doesn't affect the output but I thought you might mean, sum( !is.na( dep_delay ))

    5. There are few planes older than 30 years, so I combine them into a single category.

      The code in the solution book totally dropped the data for planes age > 25. How might we combine them into a single row? I didn’t think it could be done without first defining a second tibble that contains all the planes age >25, then merge it with the first tibble that contains the data for planes age <= 25, before carrying out the summarise actions on them after applying the group_by = age argument. older_plane_cohorts <- inner_join( flights, select( planes, tailnum, plane_year = year ), by = "tailnum" ) %>% mutate(age = year - plane_year) %>% filter(!is.na(age)) %>% mutate(age = pmin(46, age) - pmin( 25, age ) ) %>% filter( age != 0 )

      Then I got stuck. I can’t figure out how to proceed after that. And frankly, I can say with any certainty if my argument is sensible. And I’m not even sure it’s possible to combine the 17 rows into a single row. Your help will be appreciated.

      Thank you in advance and for your help so far. Truly appreciated.

    6. mutate(age = pmin(25, age))

      I think you used this line of code to limit the selection to planes not older than 25 years. But in the text above, you stated, "There are few planes older than 30 years, so we combine them into a single category." So, I was expecting the selection to be age <= 30, or using your notation pmin(30, age) and not pmin(25, age). Perhaps an edit of the text may be required unless I'm wrong in my supposition?

    7. This however, this default

      Edit suggestion: However, this default...

    8. If we needed a unique identifier for our analysis, could add a surrogate key.

      Hi, I hope this doesn't come across as nitpicking:

      If we needed a unique identifier for our analysis, we could add a surrogate key, perhaps?

    9. faa$airports

      Shouldn't this be "airports$faa" since airports is a data frame and faa is a variable?

    1. It looks like it is possible for certain variables to missing for (country, years).

      "It looks like it is possible for certain variables to missing for (country, years)." Edit suggestion: It looks like it is possible that certain variables are missing for (country, years).

    2. is

      delete?

    1. ggplot

      Please, can you explain to me what I am getting wrong in the below code and especially the error message, which I have tried goggling to no success.

      I tried to see if instead of count to show, in my practice exercise, the average price per carat group using the code below:

      group_by( diamonds, carat ) %>% summarise( avg_price = mean(price ) ) %>% ggplot( ) + mapping = aes( color = cut_width(carat, 5 ), x = avg_price ) + geom_freqpoly( )

      The code returns the following error:

      Error in group_by(diamonds, carat) %>% summarise(avg_price = mean(price)) %>% : could not find function "+<-"

      I've googled the error without success. So my confusion is in two parts:

      1. What is the right code to show the average price per carat type
      2. What does the above error mean?

      Thanks in advance

    2. visualization

      Does anyone else notice that the R4DS uses British spelling for visualisation, but the Solution textbook uses the American spelling. It's a bit weird when one switches directly from the text book to the solution. I'm sure not many people notice the difference. I probably do because I generally write Brit English. By the way, I am not censuring, just expressing my thought. I am really grateful to the author for making this available.

    3. there spikes in

      Omission of are? Perhaps, you meant there are?

    4. the these

      Typo. Either the or these, preferably these in my opinion.

    1. n

      Can anyone please explain the n in this code? Thanks in advance.

    2. sin()

      cos( )

    3. sin)

      sin( )

    4. head(arrange(fastest_flights, desc(mph)))

      why not just use: arrange(flights, desc(distance / air_time))

  14. Jun 2019
    1. In dep_time, midnight is represented by 2400, not 0.

      Recommendation: For new users of R, how one determines the representation of midnight as 2400 should be explained. At this point in the book, a reader would not have been thought how to do the proper exploration to make this determination alone, so it would seem that explaining how the author of the solutions book acquired that knowledge would be most beneficial.

    1. (date - 1L) %in% holidays_2013$date ~ "day before holiday", (date + 1L) %in% holidays_2013$date ~ "day after holiday",

      I think you mixed up these two.

      (date + 1L) %in% holidays_2013$date ~ "day before holiday",
      (date - 1L) %in% holidays_2013$date ~ "day after holiday",
      
    2. (α+βx)

      should be $$\beta \log x$$

    1. the more that pre-allocation will outperform appending.

      Based on the output, it seems to me that appending outperform the pre-allocation. I run the code myself, got same results. However, another R notebook gave opposite results.

    1. >

      Omit.

    2. Exercise 7.4.1

      Many thanks for this lovely example. I hadn't understood geom_bar's NA behavior with factor variables until your explanation.

  15. May 2019
    1. Presumably there is a premium for a 1 carat diamond

      Is this "buyer's psychology?" The opposite of $4.99 being cheaper than $5.00?

    2. number of diamonds in each carat range

      Nice! Your intuition seems correct. But how do you account for the large number of 1.01 carat diamonds?

    3. print(n = 30)

      print(n = Inf) will print all rows of a tibble.

    4. seem

      see

    5. There are no diamonds with a price of $1,500

      More precisely: there's a $90 gap: the closed interval [1455,1545].

    6. Explore the distribution

      I wonder if this includes identifying outliers for possible data errors. E.g. x = 0 for 8 diamonds and z = 0 for 20. Also z = 31.8 for one. y = 0 for 7 diamonds and y = 31.8 and y = 58.9 for one each. I believe all are data errors.

    7. mutate(id = row_number())

      I believe the plots can be generated without id:

      diamonds %>%
        select(x, y, z) %>%
        gather(variable, value) %>%
        ggplot(aes(x = value)) +
        geom_density() +
        geom_rug() +
        facet_grid(vars(variable))
      
    8. will

      Is it a deliberate or an error putting "will" and "is" together?

    1. bottles <- function(i) { if (i > 2) { bottles <- str_c(i - 1, " bottles") } else if (i == 2) { bottles <- "1 bottle" } else { bottles <- "no more bottles" } bottles }

      I think this should read:

      
      bottles <- function(i) {
        if (i > 1) {
          bottles <- str_c(i , " bottles")
        } else if (i == 1) {
          bottles <- "1 bottle"
        } else {
          bottles <- "no more bottles"
        }
        bottles
      }
      

      Otherwise you get no bottles of beer twice in the final output

    2. map_lglg

      typo

    3. is.factor(diamonds$color)

      repetition?

    4. mean(X[i, ])

      I think you forget to change mean(X[i, ]) to mean(X[, i]) in calculating column means

    5. df[[i]] <- read_csv(files[[i]])

      I think it should be corrected this way: for (fname in ...) { df[[fname]] <- bind_rows()

    1. For each plane, count the number of flights before the first delay of greater than 1 hour.

      I think this is a lovely use of logicals and cumsum. But I believe that it omits planes whose first flight is delayed by more than an hour. There are 234 of these:

      (zero  <- flights %>%
         filter(!is.na(dep_delay)) %>%
         arrange(tailnum,month,day) %>%
         group_by(tailnum) %>%
         mutate(delay_gt1hr = dep_delay > 60) %>%
         filter(row_number() == 1,delay_gt1hr) %>%
         select(tailnum) %>%
         mutate(n = 0)
      )
      

      Then bind_rows could concatenate these to the tibble produced by the posted solution.

      This is a "two-part" solution. Is there a "one-part" solution?

    2. year

      Since the year column contains only 2013, you could arrange chronologically without this argument.

    3. We will calculate this ranking in two parts

      Or, the two parts could be combined:

      rank <- flights %>%
         group_by(dest) %>%
         mutate(n_carriers = n_distinct(carrier)) %>%
         filter(n_carriers > 1) %>%
         group_by(carrier) %>%
         summarize(n_dest = n_distinct(dest)) %>%
         arrange(desc(n_dest))
      
    4. width = 100

      Here width = Inf will print all columns.

    5. print(width = 120)

      I believe that print(width = Inf) will reliably print all selected columns.

    6. below than the

      below the

    7. observations each

      observations of each

    1. #> Warning: Computation failed in `stat_binhex()`:

      Why are not these two plots displayed properly? Only empty background is shown.

  16. Apr 2019
    1. Surprisingly, it appears that depth (z) is always smaller than length (x) or width (y). Length is less than width in more than half the observations, the opposite of expectations. I don’t know what’s going on. If this was not a widely used da

      Diamond Dimensions

      1. If you look at this picture, it seems that depth is always smaller than either width or length. Maybe it is easier to plug it into a ring or other jewelry...
      2. Also you said "1. length is less than width, otherwise the width would be called the length". actually, it is the other way round.
    2. geom_histogram(binwidth = 1, center = 0) + geom_bar()

      I do not understand, why geom_bar is used in addition to geom_histogram. It greates two layers which are essentially the same, right?

    1. # … with 1 more row

      You might want to check this last row of the tibble that's not displayed. Saturday actually has the shortest delays of any day of the week. You'll see it right away if you plot it:

      flights_dt %>% mutate(wday = wday(dep_time, label = TRUE)) %>% group_by(wday) %>% summarize(ave_dep_delay = mean(dep_delay, na.rm = TRUE)) %>% ggplot(aes(x = wday, y = ave_dep_delay)) + geom_bar(stat = "identity")

      flights_dt %>% mutate(wday = wday(dep_time, label = TRUE)) %>% group_by(wday) %>% summarize(ave_arr_delay = mean(arr_delay, na.rm = TRUE)) %>% ggplot(aes(x = wday, y = ave_arr_delay)) + geom_bar(stat = "identity")

    2. %%

      should not be %/% ?

    1. skewness <- function(x, na.rm = FALSE) { n <- length(x) m <- mean(x, na.rm = na.rm) v <- var(x, na.rm = na.rm) (sum(x - m)^3 / (n - 2)) / v^(3 / 2) }

      This function always returns 0. This is because sum(x-m) (performed before raised to the power 3) will always be 0. Resolved with additional parenthesis:

      skewness <- function(x, na.rm = FALSE) { n <- length(x) m <- mean(x, na.rm = na.rm) v <- var(x, na.rm = na.rm) (sum((x - m)^3) / (n - 2)) / v^(3 / 2) }

      I appreciate there are several possible formulas for skewness which may not match this one.

    1. I

      Omit.

    2. select(tailnum, on_time, arr_time, arr_delay) %>%

      If you wish, this could be eliminated since the only columns that survive are those in the summarise.

    3. flights

      flights at departure

    4. the

      omit

  17. Mar 2019
    1. arrange(flights, distance / air_time * 60)

      Do we need a descending order to put the fastest flights first? arrange(flights, desc(distance / air_time 60)) See also: flights %>% mutate(speed = distance / air_time 60) %>% select(tailnum, distance, air_time, speed, dep_time) %>% arrange(desc(speed))

    2. Look at the number of cancelled flights per day. Is there a pattern?

      This part is missing

    3. benificial

      beneficital

    4. were

      omit

    5. ungroup() %>%

      I believe you don't need ungroup().

    6. an standard deviation. That

      "and standard deviation. The following"

    7. sinpi()

      Typo. I suspect you mean cospi(), because "sinpi(x)" had just been introduced

    8. Unusually fast flights are those flights with the smallest standardized values.

      I believe the problem asks for "fast flights per destination". If so, this code produces that list, including ties. Sadly, it uses slice.

      fast <- flights %>%
        filter(!is.na(air_time)) %>%
        group_by(dest) %>%
        mutate(z = (air_time - mean(air_time)) / sd(air_time)) %>%
        slice(which(z == min(z))) %>%
      select(origin,dest,month,day,carrier,flight,air_time,z) %>%
        arrange(z)
      
    9. ggplot(standardized_flights, aes(x = air_time_standard)) + geom_density()

      This is a nice plot, but the image below is not it. It appears to have been cut-and-pasted from a previous plot.

    10. standardized_flights

      Somewhat more succinctly:

      standardized_flights <- flights %>%
        filter(!is.na(air_time)) %>%
        group_by(dest,origin) %>%
        mutate(air_time_standard = (air_time - mean(air_time)) / sd(air_time)) %>%
      select(origin,dest,month,day,carrier,flight,air_time,air_time_standard)
      
    1. filter(n > 100)

      This should be filter(n >= 100): "at least 100 flights" implies that 100 should be included as the least possible value of number of flights.

    1. Both R markdown files and can be knit R markdown documents can be knit.

      This part does not make sense . Needs to be reviewed.

  18. Feb 2019
    1. in 10 flights

      Omit, since you've already filtered.

    2. min_rank() and dense_rank()

      English is inherently ambiguous while only math (vectors, subscripts) and algorithms are not. But then you'd lose your reader.

      Here's a slight rephrasing where rank refers to a component of, say, min_rank(x) and value a component of x.

      For each set of tied values the min_rank() function assigns a rank equal to the number of values less than that tied value plus one. In contrast, the dense_rank() function assigns a rank equal to the number of distinct values less than that tied value plus one.

    3. min_rank() and dense_rank()

      English is inherently ambiguous while only math (vectors, subscripts) and algorithms are not. But then you'd lose your reader.

      Here's a slight rephrasing where rank refers to a component of, say, min_rank(x) and value a component of x.

      For each set of tied values the min_rank() function assigns a rank equal to the number of values less than that tied value plus one. The dense_rank() function assigns a rank equal to the number of distinct values less than that tied value plus one.

    4. ranking handles

      the ranking functions handle

    5. function

      functions

    6. roughly

      Is it "rough" or "exact"?

    7. an element

      a vector

    8. The row_number() function is confusingly named. It can create ranks for any column.

      I might omit this since you give an explanation of row_number berlow which is accurate. Also, tibble adds its own rownumbers and rownumber(x) appears to be "the rownumber after x is sorted", which I don't find confusing.

    9. missing values

      ties

    10. na.rm = TRUE

      This probably is intended to be an argument of mean. It's not necessary, however, since neither dep_delay nor dep_delay_lag have any NAs. Also, could you provide an interpretation of delay_diff? For example, what does it mean that JFK's is negative?

    11. year,

      Could be eliminated since the dataset is for one specific year.

    12. flight

      Another definition of flight could be the quad (carrier,flight,origin,dest) which, say, might fly daily. Here's a solution with this definition.

      dest_delay <- flights %>%
        group_by(dest) %>%
        filter(arr_delay > 0) %>%
        summarize(dest_ad = sum(arr_delay,na.rm=T),dest_ct = n())
      flight_delay <- flights %>%
        group_by(carrier,flight,origin,dest) %>%
        filter(arr_delay > 0) %>%
        summarize(flight_ad = sum(arr_delay,na.rm=T),flight_ct = n())
      prop <- as_tibble(merge(dest_delay,flight_delay,by="dest")) %>%
        mutate(prop = flight_ad / dest_ad) %>%
      select(carrier,flight,origin,dest,dest_ct,flight_ct,prop)
      
    13. Exercise 5.7.4

      In general selecting just a few columns (often the mutate variables) makes the result clearer since "irrelevant" cols are eliminated.

    14. !is.na(arr_delay),

      This could be omitted but keeping it in redundantly may add clarity.

    15. only

      Omit

    16. However, there are many planes that have never flown an on-time flight.

      Since this minimum rank group has an on_time of 0.0, I believe that this tibble is all planes which have never flown an on-time flight.

    17. <=

      ==

      I believe min_rank cannot be 0.

    18. arr_delay > 0

      (arr_delay > 0)

      Of course precedence "does the right thing" but since R will use coercion to give lgl & dbl a value, putting in the redundant parentheses prevents the not-careful reader (me, for example) from "thinking the wrong thing."

    19. cancelled

      In this dataset the presence of arr_delay implies the presence of arr_time so the boolean cancelled could be eliminated. Of course, the reader hasn't verified this relation so keeping it in makes sense. The definition of cancelled as "having an arrival delay" and "not having an arrival time" seems a bit odd.

    20. They operate within each group rather than over the entire data frame

      Do all functions on grouped tibbles operate only per group and not the entire tibble? E.g.

      flights %>% 
        group_by(month,day) %>%
        arrange(desc(dep_delay))
      

      Will the arrange function order both by group and by the entire tibble?

    21. There are more sophisticated ways to do this analysis

      Perhaps, but this is a lovely example of the power of R.

    22. , which calculates

      ". atan(y,x) returns"

      My worry is that the reader would assume that atan(x,y) returns that angle.

    23. not_cancelled

      Hadley defines this as

      not_cancelled <- flights %>% 
        filter(!is.na(dep_delay), !is.na(arr_delay))
      

      But in

      flights %>% 
        filter(is.na(dep_time) | is.na(arr_time) | is.na(dep_delay) | is.na(arr_delay)) %>% 
      select(sched_arr_time,arr_time,arr_delay,sched_dep_time,dep_time,dep_delay)
      

      the first 6 flights have an NA only in arr_delay. Aren't these likely not to have been cancelled and simply have a missing data entry? Why isn't

      not_cancelled <- flights %>%
      filter(!is.na(dep_time), !is.na(arr_time))
      

      a "better" definition?

    24. and the passenger arrived at the same time

      Omit.

    25. Exercise 5.5.4

      Another solution. Now that we know we can "trust" dep_delay we could simply

      md <- flights %>%
      select(carrier,flight,origin,dest,sched_dep_time,dep_time,dep_delay) %>%
      arrange(min_rank(desc(dep_delay)))
      

      There are no ties in the top 10.

    26. Daylight

      If you wish, you could change this to "Eastern Daylight" to agree with its previous reference in that sentence.

    27. Except for flights daylight savings started (March 10) or ended (November 3)

      I understand the point of the paragraph but this sentence is a bit unclear.

    1. sum(cases)

      Must be na.rm = TRUE inside the sum parenthesis for the below graph to appear otherwise it will just be a blank graph.

      By the way thanks so much for these solutions, they are extremely helpful and teach things which are not taught in the book!