426 Matching Annotations
  1. Jan 2024
    1. Midnight is represented by 2400, which would correspond to 1440 minutes since midnight, but it should correspond to 0.

      This actually doesn't work, because sometimes 1440 means 0 and sometimes it means 2400, depending on whether the delay crosses over from one day to the next.

  2. Dec 2023
  3. Oct 2023
  4. May 2023
    1. +

      don't believe this + character serves any purpose. It allows 1 or more spaces to between the matching words of group 1 and 2. But why would there ever be more than 1 space?

  5. Feb 2023
  6. Jan 2023
  7. Jul 2022
    1. #> Warning: 2 parsing failures. #> row col expected actual file #> 1 -- 2 columns 3 columns literal data #> 2 -- 2 columns 3 columns literal data #> # A tibble: 2 x 2 #> a b #> <dbl> <dbl> #> 1 1 2 #> 2 4 5

      Behavior differs - the last values in each row are concatenated, so I get:

      a b 1 23 4 56

      Same general pattern for the other answers.

  8. Jun 2022
    1. The code would have had to check if the departure time is less than the scheduled departure time plus departure delay (in minutes).

      This error is addressed and corrected in 16.4.2 - maybe worth noting for readers who want to check out solution.

    1. cancelled_prop = mean(cancelled),

      how about using proportions or percentages? Even plots would be more revealing with less cluttered points.

      cancelled_prop = (cancelled_num / flights_num),

      OR even better:

      cancelled_prop = (cancelled_num / flights_num) * 100,

      ggplot (cancelled_per_day, aes (x = cancelled_prop, y = avg_dep_delay)) + geom_point (color = 'blue') + geom_smooth (se = FALSE)

      ggplot (cancelled_per_day, aes (x = cancelled_prop, y = avg_arr_delay)) + geom_point (color = 'red') + geom_smooth (se = FALSE, color = 'red')

    1. "2003-01-01"

      I think this should be Feb 1 2003 (2003-02-01), based on: date_custom <- c("Day 01 Mon 02 Year 03", "Day 03 Mon 01 Year 01")

    2. read_csv("a,b\n\"1")

      behavior is different now: "> read_csv("a,b\n\"1") Rows: 0 Columns: 2 <br /> -- Column specification --------------------------------------------------------------- Delimiter: "," chr (2): a, b

      i Use spec() to retrieve the full column specification for this data. i Specify the column types or set show_col_types = FALSE to quiet this message.

      A tibble: 0 x 2

      ... with 2 variables: a <chr>, b <chr>"

    3. that value is dropped.

      As above, behavior has changed and final two values now combined to "34".

    4. #> Warning: 2 parsing failures.

      read_csv behavior has changed, warning is now: "Warning message: One or more parsing issues, see problems() for details"

    5. the two functions have the exact same arguments:

      This is becase read_csv() and read_tsv() are special cases of the more general read_delim().

    1. The n_extra argument determines the number of extra columns to print information for.

      According to ?tibble::print.tbl we should use max_extra_cols now: "max_extra_cols Number of extra columns to print abbreviated information for, if the width is too small for the entire tibble. If NULL, the max_extra_cols option is used. The previously defined n_extra argument is soft-deprecated."

  9. May 2022
    1. (air_time_delay)

      This should be air_time_delay_pct not air_time_delay. As it is, the tibble below shows the same results as above so we can't see the data sorted to show highest percent first.

    2. These are a few ways to select columns

      Another way to select these columns is by excluding all other columns (either by name as below or column number):

      select(flights, -(year:day), -sched_dep_time, -sched_arr_time, -(carrier:time_hour))

      This isn't a good coding approach, but as an example it does demonstrate how to exclude columns which was one of the techniques used in this section of the text.

  10. Apr 2022
  11. Feb 2022
    1. And the hours of TV doesn’t look that surprising to me.

      The tvhours variable is the hours watched per day. It's unlikely, that respondents actually watch TV 24 hours a day. Could be a misunderstanding of the question, or they could be answering figuratively (e.g. "oh, I watch TV 24/7").

  12. Jan 2022
  13. Dec 2021
    1. no relationship or only a small negative relationship

      This is not quite true, it is just appearance due to choice of scale. If you calculate correlation coefficient between displ and hwy for each classes, it is smaller than 0.5 only for 2-seaters and minivans. P value for linear models is also larger than 0.05 only for 2-seaters and minivans.

    1. better model

      We can add a new variable: price per carat. And we will see that it drops just before a "nice" carat weight and then sharply rises. So we can somehow try to include it in the model. Table and depth seem to influence price too.

    2. 4 to 2

      I think this is a typo. It should be vice-versa "from 2 to 4" based on the code snippet below, shouldn't it?

  14. Nov 2021
    1. discrepancies

      In all cases of discrepancy, the error is exactly 24 hours. It means that the flight was postponed till next day but dep_time erroneously gives the same day.

    2. doesn’t appear to much difference

      Another, completely different approach is to calculate distribution parameters for each day (I used quartiles) and plot them against time.

      The median very slowly declines over the year. Although the decline is slow linear regression analysis shows statistically significant correlation.

    1. replacements <- c("A" = "a", "B" = "b", "C" = "c", "D" = "d", "E" = "e", "F" = "f", "G" = "g", "H" = "h", "I" = "i", "J" = "j", "K" = "k", "L" = "l", "M" = "m", "N" = "n", "O" = "o", "P" = "p", "Q" = "q", "R" = "r", "S" = "s", "T" = "t", "U" = "u", "V" = "v", "W" = "w", "X" = "x", "Y" = "y", "Z" = "z")

      An easy way to do it without typing the entire alphabet:

      alphabet <- letters

      names(alphabet) <- LETTERS

    2. Strictly speaking, this code replaces forward slashes with double backslashes. There seems to be a bug in R that prevents replacing with single slash.

    1. hard to say

      I can only add that in most (2/3) of these 48 hours either wind speed OR wind gust OR visibility were worse than the mean wind speed or mean visibility respectively.

    2. flights

      I would also filter out cancelled flights with filter(!is.na(arr_time)) Flights missing a tail number is filtered out automatically by count.

    3. What weather conditions

      Funny enough, the condition that makes it most likely to see a delay is normal atmospheric pressure!

      weather_delay_dep <- flights %>% select(dep_delay, origin, time_hour) %>% filter(!is.na(dep_delay)) %>% inner_join(weather) %>% filter(!is.na(pressure))

      pressure_delay <- weather_delay_dep %>% mutate(pressure = round(pressure)) %>% group_by(pressure) %>% summarise(delay = median(dep_delay))

      pressure_delay %>% ggplot(aes(x = pressure, y = delay))+ geom_line() + geom_point()

      You can see it even before calculating median delay for each pressure value (I use median rather than mean because delay distributions are skewed):

      weather_delay_dep %>% ggplot(aes(x = pressure, y = dep_delay)) + geom_point() + geom_smooth()

    1. year > 1995

      What happened in 1995? Global number of cases increased by two orders of magnitude.

    2. summarize

      Also, some countries have missing age groups in some years.

    3. for years prior to the existence of the country

      There are other missing years, not related to existence of the country. If your group by country followed by summarise(years = unique(year)) you will see that Albania has data for 1995 and 1997 but no data for 1996, Algeria has data for 1997 and 1999 but not for 1998, etc.

    1. First

      The dataset description says that it describes round diamonds so two dimensions should be almost identical. A good starting point could be plotting frequency distributions of all three dimensions together:

      dimensions <- diamonds %>% pivot_longer(cols = c(x, y, z), names_to = 'dimension', values_to = 'value' )

      dim_short <- dimensions %>% filter(value <= 10)

      ggplot(data = dim_short, mapping = aes(x = value, colour = dimension)) + geom_freqpoly(binwidth = 0.01)

      Lines for x and y are practically identical, while z is shifted to the left. It does not prove that these dimensions are equal for each diamond but is a good indicator.

  15. Oct 2021
  16. Sep 2021
    1. "No answer", "Other", "Don't know", "Not applicable", "No denomination"

      "Refused"

      Why isn't it mentioned as well?

    1. that

      just that

    2. th

      the

    3. (123) 456-7890

      I don't get this match when passing the respective code to the console.

    4. (123) 456-7890

      I don't get this match when passing the respective code to the console.

    5. (123) 456-7890

      I don't get this match when passing the respective code to the console. Any ideas why? What does the s* do?

    6. )

      no closing curve here.

    7. at last

      at least

  17. Aug 2021
    1. So the most important column is arr_delay, which indicates the amount of delay in arrival.

      In the context of defining which flights are cancelled, shouldn't the dep_delay variable be considered more important here? As you go on to say, just because a flight as arr_delay as NA doesn't necessarily mean the flight was cancelled. Whereas every flight that has NA for dep_delay also has NA for dep_arr; these are the flights that have been cancelled.

    1. The variable cty, city highway miles per gallon, is a continuous variable.

      Minor mistake:

      The variable cty, city miles per gallon, is a continuous variable.

    2. ggplot(mpg, aes(x = displ, y = hwy)) + geom_point(size = 4, color = "white") + geom_point(aes(colour = drv))

      Another solution, that evades plotting the points twice, would be the following:

      ggplot(mpg, aes(x = displ, y = hwy, fill = drv)) + geom_point(color = "white", shape = 21, size = 3, stroke = 2)

    1. m <- mean(x, na.rm = TRUE)

      It should rather be as follows to properly pass on the parameter na.rm: m <- mean(x, na.rm = na.rm)

  18. Jul 2021
  19. Jun 2021
    1. each element of the character vector nchar

      I guess the character vector should be string instead of nchar.

    2. [y == -Inf] <- 0 y[y == Inf]

      How does this work? How come the evaluation of the conditional y == (-)Inf happens in the index of y?

    1. str_replace_all("past/present/future", "/", "\\\\")

      Why are there two backslashes replacing a single forward-slash?

    2. The \\1 pattern is called a backreference. It matches whatever the first group matched. This allows the pattern to match a repeating pair of letters without having to specify exactly what pair letters is being repeated.

      I wonder if this was written early in the editing of the text. The main text already describes this. If this is here as a reminder, maybe it could be worded differently so that it doesn't seem as if it was written early in the editing of the text and sort of forgotten here?

    3. [A-Za-z][A-Za-z]

      Could this notation be explained in the main text so that when we try to solve the problem we have the tool at our disposal?

    1. ggplot(data = diamonds) + geom_bar(aes(x = cut, y = ..count.. / sum(..count..), fill = color))

      from ?after_stat

      after_stat() replaces the old approaches of using either stat() or surrounding the variable names with ...

      so the (new) solution would be:

      ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, y = after_stat(count / max(count)), fill = color))

    2. such st

      such as

    3. lists

      Either the following table lists or the following tables list.

    4. plots

      the plots

    5. class

      class needs to be with different fonts

    6. values

      the values

    7. color

      colors

    8. only facets

      facets only on one variable.

    9. within

      with

    10. on the x-axis.

      in the column dimension.

    11. of drv on the y-axis.

      in the row dimension.

    12. facet

      facets

    13. combinations

      the combinations

    1. What weather conditions make it more likely to see a delay?

      Is there a way of doing this more systematically through z-scores?

      flights2 <- flights %>% left_join(weather) %>% pivot_longer(temp:visib, names_to = "weather_condition", values_to = "weather_value") %>% group_by(weather_condition) %>% mutate(weather_value = (weather_value - mean(weather_value, na.rm = TRUE)/sd(weather_value, na.rm = TRUE)))

      flights2 %>% ggplot(aes(weather_value, dep_delay, fill = weather_condition)) + geom_point()

      I should warn that, although I tried this, not only does it take long to render, but it also simply didn't color my points. I'm not sure why.

    2. The following diagram shows the relations between the Batting, Pitching, and Fielding tables.

      Is there a reason why Batting in particular is in the middle? Or could any of the other two have been placed there instead?

  20. May 2021
    1. we could a FizzBuzz function

      omission of write?

      Suggested edit: we could write a FizzBuzz function

    1. mean delay of a flight for all values of the previous flight

      It's not really clear to me why doing the means here makes sense... I'll come back to this and maybe later it'll click, but so far I've re-read this three times in three different days and it still doesn't click... Maybe there's another way of phrasing why this is being done the way it is?

    2. We could also use the | operator. However, the | does not scale to many choices. Even with only three choices, it is quite verbose.

      Could we use & too? Could that be written here?

    3. std()

      I think the correct name of the function is sd()

    1. es

      Could this answer include when I would use it, which is what the question asks. I wasn't sure what to answer because I couldn't think of scenarios where a mutate wouldn't do the job just as well?

    1. Since the histogram bins have already been calculated, it is unaffected.

      What is the antecedent of the word "it"??

    2. affect

      affects?

    3. It’s usually better to use the categorical variable with a larger number of categories or the longer labels on the y axis.

      I thought it is better to use X axis for variables of larger number of categories.

    1. Finding all plurals cannot be correctly accomplished with regular expressions alone. Finding plural words would at least require morphological information about words in the language. See WordNet for a resource that would do that. However, identifying words that end in an “s” and with more than three characters, in order to remove “as”, “is”, “gas”, etc., is a reasonable heuristic.

      I agree with the statement and used that as a basis for my answer.

      sent_with_words_end_s <- str_subset(sentences, "\b[A-Za-z]{3,}s\b") # Focusing on only those sentences that meets the specified criteria

      words_end_s <- str_extract(sent_with_words_end_s, "\b[A-Za-z]{3,}s\b") # Words ending in s (contains both plural words like "planks" and non plural words like "Sickness"

      str_view(words_end_s, "\b[A-Za-z]+[^s]s$") #Extracts only words that end in s but not in ss.

    1. factor(1)

      Even though this looks similar to the time in which we did "group = 1", this seems different enough to warrant an explanation? At least I don't really understand how this works. On a hunch, I tried replacing "factor(1)" with "identity", and I got the same graph. How did that work? I tried looking up factor in the documentation, and it clarified what that meant, but I still am not sure how that relates to ggplot.

    1. Since is always good practice to have clear

      Typo: it omitted.

      Edit suggestion: Since (it) is always good practice to have clear

  21. Apr 2021
    1. mutate(cut = if_else(runif(n()) < 0.1, NA_character_, as.character(cut)))

      Can you explain this code to me? I've looked up the if_else function but I do not understand this code.

    1. (arr_delay <= 0))

      Why did you use filter arr_delay <= 0 and not arr_delay > 0 when we are looking for the plane with the worst on-time record? This sounds counterintuitive to me, what am I misunderstanding? Thank you.

    2. this delay will not have those affects plans

      For better clarity, change "this delay will not have those affects plans nor does it affect the total time spent traveling." to this delay will not affect those plans nor would it affect the total time spent traveling.

    1. No exercises

      Thanks ever so much for this amazing set of solutions. I learned as much from this as from the R4DS textbook. I just made a donation to the kākāpō recovery (as suggested by the r4ds online version) and would be happy to make a similar donation to any charity of your choice. Just let me know what cause you'd like me to support. Thanks again for your awesome work!

    1. geom_bar(width = 1)
      1. Please can you explain to me why you included the argument, width = 1 for geom_bar? Without it, the pie doesn't look different. I believe you must have specified for a reason?

      2. This was my attempt to answer the question. I'm not totally if the resulting plot makes much sense. Please take a look and let me know what you think. Thank you.

      ggplot(diamonds, mapping = aes(x = cut, fill = color)) + geom_bar() + coord_polar()

    2. such

      Typo. I think should read, such as, not such.

    3. ..count.. / sum(..count..

      Please, can you explain why you have dots in the code? Thanks.

    1. "Terms of US Presdients", subtitle = "Roosevelth

      Nothing important, but a couple of typos here: "Presdients" and "Roosevelth". The latter should be Eisenhower (as in the y-axis).

    1. I use arrange() and slice() to select the largest twenty diamonds

      I guess an alternative would be to drop "arrange()" and use the slice_max() function instead? That is, slice_max(carat, n = 20, with_ties = FALSE).

  22. Feb 2021
    1. The lyrics for Ten in the Bed are

      We're asked to convert the lyrics to a function that can be generalised to any number of people in the bed. Isn't the following closer to answering the question fully:

      library(english)

      ten_bed <- function(x) { lyric_1 <- c("There were ") lyric_2 <- c(" in the bed And the little one said, “Roll over! Roll over!” So they all rolled over and one fell out\n") number <- as.character(as.english(x:1))

      for (i in number) { if (i != "one") { cat(lyric_1) cat(i) cat(lyric_2) cat("\n") } else { cat("There was one in the bed And the little one said, “Alone at last!” “Good Night!”") } } }

      ten_bed(x)

    1. Exercise 16.3.7

      Confirm my hypothesis that flights that departed in minutes 20-30 and 50-60 are more likely to have departed early than on time or late. Hint: create a binary variable that tells you whether or not a flight departed early.

  23. Jan 2021
    1. flights %>% # sort in increasing order select(tailnum, year, month,day, dep_delay) %>% filter(!is.na(dep_delay)) %>% arrange(tailnum, year, month, day) %>% group_by(tailnum) %>% # cumulative number of flights delayed over one hour mutate(cumulative_hr_delays = cumsum(dep_delay > 60)) %>% # count the number of flights == 0 summarise(total_flights = sum(cumulative_hr_delays < 1)) %>% arrange(total_flights)

      Here you counted the opposite, meaning the flights AFTER the first delay of greater than 60 minutes the code should've looked something like this:

      flights %>% select(tailnum, year, month,day, dep_delay) %>% filter(!is.na(dep_delay)) %>% arrange(tailnum, year, month, day) %>% group_by(tailnum) %>% filter(dep_delay<=60 ) %>% summarise(total_flights = n()) %>% arrange(desc(total_flights))

    2. NULL

      NA

    3. ArrDelay

      this should be arr_delay

    4. not_cancelled %>% group_by(tailnum) %>% tally()

      To match the first expression, this should be group_by(dest)

    5. being arriving

      Should be just "arriving"

    1. str_split(x, ", +(and +)?")[[1]]

      optionally drop + (both): str_split(x, ", (and )?")[[1]]

    2. what a pattern

      what pattern

    3. colour_match

      Maybe the intention here was colour_match2 (with the knock on impact on the str_view_all(...) such that only the 2 relevant strings are shown, not 3)

  24. Dec 2020
    1. Pitching

      Exercise 13.3.3 (final sub-task):

      The Pitching table is mislabeled as Fielding in the diagram (though it is clear from the listed PKs)

    2. Master

      Exercise 13.3.3 Questions have you compare 3 data sets, one of which is call "People" while in Answers the same is called "Master" (verified in R: identical(Master, People) == TRUE) Posted 2020-12-28

    1. However, vars is not a column in flights, as is the case, then select will use the value the value of the , and select those columns.

      Several problems in this sentence. I think this is missing "if" before "vars"; "the value" is duplicated; "of the ," should probably be something like "of the variable vars".

    2. th

      Should be "the"

    3. the value

      typo - duplicate "the value"

    1. str_subset(words, "([A-Za-z][A-Za-z]).*\\1")

      this also seems to work

      str_subset(words, "(.)(.).*(\1\2)")

      Again, Thank you for your time. Really helpfull solutions (They are double backslash, only one is shown though)

    1. var(1:10) #> [1] 9.17 variance(1:10) #> [1] 9.17

      Concordance with var() breaks down if you introduce an NA to the vector. As written, the function will return NA if there an NA is in the vector no matter what na.rm is set to.

      The commenter above asked if you need to add na.rm to the sum function. In that case it will return a number but not the same as var because length is still counting NA, which can't be overridden. E.g. If you have x <- c(1:10, NA), variance(x) will return 8.25, but var(x, na.rm = TRUE) will return 9.166667.

      To fix concordance with NA values in vectors, I wrote the function as such:

      variance <- function(x, na.rm = TRUE) {
          if (na.rm == TRUE) {
              x <- x[!is.na(x)]
          }
          n <- length(x)
          m <- mean(x)
          sq_err <- (x - m)^2
          sum(sq_err) / (n - 1)
      }
      

      Then if you use x <- c(1:10, NA) both variance(x, na.rm = TRUE) and var(x, na.rm = TRUE) return 9.166667 and both variance(x, na.rm = FALSE) and var(x, na.rm = FALSE) return NA.

  25. Nov 2020
    1. air_time_delay

      air_time_delay_pct

    2. flight

      two "flight"

    3. I think this alternative answers Exercise 5.7.8 with fewer lines of code. Note though that the answers don't coincide, i.e., I don't havet any tailnum with zero flights before the first 1-hour delay:

      flights %>%
        filter(!is.na(arr_delay)) %>%
        arrange(tailnum, arr_delay) %>%
        group_by(tailnum) %>%
        filter(cumall(arr_delay < 60)) %>%  ## all until this 
        condition is false
        summarise(n = n())
      
    4. at least

      less than

    5. 5.6.1

      I was under impression that we need to know how to code these 3 bullet points. Instead, I see only theoretically-philosophical answers.

    1. scale_colour_viridis()

      It might be scale_colour_viridis_b() or scale_colour_viridirs_c().I use the form scale_colour_viridis() then it shows it is not availble.

    2. There seems to be a stronger relationship between visibility and delay. Delays are higher when visibility is less than 2 miles.

      When I cut the interval into more than 40 slices, I get that the graphic shows fluctuation in the interval of [0,2].When n = 50, it confused me that the delay down and then up in [1.2,1.4].And it may get the different graphic in the short intervals nearing zero when using different n .

    3. datamodelr

      when installing the package datamodelr Using R studio version 1.3.1093 and R version 4.0.3, I get the message: Warning in install.packages : package ‘datamodelr’ is not available for this version of R

    1. stocks <- tibble( year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016), qtr = c( 1, 2, 3, 4, 2, 3, 4), return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66) ) stocks %>% pivot_wider(names_from = year, values_from = return, values_fill = 0) #> # A tibble: 4 x 3 #> qtr `2015` `2016` #> <dbl> <dbl> <dbl> #> 1 1 1.88 0 #> 2 2 0.59 0.92 #> 3 3 0.35 0.17 #> 4 4 NA 2.66 stocks <- tibble( year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016), qtr = c( 1, 2, 3, 4, 2, 3, 4), return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66) ) stocks %>% pivot_wider(names_from = year, values_from = return, values_fill = 0) #> # A tibble: 4 x 3 #> qtr `2015` `2016` #> <dbl> <dbl> <dbl> #> 1 1 1.88 0 #> 2 2 0.59 0.92 #> 3 3 0.35 0.17 #> 4 4 NA 2.66

      Are these duplicated?

    2. names_ptype

      should be names_ptypes?

    1. Mon %M

      should be %m, since %m is about the month and %M is about minutes. Clearly it should be months here.

  26. Oct 2020
    1. th

      a typo, it should be "the" or "this"

    2. dep_delay_diff = dep_delay - dep_time_min + sched_dep_time_min

      Should not it be dep_delay_diff = dep_delay - dep_time_min - sched_dep_time_min ?

    1. Exercise 12.2.3

      table2 %>% pivot_wider(names_from = type, values_from=count) %>% group_by(year,country) %>% summarise(cases=sum(cases)) %>% ggplot(aes(x=year, y=cases,colour=country))+ geom_line()

    1. Why is there months() but no dmonths()

      I think this question should be rephrased. There is certainly a dmonths().

      dmonths(1)

      [1] "2629800s (~4.35 weeks)"

      months(1)

      [1] "1m 0d 0H 0M 0S"

      Maybe dmonths() was added since the initial publication of this section? Whether it is makes sense to use is another question, and you cover that in the description. Perhaps it should ask, "What is the difference in output between months(1) and dmonths(1)? Is it logical to ever use dmonths()?" I'm sure someone will come up with a creative use!

  27. Sep 2020
    1. minor breaks that use thinner lines to distinguish them

      in the code, you put panel.grid.minor = element_blank(). I'm having a hard time understanding how seemingly putting a NULL element in the minor grid can make it thinner but still exist?

    2. As the question noted, this is because the subcompact car class includes both small cheap cars, and sports cars with large engines.

      In the text, one of the subtitles notes that "2seater" indicates sports cars, and that they are an exception to the negative trend due to their light weight (despite large engines). (I don't think this exercise was meant to refer to subcompact cars.) Are some cars double-listed as 2seater and subcompact ?

    1. ouptut

      output

    2. This contrasts with R markdown files, which show their output inside the console

      My RStudio R Markdown files also display output inside the editor. Not sure if this is due to an update

    3. Alt

      Shift

    1. if we use unnest() instead of unnest(.drop = TRUE)

      my suspicion that this exercise is outdated has now increased, since I get the warning The .drop argument of unnest() is deprecated as of tidyr 1.0.0. All list-columns are now preserved. whenever I try to use .drop, and the output is the same as in the provided answer whether I include it or not

    2. How can you interpret the coefficients of the quadratic? Hint you might want to transform year so that it has mean zero.)

      Could you please explain this part too?

      ie why did we transform year to have mean 0? My graph and model coefficient outputs look the same whether I include the - median(year) or not...

      and is the resulting equation something like Intercept = coef[1]x^2 + coef[2]x?

    1. Try pointrange with mean and standard error of the mean (sd / sqrt(n)).

      this seems to be a typo since this sentence appears twice in the text

    2. regression standard error,

      higher regression standard error

    3. 14%

      I get a number similar to this (but the opposite sign) if I do 2 ^ rsme, but a completely different number if I do the same summary calcuation with resid instead of lresid. Why is it mathematically valid to back transform a manipulated log residual, ie one that has been through sqrt() and similar functions, instead of back transforming the residuals (to either a ratio or actual distance) first?

      Why is this method with lresid better than using resid?

      Also below, what are the units for 23-31? Percent? Dollars?

    4. 40

      Could you please show how you got this number? I tried to follow Exercise 24.2.2: I figured that 0.5 must be equal to r^a1, so I calculated that if a1 = 3.2, then the residual must correspond to 0.81 dollars. Obviously, that doesn't make sense, so I'm wondering what I did wrong.

      Even trying to back transform the residual by exponentiating 2 ^ 0.5 yields 1.4, an also seemingly wrong number.

      When I look at individual observations in the data and plug them into (y1/y0) = (x1/x0)^a1, I get 5.78 for a1. Also, individual observations line up with yours, which is that predictions seem to be ~$40 below the actual price.

      It would be helpful if I knew the coefficients of the linear model, but when I use coef(mod), out comes like 10 coefficients, which seems unhelpful.

      Update: I think I calculated resid wrong--I tried to back transform it using 2 ^ lresid, just like we did with lpred and lcarat, but I couldn't seem to get it... Is it because resid is calculated from a log minus a log? thus lresid actually represents the distance of the data from the model in a way that, bc of log rules, comes out to be resid = price / pred.....

      Using this interpretation, which is my most confident one so far, I agree with the statement that a residual of +2 means the price was 4x lower than expected, but a residual of +/-0.5 would still not come out to be 40. +40% or -30%, maybe, but not +-40.

    5. log

      logb

    6. y

      x

    1. a[1] = a1 and a[3] = a3, any other values of a[1] and a[3] where a[1] + a[3] == (a1 + a3)

      not sure what this means, is there any way you could provide a visualization? or how could you avoid/address this problem?

    2. you

      delete

    1. I would expect length to always be less than width, otherwise the length would be called the width.

      Why??

    2. ggplot(diamonds, aes(x = carat, y = price)) + geom_hex() + facet_wrap(~cut, ncol = 1)

      Needs library("hexbin") to work

    1. Exercise 5.7.8

      We noticed that there are observations with no record of delay exceeding 1 hour!

      filter(mean(cumulative_hr_delays) == 0) %>%<br> summarise(n = n())

      Results 677 "tailnum" without any record of delay greater than 1 hour!

      For example, "N103US" has 46 flights under 1(one) hour. However, 46 is the total number of flights of "N103US", and none of these flights exceeded 1 hour. So unless I'm wrong, we shouldn't include it (the aforementioned "N103US") as having 46 flights before the first flight that exceeds one hour, because flight number 47 (your next flight) may not exceed an hour either (considering extending the search to the year 2014 and so on)! It may never have delay of more than an hour... and if the scope is restricted to only 2013 the reasoning is the same, we should exclude the 677 flights without any delay more than one hour from the result presented.

    2. Exercise 5.7.8

      test

    3. Exercise 5.7.8

      We noticed that there are observations with no record of delay exceeding 1 hour!

      filter(mean(cumulative_hr_delays) == 0) %>% summarise(n = n())

      Results 677 "tailnum" without any record of delay greater than 1 hour!

    4. Exercise 5.7.8

      We noticed that there are observations with no record of delay exceeding 1 hour!

      filter(mean(cumulative_hr_delays) == 0) %>% summarise(n = n())

      Results 677 "tailnum" without any record of delay greater than 1 hour!

    5. For each plane, count the number of flights before the first delay of greater than 1 hour.

      We noticed that there are observations with no record of delay exceeding 1 hour!

      filter(mean(cumulative_hr_delays) == 0) %>% summarise(n = n())

      Results 677 "tailnum" without any record of delay greater than 1 hour!

    6. For each plane, count the number of flights before the first delay of greater than 1 hour.

      We noticed that there are observations with no record of delay exceeding 1 hour!

      filter(mean(cumulative_hr_delays) == 0) %>% summarise(n = n())

      Results 677 "tailnum" without any record of delay greater than 1 hour!

    7. For each plane, count the number of flights before the first delay of greater than 1 hour.

      We noticed that there are observations with no record of delay exceeding 1 hour!

      filter(mean(cumulative_hr_delays) == 0) %>% summarise(n = n())

      Results 677 "tailnum" without any record of delay greater than 1 hour!

    8. For each plane, count the number of flights before the first delay of greater than 1 hour.

      We noticed that there are observations with no record of delay exceeding 1 hour!

      filter(mean(cumulative_hr_delays) == 0) %>% summarise(n = n())

      Results 677 "tailnum" without any record of delay greater than 1 hour!

    9. For each plane, count the number of flights before the first delay of greater than 1 hour.

      We noticed that there are observations with no record of delay exceeding 1 hour!

      filter(mean(cumulative_hr_delays) == 0) %>% summarise(n = n())

      Results 677 "tailnum" without any record of delay greater than 1 hour!

    10. For each plane, count the number of flights before the first delay of greater than 1 hour.

      We noticed that there are observations with no record of delay exceeding 1 hour!

      filter(mean(cumulative_hr_delays) == 0) %>% summarise(n = n())

      Results 677 "tailnum" without any record of delay greater than 1 hour!

    11. For each plane, count the number of flights before the first delay of greater than 1 hour.

      We noticed that there are observations with no record of delay exceeding 1 hour!

      filter(mean(cumulative_hr_delays) == 0) %>% summarise(n = n())

      Results 677 "tailnum" without any record of delay greater than 1 hour!

    12. For each plane, count the number of flights before the first delay of greater than 1 hour.

      We noticed that there are observations with no record of delay exceeding 1 hour!

      ... filter(mean(cumulative_hr_delays) == 0) %>%

      summarise(n = n())

      Results 677 "tailnum" without any record of delay greater than 1 hour!

    13. For each plane, count the number of flights before the first delay of greater than 1 hour.

      We noticed that there are observations with no record of delay exceeding 1 hour!

      ... filter(mean(cumulative_hr_delays) == 0) %>%

      summarise(n = n())

      Results 677 "tailnum" without any record of delay greater than 1 hour!

    14. For each plane, count the number of flights before the first delay of greater than 1 hour.

      We noticed that there are observations with no record of delay exceeding 1 hour!

      ... filter(mean(cumulative_hr_delays) == 0) %>%

      summarise(n = n())

      Results 677 "tailnum" without any record of delay greater than 1 hour!

    15. Exercise 5.7.8 For each plane, count the number of flights before the first delay of greater than 1 hour.

      We noticed that there are observations with no record of delay exceeding 1 hour!

      ... filter(mean(cumulative_hr_delays) == 0) %>%

      summarise(n = n())

      Results 677 "tailnum" without any record of delay greater than 1 hour!

    16. Exercise 5.7.8 For each plane, count the number of flights before the first delay of greater than 1 hour.

      We also notice cases where there are no records of delays longer than 1 hour!

      ..... filter(mean(cumulative_hr_delays) == 0) %>% summarise(n = n())

      results: 677 "tailnum" no record of delay exceeding 1 hour!

    1. 325

      where did you get this number? the mean and median time for the first function divided by the second is closer to 80 than 325...

    2. s

      is "lengths()" a function? could not find it in help

  28. Aug 2020
    1. 3

      5

    2. TRUE

      na.rm = na.rm ?

    3. Exercise 19.4.3

      jrnlod, your answer to this question is particularly great, thank you!!

      I added

      if (x == 0) { print(x) }

      as I don't think 0 should be a fizz/buzz Is there a less manual way to do this?

    1. <, ==

      should these parentheses contain operations like &, |, !, and "xor" rather than the logical comparisons like < and == ?

    1. Confirm

      this question is difficult to understand bc it seems to employ circular logic: obviously flights that leave early are caused by flights that leave early. maybe it means to say that the hypothesis tests whether flights departing in minutes 20-30 and 50-60 are mostly early flights and not delayed ones.

    2. Explain your findings

      is this exercise unfinished? there is no explanation for the results here. I tried filtering out the ones with differences divisible by 60 (%% 60 == 0) to account for discrepancies due to timezone but that didn't seem to help...

    3. We forgot to account for this

      should mention whether this was our mistake in analysis or a data entry mistake that we can fix during analysis. it isn't clear in the text

    1. summarise(diamonds, mean(x > y), mean(x > z), mean(y > z))

      Can you help me translate the command? I don't quite understand what the result of mean( x > y) represents.

  29. Jul 2020
    1. "[A-ZAa-z]+")

      [A-Za-z]+

    2. case insensitive flag

      can you show what this is? it isn't listed under ?str_view

    3. by dry

      this result is irrelevant to the question

    4. Exercise 14.5.2 question 2 We have not learned 'unlist' yet, we learn that only in Ch 21

      I got a list and ordered it using the following code, but it takes only the first word in each sentence (I guess), and str_extract_all doesn't work at all. Any other ideas?

      tibble(word = (str_extract(sentences, boundary("word")))) %>% mutate(word = str_to_lower(word)) %>% count(word, sort = TRUE) %>% head(5)

      A tibble: 5 x 2

      word n

      <chr> <int>

      1 the 262

      2 a 72

      3 he 24

      4 we 13

      5 it 12

    5. Exercise 12.4.3.2. We learn str_split only in the next section. My rather inelegant solution was: has_apo <- sentences %>% str_subset("\'") has_ap_sep <- has_apo %>% tibble(sentence = has_apo) %>% extract(sentence, c("before", "apo", "after"), "([A-Za-z]*)(\')([A-Za-z]+)", remove = FALSE)

      cf https://en.wiktionary.org/wiki/Category:English_contractions There are some contractions where the apostrophe appears as the first character, so use * not + in the first [A-Za-z] expression

    1. use

      us

    2. running that expression that there are only four airports in this lis

      running that expression shows that? not sure what this is referring to

    3. Instead of using year, month, day, hour, you can join on only 'origin' and 'time_hour'

    4. Then I select the 48 observations

      how come the output is grouped even though you specified ungroup()? does arrange() implicitly discard duplicate results if the var to be arranged on is a result of summarizing? does this have something to do with the "summarize regrouping output" message, and if so why does summarize() have an effect on subsequent functions? I'm not sure this was mentioned elsewhere in the book.

    5. hours

      For this data, we only have data from the year 2013. So why do we always group by year in the exercises? Isn't that redundant/useless?

    6. . S

      , s

    7. The result shows that this is in fact not a primary key, as n > 1 is very high. Why?

      One of the variables is named 'n', so first rename that variable, then check:

      babynames %>% rename(no = n) %>% count(year, sex, name, prop) %>% filter(n > 1) %>% nrow()

      [1] 0

    8. flight_weather

      since the text says that inner join is almost never used for analysis, could you also explain when it would be used? for example, is the reason you're using it in all the exercises because it is a convenient way to drop NA observations?

    9. so I truncate

      I truncate

    10. used used

      used

    11. more

      more than one sex?

    12. In a foreign key relationship

      some of the exercises for this subsection a little premature...for some readers (like me) it is difficult to understand exactly what's going on without being first introduced to basic content knowledge in keys.

    1. Second, I should check whether all values for a (country, year) are missing or whether it is possible for only some columns to be missing.

      This part doesn't seem to do what it says it does, although I might just be interpreting everything wrong. But it seems that you are trying to check whether every entry for a particular "country, year" combination would be missing (eg no data collected for that region that year). However, the section is described as checking if multiple columns have missing values (as compared to only one column for that observation having a missing value). At least, perhaps consider clarifying this section.

    2. For example, this will fill in the missing values of the long data frame with 0

      delete