Hypothesis

63 Matching Annotations

Jun 2019
jrnold.github.io jrnold.github.io

7 Exploratory Data Analysis | R for Data Science: Exercise Solutions

2
1. cfgauss 04 Jun 2019
  
  in Public
  
  >
  
  Omit.
2. cfgauss 04 Jun 2019
  
  in Public
  
  Exercise 7.4.1
  
  Many thanks for this lovely example. I hadn't understood geom_bar's NA behavior with factor variables until your explanation.
Visit annotations in context

Annotators

cfgauss

URL

jrnold.github.io/r4ds-exercise-solutions/exploratory-data-analysis.html
May 2019
jrnold.github.io jrnold.github.io

7 Exploratory Data Analysis | R for Data Science: Exercise Solutions

7
1. cfgauss 30 May 2019
  
  in Public
  
  Presumably there is a premium for a 1 carat diamond
  
  Is this "buyer's psychology?" The opposite of $4.99 being cheaper than $5.00?
2. cfgauss 30 May 2019
  
  in Public
  
  number of diamonds in each carat range
  
  Nice! Your intuition seems correct. But how do you account for the large number of 1.01 carat diamonds?
3. cfgauss 30 May 2019
  
  in Public
  
  print(n = 30)
  
  print(n = Inf) will print all rows of a tibble.
4. cfgauss 30 May 2019
  
  in Public
  
  seem
  
  see
5. cfgauss 30 May 2019
  
  in Public
  
  There are no diamonds with a price of $1,500
  
  More precisely: there's a $90 gap: the closed interval [1455,1545].
6. cfgauss 30 May 2019
  
  in Public
  
  Explore the distribution
  
  I wonder if this includes identifying outliers for possible data errors. E.g. x = 0 for 8 diamonds and z = 0 for 20. Also z = 31.8 for one. y = 0 for 7 diamonds and y = 31.8 and y = 58.9 for one each. I believe all are data errors.
7. cfgauss 29 May 2019
  
  in Public
  
  mutate(id = row_number())
  
  I believe the plots can be generated without id:
  
  diamonds %>% select(x, y, z) %>% gather(variable, value) %>% ggplot(aes(x = value)) + geom_density() + geom_rug() + facet_grid(vars(variable))
Visit annotations in context

Annotators

cfgauss

URL

jrnold.github.io/r4ds-exercise-solutions/exploratory-data-analysis.html
jrnold.github.io jrnold.github.io

5 Data transformation | R for Data Science: Exercise Solutions

7
1. cfgauss 23 May 2019
  
  in Public
  
  For each plane, count the number of flights before the first delay of greater than 1 hour.
  
  I think this is a lovely use of logicals and cumsum. But I believe that it omits planes whose first flight is delayed by more than an hour. There are 234 of these:
  
  (zero <- flights %>% filter(!is.na(dep_delay)) %>% arrange(tailnum,month,day) %>% group_by(tailnum) %>% mutate(delay_gt1hr = dep_delay > 60) %>% filter(row_number() == 1,delay_gt1hr) %>% select(tailnum) %>% mutate(n = 0) )
  
  Then bind_rows could concatenate these to the tibble produced by the posted solution.
  
  This is a "two-part" solution. Is there a "one-part" solution?
2. cfgauss 22 May 2019
  
  in Public
  
  year
  
  Since the year column contains only 2013, you could arrange chronologically without this argument.
3. cfgauss 20 May 2019
  
  in Public
  
  We will calculate this ranking in two parts
  
  Or, the two parts could be combined:
  
  rank <- flights %>% group_by(dest) %>% mutate(n_carriers = n_distinct(carrier)) %>% filter(n_carriers > 1) %>% group_by(carrier) %>% summarize(n_dest = n_distinct(dest)) %>% arrange(desc(n_dest))
4. cfgauss 20 May 2019
  
  in Public
  
  width = 100
  
  Here width = Inf will print all columns.
5. cfgauss 14 May 2019
  
  in Public
  
  print(width = 120)
  
  I believe that print(width = Inf) will reliably print all selected columns.
6. cfgauss 14 May 2019
  
  in Public
  
  below than the
  
  below the
7. cfgauss 14 May 2019
  
  in Public
  
  observations each
  
  observations of each
Visit annotations in context

Annotators

cfgauss

URL

jrnold.github.io/r4ds-exercise-solutions/transform.html
Apr 2019
jrnold.github.io jrnold.github.io

5 Data transformation | R for Data Science: Exercise Solutions

2
1. cfgauss 15 Apr 2019
  
  in Public
  
  I
  
  Omit.
2. cfgauss 15 Apr 2019
  
  in Public
  
  select(tailnum, on_time, arr_time, arr_delay) %>%
  
  If you wish, this could be eliminated since the only columns that survive are those in the summarise.
Visit annotations in context

Annotators

cfgauss

URL

jrnold.github.io/r4ds-exercise-solutions/transform.html
Mar 2019
jrnold.github.io jrnold.github.io

5 Data transformation | R for Data Science: Exercise Solutions

6
1. cfgauss 12 Mar 2019
  
  in Public
  
  were
  
  omit
2. cfgauss 11 Mar 2019
  
  in Public
  
  ungroup() %>%
  
  I believe you don't need ungroup().
3. cfgauss 10 Mar 2019
  
  in Public
  
  an standard deviation. That
  
  "and standard deviation. The following"
4. cfgauss 03 Mar 2019
  
  in Public
  
  Unusually fast flights are those flights with the smallest standardized values.
  
  I believe the problem asks for "fast flights per destination". If so, this code produces that list, including ties. Sadly, it uses slice.
  
  fast <- flights %>% filter(!is.na(air_time)) %>% group_by(dest) %>% mutate(z = (air_time - mean(air_time)) / sd(air_time)) %>% slice(which(z == min(z))) %>% select(origin,dest,month,day,carrier,flight,air_time,z) %>% arrange(z)
5. cfgauss 03 Mar 2019
  
  in Public
  
  ggplot(standardized_flights, aes(x = air_time_standard)) + geom_density()
  
  This is a nice plot, but the image below is not it. It appears to have been cut-and-pasted from a previous plot.
6. cfgauss 03 Mar 2019
  
  in Public
  
  standardized_flights
  
  Somewhat more succinctly:
  
  standardized_flights <- flights %>% filter(!is.na(air_time)) %>% group_by(dest,origin) %>% mutate(air_time_standard = (air_time - mean(air_time)) / sd(air_time)) %>% select(origin,dest,month,day,carrier,flight,air_time,air_time_standard)
Visit annotations in context

Annotators

cfgauss

URL

jrnold.github.io/r4ds-exercise-solutions/transform.html
Feb 2019
jrnold.github.io jrnold.github.io

5 Data transformation | R for Data Science: Exercise Solutions

38
1. cfgauss 25 Feb 2019
  
  in Public
  
  in 10 flights
  
  Omit, since you've already filtered.
2. cfgauss 25 Feb 2019
  
  in Public
  
  min_rank() and dense_rank()
  
  English is inherently ambiguous while only math (vectors, subscripts) and algorithms are not. But then you'd lose your reader.
  
  Here's a slight rephrasing where rank refers to a component of, say, min_rank(x) and value a component of x.
  
  For each set of tied values the min_rank() function assigns a rank equal to the number of values less than that tied value plus one. In contrast, the dense_rank() function assigns a rank equal to the number of distinct values less than that tied value plus one.
3. cfgauss 25 Feb 2019
  
  in Public
  
  min_rank() and dense_rank()
  
  English is inherently ambiguous while only math (vectors, subscripts) and algorithms are not. But then you'd lose your reader.
  
  Here's a slight rephrasing where rank refers to a component of, say, min_rank(x) and value a component of x.
  
  For each set of tied values the min_rank() function assigns a rank equal to the number of values less than that tied value plus one. The dense_rank() function assigns a rank equal to the number of distinct values less than that tied value plus one.
4. cfgauss 25 Feb 2019
  
  in Public
  
  ranking handles
  
  the ranking functions handle
5. cfgauss 25 Feb 2019
  
  in Public
  
  function
  
  functions
6. cfgauss 25 Feb 2019
  
  in Public
  
  roughly
  
  Is it "rough" or "exact"?
7. cfgauss 25 Feb 2019
  
  in Public
  
  an element
  
  a vector
8. cfgauss 25 Feb 2019
  
  in Public
  
  The row_number() function is confusingly named. It can create ranks for any column.
  
  I might omit this since you give an explanation of row_number berlow which is accurate. Also, tibble adds its own rownumbers and rownumber(x) appears to be "the rownumber after x is sorted", which I don't find confusing.
9. cfgauss 25 Feb 2019
  
  in Public
  
  missing values
  
  ties
10. cfgauss 24 Feb 2019
  
  in Public
  
  na.rm = TRUE
  
  This probably is intended to be an argument of mean. It's not necessary, however, since neither dep_delay nor dep_delay_lag have any NAs. Also, could you provide an interpretation of delay_diff? For example, what does it mean that JFK's is negative?
11. cfgauss 23 Feb 2019
  
  in Public
  
  year,
  
  Could be eliminated since the dataset is for one specific year.
12. cfgauss 23 Feb 2019
  
  in Public
  
  flight
  
  Another definition of flight could be the quad (carrier,flight,origin,dest) which, say, might fly daily. Here's a solution with this definition.
  
  dest_delay <- flights %>% group_by(dest) %>% filter(arr_delay > 0) %>% summarize(dest_ad = sum(arr_delay,na.rm=T),dest_ct = n()) flight_delay <- flights %>% group_by(carrier,flight,origin,dest) %>% filter(arr_delay > 0) %>% summarize(flight_ad = sum(arr_delay,na.rm=T),flight_ct = n()) prop <- as_tibble(merge(dest_delay,flight_delay,by="dest")) %>% mutate(prop = flight_ad / dest_ad) %>% select(carrier,flight,origin,dest,dest_ct,flight_ct,prop)
13. cfgauss 23 Feb 2019
  
  in Public
  
  Exercise 5.7.4
  
  In general selecting just a few columns (often the mutate variables) makes the result clearer since "irrelevant" cols are eliminated.
14. cfgauss 23 Feb 2019
  
  in Public
  
  !is.na(arr_delay),
  
  This could be omitted but keeping it in redundantly may add clarity.
15. cfgauss 23 Feb 2019
  
  in Public
  
  only
  
  Omit
16. cfgauss 23 Feb 2019
  
  in Public
  
  However, there are many planes that have never flown an on-time flight.
  
  Since this minimum rank group has an on_time of 0.0, I believe that this tibble is all planes which have never flown an on-time flight.
17. cfgauss 23 Feb 2019
  
  in Public
  
  <=
  
  ==
  
  I believe min_rank cannot be 0.
18. cfgauss 23 Feb 2019
  
  in Public
  
  arr_delay > 0
  
  (arr_delay > 0)
  
  Of course precedence "does the right thing" but since R will use coercion to give lgl & dbl a value, putting in the redundant parentheses prevents the not-careful reader (me, for example) from "thinking the wrong thing."
19. cfgauss 23 Feb 2019
  
  in Public
  
  cancelled
  
  In this dataset the presence of arr_delay implies the presence of arr_time so the boolean cancelled could be eliminated. Of course, the reader hasn't verified this relation so keeping it in makes sense. The definition of cancelled as "having an arrival delay" and "not having an arrival time" seems a bit odd.
20. cfgauss 23 Feb 2019
  
  in Public
  
  They operate within each group rather than over the entire data frame
  
  Do all functions on grouped tibbles operate only per group and not the entire tibble? E.g.
  
  flights %>% group_by(month,day) %>% arrange(desc(dep_delay))
  
  Will the arrange function order both by group and by the entire tibble?
21. cfgauss 22 Feb 2019
  
  in Public
  
  There are more sophisticated ways to do this analysis
  
  Perhaps, but this is a lovely example of the power of R.
22. cfgauss 21 Feb 2019
  
  in Public
  
  , which calculates
  
  ". atan(y,x) returns"
  
  My worry is that the reader would assume that atan(x,y) returns that angle.
23. cfgauss 20 Feb 2019
  
  in Public
  
  not_cancelled
  
  Hadley defines this as
  
  not_cancelled <- flights %>% filter(!is.na(dep_delay), !is.na(arr_delay))
  
  But in
  
  flights %>% filter(is.na(dep_time) | is.na(arr_time) | is.na(dep_delay) | is.na(arr_delay)) %>% select(sched_arr_time,arr_time,arr_delay,sched_dep_time,dep_time,dep_delay)
  
  the first 6 flights have an NA only in arr_delay. Aren't these likely not to have been cancelled and simply have a missing data entry? Why isn't
  
  not_cancelled <- flights %>% filter(!is.na(dep_time), !is.na(arr_time))
  
  a "better" definition?
24. cfgauss 19 Feb 2019
  
  in Public
  
  and the passenger arrived at the same time
  
  Omit.
25. cfgauss 19 Feb 2019
  
  in Public
  
  Exercise 5.5.4
  
  Another solution. Now that we know we can "trust" dep_delay we could simply
  
  md <- flights %>% select(carrier,flight,origin,dest,sched_dep_time,dep_time,dep_delay) %>% arrange(min_rank(desc(dep_delay)))
  
  There are no ties in the top 10.
26. cfgauss 19 Feb 2019
  
  in Public
  
  Daylight
  
  If you wish, you could change this to "Eastern Daylight" to agree with its previous reference in that sentence.
27. cfgauss 19 Feb 2019
  
  in Public
  
  Except for flights daylight savings started (March 10) or ended (November 3)
  
  I understand the point of the paragraph but this sentence is a bit unclear.
28. cfgauss 16 Feb 2019
  
  in Public
  
  (to handle the .
  
  It's unclear to me what is intended here.
29. cfgauss 10 Feb 2019
  
  in Public
  
  Exercise 5.5.2
  
  A nice analysis, showing the "messiness" of actual datasets.
30. cfgauss 08 Feb 2019
  
  in Public
  
  select(flights, one_of(vars))
  
  Note that
  
  select(flights,vars)
  
  produces the same result.
31. cfgauss 07 Feb 2019
  
  in Public
  
  /
  
  *
32. cfgauss 07 Feb 2019
  
  in Public
  
  Saturday
  
  December
33. cfgauss 06 Feb 2019
  
  in Public
  
  month == 7, month == 8, month == 9
  
  Replace , by |.
34. cfgauss 06 Feb 2019
  
  in Public
  
  between()
  
  Interestingly between microbenchmarks to 143 microseconds while >=, <= is 2.3 on my box!
35. cfgauss 05 Feb 2019
  
  in Public
  
  What other variables are missing?
  
  Another interpretation:
  
  colnames(flights)[colSums(is.na(flights)) > 0]
36. cfgauss 05 Feb 2019
  
  in Public
  
  dep_time %% 2400 <= 600
  
  Elegant, at a 2x microbenchmark cost.
37. cfgauss 03 Feb 2019
  
  in Public
  
  is preferred
  
  Do numerical operators execute faster than, say, %in%?
38. cfgauss 03 Feb 2019
  
  in Public
  
  were
  
  Omit
Visit annotations in context

Annotators

cfgauss

URL

jrnold.github.io/r4ds-exercise-solutions/transform.html
jrnold.github.io jrnold.github.io

4 Workflow: basics | R for Data Science: Exercise Solutions

1
1. cfgauss 03 Feb 2019
  
  in Public
  
  It looks like a typo, dota instead of data.
  
  There was no typo in the 2/2/19 version of r4ds.
Visit annotations in context

Annotators

cfgauss

URL

jrnold.github.io/r4ds-exercise-solutions/workflow-basics.html

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL