- Jun 2019
-
jrnold.github.io jrnold.github.io
-
>
Omit.
-
Exercise 7.4.1
Many thanks for this lovely example. I hadn't understood
geom_bar
's NA behavior with factor variables until your explanation.
-
- May 2019
-
jrnold.github.io jrnold.github.io
-
Presumably there is a premium for a 1 carat diamond
Is this "buyer's psychology?" The opposite of $4.99 being cheaper than $5.00?
-
number of diamonds in each carat range
Nice! Your intuition seems correct. But how do you account for the large number of 1.01 carat diamonds?
-
print(n = 30)
print(n = Inf)
will print all rows of a tibble. -
seem
see
-
There are no diamonds with a price of $1,500
More precisely: there's a $90 gap: the closed interval [1455,1545].
-
Explore the distribution
I wonder if this includes identifying outliers for possible data errors. E.g. x = 0 for 8 diamonds and z = 0 for 20. Also z = 31.8 for one. y = 0 for 7 diamonds and y = 31.8 and y = 58.9 for one each. I believe all are data errors.
-
mutate(id = row_number())
I believe the plots can be generated without
id
:diamonds %>% select(x, y, z) %>% gather(variable, value) %>% ggplot(aes(x = value)) + geom_density() + geom_rug() + facet_grid(vars(variable))
-
-
jrnold.github.io jrnold.github.io
-
For each plane, count the number of flights before the first delay of greater than 1 hour.
I think this is a lovely use of
logicals
andcumsum
. But I believe that it omits planes whose first flight is delayed by more than an hour. There are 234 of these:(zero <- flights %>% filter(!is.na(dep_delay)) %>% arrange(tailnum,month,day) %>% group_by(tailnum) %>% mutate(delay_gt1hr = dep_delay > 60) %>% filter(row_number() == 1,delay_gt1hr) %>% select(tailnum) %>% mutate(n = 0) )
Then
bind_rows
could concatenate these to the tibble produced by the posted solution.This is a "two-part" solution. Is there a "one-part" solution?
-
year
Since the
year
column contains only2013
, you could arrange chronologically without this argument. -
We will calculate this ranking in two parts
Or, the two parts could be combined:
rank <- flights %>% group_by(dest) %>% mutate(n_carriers = n_distinct(carrier)) %>% filter(n_carriers > 1) %>% group_by(carrier) %>% summarize(n_dest = n_distinct(dest)) %>% arrange(desc(n_dest))
-
width = 100
Here
width = Inf
will print all columns. -
print(width = 120)
I believe that
print(width = Inf)
will reliably print all selected columns. -
below than the
below the
-
observations each
observations of each
-
- Apr 2019
-
jrnold.github.io jrnold.github.io
-
I
Omit.
-
select(tailnum, on_time, arr_time, arr_delay) %>%
If you wish, this could be eliminated since the only columns that survive are those in the
summarise
.
-
- Mar 2019
-
jrnold.github.io jrnold.github.io
-
were
omit
-
ungroup() %>%
I believe you don't need
ungroup()
. -
an standard deviation. That
"and standard deviation. The following"
-
Unusually fast flights are those flights with the smallest standardized values.
I believe the problem asks for "fast flights per destination". If so, this code produces that list, including ties. Sadly, it uses
slice
.fast <- flights %>% filter(!is.na(air_time)) %>% group_by(dest) %>% mutate(z = (air_time - mean(air_time)) / sd(air_time)) %>% slice(which(z == min(z))) %>% select(origin,dest,month,day,carrier,flight,air_time,z) %>% arrange(z)
-
ggplot(standardized_flights, aes(x = air_time_standard)) + geom_density()
This is a nice plot, but the image below is not it. It appears to have been cut-and-pasted from a previous plot.
-
standardized_flights
Somewhat more succinctly:
standardized_flights <- flights %>% filter(!is.na(air_time)) %>% group_by(dest,origin) %>% mutate(air_time_standard = (air_time - mean(air_time)) / sd(air_time)) %>% select(origin,dest,month,day,carrier,flight,air_time,air_time_standard)
-
- Feb 2019
-
jrnold.github.io jrnold.github.io
-
in 10 flights
Omit, since you've already
filter
ed. -
min_rank() and dense_rank()
English is inherently ambiguous while only math (vectors, subscripts) and algorithms are not. But then you'd lose your reader.
Here's a slight rephrasing where
rank
refers to a component of, say,min_rank(x)
andvalue
a component ofx
.For each set of tied values the
min_rank()
function assigns a rank equal to the number of values less than that tied value plus one. In contrast, thedense_rank()
function assigns a rank equal to the number of distinct values less than that tied value plus one. -
min_rank() and dense_rank()
English is inherently ambiguous while only math (vectors, subscripts) and algorithms are not. But then you'd lose your reader.
Here's a slight rephrasing where
rank
refers to a component of, say,min_rank(x)
andvalue
a component ofx
.For each set of tied values the
min_rank()
function assigns a rank equal to the number of values less than that tied value plus one. Thedense_rank()
function assigns a rank equal to the number of distinct values less than that tied value plus one. -
ranking handles
the ranking functions handle
-
function
functions
-
roughly
Is it "rough" or "exact"?
-
an element
a vector
-
The row_number() function is confusingly named. It can create ranks for any column.
I might omit this since you give an explanation of
row_number
berlow which is accurate. Also, tibble adds its own rownumbers andrownumber(x)
appears to be "the rownumber after x is sorted", which I don't find confusing. -
missing values
ties
-
na.rm = TRUE
This probably is intended to be an argument of
mean
. It's not necessary, however, since neitherdep_delay
nordep_delay_lag
have anyNA
s. Also, could you provide an interpretation ofdelay_diff
? For example, what does it mean that JFK's is negative? -
year,
Could be eliminated since the dataset is for one specific year.
-
flight
Another definition of flight could be the quad (carrier,flight,origin,dest) which, say, might fly daily. Here's a solution with this definition.
dest_delay <- flights %>% group_by(dest) %>% filter(arr_delay > 0) %>% summarize(dest_ad = sum(arr_delay,na.rm=T),dest_ct = n()) flight_delay <- flights %>% group_by(carrier,flight,origin,dest) %>% filter(arr_delay > 0) %>% summarize(flight_ad = sum(arr_delay,na.rm=T),flight_ct = n()) prop <- as_tibble(merge(dest_delay,flight_delay,by="dest")) %>% mutate(prop = flight_ad / dest_ad) %>% select(carrier,flight,origin,dest,dest_ct,flight_ct,prop)
-
Exercise 5.7.4
In general selecting just a few columns (often the mutate variables) makes the result clearer since "irrelevant" cols are eliminated.
-
!is.na(arr_delay),
This could be omitted but keeping it in redundantly may add clarity.
-
only
Omit
-
However, there are many planes that have never flown an on-time flight.
Since this minimum rank group has an on_time of 0.0, I believe that this tibble is all planes which have never flown an on-time flight.
-
<=
==
I believe min_rank cannot be 0.
-
arr_delay > 0
(arr_delay > 0)
Of course precedence "does the right thing" but since R will use coercion to give lgl & dbl a value, putting in the redundant parentheses prevents the not-careful reader (me, for example) from "thinking the wrong thing."
-
cancelled
In this dataset the presence of arr_delay implies the presence of arr_time so the boolean cancelled could be eliminated. Of course, the reader hasn't verified this relation so keeping it in makes sense. The definition of cancelled as "having an arrival delay" and "not having an arrival time" seems a bit odd.
-
They operate within each group rather than over the entire data frame
Do all functions on grouped tibbles operate only per group and not the entire tibble? E.g.
flights %>% group_by(month,day) %>% arrange(desc(dep_delay))
Will the arrange function order both by group and by the entire tibble?
-
There are more sophisticated ways to do this analysis
Perhaps, but this is a lovely example of the power of R.
-
, which calculates
". atan(y,x) returns"
My worry is that the reader would assume that atan(x,y) returns that angle.
-
not_cancelled
Hadley defines this as
not_cancelled <- flights %>% filter(!is.na(dep_delay), !is.na(arr_delay))
But in
flights %>% filter(is.na(dep_time) | is.na(arr_time) | is.na(dep_delay) | is.na(arr_delay)) %>% select(sched_arr_time,arr_time,arr_delay,sched_dep_time,dep_time,dep_delay)
the first 6 flights have an
NA
only inarr_delay
. Aren't these likely not to have been cancelled and simply have a missing data entry? Why isn'tnot_cancelled <- flights %>% filter(!is.na(dep_time), !is.na(arr_time))
a "better" definition?
-
and the passenger arrived at the same time
Omit.
-
Exercise 5.5.4
Another solution. Now that we know we can "trust"
dep_delay
we could simplymd <- flights %>% select(carrier,flight,origin,dest,sched_dep_time,dep_time,dep_delay) %>% arrange(min_rank(desc(dep_delay)))
There are no ties in the top 10.
-
Daylight
If you wish, you could change this to "Eastern Daylight" to agree with its previous reference in that sentence.
-
Except for flights daylight savings started (March 10) or ended (November 3)
I understand the point of the paragraph but this sentence is a bit unclear.
-
(to handle the .
It's unclear to me what is intended here.
-
Exercise 5.5.2
A nice analysis, showing the "messiness" of actual datasets.
-
select(flights, one_of(vars))
Note that
select(flights,vars)
produces the same result.
-
/
*
-
Saturday
December
-
month == 7, month == 8, month == 9
Replace
,
by|
. -
between()
Interestingly
between
microbenchmarks to 143 microseconds while>=, <=
is 2.3 on my box! -
What other variables are missing?
Another interpretation:
colnames(flights)[colSums(is.na(flights)) > 0]
-
dep_time %% 2400 <= 600
Elegant, at a 2x microbenchmark cost.
-
is preferred
Do numerical operators execute faster than, say, %in%?
-
were
Omit
-
-
jrnold.github.io jrnold.github.io
-
It looks like a typo, dota instead of data.
There was no typo in the 2/2/19 version of r4ds.
-