>
Omit.
>
Omit.
Exercise 7.4.1
Many thanks for this lovely example. I hadn't understood geom_bar
's NA behavior with factor variables until your explanation.
Presumably there is a premium for a 1 carat diamond
Is this "buyer's psychology?" The opposite of $4.99 being cheaper than $5.00?
number of diamonds in each carat range
Nice! Your intuition seems correct. But how do you account for the large number of 1.01 carat diamonds?
print(n = 30)
print(n = Inf)
will print all rows of a tibble.
seem
see
There are no diamonds with a price of $1,500
More precisely: there's a $90 gap: the closed interval [1455,1545].
Explore the distribution
I wonder if this includes identifying outliers for possible data errors. E.g. x = 0 for 8 diamonds and z = 0 for 20. Also z = 31.8 for one. y = 0 for 7 diamonds and y = 31.8 and y = 58.9 for one each. I believe all are data errors.
mutate(id = row_number())
I believe the plots can be generated without id
:
diamonds %>%
select(x, y, z) %>%
gather(variable, value) %>%
ggplot(aes(x = value)) +
geom_density() +
geom_rug() +
facet_grid(vars(variable))
For each plane, count the number of flights before the first delay of greater than 1 hour.
I think this is a lovely use of logicals
and cumsum
. But I believe that it omits planes whose first flight is delayed by more than an hour. There are 234 of these:
(zero <- flights %>%
filter(!is.na(dep_delay)) %>%
arrange(tailnum,month,day) %>%
group_by(tailnum) %>%
mutate(delay_gt1hr = dep_delay > 60) %>%
filter(row_number() == 1,delay_gt1hr) %>%
select(tailnum) %>%
mutate(n = 0)
)
Then bind_rows
could concatenate these to the tibble produced by the posted solution.
This is a "two-part" solution. Is there a "one-part" solution?
year
Since the year
column contains only 2013
, you could arrange chronologically without this argument.
We will calculate this ranking in two parts
Or, the two parts could be combined:
rank <- flights %>%
group_by(dest) %>%
mutate(n_carriers = n_distinct(carrier)) %>%
filter(n_carriers > 1) %>%
group_by(carrier) %>%
summarize(n_dest = n_distinct(dest)) %>%
arrange(desc(n_dest))
width = 100
Here width = Inf
will print all columns.
print(width = 120)
I believe that print(width = Inf)
will reliably print all selected columns.
below than the
below the
observations each
observations of each
I
Omit.
select(tailnum, on_time, arr_time, arr_delay) %>%
If you wish, this could be eliminated since the only columns that survive are those in the summarise
.
were
omit
ungroup() %>%
I believe you don't need ungroup()
.
an standard deviation. That
"and standard deviation. The following"
Unusually fast flights are those flights with the smallest standardized values.
I believe the problem asks for "fast flights per destination". If so, this code produces that list, including ties. Sadly, it uses slice
.
fast <- flights %>%
filter(!is.na(air_time)) %>%
group_by(dest) %>%
mutate(z = (air_time - mean(air_time)) / sd(air_time)) %>%
slice(which(z == min(z))) %>%
select(origin,dest,month,day,carrier,flight,air_time,z) %>%
arrange(z)
ggplot(standardized_flights, aes(x = air_time_standard)) + geom_density()
This is a nice plot, but the image below is not it. It appears to have been cut-and-pasted from a previous plot.
standardized_flights
Somewhat more succinctly:
standardized_flights <- flights %>%
filter(!is.na(air_time)) %>%
group_by(dest,origin) %>%
mutate(air_time_standard = (air_time - mean(air_time)) / sd(air_time)) %>%
select(origin,dest,month,day,carrier,flight,air_time,air_time_standard)
in 10 flights
Omit, since you've already filter
ed.
min_rank() and dense_rank()
English is inherently ambiguous while only math (vectors, subscripts) and algorithms are not. But then you'd lose your reader.
Here's a slight rephrasing where rank
refers to a component of, say, min_rank(x)
and value
a component of x
.
For each set of tied values the min_rank()
function assigns a rank equal to the number of values less than that tied value plus one. In contrast, the dense_rank()
function assigns a rank equal to the number of distinct values less than that tied value plus one.
min_rank() and dense_rank()
English is inherently ambiguous while only math (vectors, subscripts) and algorithms are not. But then you'd lose your reader.
Here's a slight rephrasing where rank
refers to a component of, say, min_rank(x)
and value
a component of x
.
For each set of tied values the min_rank()
function assigns a rank equal to the number of values less than that tied value plus one. The dense_rank()
function assigns a rank equal to the number of distinct values less than that tied value plus one.
ranking handles
the ranking functions handle
function
functions
roughly
Is it "rough" or "exact"?
an element
a vector
The row_number() function is confusingly named. It can create ranks for any column.
I might omit this since you give an explanation of row_number
berlow which is accurate. Also, tibble adds its own rownumbers and rownumber(x)
appears to be "the rownumber after x is sorted", which I don't find confusing.
missing values
ties
na.rm = TRUE
This probably is intended to be an argument of mean
. It's not necessary, however, since neither dep_delay
nor dep_delay_lag
have any NA
s. Also, could you provide an interpretation of delay_diff
? For example, what does it mean that JFK's is negative?
year,
Could be eliminated since the dataset is for one specific year.
flight
Another definition of flight could be the quad (carrier,flight,origin,dest) which, say, might fly daily. Here's a solution with this definition.
dest_delay <- flights %>%
group_by(dest) %>%
filter(arr_delay > 0) %>%
summarize(dest_ad = sum(arr_delay,na.rm=T),dest_ct = n())
flight_delay <- flights %>%
group_by(carrier,flight,origin,dest) %>%
filter(arr_delay > 0) %>%
summarize(flight_ad = sum(arr_delay,na.rm=T),flight_ct = n())
prop <- as_tibble(merge(dest_delay,flight_delay,by="dest")) %>%
mutate(prop = flight_ad / dest_ad) %>%
select(carrier,flight,origin,dest,dest_ct,flight_ct,prop)
Exercise 5.7.4
In general selecting just a few columns (often the mutate variables) makes the result clearer since "irrelevant" cols are eliminated.
!is.na(arr_delay),
This could be omitted but keeping it in redundantly may add clarity.
only
Omit
However, there are many planes that have never flown an on-time flight.
Since this minimum rank group has an on_time of 0.0, I believe that this tibble is all planes which have never flown an on-time flight.
<=
==
I believe min_rank cannot be 0.
arr_delay > 0
(arr_delay > 0)
Of course precedence "does the right thing" but since R will use coercion to give lgl & dbl a value, putting in the redundant parentheses prevents the not-careful reader (me, for example) from "thinking the wrong thing."
cancelled
In this dataset the presence of arr_delay implies the presence of arr_time so the boolean cancelled could be eliminated. Of course, the reader hasn't verified this relation so keeping it in makes sense. The definition of cancelled as "having an arrival delay" and "not having an arrival time" seems a bit odd.
They operate within each group rather than over the entire data frame
Do all functions on grouped tibbles operate only per group and not the entire tibble? E.g.
flights %>%
group_by(month,day) %>%
arrange(desc(dep_delay))
Will the arrange function order both by group and by the entire tibble?
There are more sophisticated ways to do this analysis
Perhaps, but this is a lovely example of the power of R.
, which calculates
". atan(y,x) returns"
My worry is that the reader would assume that atan(x,y) returns that angle.
not_cancelled
Hadley defines this as
not_cancelled <- flights %>%
filter(!is.na(dep_delay), !is.na(arr_delay))
But in
flights %>%
filter(is.na(dep_time) | is.na(arr_time) | is.na(dep_delay) | is.na(arr_delay)) %>%
select(sched_arr_time,arr_time,arr_delay,sched_dep_time,dep_time,dep_delay)
the first 6 flights have an NA
only in arr_delay
. Aren't these likely not to have been cancelled and simply have a missing data entry? Why isn't
not_cancelled <- flights %>%
filter(!is.na(dep_time), !is.na(arr_time))
a "better" definition?
and the passenger arrived at the same time
Omit.
Exercise 5.5.4
Another solution. Now that we know we can "trust" dep_delay
we could simply
md <- flights %>%
select(carrier,flight,origin,dest,sched_dep_time,dep_time,dep_delay) %>%
arrange(min_rank(desc(dep_delay)))
There are no ties in the top 10.
Daylight
If you wish, you could change this to "Eastern Daylight" to agree with its previous reference in that sentence.
Except for flights daylight savings started (March 10) or ended (November 3)
I understand the point of the paragraph but this sentence is a bit unclear.
(to handle the .
It's unclear to me what is intended here.
Exercise 5.5.2
A nice analysis, showing the "messiness" of actual datasets.
select(flights, one_of(vars))
Note that
select(flights,vars)
produces the same result.
/
*
Saturday
December
month == 7, month == 8, month == 9
Replace ,
by |
.
between()
Interestingly between
microbenchmarks to 143 microseconds while >=, <=
is 2.3 on my box!
What other variables are missing?
Another interpretation:
colnames(flights)[colSums(is.na(flights)) > 0]
dep_time %% 2400 <= 600
Elegant, at a 2x microbenchmark cost.
is preferred
Do numerical operators execute faster than, say, %in%?
were
Omit
It looks like a typo, dota instead of data.
There was no typo in the 2/2/19 version of r4ds.