There are more sophisticated ways to do this analysis
Perhaps, but this is a lovely example of the power of R.
There are more sophisticated ways to do this analysis
Perhaps, but this is a lovely example of the power of R.
, which calculates
". atan(y,x) returns"
My worry is that the reader would assume that atan(x,y) returns that angle.
not_cancelled
Hadley defines this as
not_cancelled <- flights %>%
filter(!is.na(dep_delay), !is.na(arr_delay))
But in
flights %>%
filter(is.na(dep_time) | is.na(arr_time) | is.na(dep_delay) | is.na(arr_delay)) %>%
select(sched_arr_time,arr_time,arr_delay,sched_dep_time,dep_time,dep_delay)
the first 6 flights have an NA only in arr_delay. Aren't these likely not to have been cancelled and simply have a missing data entry? Why isn't
not_cancelled <- flights %>%
filter(!is.na(dep_time), !is.na(arr_time))
a "better" definition?
and the passenger arrived at the same time
Omit.
Exercise 5.5.4
Another solution. Now that we know we can "trust" dep_delay we could simply
md <- flights %>%
select(carrier,flight,origin,dest,sched_dep_time,dep_time,dep_delay) %>%
arrange(min_rank(desc(dep_delay)))
There are no ties in the top 10.
Daylight
If you wish, you could change this to "Eastern Daylight" to agree with its previous reference in that sentence.
Except for flights daylight savings started (March 10) or ended (November 3)
I understand the point of the paragraph but this sentence is a bit unclear.
(to handle the .
It's unclear to me what is intended here.
Exercise 5.5.2
A nice analysis, showing the "messiness" of actual datasets.
select(flights, one_of(vars))
Note that
select(flights,vars)
produces the same result.
/
*
Saturday
December
month == 7, month == 8, month == 9
Replace , by |.
between()
Interestingly between microbenchmarks to 143 microseconds while >=, <= is 2.3 on my box!
What other variables are missing?
Another interpretation:
colnames(flights)[colSums(is.na(flights)) > 0]
dep_time %% 2400 <= 600
Elegant, at a 2x microbenchmark cost.
is preferred
Do numerical operators execute faster than, say, %in%?
were
Omit
sum(cases)
Must be na.rm = TRUE inside the sum parenthesis for the below graph to appear otherwise it will just be a blank graph.
By the way thanks so much for these solutions, they are extremely helpful and teach things which are not taught in the book!
It looks like a typo, dota instead of data.
There was no typo in the 2/2/19 version of r4ds.
sum_to_one <- function(x, na.rm = FALSE) { x / sum(x, na.rm = na.rm) }
Since the sum of x is the same across the input, couldn't you make the code less repetitive by assigning it to an intermediate variable?
sum_to_one <- function(x, na.rm = FALSE) { y = sum(x, na.rm = na.rm) x / y }
dep_delay
This should be arr_delay not dep_delay
flights, dep_time %% 2400 <= 600)
这个膜运算选择了一个最大值
There is one remaining issue. Midnight is represented by 2400, which would correspond to 1440 minutes since midnight, but it should correspond to 0. After converting all the times to minutes after midnight, x %% 1440 will convert 1440 to zero while keeping all the other times the same. Now we will put it all together. The following code creates a new data frame flights_times with columns dep_time_mins and sched_dep_time_mins. These columns convert dep_time and sched_dep_time, respectively, to minutes since midnight. flights_times <- mutate(flights, dep_time_mins = (dep_time %/% 100 * 60 + dep_time %% 100) %% 1440, sched_dep_time_mins = (sched_dep_time %/% 100 * 60 + sched_dep_time %% 100) %% 1440
这个计算变量用的小技巧非常好,要深入体会一下
arrange(flights, distance / air_time * 60)
arrange也可以接受一个表达式生成的新变量然后根据这个新变量排序
c(600, 1200, 2400) %% 2400
这个是膜运算,类似于取余数,但是又不太一样,详细见:https://baike.baidu.com/item/%E5%8F%96%E6%A8%A1%E8%BF%90%E7%AE%97/10739384?fr=aladdin
filter(flights, between(month, 7, 9))
这个不错,可以在对连续型变量转换成分类变量的时 候使用
desc(is.na(dep_time)), dep_time)
通过两个变量排序,第一个生成一个逻辑变量T,F。因为缺失值是T,所以缺失值就排在了前边,然后再按照第二个变量dep_time排序
horizontal
vertical
geom_point
This should be "geom_jitter"
scales
axes
position_dodge()
position = "dodge2"
changing
slightly changing
height = 0.8 and width = 0.8
height = 0.4 and width =0.4 because the randomness is created on both negative and positive directions.