- Jul 2020
-
jrnold.github.io jrnold.github.io
-
stocks <- tibble( year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016), qtr = c( 1, 2, 3, 4, 2, 3, 4), return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66) ) stocks %>% pivot_wider(names_from = year, values_from = return, values_fill = 0) #> # A tibble: 4 x 3 #> qtr `2015` `2016` #> <dbl> <dbl> <dbl> #> 1 1 1.88 0 #> 2 2 0.59 0.92 #> 3 3 0.35 0.17 #> 4 4 NA 2.66
duplicated example I think
-
complete()
delete?
-
set
delete?
-
[^ex-12.2.2]
are all the bracketed parts supposed to be links?
-
- Jun 2020
-
jrnold.github.io jrnold.github.io
-
flights_delayed3
slice_max is not found
-
The fastest flight is the one with the average ground speed,
Should read:
The fastest flight is the one with the fastest average ground speed,
The sentence as currently worded describes the flight with the average ground speed, not the fastest.
-
-
jrnold.github.io jrnold.github.io
-
numeric_cols <- vector("logical", length(df)) # test whether each column is numeric for (i in seq_along(df)) { numeric_cols[[i]] <- is.numeric(df[[i]]) } # find the indexes of the numeric columns idxs <- which(numeric_cols)
Is there a reason not to use
idxs <- which(sapply(df, is.numeric)) ?
-
-
jrnold.github.io jrnold.github.io
-
NA
NA and NaN. Note that 0.286 is equal to 2 out of 7.
-
in
"an" ?
-
integer values
numeric values not limited to integers?
-
- May 2020
-
jrnold.github.io jrnold.github.io
-
What other variables are missing?
summary(is.na(flights)) will give NAs of all variables
-
>
should be >= (at least meaning 30 or more)
-
Exercise 5.3.3
Your solution works, but we have not been taught the mutate command yet. Given the commands we have already been taught, you can get the same results from: arrange(flights, desc(distance / air_time))
-
ground speed
This worked for me arrange(flights, air_time/distance)
-
because the value of the missing TRUE or FALSE, x
I find this hard to follow. Why not phrase it as the next example? NA | TRUE is TRUE because anything or TRUE is always TRUE
-
mean(on_time)
If we were to calculate the proportion of flights not delayed or cancelled, we would need to adjust for NA values of the cancelled flights:
summarise(n = n(), on_time = sum(on_time, na.rm=TRUE) / n)
-
-
jrnold.github.io jrnold.github.io
-
Exercise 10.2
Exercise 10.5.2
-
Exercise 10.1
Exercise 10.5.1
-
on
only
-
-
jrnold.github.io jrnold.github.io
-
The words that have seven letters or more are
if we need to view/extract later the words that have at least 7 characters we can:
str_view( stringr::words, "........*", match = T)
adding "*" after the eight "." states that the last "." can be repeated zero or more times.
or
str_view(stringr::words, ".{7,}", match = T)
in both cases we match the words that have at least 7 characters, not only up to the seventh letter
-
Words that contain only consonants
An alternative: str_view(words, "[aeiou]", match = FALSE)
Or, using str_subset: str_subset(words, "[aeiou]", negate = TRUE)
-
-
jrnold.github.io jrnold.github.io
-
foreign
If one adds "%>% distinct(dest)" to the expression, one sees that there are four airports that are not in the FAA list: BQN, SJU, STT, and PSE. Three of these are in Puerto Rico, one (STT) is in US Virgin Islands.
-
precipitation
Perhaps the association with visibility is even stronger.
-
-
jrnold.github.io jrnold.github.io
-
ggplot(data = diamonds) + geom_pointrange( mapping = aes(x = cut, y = depth), stat = "summary", fun.ymin = min, fun.ymax = max, fun.y = median )
Could not generate the same stat_summary( ) plot with that code, did a little research and stack overflow suggested two solutions: use geom_line( )
ggplot(data = diamonds, mapping = aes(x = cut, y = depth)) + geom_line() + stat_summary(fun.y = "median", geom = "point", size = 3)
or, reduce the amount of data by grouping it
data = diamonds %>% group_by(cut) %>% summarise(min = min(depth), max = max(depth), median = median(depth))
ggplot(data, aes(x = cut, y = median, ymin = min, ymax = max)) + geom_linerange() + geom_pointrange()
Source: https://stackoverflow.com/questions/41850568/r-ggplot2-pointrange-example
-
width controls the amount of vertical displacement, and height controls the amount of horizontal displacement.
I think you flipped these labels. width is horizontal and height is vertical
It would also be helpful to emphasize that unless height and/or weight are explicitly defined as zero, there will be jitter. When I first used the geom_jitter, I did not realize this.
-
- Apr 2020
-
jrnold.github.io jrnold.github.io
-
combination
is a combination
-
contained
contained in
-
a particular trip by aircraft from a particular
the sentence is not complete.
-
few
flew
-
They may be combining different flights?
It seems to refer to diverted flights. The original BTS data also has diverted airport information, including a variable DivArrDelay with the following description: "Difference in minutes between scheduled and actual arrival time for a diverted flight reaching scheduled destination. The ArrDelay column remains NULL for all diverted flights." I browsed this data quickly and, indeed, those missing arr_delay observations have values in the DivArrDelay variable.
-
is column
is a column.
-
There is one remaining issue. Midnight is represented by 2400, which would correspond to 1440 minutes since midnight, but it should correspond to 0. After converting all the times to minutes after midnight, x %% 1440 will convert 1440 to zero while keeping all the other times the same.
%% 1440 is brilliant
-
-
jrnold.github.io jrnold.github.io
-
5.7
Did you mean 6.5?
-
library("viridis")
What is viridis library?
-
-
jrnold.github.io jrnold.github.io
-
function that directly calculates the number of unique values in a vector.
n_distinct is a faster and more concise equivalent of length(unique(x)).
-
-
jrnold.github.io jrnold.github.io
-
December
pretty sure december as 31 days
-
-
jrnold.github.io jrnold.github.io
-
model
manufacturer
missing (it is a categorical variable as well)
-
- Mar 2020
-
jrnold.github.io jrnold.github.io
-
cyl
How cylinder is a continuous variable?
-
-
jrnold.github.io jrnold.github.io
-
Note
Is anyone getting the above plot? even when I copy paste the code the boxplots are horizontal instead of vertical. tidyverse has the same plot as I do https://ggplot2.tidyverse.org/reference/geom_boxplot.html
If I understand the aes correctly, group should make carat the "categorical" variable and plot vertically as seen here.
If I replace y = price with y = depth it plots vertically as expected.
Can anyone help me understand what's happening?
-
-
jrnold.github.io jrnold.github.io
-
What does the one_of() function do? Why might it be helpful in conjunction with this vector?
Retired Selection Helpers
one_of() is retired in favour of the more precise any_of() and all_of() selectors.
https://www.rdocumentation.org/packages/tidyselect/versions/1.0.0/topics/one_of
-
will
Change "will the difference" to "is the difference" or "will be the difference"
-
-
jrnold.github.io jrnold.github.io
-
Why are gather() and spread() not perfectly symmetrical? Carefully consider the following example:
Shouldn't we update this functions to the newer pivor_longer and pivot_wider? (check ?gather, and also https://r4ds.had.co.nz/tidy-data.html#exercises-24 )
-
- Feb 2020
-
jrnold.github.io jrnold.github.io
-
Tidy the simple tibble below. Do you need to spread or gather it? What are the variables?
Thank you for writing/maintaining r4ds! Great resource.
I delivered this example in a lecture introducing tidy data, and the solution to have three variables doesn't make sense to me. Why is this not only two variables (sex and pregnant)? Sure, I could argue that count is a variable on this data, but it doesn't seem as natural.
-
-
jrnold.github.io jrnold.github.io
-
Exercise 3.7.5
This exercise makes more sense AFTER introducing Position adjustments in part 3.8.
-
na.rm:
from the documentation: If FALSE, the default, missing values are removed with a warning. If TRUE, missing values are silently removed.
-
- Jan 2020
-
jrnold.github.io jrnold.github.io
-
As that example shows,
Using geom_count() with position ='jitter' may help a bit with overplotting:
ggplot(data = mpg) +
- geom_count(mapping = aes(x = cty, y = hwy, color = class), position = "jitter")
-
stat_violin()
The ggplot2 v3.2.1 R documentation shows geom_violin() paired with stat_ydensity()
-
The default stat of geom_bar() is stat_bin(). The geom_bar() function only expects an x variable. The stat, stat_bin(), preprocesses input data by counting the number of observations for each value of x. The y aesthetic uses the values of these counts.
The ggplot2 v3.2.1 R documentation states "geom_bar() uses stat_count() by default". Was ggplot2 updated since this answer was published? My understanding is stat_count() is used for discrete x data and stat_bin() for continuous x data.
-
The following list contains the categorical variables
missing the variable
manufacturer
, which is also a categorical (<chr>) variable.
-
-
jrnold.github.io jrnold.github.io
-
Write a function that turns (e.g.) a vector c("a", "b", "c") into the string "a, b, and c". Think carefully about what it should do if given a vector of length 0, 1, or 2.
Can we the whole code replace with: ifelse(length(x)==1, x, str_c(str_c(x[-length(x)], collapse = ", ")," and ",str_c(x[length(x)], collapse = ",")))
of course in the form of a function
-
-
jrnold.github.io jrnold.github.io
-
how how
Repeated word
-
-
jrnold.github.io jrnold.github.io
-
Distribution of last digit
Why we do that? Just to see if each digit appears or what? Thanks
-
- Dec 2019
-
jrnold.github.io jrnold.github.io
-
What is meant by “48 hours over the course of the year”? This could mean two days, a span of 48 contiguous hours, or 48 hours that are not necessarily contiguous hours. I will find 48 not-necessarily contiguous hours. That definition makes better use of the methods introduced in this section and chapter.
Thank you for the book! Could you provide a hint as to how you might go about with this if 48 hours here meant any contiguous 48 hours? My guess is using a windowing function, but I couldn't figure it out.
-
- Nov 2019
-
jrnold.github.io jrnold.github.io
-
the an
Either it is "the" or "an", but not both.
-
with with
"with" written 2 times.
-
geom_bin()
this should be "stat_bin()"
-
no plots
I think this should be "no points" (i. e. no points on the scatter plot).
-
- Oct 2019
-
jrnold.github.io jrnold.github.io
-
)
Typo.
-
A warning is provided since often, but not always,
Sentence syntax unclear. Maybe this:
If a warning is provided often, but not always, there may be a bug in the code.
-
vectors recycles
vectors, R recycles
-
min_rank()
min_rank是对数据大小进行编号排序,遇到重复值,排序相同,但每个值都占一个位置,缺失值不计入
-
-
jrnold.github.io jrnold.github.io
-
how these parameters affects
affect
-
use already
already use
-
The benefits encoding
Missing "of" between "benefits" and "encoding"
-
hwy vs. cyl
I think common notation is dependent variable (Y) versus independent variable (X), so if you're asking for a y = cyl and x = hwy plot, then I'd say that's cyl vs. hwy (cyl as a function of hwy), not vice versa as written.
-
- Sep 2019
-
jrnold.github.io jrnold.github.io
-
Unusually fast flights are those flights with the smallest standardized values.
Question Why would it be unusual that flights with air time on the left side of the distribution would be the fastest?
-
-
jrnold.github.io jrnold.github.io
-
the method used to
Is there not something missing here?
-
-
jrnold.github.io jrnold.github.io
-
not tail
Replace by "not ALL tail"?
-
of the both the origin and
Should be "of both the origin and..." :)
-
-
jrnold.github.io jrnold.github.io
-
Exercise 14.4.2.2
Exercise 14.4.2.2 Q 2 - this misses words that are at the end of the sentence with a period after them. I use this: str_extract(sentences, "[A-za-z]+ing[ .]")
-
-
jrnold.github.io jrnold.github.io
-
we could use map() followed by flatten_dbl(),
The way this is written could be somewhat confusing to a reader, in my opinion, although the code makes the order of the functions clearer..
Suggestion:
If we wanted a numeric vector, we could combine the map() followed with the flatten_dbl(),
-
like so,
Edit suggestion.
like shown:
-
- Aug 2019
-
jrnold.github.io jrnold.github.io
-
x = "Highway MPG Relative to Class Average", y = "Engine Displacement"
Hi. is "Highway MPG Relative to Class Average" the name of Y-axis? and the name of X-axis is "Engine Displacement" because we put in X-axis with displ. thanks:)
-
-
jrnold.github.io jrnold.github.io
-
at
Minor mistake: a not at
-
not
"Not" repeated. Delete?
-
Neither NaN nor Inf are not numbers, and so they aren’t even numbers
Edit suggestion,
Neither NaN nor Inf is a number
Or
Both NaN and Inf are not even numbers.
If accepted, then "and so they aren’t even numbers" may be deleted as it becomes superfluous.
-
This is not the same as what you See the value of looking at the value of
Could you please check the sentence? There seems to be some ambiguity.
-
-
jrnold.github.io jrnold.github.io
-
Okay, I’m not sure what’s going on in this data.
-
-
jrnold.github.io jrnold.github.io
-
is
redundant. delete?
-
(!(x %% 3) && !(x %% 5))
Hi,
Please can you explain to me how to read this line of code and what it means. I'm having difficulty understanding it. Individually I do understand what each symbol means but put together as it is, I'm unable to. Moreover, it looks more efficient than my effort, as shown below.
Thank you.
fizzbuzz <- function(n) {
x <- n
if(( x %% 3 == 0 ) && ( x %% 5 == 0 )) {
print("fizzbuzz")
} else {
if(x %% 3 == 0) {
print("fizz")
} else {
if(x %% 5 == 0) {
print("buzz")
} else { print(x)
} } }
}
-
-
jrnold.github.io jrnold.github.io
-
function(.data)
Hi Jeffrey. First of all, i really appreciate this Exercise solutions! and.. i need your help! why did you code ".data"? Is there any difference in the case of "data"? when i coded "data", The results were the same as ".data". i had tried to find a difference between them, i couldn`t find it....please let me know. does it have no difference?..
-
-
jrnold.github.io jrnold.github.io
-
is that
Change word order perhaps? that is
-
which as a
which has
-
if
Minor mistake. Should read if it does not
-
-
jrnold.github.io jrnold.github.io
-
What does it mean for a flight to have a missing tailnum
This seems to be a tad long so I apologise in advance. This is a whole new field for me and I would really like to understand.
Could it be AA and MQ use different values to represent tailnum?
filter( planes, tailnum == 0 )
A tibble: 0 x 9
length(is.na( planes$tailnum ))
[1] 3322
nrow(planes)
[1] 3322
filter( flights, tailnum == 0 )
A tibble: 0 x 19
length(is.na( flights$tailnum ) )
[1] 336776
nrow( flights )
[1] 336776
Yet , the anti_join () as shown in your code shows clearly that there are some talinum values in flights that are not represented in the planes datasets. How could that be? The one explanation I could come up with is that the two datasets used different talinum values, so I tried to investigate for AA and MQ.
tailnum_flights <- flights %>% filter( carrier == 'AA'| carrier == 'MQ' ) %>% select ( carrier, tailnum )
tailnum_planes <- planes %>% select( tailnum )
tailnum_planes %in% tailnum_flights
[1] FALSE
So, it looks like the tailnum values are not missing for the ten airlines but are represented with values different in the two datasets (flights and planes).
What are your thoughts? Thank you.
-
- Jul 2019
-
jrnold.github.io jrnold.github.io
-
a is missing
-
-
jrnold.github.io jrnold.github.io
-
In the full English language, no
Erm, full stop omitted after no.
And thanks, for the link. It is an interesting read.
-
Words that end with “-ed” but not ending in “-eed”
This worked for me as well:
str_view(stringr::words, ".*[^e]ed$", match = TRUE)
-
"ab$^$sfas"
Please, could explain why you included this in the code? I replicated the answer without it. Thanks.
-
str_view(words, "([[:letter:]]).*\\1", match = TRUE)
Does this work? E.g. "achieve" does not have a matching PAIR of letters.
-
-
jrnold.github.io jrnold.github.io
-
avg_dest_delays <- flights %>% group_by(dest) %>% # arrival delay NA's are cancelled flights summarise(delay = mean(arr_delay, na.rm = TRUE)) %>% inner_join(airports, by = c(dest = "faa"))
Please, could you explain to me why if I pipe this directly to ggplot the colour aesthetics is not applied. See code below, it's basically a replication of yours but with ggplot directly piped to avg_dest_delays:
avg_dest_delays <- flights %>% group_by( dest ) %>% summarise( delay = mean( arr_delay, na.rm = TRUE )) %>% inner_join(airports, by = c( dest = "faa" ) ) %>% ggplot( aes(lon, lat, colour = delay ) ) + borders("state" ) + geom_point( ) + coord_quickmap( )
Thanks
-
any any
repetition?
-
join
This also seems to be the case where the by = argument is not used in a code. In that case, it seems, the semi_join() will give outputs only where the rows for both datasets correctly match, for example:
fueleconomy::vehicles %>% semi_join(fueleconomy::common)
produces the same output as:
fueleconomy::vehicles %>% semi_join(fueleconomy::common, by = c("make", "model"))
Or is that a coincident?
But, fueleconomy::vehicles %>% semi_join(fueleconomy::common, by = "make"
will produce a different output as R will match only by "make" in this example.
-
arr_delay
The code doesn't affect the output but I thought you might mean, sum( !is.na( dep_delay ))
-
There are few planes older than 30 years, so I combine them into a single category.
The code in the solution book totally dropped the data for planes age > 25. How might we combine them into a single row? I didn’t think it could be done without first defining a second tibble that contains all the planes age >25, then merge it with the first tibble that contains the data for planes age <= 25, before carrying out the summarise actions on them after applying the group_by = age argument. older_plane_cohorts <- inner_join( flights, select( planes, tailnum, plane_year = year ), by = "tailnum" ) %>% mutate(age = year - plane_year) %>% filter(!is.na(age)) %>% mutate(age = pmin(46, age) - pmin( 25, age ) ) %>% filter( age != 0 )
Then I got stuck. I can’t figure out how to proceed after that. And frankly, I can say with any certainty if my argument is sensible. And I’m not even sure it’s possible to combine the 17 rows into a single row. Your help will be appreciated.
Thank you in advance and for your help so far. Truly appreciated.
-
mutate(age = pmin(25, age))
I think you used this line of code to limit the selection to planes not older than 25 years. But in the text above, you stated, "There are few planes older than 30 years, so we combine them into a single category." So, I was expecting the selection to be age <= 30, or using your notation pmin(30, age) and not pmin(25, age). Perhaps an edit of the text may be required unless I'm wrong in my supposition?
-
This however, this default
Edit suggestion: However, this default...
-
If we needed a unique identifier for our analysis, could add a surrogate key.
Hi, I hope this doesn't come across as nitpicking:
If we needed a unique identifier for our analysis, we could add a surrogate key, perhaps?
-
faa$airports
Shouldn't this be "airports$faa" since airports is a data frame and faa is a variable?
-
-
jrnold.github.io jrnold.github.io
-
It looks like it is possible for certain variables to missing for (country, years).
"It looks like it is possible for certain variables to missing for (country, years)." Edit suggestion: It looks like it is possible that certain variables are missing for (country, years).
-
is
delete?
-
-
jrnold.github.io jrnold.github.io
-
run
run:
-
Using $
Using $,
-
-
jrnold.github.io jrnold.github.io
-
ggplot
Please, can you explain to me what I am getting wrong in the below code and especially the error message, which I have tried goggling to no success.
I tried to see if instead of count to show, in my practice exercise, the average price per carat group using the code below:
group_by( diamonds, carat ) %>% summarise( avg_price = mean(price ) ) %>% ggplot( ) + mapping = aes( color = cut_width(carat, 5 ), x = avg_price ) + geom_freqpoly( )
The code returns the following error:
Error in group_by(diamonds, carat) %>% summarise(avg_price = mean(price)) %>% : could not find function "+<-"
I've googled the error without success. So my confusion is in two parts:
- What is the right code to show the average price per carat type
- What does the above error mean?
Thanks in advance
-
visualization
Does anyone else notice that the R4DS uses British spelling for visualisation, but the Solution textbook uses the American spelling. It's a bit weird when one switches directly from the text book to the solution. I'm sure not many people notice the difference. I probably do because I generally write Brit English. By the way, I am not censuring, just expressing my thought. I am really grateful to the author for making this available.
-
there spikes in
Omission of are? Perhaps, you meant there are?
-
the these
Typo. Either the or these, preferably these in my opinion.
-
-
jrnold.github.io jrnold.github.io
-
n
Can anyone please explain the n in this code? Thanks in advance.
-
sin()
cos( )
-
sin)
sin( )
-
head(arrange(fastest_flights, desc(mph)))
why not just use: arrange(flights, desc(distance / air_time))
-
-
jrnold.github.io jrnold.github.io
-
Year
Class
-
- Jun 2019
-
jrnold.github.io jrnold.github.io
-
In dep_time, midnight is represented by 2400, not 0.
Recommendation: For new users of R, how one determines the representation of midnight as 2400 should be explained. At this point in the book, a reader would not have been thought how to do the proper exploration to make this determination alone, so it would seem that explaining how the author of the solutions book acquired that knowledge would be most beneficial.
-
-
jrnold.github.io jrnold.github.io
-
(date - 1L) %in% holidays_2013$date ~ "day before holiday", (date + 1L) %in% holidays_2013$date ~ "day after holiday",
I think you mixed up these two.
(date + 1L) %in% holidays_2013$date ~ "day before holiday", (date - 1L) %in% holidays_2013$date ~ "day after holiday",
-
(α+βx)
should be $$\beta \log x$$
-
-
jrnold.github.io jrnold.github.io
-
the more that pre-allocation will outperform appending.
Based on the output, it seems to me that appending outperform the pre-allocation. I run the code myself, got same results. However, another R notebook gave opposite results.
-
-
jrnold.github.io jrnold.github.io
-
>
Omit.
-
Exercise 7.4.1
Many thanks for this lovely example. I hadn't understood
geom_bar
's NA behavior with factor variables until your explanation.
-
- May 2019
-
jrnold.github.io jrnold.github.io
-
Presumably there is a premium for a 1 carat diamond
Is this "buyer's psychology?" The opposite of $4.99 being cheaper than $5.00?
-
number of diamonds in each carat range
Nice! Your intuition seems correct. But how do you account for the large number of 1.01 carat diamonds?
-
print(n = 30)
print(n = Inf)
will print all rows of a tibble. -
seem
see
-
There are no diamonds with a price of $1,500
More precisely: there's a $90 gap: the closed interval [1455,1545].
-
Explore the distribution
I wonder if this includes identifying outliers for possible data errors. E.g. x = 0 for 8 diamonds and z = 0 for 20. Also z = 31.8 for one. y = 0 for 7 diamonds and y = 31.8 and y = 58.9 for one each. I believe all are data errors.
-
mutate(id = row_number())
I believe the plots can be generated without
id
:diamonds %>% select(x, y, z) %>% gather(variable, value) %>% ggplot(aes(x = value)) + geom_density() + geom_rug() + facet_grid(vars(variable))
-
will
Is it a deliberate or an error putting "will" and "is" together?
-
-
jrnold.github.io jrnold.github.io
-
bottles <- function(i) { if (i > 2) { bottles <- str_c(i - 1, " bottles") } else if (i == 2) { bottles <- "1 bottle" } else { bottles <- "no more bottles" } bottles }
I think this should read:
bottles <- function(i) { if (i > 1) { bottles <- str_c(i , " bottles") } else if (i == 1) { bottles <- "1 bottle" } else { bottles <- "no more bottles" } bottles }
Otherwise you get no bottles of beer twice in the final output
-
map_lglg
typo
-
is.factor(diamonds$color)
repetition?
-
mean(X[i, ])
I think you forget to change mean(X[i, ]) to mean(X[, i]) in calculating column means
-
df[[i]] <- read_csv(files[[i]])
I think it should be corrected this way: for (fname in ...) { df[[fname]] <- bind_rows()
-
-
jrnold.github.io jrnold.github.io
-
For each plane, count the number of flights before the first delay of greater than 1 hour.
I think this is a lovely use of
logicals
andcumsum
. But I believe that it omits planes whose first flight is delayed by more than an hour. There are 234 of these:(zero <- flights %>% filter(!is.na(dep_delay)) %>% arrange(tailnum,month,day) %>% group_by(tailnum) %>% mutate(delay_gt1hr = dep_delay > 60) %>% filter(row_number() == 1,delay_gt1hr) %>% select(tailnum) %>% mutate(n = 0) )
Then
bind_rows
could concatenate these to the tibble produced by the posted solution.This is a "two-part" solution. Is there a "one-part" solution?
-
year
Since the
year
column contains only2013
, you could arrange chronologically without this argument. -
We will calculate this ranking in two parts
Or, the two parts could be combined:
rank <- flights %>% group_by(dest) %>% mutate(n_carriers = n_distinct(carrier)) %>% filter(n_carriers > 1) %>% group_by(carrier) %>% summarize(n_dest = n_distinct(dest)) %>% arrange(desc(n_dest))
-
width = 100
Here
width = Inf
will print all columns. -
print(width = 120)
I believe that
print(width = Inf)
will reliably print all selected columns. -
below than the
below the
-
observations each
observations of each
-
-
jrnold.github.io jrnold.github.io
-
color
colour
change from "color" to "colour" to be consistent within the paragraph
-
mpg
drv
-
mpg
drv
-
mpg
drv
-
-
jrnold.github.io jrnold.github.io
-
five
'five' is an error.
-
-
jrnold.github.io jrnold.github.io
-
#> Warning: Computation failed in `stat_binhex()`:
Why are not these two plots displayed properly? Only empty background is shown.
-
- Apr 2019
-
jrnold.github.io jrnold.github.io
-
Surprisingly, it appears that depth (z) is always smaller than length (x) or width (y). Length is less than width in more than half the observations, the opposite of expectations. I don’t know what’s going on. If this was not a widely used da
- If you look at this picture, it seems that depth is always smaller than either width or length. Maybe it is easier to plug it into a ring or other jewelry...
- Also you said "1. length is less than width, otherwise the width would be called the length". actually, it is the other way round.
-
geom_histogram(binwidth = 1, center = 0) + geom_bar()
I do not understand, why geom_bar is used in addition to geom_histogram. It greates two layers which are essentially the same, right?
-
-
jrnold.github.io jrnold.github.io
-
# … with 1 more row
You might want to check this last row of the tibble that's not displayed. Saturday actually has the shortest delays of any day of the week. You'll see it right away if you plot it:
flights_dt %>% mutate(wday = wday(dep_time, label = TRUE)) %>% group_by(wday) %>% summarize(ave_dep_delay = mean(dep_delay, na.rm = TRUE)) %>% ggplot(aes(x = wday, y = ave_dep_delay)) + geom_bar(stat = "identity")
flights_dt %>% mutate(wday = wday(dep_time, label = TRUE)) %>% group_by(wday) %>% summarize(ave_arr_delay = mean(arr_delay, na.rm = TRUE)) %>% ggplot(aes(x = wday, y = ave_arr_delay)) + geom_bar(stat = "identity")
-
%%
should not be
%/%
?
-
-
jrnold.github.io jrnold.github.io
-
skewness <- function(x, na.rm = FALSE) { n <- length(x) m <- mean(x, na.rm = na.rm) v <- var(x, na.rm = na.rm) (sum(x - m)^3 / (n - 2)) / v^(3 / 2) }
This function always returns 0. This is because sum(x-m) (performed before raised to the power 3) will always be 0. Resolved with additional parenthesis:
skewness <- function(x, na.rm = FALSE) { n <- length(x) m <- mean(x, na.rm = na.rm) v <- var(x, na.rm = na.rm) (sum((x - m)^3) / (n - 2)) / v^(3 / 2) }
I appreciate there are several possible formulas for skewness which may not match this one.
-
-
jrnold.github.io jrnold.github.io
-
I
Omit.
-
select(tailnum, on_time, arr_time, arr_delay) %>%
If you wish, this could be eliminated since the only columns that survive are those in the
summarise
. -
flights
flights at departure
-
the
omit
-
-
jrnold.github.io jrnold.github.io
-
?print.tbl_df
?print.tbl instead of ?print.tbl_df (archived)
-
-
jrnold.github.io jrnold.github.io
-
.
or
mtcars %>% map_dbl(mean)
-
-
jrnold.github.io jrnold.github.io
-
you
You would do You would use
-
- Mar 2019
-
jrnold.github.io jrnold.github.io
-
arrange(flights, distance / air_time * 60)
Do we need a descending order to put the fastest flights first? arrange(flights, desc(distance / air_time 60)) See also: flights %>% mutate(speed = distance / air_time 60) %>% select(tailnum, distance, air_time, speed, dep_time) %>% arrange(desc(speed))
-
Look at the number of cancelled flights per day. Is there a pattern?
This part is missing
-
benificial
beneficital
-
were
omit
-
ungroup() %>%
I believe you don't need
ungroup()
. -
an standard deviation. That
"and standard deviation. The following"
-
sinpi()
Typo. I suspect you mean cospi(), because "sinpi(x)" had just been introduced
-
Unusually fast flights are those flights with the smallest standardized values.
I believe the problem asks for "fast flights per destination". If so, this code produces that list, including ties. Sadly, it uses
slice
.fast <- flights %>% filter(!is.na(air_time)) %>% group_by(dest) %>% mutate(z = (air_time - mean(air_time)) / sd(air_time)) %>% slice(which(z == min(z))) %>% select(origin,dest,month,day,carrier,flight,air_time,z) %>% arrange(z)
-
ggplot(standardized_flights, aes(x = air_time_standard)) + geom_density()
This is a nice plot, but the image below is not it. It appears to have been cut-and-pasted from a previous plot.
-
standardized_flights
Somewhat more succinctly:
standardized_flights <- flights %>% filter(!is.na(air_time)) %>% group_by(dest,origin) %>% mutate(air_time_standard = (air_time - mean(air_time)) / sd(air_time)) %>% select(origin,dest,month,day,carrier,flight,air_time,air_time_standard)
-
-
jrnold.github.io jrnold.github.io
-
geom_hist
geom_histogram
-
preprocess input
preprocesses input...
-
-
jrnold.github.io jrnold.github.io
-
scatterplot visualize
"to" is missing
-
-
jrnold.github.io jrnold.github.io
-
mod4
mode3
-
-
jrnold.github.io jrnold.github.io
-
filter(n > 100)
This should be filter(n >= 100): "at least 100 flights" implies that 100 should be included as the least possible value of number of flights.
-
-
jrnold.github.io jrnold.github.io
-
Both R markdown files and can be knit R markdown documents can be knit.
This part does not make sense . Needs to be reviewed.
-
- Feb 2019
-
jrnold.github.io jrnold.github.io
-
in 10 flights
Omit, since you've already
filter
ed. -
min_rank() and dense_rank()
English is inherently ambiguous while only math (vectors, subscripts) and algorithms are not. But then you'd lose your reader.
Here's a slight rephrasing where
rank
refers to a component of, say,min_rank(x)
andvalue
a component ofx
.For each set of tied values the
min_rank()
function assigns a rank equal to the number of values less than that tied value plus one. In contrast, thedense_rank()
function assigns a rank equal to the number of distinct values less than that tied value plus one. -
min_rank() and dense_rank()
English is inherently ambiguous while only math (vectors, subscripts) and algorithms are not. But then you'd lose your reader.
Here's a slight rephrasing where
rank
refers to a component of, say,min_rank(x)
andvalue
a component ofx
.For each set of tied values the
min_rank()
function assigns a rank equal to the number of values less than that tied value plus one. Thedense_rank()
function assigns a rank equal to the number of distinct values less than that tied value plus one. -
ranking handles
the ranking functions handle
-
function
functions
-
roughly
Is it "rough" or "exact"?
-
an element
a vector
-
The row_number() function is confusingly named. It can create ranks for any column.
I might omit this since you give an explanation of
row_number
berlow which is accurate. Also, tibble adds its own rownumbers andrownumber(x)
appears to be "the rownumber after x is sorted", which I don't find confusing. -
missing values
ties
-
na.rm = TRUE
This probably is intended to be an argument of
mean
. It's not necessary, however, since neitherdep_delay
nordep_delay_lag
have anyNA
s. Also, could you provide an interpretation ofdelay_diff
? For example, what does it mean that JFK's is negative? -
year,
Could be eliminated since the dataset is for one specific year.
-
flight
Another definition of flight could be the quad (carrier,flight,origin,dest) which, say, might fly daily. Here's a solution with this definition.
dest_delay <- flights %>% group_by(dest) %>% filter(arr_delay > 0) %>% summarize(dest_ad = sum(arr_delay,na.rm=T),dest_ct = n()) flight_delay <- flights %>% group_by(carrier,flight,origin,dest) %>% filter(arr_delay > 0) %>% summarize(flight_ad = sum(arr_delay,na.rm=T),flight_ct = n()) prop <- as_tibble(merge(dest_delay,flight_delay,by="dest")) %>% mutate(prop = flight_ad / dest_ad) %>% select(carrier,flight,origin,dest,dest_ct,flight_ct,prop)
-
Exercise 5.7.4
In general selecting just a few columns (often the mutate variables) makes the result clearer since "irrelevant" cols are eliminated.
-
!is.na(arr_delay),
This could be omitted but keeping it in redundantly may add clarity.
-
only
Omit
-
However, there are many planes that have never flown an on-time flight.
Since this minimum rank group has an on_time of 0.0, I believe that this tibble is all planes which have never flown an on-time flight.
-
<=
==
I believe min_rank cannot be 0.
-
arr_delay > 0
(arr_delay > 0)
Of course precedence "does the right thing" but since R will use coercion to give lgl & dbl a value, putting in the redundant parentheses prevents the not-careful reader (me, for example) from "thinking the wrong thing."
-
cancelled
In this dataset the presence of arr_delay implies the presence of arr_time so the boolean cancelled could be eliminated. Of course, the reader hasn't verified this relation so keeping it in makes sense. The definition of cancelled as "having an arrival delay" and "not having an arrival time" seems a bit odd.
-
They operate within each group rather than over the entire data frame
Do all functions on grouped tibbles operate only per group and not the entire tibble? E.g.
flights %>% group_by(month,day) %>% arrange(desc(dep_delay))
Will the arrange function order both by group and by the entire tibble?
-
There are more sophisticated ways to do this analysis
Perhaps, but this is a lovely example of the power of R.
-
, which calculates
". atan(y,x) returns"
My worry is that the reader would assume that atan(x,y) returns that angle.
-
not_cancelled
Hadley defines this as
not_cancelled <- flights %>% filter(!is.na(dep_delay), !is.na(arr_delay))
But in
flights %>% filter(is.na(dep_time) | is.na(arr_time) | is.na(dep_delay) | is.na(arr_delay)) %>% select(sched_arr_time,arr_time,arr_delay,sched_dep_time,dep_time,dep_delay)
the first 6 flights have an
NA
only inarr_delay
. Aren't these likely not to have been cancelled and simply have a missing data entry? Why isn'tnot_cancelled <- flights %>% filter(!is.na(dep_time), !is.na(arr_time))
a "better" definition?
-
and the passenger arrived at the same time
Omit.
-
Exercise 5.5.4
Another solution. Now that we know we can "trust"
dep_delay
we could simplymd <- flights %>% select(carrier,flight,origin,dest,sched_dep_time,dep_time,dep_delay) %>% arrange(min_rank(desc(dep_delay)))
There are no ties in the top 10.
-
Daylight
If you wish, you could change this to "Eastern Daylight" to agree with its previous reference in that sentence.
-
Except for flights daylight savings started (March 10) or ended (November 3)
I understand the point of the paragraph but this sentence is a bit unclear.
-
-
jrnold.github.io jrnold.github.io
-
sum(cases)
Must be na.rm = TRUE inside the sum parenthesis for the below graph to appear otherwise it will just be a blank graph.
By the way thanks so much for these solutions, they are extremely helpful and teach things which are not taught in the book!
-