- Dec 2021
-
jrnold.github.io jrnold.github.io
-
no relationship or only a small negative relationship
This is not quite true, it is just appearance due to choice of scale. If you calculate correlation coefficient between displ and hwy for each classes, it is smaller than 0.5 only for 2-seaters and minivans. P value for linear models is also larger than 0.05 only for 2-seaters and minivans.
-
-
jrnold.github.io jrnold.github.io
-
better model
We can add a new variable: price per carat. And we will see that it drops just before a "nice" carat weight and then sharply rises. So we can somehow try to include it in the model. Table and depth seem to influence price too.
-
- Nov 2021
-
jrnold.github.io jrnold.github.io
-
discrepancies
In all cases of discrepancy, the error is exactly 24 hours. It means that the flight was postponed till next day but dep_time erroneously gives the same day.
-
doesn’t appear to much difference
Another, completely different approach is to calculate distribution parameters for each day (I used quartiles) and plot them against time.
The median very slowly declines over the year. Although the decline is slow linear regression analysis shows statistically significant correlation.
-
-
jrnold.github.io jrnold.github.io
-
replacements <- c("A" = "a", "B" = "b", "C" = "c", "D" = "d", "E" = "e", "F" = "f", "G" = "g", "H" = "h", "I" = "i", "J" = "j", "K" = "k", "L" = "l", "M" = "m", "N" = "n", "O" = "o", "P" = "p", "Q" = "q", "R" = "r", "S" = "s", "T" = "t", "U" = "u", "V" = "v", "W" = "w", "X" = "x", "Y" = "y", "Z" = "z")
An easy way to do it without typing the entire alphabet:
alphabet <- letters
names(alphabet) <- LETTERS
-
Strictly speaking, this code replaces forward slashes with double backslashes. There seems to be a bug in R that prevents replacing with single slash.
-
-
jrnold.github.io jrnold.github.io
-
hard to say
I can only add that in most (2/3) of these 48 hours either wind speed OR wind gust OR visibility were worse than the mean wind speed or mean visibility respectively.
-
flights
I would also filter out cancelled flights with filter(!is.na(arr_time)) Flights missing a tail number is filtered out automatically by count.
-
What weather conditions
Funny enough, the condition that makes it most likely to see a delay is normal atmospheric pressure!
weather_delay_dep <- flights %>% select(dep_delay, origin, time_hour) %>% filter(!is.na(dep_delay)) %>% inner_join(weather) %>% filter(!is.na(pressure))
pressure_delay <- weather_delay_dep %>% mutate(pressure = round(pressure)) %>% group_by(pressure) %>% summarise(delay = median(dep_delay))
pressure_delay %>% ggplot(aes(x = pressure, y = delay))+ geom_line() + geom_point()
You can see it even before calculating median delay for each pressure value (I use median rather than mean because delay distributions are skewed):
weather_delay_dep %>% ggplot(aes(x = pressure, y = dep_delay)) + geom_point() + geom_smooth()
-
-
jrnold.github.io jrnold.github.io
-
year > 1995
What happened in 1995? Global number of cases increased by two orders of magnitude.
-
summarize
Also, some countries have missing age groups in some years.
-
for years prior to the existence of the country
There are other missing years, not related to existence of the country. If your group by country followed by summarise(years = unique(year)) you will see that Albania has data for 1995 and 1997 but no data for 1996, Algeria has data for 1997 and 1999 but not for 1998, etc.
-
-
jrnold.github.io jrnold.github.io
-
First
The dataset description says that it describes round diamonds so two dimensions should be almost identical. A good starting point could be plotting frequency distributions of all three dimensions together:
dimensions <- diamonds %>% pivot_longer(cols = c(x, y, z), names_to = 'dimension', values_to = 'value' )
dim_short <- dimensions %>% filter(value <= 10)
ggplot(data = dim_short, mapping = aes(x = value, colour = dimension)) + geom_freqpoly(binwidth = 0.01)
Lines for x and y are practically identical, while z is shifted to the left. It does not prove that these dimensions are equal for each diamond but is a good indicator.
-