Correlation matrix
Maybe not the best dataset for the correlation matrix, but this is a must for the initial analysis of any dataset. Getting basic info, descriptive statistics and distribution of each column and correlation matrix.
Correlation matrix
Maybe not the best dataset for the correlation matrix, but this is a must for the initial analysis of any dataset. Getting basic info, descriptive statistics and distribution of each column and correlation matrix.
Median of point duration is about 40 seconds with three longest points around 5 minute player in the middle of the game. One fourth of all points have ended up with less then 25 seconds and over one fourth of all points lasted over 50s. With match duration of nearly 5 hours without a doubt it is one of the longest, if not the longest, Wimbledon finals
great sum up
maybe avoiding mistakes
yup, they tend to risk less, so you go for the central path, good notice
First and second service direction
Thumbs up for this analysis. It's not something that's just sitting in a column as an integer, so needs some additional work, great job :)
result of this set (1-6) didn't actually mean that Đoković "gave
And a great conclusion. It's also important to have this kind of insight because we can maybe exclude it from some specific analysis as an outlier set.
There is not NaNs in dataset
There's a fantastic example on how much you can achieve only with a single column, that majority of people would just neglect and disregard. If you're acquainted with the titanic dataset and the ML prediction tasks, check this notebook, it's a jewel :) https://www.kaggle.com/ccastleberry/titanic-cabin-features/notebook
plt.show()
Great, simple chart, giving a temporal dynamics. Some further idea could be more features involved (winners, rallies, forced errors etc. maybe they can be combined into a single metric etc.)
Federer had 250% more aces than Nole
Nole like things the hard way :) typical Serb :D But it's a good match to represent the "importance" or "relativity" of the statistics.
Let's say that the
I like how you handled the outliers. Made reasonable assumptions, documented them and moved on with the analysis.
detected_ballhit_count
good observation, also 'detected_ballhit_count' is the output of the 'ball_hit' algorithm so it's probably on the account of the algorithm accuracy. Good check would be to use inferred ball hits from the 'point_description' column.
Seems
great remark
Result analysis
great commenting, small remark that it would've probably been easier that the result is on a single plot. In addition, you could emphasize the difference in absolute points or percentage-wise
Federer has Nole beat in every single category
yup, it's considered one of "those" matches that in regular statistics you would assume that things ended up differently.
Reading csv
Concerning the organizational part, there's a great nbextension for generating automatic table of contents, so check it out. https://jupyter-contrib-nbextensions.readthedocs.io/en/latest/nbextensions/toc2/README.html
And a a collection of community-contributed unofficial extensions: https://jupyter-contrib-nbextensions.readthedocs.io/en/latest/
A little about this project
nice intro :)
np.linspace
Bins are always a good checkpoint for discussions, so here it goes :) Like for any kind of histogram we underepresent the data in order to have a "meaningful" plot and depending of the bins the "story" can go one way or the other. So it's cool to use some specific bin size to represent something, but always a good idea to check that the data "behaves' the same when the bin is smaller. Here in specific, because the data is not that "linear", it would probably be better to distribute them into equal bins (containing roughly the same amount of points) or go with the "domain knowledge" and use that (e.g. short rallies 1-4, medium 4-9, and long >9).
Hm, at first, it seems that Federer had two matchpoints. But, we can't be certain, it could just be the same ongoing point. Let's see the indexes of these matchpoints.
Just to emphasize the things already discussed, it's nice to have frequent comments. Great storytelling and thinking out loud.
My guess
Great, that you've picked up the story part. Also, it's nice to see that you can add comments that are facts, like conclusion and also some additional thoughts (my guess). So you're getting 2 things. One, it's much easier to go through a notebook, but also document your own line of thinking. It'll be much easier when/if you return to continue with the project later on.
Table of Contents
There's a great nbextension for this part, so check it out. https://jupyter-contrib-nbextensions.readthedocs.io/en/latest/nbextensions/toc2/README.html
And a a collection of community-contributed unofficial extensions: https://jupyter-contrib-nbextensions.readthedocs.io/en/latest/
plt.show()
Nice set of plots, would be cool to include 'unforced errors' (but this is relevant from domain knowledge point of view) It's also interesting when you put it into perspective who won which set (it can be an added info to the plot).
storytelling part: e.g. It's also interesting that despite the statistics (provided here) in the 5th set, Djokovic still managed to take the victory
plt.show()
Not the same format of end data, but just wanted to give you a glimpse of power of using groupby. It also needs some time to get accustomed to the groupby logic (e.g. writing sql queries) like for vectorization as we discussed.
data.groupby(['p1_sets','p2_sets'])['point_duration'].sum().values/60
a more elaborate read: https://pandas.pydata.org/docs/user_guide/groupby.html
plt.show()
It's great flow how you elaborated from services to winners and errors, a few comments/observations on the charts and it's full house.
plt.show()
Nice set of plots, would be cool to include 'unforced errors' (but this is relevant from domain knowledge point of view) It's also interesting when you put it into perspective who won which set (it can be an added info to the plot).
storytelling part: e.g. It's also interesting that despite the statistics (provided here) in the 5th set, Djokovic still managed to take the victory
Time elapsed for each set
Not the same format of end data, but just wanted to give you a glimpse of power of using groupby. It also needs some time to get accustomed to the groupby logic (e.g. writing sql queries) like for vectorization as we discussed.
data.groupby(['p1_sets','p2_sets'])['point_duration'].sum().values/60
a more elaborate read: https://pandas.pydata.org/docs/user_guide/groupby.html