 Jun 2020

nabeelqu.co nabeelqu.conabeelqu1

Most of the discourse is about how AI will “replace” humans. I prefer the Licklider school of thought: humancomputer symbiosis. AI will make humans vastly more effective by automating tedious tasks. For example, humans can use text AI such as GPT3 to generate ideas/boilerplate writing to get around the terror of the blank page, and then simply pick the best ones and refine/iterate on those. (AI Dril, which was based on GPT2, was an early example of this). As AI gets better, “assistive creativity” will become a bigger thing, enabling humans to create sophisticated artifacts (including video games!) easier and better than ever.
Why AI is important for human productivity

 May 2020

www.estimationstats.com www.estimationstats.com

You can create estimation plots here at estimationstats.com, or with the DABEST packages which are available in R, Python, and Matlab.
You can create estimation plots with:

Relative to conventional plots, estimation plots offer five key benefits:
Estimation plots > barsandstars or boxplot & P.
They:
 avoid false dictonomy
 display all observed values
 focus on intervention effect size
 visualise estimate precision
 show mean difference distribution

For comparisons between 3 or more groups that typically employ analysis of variance (ANOVA) methods, one can use the Cumming estimation plot, which can be considered a variant of the GardnerAltman plot.
Cumming estimation plot

Efron developed the biascorrected and accelerated bootstrap (BCa bootstrap) to account for the skew whilst obtaining the central 95% of the distribution.
Biascorrected and accelerated bootstrap (BCa boostrap) deals with skewed sample distributions. However; it must be noted that it "may not give very accurate coverage in a smallsample nonparametric situation" (simply said, take caution with small datasets)

We can calculate the 95% CI of the mean difference by performing bootstrap resampling.
Bootstrap  simple but powerful technique that creates multiple resamples (with replacement) from a single set of observations, and computes the effect size of interest on each of these resamples. It can be used to determine the 95% CI (Confidence Interval).
We can use bootstrap resampling to obtain measure of precision and confidence about our estimate. It gives us 2 important benefits:
 Nonparametric statistical analysis  no need to assume normal distribution of our observations. Thanks to Central Limit Theorem, the resampling distribution of the effect size will approach normality
 Easy construction of the 95% CI from the resampling distribution. For 1000 bootstrap resamples of the mean difference, 25th value and 975th value can be used as boundaries of the 95% CI.
Bootstrap resampling can be used for such an example:
Computers can easily perform 5000 resamples:

Shown above is a GardnerAltman estimation plot.
GardnerAltman estimation plot shows all the relevant information:
 Datapoints presented as swarmplot
 Effect size is presented as bootstrap 95% confidence interval (95% CI) on a seperate but aligned axes

Jitter plots avoid overlapping datapoints (i.e. datapoints with the same yvalue) by adding a random factor to each point along the orthogonal xaxes.
Jitter plots displays all datapoints but it might not accurately depict the underlying distribution of the data:

Unfortunately, the boxplot still doesn't show all our data.
Boxplots may be better than barplots (they introduce medians, quartiles, minima and maxima), but still doesn't show all the information:

The barplot has several shortcomings, even though its use in academic journals is endemic.
Barplots are not the best choice for data visualisation:
Tags
Annotators
URL


mateuszgrzyb.pl mateuszgrzyb.pl

Kolejna zmienna  "Recency", czyli informacja o tym, jak dawno klient robił zakupy w sklepie.
To calculate RFM we need recency value. Firstly, we shall specify the most recent transaction as "today" and then to find the latest transaction of a specific client:
df['Recency'] = (today  df.InvoiceDate)/np.timedelta64(1,'D')
Calculating frequency and aggregating data of each client may be done with the
groupby
method:abt = df.groupby(['CustomerID']).agg({'Recency':'min', 'MonetaryValue':'sum', 'InvoiceNo':'count'})
lastly, we can update the column names and display RFM data:
abt = df.groupby(['CustomerID']).agg({'Recency':'min', 'MonetaryValue':'sum', 'InvoiceNo':'count'}) abt.rename(columns = {'InvoiceNo':'Frequency'}, inplace = True) abt = abt[['Recency', 'Frequency', 'MonetaryValue']] abt.head()

Będę oceniać klientów kryteriami, jakie zakłada metoda RFM.
RFM is a method used for analyzing customer value:
 Recency – How recently did the customer purchase?
 Frequency – How often do they purchase?
 Monetary Value – How much do they spend?
In the RFM method we usually analyse only the last 12 months since this is the most relevant data of our products.
We also have other RFM variants:
 RFD – Recency, Frequency, Duration
 RFE – Recency, Frequency, Engagement
 RFMI – Recency, Frequency, Monetary Value – Interactions
 RFMTC – Recency, Frequency, Monetary Value, Time, Churn rate

Usuwam brakujące wartości w zmiennej "CustomerID".
Deleting rows where value is null:
df = df[~df.CustomerID.isnull()]
Assigning different data types to columns:
df['CustomerID'] = df.CustomerID.astype(int)
Deleting irrelevant columns:
df.drop(['Description', 'StockCode', 'Country'], axis = 1, inplace = True)

Sprawdzam braki danych.
Checking the % of missing data:
print(str(round(df.isnull().any(axis=1).sum()/df.shape[0]*100,2))+'% obserwacji zawiera braki w danych.')
Sample output:
24.93% obserwacji zawiera braki w danych.

Buduję prosty data frame z podstawowymi informacjami o zbiorze.
Building a simple dataframe with a summary of our columns (data types, sum and % of nulls):
summary = pd.DataFrame(df.dtypes, columns=['Dtype']) summary['Nulls'] = pd.DataFrame(df.isnull().any()) summary['Sum_of_nulls'] = pd.DataFrame(df.isnull().sum()) summary['Per_of_nulls'] = round((df.apply(pd.isnull).mean()*100),2) summary.Dtype = summary.Dtype.astype(str) print(summary)
the output:
Dtype Nulls Sum_of_nulls Per_of_nulls InvoiceNo object False 0 0.000 StockCode object False 0 0.000 Description object True 1454 0.270 Quantity int64 False 0 0.000 InvoiceDate datetime64[ns] False 0 0.000 UnitPrice float64 False 0 0.000 CustomerID float64 True 135080 24.930 Country object False 0 0.000


towardsdatascience.com towardsdatascience.com

For many data scientists, the finished product of a work session is a business analysis. They need to show team members—who oftentimes aren’t technical—how their data became a specific recommendation or insight.
Usual final product in Data Science is the business analysis which is perfectly explained with notebooks


www.datanami.com www.datanami.com

COVID19 has spurred a shift to analyze things like supply chain disruptions, speech analytics, and filtering out pandemicrelated behavior, such as binge shopping, Burtch Works says. Data teams are being asked to create new simulations for the coming recession, to create scorecards to track pandemicrelated behavior, and add COVID19 control variables to marketing attribution models, the company says.
How COVID19 impacts activities of data positions

Data scientists and data engineers have builtin job security relative to other positions as businesses transition their operations to rely more heavily on data, data science, and AI. That’s a longterm trend that is not likely to change due to COVID19, although momentum had started to slow in 2019 as venture capital investments ebbed.

According to a Dice Tech Jobs report released in February, demand for data engineers was up 50% and demand for data scientists was up 32% in 2019 compared to the prior year.
Need for Data Scientist / Engineers in 2019 vs 2018


news.ycombinator.com news.ycombinator.com

Truth be told, we found that most companies we worked with preferred to own the analytical backend.
From the experience of Plotly Team
Tags
Annotators
URL

 Apr 2020

towardsdatascience.com towardsdatascience.com

the limitations of the PPS
Limitations of the PPS:
 Slower than correlation
 Score cannot be interpreted as easily as the correlation (it doesn't tell you anything about the type of relationship). PPS is better for finding patterns and correlation is better for communicating found linear relationships
 You cannot compare the scores for different target variables in a strict math way because they're calculated using different evaluation metrics
 There are some limitations of the components used underneath the hood
 You've to perform forward and backward selection in addition to feature selection

Although the PPS has many advantages over the correlation, there is some drawback: it takes longer to calculate.
PPS is slower to calculate the correlation.
 single PPS = 10500 ms
 whole PPS matrix for 40 columns = 40*40 = 1600 individual calculations = 110 minutes

How to use the PPS in your own (Python) project
Using PPS with Python
 Download ppscore:
pip install ppscore
shell  Calculate the PPS for a given pandas dataframe:
import ppscore as pps pps.score(df, "feature_column", "target_column")
 Calculate the whole PPS matrix:
pps.matrix(df)
 Download ppscore:

The PPS clearly has some advantages over correlation for finding predictive patterns in the data. However, once the patterns are found, the correlation is still a great way of communicating found linear relationships.
PPS:
 good for finding predictive patterns
 can be used for feature selection
 can be used to detect information leakage between variables
 interpret PPS matrix as a directed graph to find entity structures Correlation:
 good for communicating found linear relationships

Let’s compare the correlation matrix to the PPS matrix on the Titanic dataset.
Comparing correlation matrix and the PPS matrix of the Titanic dataset:
findings about the correlation matrix:
 Correlation matrix is smaller because it doesn't work for categorical data
 Correlation matrix shows a negative correlation between
TicketPrice
andClass
. For PPS, it's a strong predictor (0.9 PPS), but not the other wayClass
toTicketPrice
(ticket of 500010000$ is most likely the highest class, but the highest class itself cannot determine the price)
findings about the PPS matrix:
 First row of the matrix tells you that the best univariate predictor of the column
Survived
is the columnSex
(Sex
was dropped for correlation) TicketID
uncovers a hidden pattern as well as it's connection with theTicketPrice

Let’s use a typical quadratic relationship: the feature x is a uniform variable ranging from 2 to 2 and the target y is the square of x plus some error.
In this scenario:
 we can predict y using x
 we cannot predict x using y as x might be negative or positive (for y=4, x=2 or 2
 the correlation is 0. Both from x to y and from y to x because the correlation is symmetric (more often relationships are assymetric!). However, the PPS from x to y is 0.88 (not 1 because of existing error)
 PPS from y to x is 0 because there's no relationship that y can predict if it only knows its own value

how do you normalize a score? You define a lower and an upper limit and put the score into perspective.
Normalising a score:
 you need to put a lower and upper limit
 upper limit can be F1 = 1, and a perfect MAE = 0
 lower limit depends on the evaluation metric and your data set. It's the value that a naive predictor achieves

For a classification problem, always predicting the most common class is pretty naive. For a regression problem, always predicting the median value is pretty naive.
What is a naive model:
 predicting common class for a classification problem
 predicting median value for a regression problem

Let’s say we have two columns and want to calculate the predictive power score of A predicting B. In this case, we treat B as our target variable and A as our (only) feature. We can now calculate a crossvalidated Decision Tree and calculate a suitable evaluation metric.
If the target (B) variable is:
 numeric  we can use a Decision Tree Regressor and calculate the Mean Absolute Error (MAE)
 categoric  we can use a Decision Tree Classifier and calculate the weighted F1 (or ROC)

More often, relationships are asymmetric
a column with 3 unique values will never be able to perfectly predict another column with 100 unique values. But the opposite might be true

there are many nonlinear relationships that the score simply won’t detect. For example, a sinus wave, a quadratic curve or a mysterious step function. The score will just be 0, saying: “Nothing interesting here”. Also, correlation is only defined for numeric columns.
Correlation:
 doesn't work with nonlinear data
 doesn't work for categorical values
Examples:



Creating meticulous tests before exploring the data is a big mistake, and will result in a wellcrafted garbagein, garbageout pipeline. We need an environment flexible enough to encourage experiments, especially in the initial place.
Overzealous nature of TDD may discourage from explorable data science


towardsdatascience.com towardsdatascience.com

The priorities in building a production machine learning pipeline—the series of steps that take you from raw data to product—are not fundamentally different from those of general software engineering.
 Your pipeline should be reproducible
 Collaborating on your pipeline should be easy
 All code in your pipeline should be testable

A notebook, at a very basic level, is just a bunch of JSON that references blocks of code and the order in which they should be executed.But notebooks prioritize presentation and interactivity at the expense of reproducibility. YAML is the other side of that coin, ignoring presentation in favor of simplicity and reproducibility—making it much better for production.
Summary of the article:
Notebook = presentation + interactivity
YAML = simplicity + reproducibility


stats.stackexchange.com stats.stackexchange.com

Repeated measures involves measuring the same cases multiple times. So, if you measured the chips, then did something to them, then measured them again, etc it would be repeated measures. Replication involves running the same study on different subjects but identical conditions. So, if you did the study on n chips, then did it again on another n chips that would be replication.
Difference between repeated measures and replication


github.com github.com

To take full advantage of tabular augmentation for timeseries you would perform the techniques in the following order: (1) transforming, (2) interacting, (3) mapping, (4) extracting, and (5) synthesising

process of modular feature engineering and observation engineering while emphasising the order of augmentation to achieve the best predicted outcome from a given information set
Tabular Data Augmentation
Tags
Annotators
URL


news.ycombinator.com news.ycombinator.com

1) Redash and Falcon focus on people that want to do visualizations on top of SQL2) Superset, Tableau and PowerBI focus on people that want to do visualizations with a UI3) Metabase and SeekTable focus on people that want to do quick analysis (they are the closest to an Excel replacement)
Comparison of data analysis tools:
1) Redash & Falcon  SQL focus
2) Superset, Tableau & PowerBI  UI workflow
3) Metabase & SeekTable  Excel like experience
Tags
Annotators
URL


blog.jupyter.org blog.jupyter.org

JupyterLab project, which enables a richer UI including a file browser, text editors, consoles, notebooks, and a rich layout system.
How JupyterLab differs from a traditional notebook


code.visualstudio.com code.visualstudio.com

Open an Anaconda command prompt and run conda create n myenv python=3.7 pandas jupyter seaborn scikitlearn keras tensorflow
Command to quickly create a new Anaconda environment:
conda create n myenv python=3.7 pandas jupyter seaborn scikitlearn keras tensorflow


dfrieds.com dfrieds.com

On the job, if you notice poor infrastructure, speak up to your manager early on. Clearly document the problem, and try to incorporate a data engineering, infrastructure, or devops team to help resolve the issue! I'd also encourage you to learn these skills too!
Recommendation to Data Scientists dealing with poor infrastructure

When I worked at Target HQ in 2012, employees would arrive to work early  often around 7am  in order to query the database at a time when few others were doing so. They hoped they’d get database results quicker. Yet, they’d still often wait several hours just to get results.
What poor data infrastructure leads to

The Data Scientist, in many cases, should be called the Data Janitor. ¯_(ツ)_/¯
:D

In regards to quality of data on the job, I’d often compare it to a garbage bag that ripped, had its content spewed all over the ground and your partner has asked you to find a beautiful earring that was accidentally inside.
From my experience, I can only agree

On the job, I’d recommend you document your work well and calculate the monetary value of your analyses based on factors like employee salary, capital investments, opportunity cost, etc. These analyses will come in handy for a promotion/review packet later too.
Factors to keep a track of as a Data Scientist

you as a Data Scientist should try to find a situation to be incredibly valuable on the job! It’s tough to find that from the outskirts of applying to jobs, but internally, you can make inroads supporting stakeholders with evidence for their decisions!
Try showcasing the evidence of why the employer needs you

As a Data Scientist in the org, are you essential to the business? Probably not. The business could go on for a while and survive without you. Sales will still be made, features will still get built, customer support will handle customer concerns, etc.
As a Data Scientist, you are more of a "support" to the overall team

As the resident Data Scientist, you may become easily inundated with requests from multiple teams at once. Be prepared to ask these teams to qualify and defend their requests, and be prepared to say “no” if their needs fall outside the scope of your actual priority queue. I’d recommend utilizing the RICE prioritization technique for projects.
Being the only data person be sure to prioritise the requests

Common questions are: How many users click this button; what % of users that visit a screen click this button; how many users have signed up by region or account type? However, the data needed to answer those questions may not exist! If the data does exists, it’s likely “dirty”  undocumented, tough to find or could be factually inaccurate. It’ll be tough to work with! You could spend hours or days attempting to answer a single question only to discover that you can’t sufficiently answer it for a stakeholder. In machine learning, you may be asked to optimize some process or experience for consumers. However, there’s uncertainty with how much, if at all, the experience can be improved!
Common types of problems you might be working with in Data Science / Machine Learning industry

In one experience, a fellow researcher spent over a month researching a particular value among our customers through qualitative and quantitative data. She presented a wellwritten and evidencebacked report. Yet, a few days later, a key head of product outlined a vision for the team and supported it with a claim that was antithetical to the researcher’s findings! Even if a data science project you advocate for is greenlighted, you may be on your own as the rare knowledgeable person to plan and execute it. It’s unlikely leadership will be handson to help you research and plan out the project.
Data science leadership is sorely lacking

Because people don’t know what data science does, you may have to support yourself with work in devops, software engineering, data engineering, etc.
You might have multiple roles due to lack of clear "data science" terminology

In 50+ interviews for data related jobs, I’ve been asked about AB testing, SQL analytics questions, optimizing SQL queries, how to code a game in Python, Logistic Regression, Gradient Boosted Trees, data structures and/or algorithms programming problems!
Data science interviews lack a clarity of scope

seven most common (and at times flagrant) ways that data science has failed to meet expectations in industry
 People don’t know what “data science” does.
 Data science leadership is sorely lacking.
 Data science can’t always be built to specs.
 You’re likely the only “data person.”
 Your impact is tough to measure — data doesn’t always translate to value.
 Data & infrastructure have serious quality problems.
 Data work can be profoundly unethical. Moral courage required.


datascience.stackexchange.com datascience.stackexchange.com

In data science community the performance of the model on the test dataset is one of the most important things people look at. Just look at the competitions on kaggle.com. They are extremely focused on test dataset and the performance of these models is really good.
In data science, performance of the model on the test model is the most important metric for the majority.
It's not always the best measurement since the most efficient model can completely misperform while receiving a different type of a dataset

It's basically a look up table, interpolating between known data points. Except, unlike other interpolants like 'linear', 'nearest neighbour' or 'cubic', the underlying functional form is determined to best represent the kind of data you have.
You can describe AI/ML methods as a look up table that adjusts to your data points unlike other interpolants (linear,nearest neighbor or cubic)

 Mar 2020


In the interval scale, there is no true zero point or fixed beginning. They do not have a true zero even if one of the values carry the name “zero.” For example, in the temperature, there is no point where the temperature can be zero. Zero degrees F does not mean the complete absence of temperature. Since the interval scale has no true zero point, you cannot calculate Ratios. For example, there is no any sense the ratio of 90 to 30 degrees F to be the same as the ratio of 60 to 20 degrees. A temperature of 20 degrees is not twice as warm as one of 10 degrees.
Interval data:
 show not only order and direction, but also the exact differences between the values
 the distances between each value on the interval scale are meaningful and equal
 no true zero point
 no fixed beginning
 no possibility to calculate ratios (only add and substract)
 e.g.: temperature in Fahrenheit or Celsius (but not Kelvin) or IQ test

As the interval scales, Ratio scales show us the order and the exact value between the units. However, in contrast with interval scales, Ratio ones have an absolute zero that allows us to perform a huge range of descriptive statistics and inferential statistics. The ratio scales possess a clear definition of zero. Any types of values that can be measured from absolute zero can be measured with a ratio scale. The most popular examples of ratio variables are height and weight. In addition, one individual can be twice as tall as another individual.
Ratio data is like interval data, but with:
 absolute zero
 possibility to calculate ratio (e.g. someone can be twice as tall)
 possibility to not only add and subtract, but multiply and divide values
 e.g.: weight, height, Kelvin scale (50K is 2x hot as 25K)
Tags
Annotators
URL


towardsdatascience.com towardsdatascience.com

If you are interested in exploring the dataset used in this article, it can be used straight from S3 with Vaex. See the full Jupyter notebook to find out how to do this.
Example of EDA in Vaex > Jupyter Notebook

Vaex is an opensource DataFrame library which enables the visualisation, exploration, analysis and even machine learning on tabular datasets that are as large as your harddrive. To do this, Vaex employs concepts such as memory mapping, efficient outofcore algorithms and lazy evaluations.
Vaex  library to manage as large datasets as your HDD, thanks to:
 memory mapping
 efficient outofcore algorithms
 lazy evaluations.
All wrapped in a Pandaslike API

The first step is to convert the data into a memory mappable file format, such as Apache Arrow, Apache Parquet, or HDF5
Before opening data with Vaex, we need to convert it into a memory mappable file format (e.g. Apache Arrow, Apache Parquet or HDF5). This way, 100 GB data can be load in Vaex in 0.052 seconds!
Example of converting CSV > HDF5.

The describe method nicely illustrates the power and efficiency of Vaex: all of these statistics were computed in under 3 minutes on my MacBook Pro (15", 2018, 2.6GHz Intel Core i7, 32GB RAM). Other libraries or methods would require either distributed computing or a cloud instance with over 100GB to preform the same computations.
Possibilities of Vaex

AWS offers instances with Terabytes of RAM. In this case you still have to manage cloud data buckets, wait for data transfer from bucket to instance every time the instance starts, handle compliance issues that come with putting data on the cloud, and deal with all the inconvenience that come with working on a remote machine. Not to mention the costs, which although start low, tend to pile up as time goes on.
AWS as a solution to analyse data too big for RAM (like 3050 GB range). In this case, it's still uncomfortable:
 managing cloud data buckets
 waiting for data transfer from bucket to instance every time the instance starts
 handling compliance issues coming by putting data on the cloud
 dealing with remote machines
 costs


towardsdatascience.com towardsdatascience.com

when AUC is 0.5, it means model has no class separation capacity whatsoever.
If AUC = 0.5

ROC is a probability curve and AUC represents degree or measure of separability. It tells how much model is capable of distinguishing between classes.
ROC & AUC

In multiclass model, we can plot N number of AUC ROC Curves for N number classes using One vs ALL methodology. So for Example, If you have three classes named X, Y and Z, you will have one ROC for X classified against Y and Z, another ROC for Y classified against X and Z, and a third one of Z classified against Y and X.
Using AUC ROC curve for multiclass model

When AUC is approximately 0, model is actually reciprocating the classes. It means, model is predicting negative class as a positive class and vice versa
If AUC = 0

AUC near to the 1 which means it has good measure of separability.
If AUC = 1


www.linkedin.com www.linkedin.comLinkedIn1

LR is nothing but the binomial regression with logit link (or probit), one of the numerous GLM cases. As a regression  itself it doesn't classify anything, it models the conditional (to linear predictor) expected value of the Bernoulli/binomially distributed DV.
Linear Regression  the ultimate definition (it's not a classification algorithm!)
It's used for classification when we specify a 50% threshold.



BOW is often used for Natural Language Processing (NLP) tasks like Text Classification. Its strengths lie in its simplicity: it’s inexpensive to compute, and sometimes simpler is better when positioning or contextual info aren’t relevant
Usefulness of BOW:
 simplicity
 low on computing requirements

Notice that we lose contextual information, e.g. where in the document the word appeared, when we use BOW. It’s like a literal bagofwords: it only tells you what words occur in the document, not where they occurred
The analogy behind using bag term in the bagofwords (BOW) model.

Here’s my preferred way of doing it, which uses Keras’s Tokenizer class
Keras's Tokenizer Class  Victor's preferred way of implementing BOW in Python
Tags
Annotators
URL


martinfowler.com martinfowler.com

process that takes input data through a series of transformation stages, producing data as output
Data pipeline
Tags
Annotators
URL



Softmax turns arbitrary real values into probabilities
Softmax function 
 outputs of the function are in range [0,1] and add up to 1. Hence, they form a probability distribution
 the calcualtion invloves e (mathematical constant) and performs operation on n numbers: $$s(x_i) = \frac{e^{xi}}{\sum{j=1}^n e^{x_j}}$$
 the bigger the value, the higher its probability
 lets us answer classification questions with probabilities, which are more useful than simpler answers (e.g. binary yes/no)
Tags
Annotators
URL


www.linkedin.com www.linkedin.comLinkedIn1

1. Logistic regression IS a binomial regression (with logit link), a special case of the Generalized Linear Model. It doesn't classify anything *unless a threshold for the probability is set*. Classification is just its application. 2. Stepwise regression is by no means a regression. It's a (flawed) method of variable selection. 3. OLS is a method of estimation (among others: GLS, TLS, (RE)ML, PQL, etc.), NOT a regression. 4. Ridge, LASSO  it's a method of regularization, NOT a regression. 5. There are tens of models for the regression analysis. You mention mainly linear and logistic  it's just the GLM! Learn the others too (link in a comment). STOP with the "17 types of regression every DS should know". BTW, there're 270+ statistical tests. Not just t, chi2 & Wilcoxon
5 clarifications to common misconceptions shared over data science cheatsheets on LinkedIn


news.ycombinator.com news.ycombinator.com

400 sized probability sample (a small random sample from the whole population) is often better than a millions sized administrative sample (of the kind you can download from gov sites). The reason is that an arbitrary sample (as opposed to a random one) is very likely to be biased, and, if large enough, a confidence interval (which actually doesn't really make sense except for probability samples) will be so narrow that, because of the bias, it will actually rarely, if ever, include the true value we are trying to estimate. On the other hand, the small, random sample will be very likely to include the true value in its (wider) confidence interval


www.linkedin.com www.linkedin.comLinkedIn1

An exploratory plot is all about you getting to know the data. An explanatory graphic, on the other hand, is about telling a story using that data to a specific audience.
Exploratory vs Explanatory plot

 Dec 2018

www.tandfonline.com www.tandfonline.com

ggplot2 and dplyr are clear intellectual contributions because they provide tools (grammars) for visualization and data manipulation, respectively. The tools make the tasks radically easier by providing clear organizing principles coupled with effective code. Users often remark on the ease of manipulating data with dplyr and it is natural to wonder if perhaps the task itself is trivial. We claim it is not. Many probability challenges become dramatically easier, once you strike upon the “right” notation. In both cases, what feels like a matter of notation or syntax is really a more about exploiting the “right” abstraction.
This is a fascinating example to me of both an important intellectual contribution that is embodied in the code itself, and also how extremely important work can be invisible or taken for granted because of how seamlessly or easily it becomes integrated into other research,

This article is one of many commentaries on the Donoho "50 Years of Data Science" paper.
