78 Matching Annotations
  1. Jun 2020
    1. Most of the discourse is about how AI will “replace” humans. I prefer the Licklider school of thought: human-computer symbiosis. AI will make humans vastly more effective by automating tedious tasks. For example, humans can use text AI such as GPT-3 to generate ideas/boilerplate writing to get around the terror of the blank page, and then simply pick the best ones and refine/iterate on those. (AI Dril, which was based on GPT-2, was an early example of this). As AI gets better, “assistive creativity” will become a bigger thing, enabling humans to create sophisticated artifacts (including video games!) easier and better than ever.

      Why AI is important for human productivity

  2. May 2020
    1. You can create estimation plots here at estimationstats.com, or with the DABEST packages which are available in R, Python, and Matlab.

      You can create estimation plots with:

    2. Relative to conventional plots, estimation plots offer five key benefits:

      Estimation plots > bars-and-stars or boxplot & P.

      They:

      • avoid false dictonomy
      • display all observed values
      • focus on intervention effect size
      • visualise estimate precision
      • show mean difference distribution
    3. For comparisons between 3 or more groups that typically employ analysis of variance (ANOVA) methods, one can use the Cumming estimation plot, which can be considered a variant of the Gardner-Altman plot.

      Cumming estimation plot

    4. Efron developed the bias-corrected and accelerated bootstrap (BCa bootstrap) to account for the skew whilst obtaining the central 95% of the distribution.

      Bias-corrected and accelerated bootstrap (BCa boostrap) deals with skewed sample distributions. However; it must be noted that it "may not give very accurate coverage in a small-sample non-parametric situation" (simply said, take caution with small datasets)

    5. We can calculate the 95% CI of the mean difference by performing bootstrap resampling.

      Bootstrap - simple but powerful technique that creates multiple resamples (with replacement) from a single set of observations, and computes the effect size of interest on each of these resamples. It can be used to determine the 95% CI (Confidence Interval).

      We can use bootstrap resampling to obtain measure of precision and confidence about our estimate. It gives us 2 important benefits:

      1. Non-parametric statistical analysis - no need to assume normal distribution of our observations. Thanks to Central Limit Theorem, the resampling distribution of the effect size will approach normality
      2. Easy construction of the 95% CI from the resampling distribution. For 1000 bootstrap resamples of the mean difference, 25th value and 975th value can be used as boundaries of the 95% CI.

      Bootstrap resampling can be used for such an example:

      Computers can easily perform 5000 resamples:

    6. Shown above is a Gardner-Altman estimation plot.

      Gardner-Altman estimation plot shows all the relevant information:

      1. Datapoints presented as swarmplot
      2. Effect size is presented as bootstrap 95% confidence interval (95% CI) on a seperate but aligned axes
    7. Jitter plots avoid overlapping datapoints (i.e. datapoints with the same y-value) by adding a random factor to each point along the orthogonal x-axes.

      Jitter plots displays all datapoints but it might not accurately depict the underlying distribution of the data:

    8. Unfortunately, the boxplot still doesn't show all our data.

      Boxplots may be better than barplots (they introduce medians, quartiles, minima and maxima), but still doesn't show all the information:

    9. The barplot has several shortcomings, even though its use in academic journals is endemic.

      Barplots are not the best choice for data visualisation:

    1. Kolejna zmienna - "Recency", czyli informacja o tym, jak dawno klient robił zakupy w sklepie.

      To calculate RFM we need recency value. Firstly, we shall specify the most recent transaction as "today" and then to find the latest transaction of a specific client:

      df['Recency'] = (today - df.InvoiceDate)/np.timedelta64(1,'D')
      

      Calculating frequency and aggregating data of each client may be done with the groupby method:

      abt = df.groupby(['CustomerID']).agg({'Recency':'min', 'MonetaryValue':'sum', 'InvoiceNo':'count'})
      

      lastly, we can update the column names and display RFM data:

      abt = df.groupby(['CustomerID']).agg({'Recency':'min', 'MonetaryValue':'sum', 'InvoiceNo':'count'})
      abt.rename(columns = {'InvoiceNo':'Frequency'}, inplace = True)
      abt = abt[['Recency', 'Frequency', 'MonetaryValue']]
      abt.head()
      
    2. Będę oceniać klientów kryteriami, jakie zakłada metoda RFM.

      RFM is a method used for analyzing customer value:

      • Recency – How recently did the customer purchase?
      • Frequency – How often do they purchase?
      • Monetary Value – How much do they spend?

      In the RFM method we usually analyse only the last 12 months since this is the most relevant data of our products.

      We also have other RFM variants:

      • RFD – Recency, Frequency, Duration
      • RFE – Recency, Frequency, Engagement
      • RFM-I – Recency, Frequency, Monetary Value – Interactions
      • RFMTC – Recency, Frequency, Monetary Value, Time, Churn rate
    3. Usuwam brakujące wartości w zmiennej "CustomerID".

      Deleting rows where value is null:

      df = df[~df.CustomerID.isnull()]
      

      Assigning different data types to columns:

      df['CustomerID'] = df.CustomerID.astype(int)
      

      Deleting irrelevant columns:

      df.drop(['Description', 'StockCode', 'Country'], axis = 1, inplace = True)
      
    4. Sprawdzam braki danych.

      Checking the % of missing data:

      print(str(round(df.isnull().any(axis=1).sum()/df.shape[0]*100,2))+'% obserwacji zawiera braki w danych.')
      

      Sample output:

      24.93% obserwacji zawiera braki w danych.
      
    5. Buduję prosty data frame z podstawowymi informacjami o zbiorze.

      Building a simple dataframe with a summary of our columns (data types, sum and % of nulls):

      summary = pd.DataFrame(df.dtypes, columns=['Dtype'])
      summary['Nulls'] = pd.DataFrame(df.isnull().any())
      summary['Sum_of_nulls'] = pd.DataFrame(df.isnull().sum())
      summary['Per_of_nulls'] = round((df.apply(pd.isnull).mean()*100),2)
      summary.Dtype = summary.Dtype.astype(str)
      print(summary)
      

      the output:

                            Dtype  Nulls  Sum_of_nulls  Per_of_nulls
      InvoiceNo            object  False             0         0.000
      StockCode            object  False             0         0.000
      Description          object   True          1454         0.270
      Quantity              int64  False             0         0.000
      InvoiceDate  datetime64[ns]  False             0         0.000
      UnitPrice           float64  False             0         0.000
      CustomerID          float64   True        135080        24.930
      Country              object  False             0         0.000
      
    1. For many data scientists, the finished product of a work session is a business analysis. They need to show team members—who oftentimes aren’t technical—how their data became a specific recommendation or insight.

      Usual final product in Data Science is the business analysis which is perfectly explained with notebooks

    1. COVID-19 has spurred a shift to analyze things like supply chain disruptions, speech analytics, and filtering out pandemic-related behavior, such as binge shopping, Burtch Works says. Data teams are being asked to create new simulations for the coming recession, to create scorecards to track pandemic-related behavior, and add COVID-19 control variables to marketing attribution models, the company says.

      How COVID-19 impacts activities of data positions

    2. Data scientists and data engineers have built-in job security relative to other positions as businesses transition their operations to rely more heavily on data, data science, and AI. That’s a long-term trend that is not likely to change due to COVID-19, although momentum had started to slow in 2019 as venture capital investments ebbed.
    3. According to a Dice Tech Jobs report released in February, demand for data engineers was up 50% and demand for data scientists was up 32% in 2019 compared to the prior year.

      Need for Data Scientist / Engineers in 2019 vs 2018

    1. Truth be told, we found that most companies we worked with preferred to own the analytical backend.

      From the experience of Plotly Team

  3. Apr 2020
    1. the limitations of the PPS

      Limitations of the PPS:

      1. Slower than correlation
      2. Score cannot be interpreted as easily as the correlation (it doesn't tell you anything about the type of relationship). PPS is better for finding patterns and correlation is better for communicating found linear relationships
      3. You cannot compare the scores for different target variables in a strict math way because they're calculated using different evaluation metrics
      4. There are some limitations of the components used underneath the hood
      5. You've to perform forward and backward selection in addition to feature selection
    2. Although the PPS has many advantages over the correlation, there is some drawback: it takes longer to calculate.

      PPS is slower to calculate the correlation.

      • single PPS = 10-500 ms
      • whole PPS matrix for 40 columns = 40*40 = 1600 individual calculations = 1-10 minutes
    3. How to use the PPS in your own (Python) project

      Using PPS with Python

      • Download ppscore: pip install ppscoreshell
      • Calculate the PPS for a given pandas dataframe:
        import ppscore as pps
        pps.score(df, "feature_column", "target_column")
        
      • Calculate the whole PPS matrix:
        pps.matrix(df)
        
    4. The PPS clearly has some advantages over correlation for finding predictive patterns in the data. However, once the patterns are found, the correlation is still a great way of communicating found linear relationships.

      PPS:

      • good for finding predictive patterns
      • can be used for feature selection
      • can be used to detect information leakage between variables
      • interpret PPS matrix as a directed graph to find entity structures Correlation:
      • good for communicating found linear relationships
    5. Let’s compare the correlation matrix to the PPS matrix on the Titanic dataset.

      Comparing correlation matrix and the PPS matrix of the Titanic dataset:

      findings about the correlation matrix:

      1. Correlation matrix is smaller because it doesn't work for categorical data
      2. Correlation matrix shows a negative correlation between TicketPrice and Class. For PPS, it's a strong predictor (0.9 PPS), but not the other way Class to TicketPrice (ticket of 5000-10000$ is most likely the highest class, but the highest class itself cannot determine the price)

      findings about the PPS matrix:

      1. First row of the matrix tells you that the best univariate predictor of the column Survived is the column Sex (Sex was dropped for correlation)
      2. TicketID uncovers a hidden pattern as well as it's connection with the TicketPrice

    6. Let’s use a typical quadratic relationship: the feature x is a uniform variable ranging from -2 to 2 and the target y is the square of x plus some error.

      In this scenario:

      • we can predict y using x
      • we cannot predict x using y as x might be negative or positive (for y=4, x=2 or -2
      • the correlation is 0. Both from x to y and from y to x because the correlation is symmetric (more often relationships are assymetric!). However, the PPS from x to y is 0.88 (not 1 because of existing error)
      • PPS from y to x is 0 because there's no relationship that y can predict if it only knows its own value

    7. how do you normalize a score? You define a lower and an upper limit and put the score into perspective.

      Normalising a score:

      • you need to put a lower and upper limit
      • upper limit can be F1 = 1, and a perfect MAE = 0
      • lower limit depends on the evaluation metric and your data set. It's the value that a naive predictor achieves
    8. For a classification problem, always predicting the most common class is pretty naive. For a regression problem, always predicting the median value is pretty naive.

      What is a naive model:

      • predicting common class for a classification problem
      • predicting median value for a regression problem
    9. Let’s say we have two columns and want to calculate the predictive power score of A predicting B. In this case, we treat B as our target variable and A as our (only) feature. We can now calculate a cross-validated Decision Tree and calculate a suitable evaluation metric.

      If the target (B) variable is:

      • numeric - we can use a Decision Tree Regressor and calculate the Mean Absolute Error (MAE)
      • categoric - we can use a Decision Tree Classifier and calculate the weighted F1 (or ROC)
    10. More often, relationships are asymmetric

      a column with 3 unique values will never be able to perfectly predict another column with 100 unique values. But the opposite might be true

    11. there are many non-linear relationships that the score simply won’t detect. For example, a sinus wave, a quadratic curve or a mysterious step function. The score will just be 0, saying: “Nothing interesting here”. Also, correlation is only defined for numeric columns.

      Correlation:

      • doesn't work with non-linear data
      • doesn't work for categorical values

      Examples:

    1. Creating meticulous tests before exploring the data is a big mistake, and will result in a well-crafted garbage-in, garbage-out pipeline. We need an environment flexible enough to encourage experiments, especially in the initial place.

      Overzealous nature of TDD may discourage from explorable data science

    1. The priorities in building a production machine learning pipeline—the series of steps that take you from raw data to product—are not fundamentally different from those of general software engineering.
      1. Your pipeline should be reproducible
      2. Collaborating on your pipeline should be easy
      3. All code in your pipeline should be testable
    2. A notebook, at a very basic level, is just a bunch of JSON that references blocks of code and the order in which they should be executed.But notebooks prioritize presentation and interactivity at the expense of reproducibility. YAML is the other side of that coin, ignoring presentation in favor of simplicity and reproducibility—making it much better for production.

      Summary of the article:

      Notebook = presentation + interactivity

      YAML = simplicity + reproducibility

    1. Repeated measures involves measuring the same cases multiple times. So, if you measured the chips, then did something to them, then measured them again, etc it would be repeated measures. Replication involves running the same study on different subjects but identical conditions. So, if you did the study on n chips, then did it again on another n chips that would be replication.

      Difference between repeated measures and replication

    1. To take full advantage of tabular augmentation for time-series you would perform the techniques in the following order: (1) transforming, (2) interacting, (3) mapping, (4) extracting, and (5) synthesising
    2. process of modular feature engineering and observation engineering while emphasising the order of augmentation to achieve the best predicted outcome from a given information set

      Tabular Data Augmentation

    1. 1) Redash and Falcon focus on people that want to do visualizations on top of SQL2) Superset, Tableau and PowerBI focus on people that want to do visualizations with a UI3) Metabase and SeekTable focus on people that want to do quick analysis (they are the closest to an Excel replacement)

      Comparison of data analysis tools:

      1) Redash & Falcon - SQL focus

      2) Superset, Tableau & PowerBI - UI workflow

      3) Metabase & SeekTable - Excel like experience

    1. JupyterLab project, which enables a richer UI including a file browser, text editors, consoles, notebooks, and a rich layout system.

      How JupyterLab differs from a traditional notebook

    1. Open an Anaconda command prompt and run conda create -n myenv python=3.7 pandas jupyter seaborn scikit-learn keras tensorflow

      Command to quickly create a new Anaconda environment: conda create -n myenv python=3.7 pandas jupyter seaborn scikit-learn keras tensorflow

    1. On the job, if you notice poor infrastructure, speak up to your manager early on. Clearly document the problem, and try to incorporate a data engineering, infrastructure, or devops team to help resolve the issue! I'd also encourage you to learn these skills too!

      Recommendation to Data Scientists dealing with poor infrastructure

    2. When I worked at Target HQ in 2012, employees would arrive to work early - often around 7am - in order to query the database at a time when few others were doing so. They hoped they’d get database results quicker. Yet, they’d still often wait several hours just to get results.

      What poor data infrastructure leads to

    3. The Data Scientist, in many cases, should be called the Data Janitor. ¯_(ツ)_/¯

      :D

    4. In regards to quality of data on the job, I’d often compare it to a garbage bag that ripped, had its content spewed all over the ground and your partner has asked you to find a beautiful earring that was accidentally inside.

      From my experience, I can only agree

    5. On the job, I’d recommend you document your work well and calculate the monetary value of your analyses based on factors like employee salary, capital investments, opportunity cost, etc. These analyses will come in handy for a promotion/review packet later too.

      Factors to keep a track of as a Data Scientist

    6. you as a Data Scientist should try to find a situation to be incredibly valuable on the job! It’s tough to find that from the outskirts of applying to jobs, but internally, you can make inroads supporting stakeholders with evidence for their decisions!

      Try showcasing the evidence of why the employer needs you

    7. As a Data Scientist in the org, are you essential to the business? Probably not. The business could go on for a while and survive without you. Sales will still be made, features will still get built, customer support will handle customer concerns, etc.

      As a Data Scientist, you are more of a "support" to the overall team

    8. As the resident Data Scientist, you may become easily inundated with requests from multiple teams at once. Be prepared to ask these teams to qualify and defend their requests, and be prepared to say “no” if their needs fall outside the scope of your actual priority queue. I’d recommend utilizing the RICE prioritization technique for projects.

      Being the only data person be sure to prioritise the requests

    9. Common questions are: How many users click this button; what % of users that visit a screen click this button; how many users have signed up by region or account type? However, the data needed to answer those questions may not exist! If the data does exists, it’s likely “dirty” - undocumented, tough to find or could be factually inaccurate. It’ll be tough to work with! You could spend hours or days attempting to answer a single question only to discover that you can’t sufficiently answer it for a stakeholder. In machine learning, you may be asked to optimize some process or experience for consumers. However, there’s uncertainty with how much, if at all, the experience can be improved!

      Common types of problems you might be working with in Data Science / Machine Learning industry

    10. In one experience, a fellow researcher spent over a month researching a particular value among our customers through qualitative and quantitative data. She presented a well-written and evidence-backed report. Yet, a few days later, a key head of product outlined a vision for the team and supported it with a claim that was antithetical to the researcher’s findings! Even if a data science project you advocate for is greenlighted, you may be on your own as the rare knowledgeable person to plan and execute it. It’s unlikely leadership will be hands-on to help you research and plan out the project.

      Data science leadership is sorely lacking

    11. Because people don’t know what data science does, you may have to support yourself with work in devops, software engineering, data engineering, etc.

      You might have multiple roles due to lack of clear "data science" terminology

    12. In 50+ interviews for data related jobs, I’ve been asked about AB testing, SQL analytics questions, optimizing SQL queries, how to code a game in Python, Logistic Regression, Gradient Boosted Trees, data structures and/or algorithms programming problems!

      Data science interviews lack a clarity of scope

    13. seven most common (and at times flagrant) ways that data science has failed to meet expectations in industry
      1. People don’t know what “data science” does.
      2. Data science leadership is sorely lacking.
      3. Data science can’t always be built to specs.
      4. You’re likely the only “data person.”
      5. Your impact is tough to measure — data doesn’t always translate to value.
      6. Data & infrastructure have serious quality problems.
      7. Data work can be profoundly unethical. Moral courage required.
    1. In data science community the performance of the model on the test dataset is one of the most important things people look at. Just look at the competitions on kaggle.com. They are extremely focused on test dataset and the performance of these models is really good.

      In data science, performance of the model on the test model is the most important metric for the majority.

      It's not always the best measurement since the most efficient model can completely misperform while receiving a different type of a dataset

    2. It's basically a look up table, interpolating between known data points. Except, unlike other interpolants like 'linear', 'nearest neighbour' or 'cubic', the underlying functional form is determined to best represent the kind of data you have.

      You can describe AI/ML methods as a look up table that adjusts to your data points unlike other interpolants (linear,nearest neighbor or cubic)

  4. Mar 2020
    1. In the interval scale, there is no true zero point or fixed beginning. They do not have a true zero even if one of the values carry the name “zero.” For example, in the temperature, there is no point where the temperature can be zero. Zero degrees F does not mean the complete absence of temperature. Since the interval scale has no true zero point, you cannot calculate Ratios. For example, there is no any sense the ratio of 90 to 30 degrees F to be the same as the ratio of 60 to 20 degrees. A temperature of 20 degrees is not twice as warm as one of 10 degrees.

      Interval data:

      • show not only order and direction, but also the exact differences between the values
      • the distances between each value on the interval scale are meaningful and equal
      • no true zero point
      • no fixed beginning
      • no possibility to calculate ratios (only add and substract)
      • e.g.: temperature in Fahrenheit or Celsius (but not Kelvin) or IQ test
    2. As the interval scales, Ratio scales show us the order and the exact value between the units. However, in contrast with interval scales, Ratio ones have an absolute zero that allows us to perform a huge range of descriptive statistics and inferential statistics. The ratio scales possess a clear definition of zero. Any types of values that can be measured from absolute zero can be measured with a ratio scale. The most popular examples of ratio variables are height and weight. In addition, one individual can be twice as tall as another individual.

      Ratio data is like interval data, but with:

      • absolute zero
      • possibility to calculate ratio (e.g. someone can be twice as tall)
      • possibility to not only add and subtract, but multiply and divide values
      • e.g.: weight, height, Kelvin scale (50K is 2x hot as 25K)
    1. If you are interested in exploring the dataset used in this article, it can be used straight from S3 with Vaex. See the full Jupyter notebook to find out how to do this.

      Example of EDA in Vaex ---> Jupyter Notebook

    2. Vaex is an open-source DataFrame library which enables the visualisation, exploration, analysis and even machine learning on tabular datasets that are as large as your hard-drive. To do this, Vaex employs concepts such as memory mapping, efficient out-of-core algorithms and lazy evaluations.

      Vaex - library to manage as large datasets as your HDD, thanks to:

      • memory mapping
      • efficient out-of-core algorithms
      • lazy evaluations.

      All wrapped in a Pandas-like API

    3. The first step is to convert the data into a memory mappable file format, such as Apache Arrow, Apache Parquet, or HDF5

      Before opening data with Vaex, we need to convert it into a memory mappable file format (e.g. Apache Arrow, Apache Parquet or HDF5). This way, 100 GB data can be load in Vaex in 0.052 seconds!

      Example of converting CSV ---> HDF5.

    4. The describe method nicely illustrates the power and efficiency of Vaex: all of these statistics were computed in under 3 minutes on my MacBook Pro (15", 2018, 2.6GHz Intel Core i7, 32GB RAM). Other libraries or methods would require either distributed computing or a cloud instance with over 100GB to preform the same computations.

      Possibilities of Vaex

    5. AWS offers instances with Terabytes of RAM. In this case you still have to manage cloud data buckets, wait for data transfer from bucket to instance every time the instance starts, handle compliance issues that come with putting data on the cloud, and deal with all the inconvenience that come with working on a remote machine. Not to mention the costs, which although start low, tend to pile up as time goes on.

      AWS as a solution to analyse data too big for RAM (like 30-50 GB range). In this case, it's still uncomfortable:

      • managing cloud data buckets
      • waiting for data transfer from bucket to instance every time the instance starts
      • handling compliance issues coming by putting data on the cloud
      • dealing with remote machines
      • costs
    1. when AUC is 0.5, it means model has no class separation capacity whatsoever.

      If AUC = 0.5

    2. ROC is a probability curve and AUC represents degree or measure of separability. It tells how much model is capable of distinguishing between classes.

      ROC & AUC

    3. In multi-class model, we can plot N number of AUC ROC Curves for N number classes using One vs ALL methodology. So for Example, If you have three classes named X, Y and Z, you will have one ROC for X classified against Y and Z, another ROC for Y classified against X and Z, and a third one of Z classified against Y and X.

      Using AUC ROC curve for multi-class model

    4. When AUC is approximately 0, model is actually reciprocating the classes. It means, model is predicting negative class as a positive class and vice versa

      If AUC = 0

    5. AUC near to the 1 which means it has good measure of separability.

      If AUC = 1

    1. LR is nothing but the binomial regression with logit link (or probit), one of the numerous GLM cases. As a regression - itself it doesn't classify anything, it models the conditional (to linear predictor) expected value of the Bernoulli/binomially distributed DV.

      Linear Regression - the ultimate definition (it's not a classification algorithm!)

      It's used for classification when we specify a 50% threshold.

    1. BOW is often used for Natural Language Processing (NLP) tasks like Text Classification. Its strengths lie in its simplicity: it’s inexpensive to compute, and sometimes simpler is better when positioning or contextual info aren’t relevant

      Usefulness of BOW:

      • simplicity
      • low on computing requirements
    2. Notice that we lose contextual information, e.g. where in the document the word appeared, when we use BOW. It’s like a literal bag-of-words: it only tells you what words occur in the document, not where they occurred

      The analogy behind using bag term in the bag-of-words (BOW) model.

    3. Here’s my preferred way of doing it, which uses Keras’s Tokenizer class

      Keras's Tokenizer Class - Victor's preferred way of implementing BOW in Python

    1. process that takes input data through a series of transformation stages, producing data as output

      Data pipeline

    1. Softmax turns arbitrary real values into probabilities

      Softmax function -

      • outputs of the function are in range [0,1] and add up to 1. Hence, they form a probability distribution
      • the calcualtion invloves e (mathematical constant) and performs operation on n numbers: $$s(x_i) = \frac{e^{xi}}{\sum{j=1}^n e^{x_j}}$$
      • the bigger the value, the higher its probability
      • lets us answer classification questions with probabilities, which are more useful than simpler answers (e.g. binary yes/no)
    1. 1. Logistic regression IS a binomial regression (with logit link), a special case of the Generalized Linear Model. It doesn't classify anything *unless a threshold for the probability is set*. Classification is just its application. 2. Stepwise regression is by no means a regression. It's a (flawed) method of variable selection. 3. OLS is a method of estimation (among others: GLS, TLS, (RE)ML, PQL, etc.), NOT a regression. 4. Ridge, LASSO - it's a method of regularization, NOT a regression. 5. There are tens of models for the regression analysis. You mention mainly linear and logistic - it's just the GLM! Learn the others too (link in a comment). STOP with the "17 types of regression every DS should know". BTW, there're 270+ statistical tests. Not just t, chi2 & Wilcoxon

      5 clarifications to common misconceptions shared over data science cheatsheets on LinkedIn

    1. 400 sized probability sample (a small random sample from the whole population) is often better than a millions sized administrative sample (of the kind you can download from gov sites). The reason is that an arbitrary sample (as opposed to a random one) is very likely to be biased, and, if large enough, a confidence interval (which actually doesn't really make sense except for probability samples) will be so narrow that, because of the bias, it will actually rarely, if ever, include the true value we are trying to estimate. On the other hand, the small, random sample will be very likely to include the true value in its (wider) confidence interval
    1. An exploratory plot is all about you getting to know the data. An explanatory graphic, on the other hand, is about telling a story using that data to a specific audience.

      Exploratory vs Explanatory plot

  5. Dec 2018
    1. ggplot2 and dplyr are clear intellectual contributions because they provide tools (grammars) for visualization and data manipulation, respectively. The tools make the tasks radically easier by providing clear organizing principles coupled with effective code. Users often remark on the ease of manipulating data with dplyr and it is natural to wonder if perhaps the task itself is trivial. We claim it is not. Many probability challenges become dramatically easier, once you strike upon the “right” notation. In both cases, what feels like a matter of notation or syntax is really a more about exploiting the “right” abstraction.

      This is a fascinating example to me of both an important intellectual contribution that is embodied in the code itself, and also how extremely important work can be invisible or taken for granted because of how seamlessly or easily it becomes integrated into other research,

    2. This article is one of many commentaries on the Donoho "50 Years of Data Science" paper.