 Jan 2020

pubs.aeaweb.org pubs.aeaweb.org

Suppose the algorithm chooses a tree that splits on education but not on age. Conditional on this tree, the estimated coefficients are consistent. But that does not imply that treatment effects do not also vary by age, as education may well covary with age; on other draws of the data, in fact, the same procedure could have chosen a tree that split on age instead
a caveat

hese heterogenous treatment effects can be used to assign treatments; Misra and Dubé (2016) illustrate this on the problem of price targeting, applying Bayesian regularized methods to a largescale experiment where prices were randomly assigned
todo  look into the implication for treatment assignment with heterogeneity

Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, and Newey (2016) take care of highdimensional controls in treatment effect estimation by solving two simultaneous prediction problems, one in the outcome and one in the treatment equation.
this seems similar to my idea of regularizing on only a subset of the variables

These same techniques applied here result in splitsample instrumental variables (Angrist and Krueger 1995) and “jackknife” instrumental variables
some classical solutions to IV bias are akin to ML solutions

Understood this way, the finitesample biases in instrumental variables are a consequence of overfitting.
traditional 'finite sample bias of IV' is really overfitting

Even when we are interested in a parameter β ˆ, the tool we use to recover that parameter may contain (often implicitly) a prediction component. Take the case of linear instrumental variables understood as a twostage procedure: first regress x = γ′z + δ on the instrument z, then regress y = β′x + ε on the fitted values x ˆ. The first stage is typically handled as an estimation step. But this is effectively a prediction task: only the predictions x ˆ enter the second stage; the coefficients in the first stage are merely a means to these fitted values.
first stage of IV  handled as an estimation problem, but really it's a prediction problem!

Prediction in the Service of Estimation
This is especially relevant to economists across the board, even the ML skeptics

New Data
The first application: constructing variables and meaning from highdimensional data, especially outcome variables
 satellite images (of energy use, lights etc) > economic activity
 cell phone data, Google street view to measure wealth
 extract similarity of firms from 10k reports
 even traditional data .. matching individuals in historical censuses

Zhao and Yu (2006) who establish asymptotic modelselection consistency for the LASSO. Besides assuming that the true model is “sparse”—only a few variables are relevant—they also require the “irrepresentable condition” between observables: loosely put, none of the irrelevant covariates can be even moderately related to the set of relevant ones.
Basically unrealistic for microeconomic applications imho

First, it encourages the choice of less complex, but wrong models. Even if the best model uses interactions of number of bathrooms with number of rooms, regularization may lead to a choice of a simpler (but worse) model that uses only number of fireplaces. Second, it can bring with it a cousin of omitted variable bias, where we are typically concerned with correlations between observed variables and unobserved ones. Here, when regularization excludes some variables, even a correlation between observed variables and other observed (but excluded) ones can create bias in the estimated coefficients.
Is this equally a problem for procedures that do not assum sparsity, such as the Ridge model?

97the variables are correlated with each other (say the number of rooms of a house and its squarefootage), then such variables are substitutes in predicting house prices. Similar predictions can be produced using very different variables. Which variables are actually chosen depends on the specific finite sample.
Lassochosen variables are unstable because of what we usually call 'multicollinearity.'<br> This presents a problem for making inferences from estimated coefficients.

Through its regularizer, LASSO produces a sparse prediction function, so that many coefficients are zero and are “not used”—in this example, we find that more than half the variables are unused in each run
This is true but they fail to mention that LASSO also shrinks the coefficients on variables that it keeps towards zero (relative to OLS). I think this is commonly misunderstood (from people I've spoken with).

One obvious problem that arises in making such inferences is the lack of standard errors on the coefficients. Even when machinelearning predictors produce familiar output like linear functions, forming these standard errors can be more complicated than seems at first glance as they would have to account for the model selection itself. In fact, Leeb and Pötscher (2006, 2008) develop conditions under which it is impossible to obtain (uniformly) consistent estimates of the distribution of model parameters after datadriven selection.
This is a very serious limitation for Economics academic work.

First, econometrics can guide design choices, such as the number of folds or the function class.
How would Econometrics guide us in this?

These choices about how to represent the features will interact with the regularizer and function class: A linear model can reproduce the log base area per room from log base area and log room number easily, while a regression tree would require many splits to do so.
The choice of 'how to represent the features' is consequential ... it's not just 'throw it all in' (kitchen sink approach)

Ta b l e 2Some Machine Learning Algorithms
This is a very helpful table!

Picking the prediction function then involves two steps: The first step is, conditional on a level of complexity, to pick the best insample lossminimizing function.8 The second step is to estimate the optimal level of complexity using empirical tuning (as we saw in crossvalidating the depth of the tree).
ML explained while standing on one leg.

egularization combines with the observability of prediction quality to allow us to fit flexible functional forms and still find generalizable structure.
But we can't really make statistical inferences about the structure, can we?

This procedure works because prediction quality is observable: both predictions y ˆ and outcomes y are observed. Contrast this with parameter estimation, where typically we must rely on assumptions about the datagenerating process to ensure consistency.
I'm not clear what the implication they are making here is. Does it in some sense 'not work' with respect to parameter estimation?

In empirical tuning, we create an outofsample experiment inside the original sample.
remember that tuning is done within the training sample

Performance of Different Algorithms in Predicting House Values
Any reason they didn't try a Ridge or an Elastic net model here? My instinct is that these will beat LASSO for most Economic applications.

We consider 10,000 randomly selected owneroccupied units from the 2011 metropolitan sample of the American Housing Survey. In addition to the values of each unit, we also include 150 variables that contain information about the unit and its location, such as the number of rooms, the base area, and the census region within the United States. To compare different prediction techniques, we evaluate how well each approach predicts (log) unit value on a separate holdout set of 41,808 units from the same sample. All details on the sample and our empirical exercise can be found in an online appendix available with this paper athttp://ejep.org
Seems a useful example for trying/testing/benchmarking. But the link didn't work for me. Can anyone find it? Is it interactive? (This is why I think papers should be html and not pdfs...)

Making sense of complex data such as images and text often involves a prediction preprocessing step.
In using 'new kinds of data' in Economics we often need to do a 'classification step' first

The fundamental insight behind these breakthroughs is as much statistical as computational. Machine intelligence became possible once researchers stopped approaching intelligence tasks procedurally and began tackling them empirically.
I hadn't thought about how this unites the 'statistics to learn stuff' part of ML and the 'build a tool to do a task' part. Wellphrased.

In another category of applications, the key object of interest is actually a parameter β, but the inference procedures (often implicitly) contain a prediction task. For example, the first stage of a linear instrumental variables regression is effectively prediction. The same is true when estimating heterogeneous treatment effects, testing for effects on multiple outcomes in experiments, and flexibly controlling for observed confounders.
This is most relevant tool for me. Before I learned about ML I often thought about using 'stepwise selection' for such tasks... to find the best set of 'control variables' etc. But without regularisation this seemed problematic.

Machine Learning: An Applied Econometric Approach
Shall we use Hypothesis to have a discussion ?
