- Jan 2020
Suppose the algorithm chooses a tree that splits on education but not on age. Conditional on this tree, the estimated coefficients are consistent. But that does not imply that treatment effects do not also vary by age, as education may well covary with age; on other draws of the data, in fact, the same procedure could have chosen a tree that split on age instead
hese heterogenous treatment effects can be used to assign treatments; Misra and Dubé (2016) illustrate this on the problem of price targeting, applying Bayesian regularized methods to a large-scale experiment where prices were randomly assigned
todo -- look into the implication for treatment assignment with heterogeneity
Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, and Newey (2016) take care of high-dimensional controls in treatment effect estimation by solving two simultaneous prediction problems, one in the outcome and one in the treatment equation.
this seems similar to my idea of regularizing on only a subset of the variables
These same techniques applied here result in split-sample instrumental variables (Angrist and Krueger 1995) and “jackknife” instrumental variables
some classical solutions to IV bias are akin to ML solutions
Understood this way, the finite-sample biases in instrumental variables are a consequence of overfitting.
traditional 'finite sample bias of IV' is really overfitting
Even when we are interested in a parameter β ˆ, the tool we use to recover that parameter may contain (often implicitly) a prediction component. Take the case of linear instrumental variables understood as a two-stage procedure: first regress x = γ′z + δ on the instrument z, then regress y = β′x + ε on the fitted values x ˆ. The first stage is typically handled as an estimation step. But this is effectively a prediction task: only the predictions x ˆ enter the second stage; the coefficients in the first stage are merely a means to these fitted values.
first stage of IV -- handled as an estimation problem, but really it's a prediction problem!
Prediction in the Service of Estimation
This is especially relevant to economists across the board, even the ML skeptics
The first application: constructing variables and meaning from high-dimensional data, especially outcome variables
- satellite images (of energy use, lights etc) --> economic activity
- cell phone data, Google street view to measure wealth
- extract similarity of firms from 10k reports
- even traditional data .. matching individuals in historical censuses
Zhao and Yu (2006) who establish asymptotic model-selection consistency for the LASSO. Besides assuming that the true model is “sparse”—only a few variables are relevant—they also require the “irrepresentable condition” between observables: loosely put, none of the irrelevant covariates can be even moderately related to the set of relevant ones.
Basically unrealistic for microeconomic applications imho
First, it encourages the choice of less complex, but wrong models. Even if the best model uses interactions of number of bathrooms with number of rooms, regularization may lead to a choice of a simpler (but worse) model that uses only number of fireplaces. Second, it can bring with it a cousin of omitted variable bias, where we are typically concerned with correlations between observed variables and unobserved ones. Here, when regular-ization excludes some variables, even a correlation between observed variables and other observed (but excluded) ones can create bias in the estimated coefficients.
Is this equally a problem for procedures that do not assum sparsity, such as the Ridge model?
97the variables are correlated with each other (say the number of rooms of a house and its square-footage), then such variables are substitutes in predicting house prices. Similar predictions can be produced using very different variables. Which variables are actually chosen depends on the specific finite sample.
Lasso-chosen variables are unstable because of what we usually call 'multicollinearity.'<br> This presents a problem for making inferences from estimated coefficients.
Through its regularizer, LASSO produces a sparse prediction function, so that many coefficients are zero and are “not used”—in this example, we find that more than half the variables are unused in each run
This is true but they fail to mention that LASSO also shrinks the coefficients on variables that it keeps towards zero (relative to OLS). I think this is commonly misunderstood (from people I've spoken with).
One obvious problem that arises in making such inferences is the lack of stan-dard errors on the coefficients. Even when machine-learning predictors produce familiar output like linear functions, forming these standard errors can be more complicated than seems at first glance as they would have to account for the model selection itself. In fact, Leeb and Pötscher (2006, 2008) develop conditions under which it is impossible to obtain (uniformly) consistent estimates of the distribution of model parameters after data-driven selection.
This is a very serious limitation for Economics academic work.
First, econometrics can guide design choices, such as the number of folds or the function class.
How would Econometrics guide us in this?
These choices about how to represent the features will interact with the regularizer and function class: A linear model can reproduce the log base area per room from log base area and log room number easily, while a regression tree would require many splits to do so.
The choice of 'how to represent the features' is consequential ... it's not just 'throw it all in' (kitchen sink approach)
Ta b l e 2Some Machine Learning Algorithms
This is a very helpful table!
Picking the prediction func-tion then involves two steps: The first step is, conditional on a level of complexity, to pick the best in-sample loss-minimizing function.8 The second step is to estimate the optimal level of complexity using empirical tuning (as we saw in cross-validating the depth of the tree).
ML explained while standing on one leg.
egularization combines with the observability of predic-tion quality to allow us to fit flexible functional forms and still find generalizable structure.
But we can't really make statistical inferences about the structure, can we?
This procedure works because prediction quality is observable: both predic-tions y ˆ and outcomes y are observed. Contrast this with parameter estimation, where typically we must rely on assumptions about the data-generating process to ensure consistency.
I'm not clear what the implication they are making here is. Does it in some sense 'not work' with respect to parameter estimation?
In empirical tuning, we create an out-of-sample experiment inside the original sample.
remember that tuning is done within the training sample
Performance of Different Algorithms in Predicting House Values
Any reason they didn't try a Ridge or an Elastic net model here? My instinct is that these will beat LASSO for most Economic applications.
We consider 10,000 randomly selected owner-occupied units from the 2011 metropolitan sample of the American Housing Survey. In addition to the values of each unit, we also include 150 variables that contain information about the unit and its location, such as the number of rooms, the base area, and the census region within the United States. To compare different prediction tech-niques, we evaluate how well each approach predicts (log) unit value on a separate hold-out set of 41,808 units from the same sample. All details on the sample and our empirical exercise can be found in an online appendix available with this paper athttp://e-jep.org
Seems a useful example for trying/testing/benchmarking. But the link didn't work for me. Can anyone find it? Is it interactive? (This is why I think papers should be html and not pdfs...)
Making sense of complex data such as images and text often involves a prediction pre-processing step.
In using 'new kinds of data' in Economics we often need to do a 'classification step' first
The fundamental insight behind these breakthroughs is as much statis-tical as computational. Machine intelligence became possible once researchers stopped approaching intelligence tasks procedurally and began tackling them empirically.
I hadn't thought about how this unites the 'statistics to learn stuff' part of ML and the 'build a tool to do a task' part. Well-phrased.
In another category of applications, the key object of interest is actually a parameter β, but the inference procedures (often implicitly) contain a prediction task. For example, the first stage of a linear instrumental variables regres-sion is effectively prediction. The same is true when estimating heterogeneous treatment effects, testing for effects on multiple outcomes in experiments, and flexibly controlling for observed confounders.
This is most relevant tool for me. Before I learned about ML I often thought about using 'stepwise selection' for such tasks... to find the best set of 'control variables' etc. But without regularisation this seemed problematic.
Machine Learning: An Applied Econometric Approach
Shall we use Hypothesis to have a discussion ?
- May 2019
- Sep 2014
"the scholarly and scientific record is rapidly evolving to become a formalized web of content, in which any node must be able to be linked to any other node, with formal, typed relationships" (2)
- Feb 2014
Chapter 1, The Art of Community We begin the book with a bird’s-eye view of how communities function at a social science level. We cover the underlying nuts and bolts of how people form communities, what keeps them involved, and the basis and opportunities behind these interactions. Chapter 2, Planning Your Community Next we carve out and document a blueprint and strategy for your community and its future growth. Part of this strategy includes the target objectives and goals and how the community can be structured to achieve them. PREFACE xix Chapter 3, Communicating Clearly At the heart of community is communication, and great communicators can have a tremendously positive impact. Here we lay down the communications backbone and the best practices associated with using it
Reading the first 3 chapters of AoC for discussion in #coasespenguin on 2013-02-11.
- Jan 2014
This suggests that peer production will thrive where projects have three characteristi cs
If thriving is a metric (is it measurable? too subjective?) of success then the 3 characteristics it must have are:
- modularity: divisible into components
- granularity: fine-grained modularity
- integrability: low-cost integration of contributions
I don't dispute that these characteristics are needed, but they are too general to be helpful, so I propose that we look at these three characteristics through the lens of the type of contributor we are seeking to motivate.
How do these characteristics inform what we should focus on to remove barriers to collaboration for each of these contributor-types?
Below I've made up a rough list of lenses. Maybe you have links or references that have already made these classifications better than I have... if so, share them!
Roughly here are the classifications of the types of relationships to open source projects that I commonly see:
core developers: either hired by a company, foundation, or some entity to work on the project. These people care most about integrability.
ecosystem contributors: someone either self-motivated or who receives a reward via some mechanism outside the institution that funds the core developers (e.g. reputation, portfolio for future job prospects, tools and platforms that support a consulting business, etc). These people care most about modularity.
feature-driven contributors: The project is useful out-of-the-box for these people and rather than build their own tool from scratch they see that it is possible for the tool to work they way they want by merely contributing code or at least a feature-request based on their idea. These people care most about granularity.
The above lenses fit the characteristics outlined in the article, but below are other contributor-types that don't directly care about these characteristics.
the funder: a company, foundation, crowd, or some other funding body that directly funds the core developers to work on the project for hire.
consumer contributors: This class of people might not even be aware that they are contributors, but simply using the project returns direct benefits through logs and other instrumented uses of the tool to generate data that can be used to improve the project.
knowledge-driven contributors: These contributors are most likely closest to the ecosystem contributors, maybe even a sub-species of those, that contribute to documentation and learning the system; they may be less-skilled at coding, but still serve a valuable part of the community even if they are not committing to the core code base.
failure-driven contributors: A primary source of bug reports and may also be any one of the other lenses.
What other lenses might be useful to look through? What characteristics are we missing? How can we reduce barriers to contribution for each of these contributor types?
I feel that there are plenty of motivations... but what barriers exist and what motivations are sufficient for enough people to be willing to surmount those barriers? I think it may be easier to focus on the barriers to make contributing less painful for the already-convinced, than to think about the motivators for those needing to be convinced-- I think the consumer contributors are some of the very best suited to convince the unconvinced; our job should be to remove the barriers for people at each stage of community we are trying to build.
A note to the awesome folks at Hypothes.is who are reading our consumer contributions... given the current state of the hypothes.is project, what class of contributors are you most in need of?
the proposition that diverse motivations animate human beings, and, more importantly, that there exist ranges of human experience in which the presence of monetary rewards is inversely related to the presence of other, social-psychological rewards.
The first analytic move.
common appropriation regimes do not give a complete answer to the sustainability of motivation and organization for the truly open, large-scale nonproprietary peer production projects we see on the Internet.
Towards the end of our last conversation the text following "common appropriation" seemed an interesting place to dive into further for our future discussions.
I have tagged this annotation with "meta" because it is a comment about our discussion and where to continue it rather than an annotation focused on the content itself.
In the future I would be interested in exploring the idea of "annotation types" that can be selectively turned on and off, but for now will handle that with ad hoc tags like "meta".
The following selection from The Yale Law Journal is not paginated and should not be used for citation purposes.
Note that this disclaimer only says the document should not be used for citation purposes, but doesn't say we can't use it for annotation purposes like testing out the Chrome PDF.js + Hypothes.is extension! :)
You can install the extension from the Chrome Web Store with this link:
understanding that when a project of any size is broken up into little pieces, each of which can be performed by an individual in a short amount of time, the motivation to get any given individual to contribute need only be very small.
The second analytic move.
- reading group
- peer production
- to read
- chrome extension
- metric of success
- the problem
- the answer