I have some background in machine learning—which may be an aid or an obstacle in understanding this description of Blumestock’s work—but it’s not obvious to me what “call logs” and “call data” mean here.
The researchers collected survey data on a small sample of the entire population—ok.
They trained a model that consumes “call data” and estimates wealth—but what is “call data” here? The survey data without the wealth data? Or is it cell phone usage data from the cell phone company?
Then they ran the model on data for all customers. This is what makes me think the model was trained on some kind of cell phone data, because that’s the only data available for people outside the survey.
Thinking about it a bit more, and without reading the actual paper (open review right 😛), I think I understand what they did: they used survey data to train a model that predicted wealth from cell phone usage data (truth = wealth data from surveys). Then they tested the resulting model on cell phone usage data on all customers to estimate wealth across the country. Is this accurate?