QTM 220: Lecture 20

Regression Trees

Why Piecewise-Constant Models?

Goal: Estimate the ATT in this Population

\[ \text{ATT} = \sum_x \ingreen{f_1(x)} \cdot \qty{\ingreen{\mu(1,x)} - \inred{\mu(0,x)}} \]

We’ll Use This Sample

\[ \widehat{\text{ATT}} = \sum_x \ingreen{\hat f_1(x)} \cdot \qty{\ingreen{\hat\mu(1,x)} - \inred{\hat\mu(0,x)}} \]

Options

\[ \widehat{\text{ATT}} = \sum_x \ingreen{\hat f_1(x)} \cdot \qty{\ingreen{\hat\mu(1,x)} - \inred{\hat\mu(0,x)}} \qfor \hat\mu = \argmin_{m \in\model } \frac{1}{n}\sum_{i=1}^n \text{weight}(D_i,X_i) \ \qty{m(D_i,X_i) - Y_i}^2 \]

The Saturated Model

It doesn’t matter whether we use weights for this one. Why not?
The model lets us make any prediction at any level \((d,x)\) of the treatment and covariate.
So we’ll make the best prediction at each level in terms of our error criterion.
The least squares thing to do is to use the sample means. \[ \begin{aligned} \hat \mu(d,x) &= \argmin_{m \in \R} \sum_{i:D_i=d, X_i=x}^n \qty{m - Y_i}^2 = \frac{\sum_{i:D_i=d, X_i=x}^n Y_i}{\sum_{i:D_i=d, X_i=x}^n 1} \end{aligned} \]

The Saturated Model

Why is the weighted least squares thing to do the same? \[ \begin{aligned} \hat \mu(d,x) &= \argmin_{m \in \R} \sum_{i:D_i=d, X_i=x}^n w(D_i,X_i) \qty{m - Y_i}^2 \overset{?}{=} \frac{\sum_{i:D_i=d, X_i=x}^n Y_i}{\sum_{i:D_i=d, X_i=x}^n 1} \end{aligned} \]

The weights are constant within each subsample—\(w(D_i,X_i) = w(d,x)\).
So what we’re minimizing to get \(\hat\mu(d,x)\) is unweighted squared error times a constant.
That constant factor doesn’t change the solution.

The Saturated Model’s Sampling Distribution

The Interacted Bivariate Linear Model

What’s bad about fitting lines without weights?

If we don’t use weights, we’ll fit the data in the wrong place.
- i.e., where there are a lot of red folks and few green folks.
- which is not what’s relevant for estimating the ATT.
Inverse probability weighting fixes this by telling the line what matters and what doesn’t.

The Interacted Model’s Sampling Distribution

The Additive Bivariate Linear Model

What’s worse about fitting parallel lines without weights?

We may not be able to fit the data at all. And our slope will be some mix of …
- the slope of the red folks on the left, where they are.
- the slope of the green folks on the right, where they are.
The distance between these lines is our estimate of the ATT.
And it isn’t a comparison of people on the right to people on the right.
Inverse probability weighting fixes this by, in effect, ignoring the irrelevant stuff on the left.

The Additive Model’s Sampling Distribution

Piecewise-Constant Models

Here I’ve fit an interacted piecewise-constant model. This is less susceptible to this problem. \[ \hat\mu(d,x) = \hat\beta_{0,d} + \hat\beta_{1,d} \cdot \mathbf{1}\qty{x > 60} + \hat\beta_{2,d} \cdot \mathbf{1}\qty{x > 70} \]
That’s sort of like what happens with the saturated model. Why?
Hint. Look at the purple line, which shows the weights for untreated folks.

Piecewise-Constant Models

Because it fits locally—what happens on the left doesn’t affect fit on the right.
The least squares estimator at each point is the average of the samples within the same bin.
The weighted least squares takes a weighted average within each bin.
- But the weights don’t vary much within bins.
- So the weighted average is similar to the unweighted one.
This is the same as coarsening the data — forgetting age differences within bins …
… then fitting a saturated model. But people do keep their (uncoarsened) weights.

Piecewise-Constant’s Model Sampling Distribution

How Do I Choose The Number of Bins?

Two Bins

Four Bins

Eight Bins

Sixteen Bins

Weight or Go Big. Or Both.

Weighting Works With Estimated Treatment Probabilities, Too

Weighted, Logistic Regression-Estimated Weights: 2 bins

Weighted: Logistic Regression-Estimated Weights 16 bins

Enough About IPW

Inverse probability weighting is a good idea.
- It’s easy to do once you work out what the weights should be.
- And you can get valid intervals even with widly misspecified models.
But it has some limitations.
- For one thing, it means your fitted curve is estimand-specific.
- If you’ve fit a model to estimate the ATT, you’ve got to refit to estimate the ATE.¹
- That’s not great if you’re sharing your fitted models with other people.
So it’s nice to have a model that works ok with or without inverse probability weighting.
Big piecewise-constant models look promising in our example.
- Bias goes down as we increase bin count.
- And variance doesn’t really go up.
Conceptually, these let us avoid bias in two ways—even if we’re doing unweighted least squares.
- If the inverse probability weights are effectively constant within bins, we’re effectively inverse probability weighting.
- If the subpopulation mean \(\mu\) is effectively constant within bins, we’re effectively using a saturated model.

Going Big

Piecewise-constant models fit via unweighted least squares.

Why Doesn’t Going Big Increase Variance (Much)?

\[ \widehat{\text{ATT}} = \sum_x \ingreen{\hat f_1(x)} \qty{\ingreen{\hat\mu(1,x)} - \inred{\hat\mu(0,x)}} \]

It seems like it should. Our within-bin subsample mean estimates are all over the place.
What’s happening is that this noise is averaging out. Let’s start somewhere simple.
- Suppose we pair up all our observations, average the pairs, and average the averages.
- We get the same thing as if we just averaged all the observations without pairing.

Why Doesn’t Going Big Increase Variance (Much)?

\[ \widehat{\text{ATT}} = \sum_x \ingreen{\hat f_1(x)} \qty{\ingreen{\hat\mu(1,x)} - \inred{\hat\mu(0,x)}} \]

That’s essentially what’s going on with the easy part of the ATT formula.
If we have \(1\) bin for each level of \(X\), we get the same thing as if we have 1 bin outright. \[ \sum_x \ingreen{\hat f_1(x)} \times \ingreen{\hat\mu(1,x)} = \sum_{x} \frac{\sum_{i:X_i=x, D_i=1} 1}{\sum_{i:D_i=1} 1} \times \frac{\sum_{i:D_i=1, X_i=x} Y_i}{\sum_{i:D_i=1, X_i=x} 1} = \frac{\sum_{i:D_i=1} Y_i}{\sum_{i:D_i=1} 1} \]

Why Doesn’t Going Big Increase Variance (Much)?

Now let’s think about the hard part—the mixed-color one.
After coarsening to bins, we’re just using within-bin subsample means as our estimates \(\inred{\hat\mu(0,x)}\).
So what we really have is a linear combinations of subsample means.

\[ \begin{aligned} \sum_x \ingreen{\hat f_1(x)} \times \inred{\hat\mu(0,x)} &= \sum_{j} \sum_{x \in \text{bin}_j} \ingreen{\hat f_1(x)} \times \inred{\hat\mu(0,\text{bin}_j)} \\ &= \sum_{j} \hat\alpha_j \ \inred{\hat\mu(0,\text{bin}_j)} \qfor \hat\alpha_j = \sum_{x \in \text{bin}_j} \ingreen{\hat f_1(x)} \end{aligned} \]

Why Doesn’t Going Big Increase Variance (Much)?

\[ \sum_x \ingreen{\hat f_1(x)} \inred{\hat\mu(0,x)} = \sum_{j} \hat\alpha_j \ \inred{\hat\mu(0,\text{bin}_j)} \qfor \hat\alpha_j = \sum_{x \in \text{bin}_j} \ingreen{\hat f_1(x)} \]

We can use the variance formula we worked out earlier in the semester.
- With a few approximations and some clever arithmetic, we can get something meaningful.
- Variance is proportional to the average–over green folks–of the ratio of our within-bin averaged histograms.

\[ \begin{aligned} \Var\qty{\sum_{j} \hat\alpha_j \ \hat\mu(0,{\text{bin}_j})} \approx \sum_{j} \hat\alpha_j^2 \ \Var\qty{\hat\mu(0, \text{bin}_j)} \textcolor{blue}{\approx} \frac{\sigma^2}{n \ P(D=0)}\sum_x \ingreen{f_1(x)} \frac{ \ingreen{\sum_{x' \in \text{bin}(x)} f_1(x')} }{ \inred{\sum_{x' \in \text{bin}(x)} f_0(x')}} \end{aligned} \]

A Picture

\[ \small{ \Var\qty{\sum_x \ingreen{\hat f_1(x)} \times \inred{\hat\mu(0,x)}} \approx \frac{\sigma^2}{n \ P(D=0)}\sum_x \ingreen{f_1(x)} \textcolor{purple}{\qty{\frac{ \ingreen{\sum_{x' \in \text{bin}(x)} f_1(x')} }{ \inred{\sum_{x' \in \text{bin}(x)} f_0(x')}}}} } \]

Above, the purple lines show the ratio of smoothed densities.
- For a 3-piece model, we get the solid one. The sum in the variance is 1.50.
- For a 6-piece model, we get the dashed one. The sum in the variance is 1.53.
- For a 30-piece (saturated) model, we get the dotted one. The sum in the variance is 1.54.
The variance improvement we get by using a smaller model is pretty small.

An Extreme-Case Picture

Here’s what we see when our densities are a lot rougher—there’s more to smooth out.
- For a 3-piece model, we get the solid one. The sum in the variance is 1.03.
- For a 6-piece model, we get the dashed one. The sum in the variance is 1.04.
- For a 30-piece (saturated) model, we get the dotted one. The sum in the variance is 3.52.
The variance improvement we get by using a smaller model is a lot bigger.
This isn’t a free lunch. With rough densities like this, ‘automatic inverse probability weighting’ doesn’t really happen.
This means we are at greater risk of bias. It’s likely to be a problem if we aren’t fitting the subpopulation means well.

Variance Formula Derivation

\[ \small{ \begin{aligned} \Var\qty{\sum_{j} \hat\alpha_j \ \hat\mu(0,{\text{bin}_j})} &\approx \sum_{j} \hat\alpha_j^2 \ \Var\qty{\hat\mu(0, \text{bin}_j)} && \qqtext{with} \\ \hat\alpha_j^2 &\approx \qty{\sum_{x \in \text{bin}_j}\ingreen{ f_1(x) }}^2 && \qqtext{and} \\ \Var\qty{\hat\mu(0, \text{bin}_j)} &\approx \frac{\sigma^2}{N_{D=0, X\in\text{bin}_j}} \approx \frac{\sigma^2}{n \ P(D=0) \ P(X \in \text{bin}_{j} \mid D=0)} = \frac{\sigma^2}{n \ P(D=0)} \frac{1}{\sum_{x \in \text{bin}_j} \inred{f_0(x')}} && \qqtext{so} \ldots \end{aligned} } \]

\[ \small{ \begin{aligned} \Var\qty{\sum_{j} \hat\alpha_j \ \hat\mu(0,{\text{bin}_j})} &\approx \frac{\sigma^2}{n \ P(D=0)} \sum_{j} \frac{\qty{\sum_{x \in \text{bin}_j}\ingreen{ f_1(x) }}^2}{\sum_{x \in \text{bin}_j} \inred{f_0(x)}} && \qqtext{substituting approximations} \\ &= \frac{\sigma^2}{n \ P(D=0)} \sum_{j} \sum_{x' \in \text{bin}_j} \ingreen{f_1(x')} \frac{\sum_{x \in \text{bin}_j}\ingreen{ f_1(x') }}{\sum_{x \in \text{bin}_j} \inred{f_0(x)}} &&\qqtext{since} \qty{\sum_x a_x}^2 = \sum_{x'} a_{x'} \times \sum_{x} a_{x} \\ &= \frac{\sigma^2}{n \ P(D=0)} \sum_{j} \sum_{x'} \ingreen{f_1(x')} 1(x' \in \text{bin}_j) \frac{\sum_{x \in \text{bin}_j}\ingreen{ f_1(x') }}{\sum_{x \in \text{bin}_j} \inred{f_0(x')}} && \qqtext{since } \sum_{x' \in S} a_{x'} = \sum_{x'} a_{x'} 1(x' \in S) \\ \\ &= \frac{\sigma^2}{n \ P(D=0)} \sum_{x'} \ingreen{f_1(x')} \sum_j 1(x' \in \text{bin}_j) \frac{\sum_{x \in \text{bin}_j}\ingreen{ f_1(x) }}{\sum_{x \in \text{bin}_j} \inred{f_0(x)}} && \qqtext{pulling constants out of the inner sum} \\ \\ &= \frac{\sigma^2}{n \ P(D=0)} \sum_{x'} \ingreen{f_1(x')} \sum_j \frac{\sum_{x \in \text{bin}(x')}\ingreen{ f_1(x) }}{\sum_{x \in \text{bin}(x')} \inred{f_0(x)}} && \qqtext{there's one inner-sum term where the indicator is nonzero} \end{aligned} } \]

Implications

To keep variance small, we want our bins to extend far enough from the peaks and valleys in \(\ingreen{f_1}\) and \(\inred{f_0}\).
Far enough that we’ve got a mix of moderate and extreme values in each bin to smooth out the averages.
But if we’re going to avoid bias, we need to fit the subpopulation means well. How do we do that?
A naive approach is to try to find a ‘just right’ bin size.
- If there were one, it’d be easy enough to find.
- We could use cross-validation to choose from a range of bin sizes.
But there’s no reason to think there is one size that works everywhere.
In this example, we can get away with bigger bins on the left than we can on the right. Why?
And the options get even more complex in 2D.

A 2D Example

Suppose we’re interested in the health impacts of a chemical leak.
- What we’re looking at on the left is a map of the chemical concentration in the air.
- Lighter colors on the left mean higher concentrations.
Those are our supopulation means. But we can’t measure concentration everywhere all the time.
We’ll measure it a set of randomly chosen locations—location density on the right

A 2D Example

Our sample is this set of measurements \((Z_i,X_i,Y_i)\).
- \(Z_i\) = concentration at measurement location i.
- \(X_i\) = longitude at location i
- \(Y_i\) = latitude at location i

A 2D Example

We’ll take them at random times, too. In the population we’re sampling from …
- There are many observations \((z_j,x_j,y_j)\) at the same location \((x_j,y_j)\)
- We’ll sample with replacement. The probability a sample is at a given location is shown on the right.

A 2D Example

There are 2000 observations in our sample.
We want to estimate \(\mu(x,y)=\E[Z_i \mid X_i=x, Y_i=y]\).
What we see on the right are the predictions \(\hat\mu(x,y)\) of the saturated model.
i.e., the mean of the observations at each location.

Fitting A Saturated Model

Here are the predictions of the saturated model.
- i.e., the means of the observed concentrations at at each location.
- white = zero observations at that location.
It’s ok near the center but bad elsewhere, where we sample less.

Fitting A Plane Doesn’t Work At All

There’s no ‘slant’, so the best thing we can do is fit a plane ‘parallel to the page’, i.e., constant brightness.
That, if we ignore some plotting artifacts I can’t presently explain, is what we see on the right.

Piecewise-Constant Models

Small piecewise-constant models work better.
But they’re still not resolving detail.

A Piecewise-Constant Models

Larger ones resolve detail better, but they’re huge.
On the left, we have \(16 \times 16 = 256\) bins; on the right, \(32 \times 32 = 1024\).
We only have 2000 observations.

Anisotropic Models

We can try using rectangles, rather than squares, as bins.
Both of these have \(32 \times 8 = 256\).
But, fundamentally, there’s no right rectangle-shape either.
We want smaller bins near the center of the data, where we have more observations and more variation in the outcomes.

Anisotropic Models

There’s no reason, statistically speaking, that our bins should be …
- all the same shape
- all the same number of locations
- spatially contiguous locations

Any set of pixels can be a bin

But we need some way to search for good partitions of the pixels into bins.
Trying all possible partitions is computationally infeasible.
And it runs into a major statistical issue we haven’t had time to address satisfactorily—overitting.
To make things manageable computationally and statistically…
we need to make some restrictions on the set of possible partitions.
People have, for the most part, settled on this: bins are axis-aligned rectangles.

Partioning as Tree-Search

We can describe any partition like this as the result of a tree of splits based on one covariate.
This aligns the problem naturally with tree-search algorithms from the AI/Optimization tradition in CS.
All we need is a way to evaluate the quality of a partition. For that, we have…
- Heuristics, e.g. no small bins.¹
- Cross-validation, e.g. minimize the mean squared error of predictions. This is standard.
- Fancier cross-validation, e.g. to minimize error in estimating the CATE.

Fancy Trees

The fancy stuff is newer, but it’s easy to use and getting popular.
It’s implemented in the Generalized Random Forest package in R.
GRF includes a search procedure meant for estimating the CATE.
And many generalizations, e.g. for Survival Analysis.

Back to the Chemical Leak

library(grf)
model = regression_forest(X=cbind(sampled.leak$x, sampled.leak$y), Y=sampled.leak$z)
im.hat = leak; im.hat$z=predict(model, newdata=cbind(im.hat$x, im.hat$y))$predictions
plot.image(leak)
plot.image(im.hat)

Here’s what we get with GRF’s default settings.
- Left is the population means, right is the predictions from GRF. Not bad!
- We’re not using any special features—just cross-validation on squared error.

And, For What It’s Worth, A Bird

left, the original image. Roughly 5000 pixels.

right, an estimate based on 5000 samples.