QTM 220: Lecture 20

Regression Trees

Why Piecewise-Constant Models?

Goal: Estimate the ATT in this Population

\[ \text{ATT} = \sum_x \ingreen{f_1(x)} \cdot \qty{\ingreen{\mu(1,x)} - \inred{\mu(0,x)}} \]

We’ll Use This Sample

\[ \widehat{\text{ATT}} = \sum_x \ingreen{\hat f_1(x)} \cdot \qty{\ingreen{\hat\mu(1,x)} - \inred{\hat\mu(0,x)}} \]

Options

\[ \widehat{\text{ATT}} = \sum_x \ingreen{\hat f_1(x)} \cdot \qty{\ingreen{\hat\mu(1,x)} - \inred{\hat\mu(0,x)}} \qfor \hat\mu = \argmin_{m \in\model } \frac{1}{n}\sum_{i=1}^n \text{weight}(D_i,X_i) \ \qty{m(D_i,X_i) - Y_i}^2 \]

The Saturated Model

\[ \widehat{\text{ATT}} = \sum_x \ingreen{\hat f_1(x)} \cdot \qty{\ingreen{\hat\mu(1,x)} - \inred{\hat\mu(0,x)}} \qfor \hat\mu = \argmin_{m \in\model } \frac{1}{n}\sum_{i=1}^n \text{weight}(D_i,X_i) \ \qty{m(D_i,X_i) - Y_i}^2 \]

  • It doesn’t matter whether we use weights for this one. Why not?
  • The model lets us make any prediction at any level \((d,x)\) of the treatment and covariate.
  • So we’ll make the best prediction at each level in terms of our error criterion.
  • The least squares thing to do is to use the sample means. \[ \begin{aligned} \hat \mu(d,x) &= \argmin_{m \in \R} \sum_{i:D_i=d, X_i=x}^n \qty{m - Y_i}^2 = \frac{\sum_{i:D_i=d, X_i=x}^n Y_i}{\sum_{i:D_i=d, X_i=x}^n 1} \end{aligned} \]

The Saturated Model

  • Why is the weighted least squares thing to do the same? \[ \begin{aligned} \hat \mu(d,x) &= \argmin_{m \in \R} \sum_{i:D_i=d, X_i=x}^n w(D_i,X_i) \qty{m - Y_i}^2 \overset{?}{=} \frac{\sum_{i:D_i=d, X_i=x}^n Y_i}{\sum_{i:D_i=d, X_i=x}^n 1} \end{aligned} \]
  • The weights are constant within each subsample—\(w(D_i,X_i) = w(d,x)\).
  • So what we’re minimizing to get \(\hat\mu(d,x)\) is unweighted squared error times a constant.
  • That constant factor doesn’t change the solution.

The Saturated Model’s Sampling Distribution

Unweighted

Weighted

Unweighted, 4 x sample size

Weighted, 4 x sample size

The Interacted Bivariate Linear Model

\[ \widehat{\text{ATT}} = \sum_x \ingreen{\hat f_1(x)} \cdot \qty{\ingreen{\hat\mu(1,x)} - \inred{\hat\mu(0,x)}} \qfor \hat\mu = \argmin_{m \in\model } \frac{1}{n}\sum_{i=1}^n \text{weight}(D_i,X_i) \ \qty{m(D_i,X_i) - Y_i}^2 \]

  • What’s bad about fitting lines without weights?
  • If we don’t use weights, we’ll fit the data in the wrong place.
    • i.e., where there are a lot of red folks and few green folks.
    • which is not what’s relevant for estimating the ATT.
  • Inverse probability weighting fixes this by telling the line what matters and what doesn’t.

The Interacted Model’s Sampling Distribution

Unweighted

Weighted

Unweighted, 4 x sample size

Weighted, 4 x sample size

The Additive Bivariate Linear Model

\[ \widehat{\text{ATT}} = \sum_x \ingreen{\hat f_1(x)} \cdot \qty{\ingreen{\hat\mu(1,x)} - \inred{\hat\mu(0,x)}} \qfor \hat\mu = \argmin_{m \in\model } \frac{1}{n}\sum_{i=1}^n \text{weight}(D_i,X_i) \ \qty{m(D_i,X_i) - Y_i}^2 \]

  • What’s worse about fitting parallel lines without weights?
  • We may not be able to fit the data at all. And our slope will be some mix of …
    • the slope of the red folks on the left, where they are.
    • the slope of the green folks on the right, where they are.
  • The distance between these lines is our estimate of the ATT.
  • And it isn’t a comparison of people on the right to people on the right.
  • Inverse probability weighting fixes this by, in effect, ignoring the irrelevant stuff on the left.

The Additive Model’s Sampling Distribution

Unweighted

Weighted

Unweighted, 4 x sample size

Weighted, 4 x sample size

Piecewise-Constant Models

\[ \widehat{\text{ATT}} = \sum_x \ingreen{\hat f_1(x)} \cdot \qty{\ingreen{\hat\mu(1,x)} - \inred{\hat\mu(0,x)}} \qfor \hat\mu = \argmin_{m \in\model } \frac{1}{n}\sum_{i=1}^n \textcolor{purple}{\text{weight}(D_i,X_i)} \ \qty{m(D_i,X_i) - Y_i}^2 \]

  • Here I’ve fit an interacted piecewise-constant model. This is less susceptible to this problem. \[ \hat\mu(d,x) = \hat\beta_{0,d} + \hat\beta_{1,d} \cdot \mathbf{1}\qty{x > 60} + \hat\beta_{2,d} \cdot \mathbf{1}\qty{x > 70} \]

  • That’s sort of like what happens with the saturated model. Why?

  • Hint. Look at the purple line, which shows the weights for untreated folks.

Piecewise-Constant Models

\[ \widehat{\text{ATT}} = \sum_x \ingreen{\hat f_1(x)} \cdot \qty{\ingreen{\hat\mu(1,x)} - \inred{\hat\mu(0,x)}} \qfor \hat\mu = \argmin_{m \in\model } \frac{1}{n}\sum_{i=1}^n \text{weight}(D_i,X_i) \ \qty{m(D_i,X_i) - Y_i}^2 \]

  • Because it fits locally—what happens on the left doesn’t affect fit on the right.
  • The least squares estimator at each point is the average of the samples within the same bin.
  • The weighted least squares takes a weighted average within each bin.
    • But the weights don’t vary much within bins.
    • So the weighted average is similar to the unweighted one.
  • This is the same as coarsening the data — forgetting age differences within bins …
  • … then fitting a saturated model. But people do keep their (uncoarsened) weights.

Piecewise-Constant’s Model Sampling Distribution

Unweighted

Weighted

Unweighted, 4 x sample size

Weighted, 4 x sample size

How Do I Choose The Number of Bins?

Two Bins

Four Bins

Eight Bins

Sixteen Bins

Weight or Go Big. Or Both.

Unweighted: 2 bins

Weighted: 2 bins

Unweighted: 16 bins

Weighted: 16 bins

Weighting Works With Estimated Treatment Probabilities, Too

Unweighted: 2 bins

Unweighted: 16 bins

Weighted, Logistic Regression-Estimated Weights: 2 bins

Weighted: Logistic Regression-Estimated Weights 16 bins

Enough About IPW

  • Inverse probability weighting is a good idea.
    • It’s easy to do once you work out what the weights should be.
    • And you can get valid intervals even with widly misspecified models.
  • But it has some limitations.
    • For one thing, it means your fitted curve is estimand-specific.
    • If you’ve fit a model to estimate the ATT, you’ve got to refit to estimate the ATE.1
    • That’s not great if you’re sharing your fitted models with other people.
  • So it’s nice to have a model that works ok with or without inverse probability weighting.
  • Big piecewise-constant models look promising in our example.
    • Bias goes down as we increase bin count.
    • And variance doesn’t really go up.
  • Conceptually, these let us avoid bias in two ways—even if we’re doing unweighted least squares.
    • If the inverse probability weights are effectively constant within bins, we’re effectively inverse probability weighting.
    • If the subpopulation mean \(\mu\) is effectively constant within bins, we’re effectively using a saturated model.

Going Big

Piecewise-constant models fit via unweighted least squares.

Why Doesn’t Going Big Increase Variance (Much)?

\[ \widehat{\text{ATT}} = \sum_x \ingreen{\hat f_1(x)} \qty{\ingreen{\hat\mu(1,x)} - \inred{\hat\mu(0,x)}} \]

  • It seems like it should. Our within-bin subsample mean estimates are all over the place.
  • What’s happening is that this noise is averaging out. Let’s start somewhere simple.
    • Suppose we pair up all our observations, average the pairs, and average the averages.
    • We get the same thing as if we just averaged all the observations without pairing.

Why Doesn’t Going Big Increase Variance (Much)?

\[ \widehat{\text{ATT}} = \sum_x \ingreen{\hat f_1(x)} \qty{\ingreen{\hat\mu(1,x)} - \inred{\hat\mu(0,x)}} \]

  • That’s essentially what’s going on with the easy part of the ATT formula.
  • If we have \(1\) bin for each level of \(X\), we get the same thing as if we have 1 bin outright. \[ \sum_x \ingreen{\hat f_1(x)} \times \ingreen{\hat\mu(1,x)} = \sum_{x} \frac{\sum_{i:X_i=x, D_i=1} 1}{\sum_{i:D_i=1} 1} \times \frac{\sum_{i:D_i=1, X_i=x} Y_i}{\sum_{i:D_i=1, X_i=x} 1} = \frac{\sum_{i:D_i=1} Y_i}{\sum_{i:D_i=1} 1} \]

Why Doesn’t Going Big Increase Variance (Much)?

  • Now let’s think about the hard part—the mixed-color one.
  • After coarsening to bins, we’re just using within-bin subsample means as our estimates \(\inred{\hat\mu(0,x)}\).
  • So what we really have is a linear combinations of subsample means.

\[ \begin{aligned} \sum_x \ingreen{\hat f_1(x)} \times \inred{\hat\mu(0,x)} &= \sum_{j} \sum_{x \in \text{bin}_j} \ingreen{\hat f_1(x)} \times \inred{\hat\mu(0,\text{bin}_j)} \\ &= \sum_{j} \hat\alpha_j \ \inred{\hat\mu(0,\text{bin}_j)} \qfor \hat\alpha_j = \sum_{x \in \text{bin}_j} \ingreen{\hat f_1(x)} \end{aligned} \]

Why Doesn’t Going Big Increase Variance (Much)?

\[ \sum_x \ingreen{\hat f_1(x)} \inred{\hat\mu(0,x)} = \sum_{j} \hat\alpha_j \ \inred{\hat\mu(0,\text{bin}_j)} \qfor \hat\alpha_j = \sum_{x \in \text{bin}_j} \ingreen{\hat f_1(x)} \]

  • We can use the variance formula we worked out earlier in the semester.
    • With a few approximations and some clever arithmetic, we can get something meaningful.
    • Variance is proportional to the average–over green folks–of the ratio of our within-bin averaged histograms.

\[ \begin{aligned} \Var\qty{\sum_{j} \hat\alpha_j \ \hat\mu(0,{\text{bin}_j})} \approx \sum_{j} \hat\alpha_j^2 \ \Var\qty{\hat\mu(0, \text{bin}_j)} \textcolor{blue}{\approx} \frac{\sigma^2}{n \ P(D=0)}\sum_x \ingreen{f_1(x)} \frac{ \ingreen{\sum_{x' \in \text{bin}(x)} f_1(x')} }{ \inred{\sum_{x' \in \text{bin}(x)} f_0(x')}} \end{aligned} \]

A Picture

\[ \small{ \Var\qty{\sum_x \ingreen{\hat f_1(x)} \times \inred{\hat\mu(0,x)}} \approx \frac{\sigma^2}{n \ P(D=0)}\sum_x \ingreen{f_1(x)} \textcolor{purple}{\qty{\frac{ \ingreen{\sum_{x' \in \text{bin}(x)} f_1(x')} }{ \inred{\sum_{x' \in \text{bin}(x)} f_0(x')}}}} } \]

  • Above, the purple lines show the ratio of smoothed densities.
    • For a 3-piece model, we get the solid one. The sum in the variance is 1.50.
    • For a 6-piece model, we get the dashed one. The sum in the variance is 1.53.
    • For a 30-piece (saturated) model, we get the dotted one. The sum in the variance is 1.54.
  • The variance improvement we get by using a smaller model is pretty small.

An Extreme-Case Picture

\[ \small{ \Var\qty{\sum_x \ingreen{\hat f_1(x)} \times \inred{\hat\mu(0,x)}} \approx \frac{\sigma^2}{n \ P(D=0)}\sum_x \ingreen{f_1(x)} \textcolor{purple}{\qty{\frac{ \ingreen{\sum_{x' \in \text{bin}(x)} f_1(x')} }{ \inred{\sum_{x' \in \text{bin}(x)} f_0(x')}}}} } \]

  • Here’s what we see when our densities are a lot rougher—there’s more to smooth out.
    • For a 3-piece model, we get the solid one. The sum in the variance is 1.03.
    • For a 6-piece model, we get the dashed one. The sum in the variance is 1.04.
    • For a 30-piece (saturated) model, we get the dotted one. The sum in the variance is 3.52.
  • The variance improvement we get by using a smaller model is a lot bigger.
  • This isn’t a free lunch. With rough densities like this, ‘automatic inverse probability weighting’ doesn’t really happen.
  • This means we are at greater risk of bias. It’s likely to be a problem if we aren’t fitting the subpopulation means well.

Variance Formula Derivation

\[ \small{ \begin{aligned} \Var\qty{\sum_{j} \hat\alpha_j \ \hat\mu(0,{\text{bin}_j})} &\approx \sum_{j} \hat\alpha_j^2 \ \Var\qty{\hat\mu(0, \text{bin}_j)} && \qqtext{with} \\ \hat\alpha_j^2 &\approx \qty{\sum_{x \in \text{bin}_j}\ingreen{ f_1(x) }}^2 && \qqtext{and} \\ \Var\qty{\hat\mu(0, \text{bin}_j)} &\approx \frac{\sigma^2}{N_{D=0, X\in\text{bin}_j}} \approx \frac{\sigma^2}{n \ P(D=0) \ P(X \in \text{bin}_{j} \mid D=0)} = \frac{\sigma^2}{n \ P(D=0)} \frac{1}{\sum_{x \in \text{bin}_j} \inred{f_0(x')}} && \qqtext{so} \ldots \end{aligned} } \]

\[ \small{ \begin{aligned} \Var\qty{\sum_{j} \hat\alpha_j \ \hat\mu(0,{\text{bin}_j})} &\approx \frac{\sigma^2}{n \ P(D=0)} \sum_{j} \frac{\qty{\sum_{x \in \text{bin}_j}\ingreen{ f_1(x) }}^2}{\sum_{x \in \text{bin}_j} \inred{f_0(x)}} && \qqtext{substituting approximations} \\ &= \frac{\sigma^2}{n \ P(D=0)} \sum_{j} \sum_{x' \in \text{bin}_j} \ingreen{f_1(x')} \frac{\sum_{x \in \text{bin}_j}\ingreen{ f_1(x') }}{\sum_{x \in \text{bin}_j} \inred{f_0(x)}} &&\qqtext{since} \qty{\sum_x a_x}^2 = \sum_{x'} a_{x'} \times \sum_{x} a_{x} \\ &= \frac{\sigma^2}{n \ P(D=0)} \sum_{j} \sum_{x'} \ingreen{f_1(x')} 1(x' \in \text{bin}_j) \frac{\sum_{x \in \text{bin}_j}\ingreen{ f_1(x') }}{\sum_{x \in \text{bin}_j} \inred{f_0(x')}} && \qqtext{since } \sum_{x' \in S} a_{x'} = \sum_{x'} a_{x'} 1(x' \in S) \\ \\ &= \frac{\sigma^2}{n \ P(D=0)} \sum_{x'} \ingreen{f_1(x')} \sum_j 1(x' \in \text{bin}_j) \frac{\sum_{x \in \text{bin}_j}\ingreen{ f_1(x) }}{\sum_{x \in \text{bin}_j} \inred{f_0(x)}} && \qqtext{pulling constants out of the inner sum} \\ \\ &= \frac{\sigma^2}{n \ P(D=0)} \sum_{x'} \ingreen{f_1(x')} \sum_j \frac{\sum_{x \in \text{bin}(x')}\ingreen{ f_1(x) }}{\sum_{x \in \text{bin}(x')} \inred{f_0(x)}} && \qqtext{there's one inner-sum term where the indicator is nonzero} \end{aligned} } \]

Implications

  • To keep variance small, we want our bins to extend far enough from the peaks and valleys in \(\ingreen{f_1}\) and \(\inred{f_0}\).
  • Far enough that we’ve got a mix of moderate and extreme values in each bin to smooth out the averages.
  • But if we’re going to avoid bias, we need to fit the subpopulation means well. How do we do that?
  • A naive approach is to try to find a ‘just right’ bin size.
    • If there were one, it’d be easy enough to find.
    • We could use cross-validation to choose from a range of bin sizes.
  • But there’s no reason to think there is one size that works everywhere.
  • In this example, we can get away with bigger bins on the left than we can on the right. Why?
  • And the options get even more complex in 2D.

A 2D Example

  • Suppose we’re interested in the health impacts of a chemical leak.
    • What we’re looking at on the left is a map of the chemical concentration in the air.
    • Lighter colors on the left mean higher concentrations.
  • Those are our supopulation means. But we can’t measure concentration everywhere all the time.
  • We’ll measure it a set of randomly chosen locations—location density on the right

A 2D Example

  • Our sample is this set of measurements \((Z_i,X_i,Y_i)\).
    • \(Z_i\) = concentration at measurement location i.
    • \(X_i\) = longitude at location i
    • \(Y_i\) = latitude at location i

A 2D Example

  • We’ll take them at random times, too. In the population we’re sampling from …
    • There are many observations \((z_j,x_j,y_j)\) at the same location \((x_j,y_j)\)
    • We’ll sample with replacement. The probability a sample is at a given location is shown on the right.

A 2D Example

  • There are 2000 observations in our sample.
  • We want to estimate \(\mu(x,y)=\E[Z_i \mid X_i=x, Y_i=y]\).
  • What we see on the right are the predictions \(\hat\mu(x,y)\) of the saturated model.
  • i.e., the mean of the observations at each location.

Fitting A Saturated Model

  • Here are the predictions of the saturated model.
    • i.e., the means of the observed concentrations at at each location.
    • white = zero observations at that location.
  • It’s ok near the center but bad elsewhere, where we sample less.

Fitting A Plane Doesn’t Work At All

  • There’s no ‘slant’, so the best thing we can do is fit a plane ‘parallel to the page’, i.e., constant brightness.
  • That, if we ignore some plotting artifacts I can’t presently explain, is what we see on the right.

Piecewise-Constant Models

  • Small piecewise-constant models work better.
  • But they’re still not resolving detail.

A Piecewise-Constant Models

  • Larger ones resolve detail better, but they’re huge.
  • On the left, we have \(16 \times 16 = 256\) bins; on the right, \(32 \times 32 = 1024\).
  • We only have 2000 observations.

Anisotropic Models

  • We can try using rectangles, rather than squares, as bins.
  • Both of these have \(32 \times 8 = 256\).
  • But, fundamentally, there’s no right rectangle-shape either.
  • We want smaller bins near the center of the data, where we have more observations and more variation in the outcomes.

Anisotropic Models

  • There’s no reason, statistically speaking, that our bins should be …
    • all the same shape
    • all the same number of locations
    • spatially contiguous locations

Any set of pixels can be a bin

  • But we need some way to search for good partitions of the pixels into bins.
  • Trying all possible partitions is computationally infeasible.
  • And it runs into a major statistical issue we haven’t had time to address satisfactorily—overitting.
  • To make things manageable computationally and statistically…
  • we need to make some restrictions on the set of possible partitions.
  • People have, for the most part, settled on this: bins are axis-aligned rectangles.

Fancy Trees

  • The fancy stuff is newer, but it’s easy to use and getting popular.
  • It’s implemented in the Generalized Random Forest package in R.
  • GRF includes a search procedure meant for estimating the CATE.
  • And many generalizations, e.g. for Survival Analysis.

Back to the Chemical Leak

library(grf)
model = regression_forest(X=cbind(sampled.leak$x, sampled.leak$y), Y=sampled.leak$z)
im.hat = leak; im.hat$z=predict(model, newdata=cbind(im.hat$x, im.hat$y))$predictions
plot.image(leak)
plot.image(im.hat)

  • Here’s what we get with GRF’s default settings.
    • Left is the population means, right is the predictions from GRF. Not bad!
    • We’re not using any special features—just cross-validation on squared error.

And, For What It’s Worth, A Bird

left, the original image. Roughly 5000 pixels.

right, an estimate based on 5000 samples.