Sep 1, 2012

Terence’s Stuff: Multiple Linear Regression, part 2

I really like multiple linear regression (MLR), even though I think that it must be the most widely misused of all statistical methods. There are so many different reasons why we might use it, and there are so many variations on linear least squares, I feel that MLR can be seen as a microcosm of statistics as a whole. At a conference recently I heard a speaker discuss MLRs with 15–20 variables. He spoke of model complexity, of functional forms, of whether or not variables should be selected, and he discussed model (in)stability and resampling techniques for diagnosing and improving models. All without stating a reason for doing MLR!

Why do we run MLRs? Let me reel off a few possible responses before commenting on why I think asking “why” matters. To summarize. To predict. To estimate a parameter. To attempt a causal analysis. To find a model. I hope it is clear that these are different reasons.

If you concede this, then perhaps you will agree that going through the same moves with a data set (y,X) to produce the familiar estimates

              $\hat{\beta} = (X ^{ \prime } X )^{-1} X ^{\prime} y $     and     $\hat{ var } ( \hat{\beta} ) = (X ^{ \prime } X )^{-1} \hat{\sigma}^2$,

and doing all the standard regression diagnostics (the “core” approach) is unlikely to be the right thing in any of these cases. Sharpening the question is just as necessary when considering regression as it is with any other statistical analysis. At the end we will want to assess how well we have answered our question, and in doing so, we’ll go far beyond the standard formulae, in different ways with different questions.

Think of the world of difference between using a regression model for prediction and using one for estimating a parameter with a causal interpretation, for example, the effect of class size on school children’s test scores. With prediction, we don’t need our relationship to be causal, but we do need to be concerned with the relation between our training and our test set. If we have reason to think that our future test set may differ from our past training set in unknown ways, nothing, including cross-validation, will save us. When estimating the causal parameter, we do need to ask whether the children were randomly assigned to classes of different sizes, and if not, we need to find a way to deal with possible selection bias. If we have not measured suitable covariates on our children, we may not be able to adjust for any bias.

What’s my point here? I would like to see multiple regression taught as a series of case studies, each study addressing a sharp question, and focussing on those aspects of the topic that are relevant to that question. Instead, what happens all too often, is that writers and instructors distil all uses of multiple linear regression down to the “core” mentioned above, and students come away not having seen the fascinating and important interplay between question, context, data and answer. It’s a “baby and bath-water” problem.

Who does it to my liking? I mentioned Mosteller & Tukey in my last piece on this topic, and once again I’m happy to say that they do a fine job on the different questions that lead us to MLR, with their own colorful terminology, e.g. regression to “set aside the effect of ” a variable, to get the variable “out of the way,” or “regression as exclusion.” In their book Mostly Harmless Econometrics: An Empiricist’s Companion, Angrist and Pischke have a very nice chapter 3 entitled “Making Regression Make Sense.” Near the beginning of their book, they say that, “the most interesting research in social science is about cause and effect, such as the effect of class size on children’s test scores.”

How do we run regressions? Overwhelmingly, the answer is by using least squares, justified by the Gauss-Markov theorem. In a characteristically brilliant, though at times challenging, 1975 book chapter, “After Gauss-Markov Least Squares, What?” Tukey deconstructs this theorem, and in so doing opens our eyes to the richness of our statistical world, in comparison with the poverty of the “core”. He views his task as “idol management.” After listing the seven “ifs” of the theorem, leading to the conclusion that the best estimate of any individual β or any linear combination of β’s is to be had by “least squares,” Tukey questions each “if ” in turn, and uses each “to point a direction in which to move a suitable distance away from our idol.” In the discussion which follows, we meet nonlinear least squares, “minimizing potential” vs “balance of forces”, “indirect and imperfect” measurements, instrumental variables, weighting and misweighting, robustness via iteratively reweighting, “insulation” and “transparency”, penalized regression and much more.

The beauty of Tukey’s approach to MLR is that it can be revisited at any time, and applied to other areas. Idol management should always be with us.

 

Tukey believed in “idol management.” That’s not the judges on this TV show…

FacebookTwitterGoogle+TumblrStumbleUponRedditDeliciousEmailShare

1 Comment

  • […] Terence’s Stuff: Multiple Linear Regression, Part 2, Terry Steel I really like multiple linear regression (MLR), even though I think that it must be the most widely misused of all statistical methods. There are so many different reasons why we might use it, and there are so many variations on linear least squares, I feel that MLR can be seen as a microcosm of statistics as a whole. At a conference recently I heard a speaker discuss MLRs with 15–20 variables. He spoke of model complexity, of functional forms, of whether or not variables should be selected, and he discussed model (in)stability and resampling techniques for diagnosing and improving models.  All without stating a reason for doing MLR! […]

Leave a comment

*

Welcome!

Welcome to the IMS Bulletin website! We are developing the way we communicate news and information more effectively with members. The print Bulletin is still with us (free with IMS membership), and still available as a PDF to download, but in addition, we are placing some of the news, columns and articles on this blog site, which will allow you the opportunity to interact more. We are always keen to hear from IMS members, and encourage you to write articles and reports that other IMS members would find interesting. Contact the IMS Bulletin at bulletin@imstat.org

What is “Open Forum”?

In the Open Forum, any IMS member can propose a topic for discussion. Email your subject and an opening paragraph (to bulletin@imstat.org) and we'll post it to start off the discussion. Other readers can join in the debate by commenting on the post. Search other Open Forum posts by using the Open Forum category link below. Start a discussion today!

About IMS

The Institute of Mathematical Statistics is an international scholarly society devoted to the development and dissemination of the theory and applications of statistics and probability. We have about 4,500 members around the world. Visit IMS at http://imstat.org
Latest Issue