May 15, 2014

# Terence’s Stuff: Creativity in Statistics

You probably know the old chestnut: He uses statistics as a drunken man uses lamp-post—for support rather than illumination. But what do others—non-statisticians, non-applied statisticians—know of how we illuminate, rather than support, or fail to support?

What do we do when we spend days, weeks or months analysing a data set? How do we come up with a range of possible designs for an experiment or observational study? In what way is creativity and imagination at work in our profession? Not only do I think others have little idea, I think we ourselves are remarkably reticent about it.

Part of this reticence probably stems from a reluctance to admit the subjectivity of much of what we do. There is also a concern about looking at data to decide what to do, before carrying out a frequentist procedure whose post-look operating characteristics will in general be different from the pre-look ones. Transforming the data is a simple case in point.

This suggests a paradox: the very things we might want to point out to someone as demonstrating our creativity and imagination— “We noticed that the data behave better after this adjustment”—are the same things we might want to suppress, for they could be viewed by someone else as compromising our analysis.

Of course we usually don’t cry foul when someone transforms data. But would we be happy to document all the marginal tables we produced, all the histograms, box-plots, scatter plots, cluster diagrams, PCA or home-made plots we looked at, all the stratifications we considered, all the models we entertained, all the fits and misfits we examined, along with their associated parameter estimates and outliers, as we inched our way towards an analysis we thought appropriate for addressing some question with our data. It might start out simply, summarizing, visualizing and carrying out exploratory analyses, but could go much further. When we notice things—a spike here, a wrong slope there—we usually do something about it, for example, discard, truncate or transform data, or modify a model. We might need to think about possible confounders, selection biases, aggregation, possibly relevant missing data, and much more. As all who have done this know, the list could be extended indefinitely, though in any given instance, we might try just a few things, quickly (and probably unconsciously) eliminating scores of possible alternatives, as we approach our preferred analysis.

In some contexts, such as prediction, where we want an unbiased estimate of the prediction error, these preliminaries may matter a lot, while in others, they may not. Experienced data analysts instinctively know how to avoid over-training, for example, by exploring one part of the available data, and then seeing how their impressions hold up on other parts. They may also do simulations.

If we are the consulting or collaborating statistician in a team, it is highly unlikely that all these preliminaries will be documented, and appear in a publication. In my experience we rarely record all of them. Only occasionally do we see this sort of thing discussed in books, Peter Huber’s 2011 monograph Data Analysis being a notable example. When it comes to writing up, we typically describe only the end result. All this brings to mind Peter Medawar’s 1963 essay “Is The Scientific Paper A Fraud?” subtitled “Yes; It Misrepresents Scientific Thought.”

Does any of this matter? I have the impression—to be explored more in a later column—that many non-statisticians (dare I say, data scientists) are unaware of this activity of ours, of the importance we attach to it, and of the satisfaction we get from doing it well. But how can we complain if we conceal our tools, techniques and thought processes from others, and then find that when they are re-discovered, they are not seen to be part of Statistics, but of something else, perhaps Data Science, or Big Data? More importantly, how can we pass on our knowledge and experience in this area, if we don’t talk about it? What should we be doing?

We often say that we want to go beyond cookbook-like recipes for the stylized analysis of data, but this usually means that we want to convey theoretical understanding, not encourage creative cooking. Let’s acknowledge, even emphasize, the role of creativity in data analysis courses, including introductory statistics courses. With the advent of supplements to papers in most journals, including more details of our preliminaries in our write-ups is now straightforward, and many people already do. We should be talking about the creative process, not just when it leads to a novel tool or technique, but for the role it plays in our daily work.

Anyone for a piece of pumpkin pi? This one was made to celebrate Pi Day (March 14 – 3.14, get it?) which is also, coincidentally, Terry’s birthday…

## 2 Comments

• This topic is even more important than your post suggests. Academic datasets are often small and relatively clean. In industry, datasets are often huge and can be very messy. Looking at the overall effort to do an analysis, it is not uncommon for up to 75% of the effort expended on finding and cleaning the relevant data (from a large number of corporate data marts/warehouses, and these are often very poorly documented), with the actual analysis – once the data are in-hand – being 25% (or less) of the effort.

• […] Conclusion: plenty of experience of critiquing papers and running analyses does not seem to prevent a lack of understanding of what a hypothesis test really does and how it goes wrong familywise, sciencewise, or whatever. The recommended solution is to read some philosophy of science. The late, great Peter Lipton is my homeboy on this one. We should pre-specify everything we can, and not allow those around us to get away with vague scientific hypotheses that map to many possible statistical hypotheses (forks in the path). But we must also remember that sometimes you can't pre-specify everything, and that ultimately you are doing a sort-of-deductive process after an inductive guess at an interesting topic and before an inductive set of practical conclusions, so we should acknowledge this creativity and not brush it under the carpet. […]

## Welcome!

Welcome to the IMS Bulletin website! We are developing the way we communicate news and information more effectively with members. The print Bulletin is still with us (free with IMS membership), and still available as a PDF to download, but in addition, we are placing some of the news, columns and articles on this blog site, which will allow you the opportunity to interact more. We are always keen to hear from IMS members, and encourage you to write articles and reports that other IMS members would find interesting. Contact the IMS Bulletin at bulletin@imstat.org

## What is “Open Forum”?

In the Open Forum, any IMS member can propose a topic for discussion. Email your subject and an opening paragraph (to bulletin@imstat.org) and we'll post it to start off the discussion. Other readers can join in the debate by commenting on the post. Search other Open Forum posts by using the Open Forum category link below. Start a discussion today!

## About IMS

The Institute of Mathematical Statistics is an international scholarly society devoted to the development and dissemination of the theory and applications of statistics and probability. We have about 4,500 members around the world. Visit IMS at http://imstat.org

Latest Issue