Dec 15, 2011

Terence’s Stuff: Looking

 

Me: “Is there any evidence of x effects in the data?”

Them: “No.”

Me: “Have you looked?”

Them: “No.”

I’ve had this conversation many times, where x = batch, temporal, spatial, machine, reagent, operator or other effects, and I expect to have it again. Why? Perhaps people want to reverse Matthew 7:7 into, “Don’t seek, and you won’t find”, as they expect that whatever they do find will be bad. Perhaps they don’t dare look at their data before they formulate their model and address the questions of interest, lest their looking will invalidate their answers. (It may, but that may also be right.) I think the style of some of our teaching (statistician as police officer, or keeper of the disciplinary faith) and some (regulatory) practice encourages such a view.

I prefer to go with Yogi Berra, who said You can see a lot by just looking. One of the reasons I’m so keen on looking is that in my data-rich world there are always things to find, and that can be fun. I call them artifacts, though features might be a more positive term (cf the 1970s quip: Is it a bug or a feature?). The question for me is not, “Are there artifacts in my data?” for the answer to this question is invariably, “Yes!” What concerns me is whether they are a major problem.

On a related point, one of the things I’ve noticed over time is that as we get confronted with more and more data, we tend to look at it less and less. It should be the other way around.

Do some people have a problem looking at large data sets, and if so, why? I think the answer is yes, some do, and I offer a few possible reasons. One is that large data sets are frequently produced by complex, multi-step processes, involving technologies that can be a challenge to understand. As a result, like the Little Prince—Quand le mystère est trop impressionnant, on n’ose pas désobéir—people take such data at face value. Another possibility is a blind faith in numbers, a feeling that if there is a lot of data, the answer that falls out must be overwhelmingly more probable than any of the alternatives, and that no artifact will change the conclusions. My third reason is that we all need to think harder, because simply repeating what we used to do with 10 variables is not an option when we have 10,000 variables. A change in perspective is required. Rather than looking at all our data, doing some analyses and finishing off with further looks, with large data sets the first step is reduced, we need a much more thorough third step. That is, our focus needs to be more on looking for things that might change our conclusions, not things that support (or fail to support) our assumptions. Also, we may be unsure what to do if we see problems. Or, perhaps now there’s so much data that no single set seems to warrant as careful consideration as it might have in the past, before we move on.

None of these reasons should be entertained. We must work hard to understand our measurement processes, artifacts are frequently the largest effects, there are some good ways of looking, and of responding when we find something untoward, and we should use them, though more ways will always be welcomed. Lastly, there is no reason to become complacent: some large data sets can be very rich indeed, and deserve thorough examination.

How should we seek, and what can we do when we find? In the last decade much use has been made of histograms or qq-plots of test statistics or p-values. These are valuable indicators of the health of an analysis: if your p-value distribution has problems, your analysis has problems. Also useful are negative controls, variables that should be unaffected by your treatments; and positive controls, which are variables that should be affected by treatments in known ways. If your controls don’t behave as expected, then you have a problem, and something needs to be done. Further, most large data sets have other information, sometimes called metadata, part of which might be associated with your final estimates, test statistics or p-values. Your task is to decide wisely which are worth looking at, and then to do so. There are statistical ways to try to deal with known or unknown artifacts, which don’t necessarily require that you understand how they arose. You should seek evidence of their fingerprints, and do something about that. Explanations may come later, as in the “Wednesday effect” in Primo Levi’s Silver or the “method effect” in Lord Rayleigh’s Anomaly Encountered in Determinations of the Density of Nitrogen Gas. Seek, find and fix, perhaps understand.

Statue of the Little Prince on his B-612 Asteroid

“The thing that is important is the thing that is not seen…” says the Little Prince,
sculpted here on his B-612 Asteroid, at the French theme park in Hakone, Japan

Share

1 Comment

  • […] Piled Higher and Deeper, by Jorge Cham. This cartoon, originally published at http://www.phdcomics.com, echoes the sentiments in Terry Speed’s last column. You can read what he has to say about looking—really looking—for solutions, in his column on page 13 of the Jan/Feb 2012 issue, or here. […]

Leave a comment

*

Share

Welcome!

Welcome to the IMS Bulletin website! We are developing the way we communicate news and information more effectively with members. The print Bulletin is still with us (free with IMS membership), and still available as a PDF to download, but in addition, we are placing some of the news, columns and articles on this blog site, which will allow you the opportunity to interact more. We are always keen to hear from IMS members, and encourage you to write articles and reports that other IMS members would find interesting. Contact the IMS Bulletin at bulletin@imstat.org

What is “Open Forum”?

In the Open Forum, any IMS member can propose a topic for discussion. Email your subject and an opening paragraph (to bulletin@imstat.org) and we'll post it to start off the discussion. Other readers can join in the debate by commenting on the post. Search other Open Forum posts by using the Open Forum category link below. Start a discussion today!

About IMS

The Institute of Mathematical Statistics is an international scholarly society devoted to the development and dissemination of the theory and applications of statistics and probability. We have about 4,500 members around the world. Visit IMS at http://imstat.org
Latest Issue