Apr 1, 2017

Terence’s Stuff: It Exists!

“One of the leading results on Brownian motion is that it exists.” So wrote David Freedman on page 1 of his book on the topic. I recalled this statement several years ago when I was writing in this column about proofs, and it came to mind again recently when some younger colleagues asked me to explain Dirichlet processes (DPs) to them. More precisely, they asked me to explain three-level hierarchical Dirichlet process (HDP) mixture models, something that I’d never seen before, but which are very easy to write down using a plate diagram, though not quite as easy to grasp. Until now, I’ve been a bit blasé about Bayesian nonparametrics, thinking that I could probably reach the end of my career without getting far into it. But I was wrong, for big data has caught up with me. One of the most exciting things in my world these days is a little machine known as the MinION, which can produce reads of hundreds of thousands of base pairs from a single DNA molecule, lots of them. This machine can be held in your hand (“smaller than your smartphone”), and plugged into your computer: if you’re not careful, this can crash it by delivering more data than it can take in. The desktop version called PromethION can generate a thousand times more data. Having been involved in DNA sequencing for a while, handheld machines generating 10–20 giga-basepairs of DNA sequence in 48 hours have almost ceased to impress me, but desktop versions generating 12 tera-basepairs of sequence in 48 hours still get my attention. What is even more amazing to me is that the signals from which all this DNA sequence data are derived are single electrical currents measured several thousand times per second. This is the big data that has forced my colleagues to come to terms with these models. Not just HDP mixture models, but also convolutional, and standard and long short-term memory recurrent neural networks, the statistical machinery of deep learning. I have finally been dragged into the twenty-first century.

What’s all this got to do with existence? Wonderfully, perhaps surprisingly, DPs first saw the light of day in 1973 in papers in our own Annals of Statistics, written by Thomas Ferguson and colleagues. And there he had to prove their existence. He did so using facts about Dirichlet distributions to show that Kolmogorov’s consistency conditions for a projective system were satisfied, and so a limit, the DP, should exist. That’s why I got called in. Who among the present generation of statistics students knows about the existence of random processes or projective limits? In 1973 and since then, alternative derivations of the DP were discovered. Some, such as the equivalent Pólya urn scheme, are indirect. Others, such as Ferguson’s use of gamma processes or Jayaram Sethuraman’s stick-breaking representation, are more direct and constructive. It turned out that some restrictions were necessary on the underlying measure spaces for Ferguson’s original existence proofs to work, restrictions which weren’t necessary for some other constructions. So we didn’t need Kolmogorov’s theorem after all.

Why do we need existence proofs anyway? Using DPs to analyse data boils down to simple arithmetic procedures whose behavior doesn’t appear to demand deep existence proofs. If people can use HDP mixture models effectively in practice without ever having thought of the question of existence, who are we to criticize? Louis Bachelier found Brownian motion valuable before Norbert Wiener proved that it exists. I have the impression that physicists have sometimes drawn valid inferences about the world from theory that wasn’t fully grounded until later.

For most of my career, I have started thinking about the questions and the data that cross my path in traditional terms, including devising graphical displays that suggest a way ahead, using linear models and their generalizations, various forms of multivariate analysis, possibly latent variables, and at times context-specific models based on some scientific or technological background. There’s always been plenty of theory in the background for me to consult if I wished. In 2013, the International Year of Statistics, I attended the London Workshop on the Future of the Statistical Sciences. I felt at home, and I liked the published report. Now, with my scientific collaborators asking questions concerning data off the MinION, I no longer feel at home. I need to go far beyond what I currently know, and I have become deeply conscious of the power of deep learning. Most of the theory is unfamiliar, indeed much is unlike what I have come to think of as theory. I should have been paying closer attention, as none of these things are really new.

One may even become Stephen Stigler’s eighth Pillar of Statistical Wisdom.

Leave a comment

*

Welcome!

Welcome to the IMS Bulletin website! We are developing the way we communicate news and information more effectively with members. The print Bulletin is still with us (free with IMS membership), and still available as a PDF to download, but in addition, we are placing some of the news, columns and articles on this blog site, which will allow you the opportunity to interact more. We are always keen to hear from IMS members, and encourage you to write articles and reports that other IMS members would find interesting. Contact the IMS Bulletin at bulletin@imstat.org

What is “Open Forum”?

In the Open Forum, any IMS member can propose a topic for discussion. Email your subject and an opening paragraph (to bulletin@imstat.org) and we'll post it to start off the discussion. Other readers can join in the debate by commenting on the post. Search other Open Forum posts by using the Open Forum category link below. Start a discussion today!

About IMS

The Institute of Mathematical Statistics is an international scholarly society devoted to the development and dissemination of the theory and applications of statistics and probability. We have about 4,500 members around the world. Visit IMS at http://imstat.org
Latest Issue