# Model-free inference in statistics: how and why

Dimitris Politis is a professor in the Department of Mathematics at the University of California in San Diego. He is one of the IMS Bulletin’s Contributing Editors, and a former Editor (January 2011–December 2013). Here, he writes about his most recent pastime, Model-Free Prediction:

**1. Estimation**

Parametric models served as the cornerstone for the foundation of Statistical Science in the beginning of the 20th century by R.A. Fisher, K. Pearson, J. Neyman, E.S. Pearson, W.S. Gosset (also known as “Student”), etc.; their seminal developments resulted into a complete theory of statistics that could be practically implemented using the technology of the time, i.e., pen and paper (and slide-rule!). While some models are inescapable, e.g. modeling a polling dataset as a sequence of independent Bernoulli random variables, others appear contrived, often invoked for the sole reason to make the mathematics work. As a prime example, the ubiquitous—and typically unjustified—assumption of Gaussian data permeates statistics textbooks to the day. Model criticism and diagnostics were subsequently developed as a practical way out.

With the advent of widely accessible powerful computing in the late 1970s, computer-intensive methods such as resampling and cross-validation created a revolution in modern statistics. Using computers, statisticians became able to analyze big datasets for the first time, paving the way towards the ‘big data’ era of the 21st century. But perhaps more important was the realization that the way we do the analysis could/should be changed as well, as practitioners were gradually freed from the limitations of parametric models. For instance, the great success of Efron’s (1979) bootstrap was in providing a complete theory for statistical inference under a nonparametric setting much like Maximum Likelihood Estimation had done half a century earlier under the restrictive parametric setup.

Nevertheless, there is a further step one may take, i.e., going beyond even nonparametric models. To explain this, let us first focus on regression, i.e., data that are pairs: (*Y*_{1},*X*_{1}), (*Y*_{2},*X*_{2}), … , (*Y _{n},X_{n}*) where

*Y*is the measured response associated with a regressor value of

_{i}*X*. The standard homoscedastic additive model in this situation reads:

_{i}*Y _{i}* =

*μ*(

*X*) +

_{i}*ϵ*

*(1)*

_{i}where the random variables *ϵ** _{i}* are assumed to be independent, identically distributed (i.i.d.) from a distribution

*F*(·) with mean zero.

• Parametric model: Both *μ*(·) and *F*(·) belong to parametric families of functions, i.e., a setup where the only unknown is a finite-dimensional parameter; a typical example is straight-line regression with Gaussian errors, i.e., *μ*(*x*) = *β*_{0} + *β*_{1}*x* and *F*(·) being *N*(0, *σ*^{2}).

• Semiparametric model: *μ*(·) belongs to a parametric family, whereas *F*(·) does not; instead, it may be assumed that *F*(·) belongs to a smoothness class, e.g., assume that *F*(·) is absolutely continuous.

• Nonparametric model: Neither *μ*(·) nor *F*(·) can be assumed to belong to parametric families of functions.

Despite the nonparametric aspect of it, even the last option constitutes a model, and can thus be rather restrictive. To see why, note that eq. (1) with i.i.d. errors is not satisfied in many cases of interest even after allowing for heteroscedasticity of the errors. Nevertheless, it is possible to shun eq. (1) altogether and instead adopt a *model-free *setup that can be described as follows.

• Model-Free Regression:

*– Random design.* The pairs (*Y*_{1},*X*_{1}), (*Y*_{2},*X*_{2}), … , (*Y _{n},X_{n}*) are i.i.d.

*– Deterministic design.* The variables *X*_{1}, … , *X _{n}* are deterministic, and the random variables

*Y*

_{1}, … ,

*Y*are independent with common conditional distribution, i.e.,

_{n}*P*{

*Y*≤

_{j}*y*|

*X*=

_{j}*x*} =

*D*(

_{x}*y*) not depending on

*j*.

Inference for features, i.e. functionals, of the common conditional distribution *D _{x}*(·) is still possible under some regularity conditions, e.g. smoothness. Arguably, the most important such feature is the conditional mean

*E*(

*Y | X = x*) that can be denoted

*μ*(

*x*). When

*μ*(

*x*) can be assumed smooth, it can be consistently estimated by a local average and/or local polynomial. Asymptotic normality and/or resampling can then be invoked to construct confidence intervals for

*μ*(

*x*).

**2. Prediction**

Traditionally, the problem of prediction has been approached in a model-based way, i.e., (a) fit a model such as (1), and then use the fitted model for prediction of a future response *Y*_{f} associated with a regressor value *x*_{f}. Note that even in the absence of model (1), the conditional expectation *μ*(*x*_{f})=*E*(*Y*_{f }|*X*_{f} = *x*_{f}) is the Mean Squared Error (MSE) optimal predictor of *Y*_{f}. As already mentioned, *μ*(*x*_{f}) can be estimated in a Model-Free way and then used for predicting *Y*_{f} but a problem remains: how to gauge the accuracy of prediction, i.e., how to construct a prediction—as opposed to confidence—interval.

Interestingly, it is possible to accomplish the goal of point and interval prediction of *Y*_{f} under the Model-Free regression setup in a direct fashion, i.e., without the intermediate step of model-fitting; this is achieved via the **Model-Free Prediction Principle** expounded upon in Politis (2015). Model-Free Prediction restores the emphasis on observable quantities, i.e., current and future data, as opposed to unobservable model parameters and estimates thereof. In this sense, the Model-Free Prediction Principle is concordant with Bruno de Finetti’s statistical philosophy. Notably, being able to predict the response *Y*_{f} associated with the regressor *X*_{f} taking on *any* possible value (say *x*_{f}) seems to inadvertently also achieve the main goal of modeling, i.e., trying to relate how *Y* depends on *X*. In so doing, the solution to interesting estimation problems is obtained as a by-product, e.g. inference on features of *D _{x}*(·) such as its mean

*μ*(

*x*). In other words, as prediction can be treated as a by-product of model-fitting, key estimation problems can be solved as a by-product of being able to perform prediction. Hence, a Model-Free approach to frequentist statistical inference is possible, including prediction and confidence intervals.

**3. The Model-Free Prediction Principle**

Consider the Model-Free regression set-up with a vector of observed responses *Y** _{n}* = (

*Y*

_{1}, … ,

*Y*)′ that are associated with the vector of regressors

_{n}*X*

*= (*

_{n}*X*

_{1}, … ,

*X*)′. Also consider the enlarged vectors

_{n}*Y*

_{n+}_{1}= (

*Y*

_{1}, … ,

*Y*,

_{n}*Y*

_{n+}_{1})′ and

*X*

_{n+}_{1}= (

*X*

_{1}, … ,

*X*,

_{n}*X*

_{n+}_{1})′ where (

*Y*

_{n+}_{1},

*X*

_{n+}_{1}) is an alternative notation for (

*Y*

_{f},

*X*

_{f}); recall that

*Y*

_{f}is yet unobserved, and

*X*

_{f}will be set equal to the value

*x*

_{f}of interest. If the

*Y*s were i.i.d. (and not depending on their associated

_{i}*X*value), then prediction would be trivial: the MSE–optimal predictor of

*Y*

_{n+}_{1}is simply given by the common expected value of the

*Y*s, completely disregarding the value of

_{i}*X*

_{n+}_{1}.

In a nutshell, the Model-Free Prediction Principle amounts to using the structure of the problem in order to **find an invertible transformation H_{m} that can map the non-i.i.d. vector **

*Y*

*to a vector*_{m}*ϵ*

*= (*_{m}*ϵ*

_{1}, … ,*ϵ*

**′**

*)*_{m}**that has i.i.d. components**; here

*m*could be taken equal to either

*n*or

*n*+1 as needed. Letting

*H*

_{m}^{−1}denote the inverse transformation, we have

*ϵ*

**=**

_{m}*H*(

_{m}*Y*

*) and*

_{m}*Y*

*=*

_{m}*H*

_{m}^{−1}(

*ϵ*

**), i.e.,**

_{m}$\underline{Y}_m \stackrel{H_m}{ \longmapsto } \underline{\epsilon}_m

\ \ \mbox{and} \ \ \underline{\epsilon}_m

\stackrel{H_m^{-1}}{ \longmapsto } \underline{Y}_m .$ (2)

If the practitioner is successful in implementing the Model-Free procedure, i.e., in identifying (and estimating) the transformation *H _{m}* to be used, then the prediction problem is reduced to the trivial one of predicting i.i.d. variables. To see why, note that eq. (2) with $m=n+1$ yields

*Y*

_{n+}_{1}=

*H*

_{n+}_{1}

^{−1}(

*ϵ*

**) =**

_{n+}_{1}*H*

_{n+}_{1}

^{−1}(

*ϵ*

_{n },*ϵ*

**). But**

_{n+}_{1}*ϵ*

**can be treated as known (and constant) given the data**

_{n}*Y*

*; just use eq. (2) with*

_{n}*m*=

*n*. Since the unobserved

*Y*

_{n}_{+1}is just the (

*n*+ 1)

^{th}coordinate of vector

*Y*

_{n+}_{1}, we have just expressed

*Y*

_{n+}_{1}as a function of the unobserved

*ϵ*

**. Note that predicting a function, say**

_{n+}_{1}*g*(·), of an i.i.d. sequence

*ϵ*

**, … ,**

_{1}*ϵ*

**,**

_{n}*ϵ*

**is straightforward because**

_{n+}_{1}*g*(

*ϵ*

**), … ,**

_{1}*g*(

*ϵ*

**),**

_{n}*g*(

*ϵ*

**) is simply another sequence of i.i.d. random variables. Hence, the practitioner can use this simple structure to develop point predictors for the future response**

_{n+}_{1}*Y*

_{n+}_{1}.

Prediction intervals can then be immediately constructed by resampling the i.i.d. variables *ϵ*** _{1}**, … ,

*ϵ*

**; this can be thought to give an extension of the model-based, residual bootstrap of Efron (1979) to Model-Free settings since, if model (1) were to hold true, the residuals from the model could be considered as the outcomes of the requisite transformation**

_{n}*H*.

_{n}**4. Time series**

Under regularity conditions, a transformation such as *H _{m}* of the Model-Free Prediction Principle always exists but is not necessarily unique. For example, if the variables (

*Y*

_{1}, … ,

*Y*) have an absolutely continuous joint distribution and no explanatory variables

_{m}

*X**are available, then the Rosenblatt (1952) transformation can map them onto a set of i.i.d. random variables. Nevertheless, estimating the Rosenblatt transformation from data may be infeasible except in special cases. On the other hand, a practitioner may exploit a given structure for the data at hand, e.g., a regression structure, in order to construct a different, case-specific transformation that may be practically estimable from the data.*

_{m}Recall that the Rosenblatt transformation maps an arbitrary random vector *Y** _{m}* = (

*Y*

_{1}, … ,

*Y*)′ having absolutely continuous joint distribution onto a random vector

_{m}*U*

*= (*

_{m}*U*

_{1}, … ,

*U*)′ whose entries are i.i.d. Uniform(0,1); this is done via the probability integral transform based on conditional distributions. For

_{m}*k*> 1 define the conditional distributions

*F*(

_{k}*y*|

_{k}*y*

_{k−}_{1}, … ,

*y*

_{1}) =

*P*{

*Y*≤

_{k}*y*|

_{k}*Y*

_{k−}_{1}=

*y*

_{k−}_{1}, … ,

*Y*

_{1}=

*y*

_{1}}, and let

*F*

_{1}(

*y*

_{1}) =

*P*{

*Y*

_{1}≤

*y*

_{1}}. Then the Rosenblatt transformation amounts to letting

*U*

_{1}=

*F*

_{1}(

*Y*

_{1}),

*U*

_{2}=

*F*

_{2}(

*Y*

_{2}|

*Y*

_{1}),

*U*

_{3}=

*F*

_{3}(

*Y*

_{3}|

*Y*

_{2},

*Y*

_{1}), …, and

*U*=

_{m}*F*(

_{m}*Y*|

_{m}*Y*

_{m−}_{1}, … ,

*Y*

_{2},

*Y*

_{1}).

The problem is that the distributions *F _{k}* for

*k*≥ 1 are typically unknown and must be estimated (in a continuous fashion) from the

*Y*

*data at hand. However, unless there is some additional structure, this estimation task may be unreliable or plain infeasible for large*

_{n}*k*. As an extreme example, note that to estimate

*F*we would have only one point (in

_{n}*n*-dimensional space) to work with. Hence, without additional assumptions, the estimate of

*F*would be a point mass which is a completely unreliable estimate, and of little use in terms of constructing a probability integral transform due to its discontinuity.

_{n}An example of additional structure is the Markov setup. To elaborate, suppose that the data *Y*_{1}, … , *Y _{n}* are a realization of a stationary (and ergodic) Markov chain. In this case, the conditional distributions

*F*for all

_{k}*k*> 1 are completely determined by the one-step transition distribution, namely

*F*

_{2}. To see why, note that the Markov assumption implies that

*P*{

*Y*≤

_{k}*y*|

_{k}*Y*

_{k−}_{1}=

*y*

_{k−}_{1}, … ,

*Y*

_{1}=

*y*

_{1}} =

*P*{

*Y*≤

_{k}*y*|

_{k}*Y*

_{k−}_{1}=

*y*

_{k−}_{1}} for

*k*> 1. Hence, the practitioner may use kernel smoothing or a related technique on the data pairs {(

*Y*,

_{j}*Y*

_{j+}_{1}) for

*j*= 1, … ,

*n*−1} in order to estimate the common joint distribution of these pairs. In turn, this yields estimates of

*F*

_{1}and

*F*

_{2}, and by extension

*F*for

_{k}*k*> 2, so that the Rosenblatt transformation can be practically implemented as part of the Model-Free Prediction Principle.

Further examples of transformations applicable to diverse settings with regression and/or time series data are discussed in Politis (2015).

**References**

[1] Efron, B. (1979). Bootstrap methods: another look at the jackknife, *Ann. Statist.*, vol. 7, pp. 1–26.

[2] Politis, D.N. (2015). *Model-Free Prediction and Regression: A Transformation-Based Approach to Inference,* Springer, New York.

[3] Rosenblatt, M. (1952). Remarks on a multivariate transformation. *Ann. Math. Statist.*, vol. 23, pp. 470–472.

## Leave a comment

## Welcome!

## What is “Open Forum”?

## Categories

- Anirban's Angle
- Dimitris Politis
- From the Editor
- Hadley Wickham
- Hand writing
- IMS awards
- IMS news
- Journal news
- Lectures and Addresses
- Letters
- Meetings
- Member news
- Nominations
- Obituary
- Open Forum
- Opinion
- Other news
- President
- Presidential Address
- Pro Bono Statistics
- Rick's Ramblings
- Robert Adler
- Statistics2013
- Stéphane Boucheron
- Student Puzzle Corner
- Terence's Stuff
- Vlada's Point
- Welcome
- XL Files