Last week, on January 18, 2022 to be precise, statistician Sir David Cox passed away. He is famous for his contributions to, among others, survival analysis, stochastic processes and statistical inference.
In a rather unlikely coincidence of events he visited University of Washington to give a lecture on November 6, 2008 while I was there for my PhD internship at the Department of Statistics. Sir David gave an overview of his work using several applied examples, intertwined with lovely British humour. Here are two pages from my old notebook (yes, I keep those) with notes from that lecture.
As you can see, apart from the sketchy descriptions of the examples he gave, I noted a quote
Problems that cannot be solved non-parametrically should not be be solved parametrically unless there is a theory.
It is a remark which, for a social scientist trained in (if not brain-washed to) viewing most of the world as a “general linear reality”1, was quite enlightening. I’m guessing anybody digesting current empirical literature, at least in the social sciences, will agree that it is rather difficult to see the above principle to be applied in practice. Which is unfortunate. Alas, we would like the model to be an honest representation (including the functional form) of what we know and what we don’t know about the data-generating process.
A piece of Sir David’s writing that I cherish is his comment to Breiman’s (2001) Statistical Modeling: The Two Cultures, which you can read here. While I recommend reading the whole set – Breiman’s paper, all the comments and the rejoinder – it is Sir David’s comment that I feel especially close to. Primarily because of the issues he raises, the opinionated exegesis of which is:
- Black-boxing the understanding of the problem at hand for the sake of prediction. Is there a place for substantive thinking in statistical modeling vis a vis contemporary machine learning approaches preoccupied with prediction? Yes, there is. I think it is especially evident in the current COVID times when there is little to no data for a machine to learn forecasting the development of the pandemic and it is the more substance-driven models that excel.2
- Data as such are not suspended in vacuum. Rather, they are the fuel for answering research/business/policy questions. Hence one should start with a question, not with data.
Last but not least, the Cox & Miller (1977) book is right next to Kemeny & Snell (1960) on my shelf as an invaluable classical reference on stochastic processes.
Abbott, A. (1988). Transcending General Linear Reality. Sociological theory 6(2):169-186. ↩︎