Aug 28, 2015

What is the core of Data Science?

David Dunson, Arts and Sciences Professor of Statistical Science at Duke University, writes:

What is the core of data science? To address this, I think it is necessary to first touch on the question of what is data science? Certainly there is not one agreed upon definition of what data science is, exactly. At Duke we had a recent search for an open rank data science faculty position, and we received extremely disparate applicants, ranging from theoretically focused researchers studying properties of machine learning algorithms to optimization experts to image processors to applied mathematicians interested in large scale applications in neurosciences and power grid optimization. The field of the PhD degree for these applicants varied extremely widely, including, but not limited to, statistics, computer science, mathematics, electrical engineering and physics. I received candidates with similarly varied backgrounds when I recently advertised for a “Bayesian data science” postdoctoral fellow.

The consensus that we came up with in our search and my own view is that a data scientist is an individual who is driven primarily by the application and uses whatever statistical, computational and algorithmic tools they can come up to develop new knowledge and insights in that application area. If the field of study is an area of science (eg., neuroscience, genomics) then the data scientist is a full-fledged scientist in their corresponding area, but instead of collecting new data in their labs they exploit existing large, complex and disparate data sources to obtain new scientific insights.

Given this view of data science, it is not at all surprising that the rise of data science has ended up blurring disciplines and attracting individuals with highly disparate backgrounds in the mathematical sciences (broadly defined). Many view this as a threat to statistics as a discipline. Increasingly, the caricature of a statistician is a reserved, conservatively thinking stickler for foundations and theoretical support, who is so slowed down by their own principles that they study toy algorithms that aren’t useful in real world large scale applications. Meanwhile the hip and cool machine learning types charge ahead in creatively developing wild new algorithms and approaches and diving right into big exciting applications. Then, not surprisingly, the lion’s share of the increasing research dollars associated with data science topics goes to the latter group. These stereotypes, which have a seed of underlying truth to them, should serve as a wake up call to statisticians to make their work more relevant to modern applications.

The over-arching motivation for organizing this conference [the IMS-Microsoft Research workshop on Foundations of Data Science—see the interview with the other organizers here], and for hopefully kick-starting an IMS group focused on this topic, is to bring together leaders in different aspects of data science to move towards establishing the foundations of data science. Classical statistical theory, methods and principles are increasingly not relevant in modern data science problems and new foundations need to be established, going well beyond statistical theory for large p, small n problems.

There are several directions to take in closing the gulf between statistical foundations and data science practice. One is to have mathematical statisticians become more seriously engaged in understanding why highly successful algorithms, such as deep learning, have such good behavior. This is a type of top-down approach. The other is to become more cognizant and seriously engaged in what successful data scientists are actually doing in terms of the process of obtaining the data to analyze, reducing dimension, doing many analyses, reporting and summarizing the results, etc. Then, attempt to develop a realistic statistical formalism for establishing optimality and other properties, taking into account more of the pipeline including computational time, storage, etc.

My own view is that the data science revolution has been extremely intellectually stimulating and exciting. In a very short time, it has had the impact of dramatically reducing siloing of data scientists based on their PhD field and department affiliation. For many years, different communities proceeded independently working on essentially identical problems, but with different notation, publication outlets and perspectives. Mostly these communities were unaware of each other even when working on exactly the same problems. This has shifted dramatically, partly due to the growing tendency to establish interdisciplinary big data and data science centers or institutes. For example, at Duke we have the “Information Initiative at Duke” (IID), which has wonderful dedicated space, a core faculty having PhDs and primary department affiliations in many different fields, vibrant seminar series, and great cross-talk between research groups at all levels including undergraduates. I have grown to enjoy the “data seminar” organized by a topologist and focusing on cool math-y stuff people do with data more than our regular statistics seminars. I’m more likely to see new intellectually stimulating ideas that will deeply impact my work, while there isn’t as much surprising to me in a usual statistics seminar after 20+ years in the field. This has definitely improved the quality of my work, and I’m hoping efforts, such as the Foundations of Data Science conference, will similarly stimulate others.

Share

1 Comment

Leave a comment to Data Science Conference: an interview « IMS Bulletin

*

Share

Welcome!

Welcome to the IMS Bulletin website! We are developing the way we communicate news and information more effectively with members. The print Bulletin is still with us (free with IMS membership), and still available as a PDF to download, but in addition, we are placing some of the news, columns and articles on this blog site, which will allow you the opportunity to interact more. We are always keen to hear from IMS members, and encourage you to write articles and reports that other IMS members would find interesting. Contact the IMS Bulletin at bulletin@imstat.org

What is “Open Forum”?

In the Open Forum, any IMS member can propose a topic for discussion. Email your subject and an opening paragraph (to bulletin@imstat.org) and we'll post it to start off the discussion. Other readers can join in the debate by commenting on the post. Search other Open Forum posts by using the Open Forum category link below. Start a discussion today!

About IMS

The Institute of Mathematical Statistics is an international scholarly society devoted to the development and dissemination of the theory and applications of statistics and probability. We have about 4,500 members around the world. Visit IMS at http://imstat.org
Latest Issue