Futuring Paper – Analytics and Data Science

Co-Conveners:
Catherine Lawson (Geography and Planning), and
Benjamin Shaw (Health Policy, Management, and Behavior)

This paper summarizes the findings of our analysis of the forces that are currently impacting the field of “Analytics and Data Science”, the forces that will shape this field in the future environment, and the potential implications. This summary is based on a review of trends described in the literature, as well as a content analysis of our colleagues’ – and our own – responses to an online survey (n = 22), conducted between September 16-23, 2016. The summary is structured around four main questions

1) What forces are shaping the disciplines of Analytics and Data Science today (for either the research or the relevant workforce communities)?
We have identified 4 major forces that are currently shaping the field including: a) advances in the use of computer and telecommunications systems for storing, retrieving, and sending information (i.e., advances in information technology, or IT); b) increases in the amount of data that are collected; c) advances in the application software used to interact with data and IT; and d) increasing concerns about the societal implications emerging from the massive growth in access to and use of data about people’s lives. More details about each of these forces are presented below.

Advances in IT. Modern computing power is stronger than ever, and continues to grow. With these advances in technology, we have seen the growth of high performance computers to process large volume of data, inexpensive data storage and easy access to data. Together with the trend of advanced technology is an increased capacity for connectivity between institutions and agents, which allows for increases in efficiency and the potential for enhanced decision making. These technological advances have also provided more convenient access to IT for larger segments of the population, as well as growth in the job market for IT-related positions.

Increases in data. Along with advances in IT, we have also seen growth in the amount of data that are, or can be, collected, stored, and accessed. That is, we are in the midst of a trend of expansion in the variety of data that has become extremely cheap to populate and collect through inexpensive storage, sensors, portable smart devices, social media apps, statistical software, multiplayer games, and the Internet of Things. This trend of increasing data has several implications, including:

  • The problem of too much data, but not enough analysts to know what to look for and perform the analytics (i.e., workforce shortage);
  • Increased access to data regarding people’s behavior (e.g., via wearable devices and social media);
  • Continued shortcomings in the completeness and reliability of data (e.g., as pertaining to health and behavior);
  • The emergence of new technology for storing and accessing large datasets, particularly the commercialization of storage services (e.g., cloud storage);
  • Issues dealing with gleaning value from the supply of inconsistent, incomplete and contradictory ontologies; and 
  • The need to establish standard data specifications that produce machine-readable inputs by working with industry leaders, vendors, and government stakeholders.

Advances in software. Along with advances in IT and growth in the amount of data generated, we are also seeing massive “disruptions” to the ways that data management, search, query processing, and analytics solutions are designed, built, and deployed. The development of new methods and applications for analyzing and using data allows for a deeper understanding of the data sets and the information they contain; for example, through new ways of combining diverse data sets, new techniques for data visualization and analysis, new methods for addressing the multi-dimensionality of data, new developments in artificial intelligence, and the increasing popularity of freeware (e.g., R) for data analysis. Additionally, the role of open source and open data make it possible to more rapidly leverage advancements through code-sharing practices (e.g., Github) as well as facilitating replication and verification of findings. 

Societal implications. The current environment for Analytics and Data Science is marked by tremendous and rapid growth, which has led many to raise important questions about how this growth may affect society. For example:

  • Are expectations that “big data” can solve important problems, and improve society, realistic?
  • How can we address concerns regarding security (including privacy) and ethics related to growth in the access to, and the new uses of, data and IT? How can we safeguard against unethical uses of new technology, such as artificial intelligence?
  • What must be done to foster collaborations, especially between the analytics fields (i.e., statistical sciences, computer sciences, and math) and the physical, social (including public health), and medical sciences?
  • With “big data”, how can we manage the risk of identifying spurious “signals” in the “noise”?

2) In ten years, what forces will shape changes in the disciplines of Analytics and Data Science (for either the research or the relevant workforce communities)?
Three themes emerged when assessing the forces that will likely shape changes in the field of Analytics and Data Science in the next 10 years. In particular, the most influential forces that will shape the future of this discipline seem to focus upon: a) methodological and technological concerns; b) education and workforce development concerns; and c) wider societal concerns. Each is discussed further below.

Methodological and technological concerns. The evolving globalization of society, and societal problems, suggests that there will be an increasing need for large data systems that connect various populations, from countries to communities. Related to this, we will need new statistical and computer science methods to analyze these large datasets to find robust and productive results in a timely and efficient manner. Part of this might lie in the development of smart algorithms that can better find meaningful “needles” within the massive data haystacks of the future. However, we may also need to find better ways to scale up traditional methods and to include mixed methods (i.e., inclusion of qualitative data) in analytics. Perhaps ideally, in 10 years we will have examples of best practices for combining heterogeneous data across multiple sources, integrating real-time data, and extracting meaningful inference; perhaps we will also have a better idea of the types of questions that can be answered using advanced data analytics and the types of questions that still require more traditional approaches. Regardless, increased computational speed and power will be critical. Attention to data stewardship, data completeness, and data quality will be accelerated with machine-learning techniques, promising improvements in analysis and outcomes.  

Education and workforce development. In the next 10 years, academia will continue to play an important role in helping to address the needs for data scientists. In addition to developing Bachelors and Masters programs that specialize in Data Science, fellowship/internship that are backed by the industry will be needed to provide students with “real world” experience. Moreover, the need for training in the computational physical and social sciences will continue to grow, as will demand in the digital humanities. As such, data science and analytics concepts and processes will need to be a part of upper level (i.e., Masters or beyond) programs in many more fields. This broad-based growth in the development of academic programs that incorporate training in analytics will better prepare the workforce of tomorrow to leverage the opportunities and address the challenges associated with big data. At the same time, there are concerns about the increase in autonomous interpretation and screening of data, leading to a possible atrophy of human skills to detect and critique these data. 

Societal concerns. Similar to the current environment, the imagined future of Analytics and Data Science raises important concerns about privacy, ethics, and the need for cross-disciplinary collaborations. For instance, with increases in the amount of data that needs to be stored, questions about how we will handle security, privacy, and data "ownership" will become critical. In addition, questions of how individuals will interact with artificial intelligence will need to be addressed. Artificial intelligence may become a big driver in scientific discovery, for example, by identifying relevant data sets, isolating prominent features, suggesting questions to ask, and generating hypotheses, all in an attempt to make society more productive. Such "augmented intelligence" will enhance and scale human expertise, and may be able to assist people with well-defined tasks in a wide range of applications (e.g., recent use of IBM Watson helping doctors make cancer diagnoses and treatment decisions).

3) What are the implications for the profession (researchers or professional practice)? What new opportunities may be created in the future?
These general trends in the field of Analytics and Data Science have several key implications for the profession, which are highlighted below.

Advances in research/methods. As shown above, the future holds many opportunities and needs for advanced methods for handling new data structures and larger volumes of data. These newer analytic techniques will help us to develop fresh insights that contribute to our understanding of the world and society as a whole.

Need for enhanced training and collaboration. In order to facilitate the development of new methods, we need to train a new generation of data scientists who will be able to take large volumes of heterogeneous data and build the infrastructure required to process it. This means that new academic departments/schools will need to be created to meet what is expected to be a growing demand for workers who are competent in Analytics and Data Science. More broadly, we need to assure that graduates in other departments are also data literate as part of their information literacy and research competencies. Finally, we must foster environments in which researchers and professional practitioners can interact and challenge each other. Building strong communication channels will help break down barriers and lead to the faster translation of technology to practice.

Other implications for the profession include;

  • The have vs. have not divide may deepen, since the ability to analyze large data sets requires access to substantial IT and other (e.g., staff) resources.
  • The widespread availability of data – public access to data sets is required by some federal funders and by an increasing number of journals – will present opportunities for researchers to generate pilot analysis without collecting new data.

4) How will future developments and opportunities affect the University at Albany? How might UAlbany respond to these within the strategic planning process?
Recommended responses for UAlbany include:

  • The creation of new academic programs, courses, and internship/training opportunities across campus in an integrated and strategic fashion.
    • Big Data Analytics and Data Science Programs at the undergraduate and graduate level
    • Health Analytics
    • Cross-disciplinary graduate programs and courses
    • Strategic partnerships with organizations that will hire graduates.
    • There may be opportunities for UAlbany in areas where we have made recent in-roads – e.g., weather prediction important to local environments; addressing cultural, ethnic and race disparities; and development of technologies for RNA-based drugs and diagnostics.
  • Focus on strategically growth in data science in the existing Informatics Program in the College of Engineering and Applied Sciences by immediately expanding components in both undergraduate and graduates (masters and PhD) programs, and conducting aggressive outreach to other disciplines to encourage domain-data science collaboration;
  • Hiring of additional faculty and the utilization of current faculty to help develop new departments and courses
  • Provide seed funding for innovative research in Analytics and Data Science, and encourage faculty members to propose innovative research/teaching programs
  • Encourage and support collaborations among data scientists and other researchers and scholars within the university and SUNY system. This should be done both informally and formally (e.g., internal grant funds for pilot projects or collaborations)
  • Enhance University data storage, high performance computing and visualization hardware, software, and support staff
  • Foster a culture of innovation and entrepreneurship
  • Facilitate other types of research support – e.g., development of MOUs/data access templates, and enhanced privacy, ethics, and security training for research involving big data to help ensure misuse and/or security breaches do not occur.

Strategic Considerations

The emergence of data science in academia can be viewed from several potential operational perspectives. The first could be labeled: data science for the sake of data science with the expressed mission of advancing basic research with a limited audience of like-minded data science researchers promoting their own approaches. This model has already been well established in a number of existing data science programs.   

A second model focuses on the embedding of data scientists in a few forward-thinking disciplines (e.g., data scientists in waste water recycling). These researchers are primarily focused on a particular domain-based application, with the expressed mission of capturing advancements within their domain, with very limited opportunities for sharing and leveraging the particular data science approaches in order to protect their own research trajectory. While these opportunities can produce a few research “spikes”, they are vulnerable to the loss of one or more of these top data science/domain faculty, with a resulting loss in research capability. 

A third model is tied tightly to commercial partners who are interested in financial gain, without any particular interest in an academic mission of scholarship, or enrichment of the student experience. This model is vulnerable to the vagaries of commercial interest and any change in the corporate mission, or loss of interest under new management, could disrupt the research trajectory. This type of data science is also primarily proprietary and requires little or no promotion of some data science aspects of the research as they are key to maintaining a commercial advantage. 

A final model is an applied data science approach, with “open” labs (e.g., the Albany Visualization and Informatics Laboratory) and a primary mission to leverage in-house expertise in data science techniques, and the application of more general data science techniques, for use in numerous disciplines and for research projects. This model requires, in many cases, collaboration with domain experts, to guide the data science applications. The open lab operations rely heavily on graduate and undergraduate (even local high school) computer science students, to work on aspects of data science projects, under the supervision of in-house data scientists on funded projects. As “interns”, these students provide a flexible workforce, receiving hands-on training and lab experiences. The variety of projects provides opportunities for cross-pollination of data science techniques and makes it possible to leverage existing code for new projects. For the most part, the lab uses an “open source/open data” approach to contribute to the growing body of data science resources available for educational purposes and also maximizes opportunities for students to leverage what they are learning to other work environments, or research opportunities.