We have witnessed an explosive growth in the amount of data in the world. Recent studies estimate that world's information is doubling every two years. Thanks to the quick development and diffusion of digital information technologies big data has reached critical mass in every single sector of the economy. The rise of multimedia, social media, and the Internet of Things points towards a future of continued growth in volumes of data.
In this data-abundant context, major questions arise: how to transform data into actionable knowledge? In other words, how to take advantage of unprecedented amounts of data as enablers of information-driven decisions? Or yet, how to trigger the predictive power behind big data?
The present article will discuss the role of data science in answering these questions. It will present the work of data scientists and their contribution to the development of competitive intelligence. It will further present the R&D tax credit opportunity available for innovative efforts aimed at advancing data science capabilities.
Enacted in 1981, the Federal Research and Development (R&D) Tax Credit allows a credit of up to 13 percent of eligible spending for new and improved products and processes. Qualified research must meet the following four criteria:
Eligible costs include employee wages, cost of supplies, cost of testing, contract research expenses, and costs associated with developing a patent. On December 18, 2015 President Obama signed the bill making the R&D Tax Credit permanent. Beginning in 2016, the R&D credit can be used to offset Alternative Minimum tax and startup businesses can utilize the credit against payroll taxes.
Data science can be defined as the study of the generalizable extraction of knowledge from data. In other words, it is the systematic study of the organization, properties, and analysis of data and its role in inference.
However accurate, this definition fails to grasp the complexity surrounding data science. A more comprehensive understanding should begin with a two-fold approach. Firstly, one must assess the nature of data, its evolution, and the tools available to handle it. Secondly, one must look into the work of data scientists, their necessary skills, and the challenges they must face.
- Volume: Big data refers to massive volumes of information that are so large that cannot be processed using traditional database and software techniques.
- Velocity: Big data's high-velocity points to the accelerated pace at which data comes in and to the pressing need of quickly analyzing and utilizing such data.
- Variety: Big data comes from a wide variety of sources. Despite the common tendency of linking the word data to numbers, much of the data in the world is non-numeric and unstructured. Simply put, instead of rows and columns filled with numbers we now have texts, videos, images, etc.
The following chart features the expected volumes of unstructured and structured data from 2008 to 2015. The increasing prevalence of unstructured data verified in the last years should intensify in the foreseeable future.
Figure 1. Projected growth of unstructured and structured data
Source: Vasant Dhar, Data Science and Prediction, Communications of the ACM, Dec. 2013, Vol. 56, No. 12.
This avalanche of data has called for innovative storage solutions and new processing tools. An iconic example is Apache Hadoop, an open-source software framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. Hadoop doesn't rely on expensive, proprietary hardware and different systems to store and process data. Instead, it enables distributed parallel processing of huge amounts of data across inexpensive, industry-standard servers that both store and process the data. Its ability to scale without limits ensures that no data is too big.
Hadoop is currently the de facto standard for storing, processing, and analyzing big data. It can handle all sorts of data from various systems, regardless of its native format. It makes all data useable, therefore unveiling hidden relationships and revealing previously unreachable answers. Initially created to store Yahoo's huge quantities of data at low cost, the platform is currently used by major players, such as Google, Microsoft, and IBM.
- Mathematics and Statistics: particularly probability, distributions, hypothesis testing, modeling, and multivariate analysis.
- Computer Science: understanding of how data is internally represented and manipulated by computers, database expertise, knowledge of "text mining" and markup languages, such as XML and its derivatives, and new skills related to cloud computing and artificial intelligence.
- Correlation and Causation: being the minds behind machine learning, data scientists must master the distinction between correlation and causality in order to assess the desirability of models.
- Computational Thinking: data scientists must be able to formulate problems in ways that favor the system's chances of making accurate predictions. Good data scientists don't simply address given problems, but choose the right problems that bare most value to the organization. In other words, they understand the context in which they are inserted.
- Communication: ability to communicate findings to both business and IT leaders and to influence decision-making.
A 2011 McKinsey industry report points out that "analyzing large data sets will become a key basis of competition, underpinning new waves of productivity growth, innovation, and consumer surplus".
In fact, the data science revolution is poised to change the face of businesses, promoting a shift from intuition-based decision-making to fact-based decision-making. This means a complete transformation of managers' mind-sets, organizational structures, and existing business models.
In the era of big data, computers have proven better decision makers than humans, ensuring cost-effectiveness, accuracy, and scalability. Data science triggers this unprecedented predictive power that is central to competitive business models.
Competitive intelligence, or the use of data to make strategic business decisions, increasingly determines which agents stand out in an ever more aggressive market. In this context, we may confidently say that data scientists are bound to lead the way in innovation and economic growth.
In October 2012, the Harvard Business Review called data scientist "the sexiest job of the 21st century." However, many have pointed to the scarcity of qualified professionals combining the various skills necessary to work with data science. According to McKinsey, the U.S. is expected to have a shortage of 190,000 data scientists by 2020. Important U.S. universities have engaged in data science efforts, as a means to fulfill this growing demand and plug the talent gap that currently exists.
Established in July 2012, the Institute for Data Sciences and Engineering at Columbia University aims to advance technologies that unlock the power of global data to solve society's most challenging problems. It offers both a Certification of Professional Achievement in Data Sciences and a Master of Science in Data Science. Efforts are divided into six domains: cybersecurity, financial analytics, foundations of data science, health analytics, new media, and smart cities. The Institute offers funding opportunities designed to support data-centric interdisciplinary research, such as the annual ROADS.
NYC is also home to the ambitious "Applied Sciences NYC" program, which includes the construction of a two million square foot Cornell Tech campus on Roosevelt Island. With the first opening expected for 2017, the campus aims to be the "staging ground for what's next". Data science should undoubtedly be at the heart of this upcoming innovation hub.
In November 2013, New York University, the University of California, Berkeley, and the University of Washington launched a 5-year, $37.8 million, cross-institutional data science effort. With support from the Gordon and Betty Moore Foundation and the Alfred P. Sloan Foundation the program aims to harness the potential of data scientists and big data for basic research and scientific discovery. It hopes to establish models that will dramatically accelerate the data science revolution and capture the full potential of a progressively data-rich world.
In an unprecedented partnership with the State of Ohio, IBM recently launched its first dedicated advanced analytics center in Columbus, OH - the IBM Client Center Analytics Solutions Lab. Home to cutting-edge analytics, big data, and cognitive computing research, the center focuses on responding to IBM customers' analytics challenges.
The company is working hand-in-hand with the University of Ohio to address the need for highly skilled analytics professionals. The efforts include the elaboration of new business and technology curricula at the undergraduate, graduate, and executive education. IBM is providing the University with curriculum materials, relevant case studies, and access to software solutions, IBM guest speakers, and faculty awards aimed at accelerating program development.
Other universities involved in data science efforts include Stanford, Northwestern, George Mason, Syracuse, University of California at Irvine, University of San Francisco, University of Virginia, North Carolina State, and Indiana University.
The academic interest in data science is not restricted to the U.S. KPMG recently announced its sponsorship of the first Data Science Summer School in the UK, organized by pivigo, a London-based recruitment firm. The pioneering project consists of a 5-week intensive course aimed at transforming 100 of Europe's brightest PhD students into data scientists.
According to the International Data Corporation, the big data and analytics technology market totaled $113 billion in 2013 and is expected to grow at a compound annual rate of 11% in the next 5 years. Recent funding rounds demonstrate investors' increasing interest in the data science business.
California-based Platfora provides software solutions that help costumers find their paths to fact-based decision-making. Founded in 2011, the company has worked on making the capabilities of Hadoop accessible to ordinary business users. It has recently raised $38 million, increasing its total funding to $65 million.
Hortonworks, a startup spun out of Yahoo, builds on, manages, and implements Apache Hadoop. The company creates its own versions of the open-source platform, designs standard implementation processes, assists integration with other applications, and provides support for IT staff. Its engineers are engaged in innovative efforts to improve performance and capabilities of Hadoop in areas such as cluster operations, security, governance, and integration. Hortonworks recently raised $100 million at a valuation of more than $1 billion as it prepares to go public in 2015.
In the biggest fundraising by a U.S. tech company since last November, the six-year-old Silicon Valley big data startup Cloudera has recently raised $900 million in a private share sale, at a valuation of $4.1 billion. The deal makes Cloudera the largest purveyor of Haddop by far and promises to increase its international presence. Most of the investment came for Intel, which now holds an 18 percent share of the company. The world's largest chipmaker anticipates that Hadoop will become the biggest application to run on servers based on its chips, therefore the interest in working together with Cloudera to make sure its products are optimized to run on Intel hardware.
Innovation is at the heart of the bourgeoning data science industry. The following paragraphs exemplify recent efforts to develop new and improved ways to apply data science to the benefit of business strategies.
Pennsylvania-based QlikTech is the provider of the QlikView Business Discovery Platform. This groundbreaking self-service business intelligence software helps organizations make transformative discoveries as it enables users to analyze and search data visualizations, make associations, and uncover insight that lead to better decision-making. Capable of blending data from different sources, including Hadoop-based ones, QlikView gives users real time visibility of their business environment.
MapR Technologies, provider of a popular distribution of Apache Hadoop, recently announced a partnership with Elasticsearch, an open-source search and analytics solution. The Elasticsearch-Hadoop combination results in a scalable, distributed architecture that enables search and discovery across tremendous amounts of information in real time. This innovative combined solution allows users to ask better questions and get clearer, faster answers.
Also at the forefront of data science innovation is Pivotal, a joint venture of EMC and VMware in which General Electric has a stake. The big data and cloud computing company aims to revolutionize the economics of data science applications. To this end, it has recently introduced its "Big Data Suite", a bundle of big data, cloud-style software, support, and maintenance with simplified pricing.
By providing a competitive, easy way to buy not only Hadoop, but all the important layers on top of it, Pivotal wants to establish itself as a one-stop shop. Different from vendors who gain from incremental licensing revenue due to data growth and Hadoop cluster growth, the "Pivotal Business Data Lake" offers unlimited storage, taking such costs off the table.
It is still early days for big data technology in the corporate mainstream. Research firm Gartner estimates the worldwide number of Hadoop paying customers at only about 1,000. However, the recent round of massive investments in data-centric startups underlines the vital importance of this rapidly growing industry, as data science becomes the new basis of competition. Federal R&D tax credits are available to support innovative efforts aimed at creating new and improved tools for fact-based decision-making.
Charles R Goulding Attorney/CPA, is the President of R&D Tax Savers.
Jacob Goldman is the VP of Operations at R&D Tax Savers.
Andressa Bonafé is a Tax Analyst with R&D Tax Savers.