The R&D Tax Aspects of Data Science

By , , and

        We have witnessed an explosive growth in the amount of data in the world. Recent studies estimate that world's information is doubling every two years. Thanks to the quick development and diffusion of digital information technologies big data has reached critical mass in every single sector of the economy. The rise of multimedia, social media, and the Internet of Things points towards a future of continued growth in volumes of data.

        In this data-abundant context, major questions arise: how to transform data into actionable knowledge? In other words, how to take advantage of unprecedented amounts of data as enablers of information-driven decisions? Or yet, how to trigger the predictive power behind big data?

        The present article will discuss the role of data science in answering these questions. It will present the work of data scientists and their contribution to the development of competitive intelligence. It will further present the R&D tax credit opportunity available for innovative efforts aimed at advancing data science capabilities.

The Research & Development Tax Credit

        Enacted in 1981, the Federal Research and Development (R&D) Tax Credit allows a credit of up to 13 percent of eligible spending for new and improved products and processes. Qualified research must meet the following four criteria:

  • New or improved products, processes, or software
  • Technological in nature
  • Elimination of uncertainty
  • Process of experimentation

        Eligible costs include employee wages, cost of supplies, cost of testing, contract research expenses, and costs associated with developing a patent. On December 18, 2015 President Obama signed the bill making the R&D Tax Credit permanent.  Beginning in 2016, the R&D credit can be used to offset Alternative Minimum tax and startup businesses can utilize the credit against payroll taxes.

Understanding Data Science

        Data science can be defined as the study of the generalizable extraction of knowledge from data. In other words, it is the systematic study of the organization, properties, and analysis of data and its role in inference.

        However accurate, this definition fails to grasp the complexity surrounding data science. A more comprehensive understanding should begin with a two-fold approach. Firstly, one must assess the nature of data, its evolution, and the tools available to handle it. Secondly, one must look into the work of data scientists, their necessary skills, and the challenges they must face.

  1. The Nature of Data
    Gartner defines big data as "high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision-making."

    The three Vs of big data can be shortly summarized as follows:
  • Volume: Big data refers to massive volumes of information that are so large that cannot be processed using traditional database and software techniques.
  • Velocity: Big data's high-velocity points to the accelerated pace at which data comes in and to the pressing need of quickly analyzing and utilizing such data.
  • Variety: Big data comes from a wide variety of sources. Despite the common tendency of linking the word data to numbers, much of the data in the world is non-numeric and unstructured. Simply put, instead of rows and columns filled with numbers we now have texts, videos, images, etc.
The following chart features the expected volumes of unstructured and structured data from 2008 to 2015. The increasing prevalence of unstructured data verified in the last years should intensify in the foreseeable future.

Figure 1. Projected growth of unstructured and structured data
data science fig 1

Source: Vasant Dhar, Data Science and Prediction, Communications of the ACM, Dec. 2013, Vol. 56, No. 12.

This avalanche of data has called for innovative storage solutions and new processing tools. An iconic example is Apache Hadoop, an open-source software framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. Hadoop doesn't rely on expensive, proprietary hardware and different systems to store and process data. Instead, it enables distributed parallel processing of huge amounts of data across inexpensive, industry-standard servers that both store and process the data. Its ability to scale without limits ensures that no data is too big.


Hadoop is currently the de facto standard for storing, processing, and analyzing big data. It can handle all sorts of data from various systems, regardless of its native format. It makes all data useable, therefore unveiling hidden relationships and revealing previously unreachable answers. Initially created to store Yahoo's huge quantities of data at low cost, the platform is currently used by major players, such as Google, Microsoft, and IBM.

  1. The Work of Data Scientists
    Big data has fundamentally transformed the role of data scientists. While traditional database methods were designed to access and summarize data in order to answer a pre-established query, new methods aim to find patterns in massive volumes of data when users lack well-formulated inquiries.

    When data is large and heterogeneous, it is practically impossible to know what queries would lead to interesting and actionable insights. Put simply, in the current world of big data, the challenge facing data scientists is to find interesting patterns that satisfy the data (and not data that satisfies a pattern).

    With this objective in mind, data scientists rely largely on machine learning, artificial intelligence systems that can learn from data and detect subtle structures in information. Suitably designed machine learning algorithms are crucial to the identification of predictive patterns, meaning patterns that are likely to hold up in future cases.

    Machine learning is increasingly central to data scientists' quest of extracting knowledge from data. It allows the construction of automated decision-making systems built upon predictive accuracy. Proficiency in machine learning, however, requires a broad set of skills, which reflect the complex work of data scientists. They are:
  • Mathematics and Statistics: particularly probability, distributions, hypothesis testing, modeling, and multivariate analysis.
  • Computer Science: understanding of how data is internally represented and manipulated by computers, database expertise, knowledge of "text mining" and markup languages, such as XML and its derivatives, and new skills related to cloud computing and artificial intelligence.
  • Correlation and Causation: being the minds behind machine learning, data scientists must master the distinction between correlation and causality in order to assess the desirability of models.
  • Computational Thinking: data scientists must be able to formulate problems in ways that favor the system's chances of making accurate predictions. Good data scientists don't simply address given problems, but choose the right problems that bare most value to the organization. In other words, they understand the context in which they are inserted.
  • Communication: ability to communicate findings to both business and IT leaders and to influence decision-making.

        Anjul Bhambhri, vice president of big data products at IBM, has described data scientists as "part analysts, part artists". These highly inquisitive professionals are not only able to stare at data     and spot trends, but to learn and bring change to organizations, transforming data into business value.

Data Science and Competitive Intelligence

        A 2011 McKinsey industry report points out that "analyzing large data sets will become a key basis of competition, underpinning new waves of productivity growth, innovation, and consumer surplus".

        In fact, the data science revolution is poised to change the face of businesses, promoting a shift from intuition-based decision-making to fact-based decision-making. This means a complete transformation of managers' mind-sets, organizational structures, and existing business models.

        In the era of big data, computers have proven better decision makers than humans, ensuring cost-effectiveness, accuracy, and scalability. Data science triggers this unprecedented predictive power that is central to competitive business models.

        Competitive intelligence, or the use of data to make strategic business decisions, increasingly determines which agents stand out in an ever more aggressive market. In this context, we may confidently say that data scientists are bound to lead the way in innovation and economic growth.

Data Science Academic Programs

        In October 2012, the Harvard Business Review called data scientist "the sexiest job of the 21st century." However, many have pointed to the scarcity of qualified professionals combining the various skills necessary to work with data science. According to McKinsey, the U.S. is expected to have a shortage of 190,000 data scientists by 2020. Important U.S. universities have engaged in data science efforts, as a means to fulfill this growing demand and plug the talent gap that currently exists.

        Established in July 2012, the Institute for Data Sciences and Engineering at Columbia University aims to advance technologies that unlock the power of global data to solve society's most challenging problems. It offers both a Certification of Professional Achievement in Data Sciences and a Master of Science in Data Science. Efforts are divided into six domains: cybersecurity, financial analytics, foundations of data science, health analytics, new media, and smart cities. The Institute offers funding opportunities designed to support data-centric interdisciplinary research, such as the annual ROADS.

        NYC is also home to the ambitious "Applied Sciences NYC" program, which includes the construction of a two million square foot Cornell Tech campus on Roosevelt Island. With the first opening expected for 2017, the campus aims to be the "staging ground for what's next". Data science should undoubtedly be at the heart of this upcoming innovation hub.

        In November 2013, New York University, the University of California, Berkeley, and the University of Washington launched a 5-year, $37.8 million, cross-institutional data science effort. With support from the Gordon and Betty Moore Foundation and the Alfred P. Sloan Foundation the program aims to harness the potential of data scientists and big data for basic research and scientific discovery. It hopes to establish models that will dramatically accelerate the data science revolution and capture the full potential of a progressively data-rich world.

        In an unprecedented partnership with the State of Ohio, IBM recently launched its first dedicated advanced analytics center in Columbus, OH - the IBM Client Center Analytics Solutions Lab. Home to cutting-edge analytics, big data, and cognitive computing research, the center focuses on responding to IBM customers' analytics challenges.

        The company is working hand-in-hand with the University of Ohio to address the need for highly skilled analytics professionals. The efforts include the elaboration of new business and technology curricula at the undergraduate, graduate, and executive education. IBM is providing the University with curriculum materials, relevant case studies, and access to software solutions, IBM guest speakers, and faculty awards aimed at accelerating program development.

        Other universities involved in data science efforts include Stanford, Northwestern, George Mason, Syracuse, University of California at Irvine, University of San Francisco, University of Virginia, North Carolina State, and Indiana University.

        The academic interest in data science is not restricted to the U.S. KPMG recently announced its sponsorship of the first Data Science Summer School in the UK, organized by pivigo, a London-based recruitment firm. The pioneering project consists of a 5-week intensive course aimed at transforming 100 of Europe's brightest PhD students into data scientists.

Big Data, Big Returns

        According to the International Data Corporation, the big data and analytics technology market totaled $113 billion in 2013 and is expected to grow at a compound annual rate of 11% in the next 5 years. Recent funding rounds demonstrate investors' increasing interest in the data science business.

        California-based Platfora provides software solutions that help costumers find their paths to fact-based decision-making. Founded in 2011, the company has worked on making the capabilities of Hadoop accessible to ordinary business users. It has recently raised $38 million, increasing its total funding to $65 million.

        Hortonworks, a startup spun out of Yahoo, builds on, manages, and implements Apache Hadoop. The company creates its own versions of the open-source platform, designs standard implementation processes, assists integration with other applications, and provides support for IT staff. Its engineers are engaged in innovative efforts to improve performance and capabilities of Hadoop in areas such as cluster operations, security, governance, and integration. Hortonworks recently raised $100 million at a valuation of more than $1 billion as it prepares to go public in 2015.

        In the biggest fundraising by a U.S. tech company since last November, the six-year-old Silicon Valley big data startup Cloudera has recently raised $900 million in a private share sale, at a valuation of $4.1 billion. The deal makes Cloudera the largest purveyor of Haddop by far and promises to increase its international presence. Most of the investment came for Intel, which now holds an 18 percent share of the company. The world's largest chipmaker anticipates that Hadoop will become the biggest application to run on servers based on its chips, therefore the interest in working together with Cloudera to make sure its products are optimized to run on Intel hardware.

Data Science Corporate Innovation

        Innovation is at the heart of the bourgeoning data science industry. The following paragraphs exemplify recent efforts to develop new and improved ways to apply data science to the benefit of business strategies.

        Pennsylvania-based QlikTech is the provider of the QlikView Business Discovery Platform. This groundbreaking self-service business intelligence software helps organizations make transformative discoveries as it enables users to analyze and search data visualizations, make associations, and uncover insight that lead to better decision-making. Capable of blending data from different sources, including Hadoop-based ones, QlikView gives users real time visibility of their business environment.

        MapR Technologies, provider of a popular distribution of Apache Hadoop, recently announced a partnership with Elasticsearch, an open-source search and analytics solution. The Elasticsearch-Hadoop combination results in a scalable, distributed architecture that enables search and discovery across tremendous amounts of information in real time. This innovative combined solution allows users to ask better questions and get clearer, faster answers.

        Also at the forefront of data science innovation is Pivotal, a joint venture of EMC and VMware in which General Electric has a stake. The big data and cloud computing company aims to revolutionize the economics of data science applications. To this end, it has recently introduced its "Big Data Suite", a bundle of big data, cloud-style software, support, and maintenance with simplified pricing.

        By providing a competitive, easy way to buy not only Hadoop, but all the important layers on top of it, Pivotal wants to establish itself as a one-stop shop. Different from vendors who gain from incremental licensing revenue due to data growth and Hadoop cluster growth, the "Pivotal Business Data Lake" offers unlimited storage, taking such costs off the table.


        It is still early days for big data technology in the corporate mainstream. Research firm Gartner estimates the worldwide number of Hadoop paying customers at only about 1,000. However, the recent round of massive investments in data-centric startups underlines the vital importance of this rapidly growing industry, as data science becomes the new basis of competition. Federal R&D tax credits are available to support innovative efforts aimed at creating new and improved tools for fact-based decision-making.

Article Citation List



Charles R Goulding Attorney/CPA, is the President of R&D Tax Savers.

Jacob Goldman is the VP of Operations at R&D Tax Savers.

Andressa Bonafé is a Tax Analyst with R&D Tax Savers.

Similar Articles
The R&D Tax Credit Aspects of Blockchain for Supply Chains
The R&D Tax Credit Aspects of Physical Security Technology
The R&D Tax Credit Aspects of Driverless Cars
The R&D Tax Credit Aspects of SaaS Start-Ups
The R&D Tax Credit Aspects of Emotion-Recognition Technology
The R&D Tax Credit Aspects of AI in the Insurance Industry
The R&D Tax Credit Aspects of Emerging AV Trends
Enhanced R&D Tax Credits for Specialized Co-Shared Spaces
Ethereum's Impact on Digital Contracting Creates R&D Tax Credit Opportunities
The R&D Tax Credit Aspects of Geofencing
The R&D Tax Credit Aspects of Distribution Center Automation
The R&D Tax Credit Aspects of Law Firm Artificial Intelligence
The R&D Tax Credit Aspects of Avionics
The R&D Tax Credit Aspects of Telemedicine
Federal Government Provides Faster Approvals and Tax Credits for Consumer FinTech Products
The R&D Tax Credit Aspects of Voice-Activated Software
The R&D Tax Credit Aspects of Artificially Intelligent Hedge Funds
The R&D Tax Credit Aspects of LiDAR
The R&D Tax Credit Aspects of Educational Technology (EdTech)
The R&D Tax Credit Aspects of Cyber Security Start-Ups
The R&D Tax Credit Aspects of Construction Industry IoT
R&D Tax Credits Provide New Opportunities for Artificial Intelligence Start-ups
The R&D Tax Credit Aspects of NYC Start-Ups
The R&D Tax Credit Aspects of Virtual Reality Technology
The R&D Tax Credit Aspects of Water Analytics
The R&D Tax Aspects of Artificial Intelligence Robo-Advisors
The R&D Tax Credit Aspects of Natural Language Processing (NLP) Innovation
The R&D Tax Credit Aspects of Video Compression Technology
The R&D Tax Credit Aspects of Automated Coding
The R&D Tax Credit Aspects of Payment Technology
The R&D Tax Credit Aspects of Restaurant Technology
R&D Tax Credits and the Second Wave of Cloud Adoption
The R&D Tax Aspects of Data Storage Startups
The R&D Tax Credit Aspects of Cyber Security
The R&D Tax Credit Aspects of Mobile Applications
R&D Tax Credits for the Modern Insurance Industry
The R&D Tax Credit Aspects of the Internet of DNA
The R&D Tax Credit Aspects of Modern Dental Labs
The R&D Tax Credit Aspects of IoT Communication
The R&D Tax Credit Aspects of Bitcoin and Blockchain Technology
The R&D Tax Aspects of Near Field Communication
The R&D Tax Aspects of the New FDA Mobile Apps Requirements
Tapping the Power of Big Data and R&D Tax Credits for Utility Companies
The R&D Tax Credit Aspects of the Medical Software Industry
The R&D Tax Aspects of Computer Enabled Human Identification
The R&D Tax Credit Aspects of New York City's Engineering Education and Googlization
The R&D Tax Credit Aspects of Software Modeling Analytics
The R&D Tax Aspects of Cameras of the 21st Century
The R&D Tax Credit Aspects of Network Security
R&D Tax Aspects of DNA Identification
R&D Tax Credit Aspects of Cyber Security and Homeland Protection
Financial Product Trading Platform Artificial Intelligence R&D Tax Credits
The Internet of Things Creates R&D Tax Credit Opportunity
The R&D Tax Credit Aspects of Mobile Banking Applications
The R&D Tax Credit Aspects of In-Image Advertising
R&D Tax Credits for Hybrid Call Centers - Airline, Hotel, and Car Rental Industries
The R&D Tax Aspects of Advertising Science
R&D Tax Aspects of Radio Frequency Identification
The R&D Tax Aspects of Advanced Driver Assist Systems
The R&D Tax Aspects of the Internet of Residential Things
The R&D Tax Aspects of Web Television
R&D Tax Credit Aspects of Medical Robotics
R&D Tax Credit Aspects of Industrial Robotics
R&D Tax Credit Aspects of Service Robotics
Yes Alice, Patents and R&D Tax Credits Remain Available for the Internet of Things!
How Salesmen Can Use R&D Tax Credits to Sell Today's Software Products Engagements
The R&D Tax Aspects of Cloud Computing
The R&D Tax Credit Aspects of Hybrid Call Centers for Health Insurers
The R&D Tax Aspects of Robot Software
The R&D Tax Aspects of Machine-to-Machine (M2M) Innovation
The R&D Tax Aspects of Financial Technology Services
Beacons Create R&D Tax Credit Opportunity
The R&D Tax Credit Aspects of Retail Technology
The R&D Tax Credit Aspects of Improving Virtual Reality Technology
Now Every Business is a Software Business
Gig City Startups and R&D Tax Credits
The R&D Tax Credit Opportunities for Mobile Devices
The R&D Tax Credit Aspects of Wearable Technology
The R&D Tax Aspects of Big Data
R&D Tax Credit Fundamentals
Los Angeles Tech Boom Creates Large R&D Tax Incentive Opportunities
The R&D Tax Aspects of Software Development