The R&D Tax Credit Aspects of AI Vision Technology

By Charles R Goulding, Andressa Bonafé, and Tricia Genova

AI-Vision-Tech

There is no doubt about it: it is an increasingly visual world. In a time when everyone has a camera in their pockets, access to visual data has reached unprecedented levels. Recent technological developments promise to use the wealth of visual information currently available to enable groundbreaking applications. Deep-learning techniques are revolutionizing image processing as machine vision moves beyond a rules-based, linear approach and into the new era of artificial intelligence. The present article will discuss the revolutionary power of artificially intelligent vision systems, highlighting current trends and challenges ahead. It will also present the R&D tax credit opportunity available to support companies engaged in visual intelligence innovation.

The Research & Development Tax Credit

Enacted in 1981, the now permanent Federal Research and Development (R&D) Tax Credit allows a credit that typically ranges from 4%-7% of eligible spending for new and improved products and processes. Qualified research must meet the following four criteria:

Must be technological in nature
Must be a component of the taxpayers business
Must represent R&D in the experimental sense and generally includes all such costs related to the development or improvement of a product or process
Must eliminate uncertainty through a process of experimentation that considers one or more alternatives

Eligible costs include U.S. employee wages, cost of supplies consumed in the R&D process, cost of pre-production testing, U.S. contract research expenses, and certain costs associated with developing a patent.

On December 18, 2015, President Obama signed the PATH Act, making the R&D Tax Credit permanent. Beginning in 2016, the R&D credit can be used to offset Alternative Minimum tax for companies with revenue below $50MM and for the first time, pre-profitable and pre-revenue startup businesses can utilize the credit against $250,000 per year in payroll taxes.

45 Billion Digital Eyes

An August 2017 study by LDV Capital predicts that the number of cameras in the world will at least triple by 2022, adding up to staggering 45 billion. This massive surge in visual-based technology will come from the emergence of new “camera-hungry” products, such as autonomous cars and augmented reality glasses, as well as from the addition of new functionalities to already widespread devices. For instance, LDV predicts that, in five years, smartphones could have as much as thirteen cameras, which would allow for 360 degree, 3D video making as well as augmented reality images. Smart home products and security systems will also contribute to the increasing number of cameras, which is made possible by a steep drop in unit prices.

Vision-based artificial intelligence (AI) will undoubtedly be a major trend in the new era of all-seeing digital eyes. Access to massive visual data combined with groundbreaking machine-learning techniques will allow algorithms to learn and evolve, becoming the basis of a growing number of innovative AI services.

Neural Networks and Beyond

        For years, images were just too complex for algorithms to reliably work on. The uniqueness of millions of pixels combined into singular patterns seemed to be an unsolvable riddle, which was beyond the capacity of computer-vision technology. However, improvements in processing power combined with access to large amounts of visual data allowed for the emergence of deep-learning techniques, which inaugurated a new era in image processing.

        Inspired by our brains’ interconnected neurons, the mathematical functions of so-called deep neural networks perform remarkably well when working with complex images. The algorithms are able to learn discrete tasks by spotting patterns in larges sets of data. For instance, by analyzing thousands of dog photos, they learn to recognize a dog. University of Washington’s MegaFace Project is an interesting example of recent advancements in neural network image processing. When asked to match two images of the same person among 1 million face images, the system achieved 75 percent accuracy in first-time guesses and more than 90 percent when 10 options were allowed.

        In addition to achieving greater accuracy than traditional computer vision, deep-learning techniques offer superior versatility. Its algorithms are less purpose-specific and its frameworks can be re-utilized across various cases. As neural networks grow in complexity and scale, the range of vision-based AI services reaches impressive widths. However, there is still a long way to go in the path towards truly intelligent vision systems. Some specialists believe that neural networks and related techniques will be soon considered small advances when compared to innovation that is yet to come.

The Issue of Generalization
        Despite considerable improvements in image-processing performance, generalization remains an issue to be tackled. In most vision-based AI systems, object identification depends on the angle being portrayed. In other words, there is a recurrent inability to recognize familiar objects seen from unfamiliar angles. Aiming to overcome the shortcomings of existing systems, Geoffrey Hinton, creator of the neural network approach and Google employee, is now proposing a new mathematical technique called capsule network. The idea is to make vision systems more human-like, allowing them to “see” not only two dimensions, as it happens in neural networks, but three. Equipped with three-dimensional perspective, AI systems would be able to accurately recognize familiar objects from any angle and thus allow for considerably better generalization.

The Issue of Motion
        Despite remarkable advances made so far, the ability to correctly identify dynamic activities also remains a challenge. Most AI video-processing applications do not interpret actions but rather rely on recognizing objects in static frames. This is the case of the recently launched Google Cloud Video Intelligence, a machine-learning application programming interface designed to detect and classify objects in videos.

        Though enabling major gains in productivity, particularly when it comes to searching through vast libraries of video content, the reliance on static frames is a limitation. Truly intelligent video processing must be able to go beyond the identification of video content and into what is actually happening on screen. Aiming to advance towards this goal, the Massachusetts Institute of Technology and IBM released, on December 2017, the Moments in Time Dataset, a major compilation of videos annotated with details of the activities being performed. This is the most recent of various efforts to increase access to tagged videos, which included Google’s release of 8 million YouTube videos and Facebook’s ongoing Scenes, Actions, and Objects project. Temporal context and transfer learning are two of the main research focuses moving forward.

The Issue of Capabilities
        Over recent years, significant improvements in device capabilities, including computing power, memory capacity, power consumption, image sensor resolution, and optics, have enhanced the performance and cost-effectiveness of computer vision. Even so, further accuracy gains will require enormous amounts of computing resources in both training and inferencing stages. Senior VP of product management for Qualcomm Technologies Raj Talluri points out that “going from 75% to 80% accuracy in a vision-based application could require nothing less than billions of additional math operations.” He further underlines that vision-processing results are dependent on image resolution, which is a particularly crucial aspect in applications designed to detect and classify objects in the distance, such as security cameras. Higher resolution means an increase in the amount of data being processed, stored, and transferred.

        Aiming to overcome existing limitations of capability, Qualcomm Technologies is pioneering a hybrid vision-processing implementation. The idea is to combine classic computer vision algorithms – considered “mature, proven, and optimized for performance and power efficiency” – with accurate and versatile deep-learning techniques. Security cameras, for instance, could rely on computer vision to detect faces or objects and apply deep learning to process only the smaller segment of the image in which the face or object was detected. The hybrid implementation uses about half of the memory bandwidth and requires significantly lower CPU resources than pure deep-learning solutions.

        Innovative compute architecture can also contribute to greater processing performance and power efficiency. For instance, executing deep-learning inferences on a DSP can yield considerable latency reductions in object detection when compared to a CPU. Similarly, edge computing, or running algorithms and neural network on the device itself, can also help lower latency and bandwidth requirements while offering greater privacy and security as compared to cloud-based implementations.

Public Safety and Surveillance Applications

        Recent technological advances have shed light on the possibility of using advanced software to overcome human limitations and supplement human judgment in public safety and surveillance applications. Artificially intelligent vision systems can enable unprecedented levels of detail and personalization in addition to making security and law-enforcement work considerably more efficient.

        Facial recognition is maybe the most widespread application of vision-based AI so far. The New York Department of Motor Vehicles recently announced that special facial-recognition technology was used in the arrest of over 4 thousand people charged with identity theft or fraud since 2010. The software, which compares new drivers’ license application photos to images on a database, illustrates how law enforcement and public safety can benefit from AI innovation.

        Police body cameras are also a potential field for visual AI applications. Axon, formerly Taser International and headquartered in Scottsdale, Arizona, has signaled its intention to incorporate AI into its products. The largest distributor of police body cameras in the U.S. acquired two AI companies in early 2017, envisioning ambitious new functionalities, which include an automated system for police reports. Axon CEO Rick Smith points out that the ability to process video and audio will allow police to spend more time doing their jobs rather than performing menial tasks, such as note-taking and report-writing.

        Another important supplier of body-worn cameras, Motorola is working with deep-learning startup Neurala to integrate AI capabilities that will help police officers in their search for objects and persons of interest. These groundbreaking applications are expected to significantly reduce the time and effort necessary to perform recurring tasks, such as finding a missing child or identifying a suspicious object. Based in Boston, Massachusetts, Neurala has developed “at the edge” learning capabilities that enable real-time applications of AI. Their patent-pending technology differs from traditional learning processes that require lengthy training for the AI engine. Built upon an incremental learning logic, the Lifelong Deep Neural Network (L-DNN) enhances accuracy and latency and eliminates the risk of “catastrophic forgetting”, the most significant limitation to real-time AI so far.

        Security cameras are yet another promising field of AI innovation. In April, Intel Movidius and Dahua Technology USA, a subsidiary of Chinese Dahua Technology, announced a new line of cameras that will offer advanced video analysis features, including crowd density monitoring, stereoscopic vision, facial recognition, people counting, behavior analysis, and detection of illegally parked vehicles. The groundbreaking solution represents an important step in bringing AI and machine learning into real-world products, particularly due to advances in power demand. The Movidius Myriad 2 Vision Processing Unit (VPU) delivers a massive amount of deep neural networks while requiring less than a single watt of power. This radically low-powered computer vision allows for natively intelligent cameras that do not require cloud-based resources.

        Located in San Mateo, California, computer-vision startup Movidius was acquired by Intel in 2016. The company has worked with Chinese video Internet of Things (IoT) firm Hikvision, which uses the innovative Myriad 2 VPU to run deep neural networks and perform high-accuracy video analytics on the cameras themselves. By processing data on edge devices the innovative cameras can detect anomalies in real-time and thus allow for groundbreaking advances in creating safer communities, better transit hubs, and more efficient operations.

The Smart Home

        Vision-based Internet of Things (IoT) applications are also an important area for innovation. When it comes to the smart home, in particular, artificially intelligent vision systems can offer new levels of security, safety, comfort, and entertainment.

Safety and Security
        Potential safety and security applications include front doors that recognize authorized people and unlock for them while remaining locked for unfamiliar faces. Vision-based alarm systems could similarly distinguish members of the family from unauthorized strangers. AI indoor security cameras could go even further in protecting users and preventing accidents by issuing alerts when an elderly person falls or a child is approaching a hot stove.

        IoT pioneer Nest has recently launched a smart security camera with facial recognition. The Nest Cam IQ identifies people it has been introduced to and sends alerts to the owner’s smartphone in case it sees someone it doesn’t recognize. Potential future applications may include “seeing” how many people are in a room adjusting temperature and lighting accordingly and even “noticing” when the fridge is low on certain items and adding them to a grocery list.

        Various companies around the globe are willing to get ahead in the vision-based smart home market. Amazon has recently unveiled its AWS DeepLens camera, the world’s first deep-learning video camera for developers. Expected to reach the market in April 2018, the innovative camera is seen rather as a training device that can use AWS machine-learning service SageMaker to perform tasks such as object and face detection, activity recognition, etc.

Entertainment
        On the entertainment front, vision-based systems could allow for unprecedented, seamless personalization. A TV that “sees” and recognizes who is watching could turn on a tailored interface and even block inappropriate content when a kid is in the room. Emotion-recognition could take customization even further, by adapting content to the user’s mood. Even though this kind of application is still in its early days, various companies are beginning to incorporate vision-based AI into their IoT entertainment products. There is growing expectation that Apple may include face recognition in the HomePod smart speaker by 2019.

        Located in Boston, Massachusetts, Affectiva has developed technology that allows for standard webcams to recognize different human emotions. The emotion measurement solution is largely visual-based; it uses facial cues, gestures, and psychophysical responses to identify the users’ mood. Affectiva points out that existing technology has lots of IQ, but no “emotional intelligence”. The startup envisions, however, that IoT devices will soon became mood-aware, as AI and optical sensors become widespread.

Autonomous Vehicles

As autonomous-driving capabilities grow in number and complexity, the crucial role of visual intelligence technology becomes ever more clear. Cameras are used to monitor the road as well as to collect information on the occupant’s behavior. In addition to mapping and localization, smart cameras estimate distances to surrounding objects, read traffic signs, and detect pedestrians. A much less expensive alternative to LiDAR, cameras are a preeminent feature of driverless cars been tested so far. By incorporating cameras at all angles, these innovative vehicles can maintain 360-degree, real-time view of their surroundings. For instance, Tesla’s “Full Self-Driving Hardware” includes eight cameras while Uber’s autonomous vehicles have 20 of these devices.

The work of Palo Alto, California-based Nauto has shed light on the crucial role of visual intelligence in autonomous-driving functionalities. Founded in 2015, the company has developed a groundbreaking, affordable solution to make existing cars safer and smarter. Equipped with a powerful AI engine, the Nauto’s bi-directional dashcams are able to detect what is happening on the road ahead and within the vehicle. Through a combination of image recognition, motion sensors, and GPS, the system alerts drivers if there is a problem on the way or a dangerous distraction. According to Nauto, the innovative solution can generate up to 37 percent reduction in collision incidents.

Wide-Ranging Automation

            A recent report by Markets and Markets forecasts that the machine vision market will grow from $8.12 billion in 2015 to $14.43 billion in 2022, experience a compound annual growth rate of 8.15% between 2016 and 2022. Automation is the major driver of this growth, with innovative efforts ranging from manufacturing and healthcare to consumer goods and robotics. The following sections present some outstanding initiatives in these fields.

        I. Quality Inspection: Based in Los Altos, California, Instrumental specializes in optical detection systems for manufacturing. The company has recently added an AI capability to its products, enabling them to automatically recognize and triage abnormal units in the assembly line. The Monitor technology uses vision-based AI to detect differences between products and learn whether those are actual defects or not. Compared to traditional computer vision approaches, the technology behind Instrumental’s solution requires significantly less training, golden samples, or rules. Quality inspection is just one of the many illustrations of how vision-based AI can greatly contribute to more efficient, automated manufacturing.

        II. Medical Diagnosis: Using a database of nearly 130,000 images, researchers at Stanford University trained an algorithm to diagnose skin cancer. Inspired by the need for universal access to healthcare, they built upon an existing algorithm developed by Google to perform object identification. The team used machine-learning techniques to expand these identifiable categories to include melanomas and carcinomas. The algorithm matched the performance of board-certified dermatologists, showing great promise for widespread application. Researchers envision making a smartphone-compatible version of the software that would bring reliable diagnosis to patients’ fingertips.

        III. Robotics: Founded in March 2016, PerceptIn has been at the forefront of visual intelligence. Located in Santa Clara, California, the robotics startup has recently unveiled its Ironsides vision systems, a full vision system that combines hardware and software for real-time tracking, mapping, and path planning for autonomous robots. The innovative solution can be integrated into virtually any device, enabling a wide array of autonomous applications. As the number of robotized products increase, visual intelligence and vision-based perception systems become a major area for innovation.

        IV. Consumer Goods: Google recently unveiled a new concept for artificially intelligent consumer cameras. Google Clips uses advanced machine-learning algorithms to recognize people and decide when to take photos. Its objective is to allow for hands-free, authentic pictures that capture the “feeling” of each particular moment. The innovative device automatically takes seven-second, 15-frames-per-second photos to be later filtered and selected by the user via a mobile app. The frequency of use is directly proportionate to the algorithm’s success, as it learns to recognize people and moments that matter to the user.

        V. Virtual Reality and Augmented Reality: Recent advances in real-time image recognition, expanding network bandwidth, and improvements in processing and storage are opening the way for innovative solutions that combine vision-based AI and VR/AR. The retail environment is a particularly promising field for innovation. Founded in 2015, Oak Labs has created a smart dressing-room mirror that allows customers to browse different products, toggle different types of lighting and even makes suggestions of complementary pieces. Early adopters include Ralph Lauren and Rebecca Minkoff.

Conclusion

Visual intelligence technology brings together three highly disruptive trends: artificial intelligence, computer vision, and analytics. From public safety and surveillance to smart homes and autonomous vehicles, there is a wide range of potential applications for artificially intelligent vision systems. Innovative companies engaged in visual intelligence R&D should take advantage of the tax credit opportunity available to support their efforts and increase their chances of success.