Intel Highlighted Why NVIDIA Tensor Core GPUs Are Great for Inference

It’s not every day that one of the world’s leading tech companies highlights the benefits of your products.

Intel did just that last week, comparing the inference performance of two of their most expensive CPUs to NVIDIA GPUs.

To achieve the performance of a single mainstream NVIDIA V100 GPU, Intel combined two power-hungry, highest-end CPUs with an estimated price of $50,000-$100,000, according to Anandtech. Intel’s performance comparison also highlighted the clear advantage of NVIDIA T4 GPUs, which are built for inference. When compared to a single highest-end CPU, they’re not only faster but also 7x more energy-efficient and an order of magnitude more cost-efficient.

Inference performance is crucial, as AI-powered services are growing exponentially. And Intel’s latest Cascade Lake CPUs include new instructions that improve inference, making them the best CPUs for inference. However, it’s hardly competitive with NVIDIA deep learning-optimized Tensor Core GPUs.

Inference (also known as prediction), in simple terms, is the “pattern recognition” that a neural network does after being trained. It’s where AI models provide intelligent capabilities in applications, like detecting fraud in financial transactions, conversing in natural language to search the internet, and predictive analytics to fix manufacturing breakdowns before they even happen.

While most AI inference today happens on CPUs, NVIDIA Tensor Core GPUs are rapidly being adopted across the full range of AI models. Tensor Core, a breakthrough innovation has transformed NVIDIA GPUs to highly efficient and versatile AI processors. Tensor Cores do multi-precision calculations at high rates to provide optimal precision for diverse AI models and have automatic support in popular AI frameworks.

It’s why a growing list of consumer internet companies — Microsoft, Paypal, Pinterest, Snap and Twitter among them — are adopting GPUs for inference.

Compelling Value of Tensor Core GPUs for Computer Vision

First introduced with the NVIDIA Volta architecture, Tensor Core GPUs are now in their second generation with NVIDIA Turing. Tensor Cores perform extremely efficient computations for AI for a full range of precision — from 16-bit floating point with 32-bit accumulate to 8-bit and even 4-bit integer operations with 32-bit accumulate.

They’re designed to accelerate both AI training and inference, and are easily enabled using automatic mixed precision features in the TensorFlow and PyTorch frameworks. Developers can achieve 3x training speedups by adding just two lines of code to their TensorFlow projects.

On computer vision, as the table below shows, when comparing the same number of processors, the NVIDIA T4 is faster, 7x more power-efficient and far more affordable. NVIDIA V100, designed for AI training, is 2x faster and 2x more energy efficient than CPUs on inference.

Table 1: Inference on ResNet-50.

 Two-Socket
Intel Xeon 9282
NVIDIA V100
(Volta)
NVIDIA T4
(Turing)
ResNet-50 Inference (images/sec)7,8787,8444,944
# of Processors211
Total Processor TDP800 W350 W70 W
Energy Efficiency (Taking TDP)10 img/sec/W22 img/sec/W71 img/sec/W
Performance per Processor (images/sec)3,9397,8444,944
GPU Performance Advantage1.0 (baseline)2.0x1.3x
GPU Energy-Efficiency Advantage1.0 (baseline)2.3x7.2x

Source: Intel Xeon performance; NVIDIA GPU performance

Compelling Value of Tensor Core GPUs for Understanding Natural Language

AI has been moving at a frenetic pace. This rapid progress is fueled by teams of AI researchers and data scientists who continue to innovate and create highly accurate and exponentially more complex AI models.

Over four years ago, computer vision was among the first applications where AI from Microsoft was able to perform at superhuman accuracy using models like ResNet-50. Today’s advanced models perform even more complex tasks like understanding language and speech at superhuman accuracy. BERT, a highly complex AI model open-sourced by Google last year, can now understand prose and answer questions with superhuman accuracy.

A measure of the complexity of AI models is the number of parameters they have. Parameters in an AI model are the variables that store information the model has learned. While ResNet-50 has 25 million parameters, BERT has 340 million, a 13x increase.

On an advanced model like BERT, a single NVIDIA T4 GPU is 56x faster than a dual-socket CPU server and 240x more power-efficient.

Table 2: Inference on BERT. Workload: Fine-Tune Inference on BERT Large dataset.

 Dual Intel Xeon
Gold 6240
NVIDIA T4
(Turing)
BERT Inference,
Question-Answering (sentences/sec)
2118
Processor TDP300 W (150 Wx2)70 W
Energy Efficiency (using TDP)0.007 sentences/sec/W1.7 sentences/sec/W
GPU Performance Advantage1.0 (baseline)59x
GPU Energy-Efficiency Advantage1.0 (baseline)240x

CPU server: Dual-socket Xeon Gold 6240@2.6GHz; 384GB system RAM; FP32 precision; with Intel’s TF Docker container v. 1.13.1. Note: Batch-size 4 results yielded the best CPU score.

GPU results: T4: Dual-socket Xeon Gold 6240@2.6GHz; 384GB system RAM; mixed precision; CUDA 10.1.105; NCCL 2.4.3, cuDNN 7.5.0.56, cuBLAS 10.1.105; NVIDIA driver 418.67; on TensorFlow using automatic mixed precision and XLA compiler; batch-size 4 and sequence length 128 used for all platforms tested. 

Compelling Value of Tensor Core GPUs for Recommender Systems

Another key usage of AI is in recommendation systems, which are used to provide relevant content recommendations on video sharing sites, news feeds on social sites and product recommendations on e-commerce sites.

Neural collaborative filtering, or NCF, is a recommender system that uses the prior interactions of users with items to provide recommendations. When running inference on the NCF model that is a part of the MLPerf 0.5 training benchmark, NVIDIA T4 brings 12x more performance and 24x higher energy efficiency than CPUs.

Table 3: Inference on NCF.

 Single Intel Xeon
Gold 6140
NVIDIA T4
(Turing)
Recommender Inference Throughput (MovieLens)(thousands of samples/sec)2,86027,800
Processor TDP150 W70 W
Energy Efficiency (using TDP)19 samples/sec/W397 samples/sec/W
GPU Performance Advantage1.0 (baseline)10x
GPU Energy-Efficiency Advantage1.0 (baseline)20x

CPU server: Single-socket Xeon Gold 6240@2.6GHz; 384GB system RAM; Used Intel Benchmark for NCF on TensorFlow with Intel’s TF Docker container version 1.13.1; FP32 precision. Note: Single-socket CPU config used for CPU tests as it yielded a better score than dual-socket.

GPU results: T4: Single-socket Xeon Gold 6140@2.3GHz; 384GB system RAM; CUDA 10.1.105; NCCL 2.4.3, cuDNN 7.5.0.56, cuBLAS 10.1.105; NVIDIA driver 418.40.04; on TensorFlow using automatic mixed precision and XLA compiler; batch-size: 2,048 for CPU, 1,048,576 for T4; precision: FP32 for CPU, mixed precision for T4. 

Unified Platform for AI Training and Inference

The use of AI models in applications is an iterative process designed to continuously improve their performance. Data scientist teams constantly update their models with new data and algorithms to improve accuracy. These models are then updated in applications by developers.

Updates can happen monthly, weekly and even on a daily basis. Having a single platform for both AI training and inference can dramatically simplify and accelerate this process of deploying and updating AI in applications.

NVIDIA’s data center GPU computing platform leads the industry in performance by a large margin for AI training, as demonstrated by the standard AI benchmark, MLPerf. And the NVIDIA platform provides compelling value for inference, as the data presented here attests. That value increases with the growing complexity and progress of modern AI.

To help fuel the rapid progress in AI, NVIDIA has deep engagements with the ecosystem and constantly optimizes software, including key frameworks like TensorFlow, Pytorch and MxNet as well as inference software like TensorRT and TensorRT Inference Server.

NVIDIA also regularly publishes pre-trained AI models for inference and model scripts for training models using your own data. All of this software is freely made available as containers, ready to download and run from NGC, NVIDIA’s hub for GPU-accelerated software.

Get the full story about our comprehensive AI platform.

The post Intel Highlighted Why NVIDIA Tensor Core GPUs Are Great for Inference appeared first on The Official NVIDIA Blog.

Intel Highlighted Why NVIDIA Tensor Core GPUs Are Great for Inference

It’s not every day that one of the world’s leading tech companies highlights the benefits of your products. Intel did just that last week, comparing the inference performance of two of their most expensive CPUs to NVIDIA GPUs. To achieve the performance of a single mainstream NVIDIA V100 GPU, Intel combined two power-hungry, highest-end CPUs Read article >

The post Intel Highlighted Why NVIDIA Tensor Core GPUs Are Great for Inference appeared first on The Official NVIDIA Blog.

Plowing AI, Startup Retrofits Tractors with Autonomy

Colin Hurd, Mark Barglof and Quincy Milloy aren’t your conventional tech entrepreneurs. And that’s not just because their agriculture technology startup is based in Ames, Iowa.

Smart Ag is developing autonomy and robotics for tractors in a region more notable for its corn and soybeans than software or silicon.

Hurd, Barglof and Milloy, all Iowa natives, founded Smart Ag in 2015 and landed a total of $6 million in seed funding, $5 million of which came from Stine Seed Farm, an affiliate of Stine Seed Co. Other key investors included Ag Startup Engine, which backs Iowa State University startups.

The company is in widespread pilot tests with its autonomy for tractors and plans to commercialize its technology for row crops by 2020.

Smart Ag is a member of the NVIDIA Inception virtual accelerator, which provides marketing and technology support to AI startups.

A team of two dozen employees has been busy on its GPU-enabled autonomous software and robotic hardware system that operates tractors that pull grain carts during harvest.

“We aspire to being a software company first, but we’ve had to make a lot of hardware in order to make a vehicle autonomous,” said Milloy.

Wheat from Chaff

Smart Ag primarily works today with traditional row crop (corn and soybean) producers and cereal grain (wheat) producers. During harvest, these farmers use a tractor to pull a grain cart in conjunction with the harvesting machine, or combine, which separates the wheat from the chaff or corn from the stalk. Once the combine’s storage bin is full, the tractor with the grain cart pulls alongside for the combine to unload into the cart.

That’s where autonomous tractors come in.

Farm labor is scarce. In California, 55 percent of farms surveyed said they had experienced labor shortages in 2017, according to a report from the California Farm Bureau Federation.

Smart Ag is developing its autonomous tractor to pull a grain cart, addressing the lack of drivers available for this job.

Harvest Tractor Autonomy

Farmers can retrofit a John Deere 8R Series tractor using the company’s AutoCart system. It provides controllers for steering, acceleration and braking, as well as cameras, radar and wireless connectivity. An NVIDIA Jetson Xavier powers its perception system, fusing Smart Ag’s custom agricultural object detection model with other sensor data to give the tractor awareness of its surroundings.

“The NVIDIA Jetson AGX Xavier has greatly increased our perception capabilities — from the ability to process more camera feeds to the fusion of additional sensors —  it has enabled the path to develop and rapidly deploy a robust safety system into the field,” Milloy said.

Customers can use mobile devices and a web browser to access the system to control tractors.

Smart Ag’s team gathered more than 1 million images to train the image recognition system on AWS, tapping into NVIDIA GPUs. The startup’s custom image recognition algorithms allow its autonomous tractor to avoid people and other objects in the field, find the combine for unloading and return to a semi truck for the driverless grain cart vehicle to unload the grain for final transport to a grain storage facility.

Smart Ag has more than 12 pilot tests under its belt and uses those to gather more data to refine its algorithms. The company plans to expand its test base to roughly 20 systems operating during harvest in 2019 in preparation for its commercial launch in 2020.

“We’ve been training for the past year and a half. The system can get put out today in deployment, but we can always get higher accuracy,” Milloy said.

The post Plowing AI, Startup Retrofits Tractors with Autonomy appeared first on The Official NVIDIA Blog.

Plowing AI, Startup Retrofits Tractors with Autonomy

Colin Hurd, Mark Barglof and Quincy Milloy aren’t your conventional tech entrepreneurs. And that’s not just because their agriculture technology startup is based in Ames, Iowa. Smart Ag is developing autonomy and robotics for tractors in a region more notable for its corn and soybeans than software or silicon. Hurd, Barglof and Milloy, all Iowa Read article >

The post Plowing AI, Startup Retrofits Tractors with Autonomy appeared first on The Official NVIDIA Blog.

ACR AI-LAB and NVIDIA Make AI in Hospitals Easy on IT, Accessible to Every Radiologist

For radiology to benefit from AI, there needs to be easy, consistent and scalable ways for hospital IT departments to implement the technology. It’s a return to a service-oriented architecture, where logical components are separated and can each scale individually, and an efficient use of the additional compute power these tools require.

AI is coming from dozens of vendors as well as internal innovation groups, and needs a place within the hospital network to thrive. That’s why NVIDIA and the American College of Radiology (ACR) have published a Hospital AI Reference Architecture Framework. It helps hospitals easily get started with AI initiatives.

A Cookbook to Make AI Easy

The Hospital AI Reference Architecture Framework was published at yesterday’s annual ACR meeting for public comment. This follows the recent launch of the ACR AI-LAB, which aims to standardize and democratize AI in radiology. The ACR AI-LAB uses infrastructure such as NVIDIA GPUs and the NVIDIA Clara AI toolkit, as well as GE Healthcare’s Edison platform, which helps bring AI from research into FDA-cleared smart devices.

The Hospital AI Reference Architecture Framework outlines how hospitals and researchers can easily get started with AI initiatives. It includes descriptions of the steps required to build and deploy AI systems, and provides guidance on the infrastructure needed for each step.

Hospital AI Architecture Framework
Hospital AI Architecture Framework

To drive an effective AI program within a healthcare institution, there must first be an understanding of the workflows involved, compute needs and data required. It comes from a foundation of enabling better insights from patient data with easy-to deploy compute at the edge.

Using a transfer client, seed models can be downloaded from a centralized model store. A clinical champion uses an annotation tool to locally create data that can be used for fine-tuning the seed model or training a new model. Then, using the training system with the annotated data, a localized model is instantiated. Finally, an inference engine is used to conduct validation and ultimately inference on data within the institution.

These four workflows sit atop AI compute infrastructure, which can be accelerated with NVIDIA GPU technology for best performance, alongside storage for models and annotated studies. These workflows tie back into other hospital systems such as PACS, where medical images are archived.

Three Magic Ingredients: Hospital Data, Clinical AI Workflows, AI Computing

Healthcare institutions don’t have to build the systems to deploy AI tools themselves.

This scalable architecture is designed to support and provide computing power to solutions from different sources. GE Healthcare’s Edison platform now uses NVIDIA’s TRT-IS inference capabilities to help AI run in an optimized way within GPU-powered software and medical devices. This integration makes it easier to deliver AI from multiple vendors into clinical workflows — and is the first example of the AI-LAB’s efforts to help hospitals adopt solutions from different vendors.

Together, Edison with TRT-IS offers a ready-made device inferencing platform that is optimized for GPU-compliant AI, so models built anywhere can be deployed in an existing healthcare workflow.

Hospitals and researchers are empowered to embrace AI technologies without building their own standalone technology or yielding their data to the cloud, which has privacy implications.

The post ACR AI-LAB and NVIDIA Make AI in Hospitals Easy on IT, Accessible to Every Radiologist appeared first on The Official NVIDIA Blog.

ACR AI-LAB and NVIDIA Make AI in Hospitals Easy on IT, Accessible to Every Radiologist

For radiology to benefit from AI, there needs to be easy, consistent and scalable ways for hospital IT departments to implement the technology. It’s a return to a service-oriented architecture, where logical components are separated and can each scale individually, and an efficient use of the additional compute power these tools require. AI is coming Read article >

The post ACR AI-LAB and NVIDIA Make AI in Hospitals Easy on IT, Accessible to Every Radiologist appeared first on The Official NVIDIA Blog.

Cracking the Code on Opioid Addiction with Summit Supercomputer

About 10 percent of people who are prescribed opioids experience full-blown addiction, yet there’s no way to know who is most susceptible.

Genomics can likely explain this riddle — by identifying genetic predispositions to addiction — but such research will take enormous calculations on massive datasets.

Researchers at Oak Ridge National Laboratory are using the biggest GPU-powered computing cluster ever built to perform these calculations faster than ever. It’s part of an effort that could one day lead to alternative medications for pain management and help mitigate the opioid addiction crisis.

“We’ve been dreaming about solving these sorts of problems for years,” said Dan Jacobson, chief scientist of computational systems biology at ORNL, one of several labs run by the U.S. Department of Energy and home to Summit, the world’s fastest supercomputer.

Scientists across numerous fields are hailing Summit as a huge leap in computational research capabilities. Its 27,000-plus NVIDIA V100 Tensor Core GPUs deliver exponentially more processing power than was possible just a few years ago.

In the world of biology, this translates to being able to zoom in closer on the molecular level and explore new frontiers. That, in turn, will enable researchers to learn more about how all the components of a cell interact, and to do studies on a population scale.

Among the first such projects will be an effort led by Jacobson to train a machine learning model on genomic data in the hopes of accurately predicting whether a patient is predisposed to opioid addiction.

“Tensor Cores on GPUs on Summit will give us this enormous boost in performance to solve fundamental biological problems we simply couldn’t before,” he said.

Serious Math

Jacobson’s team plans to tap Summit’s prodigious mathematical capabilities by running immense calculations on genetics data that will help establish correlations between that data and the likelihood of addiction.

First, the team will use Summit to look for genetic changes across an entire population, then it’ll write algorithms to search for correlations between those changes.

To do this, the team is working with a Veterans Administration dataset of clinical records for 23 million people going back two decades. It already has assembled a dataset of genomics correlations on 600,000 people. The goal is to build that to about 2 million.

Once they have a large enough set of these correlations, the team can start testing them against two groups: Those who have developed opioid addiction and a control group who have been exposed but not developed an addiction.

Which brings us to the math: Jacobson said the very first calculation would require somewhere on the order of 10 to the 16th power (or 10 quadrillion) comparisons, and that operation would be repeated thousands, possibly even hundreds of thousands, of times.

Being able to perform such calculations in manageable amounts of time will open the doors to new ways of dealing with the growing opioid addiction crisis.

“We can develop better therapies for addiction, we can develop better therapies for chronic pain, and we can predict which patients will become addicted to opioids and then not give them opioids,” said Jacobson.

More Breakthroughs to Come

The team already has been able to test some of its applications on Summit and has managed to boost performance from 1.8 exaflops to 2.36 exaflops on its algorithm. It’s the fastest science application ever reported, and earned the team a Gordon Bell Prize in 2018. (For reference, one exaflops equals 1 quintillion, or billion billion, operations per second.)

As it continues to refine performance, Jacobson’s team expects to achieve higher levels of accuracy and to get those results faster.

That’s almost hard to imagine, given that Jacobson already said that his team can, in one hour on Summit, complete tasks that would require 35 years on a competing supercomputer, or 12,000 years on a laptop.

Jacobson believes that being able to do his team’s work on opioid addiction on Summit will lead to breakthrough treatments of other conditions, such as Alzheimer’s, dementia, prostate cancer and cardiovascular disease.

“Once we understand the complex genetic architecture underlying addiction, we want to do this really for all clinical disease states that seem to have some sort of genetic underpinning,” said Jacobson. “It’s machines like Summit that give us the ability to do that at scale, so we can now start to answer scientific questions that were literally impossible earlier this year.”

Learn more about Summit and why it plays such a critical role in enabling scientific progress in the video below.

The post Cracking the Code on Opioid Addiction with Summit Supercomputer appeared first on The Official NVIDIA Blog.

Cracking the Code on Opioid Addiction with Summit Supercomputer

About 10 percent of people who are prescribed opioids experience full-blown addiction, yet there’s no way to know who is most susceptible. Genomics can likely explain this riddle — by identifying genetic predispositions to addiction — but such research will take enormous calculations on massive datasets. Researchers at Oak Ridge National Laboratory are using the Read article >

The post Cracking the Code on Opioid Addiction with Summit Supercomputer appeared first on The Official NVIDIA Blog.

By the Book: AI Making Millions of Ancient Japanese Texts More Accessible

Natural disasters aren’t just threats to people and buildings, they can also erase history — by destroying rare archival documents. As a safeguard, scholars in Japan are digitizing the country’s centuries-old paper records, typically by taking a scan or photo of each page. But while this method preserves the content in digital form, it doesn’t Read article >

The post By the Book: AI Making Millions of Ancient Japanese Texts More Accessible appeared first on The Official NVIDIA Blog.

By the Book: AI Making Millions of Ancient Japanese Texts More Accessible

Natural disasters aren’t just threats to people and buildings, they can also erase history — by destroying rare archival documents. As a safeguard, scholars in Japan are digitizing the country’s centuries-old paper records, typically by taking a scan or photo of each page.

But while this method preserves the content in digital form, it doesn’t mean researchers will be able to read it. Millions of physical books and documents were written in an obsolete script called Kuzushiji, legible to fewer than 10 percent of Japanese humanities professors.

“We end up with billions of images which will take researchers hundreds of years to look through,” said Tarin Clanuwat, researcher at Japan’s ROIS-DS Center for Open Data in the Humanities. “There is no easy way to access the information contained inside those images yet.”

Extracting the words on each page into machine-readable, searchable form takes an extra step: transcription, which can be done either by hand or through a computer vision method called optical character recognition, or OCR.

Clanuwat and her colleagues are developing a deep learning OCR system to transcribe Kuzushiji writing — used for most Japanese texts from the 8th century to the start of the 20th — into modern Kanji characters.

Clanuwat said GPUs are essential for both training and inference of the AI.

“Doing it without GPUs would have been inconceivable,” she said. “GPU not only helps speed up the work, but it makes this research possible.”

Parsing a Forgotten Script

Before the standardization of the Japanese language in 1900 and the advent of modern printing, Kuzushiji was widely used for books and other documents. Though millions of historical texts were written in the cursive script, just a few experts can read it today.

Only a tiny fraction of Kuzushiji texts have been converted to modern scripts — and it’s time-consuming and expensive for an expert to transcribe books by hand. With an AI-powered OCR system, Clanuwat hopes a larger body of work can be made readable and searchable by scholars.

She collaborated on the OCR system with Asanobu Kitamoto from her research organization and Japan’s National Institute of Informatics, and Alex Lamb of the Montreal Institute for Learning Algorithms. Their paper was accepted in 2018 to the Machine Learning for Creativity and Design workshop at the prestigious NeurIPS conference.

Using a labeled dataset of 17th to 19th century books from the National Institute of Japanese Literature, the researchers trained their deep learning model on NVIDIA GPUs, including the TITAN Xp. Training the model took about a week, Clanuwat said, but “would be impossible” to train on CPU.

Kuzushiji has thousands of characters, with many occurring so rarely in datasets that it is difficult for deep learning models to recognize them. Still, the average accuracy of the researchers’ KuroNet document recognition model is 85 percent — outperforming prior models.

The newest version of the neural network can recognize more than 2,000 characters. For easier documents with fewer than 300 character types, accuracy jumps to about 95 percent, Clanuwat said. “One of the hardest documents in our dataset is a dictionary, because it contains many rare and unusual words.”

One challenge the researchers faced was finding training data representative of the long history of Kuzushiji. The script changed over the hundreds of years it was used, while the training data came from the more recent Edo period.

Clanuwat hopes the deep learning model could expand access to Japanese classical literature, historical documents and climatology records to a wider audience.

The post By the Book: AI Making Millions of Ancient Japanese Texts More Accessible appeared first on The Official NVIDIA Blog.