What’s a DPU?

What’s a DPU?

Of course, you’re probably already familiar with the Central Processing Unit or CPU. Flexible and responsive, for many years CPUs were the sole programmable element in most computers.

More recently the GPU, or graphics processing unit, has taken a central role. Originally used to deliver rich, real-time graphics, their parallel processing capabilities make them ideal for accelerated computing tasks of all kinds.

That’s made them the key to artificial intelligence, deep learning, and big data analytics applications.

Over the past decade, however, computing has broken out of the boxy confines of PC and servers — with CPUs and GPUs powering sprawling new hyperscale data centers.

These data centers are knit together with a powerful new category of processors. The DPU, or data processing unit, has become the third member of the data centric accelerated computing model.“This is going to represent one of the three major pillars of computing going forward,” NVIDIA CEO Jensen Huang said during a talk earlier this month.

“The CPU is for general purpose computing, the GPU is for accelerated computing and the DPU, which moves data around the data center, does data processing.”

What's a DPU?

Data Processing Unit
Industry-standard, high-performance, software-programmable multi-core CPU
High-performance network interface
Flexible and programmable acceleration engines

So What Makes a DPU Different?

A DPU is a new class of programmable processor that combines three key elements. A DPU is a system on a chip, or SOC, that combines:
An industry standard, high-performance, software programmable, multi-core CPU, typically based on the widely-used Arm architecture, tightly coupled to the other SOC components

A high-performance network interface capable of parsing, processing, and efficiently transferring data at line rate, or the speed of the rest of the network, to GPUs and CPUs

A rich set of flexible and programmable acceleration engines that offload and improve applications performance for AI and Machine Learning, security, telecommunications, and storage, among others.

All these DPU capabilities are critical to enable an isolated, bare-metal, cloud-native computing that will define the next generation of cloud-scale computing.

DPUs: Incorporated into SmartNICs

The DPU can be used as a stand-alone embedded processor, but it’s more often incorporated into a SmartNIC, a network interface controller that’s  used as a key component in a next generation server.

Other devices that claim to be DPUs miss significant elements of these three critical capabilities that are fundamental to claiming to answer the question: What is a DPU?

DPUs, or data processing units, can be used as a stand-alone embedded processor, but they’re more often incorporated into a SmartNIC, a network interface controller that’s used as a key component in a next generation server.
DPUs can be used as a stand-alone embedded processor, but they’re more often incorporated into a SmartNIC, a network interface controller that’s used as a key component in a next generation server.

For example, some vendors use proprietary processors that don’t benefit from the rich development and application infrastructure offered by the broad Arm CPU ecosystem.

Others claim to have DPUs but make the mistake of focusing solely on the embedded CPU to perform data path processing.

DPUs: A Focus on Data Processing

This isn’t competitive and doesn’t scale, because trying to beat the traditional x86 CPU with a brute force performance attack is a losing battle. If 100 Gigabit/sec packet processing brings an x86 to its knees, why would an embedded CPU perform better?

Instead the network interface needs to be powerful and flexible enough to handle all network data path processing. The embedded CPU should be used for control path initialization and exception processing, nothing more.

At a minimum, there 10 capabilities the network data path acceleration engines need to be able to deliver:

  • Data packet parsing, matching, and manipulation to implement an open virtual switch (OVS)
  • RDMA data transport acceleration for Zero Touch RoCE
  • GPU-Direct accelerators to bypass the CPU and feed networked data directly to GPUs (both from storage and from other GPUs)
  • TCP acceleration including RSS, LRO, checksum, etc
  • Network virtualization for VXLAN and Geneve overlays and VTEP offload
  • Traffic shaping “packet pacing” accelerator to enable multi-media streaming, content distribution networks, and the new 4K/8K Video over IP (RiverMax for ST 2110)
  • Precision timing accelerators for telco Cloud RAN such as 5T for 5G capabilities
  • Crypto acceleration for IPSEC and TLS performed inline so all other accelerations are still operation
  • Virtualization support for SR-IOV, VirtIO and para-virtualization
  • Secure Isolation: root of trust, secure boot, secure firmware upgrades, and authenticated containers and application life cycle management

These are just 10 of the acceleration and hardware capabilities that are critical to being able to answer yes to the question: “What is a DPU?”

So what is a DPU? This is a DPU:

What's a DPU? This is a DPU, also known as a Data Processing Unit.

Many so-called DPUs focus solely on delivering one or two of these functions.

The worst try to offload the datapath in proprietary processors.

While good for prototyping, this is a fool’s errand, because of the scale, scope, and breadth of data center.

Additional DPU-Related Resources

The post What’s a DPU? appeared first on The Official NVIDIA Blog.

TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC Up to 20x

As with all computing, you’ve got to get your math right to do AI well. Because deep learning is a young field, there’s still a lively debate about which types of math are needed, for both training and inferencing.

In November, we explained the differences among popular formats such as single-, double-, half-, multi- and mixed-precision math used in AI and high-performance computing. Today, the NVIDIA Ampere architecture introduces a new approach for improving training performance on the single-precision models widely used for AI.

TensorFloat-32 (TF32) is the default mode in NVIDIA A100 GPUs for handling the matrix math also called tensor operations at the heart of AI and many HPC applications. Users don’t have to change any code to take advantage of its capabilities.

TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs. Combining TF32 with structured sparsity on the A100 enables performance gains over Volta of up to 20x.

Understanding the New Math

It helps to step back for a second to see how TF32 works and where it fits.

Math formats are like rulers. The number of bits in a format’s exponent determines its range,  how large an object it can measure. Its precision — how fine the lines are on the ruler — comes from the number of bits used for its mantissa, the part of a floating point number after the radix or decimal point.

A good format strikes a balance. It should use enough bits to deliver precision without using so many it slows processing and bloats memory.

The chart below shows how TF32 is a hybrid that strikes this balance for tensor operations.

F32 strikes a balance that delivers performance with range and accuracy.

TF32 uses the same 10-bit mantissa as the half-precision (FP16) math, shown to have more than sufficient margin for the precision requirements of AI workloads. And TF32 adopts the same 8-bit exponent as FP32 so it can support the same numeric range.

The combination makes TF32 a great alternative to FP32 for crunching through single-precision math, specifically the massive multiply-accumulate functions at the heart of deep learning and many HPC apps.

Users don’t have to make any code changes because TF32 only runs inside the A100 GPU. TF32 operates on FP32 inputs and produces results in FP32. Non-tensor operations continue to use FP32.

For maximum performance, the A100 also has enhanced 16-bit math capabilities. It supports both FP16 and Bfloat16 (BF16) at double the rate of TF32.  Employing Automatic Mixed Precision, users can get 2x higher performance with just a few lines of code.

TF32 is Demonstrating Great Results Today

Compared to FP32, TF32 shows a 6x speed up training BERT, one of today’s most demanding conversational AI models. Applications-level results on other AI training and HPC apps that rely on matrix math will vary by workload.

To validate the accuracy of TF32, we used it to train a broad set of AI networks across a wide variety of applications from computer vision to natural language processing to recommender systems. All of them have the same convergence-to-accuracy behavior as FP32.

That’s why NVIDIA is making TF32 the default on its cuDNN library which accelerates key math operations for neural networks. At the same time, NVIDIA is working with the open-source communities that develop AI frameworks to enable TF32 as their default training mode on A100 GPUs, too.

In June, developers will be able to access a version of the TensorFlow framework and a version of the Pytorch framework with support for TF32 on NGC, NVIDIA’s catalog of GPU-accelerated software.

“TensorFloat-32 provides a huge out-of-the-box performance increase for AI applications for training and inference while preserving FP32 levels of accuracy,” said Kemal El Moujahid,  director of Product Management for TensorFlow.

“We plan to make TensorFloat-32 supported natively in TensorFlow to enable data scientists to benefit from dramatically higher speedups in NVIDIA A100 Tensor Core GPUs without any code changes,” he added.

“Machine learning researchers, data scientists and engineers want to accelerate time to solution,” said a spokesperson for the PyTorch team. “When TF32 is natively integrated into PyTorch, it will enable out-of-the-box acceleration with zero code changes while maintaining accuracy of FP32 when using the NVIDIA Ampere architecture-based GPUs.”

TF32 Accelerates Linear Solvers in HPC

HPC apps called linear solvers — algorithms with repetitive matrix-math calculations — also will benefit from TF32. They are used in a wide range of fields such as earth science, fluid dynamics, healthcare, material science and nuclear energy as well as oil and gas exploration.

Linear solvers using FP32 and FP64 have been in use for more than 30 years. They have demonstrated accuracy on par with FP64-only math across a wide range of matrices types. However their performance benefits have been limited due to the modest throughput difference between FP32 and FP64 on most processors.

Last year, a fusion reaction study for the International Thermonuclear Experimental Reactor demonstrated that NVIDIA FP16 Tensor Cores and mixed-precision techniques delivered a speedup of 3.5x for such solvers. The same technology used in that study tripled the Summit supercomputer’s performance on the HPL benchmark.

To demonstrate the power and robustness of TF32 for linear system solvers, we ran a variety of tests in the SuiteSparse matrix collection using cuSolver 11.0, which supports TF32 and BF16 in A100 GPUs. The results (see the chart below) show that TF32 provided the highest speed-up (2.1x) and robustness. That’s why it’s the recommended choice in the A100.

In these tests, TF32 delivered the fastest results. Compared to other tensor-core modes, TF32 required fewer iterations and had the fewest fallback cases, times when the solution was not converging well and cuSolver automatically switched to FP64 to complete the calculation.

Beyond linear solvers, other domains in high performance computing make use of FP32 matrix operations. NVIDIA plans to work with the industry to study the application of TF32 to more use cases that rely on FP32 today.

Where To Go for More Information

To get the big picture on the role of TF32 in our latest GPUs, watch the keynote with NVIDIA founder and CEO Jensen Huang. To learn even more, register for webinars on mixed-precision training or CUDA math libraries or read a detailed article that takes a deep dive into the NVIDIA Ampere architecture.

TF32 is among a cluster of new capabilities in the NVIDIA Ampere architecture, driving AI and HPC performance to new heights. For more details, check out our blogs on:

  • Our support for sparsity, driving up to 50 percent improvements for AI inference.
  • Double-precision Tensor Cores, speeding up HPC simulations and AI up to 2.5x
  • Multi-Instance GPU (MIG), supporting up to 7x in GPU productivity gains.
  • Or, see the web page describing the NVIDIA A100 GPU.


The post TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC Up to 20x appeared first on The Official NVIDIA Blog.

What’s a Recommender System?

Search and you might find.

Spend enough time online, however, and what you want will start finding you just when you need it.

This is what’s driving the internet right now.

They’re called recommender systems, and they’re among the most important applications today.

That’s because there is an explosion of choice and it’s impossible to explore the large number of available options.

If a shopper were to spend just one second each swiping on their mobile app through the two billion products available on one prominent ecommerce site, it would take 65 years — almost an entire lifetime — to go through their entire catalog. 

This is one of the major reasons why the Internet is now so personalized, otherwise it’s simply impossible for the billions of Internet users in the world to connect with the products, services, even expertise — among hundreds of billions of things — that matter to them.

They might be the most human, too. After all, what are you doing when you go to someone for advice? When you’re looking for feedback? You’re asking for a recommendation.

Now, driven by vast quantities of data about the preferences of hundreds of millions of individual users, recommender systems are racing to get better at doing just that.

The internet, of course, already knows a lot of facts: your name, your address, maybe your birthplace. But what the recommender systems seek to learn better, perhaps, than the people who know you are your preferences.

Looking to get started with recommender systems? Read more about NVIDIA Merlin, NVIDIA’s application framework for deep recommender systems

Key to Success of Web’s Most Successful Companies

Recommender systems aren’t a new idea. Jussi Karlgren formulated the idea of a recommender system, or a “digital bookshelf,” in 1990. Over the next two decades researchers at MIT and Bellcore steadily advanced the technique.

The technology really caught the popular imagination starting in 2007, when Netflix — then in the business of renting out DVDs through the mail — kicked off an open competition with a $1 million prize for a collaborative filtering algorithm that could improve on the accuracy of Netflix’s own system by more than 10 percent, a prize that was claimed in 2009.

Over the following decade, such recommender systems would become critical to the success of Internet companies such as Netflix, Amazon, Facebook, Baidu and Alibaba.

Virtuous Data Cycle

And the latest generation of deep-learning powered recommender systems provide marketing magic, giving companies the ability to boost click-through rates by better targeting users who will be interested in what they have to offer.

Now the ability to collect this data, process it, use it to train AI models and deploy those models to help you and others find what you want is among the largest competitive advantages possessed by the biggest internet companies.

It’s driving a virtuous cycle — with the best technology driving better recommendations, recommendations which draw more customers and, ultimately, let these companies afford even better technology.

That’s the business model. So how does this technology work?

Collecting  Information

Recommenders work by collecting information — by noting what you ask for — such as what movies you tell your video streaming app you want to see, ratings and reviews you’ve submitted, purchases you’ve made, and other actions you’ve taken in the past

Perhaps more importantly, they can keep track of choices you’ve made: what you click on and how you navigate. How long you watch a particular movie, for example. Or which ads you click on or which friends you interact with.

All this information is streamed into vast data centers and compiled into complex, multidimensional tables that quickly balloon in size.

They can be hundreds of terabytes large — and they’re growing all the time.

That’s not so much because vast amounts of data are collected from any one individual, but because a little bit of data is collected from so many.

In other words, these tables are sparse — most of the information most of these services have on most of us for most of these categories is zero.

But, collectively these tables contain a great deal of information on the preferences of a large number of people.

And that helps companies make intelligent decisions about what certain types of users  might like.

Content Filtering, Collaborative Filtering

While there are a vast number of recommender algorithms and techniques, most fall into one of two broad categories: collaborative filtering and content filtering.

Collaborative filtering helps you find what you like by looking for users who are similar to you.

So while the recommender system may not know anything about your taste in music, if it knows you and another user share similar taste in books, it might recommend a song to you that it knows this other user already likes.

Content filtering, by contrast, works by understanding the underlying features of each product.

So if a recommender sees you liked the movies “You’ve Got Mail” and “Sleepless in Seattle,” it might recommend another movie to you starring Tom Hanks and Meg Ryan, such as “Joe Versus the Volcano.”

Those are extremely simplistic examples, to be sure.

Data as a Competitive Advantage

In reality, because these systems capture so much data, from so many people, and are deployed at such an enormous scale, they’re able to drive tens or hundreds of millions of dollars of business with even a small improvement in the system’s recommendations.

A business may not know what any one individual will do, but thanks to the law of large numbers, they know that, say, if an offer is presented to 1 million people, 1 percent will take it.

But while the potential benefits from better recommendation systems are big, so are the challenges.

Successful internet companies, for example, need to process ever more queries, faster, spending vast sums on infrastructure to keep up as the amount of data they process continues to swell.

Companies outside of technology, by contrast, need access to ready-made tools so they don’t have to hire whole teams of data scientists.

If recommenders are going to be used in industries ranging from healthcare to financial services, they’ll need to become more accessible.

GPU Acceleration

This is where GPUs come in.

NVIDIA GPUs, of course, have long been used to accelerate training times for neural networks — sparking the modern AI boom — since their parallel processing capabilities let them blast through data-intensive tasks.

But now, as the amount of data being moved continues to grow, GPUs are being harnessed more extensively. Tools such as RAPIDS, a suite of software libraries for accelerating data science and analytics pipelines much more quickly, so data scientists can get more work done much faster.

And NVIDIA’s just announced Merlin recommender application framework promises to make GPU-accelerated recommender systems more accessible still with an end-to-end pipeline for ingesting, training and deploying GPU-accelerated recommender systems.

These systems will be able to take advantage of the new NVIDIA A100 GPU, built on our NVIDIA Ampere architecture, so companies can build recommender systems more quickly and economically than ever.

Our recommendation? If you’re looking to put recommender systems to work, now might be a good time to get started.

Looking to get started with recommender systems? Read more about NVIDIA Merlin, NVIDIA’s application framework for deep recommender systems

Featured image credit: © Monkey Business – stock.adobe.com.

The post What’s a Recommender System? appeared first on The Official NVIDIA Blog.

Double-Precision Tensor Cores Speed High-Performance Computing

What you can see, you can understand.

Simulations help us understand the mysteries of black holes and see how a protein spike on the coronavirus causes COVID-19. They also let designers create everything from sleek cars to jet engines.

But simulations are also among the most demanding computer applications on the planet because they require lots of the most advanced math.

Simulations make numeric models visual with calculations that use a double-precision floating-point format called FP64. Each number in the format takes up 64 bits inside a computer, making it one the most computationally intensive of the many math formats today’s GPUs support.

As the next big step in our efforts to accelerate high performance computing, the NVIDIA Ampere architecture defines third-generation Tensor Cores that accelerate FP64 math by 2.5x compared to last-generation GPUs.

That means simulations that kept researchers and designers waiting overnight can be viewed in a few hours when run on the latest A100 GPUs.

Science Puts AI in the Loop

The speed gains open a door for combining AI with simulations and experiments, creating a positive-feedback loop that saves time.

First, a simulation creates a dataset that trains an AI model. Then the AI and simulation models run together, feeding off each other’s strengths until the AI model is ready to deliver real-time results through inference. The trained AI model also can take in data from an experiment or sensor, further refining its insights.

Using this technique, AI can define a few areas of interest for conducting high-resolution simulations. By narrowing the field, AI can slash by orders of magnitude the need for thousands of time-consuming simulations. And the simulations that need to be run will run 2.5x faster on an A100 GPU.


With FP64 and other new features, the A100 GPUs based on the NVIDIA Ampere architecture become a flexible platform for simulations, as well as AI inference and training — the entire workflow for modern HPC. That capability will drive developers to migrate simulation codes to the A100.

Users can call new CUDA-X libraries to access FP64 acceleration in the A100. Under the hood, these GPUs are packed with third-generation Tensor Cores that support DMMA, a new mode that accelerates double-precision matrix multiply-accumulate operations.

Accelerating Matrix Math

A single DMMA job uses one computer instruction to replace eight traditional FP64 instructions. As a result, the A100 crunches FP64 math faster than other chips with less work, saving not only time and power but precious memory and I/O bandwidth as well.

We refer to this new capability as Double-Precision Tensor Cores. It delivers the power of Tensor Cores to HPC applications, accelerating matrix math in full FP64 precision.

Beyond simulations, HPC apps called iterative solvers — algorithms with repetitive matrix-math calculations — will benefit from this new capability. These apps include a wide range of jobs in earth science, fluid dynamics, healthcare, material science and nuclear energy as well as oil and gas exploration.

To serve the world’s most demanding applications, Double-Precision Tensor Cores arrive inside the largest and most powerful GPU we’ve ever made. The A100 also packs more memory and bandwidth than any GPU on the planet.

The third-generation Tensor Cores in the NVIDIA Ampere architecture are beefier than prior versions. They support a larger matrix size — 8x8x4, compared to 4x4x4 for Volta — that lets users tackle tougher problems.

That’s one reason why an A100 with a total of 432 Tensor Cores delivers up to 19.5 FP64 TFLOPS, more than double the performance of a Volta V100.

Where to Go to Learn More

To get the big picture on the role of FP64 in our latest GPUs, watch the keynote with NVIDIA founder and CEO Jensen Huang. To learn more, register for the webinar or read a detailed article that takes a deep dive into the NVIDIA Ampere architecture.

Double-Precision Tensor Cores are among a battery of new capabilities in the NVIDIA Ampere architecture, driving HPC performance as well as AI training and inference to new heights. For more details, check out our blogs on:

  • Multi-Instance GPU (MIG), supporting up to 7x in GPU productivity gains.
  • TensorFloat-32 (TF32), a format, speeding up AI training and certain HPC jobs up to 20x.
  • Our support for sparsity, accelerating math throughput 2x for AI inference.
  • Or, see the web page describing the A100 GPU.

The post Double-Precision Tensor Cores Speed High-Performance Computing appeared first on The Official NVIDIA Blog.

How RDMA Became the Fuel for Fast Networks

Two chance encounters propelled remote direct memory access from a good but obscure idea for fast networks into the jet fuel for the world’s more powerful supercomputers.

The lucky breaks launched the fortunes of an Israel-based startup that staked its fortunes on InfiniBand, a network based on RDMA. Later, that startup, Mellanox Technologies, helped steer RDMA into mainstream computing, today’s AI boom and the tech industry’s latest multibillion-dollar merger.

It started in August 2001.

D.K. Panda, a professor at Ohio State University, met Kevin Deierling, vice president of marketing for newly funded Mellanox. Panda explained how his lab’s work on software for the message-passing interface (MPI) was fueling a trend to assembling clusters of low-cost systems for high performance computing.

The problem was the fast networks they used from small companies such as Myrinet and Quadrics were proprietary and expensive.

It was an easy deal for a startup on a shoestring. For the price of a few of its first InfiniBand cards, some pizza and Jolt cola, grad students got MPI running on Mellanox’s chips. U.S. national labs, also eyeing InfiniBand, pitched in, too. Panda’s team debuted the open source MVAPICH at the Supercomputing 2002 conference.

“People got performance gains over Myrinet and Quadrics, so it had a snowball effect,” said Panda of the software that recently passed a milestone of half a million downloads.

Pouring RDMA on a Really Big Mac

The demos captured the imagination of an ambitious assistant professor of computer science from Virginia Tech. Srinidhi Varadarajan had research groups across the campus lining up to use an inexpensive 80-node cluster he had built, but that was just his warm-up.

An avid follower of the latest tech, Varadarajan set his sights on IBM’s PowerPC 970, a new 2-GHz CPU with leading floating-point performance, backed by IBM’s heritage of solid HPC compilers. Apple was just about to come out with its Power Mac G5 desktop using two of the CPUs.

With funding from the university and a government grant, Varadarajan placed an order with the local Apple sales office for 1,100 of the systems on a Friday. “By Sunday morning we got a call back from Steve Jobs for an in-person meeting on Monday,” he said.

From the start, Varadarajan wanted the Macs linked on InfiniBand. “We wanted the high bandwidth and low latency because HPC apps tend to be latency sensitive, and InfiniBand would put us on the leading edge of solving that problem,” he said.

Mellanox Swarms an Infiniband Supercomputer

Mellanox gave its new and largest customer first-class support. CTO Michael Kagan helped design the system.

An architect on Kagan’s team, Dror Goldenberg, was having lunch at his desk at Mellanox’s Israel headquarters, near Haifa, when he heard about the project and realized he wanted to be there in person.

He took a flight to Silicon Valley that night and spent the next three weeks camped out in a secret lab at One Infinite Loop — Apple’s Cupertino, Calif., headquarters — where he helped develop the company’s first drivers for the Mac.

Dror Goldenberg with the System X under construction
Goldenberg with Infiniband cabling inside the Big Mac under construction.

“That was the only place we could get access to the Mac G5 prototypes,” said Dror of the famously secretive company.

He worked around the clock with Virginia Tech researchers at Apple and Mellanox engineers back in Israel, getting the company’s network running on a whole new processor architecture.

The next stage brought a challenge of historic proportions. Goldenberg and others took up residence for weeks in a Virginia Tech server room, stacking computers to unheard-of heights in the days when Amazon was mainly known as a rain forest.

“It was crashing, but only at very large scale, and we couldn’t replicate the problem at small scale,” he said. “I was calling people back at headquarters about unusual bit errors. They said it didn’t make sense,” he added.

Energized by a new class of computer science issues, they often worked 22-hour days. They took catnaps in sleeping bags in between laying 2.3 miles of InfiniBand cable in a race to break records on the TOP500 list of the world’s biggest supercomputers.

Linking 1,100 off-the-shelf computers into one, “we stressed the systems to levels no one had seen before,” he said.

“Today, I know when you run such large jobs heat goes up, fans go crazy and traffic patterns on the wire change, creating different frequencies that can cause bit flips,” said Goldenberg, now a seasoned vice president of software architecture with the world’s largest cloud service providers among his customers.

RDMA Fuels 10 Teraflops and Beyond

Varadarajan was “a crazy guy, writing object code even though he was a professor,” recalls Kagan.

Benchmarks plateaued around 9.3 teraflops, just shy of the target that gave the System X its name. “The results were changing almost daily, and we had two days left on the last weekend when a Sunday night run popped out 10.3 teraflops,” Varadarajan recalled.

The $5.2 million system was ranked No. 3 on the TOP500 in November 2003. The fastest system at the time, the Earth Simulator in Japan, had cost an estimated $350 million.

Srinidhi Varadarajan at the controls of Virginia Tech’s System X soon after it was installed.

“That was our debut in HPC,” said Kagan. “Now more than half the systems on the TOP500 use our chips with RDMA, and HPC is a significant portion of our business,” he added.

RDMA Born in November 1993

It was a big break for the startup and the idea of RDMA, first codified in a November 1993 patent by a team of Hewlett-Packard engineers.

Like many advances, the idea behind RDMA was simple, yet fundamental. If you give networked systems a way to access each other’s main memory without interrupting the processor or operating system, you can drive down latency, drive up throughput and simplify life for everybody.

The approach promised a way to eliminate a rat’s nest of back-and-forth traffic between systems that was slowing computers.

RoCE: RDMA Goes Mainstream

After its first success in supercomputers, baking RDMA into mature, mainstream Ethernet networks was a logical next step, but it took hard work and time. The InfiniBand Trade Association defined an initial version of RDMA over Converged Ethernet (RoCE, pronounced “rocky”) in 2010, and today’s more complete version that supports routing in 2014.

Mellanox helped write the spec and rolled RoCE into ConnectX, a family of chips that also support the high data rates of InfiniBand.

“That made them early with 25/50G and 100G Ethernet,” said Bob Wheeler, a networking analyst with The Linley Group. “When hyperscalers moved beyond 10G, Mellanox was the only game in town,” he said, noting OEMs jumped on the bandwagon to serve their enterprise customers.

IBM and Microsoft are vocal in their support for RoCE in their public clouds. And Alibaba recently “sent us a note that the network was great for its big annual Singles Day sale,” said Kagan.

Smart NICs: Chips Off the Old RDMA Block

Now that RDMA is nearly everywhere, it’s becoming invisible.

Today’s networks are increasingly defined by high-level software and virtualization that simplifies operations in part by hiding details of underlying hardware. Under the covers, RDMA is becoming just one of an increasing number of functions, many running on embedded processor cores on smart network-interface cards, aka smart NICs.

IBM details the benefits of RDMA and other networking offloads used on today’s smart NICs.

Today smart NICs enable adaptive routing, handle congestion management and increasingly oversee security. They “are essentially putting a computer in front of the computer,” continuing RDMA’s work of freeing up time on server CPUs, said Kagan.

The next role for tomorrow’s smart networks built on RDMA and RoCE is to accelerate a rising tide of AI applications.

Mellanox’s SHARP protocol already consolidates data from HPC apps distributed across dozens to thousands of servers. In the future, it’s expected to speed AI training by aggregating data from parameter servers such as updated weights from deep-learning models. Panda’s Ohio State lab continues to support such work with RDMA libraries for AI and big-data analytics.

“Today RDMA is an order of magnitude better and richer in its feature set for networking, storage and AI — scaling out to address the exponentially growing amount of data,” said Goldenberg.

Double Date: AI, RDMA Unite NVIDIA, Mellanox

The combination of AI and RDMA drew Mellanox and NVIDIA together into a $6.9 billion merger announced in March 2019. The deal closed on April 27, 2020.

NVIDIA first implemented RDMA in GPUDirect for its Kepler architecture GPUs and CUDA 5.0 software. Last year, it expanded the capability with GPUDirect Storage.

As supercomputers embraced Mellanox’s RDMA-powered InfiniBand networks, they also started adopting NVIDIA GPUs as accelerators. Today the combined companies power many supercomputers including Summit, named the world’s most powerful system in June 2018.

Similarities in data-intensive, performance-hungry HPC and AI tasks led the companies to collaborate on using InfiniBand and RoCE in NVIDIA’s DGX systems for training neural networks.

Now that Mellanox is part of NVIDIA, Kagan sees his career coming full circle.

“My first big design project as an architect was a vector processor, and it was the fastest microprocessor in the world in those days,” he said. “Now I’m with a company designing state-of-the-art vector processors, helping fuse processor and networking technologies together and driving computing to the next level — it’s very exciting.”

“Mellanox has been leading RDMA hardware delivery for nearly 20 years, it’s been a pleasure working with Michael and Dror throughout that time, and we look forward to what is coming next,” said Renato Recio an IBM Fellow in networking who helped define the Infiniband standard.

The post How RDMA Became the Fuel for Fast Networks appeared first on The Official NVIDIA Blog.

How We Took Our Biggest Event of the Year Digital – and What We Learned

There isn’t anyone in the world, right now, who isn’t going to come out of 2020 without a story to share about the global pandemic.

We’ve got a story to share, too, along with some hard-won lessons about what works, and what doesn’t, in our new socially distanced reality as we shifted our GPU Technology Conference online.

For most of a year we aimed to draw 10,000 attendees to the San Jose Convention Center in late March for GTC. It’s been NVIDIA’s flagship event since 1,500 people gathered for the first conference in 2009.

Over those 11 years, we had created 40 events worldwide. So we were in sight of a target we knew how to hit, in ordinary times.

In the last days of February, with COVID-19 spreading worldwide, the company made the call: Cancel the physical event, create the best possible version of it online and call it GTC Digital.

Time to improvise.

My team had never created a fully digital event at anything close to this scale. We didn’t know what tools we’d need. We had no idea how partners and attendees geared up for a physical event would respond. And we certainly didn’t know that soon we would all be working remotely.

That’s when we learned our first lesson: people with the skills to put an event together offline can succeed online, too. Registrations grew to more than 4x our original goal. By April 3, the online event attracted more than 80,000 views. Attendees logged in from nearly every corner of the planet.

The second lesson: if you’re going to ask people to improvise, you need to keep it simple. When the decision to go digital came down, the GTC team made three big decisions to hit those targets — with just three weeks to scramble — while working from home and taking care of kids locked out of schools amid a global healthcare crisis:

  1. Don’t look back: We chose to unwind the physical conference entirely. We issued cancellations to every supplier and gave full refunds to every attendee and sponsor. It would be a clean break.
  2. Leverage what you have: We looked for platforms and skills we had in place to support a virtual event with a broad, global audience.
  3. Communicate the plan: We alerted all 1,300 speakers, asking if they’d still participate.

It was a recipe for more than a few sleepless nights.

“I cried, I won’t lie,” said Jasmin Dave, who has led operations, sales and marketing for all 40 GTC events around the world over the past decade.

“You put your heart and soul into working on an event for an entire year or even more. When it ends without coming to life as a sold-out show floor with amazing demos and record attendance, it’s hard,” she said.

A Tough Deadline Can Be a Blessing 

That first Monday morning in March, 50 people crammed into a conference room, some sitting on the floor, with dozens more on the phone. All had the same question: What’s GTC Digital?

For starters, there wouldn’t be a show floor, but there would be plenty of content. Several hundred talks, tutorials, panels and hands-on training sessions would be transformed into live online events, podcasts and on-demand recordings of talks with slides.

In some ways, keeping to the event’s deadline was a blessing. “Speakers already scheduled time to create and give talks, so 80 percent got on board for the online event,” said Christina Hicks, who heads up content for GTC. “Ultimately, more than 800 speakers agreed to record and share their talks — we were absolutely thrilled.”

Lessons in Online Economics

We made an early choice that GTC Digital would be free. We weren’t sure how many presenters would participate. And with most of the world sheltering at home, online learning from world-class experts seemed like an ideal distraction and a way to build community.

In hindsight, free had its costs. For live events with limited capacity, a nominal fee could have prevented overbooking and minimized drop-offs.

We invested the time it takes to deliver a quality experience.  As it turned out, the online event required an even larger time investment than the physical one.

The team published a guide for recording talks using tools presenters would have at hand or could easily download. The weekend before the event opened, the team edited and processed 150 videos.

Some of the videos were top notch, one used a green screen. Some benefited from digital polish applied by our A/V crew to silence barking dogs and enhance muffled audio.

An existing GTC On Demand service for storing and sharing videos after live events was repurposed for hosting almost all parts of GTC Digital.

Virtual Events Require Real Infrastructure 

Some parts were painful, like connecting all of the different platforms into one central session catalog tied to the user’s registration status. With just two weeks, staff had to manually enter some registration data, work we ordinarily would have automated.

It was one of several unexpected hurdles the team encountered.

The first week of GTC Digital, internet access rubbernecked in an all-day traffic jam of a world working from home. The team had to find backups for speakers and staff in the event they lost their connections.

Service providers felt the surge in demand, one requested a heads-up on sessions that might exceed 2,000 attendees. It was critical to stay in close contact with them and to have alternatives in place.

Live events online require more people and tools than physical events where you can see what’s going on. We used Slack as a real-time back channel with speakers in case anything went awry. NVIDIA’s IT department stepped up, offering to create backups for live sessions in case a primary platform failed.

Working Virtually Doesn’t Make Anything Friction Free 

Some things that are assumed for physical events don’t work for global online audiences, like set lunch breaks.

In hindsight, it didn’t make sense to display the local time of the team in Silicon Valley for live events. The session catalog should calculate and display these in an attendee’s local time.

And we discovered that attendees preferred interacting through the desktop catalog rather than the mobile app.

The desktop GTC Digital experience turned out to be more popular than the event’s mobile app.

Some of the biggest challenges were non-technical.

Working from home in a time of crisis made collaboration an order of magnitude harder.

Chance hallway encounters that often resolved problems in the office were now gone. In their place, kids needed to be home-schooled, fed, put to bed and hugged.

A note to global employees from our CEO made it clear: “Prioritize your family. Our work will wait. Let your colleagues know. They will rise to support you.”

They did.

Don’t Be Afraid to Ask for Help

Early on, we took a close inventory of the tools we had at hand. And people stepped up to help across the company and beyond it.

A two-person team proficient with an external service used for NVIDIA’s webinars could leverage it to serve up GTC Digital sessions.

Another duo that produces the company’s AI Podcast offered to host panels and interviews with key speakers.

NVIDIA’s Deep Learning Institute (DLI) was in early testing with Kaltura’s Newrow software for hosting live online events. It supports a feature we wanted — breakout rooms for attendees to get one-on-one help from instructors — but now our timetable was extraordinarily tight.

“The team scrambled to pull in the rollout of the platform by six weeks,” said Will Ramey, who oversees the DLI.

Kaltura stepped up, getting dozens of DLI instructors on Newrow and responding rapidly to attendees when they had issues during online sessions. The platform also turned out to be a good fit for “Connect with the Experts” sessions that give attendees one-on-one time with NVIDIA’s engineers.

With virtual classrooms ready, we asked partners for help ensuring attendees the best experience possible. Microsoft Azure provided access to GPU servers so DLI attendees worldwide could get hands-on training in deep learning.

Through it all, the GTC team took time to celebrate accomplishments of the day with wine toasts over Webex.

GTC Digital is still ongoing. New content arrives each Thursday through April 23.

So far, the numbers are solid. As of this week, the event had:

  • 45,000+ registered attendees
  • 300+ recorded talks
  • 148 research posters
  •  86 hours of DLI instructor-led training
  • 21 webinars
  •  39 Connect with the Experts live Q&A sessions
  •  9 demos
  • 6 podcasts

Tens of thousands have attended live sessions or downloaded on-demand content, nearly 60 percent of them from outside North America. Almost all of the DLI training and “Connect with the Experts” sessions were sold out.

The team is still sorting through feedback. Anecdotally, a partner said, “we should all be taking notes on how NVIDIA is doing this.”

One speaker offered the team thanks “for finding grounds to organize the event even in the midst of the times we’re facing!”

In sessions I attended, people expressed gratitude for something positive they could do while sheltering at home.

Some said they got the professional development they had made time for on their calendars, but didn’t expect during the pandemic. Students who couldn’t afford to travel to Silicon Valley thanked us for the virtual experience.

“It was the right thing to do,” said Dave. “Our theme of keeping it simple and executing well paid off.”

Looking forward, the company believes it may be on to something. With the surprising success of our online experience and some important lessons learned, we’ll be ready to create more digital events. There’s little doubt that’s the trend we’ll be seeing in the near future.

The post How We Took Our Biggest Event of the Year Digital – and What We Learned appeared first on The Official NVIDIA Blog.

What Is Active Learning?

Reading one book on a particular subject won’t make you an expert. Nor will reading multiple books containing similar material. Truly mastering a skill or area of knowledge requires lots of information coming from a diversity of sources.

The same is true for autonomous driving and other AI-powered technologies.

The deep neural networks responsible for self-driving functions require exhaustive training. Both in situations they’re likely to encounter during daily trips, as well as unusual ones they’ll hopefully never come across. The key to success is making sure they’re trained on the right data.

What’s the right data? Situations that are new or uncertain. No repeating the same scenarios over and over.

Active learning is a training data selection method for machine learning that automatically finds this diverse data. It builds better datasets in a fraction of the time it would take for humans to curate.

It works by employing a trained model to go through collected data, flagging frames it’s having trouble recognizing. These frames are then labeled by humans. Then they’re added to the training data. This increases the model’s accuracy for situations like perceiving objects in tough conditions.

Finding the Needle in the Data Haystack

The amount of data needed to train an autonomous vehicle is enormous. Experts at RAND estimate that vehicles need 11 billion miles of driving to perform just 20 percent better than a human. This translates to more than 500 years of nonstop driving in the real world with a fleet of 100 cars.

And not just any driving data will do. Effective training data must contain diverse and challenging conditions to ensure the car can drive safely.

If humans were to annotate this validation data to find these scenarios, the 100-car fleet driving just eight hours a day would require more than 1 million labelers to manage frames from all the cameras on the vehicle — a gargantuan effort. In addition to the labor cost, the compute and storage resources needed to train DNNs on this data would be infeasible.

The combination of data annotation and curation poses a major challenge to autonomous vehicle development. By applying AI to this process, it’s possible to cut down on the time and cost spent on training, while also increasing the accuracy of the networks.

Why Active Learning

There are three common methods to selecting autonomous driving DNN training data. Random sampling extracts frames from a pool of data at uniform intervals, capturing the most common scenarios but likely leaving out rare patterns.

Metadata-based sampling uses basic tags (for example, rain, night) to select data, making it easy to find commonly encountered difficult situations, but missing unique frames that aren’t easily classified, like a tractor trailer or man on stilts crossing the road.

Caption: Not all data is created equal. Example of a common highway scene (top left) vs. some unusual driving scenarios (top right: cyclist doing a wheelie at night, bottom left: truck towing trailer towing quad, bottom right: pedestrian on jumping stilts).

Finally, manual curation uses metadata tags combined with visual browsing by human annotators — a time-consuming task that can be error-prone and difficult to scale.

Active learning makes it possible to automate the selection process while choosing valuable data points. It starts by training a dedicated DNN on already-labeled data. The network then sorts through unlabeled data, selecting frames that it doesn’t recognize, thereby finding data that would be challenging to the autonomous vehicle algorithm.

That data is then reviewed and labeled by human annotators, and added to the training data pool.

Active learning has already shown it can improve the detection accuracy of self-driving DNNs over manual curation. In our own research, we’ve found that the increase in precision when training with active learning data can be 3x for pedestrian detection and 4.4x for bicycle detection relative to the increase for data selected manually.

Advanced training methods like active learning, as well as transfer learning and federated learning, are most effective when run on a robust, scalable AI infrastructure. This makes it possible to manage massive amounts of data in parallel, shortening the development cycle.

NVIDIA will be providing developers access to these training tools as well as our rich library of autonomous driving deep neural networks on the NVIDIA GPU Cloud container registry.

The post What Is Active Learning? appeared first on The Official NVIDIA Blog.

AI Calling: How to Kickoff a Career in Data Science

Paul Mahler remembers the day in May 2013 he decided to make the switch.

The former economist was waiting at a bus stop in Washington, D.C., reading the New York Times on his smartphone. He was struck by the story of a statistics professor who wrote an app that let computers review screenplays. It launched the academic into a lucrative new career in Hollywood.

“That seemed like a monumental breakthrough. I decided I wanted to get into data science, too,” said Mahler. Today, he’s a senior data scientist in Silicon Valley, helping NVIDIA’s customers use AI to make their own advances.

Like Mahler, Eyal Toledano made a big left turn a decade into his career. He describes “an existential crisis … I thought if I have any talent, I should try to do something I’m really proud of that’s bigger than myself and even if I fail, I will love every minute,” he said.

Then “an old friend from my undergrad days told me about his diving accident in a remote area and how no one could read his X-rays. He said we should build a database of images [using AI] to facilitate diagnoses in situations where people need this help — it was the first time I devoted myself to a seed of an idea that came from someone else,” Toledano recalled.

The two friends co-founded Zebra Medical Vision in 2014 to apply AI to medical imaging. For Toledano, there was only one way into the emerging field of deep learning.

“Roll up your sleeves, shovel some dirt and join the effort, that’s what helped me — in data science, you really need to get dirty,” he said.

Plenty of Room in the Sandbox

The field is still wide open. Data scientist tops the list of best jobs in America, according to a 2019 ranking from Glassdoor, a service that connects 67 million monthly visitors with 12 million job postings. It pegged median base salary for an entry-level data scientist at $108,000, job satisfaction at 4.3 out of 5 and said there are 6,510 job openings.

The job of data engineer was not far behind at $100,000, 4.2 out of 5 and 4,524 openings.

A 2018 study by recruiters at Burtch Works adds detail to the picture. It estimated starting salaries range from $95,000 to $168,000, depending on skill level. Data scientists come to the job with a wide range of academic backgrounds including math/statistics (25%), computer science and physical science (20% each), engineering (18%) and general business (8%). Nearly half had Ph.D.s and 40 percent held master’s degrees.

“Now that data is the new oil, data science is one of the most important jobs,” said Alen Capalik, co-founder and chief executive of startup FASTDATA.io, a developer of GPU software backed in part by NVIDIA. “Demand is incredible, so the unemployment in data science is zero.”

Like Mahler and Toledano, Capalik jumped in head first. “I just read a lot to understand data, the data pipeline and how customers use their data — different verticals use data differently,” he said.

The Nuts and Bolts

Data scientists are hybrid creatures. Some are statisticians who learned to code. Some are Python wizards learning the nuances of data analytics and machine learning. Others are domain experts who wanted to be part of the next big thing in computing.

All face a common flow of tasks. They must:

  • Identify business problems suited for big data
  • Set up and maintain tool chains
  • Gather large, relevant datasets
  • Structure datasets to address business concerns
  • Select an appropriate AI model family
  • Optimize model hyperparameters
  • Postprocess machine learning models
  • Critically analyze the results

“The unicorn data scientists do it all, from setting up a server to presenting to the board,” said Mahler.

But the reality is the field is quickly segmenting into subtasks. Data engineers work on the frontend of the process, massaging datasets through the so-called extract, transform and load process.

Big operations may employ data librarians, privacy experts and AI pipeline engineers who ensure systems deliver time-sensitive recommendations fast.

“The proliferation of titles is another sign the field is maturing,” said Mahler.

Play a Game, Learn the Job

One of the fastest, most popular ways into the field is to have some fun with AI by entering Kaggle contests, said Mahler. The online matches provide forums with real-world problems and code examples to get started. “People on our NVIDIA RAPIDS product team are continually on Kaggle contests,” he said.

Wins can lead to jobs, too. Owkin, an NVIDIA partner that designs AI software for healthcare, declares on its website, “Our data scientists are among the best in the world, with several Kaggle Masters.”

These days, at least some formal study is recommended. Online courses from fast.ai aim to give experienced programmers a jumpstart into deep learning. Co-founder Rachel Thomas maintains a list of her talks encouraging everyone, especially women, to get into data science.

We compiled our own list of online courses in data science given by the likes of MIT, Google and NVIDIA’s Deep Learning Institute. Here are some other great resources:

“Having a strong grasp of linear algebra, probability and statistical modeling is important for creating and interpreting AI models,” said Mahler. “A lot of employers require a degree in data or computer science and a strong understanding of Python,” he added.

“I was never one to look for degrees,” countered Capalik of FASTDATA.io. “Having real-world experience is better because the first day on a job you will find out things people never showed you in school,” he said.

Both agreed the best data scientists have a strong creative streak. And employers covet data scientists who are imaginative problem solvers.

Getting Picked for a Job

One startup gives job candidates a test of technical skills, but the test is just part of the screening process, said Capalik.

“I like to just look someone in the eye and ask a few questions,” he said. “You want to know if they are a problem solver and can work with a team because data science is a team effort — even Michael Jordan needed a team to win,” he said.

To pass the test and get an interview with Capalik, “you need to know what the data pipeline looks like, how data is collected, where it’s stored and how to work around the nuances and inefficiencies to solve problems with algorithms,” he said.

Toledano of Zebra is suspicious of candidates with pat answers.

“This is an experimental science,” he said. “The results are asymptotic to your ability to run many experiments, so you need to come up with different pieces and ideas quickly and test them in training experiments over and over again,” he said.

“People who want to solve a problem once might be very intelligent, but they will probably miss things. Don’t build a bow and arrow, build a catapult to throw a gazillion arrows — each one a potential solution you can evaluate quickly,” he added

Chris Rowen, a veteran entrepreneur and chief executive of AI startup BabbleLabs, is impressed by candidates who can explain their work. “Understand the theory about why models work on which problems and why,” he advised.

The Developer’s Path

Unlike the pure digital world of IT where answers are right or wrong, data science challenges often have no definitive answer, so they invite the curious who like to explore options and tradeoffs.

Indeed, IT and data science are radically different worlds.

IT departments use carefully structured processes to check code in and out and verify compliance. They write apps once that may be used for years. Data science teams, on the other hand, conduct experiments continuously with models based on probability curves and frequently massage models and datasets.

“Software engineering is more of a straight line while data science is a loop,” said James Kobielus, a veteran market watcher and lead AI analyst at Wikibon.

That said, it’s also true that “data science is the core of the next-generation developer, really,” Kobielus said. Although many subject matter experts jump into data science and learn how to code, “even more people are coming in from general app development,” he said, in part because that’s where the money is these days.

Clouds, Robots and Soft Skills

Whatever path you take, data scientists need to be familiar with the cloud. Many AI projects are born on remote servers using containers and modern orchestration techniques.

And you should understand the latest mobile and edge hardware and its constraints.

“There’s a lot of work going on in robotics using trial-and-error algorithms for reinforcement learning. This is beyond traditional data science, so personnel shortages are more acute there — and computer vision in cameras could not be hotter,” Kobielus said.

A diplomat’s skills for negotiation comes in handy, too. Data scientists are often agents of change, disrupting jobs and processes, so it’s important to make allies.

A Philosophical Shift

It sounds like a lot of work, but don’t be intimidated.

“I don’t know that I’ve made such a huge shift,” said Rowen of BabbleLabs, his first startup to leverage data science.

“The nomenclature has changed. The idea that the problem’s specs are buried in the data is a philosophical shift, but at the root of it, I’m doing something analogous to what I’ve done in much of my career,” he said

In the past Rowen explored the “computational profile of a problem and found the processor to make it work. Now, we turn that upside down. We look at what’s at the heart of a computation and what data we need to do it — that insight carried me into deep learning,” he said.

In a May 2018 talk, fast.ai co-founder Thomas was equally encouraging. Using transfer learning, you can do excellent AI work by training just the last few layers of a neural network, she said. And you don’t always need big data. For example, one system was trained to recognize images of baseball vs. cricket using just 30 pictures.

“The world needs more people in AI, and the barriers are lower than you thought,” she added.

The post AI Calling: How to Kickoff a Career in Data Science appeared first on The Official NVIDIA Blog.

Read ‘em and Reap: 6 Success Factors for AI Startups

Now that data is the new oil, AI software startups are sprouting across the tech terrain like pumpjacks in Texas. A whopping $80 billion in venture capital is fueling as many as 12,000 new companies.

Only a few will tap a gusher. Those who do, experts say, will practice six key success factors.

  1. Master your domain
  2. Gather big data fast
  3. See (a little) ahead of the market
  4. Make a better screwdriver
  5. Scale across the clouds
  6. Stay flexible

Some of the biggest wins will come from startups with AI apps that “turn an existing provider on its head by figuring out a new approach for call centers, healthcare or whatever it is,” said Rajeev Madhavan who manages a $300 million fund at Clear Ventures, nurturing nine AI startups.

1. Master Your Domain

Madhavan sold his electronic design automation startup Magma Design in 2012 to Synopsys for $523 million. His first stop on the way to becoming a VC was to take Andrew Ng’s Stanford course in AI.

“For a brief period in Silicon Valley every startup’s pitch would just throw in jargon on AI, but most of them were just doing collaborative filtering,” he said. “The app companies we look for have to be heavy on AI, but success comes down to how good a startup is in its domain space,” he added.

Chris Rowen agrees. The veteran entrepreneur who in 2013 sold his startup Tensilica to Cadence Design for $380 million considers domain expertise the top criteria for an AI software startup’s success.

Rowen’s latest startup, BabbleLabs, uses AI to filter noise from speech in real time. “At the root of it, I’m doing something analogous to what I’ve done in much of my career — work on really hard real-time computing problems that apply to mass markets,” Rowen said.

Overall, “deep learning is still at the stage where people are having challenges understanding which problems can be handled with this technique. The companies that recognize a vertical-market need and deliver a solution for it have a bigger chance of getting early traction. Over time, there will be more broad, horizontal opportunities,” he added.

Jeff Herbst nurtures more than 5,000 AI startups under the NVIDIA Inception program that fuels entrepreneurs with access to its technology and market connections. But the AI tag is just shorthand.

In a way, it’s like a rerun of The Invasion of the DotComs. “We call them AI companies today, but they are all in specialized markets — in the not-so-distant future, every company will be an AI company,” said Herbst, vice president of business development at NVIDIA.

Today’s AI software landscape looks like a barbell to Herbst. Lots of activity by a handful of cloud-computing giants at one end and a bazillion startups at the other.

2. Get Big Data Fast

Collecting enough bits to fill a data lake is perhaps the hardest challenge for an AI startup.

Among NVIDIA’s Inception startups, Zebra Medical Vision uses AI on medical images to make faster, smarter diagnoses. To get the data it needed, it partnered both with Israel’s largest healthcare provider as well as Intermountain Healthcare, which manages 215 clinics and 24 hospitals in the U.S.

“We understood data was the most important asset we needed to secure, so we invested a lot in the first two years of the startup not only in data but also in developing all kinds of algorithms in parallel,” said Eyal Toledano, co-founder and CTO of Zebra. “To find one good clinical solution, you have to go through many candidates.”

Getting access to 20 years of digital data from top drawer healthcare organizations “took a lot of convincing” both from Zebra’s chief executive and Toledano.

“My contribution was showing how security, compliance and anonymity could be done. There was a lot of education and co-development so they would release the data and we could do research that could contribute back to their patient population in return,” he added.

It’s working. To date Zebra has raised $50 million, received FDA approvals on three products with two more pending “and a few other submissions are on the way,” he said.

Toledano also gave kudos to NVIDIA’s Inception program.

“We had many opportunities to examine new technologies before they became widely used. We saw the difference in applying new GPUs to current processes, and looked at inference in the hospital with GPUs to improve the user experience, especially in time-critical applications,” he said.

“We also got some good know-how and ideas to improve our own infrastructure with training and infrastructure libraries to build projects. We tried quite a lot of the NVIDIA technologies and some were really amazing and fruitful, and we adopted a DGX server and decreased our development and training time substantially in many evaluations,” he added.

Six Steps to AI Startup Gold

Success FactorCall to ActionStartups Using It
Master your domainHave deep expertise in your target applicationBabbleLabs
Gather big data fastTap partners, customers to gather data and refine modelsZebra Medical Vision, Scale
See (a little) ahead of the marketFind solutions to customer pain points before rivals see themFASTDATA.io, Netflix
Make a better screwdriverCreate tools that simplify the work of data scientistsScale, Dataiku
Scale across the cloudsSupport private and multiple public cloud servicesRobin.io
Stay flexibleFollow changing customer pain points to novel solutionsKeyhole Corp.

Another Inception startup, Scale, which provides training and validation data for self-driving cars and other platforms, got on board with Toyota and Lyft. “Working with more people makes your algorithms smarter, and then more people want to work with you — you get into a cycle of success,” said Herbst.

Reflektion, one of Madhavan’s startups, now has a database of 200 million unique shoppers, the third largest retail database after Amazon and Walmart. It started with zero. Getting big took three years and a few great partners.

Rowen’s BabbleLabs applied a little creativity and elbow grease to get a lot of data cheaply and fast. It siphoned speech data from free sources as diverse as YouTube and the Library of Congress. When it needed specialized data, it activated a network of global contractors “quite economically,” he said.

“You can find low-cost, low-quality data sources, then use algorithms to filter and curate the data. Controlling the amount of noise associated with the speech helped simplify training.” he added.

“In AI, access to data no one else has is the big win,” said Herbst. “The world has a lot of open source frameworks and tools, but a lot of the differentiation comes from proprietary access to the data that does the programming,” he added.

When seeking data-rich customers and partners “the fastest way to get in the door is knowing what their pain points are,” said Alen Capalik, founder of FASTDATA.io.

Work in high-frequency trading on Wall Street taught Capalik the value of GPUs. When he came up with an idea for using them to ingest real-time data fast for any application, he sought out Herbst at NVIDIA in 2017.

“He almost immediately wrote me a check for $1.5 million,” Capalik said.

3. See (a Little) Ahead of the Market

Today, FastData.io is poised for a Series A financing round to fuel its recently released PlasmaENGINE, which already has two customers and over 20 more in the pipeline. “I think we are 12-18 months ahead of the market, which is a great spot to be in,” said Capalik, whose product can process as much data as 100 Spark instances.

That wasn’t the position Capalik found himself in his last time out. His cybersecurity startup — GoSecure, formerly CounterTack — pioneered the idea of end-point threat detection as much as six years before it caught on.

“People told me I was crazy. Palo Alto Networks and FireEye were doing perimeter security, and users thought they’d never install agents again because they slowed systems down. So, we struggled for a while and had to educate the market a lot,” he said.

Education and awareness are the kinds of jobs established corporations tackle. For startups, being visionary is like Steve Jobs unveiling an iPhone — “show them what they didn’t know they wanted,” he said.

“Netflix went after video streaming before there was enough bandwidth or end points — they skated to where the puck was going,” said Herbst.

4. Make a Better Screwdriver

AI holds opportunities for arms dealers, too — the kind who sell the software tools data scientists use to tighten down the screws on their neural networks.

The current Swiss Army knife of AI is the workbench. It’s a software platform for developing and deploying machine-learning models in today’s DevOps IT environment.

Jupyter notebooks could be seen as a sort of two-blade model you get for free as open source. Giants such as AWS, IBM and Microsoft and dozens of startups such as H20.ai and Dataiku are rolling out versions with more forks, corkscrews and toothpicks.

Despite all the players and a fast-moving market, there are still opportunities here, said James Kobielus, a lead analyst for AI and data science at Wikibon. Start as a plug-in for a popular workbench, he suggested.

Startups can write modules to support emerging frameworks and languages, or a mod to help a workbench tap into the AI goodness embedded in the latest smartphones. Alternatively, you can automate streaming operations or render logic automatically into code, the former IBM data-science evangelist advised.

If workbenches aren’t for you, try robotic process automation, another emerging category trying to make AI easier for more people to use. “You can clean up if you can democratize RPA for makers and kids — that’s exciting,” Kobielus said.

There’s a wide-open opportunity for tools that cram neural nets into the kilobytes of memory on devices such as smart speakers, appliances and even thermostats, BabbleLabs’ Rowen said. His company aims to run its speech models on some of the world’s smallest microcontrollers.

“We need compilers that take trained models and do quantization, model compression and optimized model generation to fit into the skinny memory of embedded systems — nothing solves this problem yet,” he said.

5. Expand Across the Clouds

The playing field is very competitive with more startups than ever because it’s easier than ever to start a company, said Herbst, who worked closely with entrepreneurs as a corporate and IP attorney even before he joined NVIDIA 18 years ago.

All you need to get started today is an idea, a laptop, a cup of coffee and a cloud-computing account. “All the infrastructure is a service now,” he said.

But if you get lucky and scale, that one cloud-computing account can become a bottleneck and your biggest cost after payroll.

“That’s a good problem to have, but to hit breakeven and make it easier for customers, you need your software running on any cloud,” said Madhavan.

The need is so striking, he wound up funding a startup to address it. Robin.io is an expert in stateful and stateless workloads, helping companies become cloud-agnostic. “We have been extremely successful with 5G telcos going cloud native and embracing containers,” he said.

6. Stay Flexible as a Yogi

Few startups wind up where they thought they were going. Apple planned to make desktop computers, Amazon aimed to sell books online.

Over time “they pivot one way or another. They go in with a problem to solve, but as they talk to customers the smart ones learn from those interactions how to re-target or tailor themselves,” said Herbst, who gives an example from his pre-AI days

Keyhole Corp. wanted to provide 3D mapping services initially for real estate agents and other professionals. Its first product was distributed on CDs

As a veteran of early search startup AltaVista, “I thought this startup belonged more to a Yahoo! or some other internet company. I realized it was not a professional but a major consumer app,” said Herbst, who was happy to fund them as one of NVIDIA’s first investments outside gaming.

In time, Google agreed with Herbst and acquired the company. Keyhole’s technology became part of the underpinnings of Google Maps and Google Earth.

“They had a nice exit, their people went on to have rock-star careers at Google, and I believe were among the original creators of Pokemon Go,” he said.

The lesson is simple: Follow good directions — like the six success factors for AI software startups — and there’s no telling where you may end up.

The post Read ‘em and Reap: 6 Success Factors for AI Startups appeared first on The Official NVIDIA Blog.

What’s the Difference Between Single-, Double-, Multi- and Mixed-Precision Computing?

There are a few different ways to think about pi. As apple, pumpkin and key lime … or as the different ways to represent the mathematical constant of ℼ, 3.14159, or, in binary, a long line of ones and zeroes.

An irrational number, pi has decimal digits that go on forever without repeating. So when doing calculations with pi, both humans and computers must pick how many decimal digits to include before truncating or rounding the number.

In grade school, one might do the math by hand, stopping at 3.14. A high schooler’s graphing calculator might go to 10 decimal places — using a higher level of detail to express the same number. In computer science, that’s called precision. Rather than decimals, it’s usually measured in bits, or binary digits.

For complex scientific simulations, developers have long relied on high-precision math to understand events like the Big Bang or to predict the interaction of millions of atoms.

Having more bits or decimal places to represent each number gives scientists the flexibility to represent a larger range of values, with room for a fluctuating number of digits on either side of the decimal point during the course of a computation. With this range, they can run precise calculations for the largest galaxies and the smallest particles.

But the higher precision level a machine uses, the more computational resources, data transfer and memory storage it requires. It costs more and it consumes more power.

Since not every workload requires high precision, AI and HPC researchers can benefit by mixing and matching different levels of precision. NVIDIA Tensor Core GPUs support multi- and mixed-precision techniques, allowing developers to optimize computational resources and speed up the training of AI applications and those apps’ inferencing capabilities.

Difference Between Single-Precision, Double-Precision and Half-Precision Floating-Point Format 

The IEEE Standard for Floating-Point Arithmetic is the common convention for representing numbers in binary on computers. In double-precision format, each number takes up 64 bits. Single-precision format uses 32 bits, while half-precision is just 16 bits.

To see how this works, let’s return to pi. In traditional scientific notation, pi is written as 3.14 x 100. But computers store that information in binary as a floating-point, a series of ones and zeroes that represent a number and its corresponding exponent, in this case 1.1001001 x 21.

In single-precision, 32-bit format, one bit is used to tell whether the number is positive or negative. Eight bits are reserved for the exponent, which (because it’s binary) is 2 raised to some power. The remaining 23 bits are used to represent the digits that make up the number, called the significand.

Double precision instead reserves 11 bits for the exponent and 52 bits for the significand, dramatically expanding the range and size of numbers it can represent. Half precision takes an even smaller slice of the pie, with just five for bits for the exponent and 10 for the significand.

Here’s what pi looks like at each precision level:

Difference Between Multi-Precision and Mixed-Precision Computing 

Multi-precision computing means using processors that are capable of calculating at different precisions — using double precision when needed, and relying on half- or single-precision arithmetic for other parts of the application.

Mixed-precision, also known as transprecision, computing instead uses different precision levels within a single operation to achieve computational efficiency without sacrificing accuracy.

In mixed precision, calculations start with half-precision values for rapid matrix math. But as the numbers are computed, the machine stores the result at a higher precision. For instance, if multiplying two 16-bit matrices together, the answer is 32 bits in size.

With this method, by the time the application gets to the end of a calculation, the accumulated answers are comparable in accuracy to running the whole thing in double-precision arithmetic.

This technique can accelerate traditional double-precision applications by up to 25x, while shrinking the memory, runtime and power consumption required to run them. It can be used for AI and simulation HPC workloads.

As mixed-precision arithmetic grew in popularity for modern supercomputing applications, HPC luminary Jack Dongarra outlined a new benchmark, HPL-ML, to estimate the performance of supercomputers on mixed-precision calculations. When NVIDIA ran HPL-ML computations in a test run on Summit, the fastest supercomputer in the world, the system achieved unprecedented performance levels of nearly 445 petaflops, almost 3x faster than its official performance on the TOP500 ranking of supercomputers.

How to Get Started with Mixed-Precision Computing

NVIDIA Volta and Turing GPUs feature Tensor Cores, which are built to simplify and accelerate multi- and mixed-precision computing. And with just a few lines of code, developers can enable the automatic mixed-precision feature in the TensorFlow, PyTorch and MXNet deep learning frameworks. The tool gives researchers speedups of up to 3x for AI training.

The NGC catalog of GPU-accelerated software also includes iterative refinement solver and cuTensor libraries that make it easy to deploy mixed-precision applications for HPC.

For more information, check out our developer resources on training with mixed precision.

What Is Mixed-Precision Used for?

Researchers and companies rely on the mixed-precision capabilities of NVIDIA GPUs to power scientific simulation, AI and natural language processing workloads. A few examples:

Earth Sciences

  • Researchers from the University of Tokyo, Oak Ridge National Laboratory and the Swiss National Supercomputing Center used AI and mixed-precision techniques for earthquake simulation. Using a 3D simulation of the city of Tokyo, the scientists modeled how a seismic wave would impact hard soil, soft soil, above-ground buildings, underground malls and subway systems. They achieved a 25x speedup with their new model, which ran on the Summit supercomputer and used a combination of double-, single- and half-precision calculations.
  • A Gordon Bell prize-winning team from Lawrence Berkeley National Laboratory used AI to identify extreme weather patterns from high-resolution climate simulations, helping scientists analyze how extreme weather is likely to change in the future. Using the mixed-precision capabilities of NVIDIA V100 Tensor Core GPUs on Summit, they achieved performance of 1.13 exaflops.

Medical Research and Healthcare

  • San Francisco-based Fathom, a member of the NVIDIA Inception virtual accelerator program, is using mixed-precision computing on NVIDIA V100 Tensor Core GPUs to speed up training of its deep learning algorithms, which automate medical coding. The startup works with many of the largest medical coding operations in the U.S., turning doctors’ typed notes into alphanumeric codes that represent every diagnosis and procedure insurance providers and patients are billed for.
  • Researchers at Oak Ridge National Laboratory were awarded the Gordon Bell prize for their groundbreaking work on opioid addiction, which leveraged mixed-precision techniques to achieve a peak throughput of 2.31 exaops. The research analyzes genetic variations within a population, identifying gene patterns that contribute to complex traits.

Nuclear Energy

  • Nuclear fusion reactions are highly unstable and tricky for scientists to sustain for more than a few seconds. Another team at Oak Ridge is simulating these reactions to give physicists more information about the variables at play within the reactor. Using mixed-precision capabilities of Tensor Core GPUs, the team was able to accelerate their simulations by 3.5x.

The post What’s the Difference Between Single-, Double-, Multi- and Mixed-Precision Computing? appeared first on The Official NVIDIA Blog.