Intel Highlighted Why NVIDIA Tensor Core GPUs Are Great for Inference

It’s not every day that one of the world’s leading tech companies highlights the benefits of your products.

Intel did just that last week, comparing the inference performance of two of their most expensive CPUs to NVIDIA GPUs.

To achieve the performance of a single mainstream NVIDIA V100 GPU, Intel combined two power-hungry, highest-end CPUs with an estimated price of $50,000-$100,000, according to Anandtech. Intel’s performance comparison also highlighted the clear advantage of NVIDIA T4 GPUs, which are built for inference. When compared to a single highest-end CPU, they’re not only faster but also 7x more energy-efficient and an order of magnitude more cost-efficient.

Inference performance is crucial, as AI-powered services are growing exponentially. And Intel’s latest Cascade Lake CPUs include new instructions that improve inference, making them the best CPUs for inference. However, it’s hardly competitive with NVIDIA deep learning-optimized Tensor Core GPUs.

Inference (also known as prediction), in simple terms, is the “pattern recognition” that a neural network does after being trained. It’s where AI models provide intelligent capabilities in applications, like detecting fraud in financial transactions, conversing in natural language to search the internet, and predictive analytics to fix manufacturing breakdowns before they even happen.

While most AI inference today happens on CPUs, NVIDIA Tensor Core GPUs are rapidly being adopted across the full range of AI models. Tensor Core, a breakthrough innovation has transformed NVIDIA GPUs to highly efficient and versatile AI processors. Tensor Cores do multi-precision calculations at high rates to provide optimal precision for diverse AI models and have automatic support in popular AI frameworks.

It’s why a growing list of consumer internet companies — Microsoft, Paypal, Pinterest, Snap and Twitter among them — are adopting GPUs for inference.

Compelling Value of Tensor Core GPUs for Computer Vision

First introduced with the NVIDIA Volta architecture, Tensor Core GPUs are now in their second generation with NVIDIA Turing. Tensor Cores perform extremely efficient computations for AI for a full range of precision — from 16-bit floating point with 32-bit accumulate to 8-bit and even 4-bit integer operations with 32-bit accumulate.

They’re designed to accelerate both AI training and inference, and are easily enabled using automatic mixed precision features in the TensorFlow and PyTorch frameworks. Developers can achieve 3x training speedups by adding just two lines of code to their TensorFlow projects.

On computer vision, as the table below shows, when comparing the same number of processors, the NVIDIA T4 is faster, 7x more power-efficient and far more affordable. NVIDIA V100, designed for AI training, is 2x faster and 2x more energy efficient than CPUs on inference.

Table 1: Inference on ResNet-50.

Intel Xeon 9282
ResNet-50 Inference (images/sec)7,8787,8444,944
# of Processors211
Total Processor TDP800 W350 W70 W
Energy Efficiency (Taking TDP)10 img/sec/W22 img/sec/W71 img/sec/W
Performance per Processor (images/sec)3,9397,8444,944
GPU Performance Advantage1.0 (baseline)2.0x1.3x
GPU Energy-Efficiency Advantage1.0 (baseline)2.3x7.2x

Source: Intel Xeon performance; NVIDIA GPU performance

Compelling Value of Tensor Core GPUs for Understanding Natural Language

AI has been moving at a frenetic pace. This rapid progress is fueled by teams of AI researchers and data scientists who continue to innovate and create highly accurate and exponentially more complex AI models.

Over four years ago, computer vision was among the first applications where AI from Microsoft was able to perform at superhuman accuracy using models like ResNet-50. Today’s advanced models perform even more complex tasks like understanding language and speech at superhuman accuracy. BERT, a highly complex AI model open-sourced by Google last year, can now understand prose and answer questions with superhuman accuracy.

A measure of the complexity of AI models is the number of parameters they have. Parameters in an AI model are the variables that store information the model has learned. While ResNet-50 has 25 million parameters, BERT has 340 million, a 13x increase.

On an advanced model like BERT, a single NVIDIA T4 GPU is 56x faster than a dual-socket CPU server and 240x more power-efficient.

Table 2: Inference on BERT. Workload: Fine-Tune Inference on BERT Large dataset.

 Dual Intel Xeon
Gold 6240
BERT Inference,
Question-Answering (sentences/sec)
Processor TDP300 W (150 Wx2)70 W
Energy Efficiency (using TDP)0.007 sentences/sec/W1.7 sentences/sec/W
GPU Performance Advantage1.0 (baseline)59x
GPU Energy-Efficiency Advantage1.0 (baseline)240x

CPU server: Dual-socket Xeon Gold 6240@2.6GHz; 384GB system RAM; FP32 precision; with Intel’s TF Docker container v. 1.13.1. Note: Batch-size 4 results yielded the best CPU score.

GPU results: T4: Dual-socket Xeon Gold 6240@2.6GHz; 384GB system RAM; mixed precision; CUDA 10.1.105; NCCL 2.4.3, cuDNN, cuBLAS 10.1.105; NVIDIA driver 418.67; on TensorFlow using automatic mixed precision and XLA compiler; batch-size 4 and sequence length 128 used for all platforms tested. 

Compelling Value of Tensor Core GPUs for Recommender Systems

Another key usage of AI is in recommendation systems, which are used to provide relevant content recommendations on video sharing sites, news feeds on social sites and product recommendations on e-commerce sites.

Neural collaborative filtering, or NCF, is a recommender system that uses the prior interactions of users with items to provide recommendations. When running inference on the NCF model that is a part of the MLPerf 0.5 training benchmark, NVIDIA T4 brings 12x more performance and 24x higher energy efficiency than CPUs.

Table 3: Inference on NCF.

 Single Intel Xeon
Gold 6140
Recommender Inference Throughput (MovieLens)(thousands of samples/sec)2,86027,800
Processor TDP150 W70 W
Energy Efficiency (using TDP)19 samples/sec/W397 samples/sec/W
GPU Performance Advantage1.0 (baseline)10x
GPU Energy-Efficiency Advantage1.0 (baseline)20x

CPU server: Single-socket Xeon Gold 6240@2.6GHz; 384GB system RAM; Used Intel Benchmark for NCF on TensorFlow with Intel’s TF Docker container version 1.13.1; FP32 precision. Note: Single-socket CPU config used for CPU tests as it yielded a better score than dual-socket.

GPU results: T4: Single-socket Xeon Gold 6140@2.3GHz; 384GB system RAM; CUDA 10.1.105; NCCL 2.4.3, cuDNN, cuBLAS 10.1.105; NVIDIA driver 418.40.04; on TensorFlow using automatic mixed precision and XLA compiler; batch-size: 2,048 for CPU, 1,048,576 for T4; precision: FP32 for CPU, mixed precision for T4. 

Unified Platform for AI Training and Inference

The use of AI models in applications is an iterative process designed to continuously improve their performance. Data scientist teams constantly update their models with new data and algorithms to improve accuracy. These models are then updated in applications by developers.

Updates can happen monthly, weekly and even on a daily basis. Having a single platform for both AI training and inference can dramatically simplify and accelerate this process of deploying and updating AI in applications.

NVIDIA’s data center GPU computing platform leads the industry in performance by a large margin for AI training, as demonstrated by the standard AI benchmark, MLPerf. And the NVIDIA platform provides compelling value for inference, as the data presented here attests. That value increases with the growing complexity and progress of modern AI.

To help fuel the rapid progress in AI, NVIDIA has deep engagements with the ecosystem and constantly optimizes software, including key frameworks like TensorFlow, Pytorch and MxNet as well as inference software like TensorRT and TensorRT Inference Server.

NVIDIA also regularly publishes pre-trained AI models for inference and model scripts for training models using your own data. All of this software is freely made available as containers, ready to download and run from NGC, NVIDIA’s hub for GPU-accelerated software.

Get the full story about our comprehensive AI platform.

The post Intel Highlighted Why NVIDIA Tensor Core GPUs Are Great for Inference appeared first on The Official NVIDIA Blog.

ACR AI-LAB and NVIDIA Make AI in Hospitals Easy on IT, Accessible to Every Radiologist

For radiology to benefit from AI, there needs to be easy, consistent and scalable ways for hospital IT departments to implement the technology. It’s a return to a service-oriented architecture, where logical components are separated and can each scale individually, and an efficient use of the additional compute power these tools require.

AI is coming from dozens of vendors as well as internal innovation groups, and needs a place within the hospital network to thrive. That’s why NVIDIA and the American College of Radiology (ACR) have published a Hospital AI Reference Architecture Framework. It helps hospitals easily get started with AI initiatives.

A Cookbook to Make AI Easy

The Hospital AI Reference Architecture Framework was published at yesterday’s annual ACR meeting for public comment. This follows the recent launch of the ACR AI-LAB, which aims to standardize and democratize AI in radiology. The ACR AI-LAB uses infrastructure such as NVIDIA GPUs and the NVIDIA Clara AI toolkit, as well as GE Healthcare’s Edison platform, which helps bring AI from research into FDA-cleared smart devices.

The Hospital AI Reference Architecture Framework outlines how hospitals and researchers can easily get started with AI initiatives. It includes descriptions of the steps required to build and deploy AI systems, and provides guidance on the infrastructure needed for each step.

Hospital AI Architecture Framework
Hospital AI Architecture Framework

To drive an effective AI program within a healthcare institution, there must first be an understanding of the workflows involved, compute needs and data required. It comes from a foundation of enabling better insights from patient data with easy-to deploy compute at the edge.

Using a transfer client, seed models can be downloaded from a centralized model store. A clinical champion uses an annotation tool to locally create data that can be used for fine-tuning the seed model or training a new model. Then, using the training system with the annotated data, a localized model is instantiated. Finally, an inference engine is used to conduct validation and ultimately inference on data within the institution.

These four workflows sit atop AI compute infrastructure, which can be accelerated with NVIDIA GPU technology for best performance, alongside storage for models and annotated studies. These workflows tie back into other hospital systems such as PACS, where medical images are archived.

Three Magic Ingredients: Hospital Data, Clinical AI Workflows, AI Computing

Healthcare institutions don’t have to build the systems to deploy AI tools themselves.

This scalable architecture is designed to support and provide computing power to solutions from different sources. GE Healthcare’s Edison platform now uses NVIDIA’s TRT-IS inference capabilities to help AI run in an optimized way within GPU-powered software and medical devices. This integration makes it easier to deliver AI from multiple vendors into clinical workflows — and is the first example of the AI-LAB’s efforts to help hospitals adopt solutions from different vendors.

Together, Edison with TRT-IS offers a ready-made device inferencing platform that is optimized for GPU-compliant AI, so models built anywhere can be deployed in an existing healthcare workflow.

Hospitals and researchers are empowered to embrace AI technologies without building their own standalone technology or yielding their data to the cloud, which has privacy implications.

The post ACR AI-LAB and NVIDIA Make AI in Hospitals Easy on IT, Accessible to Every Radiologist appeared first on The Official NVIDIA Blog.

Cracking the Code on Opioid Addiction with Summit Supercomputer

About 10 percent of people who are prescribed opioids experience full-blown addiction, yet there’s no way to know who is most susceptible.

Genomics can likely explain this riddle — by identifying genetic predispositions to addiction — but such research will take enormous calculations on massive datasets.

Researchers at Oak Ridge National Laboratory are using the biggest GPU-powered computing cluster ever built to perform these calculations faster than ever. It’s part of an effort that could one day lead to alternative medications for pain management and help mitigate the opioid addiction crisis.

“We’ve been dreaming about solving these sorts of problems for years,” said Dan Jacobson, chief scientist of computational systems biology at ORNL, one of several labs run by the U.S. Department of Energy and home to Summit, the world’s fastest supercomputer.

Scientists across numerous fields are hailing Summit as a huge leap in computational research capabilities. Its 27,000-plus NVIDIA V100 Tensor Core GPUs deliver exponentially more processing power than was possible just a few years ago.

In the world of biology, this translates to being able to zoom in closer on the molecular level and explore new frontiers. That, in turn, will enable researchers to learn more about how all the components of a cell interact, and to do studies on a population scale.

Among the first such projects will be an effort led by Jacobson to train a machine learning model on genomic data in the hopes of accurately predicting whether a patient is predisposed to opioid addiction.

“Tensor Cores on GPUs on Summit will give us this enormous boost in performance to solve fundamental biological problems we simply couldn’t before,” he said.

Serious Math

Jacobson’s team plans to tap Summit’s prodigious mathematical capabilities by running immense calculations on genetics data that will help establish correlations between that data and the likelihood of addiction.

First, the team will use Summit to look for genetic changes across an entire population, then it’ll write algorithms to search for correlations between those changes.

To do this, the team is working with a Veterans Administration dataset of clinical records for 23 million people going back two decades. It already has assembled a dataset of genomics correlations on 600,000 people. The goal is to build that to about 2 million.

Once they have a large enough set of these correlations, the team can start testing them against two groups: Those who have developed opioid addiction and a control group who have been exposed but not developed an addiction.

Which brings us to the math: Jacobson said the very first calculation would require somewhere on the order of 10 to the 16th power (or 10 quadrillion) comparisons, and that operation would be repeated thousands, possibly even hundreds of thousands, of times.

Being able to perform such calculations in manageable amounts of time will open the doors to new ways of dealing with the growing opioid addiction crisis.

“We can develop better therapies for addiction, we can develop better therapies for chronic pain, and we can predict which patients will become addicted to opioids and then not give them opioids,” said Jacobson.

More Breakthroughs to Come

The team already has been able to test some of its applications on Summit and has managed to boost performance from 1.8 exaflops to 2.36 exaflops on its algorithm. It’s the fastest science application ever reported, and earned the team a Gordon Bell Prize in 2018. (For reference, one exaflops equals 1 quintillion, or billion billion, operations per second.)

As it continues to refine performance, Jacobson’s team expects to achieve higher levels of accuracy and to get those results faster.

That’s almost hard to imagine, given that Jacobson already said that his team can, in one hour on Summit, complete tasks that would require 35 years on a competing supercomputer, or 12,000 years on a laptop.

Jacobson believes that being able to do his team’s work on opioid addiction on Summit will lead to breakthrough treatments of other conditions, such as Alzheimer’s, dementia, prostate cancer and cardiovascular disease.

“Once we understand the complex genetic architecture underlying addiction, we want to do this really for all clinical disease states that seem to have some sort of genetic underpinning,” said Jacobson. “It’s machines like Summit that give us the ability to do that at scale, so we can now start to answer scientific questions that were literally impossible earlier this year.”

Learn more about Summit and why it plays such a critical role in enabling scientific progress in the video below.

The post Cracking the Code on Opioid Addiction with Summit Supercomputer appeared first on The Official NVIDIA Blog.

DUG Selects Intel to Build Its Latest Cloud-Based Supercomputer Tailored for Oil and Gas Exploration

DownUnder GeoSolutions* (DUG) on Thursday announced its new Intel-based high-performance computing (HPC) system tailored for the geophysics community. The system was unveiled in a ceremony that took place at the Skybox* Houston data center, where the supercomputer will be housed.

Harnessing the power of Intel technology, the 250 petaflop (single-precision) supercomputer, known as “Bubba,” will join other DUG data centers around the world to form DUG’s McCloud service — a global network of cloud-based high-performance computing systems used by the oil and gas industry for geophysics research and exploration.

The supercomputers that make up the DUG McCloud global network feature over 40,000 Intel® Xeon® processor-based nodes, and are some of the most powerful, energy-efficient HPC systems in the world that are optimized for geophysical research. Geophysical research relies heavily on advanced computing resources to obtain a more detailed picture of the earth’s subsurface. Intel Xeon processors deliver companies like DUG highly optimized computing, artificial intelligence and analytics capabilities for advanced simulation and modeling.

Like other supercomputers in the DUG McCloud global network, Bubba is housed in a purpose-built facility that leverages immersive cooling technology. The computing nodes are submerged in more than 700 specially designed tanks filled with polyalphaolefin dielectric fluid. The facility features 10 20-foot tall cooling towers with over 13 miles of pipes to cool the system.

More: All Intel Images | High-Performance Computing (

DUG Intel Bubba 2s

» Download all images (ZIP, 61 MB)

The post DUG Selects Intel to Build Its Latest Cloud-Based Supercomputer Tailored for Oil and Gas Exploration appeared first on Intel Newsroom.

Intel Drives Innovation across the Software Stack with Open Source for AI and Cloud

osts 19 logoWhat’s New: Intel is hosting the annual Open Source Technology Summit (OSTS) May 14-16. What started as an internal conference in 2004 with a few dozen engineers now brings together 500 participants. This year is the most open yet, with leaders from Alibaba*, Amazon*, AT&T*, Google*, Huawei*,*, Microsoft*, MontaVista*, Red Hat*, SUSE* and Wind River* taking part in discussions of open source software that is optimized for Intel hardware and will drive the next generation of data-centric technology in areas such as containers, artificial intelligence (AI), machine learning and other cloud to edge to device workloads.

“OSTS is at its heart a technology conference, and it’s the depth of technical content, engineering engagement and community focus that make the summit so valuable. This year we’re open-sourcing our open source summit, inviting customers, partners and industry stakeholders for the first time. I’m excited by the opportunity to connect the community with the amazing people who are driving open source at Intel.”
–Imad Sousou, Intel corporate vice president and general manager of System Software Products

The Details: The latest contributions Intel is sharing at OSTS represent critical advances in:

  • Modernizing core infrastructure for uses well-suited to Intel architecture
  • ModernFW Initiative has the goal to remove legacy code and modularize design for scalability and security. By delivering just enough code to boot the kernel, this approach can help reduce exposure to security risks and help ensure management is easier for users.
  • rust-vmm offers a set of common hypervisor components, developed by Intel with industry leaders including Alibaba, Amazon, Google and Red Hat to deliver use-case specific hypervisors. Intel has released a special-purpose cloud hypervisor based on rust-vmm with partners to provide a more secure, higher performance container technology designed for cloud native environments.
  • Intel is also committing to advancing critical system infrastructure projects by assigning developers to contribute code, as well as incorporating our “0-day Continuous Integration” best practices to technologies beyond the Linux* kernel. Projects Intel plans to contribute to include (but are not limited to) bash*, chrony*, the Fuzzing Project*, GnuPG*, libffi*, the Linux Kernel Self Protection Project*, OpenSSH*, OpenSSL* and the R* programming language.
  • Enhancing Intel Linux-based solutions for developers and partners: Intel’s Clear Linux* Distribution is adding Clear Linux Developer Edition, which includes a new installer and store, bringing together toolkits to give developers an operating system with all Intel hardware features already enabled. Additionally, Clear Linux usages are expanding to provide end-to-end integration and optimization for Intel hardware features and key workloads supporting the Deep Learning and Data Analytics software stacks. The performance, security, ease-of-use and customization advantages make Clear Linux a great choice for Linux developers.
    • The Deep Learning Reference Stack is an integrated, highly-performant open source stack optimized for Intel® Xeon® Scalable Processors. This stack includes Intel® Deep Learning Boost (Intel DL Boost) and is designed to accelerate AI use cases such as image recognition, object detection, speech recognition and language translation.
    • The Data Analytics Reference Stack was developed to help enterprises analyze, classify, recognize and process large amounts of data built on Intel® Xeon® Scalable platforms using Apache Hadoop and Apache Spark*.
  • Enabling new usages across automotive and industrial automation: In a world where functional safety is increasingly important, workload consolidation is both complex and critical. And with the growing reliance on software-defined systems, virtualization has never been more important. Intel is working to transform the software-defined environment to support a mix of safety critical, non-safety critical and time critical workloads to help support automotive, industrial automation and robotics uses.
    • Fusion Edge Stacks support the consolidated workloads that today’s connected devices demand using the ACRN* device hypervisor, Clear Linux OS, Zephyr Project* and Android*.
    • The Intel Robot SDKbrings together the best of Intel hardware and software in one resource, simplifying the process of creating AI-enabled robotics and automation solutions, with an optimized computer vision stack.

    Why It Matters: Open source powers the software-defined infrastructure that transformed the modern data center and ushered in the data-centric era. Today, the vast majority of the public cloud runs on open source software; new contributions by Intel are poised to drive a future where everything is software-defined, including new areas such as automotive, industrial and retail.

    With more than 15,000 software engineers, Intel invests in software and the work on standards initiatives to optimize the workload and to unlock the performance of our processors. In addition to significant contributions to the Linux kernel, Chromium OS* and OpenStack*, Intel’s leadership in the open source community drives industry advancement that fuel new models for hardware and software interaction in emerging workloads.

    Intel is in a unique position to bring together key industry players to address the complexity of building for diverse architectures and workloads and enable faster deployments of new innovations at scale. Software is a key technology pillar for Intel to fully realize the advancements in architecture, process, memory, interconnect and security.

    The post Intel Drives Innovation across the Software Stack with Open Source for AI and Cloud appeared first on Intel Newsroom.

    NVIDIA Invests in WekaIO’s Speedier Data Transfer for AI

    NVIDIA today made a strategic investment in WekaIO, whose technology speeds data transfer for performance-hungry AI applications.

    The startup, which is based in San Jose, with engineering offices in Tel Aviv, offers high-performance file storage software that accelerates data transfer between network-attached shared flash storage and GPUs.

    Founded in 2014, WekaIO offers high-throughput, high-bandwidth, low-latency data access to GPU-based servers.

    “We have developed a flash-native parallel file system with a low-latency storage protocol that natively leverages 100 Gigabit Ethernet and InfiniBand to make sure that customers get faster than local storage performance,” said Liran Zvibel, co-founder and CEO of WekaIO.

    Today’s AI workloads aren’t just exceeding the limits of Moore’s law for CPUs. They’re now requiring a host of new technologies — like smart interconnects for faster networking — to eliminate bottlenecks in the data center.

    Data center customers use WekaIO’s market-leading Matrix software-defined storage system to handle data-heavy input and output workloads found in AI, machine learning, high-velocity analytics and high performance computing. Customers include TuSimple, Tre Altamira, Untold Studios and San Diego Supercomputer Center.

    “As GPU performance has continued to grow, data movement becomes increasingly important and WekaIO has pioneered an impressive modern parallel file system that delivers important capabilities to accelerate AI and workloads at scale,” said Jeff Herbst, vice president of business development at NVIDIA.

    The WekaIO investment is the latest addition to an expanding portfolio for NVIDIA’s GPU Ventures program, which has invested in 11 companies in the past two years. This brings the portfolio to 24 companies, including three investments in Israeli startups. Below are some of the investments:

    • Deep Instinct, based in Tel Aviv and New York, uses deep learning against cyberattacks.
    • DeepMap, in Silicon Valley, develops HD mapping and localization for autonomous vehicles.
    • ElementAI is a Montreal-based startup that integrates AI capabilities.
    •, based in Santa Monica, develops software for fast data processing.
    • H2O, in Silicon Valley and Prague, offers an open source machine learning platform.
    • OmniSci, in San Francisco, is a pioneer in GPU-driven analytics.
    • TuSimple, based in Beijing and San Diego, is an autonomous truck startup.
    • WeRide, a Beijing- and Silicon Valley-based startup, develops on NVIDIA DRIVE AGX Pegasus.

    The post NVIDIA Invests in WekaIO’s Speedier Data Transfer for AI appeared first on The Official NVIDIA Blog.

    NVIDIA Delivers More Than 6,000x Speedup on Key Algorithm for Hedge Funds

    NVIDIA’s AI platform is delivering more than 6,000x acceleration for running an algorithm that the hedge fund industry uses to benchmark backtesting of trading strategies.

    This enormous GPU-accelerated speedup has big implications across the financial services industry.

    Hedge funds — there are more than 10,000 of them — will be able to design more sophisticated models, stress test them harder, and still backtest them in just hours instead of days. And quants, data scientists and traders will be able to build smarter algorithms, get them into production more quickly and save millions on hardware.

    Financial trading algorithms account for about 90 percent of public trading, according to the Global Algorithmic Trading Market 2016–2020 report. Quants, specifically, have grown to about a third of all trading on the U.S. stock markets today, according to the Wall Street Journal.

    The breakthrough results have been validated by the Securities Technology Analysis Center (STAC), whose membership includes more than 390 of the world’s leading banks, hedge funds and financial services technology companies.

    STAC Benchmark Infographic
    Click to view the infographic in full.

    NVIDIA demonstrated its computing platform’s capability using STAC-A3, the financial services industry benchmark suite for backtesting trading algorithms to determine how strategies would have performed on historical data.

    Using an NVIDIA DGX-2 system running accelerated Python libraries, NVIDIA shattered several previous STAC-A3 benchmark results, in one case running 20 million simulations on a basket of 50 instruments in the prescribed 60-minute test period versus the previous record of 3,200 simulations. This is the STAC-A3.β1.SWEEP.MAX60 benchmark, see the official STAC Report for details.

    STAC-A3 parameter-sweep benchmarks use realistic volumes of data and backtest many variants of a simplified trading algorithm to determine profit and loss scores for each simulation. While the underlying algorithm is simple, testing many variants in parallel was designed to stress systems in realistic ways.

    According to Michel Debiche, a former Wall Street quant who is now STAC’s director of analytics research, “The ability to run many simulations on a given set of historical data is often important to trading and investment firms. Exploring more combinations of parameters in an algorithm can lead to more optimized models and thus more profitable strategies.”

    The benchmark results were achieved by harnessing the parallel processing power of 16 NVIDIA V100 GPUs in a DGX-2 server and Python, which uses NVIDIA CUDA-X AI software along with NVIDIA RAPIDS and Numba machine learning software.

    RAPIDS is an evolving set of libraries that simplifies GPU acceleration of common Python data science tasks. Numba allows data scientists to write Python that is compiled into the GPU’s native CUDA, making it easy to extend the capabilities of RAPIDS.

    RAPIDS and Numba software make it possible for data scientists and traders to replicate this performance without needing in-depth knowledge of GPU programming.


    Feature image credit: Lorenzo Cafaro

    The post NVIDIA Delivers More Than 6,000x Speedup on Key Algorithm for Hedge Funds appeared first on The Official NVIDIA Blog.

    NVIDIA and Red Hat Team to Accelerate Enterprise AI

    For enterprises looking to get their GPU-accelerated AI and data science projects up and running more quickly, life just got easier.

    At Red Hat Summit today, NVIDIA and Red Hat introduced the combination of NVIDIA’s GPU-accelerated computing platform and the just-announced Red Hat OpenShift 4 to speed on-premises Kubernetes deployments for AI and data science.

    The result: Kubernetes management tasks that used to take an IT administrator the better part of a day can now be completed in under an hour.

    More GPU Acceleration, Less Deployment Hassle

    This collaboration comes at a time when enterprises are relying on AI and data science to turn their vast amounts of data into actionable intelligence.

    But meaningful AI and data analytics work requires accelerating the full stack of enterprise IT software with GPU computing. Every layer of software — from NVIDIA drivers to container runtimes to application frameworks — needs to be optimized.

    Our CUDA parallel computing architecture and CUDA-X acceleration libraries have been embraced by a community of more than 1.2 million developers for accelerating applications across a broad set of domains — from AI to high-performance computing to VDI.

    And because NVIDIA’s common architecture runs on every computing device imaginable — from a laptop to the data center to the cloud — the investment in GPU-accelerated applications is easy to justify and just makes sense.

    Accelerating AI and data science workloads is only the first step, however. Getting the optimized software stack deployed the right way in large-scale, GPU-accelerated data centers can be frustrating and time consuming for IT organizations. That’s where our work with Red Hat comes in.

    Red Hat OpenShift is the leading enterprise-grade Kubernetes platform in the industry. Advancements in OpenShift 4 make it easier than ever to deploy Kubernetes across a cluster. Red Hat’s investment in Kubernetes Operators, in particular, reduces administrative complexity by automating many routine data center management and application lifecycle management tasks.

    NVIDIA has been working on its own GPU operator to automate a lot of the work IT managers previously did through shell scripts, such as installing device drivers, ensuring the proper GPU container runtimes are present on all nodes in the data center, as well as monitoring GPUs.

    Thanks to our work with Red Hat, once the cluster is set up, you simply run the GPU operator to add the necessary dependencies to the worker nodes in the cluster. It’s just that easy. This can make it as simple for an organization to get its GPU-powered data center clusters up and running with OpenShift 4 as it is to spin up new cloud resources.

    Preview and Early Access Program

    At Red Hat Summit, we’re showing in our booth 1039 a preview of how easy it is to set up bare-metal GPU clusters with OpenShift and GPU operators.

    Also, you won’t want to miss Red Hat Chief Technology Officer Chris Wright’s keynote on Thursday when NVIDIA Vice President of Compute Software Chris Lamb will join him on stage to demonstrate how our technologies work together and discuss our collaboration in further detail.

    Red Hat and NVIDIA are inviting our joint customers in a white-glove early access program. Customers who want to learn more or participate in the early access program can sign up at

    The post NVIDIA and Red Hat Team to Accelerate Enterprise AI appeared first on The Official NVIDIA Blog.

    Intel and Google Cloud Announce Strategic Partnership to Accelerate Hybrid Cloud

    intel navin shenoy xeon
    Navin Shenoy, Intel executive vice president and general manager of the Data Center Group, displays a wafer containing Intel Xeon processors during a keynote on Tuesday, April 2, 2019. In San Francisco on April 2, Intel Corporation introduces a portfolio of data-centric tools to help its customers extract more value from their data. (Credit: Walden Kirsch/Intel Corporation)
    » Click for full image

    SAN FRANCISCO – GOOGLE CLOUD NEXT – April 9, 2019 — Intel and Google Cloud today announced a strategic partnership aimed at helping enterprise customers seamlessly deploy applications across on-premise and cloud environments. The two companies will collaborate on Anthos, a new reference design based on the 2nd-Generation Intel® Xeon® Scalable processor and an optimized Kubernetes software stack that will deliver increased workload portability to customers who want to take advantage of hybrid cloud environments. Intel will publish the production design as an Intel Select Solution, as well as a developer platform.

    While organizations are embracing multi-cloud solutions to fuel their businesses, many companies remain challenged to find the right hybrid cloud solutions that enable seamless workload migration across clouds. The new Anthos reference design will address this challenge by delivering a stack optimized for workload portability, enabling deployment of applications across on-premise data centers and public cloud provider services.

    More: Data-Centric Innovation at Intel | We’re Stepping on the Gas Pedal for Hybrid Cloud

    “Google and Intel enjoy a long-standing partnership focused on delivering infrastructure innovation to customers,” said Urs Hölzle, senior vice president of Technical Infrastructure at Google Cloud. “Data center environments today are complex, and hardware and software infrastructure is not ‘one size fits all.’ Our ability to collaborate with Intel and take advantage of their technology and product innovation to deliver Anthos solutions ensures that our customers can run their applications in the way that best suits them.”

    “Our collaboration with Google in delivering the infrastructure and software optimizations required to advance their hybrid and multi-cloud solution is a natural fit with Intel’s vision for data-centric computing,” said Navin Shenoy, executive vice president and general manager of the Data Center Group at Intel Corporation. “We’re delivering an Intel technology foundation for customers to take advantage of their data, and that requires delivery of architectures that can span across various operating environments. This collaboration will give customers a choice of optimized solutions that can be utilized both in the on-prem as well as cloud environments.”

    This collaboration is an extension of a technology alliance between the two companies that already spans many infrastructure optimizations, collaboration on high-growth workloads like artificial intelligence, and integration of new technologies into the Google Cloud Platform, such as the 2nd-Generation Intel Xeon Scalable processors and Intel® Optane™ DC Persistent Memory.

    The new reference design will be delivered by mid-year 2019 with expected solution delivery from OEMs and solutions integrators in market later this year.

    About Google Cloud

    Google Cloud (NASDAQ: GOOG, GOOGL) is widely recognized as a global leader in delivering a secure, open and intelligent enterprise cloud platform. Our technology is built on Google’s private network and is the product of nearly 20 years of innovation in security, network architecture, collaboration, artificial intelligence and open source software. We offer a simply engineered set of tools and unparalleled technology across Google Cloud Platform and G Suite that help bring people, insights and ideas together. Customers across more than 150 countries trust Google Cloud to modernize their computing environment for today’s digital world.

    The post Intel and Google Cloud Announce Strategic Partnership to Accelerate Hybrid Cloud appeared first on Intel Newsroom.

    From Microns to Miles: Intel’s Interconnect Tech is a Foundation for the Data-Centric Computing Era

    Intel Sailesh KottapalliBy Sailesh Kottapalli

    Last year, Intel leaders announced the six technology pillars that underpin all our products: process, architecture, memory, interconnect, security and software. Last week, we launched an amazing array of new data-centric products, featuring new processors, memory, network controller, SSDs, FPGAs and more. Both events were built around our vision of data for this new era – move faster, store more and process everything.

    This week, along with a team of Intel’s technical leaders, I provided an update focused on “move faster,” detailing the vital role interconnect technology plays throughout Intel’s entire portfolio.

    Connecting Data
    Intel’s investments in interconnect technology are among the broadest in the industry. Our technology moves data within a chip, to package, to processor node, through the data center, to the edge, via long-distance wired or wireless networks, and back to the core. Intel is uniquely positioned as a pioneer and leader in all these interconnects that span microns to miles.

    Interconnect technologies provide the ability to connect and transfer data that is growing at an exponential rate. It is estimated that only 2% of the world’s data has been analyzed, leaving a great untapped opportunity to propel business and to fuel societal insights. Interconnect technology is the literal link between the raw data and the compute engines that can extract value from it.

    Interconnect is Vital to Accelerating Performance

    Rapidly growing data drives an insatiable need to scale computing, storage and interconnects, in concert. Put simply, we can create the fastest processing cores, smartest field programmable gate arrays (FPGAs) or the highest-capacity solid-state drives (SSDs), but without high-performance interconnect technology to allow data to flow quickly and efficiently, the performance of the whole system or service will never reach its potential.

    How data flows across the entire spectrum between microns and miles informs our future architecture direction for each of the interconnect elements, as well as the computing, memory and I/O elements that reside on either end of the connection. It takes a holistic approach to sustain high rates of efficient data flow from chip to edge.

    Intel’s participation in many interconnect market segments and leadership in industry standards, such as USB and the new Compute Express Link (CXL), stimulates the entire ecosystem to innovate faster and advance the state of the art. The creativity of the whole market is unleashed to deliver higher-performance systems and better digital services for everyone.

    Sailesh Kottapalli is a senior fellow and chief architect of data center process architecture at Intel Corporation.

    interconnect infographic
    » Click for full infographic

    At the processor level: Data within the chip is where it all begins, whether it is within the silicon (IP-to-IP interconnect) or on package (chip-to-chip). Advancements in interconnect improve efficiencies and enable performance scaling. Intel’s portfolio approach to on-die interconnect allows us to create best-in-class interconnects across a wide range of SoCs. Example technologies: NetSpeed fabricFoveros 3D chip technology

    Between processors and devices: Data-centric computing involves massive scale and large numbers of processors. This involves a specialized set of high-bandwidth and low-latency interconnect technologies that are critical to moving data between different computing engines as well as memory and IO within a data center rack, enabling these technologies to come together as one. Example technologies: Thunderbolt 3/USB4Compute Express Link (CXL)

    In the data center: The growing demands for computing has been driving the trend toward hyperscale data centers, some of which have the footprint of several football fields. Today, there is five times more data transferred in the data center than overall internet traffic, resulting in accelerated growth in data center fabric speeds and smart processing capabilities. High-speed, long-distance interconnect technologies are critical in driving performance at large scale while computing at low latency. Example technologies: Intel Ethernet 800 SeriesSilicon PhotonicsOmni-Path

    Out in the world: Machine-generated data is driving explosive growth. Wireless interconnect technologies are the conduit to connect the massive amounts of data generated to the data center, and connecting billions of people and things to each other. 5G high-bandwidth, low-latency wireless connectivity will facilitate more useful applications of the data, allowing for smarter and more personalized experiences. Example technologies: Snow RidgeN3000 FPGA

    The post From Microns to Miles: Intel’s Interconnect Tech is a Foundation for the Data-Centric Computing Era appeared first on Intel Newsroom.