What’s New: At Baidu World 2020, Intel announced a series of collaborations with Baidu in artificial intelligence (AI), 5G, data center and cloud computing infrastructure. Intel and Baidu executives discussed the trends of intelligent infrastructure and intelligent computing, and shared details on the two companies’ strategic vision to jointly drive the industry’s intelligent transformation within the cloud, network and edge computing environments.
“In China, the most important thing for developing an industry ecosystem is to truly take root in the local market and its users’ needs. With the full-speed development of ‘new infrastructure’ and 5G, China has entered the stage of accelerated development of the industrial internet. Intel and Baidu will collaborate comprehensively to create infinite possibilities for the future through continuous innovation, so that technology can enrich the lives of everyone.”
– Rui Wang, Intel vice president in the Sales & Marketing Group and PRC country manager
Why It Matters: Zhenyu Hou, corporate vice president of Baidu, said that Baidu and Intel are both extremely focused on technology innovation and have always been committed to promoting intelligent transformation through innovative technology exploration. In the wave of new infrastructure, Baidu continues to deepen its collaboration with Intel to seize opportunities in the AI industry and bring more value to the industry, society and individuals.
A Series of Recent Collaborations:
AI in the Cloud: Intel and Baidu have delivered technological innovations over the past decade, from search and AI to autonomous driving, 5G and cloud services. Recently Baidu and Intel worked on customizing Intel® Xeon® Scalable processors to deliver optimized performance, thermal design power (TDP), temperature and feature sets within Baidu’s cloud infrastructure. With the latest 3rd generation Intel Xeon Scalable processor with built-in BFloat16 instruction set, Intel supports Baidu’s optimization of the PaddlePaddle framework to provide enhanced speech prediction services and multimedia processing support within the Baidu cloud to deliver highly optimized, highlight efficient cloud management, operation and maintenance.
Next-Gen server architecture: Intel and Baidu have designed and carried out the commercial deployment of next-generation 48V rack servers based on Intel Xeon Scalable processors to achieve higher rack power density, reduce power consumption and improve energy efficiency. The two companies are working to drive ecosystem maturity of 48V and promote the full adoption of 48V in the future based on the next-generation Xeon® Scalable processor (code named Sapphire Rapids).
Networking: In an effort to improve virtualization and workload performance, while accelerating data processing speeds and reducing total cost of ownership (TCO) within Baidu infrastructure, Intel and Baidu are deploying Smart NIC (network interface card) innovations based on Intel® SoC FPGAs and Intel® Ethernet 800 Series adapter with Application Device Queues (ADQ) technology. Smart NICs greatly increase port speed, optimize network load, realize large-scale data processing, and create an efficient and scalable bare metal and virtualization environment for the Baidu AICloud.
Baidu Smart NICs are built on the latest Intel Ethernet 800 series, Intel Xeon-D processor and Intel Arria® 10-based FPGAs. From the memory and storage side, Intel and Baidu built a high-performance, ultra-scalable, and unified user space single-node storage engine using Intel® Optane™ persistent memory and Intel Optane NVMe SSDs to enable Baidu to configure multiple storage scenarios through one set of software.
5G and edge computing: In the area of 5G and edge computing, Intel and Baidu have utilized their technology expertise and collaborated on a joint innovation using the capabilities of the OpenNESS (Open Network Edge Services Software) toolkit developed by Intel, and Baidu IME (Intelligent Mobile Edge), to help achieve a highly reliable edge compute solution with AI capabilities for low-latency applications.
What’s Next: Looking forward, Intel will continue to leverage its comprehensive data center portfolio to collaborate with Baidu on a variety of developments including:
Developing a future autonomous vehicle architecture platform and intelligent transportation vehicle-road computing architecture.
Exploring mobile edge computing to provide users with edge resources to connect Baidu AI operator services.
Expanding Baidu Smart Cloud in infrastructure technology.
Improving the optimization of Xeon Scalable processors and PaddlePaddle.
Bringing increased benefits to Baidu online AI businesses, thus creating world-changing technology that truly enriches lives.
What’s New: Samsung Medison and Intel are collaborating on new smart workflow solutions to improve obstetric measurements that contribute to maternal and fetal safety and can help save lives. Using an Intel® Core™ i3 processor, the Intel® Distribution of OpenVINO™ toolkit and OpenCV library, Samsung Medison’s BiometryAssist™ automates and simplifies fetal measurements, while LaborAssist™ automatically estimates the fetal angle of progression (AoP) during labor for a complete understanding of a patient’s birthing progress, without the need for invasive digital vaginal exams.
“Samsung Medison’s BiometryAssist is a semi-automated fetal biometry measurement system that automatically locates the region of interest and places a caliper for fetal biometry, demonstrating a success rate of 97% to 99% for each parameter1. Such high efficacy enables its use in the current clinical practice with high precision.”
–Professor Jayoung Kwon, MD PhD, Division of Maternal Fetal Medicine, Department of Obstetrics and Gynecology, Yonsei University College of Medicine, Yonsei University Health System in Seoul, Korea
Why It’s Needed: According to the World Health Organization, about 295,000 women died during and following pregnancy and childbirth in 2017, even as maternal mortality rates decreased. While every pregnancy and birth is unique, most maternal deaths are preventable. Research from the Perinatal Institute found that tracking fetal growth is essential for good prenatal care and can help prevent stillbirths when physicians are able to recognize growth restrictions.
“At Intel, we are focused on creating and enabling world-changing technology that enriches the lives of every person on Earth,” said Claire Celeste Carnes, strategic marketing director for Health and Life Sciences at Intel. “We are working with companies like Samsung Medison to adopt the latest technologies in ways that enhance the patient safety and improve clinical workflows, in this case for the important and time-sensitive care provided during pregnancy and delivery.”
How It Works: BiometryAssist automates and standardizes fetal measurements in approximately 85 milliseconds with a single click, providing over 97% accuracy1. This allows doctors to spend more time talking with their patients while also standardizing fetal measurements, which have historically proved challenging to accurately provide. With BiometryAssist, physicians can quickly verify consistent measurements for high volumes of patients.
“Samsung is working to improve the efficiency of new diagnostic features, as well as healthcare services, and the Intel Distribution of OpenVINO library and OpenCV toolkit have been a great ally in reaching these goals,” said Won-Chul Bang, corporate vice president and head of Product Strategy, Samsung Medison.
During labor, LaborAssist helps physicians estimate fetal AOP and head direction. This enables both the physician and patient to understand the fetal descent and labor process and determine the best method for delivery. There is always risk with delivery and a slowing progress could result in issues for the baby. Obtaining more accurate and real-time progression of labor can help physicians determine the best mode of delivery and potentially help reduce the number of unnecessary cesarean sections.
“LaborAssist provides automatic measurement of the angle of progression as well as information pertaining to fetal head direction and estimated head station. So it is useful for explaining to the patient and her family how the labor is progressing, using ultrasound images which show the change of head station during labor. It is expected to be of great assistance in the assessment of labor progression and decision-making for delivery,” said Professor Min Jeong Oh, MD, PhD, Department of Obstetrics and Gynecology, Korea University Guro Hospital in Seoul, Korea.
BiometryAssist and LaborAssist are already in use in 80 countries, including the United States, Korea, Italy, France, Brazil and Russia. The solutions received Class 2 clearance by the FDA in 2020.
What’s Next: Intel and Samsung Medison will continue to collaborate to advance the state of the art in ultrasounds by accelerating AI and leveraging advanced technology in Samsung Medison’s next-generation ultrasound solutions, including Nerve Tracking, SW Beamforming and AI Module.
1 Source: Internal Samsung testing. System configuration: Intel® Core™ i3-4100Q CPU @ 2.4 GHz, 8 GB memory; OS: 64-bit Windows 10. Inference time without OpenVINO enhancements was 480 milliseconds. Inference time with OpenVINO enhancements was 85 milliseconds.
In short, machine learning, one part of the broad field of AI, is set to become as mainstream as software applications. That’s why the process of running ML needs to be as buttoned down as the job of running IT systems.
Machine Learning Layered on DevOps
MLOps is modeled on the existing discipline of DevOps, the modern practice of efficiently writing, deploying and running enterprise applications. DevOps got its start a decade ago as a way warring tribes of software developers (the Devs) and IT operations teams (the Ops) could collaborate.
MLOps adds to the team the data scientists, who curate datasets and build AI models that analyze them. It also includes ML engineers, who run those datasets through the models in disciplined, automated ways.
It’s a big challenge in raw performance as well as management rigor. Datasets are massive and growing, and they can change in real time. AI models require careful tracking through cycles of experiments, tuning and retraining.
So, MLOps needs a powerful AI infrastructure that can scale as companies grow. For this foundation, many companies use NVIDIA DGX systems, CUDA-X and other software components available on NVIDIA’s software hub, NGC.
Lifecycle Tracking for Data Scientists
With an AI infrastructure in place, an enterprise data center can layer on the following elements of an MLOps software stack:
Data sources and the datasets created from them
A repository of AI models tagged with their histories and attributes
An automated ML pipeline that manages datasets, models and experiments through their lifecycles
Software containers, typically based on Kubernetes, to simplify running these jobs
It’s a heady set of related jobs to weave into one process.
Data scientists need the freedom to cut and paste datasets together from external sources and internal data lakes. Yet their work and those datasets need to be carefully labeled and tracked.
Likewise, they need to experiment and iterate to craft great models well torqued to the task at hand. So they need flexible sandboxes and rock-solid repositories.
And they need ways to work with the ML engineers who run the datasets and models through prototypes, testing and production. It’s a process that requires automation and attention to detail so models can be easily interpreted and reproduced.
Today, these capabilities are becoming available as part of cloud-computing services. Companies that see machine learning as strategic are creating their own AI centers of excellence using MLOps services or tools from a growing set of vendors.
Data Science in Production at Scale
In the early days, companies such as Airbnb, Facebook, Google, NVIDIA and Uber had to build these capabilities themselves.
“We tried to use open source code as much as possible, but in many cases there was no solution for what we wanted to do at scale,” said Nicolas Koumchatzky, a director of AI infrastructure at NVIDIA.
“When I first heard the term MLOps, I realized that’s what we’re building now and what I was building before at Twitter,” he added.
Koumchatzky’s team at NVIDIA developed MagLev, the MLOps software that hosts NVIDIA DRIVE, our platform for creating and testing autonomous vehicles. As part of its foundation for MLOps, it uses the NVIDIA Container Runtime and Apollo, a set of components developed at NVIDIA to manage and monitor Kubernetes containers running across huge clusters.
Laying the Foundation for MLOps at NVIDIA
Koumchatzky’s team runs its jobs on NVIDIA’s internal AI infrastructure based on GPU clusters called DGX PODs. Before the jobs start, the infrastructure crew checks whether they are using best practices.
First, “everything must run in a container — that spares an unbelievable amount of pain later looking for the libraries and runtimes an AI application needs,” said Michael Houston, whose team builds NVIDIA’s AI systems including Selene, a DGX SuperPOD recently ranked the most powerful industrial computer in the U.S.
Among the team’s other checkpoints, jobs must:
Launch containers with an approved mechanism
Prove the job can run across multiple GPU nodes
Show performance data to identify potential bottlenecks
Show profiling data to ensure the software has been debugged
The maturity of MLOps practices used in business today varies widely, according to Edwin Webster, a data scientist who started the MLOps consulting practice a year ago for Neal Analytics and wrote an article defining MLOps. At some companies, data scientists still squirrel away models on their personal laptops, others turn to big cloud-service providers for a soup-to-nuts service, he said.
Two MLOps Success Stories
Webster shared success stories from two of his clients.
One involves a large retailer that used MLOps capabilities in a public cloud service to create an AI service that reduced waste 8-9 percent with daily forecasts of when to restock shelves with perishable goods. A budding team of data scientists at the retailer created datasets and built models; the cloud service packed key elements into containers, then ran and managed the AI jobs.
Another involves a PC maker that developed software using AI to predict when its laptops would need maintenance so it could automatically install software updates. Using established MLOps practices and internal specialists, the OEM wrote and tested its AI models on a fleet of 3,000 notebooks. The PC maker now provides the software to its largest customers.
Many, but not all, Fortune 100 companies are embracing MLOps, said Shubhangi Vashisth, a senior principal analyst following the area at Gartner. “It’s gaining steam, but it’s not mainstream,” she said.
Vashisth co-authored a white paper that lays out three steps for getting started in MLOps: Align stakeholders on the goals, create an organizational structure that defines who owns what, then define responsibilities and roles — Gartner lists a dozen of them.
Beware Buzzwords: AIOps, DLOps, DataOps, and More
Don’t get lost in a forest of buzzwords that have grown up along this avenue. The industry has clearly coalesced its energy around MLOps.
By contrast, AIOps is a narrower practice of using machine learning to automate IT functions. One part of AIOps is IT operations analytics, or ITOA. Its job is to examine the data AIOps generate to figure out how to improve IT practices.
Similarly, some have coined the terms DataOps and ModelOps to refer to the people and processes for creating and managing datasets and AI models, respectively. Those are two important pieces of the overall MLOps puzzle.
Interestingly, every month thousands of people search for the meaning of DLOps. They may imagine DLOps are IT operations for deep learning. But the industry uses the term MLOps, not DLOps, because deep learning is a part of the broader field of machine learning.
Despite the many queries, you’d be hard pressed to find anything online about DLOps. By contrast, household names like Google and Microsoft as well as up-and-coming companies like Iguazio and Paperspace have posted detailed white papers on MLOps.
MLOps: An Expanding Software and Services Smorgasbord
Those who prefer to let someone else handle their MLOps have plenty of options.
Major cloud-service providers like Alibaba, AWS and Oracle are among several that offer end-to-end services accessible from the comfort of your keyboard.
For users who spread their work across multiple clouds, DataBricks’ MLFlow supports MLOps services that work with multiple providers and multiple programming languages, including Python, R and SQL. Other cloud-agnostic alternatives include open source software such as Polyaxon and KubeFlow.
Companies that believe AI is a strategic resource they want behind their firewall can choose from a growing list of third-party providers of MLOps software. Compared to open-source code, these tools typically add valuable features and are easier to put into use.
All six vendors provide software to manage datasets and models that can work with Kubernetes and NGC.
It’s still early days for off-the-shelf MLOps software.
Gartner tracks about a dozen vendors offering MLOps tools including ModelOp and ParallelM now part of DataRobot, said analyst Vashisth. Beware offerings that don’t cover the entire process, she warns. They force users to import and export data between programs users must stitch together themselves, a tedious and error-prone process.
The edge of the network, especially for partially connected or unconnected nodes, is another underserved area for MLOps so far, said Webster of Neal Analytics.
Koumchatzky, of NVIDIA, puts tools for curating and managing datasets at the top of his wish list for the community.
“It can be hard to label, merge or slice datasets or view parts of them, but there is a growing MLOps ecosystem to address this. NVIDIA has developed these internally, but I think it is still undervalued in the industry.” he said.
Long term, MLOps needs the equivalent of IDEs, the integrated software development environments like Microsoft Visual Studio that apps developers depend on. Meanwhile Koumchatzky and his team craft their own tools to visualize and debug AI models.
The good news is there are plenty of products for getting started in MLOps.
In addition to software from its partners, NVIDIA provides a suite of mainly open-source tools for managing an AI infrastructure based on its DGX systems, and that’s the foundation for MLOps. These software tools include:
Foreman and MAAS (Metal as a Service) for provisioning individual systems
Ansible and Git for cluster configuration management
And DeepOps for scripts and instructions on how to deploy and orchestrate all of the elements above
Many are available on NGC and other open source repositories. Pulling these ingredients into a recipe for success, NVIDIA provides a reference architecture for creating GPU clusters called DGX PODs.
In the end, each team needs to find the mix of MLOps products and practices that best fits its use cases. They all share a goal of creating an automated way to run AI smoothly as a daily part of a company’s digital life.
What’s New: Intel® Studios will premiere two new volumetric films, “Queerskins: ARK” and “HERE,” at the 77th Venice International Film Festival beginning Sept. 2. Part of the Venice VR Expanded competition on Viveport, the films are two of the latest innovative virtual reality (VR) experiences captured and produced at Intel Studios.
“Using the groundbreaking volumetric capture and production abilities of Intel Studios, whether it be through the unique movement-based experiences of “ARK” via six degrees of freedom or the ability to layer scenes from various time periods on top of one another in “HERE,” we are ushering in a new age of content creation and immersive experiences.”
–Diego Prilusky, head of Intel Studios
Why It Matters: Intel Studios is driving the future of content creation with next-generation filmmaking technology and production that allow for content-capture translating to VR, augmented reality and movement-based six degrees of freedom (6DoF). After unveiling previous productions at Tribeca Film Festival and Cannes, Venice is a defining moment for Intel Studios to showcase the interactive experiences that demonstrate the capabilities and future of longer-form volumetric content.
About the Interactive Experience: “Queerskins: ARK” is an Intel Studios original, written by Illya Szilak and co-produced by Cloudred. It follows a Catholic mother living in rural Missouri who reads a diary left by the estranged son she has lost to AIDS, as a way to transcend herself and her grief by imagining him alive and in love. The 6DoF interactive experience allows viewers to navigate around a large space and enter the mother’s imagination, co-creating and controlling the experience through movements.
About the Layers of Time: “HERE,” an Intel Studios original co-produced with 59 Productions, presents an immersive adaptation of Richard McGuire’s groundbreaking graphic novel. This unique experience is a grand biopic where the main character is a place rather than a person. Through volumetric capture and VR technology, we join the host of characters across layers of generations who have called this particular room home. The immersive VR narrative invites audiences to reflect on human experiences across generations.
At Hot Chips 2020, Raja Koduri, senior vice president, chief architect and general manager of Architecture, Graphics and Software at Intel, delivered a keynote presentation. While silicon technology scaled exponentially over the past few decades, the breadth and depth of the software stack scaled at an unprecedented rate, bringing inefficiencies and leaving room for increasing performance. Raja’s keynote explored opportunities in hardware-software co-design from the edge to the cloud to fully leverage the transistors that are left behind by the incredible pace of software change at the top of the stack.
What’s New: At SIGGRAPH 2020, Intel announced the latest additions to the Intel® oneAPI Rendering Toolkit. Part of Intel’s oneAPI family of products, the toolkit brings premier high-performance, high-fidelity capabilities to the graphics and rendering industry. The toolkit is designed to accelerate workloads with large data sets and high complexity that require built-in artificial intelligence (AI) through a set of open-source rendering and ray-tracing libraries to create high-performance, high-fidelity visual experiences. The new additions are Intel® OSPRay Studio and OSPRay for Hydra available later in 2020, and visualization capabilities for the oneAPI Intel® DevCloud with sign-up available now on the Intel Developer Zone website.1
Intel oneAPI Rendering Toolkit provides high-performance, high-fidelity, extensible and cost-effective platforms with powerful ray-tracing and rendering capabilities to bring rendering to new heights.
Why It Matters: By taking advantage of Intel’s XPU hardware, Intel® Optane™ persistent memory, networking solutions and oneAPI software solutions, content creators and developers can bring their ideas to photorealistic reality with performance, efficiency and flexibility across today’s and future generations of systems and accelerators.
Intel OSPRay Studio is a scene graph application that demonstrates high-fidelity, ray-traced, interactive, real-time rendering, and provides capabilities to visualize multiple formats of 3D models and time series. It is used for robust scientific visualization and photoreal rendering and consists of Intel OSPRay in conjunction with other Intel rendering libraries (Intel® Embree, Intel® Open Image Denoise, etc.).
Intel OSPRay for Hydra is a Universal Scene Description (USD) Hydra API-compliant renderer that provides high-fidelity, scalable ray tracing performance and real-time rendering with a viewport-focused interface for film animation and 3D CAD/CAM modeling.
New Intel DevCloud for oneAPI capabilities enable the ability to visualize and iterate rendering and create applications with real-time interactivity via remote desktop. Users can use the Intel oneAPI Rendering Toolkit to optimize visualization performance and evaluate workloads across a variety of the latest Intel hardware (CPU, GPU, FPGA). There is free access with no installation, setup or configuration required.
Advantages of using Intel’s platform with an open-development environment for ray tracing and rendering for developers include:
Open-platform approach addresses single-vendor, lock-in customer concerns with cross-architecture support for a variety of platform choices in performance and costs.
Rendering toolkit open-source libraries drive innovation through powerful ray-tracing and rendering features that extend beyond the capabilities of GPUs, such as model complexity beyond triangles, path tracing, combined volume and geometry rendering, and addressing the data explosion in today’s workloads.
Simplified AI integration is included via Intel Open Image Denoise, Intel® Distribution of OpenVINO™ toolkit and acceleration via 3rd Gen Intel® Xeon® Scalable processors with Intel® Deep Learning Boost and bfloat16.
Intel tools provide readiness for next-generation hardware innovations to ensure visualization applications automatically scale to support future Intel CPUs, GPUs and other accelerators.
What the oneAPI Rendering Toolkit brings to life: Entertainment, gaming, HPC and other industries increasingly demand high-quality visuals that are produced at a fast rate with increasingly large data sets and complex workloads, as well as the integration of AI. Examples of the oneAPI Rendering Toolkit in use include:
Learn how the oneAPI Rendering Toolkit decreases render time for Tangent Studios. (Next Gen Now Available on Netflix. Netflix subscription required.)
Learn how the oneAPI Rendering Toolkit boosts V-Ray and Corona for Chaos Group.
Learn how Bentley Motors is using the oneAPI Rendering Toolkit to deliver hyperrealism.
Learn how LAIKA is using the oneAPI Rendering Toolkit to speed up the stop-motion filmmaking process.
Customer usages of Intel® oneAPI Rendering Toolkit across Film/VFX, SciVis, product architectural design.
1 Currently there is a limit of concurrent users – available instances will be increased per demand as required.
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Intel technologies may require enabled hardware, software or service activation.
NVIDIA’s enterprise partner program has grown to more than 1,500 members worldwide and added new resources to boost opportunities for training, collaboration and sales.
The expanded NVIDIA Partner Network boasts an array of companies that span the globe and help customers across a variety of needs, from high performance computing to systems integration.
The NPN has seen exponential growth over the past two years, and these new program enhancements enable future expansion as Mellanox and Cumulus partner programs are set to be integrated into NPN throughout 2021.
Mellanox and Cumulus bring strong partners into the NVIDIA fold. Focused on enterprise data center markets, they provide accelerated, disaggregated and software-defined networking to meet the rapid growth in AI, cloud and HPC.
In anticipation of this growth, the NPN has introduced educational opportunities, tools and resources for training and collaboration, as well as added sales incentives. These benefits include:
Industry-Specific Training Curriculums: New courses and enablement tools in healthcare, higher education and research, financial services and insurance, and retail. Additional courses in energy and telco are coming next year.
NPN Learning Maps: These dramatically reduce the time partners need to get up and running. Partners can discover and build their NVIDIA learning matrix based on industry and cross-referenced by role, including sales, solution architect or data scientist.
New tools and resources:
AI Consulting Network: New AI consulting services for data scientists and solution architects who are part of our Service Delivery Partner-Professional Services program to help build and deploy HPC and AI solutions.
Enhanced NPN Partner Portal: Expanded to allow access to the vast storehouse of NVIDIA-built sales tools and data, including partner rebate history and registered opportunities. The simplified portal gives partners increased visibility and easy access to the information required to quickly track sales and build accurate forecasts.
Industry-Specific Marketing Campaigns: Provides partners with the opportunity to build campaigns that more accurately target customers with content built from data-driven insights.
A fixed backend rebate for Elite-level Solution Provider and Solutions Integration partners for compute, compute DGX, visualization and virtualization.
An enhanced quarterly performance bonus program, incorporating an annualized goal to better align with sudden fluctuations in partner selling seasons.
Dedicated market development funds for Elite-level providers and integration partners for most competencies.
NPN expanded categories:
Solution advisors focused on storage solutions and mutual reference architectures
Federal government system integrators
The NVIDIA Partner Network is dedicated to supporting partners that deliver world-class products and services to customers. The NPN collaborates with hundreds of companies globally, across a range of businesses and competencies, to serve customers in HPC, AI and emerging high-growth areas such as visualization, edge computing and virtualization.
DGX SuperPODs are driving business results for companies like Continental in automotive, Lockheed Martin in aerospace and Microsoft in cloud-computing services.
Birth of an AI System
The story of how and why NVIDIA built Selene starts in 2015.
NVIDIA engineers started their first system-level design with two motivations. They wanted to build something both powerful enough to train the AI models their colleagues were building for autonomous vehicles and general purpose enough to serve the needs of any deep-learning researcher.
The result was the SATURNV cluster, born in 2016 and based on the NVIDIA Pascal GPU. When the more powerful NVIDIA Volta GPU debuted a year later, the budding systems group’s motivation and its designs expanded rapidly.
AI Jobs Grow Beyond the Accelerator
“We’re trying to anticipate what’s coming based on what we hear from researchers, building machines that serve multiple uses and have long lifetimes, packing as much processing, memory and storage as possible,” said Michael Houston, a chief architect who leads the systems team.
As early as 2017, “we were starting to see new apps drive the need for multi-node training, demanding very high-speed communications between systems and access to high-speed storage,” he said.
AI models were growing rapidly, requiring multiple GPUs to handle them. Workloads were demanding new computing styles, like model parallelism, to keep pace.
So, in fast succession, the team crafted ever larger clusters of V100-based NVIDIA DGX-2 systems, called DGX PODs. They used 32, then 64 DGX-2 nodes, culminating in a 96-node architecture dubbed the DGX SuperPOD.
They christened it Circe for the irresistible Greek goddess. It debuted in June 2019 at No. 22 on the TOP500 list of the world’s fastest supercomputers and currently holds No. 23.
Cutting Cables in a Computing Jungle
Along the way, the team learned lessons about networking, storage, power and thermals. Those learnings got baked into the latest NVIDIA DGX systems, reference architectures and today’s 280-node Selene.
In the race through ever larger clusters to get to Circe, some lessons were hard won.
“We tore everything out twice, we literally cut the cables out. It was the fastest way forward, but it still had a lot of downtime and cost. So we vowed to never do that again and set ease of expansion and incremental deployment as a fundamental design principle,” said Houston.
The team redesigned the overall network to simplify assembling the system.
They defined modules of 20 nodes connected by relatively simple “thin switches.” Each of these so-called scalable units could be laid down, cookie-cutter style, turned on and tested before the next one was added.
The design let engineers specify set lengths of cables that could be bundled together with Velcro at the factory. Racks could be labeled and mapped, radically simplifying the process of filling them with dozens of systems.
Doubling Up on InfiniBand
Early on, the team learned to split up compute, storage and management fabrics into independent planes, spreading them across more, faster network-interface cards.
The number of NICs per GPU doubled to two. So did their speeds, going from 100 Gbit per second InfiniBand in Circe to 200G HDR InfiniBand in Selene. The result was a 4x increase in the effective node bandwidth.
Likewise, memory and storage links grew in capacity and throughput to handle jobs with hot, warm and cold storage needs. Four storage tiers spanned 100 terabyte/second memory links to 100 Gbyte/s storage pools.
Power and thermals stayed within air-cooled limits. The default designs used 35kW racks typical in leased data centers, but they can stretch beyond 50kW for the most aggressive supercomputer centers and down to 7kW racks some telcos use.
Seeking the Big, Balanced System
The net result is a more balanced design that can handle today’s many different workloads. That flexibility also gives researchers the freedom to explore new directions in AI and high performance computing.
“To some extent HPC and AI both require max performance, but you have to look carefully at how you deliver that performance in terms of power, storage and networking as well as raw processing,” said Julie Bernauer, who leads an advanced development team that’s worked on all of NVIDIA’s large-scale systems.
In the best of times, it can take dozens of engineers a few months to assemble, test and commission a supercomputer-class system. NVIDIA had to get Selene running in a few weeks to participate in industry benchmarks and fulfill obligations to customers like Argonne.
And engineers had to stay well within public-health guidelines of the pandemic.
“We had skeleton crews with strict protocols to keep staff healthy,” said Bernauer.
“To unbox and rack systems, we used two-person teams that didn’t mix with the others — they even took vacation at the same time. And we did cabling with six-foot distances between people. That really changes how you build systems,” she said.
Even with the COVID restrictions, engineers racked up to 60 systems in a day, the maximum their loading dock could handle. Virtual log-ins let administrators validate cabling remotely, testing the 20-node modules as they were deployed.
Bernauer’s team put several layers of automation in place. That cut the need for people at the co-location facility where Selene was built, a block from NVIDIA’s Silicon Valley headquarters.
Slacking with a Supercomputer
Selene talks to staff over a Slack channel as if it were a co-worker, reporting loose cables and isolating malfunctioning hardware so the system can keep running.
“We don’t want to wake up in the night because the cluster has a problem,” Bernauer said.
It’s part of the automation customers can access if they follow the guidance in the DGX POD and SuperPOD architectures.
Thanks to this approach, the University of Florida, for example, is expected to rack and power up a 140-node extension to its HiPerGator system, switching on the most powerful AI supercomputer in academia within as little as 10 days of receiving it.
As an added touch, the NVIDIA team bought a telepresence robot from Double Robotics so non-essential designers sheltering at home could maintain daily contact with Selene. Tongue-in-cheek, they dubbed it Trip given early concerns essential technicians on site might bump into it.
The fact that Trip is powered by an NVIDIA Jetson TX2 module was an added attraction for team members who imagined some day they might tinker with its programming.
Since late July, Trip’s been used regularly to let them virtually drive through Selene’s aisles, observing the system through the robot’s camera and microphone.
“Trip doesn’t replace a human operator, but if you are worried about something at 2 a.m., you can check it without driving to the data center,” she said.
Delivering HPC, AI Results at Scale
In the end, it’s all about the results, and they came fast.
In June, Selene hit No. 7 on the TOP500 list and No. 2 on the Green500 list of the most power-efficient systems. In July, it broke records in all eight systems tests for AI training performance in the latest MLPerf benchmarks.
“The big surprise for me was how smoothly everything came up given we were using new processors and boards, and I credit all the testing along the way,” said Houston. “To get this machine up and do a bunch of hard back-to-back benchmarks gave the team a huge lift,” he added.
The work pre-testing NGC containers and HPC software for Argonne was even more gratifying. The lab is already hammering on hard problems in protein docking and quantum chemistry to shine a light on the coronavirus.
At the same time, NVIDIA’s own researchers are using Selene to train autonomous vehicles and refine conversational AI, nearing advances they’re expected to report soon. They are among more than a thousand jobs run, often simultaneously, on the system so far.
Meanwhile the team already has on the whiteboard ideas for what’s next. “Give performance-obsessed engineers enough horsepower and cables and they will figure out amazing things,” said Bernauer.
At top: An artist’s rendering of a portion of Selene.
Join Intel at Hot Chips, a conference on high-performance microprocessors and related integrated circuits, where Raja Koduri, senior vice president, chief architect and general manager of Architecture, Graphics and Software, will deliver a keynote presentation explaining the opportunities for hardware-software co-design. Together with partners and customers, Intel is building the trusted foundation for computing in a data-centric world.
While silicon technology scaled exponentially over the past few decades, the breadth and depth of the software stack scaled at a much faster rate than could have been predicted by Moore’s Law, bringing inefficiencies and leaving plenty of room for increasing performance. In this keynote, Raja Koduri will explore opportunities in hardware-software co-design from the edge to the cloud to fully leverage the transistors that are left behind by the incredible pace of software change at the top of the stack.
Towards a Large-Scale Quantum Computer Using Silicon Spin Qubits
James S. Clarke
When: Sunday, Aug. 16, 3:45-5:15 p.m.
James S. Clarke, director of quantum hardware at Intel, will present progress toward the realization of a 300mm Si-MOS based spin qubit device in a production environment. In addition, this talk will focus on a key bottleneck to moving beyond today’s few-qubit devices: the interconnect scheme and control of a large quantum circuit. To address this bottleneck, at Intel we have developed customized control chips, optimized for performance at low temperature, with a goal of simplifying wiring of quantum systems and replacing the racks and racks of discrete electrical components.
Next-Generation Intel Xeon Scalable Server Processor: Icelake-SP
Irma Esmer Papazian
When: Monday, Aug. 17, 9:30-11:30 a.m.
Irma Esmer Papazian, senior principal engineer in architecture at Intel, will present on the next generation Intel® XeonTM processor (code named “Icelake-SP”), a general-purpose server processor designed to bring significant performance boost over prior-generation Xeon processors. Icelake-SP is the first Xeon server implemented on Intel 10nm processor technology and is targeted for the end of 2020. Irma will describe new enhancements to Icelake-SP, including ISA, security technologies and core microarchitecture improvements, diving into their use and impact in server application. She will also highlight key features, microarchitecture insights and performance trends of Icelake-SP not shared previously.
Inside Tiger Lake: Intel’s Next Generation Mobile Client CPU
When: Monday, Aug. 17, 12-1 p.m.
Xavier Vera, principal engineer and SoC/performance power architect at Intel, will lead a discussion on Tiger Lake, Intel’s next generation mobile client CPU. Tiger Lake provides the best performance and battery life Intel has ever offered for mobile content creation, productivity and gaming usages. The new high-performance core in Tiger Lake is called Willow Cove, which adds enhanced security features to help protect against return-oriented programming attacks along with increases in performance via frequency push and L2 cache size increases. Intel® Xe-LP is the new graphics engine within Tiger Lake that increases the maximum number of execution units from 64 to 96. Xavier will cover how Tiger Lake was able to achieve significant gains in performance and drive competitive battery life.
The Xe GPU Architecture
When: Monday, Aug. 17, 5-6:30 p.m.
David Blythe, Intel senior fellow and director of Graphics Architecture, will discuss Xe, Intel’s next-generation GPU architecture. Xe was designed for scalability – spanning teraflop to petaflop compute performance – and is slated for use in multiple Intel products including Tiger Lake Client SoC and HPC GPU Ponte Vecchio. In this presentation, he will explain the principles behind Xe configurability and scalability and its many efficiency enhancements, as well as new capabilities, including hardware acceleration support for ray tracing, matrix/tensor processing and multidie/multipackage connectivity. Blythe will then delve into details for one configuration of the architecture, Xe-LP, employed in lower TDP product segments.
Agilex Generation of Intel FPGAs
Ilya Ganusov and Mahesh A. Iyer
When: Tuesday, Aug. 18, 8:30-10 a.m.
Ilya Ganusov, senior principal engineer and director of Programmable Architecture at Intel, and Mahesh A. Iyer, Intel fellow in the Data Platforms Group and chief architect and technologist of Electronic Design Automation, will provide an in-depth technical disclosure of Intel’s next-generation Agilex™ FPGA platform. Agilex™ FPGAs deliver over 40% higher peak performance and use 40% less power, on average, over prior generation Stratix®10 FPGAs. Details on volume production shipment on engineering samples of Agilex™ FPGAs will be revealed in the presentation.
Tofino2 — A 12.9Tbps Programmable Ethernet Switch
Anurag Agrawal and Changhoon Kim
When: Tuesday, Aug. 18, 10:30 a.m.-12:30 p.m.
In this presentation, Anurag Agrawal, senior principal engineer in the Barefoot Division of the Data Center Group, and Changhoon Kim, Intel fellow in the Connectivity Group and chief technical officer of applications in the Barefoot Switch Division, will describe Tofino2, a 12.9Tbps fully-programmable switch silicon that is an evolution of the original Tofino architecture in both speed and capability. Enhancements to Tofino2 include an increase in fungible resources and the addition of several novel primitives to help users realize new types of networking protocols and functions, such as advanced telemetry, advanced congestion control and flexible scheduling.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks.
Statements in this document that refer to future plans or expectations are forward-looking statements. These statements are based on current expectations and involve many risks and uncertainties that could cause actual results to differ materially from those expressed or implied in such statements. For more information on the factors that could cause actual results to differ materially, see our most recent earnings release and SEC filings at www.intc.com.
Raja Koduri Senior Vice President
General Manager, Intel Architecture, Graphics, and Software
By Raja Koduri
At Intel, we truly believe in the potential of technology to enrich lives and change the world. This has been a guiding principle since the company was founded. It started with the PC era, when technology enabled the mass digitization of knowledge and networking, bringing 1 billion people onto the internet. Then came the mobile and cloud era, a disruption that changed the way we live. We now have over 10 billion devices connected to supercomputers in the cloud.
We believe the next era will be the intelligent era. An era where we will experience 100 billion intelligent connected devices. Exascale performance and architecture will make this intelligence available to all, enriching our lives in more ways than we can imagine today. This is a future that inspires and motivates me and my fellow Intel architects every day.
We are generating data at a faster rate than our current ability to analyze, understand, transmit, secure and reconstruct it in real time. Analyzing a ton of data requires a ton of compute. More important, for this data to help us with insights, it needs access to compute in real time, which means low-latency, close to the user. At Intel, we are on a journey to solve this exponentially hard problem.
Since the end of the Dennard scaling era, extracting the exponential value from transistor technology inspired us to look at new approaches across the whole stack. This led us to what we call our Six Pillars of Technology Innovation, which we introduced at our Architecture Day in December 2018. We believe that delivering advances across these pillars is necessary to continue the exponential essence of Moore’s Law.
This week, at Architecture Day 2020, we showcased how we are taking this forward with a broad range of exciting new breakthroughs. We have made great progress with our diverse mix of scalar, vector, matrix and spatial architectures – designed with state-of-the-art process technology, fed by disruptive memory hierarchies, integrated into systems with advanced packaging, deployed at hyperscale with lightspeed interconnect links, unified by a single software abstraction, and developed with benchmark defining security features.
You can watch all of the Architecture Day 2020 presentations in the event press kit, but let me underscore some of the highlights.
We provided more details about our disaggregated design methodology and our advanced packaging roadmap. We demonstrated our mastering of fine bump pitches in EMIB and Foveros technologies through several product iterations in graphics and FPGAs, and on the client with Lakefield.
We also shared one of the most exciting advancements in our transistor roadmap by introducing our new 10nm SuperFin technology, a redefinition of the FinFET with new SuperMIM capacitors that enables thelargest single, intranode enhancement in Intel’s history, delivering performance improvements comparable to a full-node transition and enabling a leadership product roadmap.
When we integrate our next-generation Willow Cove CPU architecture with our 10nm SuperFin technology, the result is the incredible new Tiger Lake platform. We unpacked details of the upcoming Tiger Lake system-on-chip architecture, which provides a generational leap in CPU performance, leadership graphics, leadership artificial intelligence (AI), more memory bandwidth, additional security features, better display, better video and more. I know everyone is eager for all of the details on Tiger Lake and we look forward to sharing more in the coming weeks.
In addition to Tiger Lake, we provided a deep dive into our next generation Intel® Agilex™ FPGA, which provides breakthrough performance per watt. In fact, we showcased two generations of disaggregated products using EMIB and shared the first results of our 224 Gbps transceivers.
We also highlighted how Intel’s Xe GPU architecture is the foundation that helps us build GPUs that are scalablefrom teraflops to petaflops.Xe-LP powers leadership graphics in Tiger Lake and is our most efficient microarchitecture for PC and mobile computing platforms. Xe-LP also powers our first discrete GPU in more than 20 years, codenamed DG1. This GPU is now in production. We also introduced the first Intel® server GPU, powered by Xe-LP. This GPU will ship later this year and deliver class-leading stream density and visual quality for media transcode and streaming.
On the data center front, we announced that our firstXe-HP chip is sampling to customers. Xe-HP is the industry’s first multitiled, highly scalable, high-performance GPU architecture, providing petaflop-scale AI performance and rack-level media performance in a single package based on our EMIB technology. Xe-HP will leverage enhanced SuperFin technology.
And, our enthusiast and gamer friends, we heard your requests for Xe for enthusiast gaming. We added a fourth microarchitecture to the Xe family: Xe-HPG optimized for gaming, with many new graphics features including ray tracing support. We expect to ship this microarchitecture in 2021 and I can’t wait to get my hands on this GPU!
On software, we have talked before about our vision for providing developers a unified, standards-based programming model across all our XPU architectures. We are executing on that vision with our oneAPI Gold release available later this year. We also announced that we are offering DG1 early access to developers in Intel® DevCloud, enabling them start developing with oneAPI without need for any setups, downloads and hardware installs.
Since our last Architecture Day, we have made some big steps in memory. Most recently, as part of the 3rd Gen Intel® Xeon® Scalable processor launch (code-named “Cooper Lake”), we announced our 2nd Gen Intel® Optane™ persistent memory product (code-named “Barlow Pass”). We also remain on track to move Intel’s 4-bit-per-cell QLC into production by the end of 2020.
We also took a deeper look at how we are advancing security amid a constantly evolving threat landscape. This includes the introduction of new technologies, such Intel® Control-Flow Enforcement Technology, which delivers CPU-level security structures to help protect against common malware attack methods. And, we gave the first look at our longer-term vision around foundational security, workload protection and software reliability.
We have made major progress in advancing interconnect, too. Intel announced in March 2019 that it was working with the industry for broad support for Compute Express Link, designed to accelerate next-generation data center performance and to be offered in Sapphire Rapids. We have also had a significant lead with silicon photonics in terms of customer engagements, and as the data center continues its transformation, Intel is addressing their needs through leadership speeds and foundational and SmartNIC products for network processing offloads.
Our Intel fellows and architects are passionately working on technology for 2021, 2022 and beyond. We provided a glimpse into our product vision for client and data center leveraging for all six pillars and disaggregated design. Our head of Intel Labs also provided a look at where emerging research areas can get us 100x to 1000x improvements in compute efficiency, including a sneak peek at neuromorphic architectures being researched in our world-leading labs.
For decades, Intel has been at the center of the technology industry. Our products, along with those of our customers, have reshaped the way we all work, live and play. But our collective journey is far from over. I believe we are at the start of a new era, an intelligent era, an exascale for everyone era. This era will be powered by unprecedented levels of compute performance and disruptions across all Six Pillars of Technology Innovation.
Raja M. Koduri is senior vice president, chief architect, and general manager of Architecture, Graphics, and Software at Intel Corporation.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information about performance and benchmark results, visit http://www.intel.com/benchmarks.
No product or component can be absolutely secure. Intel technologies may require enabled hardware, software or service activation. All product plans and roadmaps are subject to change without notice.
Statements in this document that refer to future plans or expectations are forward-looking statements. These statements are based on current expectations and involve many risks and uncertainties that could cause actual results to differ materially from those expressed or implied in such statements. For more information on the factors that could cause actual results to differ materially, see our most recent earnings release and SEC filings at www.intc.com.