Industry Article

Crafting a Silicon Lifecycle Management Strategy for HPC and Data Centers

4 days ago by Randy Fish, Synopsys

As data center computing and HPC advances, the stakes for ensuring reliability are high. Learn how to develop a silicon lifecycle management (SLM) strategy that ensures a successful future for your designs.

Article co-authored by Synopsys’ Guy Cortez

From the advancements of mathematical models to climate projections, supercomputers play a crucial role in driving answers to today’s largest problems, while the cloud data centers powering them process and move extreme volumes of data.

With all that in mind, demands for high-performance computing (HPC) and enormous amounts of data storage are more important now than ever. 

 

As the systems powering high-performance computing and data centers get more complex, a myriad of life-cycle issues become critical for semiconductor engineers.

Figure 1. As the systems powering high-performance computing and data centers get more complex, a myriad of life-cycle issues become critical for semiconductor engineers.
 

As electronic systems that power HPC and data centers become more advanced, issues such as device aging, thermal challenges, power constraints, and others pose a challenge for semiconductor designers (Figure 1). A lesser-known issue that poses a challenge is Silent Data Corruption (SDC), which is the result of undetected errors that occur for unknown and unexpected reasons within data centers.

 

Silent Data Corruption a Growing Problem

Since SDCs are apparently random and difficult to detect, SDCs are now becoming a widespread issue amongst the semiconductor industry and beyond. In a 2021 report on SDC, Meta ran a silent error test scenario in their large-scale infrastructure across hundreds of thousands of machines in their fleets and found that hundreds of CPUs detected these silent errors. 

SDC can cause widespread problems within infrastructure systems, therefore consistent testing during manufacturing and in-field is imperative. In today’s digital era, millions or more operations are happening within and across devices, which could exacerbate even a few system errors. If an error isn’t detected and mitigated quickly, it can lead to data loss and impact business operations and user experiences on a broader, hyperscale level.

To address SDC, designers must know what is happening beyond the surface of a chip to ensure the reliability, availability, and serviceability (RAS) of devices. Designers will need to start employing a silicon lifecycle management (SLM) strategy. Having awareness of long-term RAS implications is key to successful product lifecycle management. 

 

What is a Silicon Lifecycle Management Strategy?

SLM is an emerging concept that consists of the monitoring, analysis, and optimization of devices throughout design and development to ensure that silicon “health” remains robust—that the chip performs as intended (Figure 2).
 

Industry drivers demand Silicon Lifecycle Management.

Figure 2. Industry drivers demand Silicon Lifecycle Management.

 

Beyond ensuring your chip works when it is produced and shipped, it also needs continuous monitoring and testing throughout its life—data center providers and their silicon partner must be able to monitor or analyze the components inside each chip, from the transistor to the data being transmitted, to help not only identify and track expected degradation and potential issues, but also troubleshoot and fix problems. 

To guarantee RAS throughout a chip’s life, an SLM strategy provides the following actionable insights:

  • In-Design—Pinpoint the best design component contestant in the device for monitoring. Install monitor IP directly into the infrastructure of the design.
  • In-Ramp—Focus on the highest yield limiter candidates, conduct accurate failure analysis, and adjust the design and fab process to satisfy high yield requirements.
  • In-Production—Detect yield and quality outliers through automated insights, perform root-cause analysis across various stages of high-volume manufacturing, and course correct in the semiconductor supply chain as necessary.
  • In-Field—Calculate silicon health through predictive maintenance and advance performance metrics such as power and throughput, especially as the device ages.

To put this into perspective, think of SLM like car maintenance—just as an oil change isn’t a one-time task, but a routine activity throughout the vehicle’s lifespan, SLM is a preventative maintenance measure essential for maintaining the health and safety of the chip and larger device.

 

An SLM Strategy in Action

In SoC systems, managing power optimizations and thermal issues becomes increasingly challenging when dealing with multiple dies in a system. For long-term success in data centers and HPC applications, implementing monitors into chip designs is crucial to mitigate potential damage from too much heat and voltage as well as to minimize power consumption. 

These monitors, used for dynamic voltage and frequency scaling (DVFS) or adaptive voltage scaling (AVS), are used to measure a chip's real-time thermal profile which is also instrumental in enabling automatic shutoffs when the chip is nearing an overheating threshold. 

During wafer sort testing, process, voltage and temperature (PVT) monitors provide initial data and results, which can be instantly utilized to gain a deeper understanding of the thermal profile. This will also enable the execution of testing sequences to observe voltage values across the die. Moreover, analytics based on test data, PVT values, and path margin monitor IP data can be performed and relayed back into the design environment to understand the real margins of the silicon and correlate them to models that will more accurately predict performance and power, all without sacrificing RAS.

Additionally, establishing thresholds is a great way to monitor before something goes wrong. A preset threshold for a temperature monitor will let you know when you need to start managing the temperature back down, the same goes for voltage and path margin monitoring. As you become tighter with your thresholds, the earlier you can act to avoid issues. 

 

A Holistic SLM Strategy

As high-performance computing continues to advance and large amounts of data continue to move from one digital device to another, there are more challenges that designers need to contend with than ever before. With a holistic SLM strategy, chip designers can test, monitor, and repair devices, while receiving meaningful data and actionable next steps.

There is no room to overlook what is going on within a chip; having an SLM strategy in place helps identify and prevent these challenges that arise from extreme performance and design requirements before they happen. While we discussed the features of PVT monitors, SLM goes beyond that to include analysis and insights of silicon health data in one place, providing a next-level diagnosis of possible issues and avoiding data center downtime. 

Synopsys’ SLM solutions allow designers to have a holistic SLM strategy. The Synopsys SLM family is a set of integrated tools, IP, and methodologies aimed to improve silicon health and operational metrics at every phase of the device lifecycle. In the design phase, silicon data can be fed back for better design tuning. The chip production phase is where device screening takes place to improve quality and reliability. Finally, the system bring-up phase is where analytics data from previous phases and sensor data can be gathered and analyzed. 

Specific solutions within the Synopsys SLM family (Figure 3), including PVT Monitors, Path Margin Monitors (PMMs), and Real-time High-Speed Access and Test (HSAT) IP, provide the ability to monitor data and run manufacturing and in-field tests.

 

Synopsys SLM solution flow from in-design to in-field

Figure 3. Synopsys SLM solution flow from in-design to in-field

 

Additionally, the Synopsys HSAT IP allows designers to continue to perform diagnostics when the device is deployed in use. Overall, the Synopsys SLM family of products provides analysis and insights into different silicon health data in one place, helping designers track their chip and larger device’s RAS in real time and detect SDCs before it’s too late.

 

All images used courtesy of Synopsys

Industry Articles are a form of content that allows industry partners to share useful news, messages, and technology with All About Circuits readers in a way editorial content is not well suited to. All Industry Articles are subject to strict editorial guidelines with the intention of offering readers useful news, technical expertise, or stories. The viewpoints and opinions expressed in Industry Articles are those of the partner and not necessarily those of All About Circuits or its writers.