Practical Chip Failure Analysis Methodology: From Fault Verification to Dynamic Root Cause Capture

A practical framework for chip failure analysis engineers: from A-B-A swap verification and ATE retest to EMMI/FIB/SEM, then on to dynamic waveform capture and TLP-based reproduction. It addresses three recurring pain points: misclassifying chip failures, seeing only outcomes instead of causes, and failing to capture transient root causes. Keywords: Failure Analysis, ATE, Latch-up.

Technical specifications snapshot

Parameter Details
Domain Semiconductor Failure Analysis (FA)
Intended audience MCU/SoC/analog chip test, reliability, and FA engineers
Core methods A-B-A Swap, ATE, EMMI/OBIRCH, FIB, SEM, TLP
Key models Impedance Deviation Model, Probabilistic Closed Loop
Data source Experience-based technical blog article
Language Chinese
License Not specified in the original article
Stars Not applicable
Core dependencies Oscilloscope, ATE platform, layout/GDS, failure analysis lab equipment

This failure analysis methodology begins by defining the system boundary

The most common mistake in chip failure analysis is not a lack of technical skill, but attributing the issue to the chip itself too early. High-quality FA must first upgrade “experience-based judgment” into “physical-evidence-driven fault verification.”

The article emphasizes that without rigorous A-B-A swap cross-validation, engineers should not rush to take the case into the lab. At the same time, you must stay alert to the illusion of condition-following behavior: a chip that fails on a customer board does not necessarily contain deterministic internal hard damage.

# Example fault verification workflow
fa_checklist = {
    "swap_test": True,      # A-B-A swap verification is mandatory
    "scope_power_on": True, # Check power-on sequencing and surge behavior
    "ground_bounce": True,  # Rule out ground bounce noise
    "sample_expand": True,  # Expand the sample set and observe failure-rate distribution
    "ate_retest": True      # Retest on the original ATE platform to locate the failing block
}

# Enter the next stage only after all critical checks are complete
ready_for_lab = all(fa_checklist.values())

This code expresses the minimum prerequisite before entering the lab: first narrow the system boundary using system-level evidence, then proceed to physical analysis.

Expanding the sample set and tracing lot history quickly separates isolated defects from batch issues

If failures are highly scattered, the issue is more likely an isolated device defect. If failures cluster around a specific lot or operating condition, immediately trace the wafer lot number, CP/FT records, and fab process drift. The value of this step lies in quickly determining whether the issue belongs to design, manufacturing, or application stress.

ATE retest should not be skipped either. Through datalog analysis, it can point directly to Flash, RAM, ADC, or a specific register anomaly, providing a clear starting point for subsequent physical localization.

Building an impedance deviation model is the key to unifying electrical anomalies

Instead of guessing causes from symptoms, mapping anomalies into “impedance change” is closer to the physical essence of the problem. This model can explain hard shorts, opens, and soft failures at the same time, preventing the analysis path from devolving into fragmented experience-based heuristics.

A short or leakage path can be understood as a sharp drop in DC impedance. Common root causes include ESD diode breakdown, PMOS drain-body junction failure, or a weak internal leakage path within the logic. An open corresponds to DC impedance approaching infinity, commonly caused by bond wire breakage, package delamination, or complete burnout of a metal interconnect.

# Coarse failure classification using an impedance model

def classify_failure(dc_resistance, ac_drive_ok):
    if dc_resistance < 10:
        return "Short/Leakage"   # DC impedance is too low, suggesting a short or leakage
    if dc_resistance > 1e6:
        return "Open"            # DC impedance is extremely high, suggesting an open
    if not ac_drive_ok:
        return "Soft Failure"    # AC drive is abnormal, suggesting timing or edge-related issues
    return "Need More Data"      # Evidence is insufficient; collect more measurements

This code shows that abstracting failure into impedance and drive-strength deviation helps establish a unified diagnostic framework quickly.

Soft failures often leave no visible damage, but they are the most misleading

AC impedance anomalies often appear as incorrect output levels, functional misbehavior, or timing failures at high or low temperature. They may leave no microscopic damage, yet still be triggered by increased RC delay, slower edges, or degraded drive strength.

As a result, if you focus only on microscopy images, you can easily misclassify the real dynamic problem as NFF. This is also why the article stresses the importance of respecting the boundary of physical evidence.

Physical analysis must distinguish primary damage from secondary effects

When you see a large metal melt crater, it is easy to conclude that the root cause is overcurrent burnout. In many cases, however, that is only the final result. The true primary damage may be an earlier nanoscale soft gate-oxide breakdown that triggered latch-up and only then caused the subsequent high-current destruction.

This reversal of cause and effect is one of the most expensive mistakes in failure analysis. If you only fix the “melt-crater result,” for example by simply widening a metal line, you will often miss the real trigger.

Layout alignment and schematic cross-validation determine root cause quality

After identifying a suspicious site, map the physical coordinates precisely to the GDS layout and confirm the connected pad, functional block, voltage rating, and current limit. Only then can you upgrade “seeing damage” into “explaining why the damage occurred.”

# Minimum data structure for root-cause closure
root_cause = {
    "physical_site": "Gate oxide near input pad",   # Initial suspect location
    "layout_mapping": "PAD_ESD_CELL_IN_A",         # Mapped layout cell
    "rated_voltage": 3.3,                           # Device rated voltage
    "observed_stress": 5.0,                         # Measured or inferred stress voltage
    "conclusion": "Over-voltage induced breakdown" # Root-cause conclusion
}

This code illustrates the basic elements of root-cause closure: the physical site, layout point, rated value, stress value, and conclusion must all align.

Dynamic root cause capture is the real dividing line in advanced FA

The hardest part is not seeing the damage, but capturing the abnormal waveform at the exact moment the failure occurs. ESD or latch-up often triggers on the nanosecond scale, and probe parasitics can in turn alter the original circuit, making the measurement itself a source of disturbance.

If the sample burns out immediately after the failure event, reproduction becomes even more difficult. At that point, the highest-value engineering method is not blindly adding more equipment, but using “dimensionality reduction for reproduction.”

Delaying failure turns unmeasurable events into observable anomalies

By lowering the supply voltage, clock frequency, or adjusting temperature, you can transform a high-speed destructive failure into a repeatable functional anomaly, then use an oscilloscope to capture timing errors and voltage droop. This is the core strategy for converting “instant death” into “slow exposure.”

For internal high-speed nodes, dynamic EMMI or TREM provides a measurement capability that is closer to non-invasive observation. It does not directly replace an oscilloscope, but when probes cannot physically reach the node, it provides optical side evidence for internal ultrafast activity.

# Failure reproduction strategy selector

def reproduce_strategy(repeatable, destructive, internal_node):
    if internal_node:
        return "Dynamic EMMI/TREM"   # Prefer non-invasive probing for internal high-speed nodes
    if destructive and not repeatable:
        return "Reduce VDD/Freq/Temp Stress"  # Delay the failure first
    if repeatable:
        return "Scope + Trigger Capture"      # If repeatable, capture the waveform directly
    return "TLP/ESD Gun Reverse Verification" # Recreate the waveform for reverse verification

This code summarizes the decision logic of dynamic analysis: first evaluate repeatability, destructiveness, and the observable boundary.

In practice, probabilistic closure matters more than perfect evidence

In the real world, not every case can deliver perfect waveforms, microscopy images, and design-side confirmation all at once. Senior FA engineers must accept a basic reality: many complex failures can only be closed with high probability.

That does not mean drawing conclusions carelessly. It means combining historical case libraries, DFMEA-based elimination paths, and statistical reproduction results from accelerated tests such as HTOL and HAST to build a strongly correlated evidence chain that is sufficient to drive design improvement.

FAQ: The three questions engineers ask most often

1. Why should ATE retest come first?

Because ATE datalogs can narrow the failure scope to memory, analog, or digital blocks before lab teardown begins, improving the efficiency of downstream localization.

2. Why do so many failure cases end up as NFF?

Because soft failures, timing races, and transient noise coupling do not necessarily leave visible physical damage. If you ignore the boundary of dynamic evidence and force a search for “image-based proof,” you may instead destroy the sample.

3. When should you use TLP or an ESD gun for reverse verification?

When damage morphology suggests a type of external stress but you lack direct waveform evidence, you can apply targeted stress to known-good units. If you reproduce similar damage, you can significantly increase confidence in the root cause.

WeChat sharing prompt AI Visual Insight: This image is a platform sharing prompt animation. It does not contain chip structure, failure morphology, or test-flow information, and therefore provides no direct evidentiary value for failure analysis judgments.

Core Summary: This article reconstructs practical failure analysis experience into an executable methodology that covers fault verification, impedance modeling, physical analysis, dynamic waveform capture, and probabilistic closure, helping chip, test, and reliability engineers locate true root causes more efficiently.