

# System-on-chip Fault-Tolerance: the DeSyRe approach

#### **Ioannis Sourdis**

Computer Science and Engineering dept. Chalmers University of Technology, Sweden











University of BRISTOI







# Technology trends

### As Technology Scales chips are becoming less reliable

14

- Harder to keep defect density constant
  - **Manufacturing cost** increases
- Variations become more severe
- Transistor aging is accelerated
- Soft-error rate grows exponentially
  - mostly logic state bits



DeSyRe: on Demand System Reliability © DeSyRe consortium, all rights reserved



2



Figure 5. Soft-Transfstor's-Satiunation autripentogic and memory).

## Transistor count





3

## Performance trends



### New technology nodes do not deliver significant Performance improvements



4/26/2012

4

## Power trends



### Power consumption is a major limiting factor





5

# Power density limits gate density



- By 2018 chips will exceed the max allowed power consumption:
  - Then significant parts of a chip will need to be inactive to stay within the available power budget



## The Cost of Fault Tolerance

### Fault Tolerance requires Redundancy

### in time:

Performance overhead

#### in space:

Power and Area overhead

### Both in time and space:

Energy overhead





# **DeSyRe:** on-Demand System Reliability

- Generic design framework for future reliable SoCs
- Main Goal:
  - to reduce the costs for fault tolerance
- How?
  - Flexibility/reconfigurability
  - System-level support for dynamic adaptation
- Applications: DeSyRe will be applied to two Medical SoCs





DeSyRe: on-Demand System Reliability © DeSyRe consortium, all rights reserved

# **DeSyRe:** on-Demand System Reliability

- Generic design framework for future reliable SoCs
- Main Goal:
  - to reduce the costs for fault tolerance

#### How?

- Flexibility/reconfigurability
- System-level support for dynamic adaptation
- Applications: DeSyRe will be applied to two Medical SoCs





# DeSyRe objectives

Develop a generic design framework for heterogeneous embedded SoCs, to:

- 1. Provide reliability at a reduced performance, energy and power cost.
  - tolerate permanent, intermittent and transient faults
- 2. **Reliable by Design** (for fabless chip developers)
  - Guarantee reliability of chips by design
  - Technology independent

3. Build on-demand adaptive systems. Systems built on flexible/reconfigurable hardware complimented by system-level techniques. Adaptive to

- Various types of faults
- System constraints (power, energy, resources)
- Application requirements (performance constraints, reliability requirements/safety/availability)

#### 4. Increased Defect tolerance

(design, manufacturing defects, defects due to aging).

- Increased manufacturing yield
- Longer SoC lifetime
- Lower manufacturing cost
- Shorter time-to-market



DeSyRe: on-Demand System Reliability © DeSyRe consortium, all rights reserved

# The DeSyRe design framework

- Only a small fraction of the chip will be designed to be Fault Free.
  - To Reduce Cost for Fault Tolerance
- Fault Free part will manage the remaining unreliable resources.
- The remaining Fault Prone resources will provide flexibility reconfiguration
- System reliability managed at runtime based on the:
  - Application requirements
  - System constraints
  - Fault types and density



#### System-on-Chip



DeSyRe: on-Demand System Reliability © DeSyRe consortium, all rights reserved

# DeSyRe framework layers



Runtime System Software-layer

MiddleWare Software-layer

#### Components:

- 1. Component Architecture
- 2. Realization (HW Substrate)



DeSyRe: on Demand System Reliability © DeSyRe consortium, all rights reserved

# DeSyRe Reconfigurable Substrate

#### Two levels of Reconfiguration:

- 1. Coarse-grain (Component-level)
- 2. Fine-grain (FPGA-like)

### Example: RISC

#### processor

- partitioned in pipeline stages
- Reconfigurable wires for interconnection
- Fine-grain reconfigurable blocks





DeSyRe: on Demand System Reliability © DeSyRe consortium, all rights reserved

## Fault Types: causes / DeSyRe solutions

#### Permanent faults:

- Cause:
  - Manufacturing process or Aging
- Detection (diagnosis):
  - Online and offline testing managed by the fault-free part (or even locally)
- Correction:
  - System Reconfiguration: isolate, replace/task migration /reconfigure

#### Intermittent Faults:

- Cause: variations (process, temperature, etc.)
- Detection: same as transient (+ extra step to distinguish form transient)
- Correction:
  - 1<sup>st</sup> step: same as transient
  - 2<sup>nd</sup> step: similar to permanent (mostly task-migration)



DeSyRe: on Demand System Reliability © DeSyRe consortium, all rights reserved

### **Transient Faults:**

- Cause:
  - Radiation, a-particles, etc
- Detection:
  - Local checkers per tile
  - Checkpoints set by the fault-free part
- Correction:
  - Checkpointing (HW assisted): recover and re-execute
    - based on the application needs

www.desyre.eu

ECCs

# Applications

### Two Medical SoCs

- Applications provided by Neurasmus BV
  - Implantable Artificial Pancreas
  - Wearable Artificial Cerebellum
- Very high reliability constraints
- Different power efficiency requirements
- Different processing requirements





## Artificial Cerebellum

- Replace damaged parts of the brain cerebellum
  - recovering sensorimotor control
- Build an artificial cerebellar system
  - Multi-level, closed-loop control
  - Use Deep-Cerebellar-Nuclear-Neuron models to artificially compute the output of the replaced part.





Realtime constraints:

- Low latency
- High throughput
- High reliability
- Power efficiency
- Adaptive to different input patterns



DeSyRe: on Demand System Reliability

© DeSyRe consortium, all rights reserved



## **Artificial Pancreas**



- Glucose-level sensing
- Control insulin injection
- Lightweight processing
- Ultra-low power
- Wireless (and secure) user interface



DeSyRe: on Demand System Reliability © DeSyRe consortium, all rights reserved

## Conclusions

- DeSyRe will describe a design framework for reliable future SoCs
  - Lowering the cost of Fault tolerance
    - Power
    - Energy,
    - Performance

- On-demand adaptation to
  - Fault types and densities
  - System constraints
  - Application requirements
- DeSyRe SoCs will exploit
  - The flexibility of a reconfigurable substrate
  - Runtime System support for adaptation





Thank you! Questions?

Email: sourdis@chalmers.se

www.DeSyRe.eu



DeSyRe: on Demand System Reliability © DeSyRe consortium, all rights reserved

