Where academic tradition
meets the exciting future

On Designing Fault Tolerant Nanoscale Systems

Teijo Lehtonen, On Designing Fault Tolerant Nanoscale Systems. 2007.

Abstract:

This thesis addresses fault tolerance design aspects of nanoscale systems. The fault tolerance concepts will be important in designing nanoscale systems since the probability of faults is expected to increase due to the shrinking device sizes, larger relative parameter deviations and the higher degree of integration. Applying fault tolerance methods enables the usage of the circuits containing some faults, thus giving better manufacture yield and longer circuit life-time. Furthermore, faults occurring run-time do not compromise the correct operation of the system if proper fault tolerance methods are employed.

The fault sources in nanoscale systems are identified and the faults are classified to permanent, intermittent and transient errors. The fault tolerance methods can be divided to static and dynamic redundancy. A survey of fault tolerance methods presents principles and structures for a number of methods from both categories together with a discussion
of their suitability for nanoscale systems.

An architecture-level approach for analyzing and improving fault tolerance is presented and applied to radio architectures. The results show that the parallel structures inherently present at many radio systems can be used to increase the system reliability by either trading off the system performance or the circuit area by the insertion of spare modules that can be used to replace the faulty modules. Parallel structures enables the use of one module as a spare for many other modules.

The same approach is applied to network-on-chip (NoC) architectures which are believed to be the basic platforms of future complex multicore systems. The basic building blocks needed to realize a NoC are implemented for the analysis. The results show that the overall reliability can be enhanced by inserting a second network interface to each core and using a communication network that is constructed from minimum-size routers.

Finally, a fault tolerant on-chip link targeted to NoC platforms is presented. The link system combines several fault tolerance methods to achieve a system that is capable of tolerating different types of errors. The applied methods include coding to detect errors and retransmissions as the recovery method against transient errors together with spare wires or split transmissions to handle the intermittent and permanent errors. The presented structures are implemented and their impact to area, performance and power consumption is demonstrated with numerous simulations.

BibTeX entry:

@LICTHESIS{licLehtonen07a,
  title = {On Designing Fault Tolerant Nanoscale Systems},
  author = {Lehtonen, Teijo},
  year = {2007},
  keywords = {fault tolerance, reliability, nanoscale systems, on-chip communication},
}

Belongs to TUCS Research Unit(s): Distributed Systems Laboratory (DS Lab)

Edit publication