Where academic tradition
meets the exciting future

On Fault Tolerance Methods for Networks-on-Chip

Teijo Lehtonen, On Fault Tolerance Methods for Networks-on-Chip. TUCS Dissertations 122. Turku Centre for Computer Science, 2009.

Abstract:

Technology scaling has proceeded into dimensions in which the reliability of manufactured devices is becoming endangered. The reliability decrease is a consequence of physical limitations, relative increase of variations, and decreasing noise margins, among others. A promising solution for bringing the reliability of circuits back to a desired level is the use of design methods which introduce tolerance against possible faults in an integrated circuit.

This thesis studies and presents fault tolerance methods for network-on-chip (NoC) which is a design paradigm targeted for very large systems-on-chip. In a NoC resources, such as processors and memories, are connected to a communication network; comparable to the Internet. Fault tolerance in such a system can be achieved at many abstraction levels.

The thesis studies the origin of faults in modern technologies and explains the classification to transient, intermittent and permanent faults. A survey of fault tolerance methods is presented to demonstrate the diversity of available methods. Networks-on-chip are approached by exploring their main design choices: the selection of a topology, routing protocol, and flow
control method. Fault tolerance methods for NoCs are studied at different layers of the OSI reference model.

The data link layer provides a reliable communication link over a physical channel. Error control coding is an efficient fault tolerance method especially against transient faults at this abstraction level. Error control coding methods suitable for on-chip communication are studied and their implementations presented. Error control coding loses its effectiveness in the presence of intermittent and permanent faults. Therefore, other solutions against them are presented. The introduction of spare wires and split transmissions are shown to provide good tolerance against intermittent and permanent errors and their combination to error control coding is illustrated.

At the network layer positioned above the data link layer, fault tolerance can be achieved with the design of fault tolerant network topologies and routing algorithms. Both of these approaches are presented in the thesis together with realizations in the both categories. The thesis concludes that an optimal fault tolerance solution contains carefully co-designed elements from different abstraction levels.

BibTeX entry:

@PHDTHESIS{phdLehtonen09a,
  title = {On Fault Tolerance Methods for Networks-on-Chip},
  author = {Lehtonen, Teijo},
  number = {122},
  series = {TUCS Dissertations},
  school = {Turku Centre for Computer Science},
  year = {2009},
  keywords = {fault tolerance, network-on-chip},
  ISBN = {978-952-12-2355-6},
}

Belongs to TUCS Research Unit(s): Distributed Systems Laboratory (DS Lab)

Edit publication