Optimizing Fault Tolerance for Real-Time Systems

Nikolov, Dimitar

Optimizing Fault Tolerance for Real-Time Systems

Mark

Nikolov, Dimitar ^LU (2012)

Abstract: For the vast majority of computer systems correct operation is defined as producing the correct result within a time constraint (deadline). We refer to such computer systems as real-time systems (RTSs). RTSs manufactured in recent semiconductor technologies are increasingly susceptible to soft errors, which enforces the use of fault tolerance to detect and recover from eventual errors. However, fault tolerance usually introduces a time overhead, which may cause an RTS to violate the time constraints. Depending on the consequences of violating the deadlines, RTSs are divided into hard RTSs, where the consequences are severe, and soft RTSs, otherwise. Traditionally, worst case execution time (WCET) analyses are used for hard RTSs to ensure... (More); For the vast majority of computer systems correct operation is defined as producing the correct result within a time constraint (deadline). We refer to such computer systems as real-time systems (RTSs). RTSs manufactured in recent semiconductor technologies are increasingly susceptible to soft errors, which enforces the use of fault tolerance to detect and recover from eventual errors. However, fault tolerance usually introduces a time overhead, which may cause an RTS to violate the time constraints. Depending on the consequences of violating the deadlines, RTSs are divided into hard RTSs, where the consequences are severe, and soft RTSs, otherwise. Traditionally, worst case execution time (WCET) analyses are used for hard RTSs to ensure that the deadlines are not violated, and average execution time (AET) analyses are used for soft RTSs. However, at design time a designer of an RTS copes with the challenging task of deciding whether the system should be a hard or a soft RTS. In such case, focusing only on WCET analyses may result in an over-designed system, while on the other hand focusing only on AET analyses may result in a system that allows eventual deadline violations.

To overcome this problem, we introduce Level of Confidence (LoC) as a metric to evaluate to what extent a deadline is met in presence of soft errors. The advantage is that the same metric can be used for both soft and hard RTSs, thus a system designer can precisely specify to what extent a deadline is to be met. In this thesis, we address optimization of Roll-back Recovery with Checkpointing (RRC) which is a good representative for fault tolerance due to that it enables detection and recovery of soft errors at the cost of introducing a time overhead which impacts the execution time of tasks. The time overhead depends on the number of checkpoints that are used. Therefore, we provide mathematical expressions for finding the optimal number of checkpoints which leads to: 1) minimal AET and 2) maximal LoC. To obtain these expressions we assume that error probability is given. However, error probability is not known in advance and it can even vary over runtime. Therefore, we propose two error probability estimation techniques: Periodic Probability Estimation and Aperiodic Probability Estimation that estimate error probability during runtime and adjust the RRC scheme with the goal to reduce the AET. By conducting experiments, we show that both techniques provide near-optimal performance of RRC. (Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/5050604

author

Nikolov, Dimitar ^LU

supervisor

Erik Larsson ^LU

publishing date

2012

type

Thesis

publication status

published

subject

Electrical Engineering, Electronic Engineering, Information Engineering

pages

98 pages

publisher

Linköping University

language

English

LU publication?

no

id

a19456f1-4e70-46ce-b56a-01628a6c6f77 (old id 5050604)

alternative location

http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-84865

date added to LUP

2016-04-01 13:15:08

date last changed

2025-12-03 14:29:49

@misc{a19456f1-4e70-46ce-b56a-01628a6c6f77,
  abstract     = {{For the vast majority of computer systems correct operation is defined as producing the correct result within a time constraint (deadline). We refer to such computer systems as real-time systems (RTSs). RTSs manufactured in recent semiconductor technologies are increasingly susceptible to soft errors, which enforces the use of fault tolerance to detect and recover from eventual errors. However, fault tolerance usually introduces a time overhead, which may cause an RTS to violate the time constraints. Depending on the consequences of violating the deadlines, RTSs are divided into hard RTSs, where the consequences are severe, and soft RTSs, otherwise. Traditionally, worst case execution time (WCET) analyses are used for hard RTSs to ensure that the deadlines are not violated, and average execution time (AET) analyses are used for soft RTSs. However, at design time a designer of an RTS copes with the challenging task of deciding whether the system should be a hard or a soft RTS. In such case, focusing only on WCET analyses may result in an over-designed system, while on the other hand focusing only on AET analyses may result in a system that allows eventual deadline violations.<br/><br>
<br/><br>
To overcome this problem, we introduce Level of Confidence (LoC) as a metric to evaluate to what extent a deadline is met in presence of soft errors. The advantage is that the same metric can be used for both soft and hard RTSs, thus a system designer can precisely specify to what extent a deadline is to be met. In this thesis, we address optimization of Roll-back Recovery with Checkpointing (RRC) which is a good representative for fault tolerance due to that it enables detection and recovery of soft errors at the cost of introducing a time overhead which impacts the execution time of tasks. The time overhead depends on the number of checkpoints that are used. Therefore, we provide mathematical expressions for finding the optimal number of checkpoints which leads to: 1) minimal AET and 2) maximal LoC. To obtain these expressions we assume that error probability is given. However, error probability is not known in advance and it can even vary over runtime. Therefore, we propose two error probability estimation techniques: Periodic Probability Estimation and Aperiodic Probability Estimation that estimate error probability during runtime and adjust the RRC scheme with the goal to reduce the AET. By conducting experiments, we show that both techniques provide near-optimal performance of RRC.}},
  author       = {{Nikolov, Dimitar}},
  language     = {{eng}},
  note         = {{Licentiate Thesis}},
  publisher    = {{Linköping University}},
  title        = {{Optimizing Fault Tolerance for Real-Time Systems}},
  url          = {{http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-84865}},
  year         = {{2012}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

Optimizing Fault Tolerance for Real-Time Systems