Advanced

Evaluation of Level of Confidence and Optimization of Roll-back Recovery with Checkpointing for Real-Time Systems

Nikolov, Dimitar LU ; Ingelsson, Urban; Singh, Virendra and Larsson, Erik LU (2014) In Microelectronics and Reliability 54(5). p.1022-1049
Abstract
Increasing soft error rates for semiconductor devices manufactured in later technologies enforce the usage of fault tolerant techniques such as Roll-back Recovery with Checkpointing (RRC). As RRC introduces time overhead that increases the completion (execution) time, time constraints (deadlines) might be violated. This is a drawback for a class of computer systems where the correct operation is defined not only by providing the correct outcome of an operation but also by ensuring that the deadlines are met. These computer systems are referred to as real-time systems (RTSs). In general RTSs are classified as soft and hard RTSs depending on the consequences of violating the deadlines. For soft RTSs, where consequences of violating the... (More)
Increasing soft error rates for semiconductor devices manufactured in later technologies enforce the usage of fault tolerant techniques such as Roll-back Recovery with Checkpointing (RRC). As RRC introduces time overhead that increases the completion (execution) time, time constraints (deadlines) might be violated. This is a drawback for a class of computer systems where the correct operation is defined not only by providing the correct outcome of an operation but also by ensuring that the deadlines are met. These computer systems are referred to as real-time systems (RTSs). In general RTSs are classified as soft and hard RTSs depending on the consequences of violating the deadlines. For soft RTSs, where consequences of violating the deadlines are not very severe, research have focused on optimizing RRC and shown that it is possible to find the optimal number of checkpoints such that the average execution time (AET) is minimal. While minimal AET is important for soft RTSs, it is more important to provide a high probability that deadlines are met for hard RTSs, where consequences of violating the deadlines may be catastrophic. Hence, there is a need of probabilistic guarantees that jobs employing RRC complete before a given deadline. Traditionally, AET analysis have been used for soft RTSs and worst case execution time (WCET) analysis along with schedule feasibility have been used for hard RTSs. In this paper we introduce a reliability metric, Level of Confidence (LoC), which is equally applicable to both soft and hard RTS. LoC is used as a metric to evaluate to what extent a deadline is met. The main contributions of this paper are as follows. First, we present a mathematical framework for the evaluation of LoC when RRC is employed. Second, we provide a proof to verify the correctness of the proposed expression. Third, in the context of hard RTSs, we provide a method to obtain the optimal number of checkpoints that maximizes the LoC. Fourth, in the context of soft RTSs where the maximal LoC may not be needed, but instead some LoC requirement is needed, we present an optimization method for RRC that finds the number of checkpoints that results in the minimal completion time while the minimal completion time satisfies a given LoC requirement. Fifth, we use the proposed framework to evaluate and compare probabilistic guarantees when RRC is optimized towards soft RTSs. (Less)
Please use this url to cite or link to this publication:
author
organization
publishing date
type
Contribution to journal
publication status
published
subject
keywords
fault tolerance, reliability, real-time systems, soft errors, checkpointing, Level of Confidence
in
Microelectronics and Reliability
volume
54
issue
5
pages
1022 - 1049
publisher
Elsevier
external identifiers
  • wos:000336777100024
  • scopus:84899960545
ISSN
0026-2714
DOI
10.1016/j.microrel.2014.02.004
language
English
LU publication?
yes
id
dfeb6863-46ff-43f4-ad0c-9e38ae200592 (old id 4359416)
date added to LUP
2014-03-17 12:45:23
date last changed
2017-07-09 04:12:24
@article{dfeb6863-46ff-43f4-ad0c-9e38ae200592,
  abstract     = {Increasing soft error rates for semiconductor devices manufactured in later technologies enforce the usage of fault tolerant techniques such as Roll-back Recovery with Checkpointing (RRC). As RRC introduces time overhead that increases the completion (execution) time, time constraints (deadlines) might be violated. This is a drawback for a class of computer systems where the correct operation is defined not only by providing the correct outcome of an operation but also by ensuring that the deadlines are met. These computer systems are referred to as real-time systems (RTSs). In general RTSs are classified as soft and hard RTSs depending on the consequences of violating the deadlines. For soft RTSs, where consequences of violating the deadlines are not very severe, research have focused on optimizing RRC and shown that it is possible to find the optimal number of checkpoints such that the average execution time (AET) is minimal. While minimal AET is important for soft RTSs, it is more important to provide a high probability that deadlines are met for hard RTSs, where consequences of violating the deadlines may be catastrophic. Hence, there is a need of probabilistic guarantees that jobs employing RRC complete before a given deadline. Traditionally, AET analysis have been used for soft RTSs and worst case execution time (WCET) analysis along with schedule feasibility have been used for hard RTSs. In this paper we introduce a reliability metric, Level of Confidence (LoC), which is equally applicable to both soft and hard RTS. LoC is used as a metric to evaluate to what extent a deadline is met. The main contributions of this paper are as follows. First, we present a mathematical framework for the evaluation of LoC when RRC is employed. Second, we provide a proof to verify the correctness of the proposed expression. Third, in the context of hard RTSs, we provide a method to obtain the optimal number of checkpoints that maximizes the LoC. Fourth, in the context of soft RTSs where the maximal LoC may not be needed, but instead some LoC requirement is needed, we present an optimization method for RRC that finds the number of checkpoints that results in the minimal completion time while the minimal completion time satisfies a given LoC requirement. Fifth, we use the proposed framework to evaluate and compare probabilistic guarantees when RRC is optimized towards soft RTSs.},
  author       = {Nikolov, Dimitar and Ingelsson, Urban and Singh, Virendra and Larsson, Erik},
  issn         = {0026-2714},
  keyword      = {fault tolerance,reliability,real-time systems,soft errors,checkpointing,Level of Confidence},
  language     = {eng},
  number       = {5},
  pages        = {1022--1049},
  publisher    = {Elsevier},
  series       = {Microelectronics and Reliability},
  title        = {Evaluation of Level of Confidence and Optimization of Roll-back Recovery with Checkpointing for Real-Time Systems},
  url          = {http://dx.doi.org/10.1016/j.microrel.2014.02.004},
  volume       = {54},
  year         = {2014},
}