Analysis of dawnbench, a time-to-accuracy machine learning performance benchmark

Coleman, Cody; Kang, Daniel; Narayanan, Deepak; Nardi, Luigi; Zhao, Tian; Zhang, Jian; Bailis, Peter; Olukotun, Kunle; Ré, Chris; Zaharia, Matei

Analysis of dawnbench, a time-to-accuracy machine learning performance benchmark

Mark

Coleman, Cody ; Kang, Daniel ; Narayanan, Deepak ; Nardi, Luigi ^LU ; Zhao, Tian ; Zhang, Jian ; Bailis, Peter ; Olukotun, Kunle ; Ré, Chris and Zaharia, Matei (2019) In Operating Systems Review (ACM) 53(1). p.14-25

Abstract: Researchers have proposed hardware, software, and algorithmic optimizations to improve the computational performance of deep learning. While some of these optimizations perform the same operations faster (e.g., increasing GPU clock speed), many others modify the semantics of the training procedure (e.g., reduced precision), and can impact the final model's accuracy on unseen data. Due to a lack of standard evaluation criteria that considers these trade-offs, it is difficult to directly compare these optimizations. To address this problem, we recently introduced DAWNBENCH, a benchmark competition focused on end-to-end training time to achieve near-state-of-the-art accuracy on an unseen dataset-a combined metric called time-to-accuracy... (More); Researchers have proposed hardware, software, and algorithmic optimizations to improve the computational performance of deep learning. While some of these optimizations perform the same operations faster (e.g., increasing GPU clock speed), many others modify the semantics of the training procedure (e.g., reduced precision), and can impact the final model's accuracy on unseen data. Due to a lack of standard evaluation criteria that considers these trade-offs, it is difficult to directly compare these optimizations. To address this problem, we recently introduced DAWNBENCH, a benchmark competition focused on end-to-end training time to achieve near-state-of-the-art accuracy on an unseen dataset-a combined metric called time-to-accuracy (TTA). In this work, we analyze the entries from DAWNBENCH, which received optimized submissions from multiple industrial groups, to investigate the behavior of TTA as a metric as well as trends in the best-performing entries. We show that TTA has a low coefficient of variation and that models optimized for TTA generalize nearly as well as those trained using standard methods. Additionally, even though DAWNBENCH entries were able to train ImageNet models in under 3 minutes, we find they still underutilize hardware capabilities such as Tensor Cores. Furthermore, we find that distributed entries can spend more than half of their time on communication. We show similar findings with entries to the MLPERF v0.5 benchmark.
(Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/caa379d0-9d54-4c2b-9a21-7b2bc5fe7d3f

author

Coleman, Cody ; Kang, Daniel ; Narayanan, Deepak ; Nardi, Luigi ^LU ; Zhao, Tian ; Zhang, Jian ; Bailis, Peter ; Olukotun, Kunle ; Ré, Chris and Zaharia, Matei

publishing date

2019-07

type

Contribution to journal

publication status

published

subject

Computer Systems

in

Operating Systems Review (ACM)

volume

53

issue

1

pages

12 pages

publisher

Association for Computing Machinery

external identifiers

scopus:85071332053

ISSN

0163-5980

DOI

10.1145/3352020.3352024

language

English

LU publication?

no

additional info

id

caa379d0-9d54-4c2b-9a21-7b2bc5fe7d3f

date added to LUP

2022-09-16 18:06:35

date last changed

2025-04-17 03:57:53

@article{caa379d0-9d54-4c2b-9a21-7b2bc5fe7d3f,
  abstract     = {{<p>Researchers have proposed hardware, software, and algorithmic optimizations to improve the computational performance of deep learning. While some of these optimizations perform the same operations faster (e.g., increasing GPU clock speed), many others modify the semantics of the training procedure (e.g., reduced precision), and can impact the final model's accuracy on unseen data. Due to a lack of standard evaluation criteria that considers these trade-offs, it is difficult to directly compare these optimizations. To address this problem, we recently introduced DAWNBENCH, a benchmark competition focused on end-to-end training time to achieve near-state-of-the-art accuracy on an unseen dataset-a combined metric called time-to-accuracy (TTA). In this work, we analyze the entries from DAWNBENCH, which received optimized submissions from multiple industrial groups, to investigate the behavior of TTA as a metric as well as trends in the best-performing entries. We show that TTA has a low coefficient of variation and that models optimized for TTA generalize nearly as well as those trained using standard methods. Additionally, even though DAWNBENCH entries were able to train ImageNet models in under 3 minutes, we find they still underutilize hardware capabilities such as Tensor Cores. Furthermore, we find that distributed entries can spend more than half of their time on communication. We show similar findings with entries to the MLPERF v0.5 benchmark.</p>}},
  author       = {{Coleman, Cody and Kang, Daniel and Narayanan, Deepak and Nardi, Luigi and Zhao, Tian and Zhang, Jian and Bailis, Peter and Olukotun, Kunle and Ré, Chris and Zaharia, Matei}},
  issn         = {{0163-5980}},
  language     = {{eng}},
  number       = {{1}},
  pages        = {{14--25}},
  publisher    = {{Association for Computing Machinery}},
  series       = {{Operating Systems Review (ACM)}},
  title        = {{Analysis of dawnbench, a time-to-accuracy machine learning performance benchmark}},
  url          = {{http://dx.doi.org/10.1145/3352020.3352024}},
  doi          = {{10.1145/3352020.3352024}},
  volume       = {{53}},
  year         = {{2019}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

Analysis of dawnbench, a time-to-accuracy machine learning performance benchmark