Network Parameterisation and Activation Functions in Deep Learning

Trimmel, Martin

Network Parameterisation and Activation Functions in Deep Learning

Mark

Trimmel, Martin ^LU (2023) In Doctoral Theses in Mathematical Sciences

Abstract: Deep learning, the study of multi-layered artificial neural networks, has received tremendous attention over the course of the last few years. Neural networks are now able to outperform humans in a growing variety of tasks and increasingly have an impact on our day-to-day lives. There is a wide range of potential directions to advance deep learning, two of which we investigate in this thesis:

(1) One of the key components of a network are its activation functions. The activations have a big impact on the overall mathematical form of the network. The \textit{first paper} studies generalisation of neural networks with rectified linear activations units (“ReLUs”). Such networks partition the input space into so-called linear regions,... (More); Deep learning, the study of multi-layered artificial neural networks, has received tremendous attention over the course of the last few years. Neural networks are now able to outperform humans in a growing variety of tasks and increasingly have an impact on our day-to-day lives. There is a wide range of potential directions to advance deep learning, two of which we investigate in this thesis:

(1) One of the key components of a network are its activation functions. The activations have a big impact on the overall mathematical form of the network. The \textit{first paper} studies generalisation of neural networks with rectified linear activations units (“ReLUs”). Such networks partition the input space into so-called linear regions, which are the maximally connected subsets on which the network is affine. In contrast to previous work, which focused on obtaining estimates of the number of linear regions, we proposed a tropical algebra-based algorithm called TropEx to extract coefficients of the linear regions. Applied to fully-connected and convolutional neural networks, TropEx shows significant differences between the linear regions of these network types. The \textit{second paper} proposes a parametric rational activation function called ERA, which is learnable during network training. Although ERA only adds about ten parameters per layer, the activation significantly increases network expressivity and makes small architectures have a performance close to large ones. ERA outperforms previous activations when used in small architectures. This is relevant because neural networks keep growing larger and larger and the computational resources they require result in greater costs and electricity usage (which in turn increases the CO2 footprint).

(2) For a given network architecture, each parameter configuration gives rise to a mathematical function. This functional realisation is far from unique and many different parameterisations can give rise to the same function. Changes to the parameterisation that do not change the function are called symmetries. The \textit{third paper} theoretically studies and classifies all the symmetries of 2-layer networks using the ReLU activation. Finally, the \textit{fourth paper} studies the effect of network parameterisation on network training. We provide a theoretical analysis of the effect that scaling layers have on the gradient updates. This provides a motivation for us to propose a Cooling method, which automatically scales the network parameters during training. Cooling reduces the reliance of the network on specific tricks, in particular the use of a learning rate schedule.
(Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/31f59552-5e27-488c-b8e3-70c3680dd81b

author

Trimmel, Martin ^LU

supervisor

opponent

Prof. Montúfar, Guido, UCLA, USA.

organization

Mathematics (Faculty of Engineering)

publishing date

2023-05-16

type

Thesis

publication status

published

subject

keywords

deep learning, linear region, network parameterisation, activation function, network calibration, conformal predictnio, tropical algebra, rational function, temperature scaling, network symmetries

in

Doctoral Theses in Mathematical Sciences

pages

220 pages

publisher

Lunds Universitet, Centre for Mathematical Sciences

defense location

Lecture Hall Hörmander, Centre of Mathematical Sciences, Sölvegatan 18 A, Faculty of Engineering LTH, Lund University, Lund. The dissertation will be live streamed, but part of the premises is to be excluded from the live stream.

defense date

2023-05-16 15:15:00

ISSN

1404-0034

ISBN

978-91-8039-573-1

978-91-8039-572-4

language

English

LU publication?

yes

id

31f59552-5e27-488c-b8e3-70c3680dd81b

date added to LUP

2023-04-08 14:12:34

date last changed

2025-04-04 15:17:39

@phdthesis{31f59552-5e27-488c-b8e3-70c3680dd81b,
  abstract     = {{Deep learning, the study of multi-layered artificial neural networks, has received tremendous attention over the course of the last few years. Neural networks are now able to outperform humans in a growing variety of tasks and increasingly have an impact on our day-to-day lives. There is a wide range of potential directions to advance deep learning, two of which we investigate in this thesis:<br/><br/>(1) One of the key components of a network are its activation functions. The activations have a big impact on the overall mathematical form of the network. The \textit{first paper} studies generalisation of neural networks with rectified linear activations units (“ReLUs”). Such networks partition the input space into so-called linear regions, which are the maximally connected subsets on which the network is affine. In contrast to previous work, which focused on obtaining estimates of the number of linear regions, we proposed a tropical algebra-based algorithm called TropEx to extract coefficients of the linear regions. Applied to fully-connected and convolutional neural networks, TropEx shows significant differences between the linear regions of these network types. The \textit{second paper} proposes a parametric rational activation function called ERA, which is learnable during network training. Although ERA only adds about ten parameters per layer, the activation significantly increases network expressivity and makes small architectures have a performance close to large ones. ERA outperforms previous activations when used in small architectures. This is relevant because neural networks keep growing larger and larger and the computational resources they require result in greater costs and electricity usage (which in turn increases the CO2 footprint).<br/><br/>(2) For a given network architecture, each parameter configuration gives rise to a mathematical function. This functional realisation is far from unique and many different parameterisations can give rise to the same function. Changes to the parameterisation that do not change the function are called symmetries. The \textit{third paper} theoretically studies and classifies all the symmetries of 2-layer networks using the ReLU activation. Finally, the \textit{fourth paper} studies the effect of network parameterisation on network training. We provide a theoretical analysis of the effect that scaling layers have on the gradient updates. This provides a motivation for us to propose a Cooling method, which automatically scales the network parameters during training. Cooling reduces the reliance of the network on specific tricks, in particular the use of a learning rate schedule.<br/>}},
  author       = {{Trimmel, Martin}},
  isbn         = {{978-91-8039-573-1}},
  issn         = {{1404-0034}},
  keywords     = {{deep learning; linear region; network parameterisation; activation function; network calibration; conformal predictnio; tropical algebra; rational function; temperature scaling; network symmetries}},
  language     = {{eng}},
  month        = {{05}},
  publisher    = {{Lunds Universitet, Centre for Mathematical Sciences}},
  school       = {{Lund University}},
  series       = {{Doctoral Theses in Mathematical Sciences}},
  title        = {{Network Parameterisation and Activation Functions in Deep Learning}},
  url          = {{https://lup.lub.lu.se/search/files/144514005/kappa_Martin.pdf}},
  year         = {{2023}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

Network Parameterisation and Activation Functions in Deep Learning