Skip to main content

Lund University Publications

LUND UNIVERSITY LIBRARIES

100 miljoner ord : Reflektioner kring forskningsarbete med storskaliga dataset som historisk empiri

Snickars, Pelle LU (2022) In Historisk Tidskrift 142(3). p.320-352
Abstract
A hundred million words: Reflections on historical research with large-scale textual datasets as empirical evidence

The research project Welfare State Analytics: Text Mining and Modelling Swedish Politics, Media & Culture, 1945–1989 uses probabilistic methods and text-mining models to study three massive textual datasets from Swedish politics, news media, and literary culture. By topic modelling and distant reading a dataset from some 3,100 Swedish Government Official Reports, findings have been made which previous historical scholarship has neglected – or rather, cannot detect because of the limitations of traditional, smallscale examinations of only a few such reports. This article presents some of the project’s findings,... (More)
A hundred million words: Reflections on historical research with large-scale textual datasets as empirical evidence

The research project Welfare State Analytics: Text Mining and Modelling Swedish Politics, Media & Culture, 1945–1989 uses probabilistic methods and text-mining models to study three massive textual datasets from Swedish politics, news media, and literary culture. By topic modelling and distant reading a dataset from some 3,100 Swedish Government Official Reports, findings have been made which previous historical scholarship has neglected – or rather, cannot detect because of the limitations of traditional, smallscale examinations of only a few such reports. This article presents some of the project’s findings, but concentrates on the practical issues of curating large-scale textual datasets, and thus the possibilities – and shortcomings – of digital history research practices.

Large-scale textual datasets, often containing hundreds of millions of words, are a new type of empirical material that presents the historian with fresh challenges. The preparation of datasets is usually a resource-intensive task, where algorithmic machine learning is combined with the manual curation of data, a process that compiles the empirical material into datasets (in different versions).

Plainly, historical empirical material must be compiled into datasets to enable large-scale analyses, and such work can be laborious, as it depends on extensive programming efforts; what may come as a surprise is how complicated the relationship between data and empirical material can be in a digital-historical context, and the fact that preparing datasets is usually an iterative procedure that fundamentally changes the historical sources. In this type of research, compiled empirical material will usually result in several datasets, depending not only on how effective the available software is to curate and correct errors but also the specific research questions – given that data can be modelled in many ways. The relationship between empirical material and curated datasets is therefore complex, and highly dependent on both software and research practices. (Less)
Abstract (Swedish)
I forskningsprojektet Välfärdsstaten analyserad (1945–89) arbetar vi med olika typer av algoritmisk textanalys av storskalig empiri från politikens sfär (digitaliserat riksdagstryck och statliga offentliga utredningar under perioden), dagspress och skönlitteratur. Svensk efterkrigstid är en väl utforskad period, men genom att applicera digitala metoder på kurerade dataset kan politikens, nyhetsmediernas och kulturens sfärer granskas på nytt. I artikeln presenteras vår digitala historieforskning, men fokus ligger främst på erfarenheter och reflektioner kring det praktiska hantverket med att arbeta med storskalig empiri, på iordningställande av dataset och datakurering, samt de möjligheter och tillkortakommanden som sådana... (More)
I forskningsprojektet Välfärdsstaten analyserad (1945–89) arbetar vi med olika typer av algoritmisk textanalys av storskalig empiri från politikens sfär (digitaliserat riksdagstryck och statliga offentliga utredningar under perioden), dagspress och skönlitteratur. Svensk efterkrigstid är en väl utforskad period, men genom att applicera digitala metoder på kurerade dataset kan politikens, nyhetsmediernas och kulturens sfärer granskas på nytt. I artikeln presenteras vår digitala historieforskning, men fokus ligger främst på erfarenheter och reflektioner kring det praktiska hantverket med att arbeta med storskalig empiri, på iordningställande av dataset och datakurering, samt de möjligheter och tillkortakommanden som sådana forskningspraktiker inbegriper. (Less)
Please use this url to cite or link to this publication:
author
organization
alternative title
A hundred million words : Reflections on historical research with large-scale textual datasets as empirical evidence
publishing date
type
Contribution to journal
publication status
published
subject
keywords
digital historia, datakurering, data curation, machine learning, textual datasets, digital history
in
Historisk Tidskrift
volume
142
issue
3
pages
320 - 352
publisher
Svenska historiska föreningen
external identifiers
  • scopus:85160959218
ISSN
0345-469X
project
Welfare State Analytics. Text Mining and Modeling Swedish Politics, Media & Culture, 1945-1989
language
Swedish
LU publication?
yes
id
abf7c703-689f-4208-b48a-d073a9bccc52
alternative location
https://www.historisktidskrift.se/index.php/june20/article/view/524
date added to LUP
2022-09-24 14:04:34
date last changed
2024-02-03 11:20:58
@article{abf7c703-689f-4208-b48a-d073a9bccc52,
  abstract     = {{A hundred million words: Reflections on historical research with large-scale textual datasets as empirical evidence<br/><br/>The research project Welfare State Analytics: Text Mining and Modelling Swedish Politics, Media &amp; Culture, 1945–1989 uses probabilistic methods and text-mining models to study three massive textual datasets from Swedish politics, news media, and literary culture. By topic modelling and distant reading a dataset from some 3,100 Swedish Government Official Reports, findings have been made which previous historical scholarship has neglected – or rather, cannot detect because of the limitations of traditional, smallscale examinations of only a few such reports. This article presents some of the project’s findings, but concentrates on the practical issues of curating large-scale textual datasets, and thus the possibilities – and shortcomings – of digital history research practices.<br/><br/>Large-scale textual datasets, often containing hundreds of millions of words, are a new type of empirical material that presents the historian with fresh challenges. The preparation of datasets is usually a resource-intensive task, where algorithmic machine learning is combined with the manual curation of data, a process that compiles the empirical material into datasets  (in different versions).<br/><br/>Plainly, historical empirical material must be compiled into datasets to enable large-scale analyses, and such work can be laborious, as it depends on extensive programming efforts; what may come as a surprise is how complicated the relationship between data and empirical material can be in a digital-historical context, and the fact that preparing datasets is usually an iterative procedure that fundamentally changes the historical sources. In this type of research, compiled empirical material will usually result in several datasets, depending not only on how effective the available software is to curate and correct errors but also the specific research questions – given that data can be modelled in many ways. The relationship between empirical material and curated datasets is therefore complex, and highly dependent on both software and research practices.}},
  author       = {{Snickars, Pelle}},
  issn         = {{0345-469X}},
  keywords     = {{digital historia; datakurering; data curation; machine learning; textual datasets; digital history}},
  language     = {{swe}},
  month        = {{09}},
  number       = {{3}},
  pages        = {{320--352}},
  publisher    = {{Svenska historiska föreningen}},
  series       = {{Historisk Tidskrift}},
  title        = {{100 miljoner ord : Reflektioner kring forskningsarbete med storskaliga dataset som historisk empiri}},
  url          = {{https://www.historisktidskrift.se/index.php/june20/article/view/524}},
  volume       = {{142}},
  year         = {{2022}},
}