100 miljoner ord : Reflektioner kring forskningsarbete med storskaliga dataset som historisk empiri

Snickars, Pelle

100 miljoner ord : Reflektioner kring forskningsarbete med storskaliga dataset som historisk empiri

Mark

Snickars, Pelle ^LU

(2022) In Historisk Tidskrift 142(3). p.320-352

Abstract: A hundred million words: Reflections on historical research with large-scale textual datasets as empirical evidence

The research project Welfare State Analytics: Text Mining and Modelling Swedish Politics, Media & Culture, 1945–1989 uses probabilistic methods and text-mining models to study three massive textual datasets from Swedish politics, news media, and literary culture. By topic modelling and distant reading a dataset from some 3,100 Swedish Government Official Reports, findings have been made which previous historical scholarship has neglected – or rather, cannot detect because of the limitations of traditional, smallscale examinations of only a few such reports. This article presents some of the project’s findings,... (More); A hundred million words: Reflections on historical research with large-scale textual datasets as empirical evidence

The research project Welfare State Analytics: Text Mining and Modelling Swedish Politics, Media & Culture, 1945–1989 uses probabilistic methods and text-mining models to study three massive textual datasets from Swedish politics, news media, and literary culture. By topic modelling and distant reading a dataset from some 3,100 Swedish Government Official Reports, findings have been made which previous historical scholarship has neglected – or rather, cannot detect because of the limitations of traditional, smallscale examinations of only a few such reports. This article presents some of the project’s findings, but concentrates on the practical issues of curating large-scale textual datasets, and thus the possibilities – and shortcomings – of digital history research practices.

Large-scale textual datasets, often containing hundreds of millions of words, are a new type of empirical material that presents the historian with fresh challenges. The preparation of datasets is usually a resource-intensive task, where algorithmic machine learning is combined with the manual curation of data, a process that compiles the empirical material into datasets (in different versions).

Plainly, historical empirical material must be compiled into datasets to enable large-scale analyses, and such work can be laborious, as it depends on extensive programming efforts; what may come as a surprise is how complicated the relationship between data and empirical material can be in a digital-historical context, and the fact that preparing datasets is usually an iterative procedure that fundamentally changes the historical sources. In this type of research, compiled empirical material will usually result in several datasets, depending not only on how effective the available software is to curate and correct errors but also the specific research questions – given that data can be modelled in many ways. The relationship between empirical material and curated datasets is therefore complex, and highly dependent on both software and research practices. (Less)
Abstract (Swedish): I forskningsprojektet Välfärdsstaten analyserad (1945–89) arbetar vi med olika typer av algoritmisk textanalys av storskalig empiri från politikens sfär (digitaliserat riksdagstryck och statliga offentliga utredningar under perioden), dagspress och skönlitteratur. Svensk efterkrigstid är en väl utforskad period, men genom att applicera digitala metoder på kurerade dataset kan politikens, nyhetsmediernas och kulturens sfärer granskas på nytt. I artikeln presenteras vår digitala historieforskning, men fokus ligger främst på erfarenheter och reflektioner kring det praktiska hantverket med att arbeta med storskalig empiri, på iordningställande av dataset och datakurering, samt de möjligheter och tillkortakommanden som sådana... (More); I forskningsprojektet Välfärdsstaten analyserad (1945–89) arbetar vi med olika typer av algoritmisk textanalys av storskalig empiri från politikens sfär (digitaliserat riksdagstryck och statliga offentliga utredningar under perioden), dagspress och skönlitteratur. Svensk efterkrigstid är en väl utforskad period, men genom att applicera digitala metoder på kurerade dataset kan politikens, nyhetsmediernas och kulturens sfärer granskas på nytt. I artikeln presenteras vår digitala historieforskning, men fokus ligger främst på erfarenheter och reflektioner kring det praktiska hantverket med att arbeta med storskalig empiri, på iordningställande av dataset och datakurering, samt de möjligheter och tillkortakommanden som sådana forskningspraktiker inbegriper. (Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/abf7c703-689f-4208-b48a-d073a9bccc52

author

Snickars, Pelle ^LU

organization

Division of ALM, Digital Cultures and Publishing Studies

alternative title

A hundred million words : Reflections on historical research with large-scale textual datasets as empirical evidence

publishing date

2022-09-23

type

Contribution to journal

publication status

published

subject

keywords

digital historia, datakurering, data curation, machine learning, textual datasets, digital history

in

Historisk Tidskrift

volume

142

issue

3

pages

320 - 352

publisher

Svenska Historiska Foreningen

external identifiers

scopus:85160959218

ISSN

0345-469X

project

Welfare State Analytics. Text Mining and Modeling Swedish Politics, Media & Culture, 1945-1989

language

Swedish

LU publication?

yes

id

abf7c703-689f-4208-b48a-d073a9bccc52

alternative location

https://www.historisktidskrift.se/index.php/june20/article/view/524

date added to LUP

2022-09-24 14:04:34

date last changed

2025-10-14 10:55:09

@article{abf7c703-689f-4208-b48a-d073a9bccc52,
  abstract     = {{A hundred million words: Reflections on historical research with large-scale textual datasets as empirical evidence<br/><br/>The research project Welfare State Analytics: Text Mining and Modelling Swedish Politics, Media &amp; Culture, 1945–1989 uses probabilistic methods and text-mining models to study three massive textual datasets from Swedish politics, news media, and literary culture. By topic modelling and distant reading a dataset from some 3,100 Swedish Government Official Reports, findings have been made which previous historical scholarship has neglected – or rather, cannot detect because of the limitations of traditional, smallscale examinations of only a few such reports. This article presents some of the project’s findings, but concentrates on the practical issues of curating large-scale textual datasets, and thus the possibilities – and shortcomings – of digital history research practices.<br/><br/>Large-scale textual datasets, often containing hundreds of millions of words, are a new type of empirical material that presents the historian with fresh challenges. The preparation of datasets is usually a resource-intensive task, where algorithmic machine learning is combined with the manual curation of data, a process that compiles the empirical material into datasets  (in different versions).<br/><br/>Plainly, historical empirical material must be compiled into datasets to enable large-scale analyses, and such work can be laborious, as it depends on extensive programming efforts; what may come as a surprise is how complicated the relationship between data and empirical material can be in a digital-historical context, and the fact that preparing datasets is usually an iterative procedure that fundamentally changes the historical sources. In this type of research, compiled empirical material will usually result in several datasets, depending not only on how effective the available software is to curate and correct errors but also the specific research questions – given that data can be modelled in many ways. The relationship between empirical material and curated datasets is therefore complex, and highly dependent on both software and research practices.}},
  author       = {{Snickars, Pelle}},
  issn         = {{0345-469X}},
  keywords     = {{digital historia; datakurering; data curation; machine learning; textual datasets; digital history}},
  language     = {{swe}},
  month        = {{09}},
  number       = {{3}},
  pages        = {{320--352}},
  publisher    = {{Svenska Historiska Foreningen}},
  series       = {{Historisk Tidskrift}},
  title        = {{100 miljoner ord : Reflektioner kring forskningsarbete med storskaliga dataset som historisk empiri}},
  url          = {{https://www.historisktidskrift.se/index.php/june20/article/view/524}},
  volume       = {{142}},
  year         = {{2022}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

100 miljoner ord : Reflektioner kring forskningsarbete med storskaliga dataset som historisk empiri