100 miljoner ord : Reflektioner kring forskningsarbete med storskaliga dataset som historisk empiri
(2022) In Historisk Tidskrift 142(3). p.320-352- Abstract
- A hundred million words: Reflections on historical research with large-scale textual datasets as empirical evidence
The research project Welfare State Analytics: Text Mining and Modelling Swedish Politics, Media & Culture, 1945–1989 uses probabilistic methods and text-mining models to study three massive textual datasets from Swedish politics, news media, and literary culture. By topic modelling and distant reading a dataset from some 3,100 Swedish Government Official Reports, findings have been made which previous historical scholarship has neglected – or rather, cannot detect because of the limitations of traditional, smallscale examinations of only a few such reports. This article presents some of the project’s findings,... (More) - A hundred million words: Reflections on historical research with large-scale textual datasets as empirical evidence
The research project Welfare State Analytics: Text Mining and Modelling Swedish Politics, Media & Culture, 1945–1989 uses probabilistic methods and text-mining models to study three massive textual datasets from Swedish politics, news media, and literary culture. By topic modelling and distant reading a dataset from some 3,100 Swedish Government Official Reports, findings have been made which previous historical scholarship has neglected – or rather, cannot detect because of the limitations of traditional, smallscale examinations of only a few such reports. This article presents some of the project’s findings, but concentrates on the practical issues of curating large-scale textual datasets, and thus the possibilities – and shortcomings – of digital history research practices.
Large-scale textual datasets, often containing hundreds of millions of words, are a new type of empirical material that presents the historian with fresh challenges. The preparation of datasets is usually a resource-intensive task, where algorithmic machine learning is combined with the manual curation of data, a process that compiles the empirical material into datasets (in different versions).
Plainly, historical empirical material must be compiled into datasets to enable large-scale analyses, and such work can be laborious, as it depends on extensive programming efforts; what may come as a surprise is how complicated the relationship between data and empirical material can be in a digital-historical context, and the fact that preparing datasets is usually an iterative procedure that fundamentally changes the historical sources. In this type of research, compiled empirical material will usually result in several datasets, depending not only on how effective the available software is to curate and correct errors but also the specific research questions – given that data can be modelled in many ways. The relationship between empirical material and curated datasets is therefore complex, and highly dependent on both software and research practices. (Less) - Abstract (Swedish)
- I forskningsprojektet Välfärdsstaten analyserad (1945–89) arbetar vi med olika typer av algoritmisk textanalys av storskalig empiri från politikens sfär (digitaliserat riksdagstryck och statliga offentliga utredningar under perioden), dagspress och skönlitteratur. Svensk efterkrigstid är en väl utforskad period, men genom att applicera digitala metoder på kurerade dataset kan politikens, nyhetsmediernas och kulturens sfärer granskas på nytt. I artikeln presenteras vår digitala historieforskning, men fokus ligger främst på erfarenheter och reflektioner kring det praktiska hantverket med att arbeta med storskalig empiri, på iordningställande av dataset och datakurering, samt de möjligheter och tillkortakommanden som sådana... (More)
- I forskningsprojektet Välfärdsstaten analyserad (1945–89) arbetar vi med olika typer av algoritmisk textanalys av storskalig empiri från politikens sfär (digitaliserat riksdagstryck och statliga offentliga utredningar under perioden), dagspress och skönlitteratur. Svensk efterkrigstid är en väl utforskad period, men genom att applicera digitala metoder på kurerade dataset kan politikens, nyhetsmediernas och kulturens sfärer granskas på nytt. I artikeln presenteras vår digitala historieforskning, men fokus ligger främst på erfarenheter och reflektioner kring det praktiska hantverket med att arbeta med storskalig empiri, på iordningställande av dataset och datakurering, samt de möjligheter och tillkortakommanden som sådana forskningspraktiker inbegriper. (Less)
Please use this url to cite or link to this publication:
https://lup.lub.lu.se/record/abf7c703-689f-4208-b48a-d073a9bccc52
- author
- Snickars, Pelle LU
- organization
- alternative title
- A hundred million words : Reflections on historical research with large-scale textual datasets as empirical evidence
- publishing date
- 2022-09-23
- type
- Contribution to journal
- publication status
- published
- subject
- keywords
- digital historia, datakurering, data curation, machine learning, textual datasets, digital history
- in
- Historisk Tidskrift
- volume
- 142
- issue
- 3
- pages
- 320 - 352
- publisher
- Svenska historiska föreningen
- external identifiers
-
- scopus:85160959218
- ISSN
- 0345-469X
- project
- Welfare State Analytics. Text Mining and Modeling Swedish Politics, Media & Culture, 1945-1989
- language
- Swedish
- LU publication?
- yes
- id
- abf7c703-689f-4208-b48a-d073a9bccc52
- alternative location
- https://www.historisktidskrift.se/index.php/june20/article/view/524
- date added to LUP
- 2022-09-24 14:04:34
- date last changed
- 2024-02-03 11:20:58
@article{abf7c703-689f-4208-b48a-d073a9bccc52, abstract = {{A hundred million words: Reflections on historical research with large-scale textual datasets as empirical evidence<br/><br/>The research project Welfare State Analytics: Text Mining and Modelling Swedish Politics, Media & Culture, 1945–1989 uses probabilistic methods and text-mining models to study three massive textual datasets from Swedish politics, news media, and literary culture. By topic modelling and distant reading a dataset from some 3,100 Swedish Government Official Reports, findings have been made which previous historical scholarship has neglected – or rather, cannot detect because of the limitations of traditional, smallscale examinations of only a few such reports. This article presents some of the project’s findings, but concentrates on the practical issues of curating large-scale textual datasets, and thus the possibilities – and shortcomings – of digital history research practices.<br/><br/>Large-scale textual datasets, often containing hundreds of millions of words, are a new type of empirical material that presents the historian with fresh challenges. The preparation of datasets is usually a resource-intensive task, where algorithmic machine learning is combined with the manual curation of data, a process that compiles the empirical material into datasets (in different versions).<br/><br/>Plainly, historical empirical material must be compiled into datasets to enable large-scale analyses, and such work can be laborious, as it depends on extensive programming efforts; what may come as a surprise is how complicated the relationship between data and empirical material can be in a digital-historical context, and the fact that preparing datasets is usually an iterative procedure that fundamentally changes the historical sources. In this type of research, compiled empirical material will usually result in several datasets, depending not only on how effective the available software is to curate and correct errors but also the specific research questions – given that data can be modelled in many ways. The relationship between empirical material and curated datasets is therefore complex, and highly dependent on both software and research practices.}}, author = {{Snickars, Pelle}}, issn = {{0345-469X}}, keywords = {{digital historia; datakurering; data curation; machine learning; textual datasets; digital history}}, language = {{swe}}, month = {{09}}, number = {{3}}, pages = {{320--352}}, publisher = {{Svenska historiska föreningen}}, series = {{Historisk Tidskrift}}, title = {{100 miljoner ord : Reflektioner kring forskningsarbete med storskaliga dataset som historisk empiri}}, url = {{https://www.historisktidskrift.se/index.php/june20/article/view/524}}, volume = {{142}}, year = {{2022}}, }