A geolocated dataset of German news articles

Kriesch, Lukas; Losacker, Sebastian

A geolocated dataset of German news articles

Mark

Kriesch, Lukas and Losacker, Sebastian ^LU (2025) In Scientific Data 12(1).

Abstract: The emergence of large language models and the exponential growth of digitized text data have revolutionized research methodologies across a broad range of social sciences. News data is crucial for the social sciences as it provides real-time insights into public discourse and societal trends. In this paper, we provide insights into how news articles can be geolocated and how the texts can then be further analyzed. We collect data from the CommonCrawl News dataset and clean the text data. We then use a named-entity recognition model for geocoding. Finally, we transform the news articles into text embeddings using SBERT, enabling semantic searches within the news data corpus. In the paper, we apply this process to all German news... (More); The emergence of large language models and the exponential growth of digitized text data have revolutionized research methodologies across a broad range of social sciences. News data is crucial for the social sciences as it provides real-time insights into public discourse and societal trends. In this paper, we provide insights into how news articles can be geolocated and how the texts can then be further analyzed. We collect data from the CommonCrawl News dataset and clean the text data. We then use a named-entity recognition model for geocoding. Finally, we transform the news articles into text embeddings using SBERT, enabling semantic searches within the news data corpus. In the paper, we apply this process to all German news articles and make the German location data, as well as the embeddings, available for download. We compile a dataset containing text embeddings for about 50 million German news articles, of which about 70% include geographic locations. The process can be replicated for news data from other countries.
(Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/b583609a-2c14-4f82-a554-529f4199d062

author

Kriesch, Lukas and Losacker, Sebastian ^LU

organization

CIRCLE

publishing date

2025-12

type

Contribution to journal

publication status

published

subject

Other Engineering and Technologies

in

Scientific Data

volume

12

issue

1

article number

1128

publisher

Nature Publishing Group

external identifiers

pmid:40603360
scopus:105010049735

ISSN

2052-4463

DOI

10.1038/s41597-025-05422-w

language

English

LU publication?

yes

id

b583609a-2c14-4f82-a554-529f4199d062

date added to LUP

2025-10-27 12:00:53

date last changed

2026-01-19 19:41:48

@article{b583609a-2c14-4f82-a554-529f4199d062,
  abstract     = {{<p>The emergence of large language models and the exponential growth of digitized text data have revolutionized research methodologies across a broad range of social sciences. News data is crucial for the social sciences as it provides real-time insights into public discourse and societal trends. In this paper, we provide insights into how news articles can be geolocated and how the texts can then be further analyzed. We collect data from the CommonCrawl News dataset and clean the text data. We then use a named-entity recognition model for geocoding. Finally, we transform the news articles into text embeddings using SBERT, enabling semantic searches within the news data corpus. In the paper, we apply this process to all German news articles and make the German location data, as well as the embeddings, available for download. We compile a dataset containing text embeddings for about 50 million German news articles, of which about 70% include geographic locations. The process can be replicated for news data from other countries.</p>}},
  author       = {{Kriesch, Lukas and Losacker, Sebastian}},
  issn         = {{2052-4463}},
  language     = {{eng}},
  number       = {{1}},
  publisher    = {{Nature Publishing Group}},
  series       = {{Scientific Data}},
  title        = {{A geolocated dataset of German news articles}},
  url          = {{http://dx.doi.org/10.1038/s41597-025-05422-w}},
  doi          = {{10.1038/s41597-025-05422-w}},
  volume       = {{12}},
  year         = {{2025}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

A geolocated dataset of German news articles