Skip to main content

Lund University Publications

LUND UNIVERSITY LIBRARIES

Connecting firm's web scraped textual content to body of science : Utilizing microsoft academic graph hierarchical topic modeling

Hajikhani, Arash ; Pukelis, Lukas ; Suominen, Arho ; Ashouri, Sajad ; Schubert, Torben LU ; Notten, Ad and Cunningham, Scott W. (2022) In MethodsX 9.
Abstract

This paper demonstrates a method to transform and link textual information scraped from companies' websites to the scientific body of knowledge. The method illustrates the benefit of Natural Language Processing (NLP) in creating links between established economic classification systems with novel and agile constructs that new data sources enable. Therefore, we experimented on the European classification of economic activities (known as NACE) on sectoral and company levels. We established a connection with Microsoft Academic Graph hierarchical topic modeling based on companies' website content. Central to the operationalization of our method are a web scraping process, NLP and a data transformation/linkage procedure. The method contains... (More)

This paper demonstrates a method to transform and link textual information scraped from companies' websites to the scientific body of knowledge. The method illustrates the benefit of Natural Language Processing (NLP) in creating links between established economic classification systems with novel and agile constructs that new data sources enable. Therefore, we experimented on the European classification of economic activities (known as NACE) on sectoral and company levels. We established a connection with Microsoft Academic Graph hierarchical topic modeling based on companies' website content. Central to the operationalization of our method are a web scraping process, NLP and a data transformation/linkage procedure. The method contains three main steps: data source identification, raw data retrieval, and data preparation and transformation. These steps are applied to two distinct data sources.

(Less)
Please use this url to cite or link to this publication:
author
; ; ; ; ; and
organization
publishing date
type
Contribution to journal
publication status
published
subject
keywords
A method for creating a linkage between web scraped company's websitecontent to scientific literature topical structure, Economic classification scheme, Knowledge transformation, Natural language processing, Web scraping
in
MethodsX
volume
9
article number
101650
publisher
Elsevier
external identifiers
  • scopus:85125892314
  • pmid:35284247
ISSN
2215-0161
DOI
10.1016/j.mex.2022.101650
language
English
LU publication?
yes
additional info
Funding Information: This project has received funding from the European Union's Horizon 2020 research and innovation program under grant agreement No 870822. Publisher Copyright: © 2022 The Authors
id
17f70ca1-ff4f-4e4a-b571-66daef21bfa5
date added to LUP
2022-03-17 07:26:13
date last changed
2024-06-13 11:29:43
@article{17f70ca1-ff4f-4e4a-b571-66daef21bfa5,
  abstract     = {{<p>This paper demonstrates a method to transform and link textual information scraped from companies' websites to the scientific body of knowledge. The method illustrates the benefit of Natural Language Processing (NLP) in creating links between established economic classification systems with novel and agile constructs that new data sources enable. Therefore, we experimented on the European classification of economic activities (known as NACE) on sectoral and company levels. We established a connection with Microsoft Academic Graph hierarchical topic modeling based on companies' website content. Central to the operationalization of our method are a web scraping process, NLP and a data transformation/linkage procedure. The method contains three main steps: data source identification, raw data retrieval, and data preparation and transformation. These steps are applied to two distinct data sources.</p>}},
  author       = {{Hajikhani, Arash and Pukelis, Lukas and Suominen, Arho and Ashouri, Sajad and Schubert, Torben and Notten, Ad and Cunningham, Scott W.}},
  issn         = {{2215-0161}},
  keywords     = {{A method for creating a linkage between web scraped company's websitecontent to scientific literature topical structure; Economic classification scheme; Knowledge transformation; Natural language processing; Web scraping}},
  language     = {{eng}},
  publisher    = {{Elsevier}},
  series       = {{MethodsX}},
  title        = {{Connecting firm's web scraped textual content to body of science : Utilizing microsoft academic graph hierarchical topic modeling}},
  url          = {{http://dx.doi.org/10.1016/j.mex.2022.101650}},
  doi          = {{10.1016/j.mex.2022.101650}},
  volume       = {{9}},
  year         = {{2022}},
}