Skip to main content

Lund University Publications

LUND UNIVERSITY LIBRARIES

Can we trust Web-page metadata?

Ardö, Anders LU (2010) In Journal of Library Metadata 10(1). p.58-74
Abstract
A statistical study of embedded metadata in a sample of

more than 4 million HTML Web-pages is reported. The paper tries to

determine and quantify the validity of this metadata. Of particular

interest is to see if it is trustworthy enough for determining the

topic of a Web-page. Datasets are collected by a Web crawler running

both as a general and a focused crawler. Metadata fields 'title',

'author', 'keywords', 'description', and 'language' are analyzed in

detail together with Dublin Core metadata. The study reveals

problems with how metadata is created. Among the 75 \% of all

Web-pages that have interesting metadata, the field 'language' is

... (More)
A statistical study of embedded metadata in a sample of

more than 4 million HTML Web-pages is reported. The paper tries to

determine and quantify the validity of this metadata. Of particular

interest is to see if it is trustworthy enough for determining the

topic of a Web-page. Datasets are collected by a Web crawler running

both as a general and a focused crawler. Metadata fields 'title',

'author', 'keywords', 'description', and 'language' are analyzed in

detail together with Dublin Core metadata. The study reveals

problems with how metadata is created. Among the 75 \% of all

Web-pages that have interesting metadata, the field 'language' is

the most trustworthy. All other metadata fields show a high degree

of duplication thus degrading their usefulness. The strict answer to

the title question is 'No', however there is a lot of meaningful and

useful information, but it must be interpreted and used with

care. The study also provides statistics on the usage of metadata

today and how it has changed over time. (Less)
Please use this url to cite or link to this publication:
author
organization
publishing date
type
Contribution to journal
publication status
published
subject
keywords
Dublin Core, metadata analysis, metadata usage, metadata validity, Web metadata
in
Journal of Library Metadata
volume
10
issue
1
pages
58 - 74
publisher
Routledge
external identifiers
  • scopus:78649753454
ISSN
1938-6389
DOI
10.1080/19386380903547008
language
English
LU publication?
yes
id
c42574fd-c196-4be9-ad94-0abf9f9e6fbb (old id 838109)
alternative location
http://www.eit.lth.se/fileadmin/eit/home/hs.aar/Publ/Metadata.pdf
date added to LUP
2016-04-01 10:46:54
date last changed
2022-02-25 05:39:55
@article{c42574fd-c196-4be9-ad94-0abf9f9e6fbb,
  abstract     = {{A statistical study of embedded metadata in a sample of<br/><br>
 more than 4 million HTML Web-pages is reported. The paper tries to<br/><br>
 determine and quantify the validity of this metadata. Of particular<br/><br>
 interest is to see if it is trustworthy enough for determining the<br/><br>
 topic of a Web-page. Datasets are collected by a Web crawler running<br/><br>
 both as a general and a focused crawler. Metadata fields 'title',<br/><br>
 'author', 'keywords', 'description', and 'language' are analyzed in<br/><br>
 detail together with Dublin Core metadata. The study reveals<br/><br>
 problems with how metadata is created. Among the 75 \% of all<br/><br>
 Web-pages that have interesting metadata, the field 'language' is<br/><br>
 the most trustworthy. All other metadata fields show a high degree<br/><br>
 of duplication thus degrading their usefulness. The strict answer to<br/><br>
 the title question is 'No', however there is a lot of meaningful and<br/><br>
 useful information, but it must be interpreted and used with<br/><br>
 care. The study also provides statistics on the usage of metadata<br/><br>
 today and how it has changed over time.}},
  author       = {{Ardö, Anders}},
  issn         = {{1938-6389}},
  keywords     = {{Dublin Core; metadata analysis; metadata usage; metadata validity; Web metadata}},
  language     = {{eng}},
  number       = {{1}},
  pages        = {{58--74}},
  publisher    = {{Routledge}},
  series       = {{Journal of Library Metadata}},
  title        = {{Can we trust Web-page metadata?}},
  url          = {{http://dx.doi.org/10.1080/19386380903547008}},
  doi          = {{10.1080/19386380903547008}},
  volume       = {{10}},
  year         = {{2010}},
}