Can we trust Web-page metadata?
(2010) In Journal of Library Metadata 10(1). p.58-74- Abstract
- A statistical study of embedded metadata in a sample of
more than 4 million HTML Web-pages is reported. The paper tries to
determine and quantify the validity of this metadata. Of particular
interest is to see if it is trustworthy enough for determining the
topic of a Web-page. Datasets are collected by a Web crawler running
both as a general and a focused crawler. Metadata fields 'title',
'author', 'keywords', 'description', and 'language' are analyzed in
detail together with Dublin Core metadata. The study reveals
problems with how metadata is created. Among the 75 \% of all
Web-pages that have interesting metadata, the field 'language' is
... (More) - A statistical study of embedded metadata in a sample of
more than 4 million HTML Web-pages is reported. The paper tries to
determine and quantify the validity of this metadata. Of particular
interest is to see if it is trustworthy enough for determining the
topic of a Web-page. Datasets are collected by a Web crawler running
both as a general and a focused crawler. Metadata fields 'title',
'author', 'keywords', 'description', and 'language' are analyzed in
detail together with Dublin Core metadata. The study reveals
problems with how metadata is created. Among the 75 \% of all
Web-pages that have interesting metadata, the field 'language' is
the most trustworthy. All other metadata fields show a high degree
of duplication thus degrading their usefulness. The strict answer to
the title question is 'No', however there is a lot of meaningful and
useful information, but it must be interpreted and used with
care. The study also provides statistics on the usage of metadata
today and how it has changed over time. (Less)
Please use this url to cite or link to this publication:
https://lup.lub.lu.se/record/838109
- author
- Ardö, Anders LU
- organization
- publishing date
- 2010
- type
- Contribution to journal
- publication status
- published
- subject
- keywords
- Dublin Core, metadata analysis, metadata usage, metadata validity, Web metadata
- in
- Journal of Library Metadata
- volume
- 10
- issue
- 1
- pages
- 58 - 74
- publisher
- Routledge
- external identifiers
-
- scopus:78649753454
- ISSN
- 1938-6389
- DOI
- 10.1080/19386380903547008
- language
- English
- LU publication?
- yes
- id
- c42574fd-c196-4be9-ad94-0abf9f9e6fbb (old id 838109)
- alternative location
- http://www.eit.lth.se/fileadmin/eit/home/hs.aar/Publ/Metadata.pdf
- date added to LUP
- 2016-04-01 10:46:54
- date last changed
- 2022-02-25 05:39:55
@article{c42574fd-c196-4be9-ad94-0abf9f9e6fbb, abstract = {{A statistical study of embedded metadata in a sample of<br/><br> more than 4 million HTML Web-pages is reported. The paper tries to<br/><br> determine and quantify the validity of this metadata. Of particular<br/><br> interest is to see if it is trustworthy enough for determining the<br/><br> topic of a Web-page. Datasets are collected by a Web crawler running<br/><br> both as a general and a focused crawler. Metadata fields 'title',<br/><br> 'author', 'keywords', 'description', and 'language' are analyzed in<br/><br> detail together with Dublin Core metadata. The study reveals<br/><br> problems with how metadata is created. Among the 75 \% of all<br/><br> Web-pages that have interesting metadata, the field 'language' is<br/><br> the most trustworthy. All other metadata fields show a high degree<br/><br> of duplication thus degrading their usefulness. The strict answer to<br/><br> the title question is 'No', however there is a lot of meaningful and<br/><br> useful information, but it must be interpreted and used with<br/><br> care. The study also provides statistics on the usage of metadata<br/><br> today and how it has changed over time.}}, author = {{Ardö, Anders}}, issn = {{1938-6389}}, keywords = {{Dublin Core; metadata analysis; metadata usage; metadata validity; Web metadata}}, language = {{eng}}, number = {{1}}, pages = {{58--74}}, publisher = {{Routledge}}, series = {{Journal of Library Metadata}}, title = {{Can we trust Web-page metadata?}}, url = {{http://dx.doi.org/10.1080/19386380903547008}}, doi = {{10.1080/19386380903547008}}, volume = {{10}}, year = {{2010}}, }