Skip to main content

Lund University Publications

LUND UNIVERSITY LIBRARIES

Hawk : An industrial-strength multi-label document classifier

Javeed, Arshad (2024) In Natural Language Processing Journal 9.
Abstract

There are a plethora of methods for solving the classical multi-label document classification problem. However, when it comes to deployment and usage in an industry setting, most if not all the contemporary approaches fail to address some of the vital aspects or requirements of an ideal solution: i) ability to operate on variable-length texts or rambling documents, ii) catastrophic forgetting problem, and iii) ability to visualize the model's predictions. The paper describes the significance of these problems in detail and adopts the hydranet architecture to address these problems. The proposed architecture views documents as a sequence of sentences and leverages sentence-level embeddings for input representation, turning the problem... (More)

There are a plethora of methods for solving the classical multi-label document classification problem. However, when it comes to deployment and usage in an industry setting, most if not all the contemporary approaches fail to address some of the vital aspects or requirements of an ideal solution: i) ability to operate on variable-length texts or rambling documents, ii) catastrophic forgetting problem, and iii) ability to visualize the model's predictions. The paper describes the significance of these problems in detail and adopts the hydranet architecture to address these problems. The proposed architecture views documents as a sequence of sentences and leverages sentence-level embeddings for input representation, turning the problem into a sequence classification task. Furthermore, two specific architectures are explored as the architectures for the heads, Bi-LSTM and transformer heads. The proposed architecture is benchmarked on some of the popular benchmarking datasets such as Web of Science - 5763, Web of Science - 11967, BBC Sports, and BBC News datasets. The experimental results reveal that the proposed model performs at least as best as previous SOTA architectures and even outperforms prior SOTA in a few cases, along with the added advantages of the practicality issues discussed. The ablation study includes comparisons of the impact of the attention mechanism and the application of weighted loss functions to train the task-specific heads in the hydranet. The claims regarding catastrophic forgetfulness are further corroborated by empirical evaluations under incremental learning scenarios. The results reveal the robustness of the proposed architecture compared to other benchmarks.

(Less)
Please use this url to cite or link to this publication:
author
publishing date
type
Contribution to journal
publication status
published
subject
keywords
Document classification, Hydranet, Machine learning, Natural language processing
in
Natural Language Processing Journal
volume
9
article number
100115
publisher
Elsevier
external identifiers
  • scopus:105022129039
ISSN
2949-7191
DOI
10.1016/j.nlp.2024.100115
language
English
LU publication?
no
id
bd8d3b84-684e-4de1-b1d0-f00287cbaeac
date added to LUP
2026-02-10 09:39:11
date last changed
2026-02-12 10:08:42
@article{bd8d3b84-684e-4de1-b1d0-f00287cbaeac,
  abstract     = {{<p>There are a plethora of methods for solving the classical multi-label document classification problem. However, when it comes to deployment and usage in an industry setting, most if not all the contemporary approaches fail to address some of the vital aspects or requirements of an ideal solution: i) ability to operate on variable-length texts or rambling documents, ii) catastrophic forgetting problem, and iii) ability to visualize the model's predictions. The paper describes the significance of these problems in detail and adopts the hydranet architecture to address these problems. The proposed architecture views documents as a sequence of sentences and leverages sentence-level embeddings for input representation, turning the problem into a sequence classification task. Furthermore, two specific architectures are explored as the architectures for the heads, Bi-LSTM and transformer heads. The proposed architecture is benchmarked on some of the popular benchmarking datasets such as Web of Science - 5763, Web of Science - 11967, BBC Sports, and BBC News datasets. The experimental results reveal that the proposed model performs at least as best as previous SOTA architectures and even outperforms prior SOTA in a few cases, along with the added advantages of the practicality issues discussed. The ablation study includes comparisons of the impact of the attention mechanism and the application of weighted loss functions to train the task-specific heads in the hydranet. The claims regarding catastrophic forgetfulness are further corroborated by empirical evaluations under incremental learning scenarios. The results reveal the robustness of the proposed architecture compared to other benchmarks.</p>}},
  author       = {{Javeed, Arshad}},
  issn         = {{2949-7191}},
  keywords     = {{Document classification; Hydranet; Machine learning; Natural language processing}},
  language     = {{eng}},
  publisher    = {{Elsevier}},
  series       = {{Natural Language Processing Journal}},
  title        = {{Hawk : An industrial-strength multi-label document classifier}},
  url          = {{http://dx.doi.org/10.1016/j.nlp.2024.100115}},
  doi          = {{10.1016/j.nlp.2024.100115}},
  volume       = {{9}},
  year         = {{2024}},
}