Crouching TIGER, hidden structure : Exploring the nature of linguistic data using TIGER values

Syrjänen, Kaj; Maurits, Luke; Leino, Unni; Honkola, Terhi; Rota, Jadranka; Vesakoski, Outi

Crouching TIGER, hidden structure : Exploring the nature of linguistic data using TIGER values

Mark

Syrjänen, Kaj ; Maurits, Luke ; Leino, Unni ; Honkola, Terhi ; Rota, Jadranka ^LU

and Vesakoski, Outi (2021) In Journal of Language Evolution 6(2). p.99-118

Abstract: In recent years, techniques such as Bayesian inference of phylogeny have become a standard part of the quantitative linguistic toolkit. While these tools successfully model the tree-like component of a linguistic dataset, real-world datasets generally include a combination of tree-like and nontree-like signals. Alongside developing techniques for modeling nontree-like data, an important requirement for future quantitative work is to build a principled understanding of this structural complexity of linguistic datasets. Some techniques exist for exploring the general structure of a linguistic dataset, such as NeighborNets, δscores, and Q-residuals; however, these methods are not without limitations or drawbacks. In general, the question... (More); In recent years, techniques such as Bayesian inference of phylogeny have become a standard part of the quantitative linguistic toolkit. While these tools successfully model the tree-like component of a linguistic dataset, real-world datasets generally include a combination of tree-like and nontree-like signals. Alongside developing techniques for modeling nontree-like data, an important requirement for future quantitative work is to build a principled understanding of this structural complexity of linguistic datasets. Some techniques exist for exploring the general structure of a linguistic dataset, such as NeighborNets, δscores, and Q-residuals; however, these methods are not without limitations or drawbacks. In general, the question of what kinds of historical structure a linguistic dataset can contain and how these might be detected or measured remains critically underexplored from an objective, quantitative perspective. In this article, we propose TIGER values, a metric that estimates the internal consistency of a genetic dataset, as an additional metric for assessing how tree-like a linguistic dataset is. We use TIGER values to explore simulated language data ranging from very tree-like to completely unstructured, and also use them to analyze a cognate-coded basic vocabulary dataset of Uralic languages. As a point of comparison for the TIGER values, we also explore the same data using δscores, Q-residuals, and NeighborNets. Our results suggest that TIGER values are capable of both ranking tree-like datasets according to their degree of treelikeness, as well as distinguishing datasets with tree-like structure from datasets with a nontree-like structure. Consequently, we argue that TIGER values serve as a useful metric for measuring the historical heterogeneity of datasets. Our results also highlight the complexities in measuring treelikeness from linguistic data, and how the metrics approach this question from different perspectives.
(Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/d24e22f1-88fb-423d-a7ba-074934911bd0

author

Syrjänen, Kaj ; Maurits, Luke ; Leino, Unni ; Honkola, Terhi ; Rota, Jadranka ^LU

and Vesakoski, Outi

organization

publishing date

2021-11-15

type

Contribution to journal

publication status

published

subject

Natural Language Processing

keywords

Language evolution, Quantitative linguistics, Simulated language data, TIGER algorithm, Uralic languages

in

Journal of Language Evolution

volume

6

issue

2

pages

99 - 118

publisher

Oxford University Press

external identifiers

scopus:85121237385

ISSN

2058-4571

DOI

10.1093/jole/lzab004

language

English

LU publication?

yes

additional info

id

d24e22f1-88fb-423d-a7ba-074934911bd0

date added to LUP

2022-01-04 06:39:42

date last changed

2025-10-14 09:06:45

@article{d24e22f1-88fb-423d-a7ba-074934911bd0,
  abstract     = {{<p>In recent years, techniques such as Bayesian inference of phylogeny have become a standard part of the quantitative linguistic toolkit. While these tools successfully model the tree-like component of a linguistic dataset, real-world datasets generally include a combination of tree-like and nontree-like signals. Alongside developing techniques for modeling nontree-like data, an important requirement for future quantitative work is to build a principled understanding of this structural complexity of linguistic datasets. Some techniques exist for exploring the general structure of a linguistic dataset, such as NeighborNets, δscores, and Q-residuals; however, these methods are not without limitations or drawbacks. In general, the question of what kinds of historical structure a linguistic dataset can contain and how these might be detected or measured remains critically underexplored from an objective, quantitative perspective. In this article, we propose TIGER values, a metric that estimates the internal consistency of a genetic dataset, as an additional metric for assessing how tree-like a linguistic dataset is. We use TIGER values to explore simulated language data ranging from very tree-like to completely unstructured, and also use them to analyze a cognate-coded basic vocabulary dataset of Uralic languages. As a point of comparison for the TIGER values, we also explore the same data using δscores, Q-residuals, and NeighborNets. Our results suggest that TIGER values are capable of both ranking tree-like datasets according to their degree of treelikeness, as well as distinguishing datasets with tree-like structure from datasets with a nontree-like structure. Consequently, we argue that TIGER values serve as a useful metric for measuring the historical heterogeneity of datasets. Our results also highlight the complexities in measuring treelikeness from linguistic data, and how the metrics approach this question from different perspectives. </p>}},
  author       = {{Syrjänen, Kaj and Maurits, Luke and Leino, Unni and Honkola, Terhi and Rota, Jadranka and Vesakoski, Outi}},
  issn         = {{2058-4571}},
  keywords     = {{Language evolution; Quantitative linguistics; Simulated language data; TIGER algorithm; Uralic languages}},
  language     = {{eng}},
  month        = {{11}},
  number       = {{2}},
  pages        = {{99--118}},
  publisher    = {{Oxford University Press}},
  series       = {{Journal of Language Evolution}},
  title        = {{Crouching TIGER, hidden structure : Exploring the nature of linguistic data using TIGER values}},
  url          = {{http://dx.doi.org/10.1093/jole/lzab004}},
  doi          = {{10.1093/jole/lzab004}},
  volume       = {{6}},
  year         = {{2021}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

Crouching TIGER, hidden structure : Exploring the nature of linguistic data using TIGER values