The EHRI Use-Case for GRAPHIA: LLM-assisted Information Discovery for Heterogeneous Multilingual Data

Authors: Mike Bryant, Frank Uiterwaal

Reviewer: Maciej Maryl

The heterogeneity and multilinguality of data sources pose a significant challenge to researchers across many fields in the Social Sciences and Humanities (SSH). This blog post discusses this challenge in the context of the European Holocaust Research Infrastructure (EHRI) and describes an approach to classifying textual metadata so that researchers can find information more easily, even if the descriptive text is in a language they don’t speak or read.

Classification in this context is sometimes called “Subject Indexing” because you’re essentially building an index like that found in the back of a book (as distinct from a “Search Index”, which is structured for retrieving information based on free text queries). EHRI has a database — the EHRI Portal — which contains a few hundred thousand descriptions of physical material held in archives around the world, and because most of these descriptions were written by the archives themselves, they’re in many different languages. EHRI also has a set of subject headings — called “EHRI Terms” — which, because it was developed largely by and for subject-matter specialists, has over 900 entries, varying from the general to the quite specific. This set of subjects is also translated in 12 languages to improve accessibility.

The visualisation below shows connections between archival institutions that have the same subject headings applied to their holdings in the EHRI Portal (coloured by country). While a shared controlled vocabulary is used, its application varies across archival descriptions from different holding institutions. This inconsistency highlights an opportunity for LLM-assisted data augmentation.

Subject heading co-occurrence across institutions in the EHRI Portal (color-coded by country). The very distinct clusters show the lack of consistency in the use of the EHRI Terms controlled vocabulary. Source: https://doi.org/10.1108/RMJ-08-2019-0045

Automated Subject Indexing (ASI) is a well-established information retrieval task that can be approached using many different “text understanding” techniques. There’s also a specialisation of ASI tailored towards very large sets of subjects – or classes, if we’re talking about classifying, for example, product descriptions into thousands of categories – called “extreme classification”. EHRI’s use-case is not quite extreme, but it still involves a significant number of subjects. With more than a few subjects, one of the complications in ASI becomes assessing whether a particular tool produces good results. This is because even humans trained at classifying the same material will tend to do the job quite differently from one another – that is, there will typically be limited overlap between the subjects they choose for a specific piece of text.

Employing LLMs for subject indexing

Because Large Language Models (LLMs) are seemingly being used for everything these days, it may be unsurprising that they can also work as classification tools. (Because the underlying technique that powers most LLMs now — the transformer model — was originally developed for automated translation tasks, it might be easier to see how the technology fits the multilingual ASI use-case, rather than being the proverbial Maslow’s hammer.) EHRI previously did some experiments where it compared a variety of classification techniques — text similarity approaches and what we could call “classical” machine learning (ML) — against a small LLM. When we reviewed the results in an ad hoc manner, they seemed pretty good! But compared to our “gold standard” dataset, the scores were, in fact, very bad. What could account for this difference? We think it is at least partly due to deficiencies in the dataset and gold standard, which, due to the way it was created, preferred high-level, general subject terms to the more specific ones favoured by the LLM. This led us to conduct a more systematic form of qualitative (rather than quantitative) adjudication and we concluded that the LLM really had some potential, particularly in “person-in-the-loop” situations, where there is someone to check the appropriateness of the chosen subjects.

Chances are good that if you were to try some multilingual ASI with a cutting edge proprietary LLM like GPT 4o, Claude, or Mistral you’ll be pretty impressed with the results, especially if you spend a while in the prompt instructions explaining the context of your task and the what it should do in cases where, for example, the text input contains rather too little information to make much of a judgement (a common case in our dataset, we found.) Depending on the “context size” (the amount of input an LLM can process in one go), you can typically paste in all your subjects, along with the input text, and ask for a list of, say, the five most appropriate subjects. The results can be so good, when reviewed in an ad hoc manner, that it might be tempting to just declare the problem solved.

Challenges for large databases

There are a few issues, however, aside from the well-known propensity of LLMs to output spurious information when dealing with topics not well covered by their training data, an issue somewhat mitigated when using a controlled vocabulary. For one thing, if you’re going to be classifying hundreds of thousands (or millions) of texts with a large set of subjects, proprietary LLMs are unlikely to be a cheap way to do it. They’re also using relatively huge amounts of electricity, knocking independent websites offline, fuelling misinformation, and involve investing a lot of trust in a third-party in a way that could justifiably raise data protection or privacy concerns. And while some very capable LLMs can be self-hosted if you have a powerful enough computer, even the most efficient of those are tremendously slow and energy-intensive relative to other methods of text classification, such as partitioned label trees.

If you want to rigorously evaluate the results of ASI performed via a commercial LLM, that can also be complicated, too, especially if, like EHRI, much of your gold-standard labelled input data has been available on the open web for years prior. It’s difficult to really judge the output of an LLM if it was at some point trained on the same information you’re using to evaluate it, since that’s like giving it the answers to the test before the test (seemingly a factor in a lot of LLM benchmarking, and a difficult problem to get around given the lack of transparency around proprietary LLM training datasets.)

These qualms aside, the way that LLMs seem to be able to translate concepts quite seamlessly between languages for which they have seen a lot of training data can potentially be very valuable in EHRI’s extensively multilingual setting. LLMs can, moreover, classify text in what is known in the jargon as a “zero shot” manner, meaning you don’t have to train them or give them explicit examples beforehand; much of the information they use to do text classification is (or should be) in the vast amounts of data they’ve ingested when originally created. These attributes make them quite promising not just for pure classification, but for data augmentation and the construction of training datasets for use with other tools and systems.

Why GRAPHIA?

For its GRAPHIA use-case, EHRI will continue these experiments using GRAPHIA’s LLM4SSH infrastructure, which will be tailored for use in SSH and as transparent as possible about things like training data, making it somewhat less of a black box than most commercial offerings. As before, we’ll continue to use the excellent Annif tool, a sort of Swiss Army Knife for subject indexing, which provides the machinery for comparing subject indexing results across different classifiers and quantitatively evaluating them.

We’re also planning to integrate Annif into the EHRI Portal’s administrative interface, so EHRI staff can assess person-in-the-loop subject indexing for themselves on new and existing archival descriptions, and give us feedback on how this system works in practice. The outcome, we hope, will not just be a more effective system for classifying multilingual archival descriptions in our domain and for the EHRI Portal, but also one that has applicability across SSH in general for making multilingual information easier for researchers to use.

This post is part of the GRAPHIA series that presents use-cases and pilots of the GRAPHIA project OPERAS is part of.

The EHRI Use-Case for GRAPHIA: LLM-assisted Information Discovery for Heterogeneous Multilingual Data

Employing LLMs for subject indexing

Challenges for large databases

Why GRAPHIA?

Like this:

Related

Leave a ReplyCancel reply

Funding

General Information

Licence

The EHRI Use-Case for GRAPHIA: LLM-assisted Information Discovery for Heterogeneous Multilingual Data

Employing LLMs for subject indexing

Challenges for large databases

Why GRAPHIA?

Share this:

Like this:

Related

Leave a ReplyCancel reply

Funding

General Information

Licence

Discover more from