Accepted Papers

Full Papers

Natural Language Querying for Humanities Knowledge Graphs: A case study on the GOLEM Knowledge Graph

Jose Maldonado-Rodríguez, Arianna Graciotti, Valentina Presutti and Federico Pianzola

Large-scale Knowledge Graphs (KGs) are increasingly relevant for humanities research, yet querying them via SPARQL poses challenges for non-technical users. While Text-to-SPARQL studies predominantly target popular KGs such as Wikidata or DBpedia, domain-specific KGs remain underexplored. This paper introduces a bilingual (English-Spanish) dataset designed for evaluating automatic text-to-SPARQL translation on GOLEM, a humanities KG containing metadata and extracted features from fanfiction stories hosted on Archive of Our Own (AO3). The dataset includes 477 manually crafted natural language questions paired with gold SPARQL queries, augmented to 1,895 questions through automatic paraphrasing. We benchmark several Large Language Models (LLMs) with prompt-based approaches, particularly examining in-context learning methods that select prompt examples based on semantic similarity, which yield the best results. Error analysis highlights entity linking as essential for improving query generation. This work provides practical insights and opens pathways for future research on natural language interfaces for querying domain-specific KGs in Digital Humanities. The dataset and output of our experiments are available at: https://github.com/GOLEM-lab/GOLEM_Text-to-SPARQL

CorefLat. Coreference Resolution for Latin as Linked Open Data

Eleonora Delfino, Roberta Grazia Leotta, Francesco Mambrini, Marco Passarotti, and Giovanni Moretti

This paper presents the publication as Linked Open Data of a set of coreference and anaphora annotations (called CorefLat) performed on a set of Latin texts. Annotations are made on texts already available as Linked Open Data as part of the LiLa Knowledge Base of interoperable linguistic resources for Latin. By adopting a lemma-centered architecture and established guidelines for annotation inspired by those of the GUM corpus, CorefLat systematically identifies and tags entities and mentions, creating relational links. The annotated corpus covers multiple periods and genres, including Augustine’s Confessiones, Plautus’ Curculio, Caesar’s De Bello Gallico, and Seneca’s Medea, ensuring a balanced dataset for broader linguistic analysis. The publication of CorefLat as Linked Open Data relies on an OWL ontology that extends the POWLA framework, thus enabling interoperability with diverse linguistic resources within LiLa. We detail how coreference relations, including phenomena such as anaphora, cataphora, split antecedents, and multiword units, are encoded through specialized classes and object properties.

Curated datasets for literary tourism: a case study in knowledge graph creation

Miriam Begliuomini, Marius Crisan, Enrico Daga, Rossana Damiano, Florin Nechita, Laurence Roussillon-Constanty, Marco Antonio Stranisci, and Cristina Trinchero

European mountains have inspired generations of writers whose works can play a significant role in determining the touristic potential of the area. However, fragmentation of research and cultural initiatives about Europeanmountains hinder this potential, even in the digital age. In this paper, we describe the use of the World Literature Knowledge Graph (WL-KG) to integrate the curated data sets of writers and works created by a set of research projects about European mountains as part of the CON.NE.C.T W.O.N.D.E.R.S. project using the SPARQL Anything library for triplification. The goal of the project is two-fold: on the one side, it aims at bridging local repositories of literary data, remodeling them according to a common model when needed, to overcome the fragmentation of the otherwise underrepresented research about the mountain areas across Europe. On the other side, it aims at creating applications that leverage the networked representation of literary, geographical and temporal data for the discovery and exploitation of new paths and connections in the field of literary tourism.

Short/Position Papers

Harold: an iterative and interactive query system for exploring cultural heritage corpus

Prunelle Daudre-Treuil, Olivier Bruneau, Jean Lieber, Emmanuel Nauer and Laurent Rollet

With the development of Semantic Web technologies in digital humanities, more and more users with little knowledge about these technologies need to interact with them. This paper presents Harold, a system for accessing the cultural heritage corpus without having to know any particular computer language, such as SPARQL. Harold is a conversational system with which a historian can interact using a user-friendly query interface together with navigation access to explore the corpus. Harold organizes the documents into a hierarchical structure using formal concept analysis (FCA) to provide a synthetic way to navigate the documents. The user may interact with the hierarchy concepts to focus on relevant documents and remove irrelevant ones. In addition, an ontology management interface is provided to assist the user in managing concepts related to their research problem. This ontology can be used by the retrieval process to better structure hierarchical access to the documents. Moreover, the concepts built by FCA provide interesting information that can guide the user to a new retrieval step to find more relevant documents related to their research problem.

Enhancing Provenance Research with Linked Data: A Visual Approach to Knowledge Discovery

Sarah Binta Alam Shoilee, Annastiina Ahola, Heikki Rantala, Eero Hyvönen, Victor de Boer, Jacco van Ossenbruggen, and Susan Legene

Provenance research is critical for understanding the historical trajectories of cultural objects housed in museums, yet it is often hindered by fragmented, ambiguous, or missing data. With the increasing adoption of Linked Data (LD) in cultural heritage, new possibilities emerge for analysing provenance metadata. This paper presents the PM-Sampo demonstrator, a structured approach to analysing provenance data through Linked Data methodologies and visualisation techniques. By connecting historical events, places, and actors to object collections and analysing data with visualisation tools, PM-Sampo aims to facilitate large-scale provenance analysis, enabling domain researchers to detect patterns, inconsistencies, and hidden connections that could otherwise go unnoticed. A case study on objects from Dutch museums associated with the Aceh War (1873–1914), an armed conflict between the Netherlands and the Muslim sultanate of Aceh, illustrates the functionalities of the demonstrator, revealing gaps in acquisition records, unexpected geographical distributions, and acquisition timelines extending well beyond the formal end of the conflict. The establishment of actor-connections further brings to the surface overlooked relationships between individuals and institutions, while provenance visualisation highlights the need for more comprehensive provenance documentation by domain experts. The study underscores the opportunities of data-driven approaches in provenance research, demonstrating how visualisation tools can aid in knowledge discovery and exploring knowledge gaps.

How to Create a Portal for Digital Humanities Research Using a Linked Open Data Cloud of Cultural Heritage Knowledge Graphs: Case SampoSampo

Eero Hyvönen, Petri Leskinen, Annastiina Ahola, Heikki Rantala, and Jouni Tuominen

This paper presents a novel approach and first results of creating a global data service and portal, SampoSampo, based on a cloud of interlinked Cultural Heritage knowledge graphs of different application domains. In this way, a more comprehensive global view for searching, exploring, and analyzing entities with enriched linked data and their semantic connections can be provided than by using local KGs separately.

CIDOC-CRM and the First Prototype of a Semantic Portal for the CHExRISH project

Luiz do Valle Miranda, Krzysztof Kutt, and Grzegorz J. Nalepa

In this paper we present work in progress towards the integration of cultural heritage resources from different units of the Jagiellonian University as linked data and the prototyping of their presentation in a semantic portal. CIDOC-CRM has been chosen as the data model behind such an interoperability given its wide use and its flexibility. Challenges arose when converting bibliographical authorship relations into CRM’s event-centered structure and when migrating instances’ hierarchical classification into CRM’s “E55 Type” and “P127 has broader term” conversion standard. Despite following CRM’s data modeling best practices, these challenges reappeared while publishing the data in an Omeka S-powered website, thus showing some lack of compatibility between these two frameworks.

Exploring and Visualizing Italian Advertising Fliers and Posters through an Iconographical Lens with Linked Open Data

Bruno Sartini

This paper explores Italian fliers and posters through an iconographical lens, leveraging Linked Open Data (LOD) from IICONGRAPH, a KG that extends the iconographical and iconological statements of ArCo. First, we examine the findings of a qualitative study on fliers and posters and assess them with SPARQL queries and visualizations, producing results that match with claims in the study. We also investigate annual promotion trends of fliers and posters through temporal analysis, identifying shifts in advertising themes and iconographical representations. Then, we conduct a small-scale study on gender representation, examining how male and female figures co-occur with specific elements in fliers and posters, highlighting variations in visual composition and associations to specific types of advertisements. Finally, we analyze the statistical dependency between promotional themes and depictions using the chi-square test. This study demonstrates how structured iconographical data can bridge qualitative insights with empirical validation in cultural heritage research.

Data-rich Web Annotations. Embedding datasets to link complex metaphor analyses with their textual basis

Philipp Tögel, Henning Gebhard, Frederik Elwert, Stefanie Dipper, Makar Fedorov, Vandana Jha, Volkhard Krech, and Danah Tonne

Annotating is a central activity with a long history in the humanities. For the purpose of digital annotations, the Web Annotation Data Model (WADM) is an established W3C standard that enables data sharing and is supported by a wide array of applications. Storing simple annotations is quite easy, but storing complex data is difficult. We propose a generic extension mechanism for the WADM that allows storing structured data inside the body of a Web Annotation. In contrast to previous research, our proposal uses the base WADM without custom extensions (except for the embedded data themselves), and thus facilitates data sharing. As a use case, we show how a domain ontology is used to model structured information about metaphors in religious texts, and how we apply our approach to store the information in data-rich annotations that can be used for queries that support comparative research across languages.

LRMoo as the Conceptual Model for the Lem Knowledge Grap

Luiz do Valle Miranda, Jakub Gomułka, Szymon Piotr Kukulak, Krzysztof Kutt, and Grzegorz J. Nalepa

This paper explores the challenges of representing the works of Stanisław Lem (1921–2006) in a knowledge graph (KG), focusing on their complex versions across languages, editions, and collections. Lem’s works, characterized by an evolving content and multiple versions, require a refined approach to capture their nuances. We propose using LRMoo, an extension of CIDOC-CRM, to model Lem’s prose. Through a case study of The Star Diaries, we demonstrate how LRMoo can appropriately represent these complexities.

Comparing FAIR Assessment Tools and their Alignment with FAIR Implementation Profiles using Digital Humanities Datasets

Andre Valdestilhas, Menzo Windhouwer, Ronald Siebes, and Shuai Wang

FAIR principles serve as guidelines for implementing data and metadata to improve Findability, Accessibility, Interoperability, and Reusability. In recent years, numerous tools have been developed to assess how well datasets adhere to each FAIR principle. However, due to their diverse designs, these tools interpret the FAIR principles differently and provide varying assessment results, which can be confusing. Many communities publish datasets that follow similar data management practices, and some of these common practices have recently been compiled into community standards known as FAIR Implementation Profiles (FIPs). This paper compares the metrics of FAIR assessment tools with FIP. We illustrate these differences by analyzing the assessment results of two datasets in the Digital Humanities domain and further explore how these results compare with their corresponding FIPs.

Everything is biased ... now what?! Introducing the Bias-Aware Framework

Amber Zijlma and Mrinalini Luthra

In the digital humanities, datasets inherit and perpetuate biases through multiple channels: individual and institutional biases, discriminatory language in archives, unequal representation in collection practices, and algorithmic biases in AI-assisted processing. These biases are compounded throughout the research process, yet the term “bias” itself lacks a clear definition, often causing “bias paralysis.” This paper proposes treating “bias” as a productive category of analysis for digital humanities research through the development of a “Bias-Aware Framework” for dataset creation and contextualisation. It has three components: a Bias Thesaurus creating shared vocabulary across disciplines to address the conceptual instability of “bias” by breaking down this nebulous concept into interrelated issues like representation, gaps, positionality, CARE, etc; a Bias-Aware Dataset Lifecycle Model showing where biases enter the research process; and Guidelines for documenting, describing, and mitigating bias. We approach bias not simply as an error, but as a revealing analytical lens that shapes knowledge production. By explicitly describing these conditions of production, researchers can improve transparency, improve dataset documentation, and enable more informed reuse of their data.

Accepted Papers SemDH 2025

Full Papers

Short/Position Papers