Data
2013-09
Embargo
Título da revista
ISSN da revista
Título do Volume
Editora
Biblioteca Nacional de Portugal (BNP)
Título Alternativo
Resumo
The ability to recognize when digital content is becoming endangered is essential for maintaining the long-term, continuous
and authentic access to digital assets. To achieve this ability, knowledge about aspects of the world that might hinder the preservation of content is needed. However, the processes of gathering, managing and reasoning on knowledge can become manually infeasible when the volume and
heterogeneity of content increases, multiplying the aspects to monitor. Automation of these processes is possible [11,21], but its usefulness is limited by the data it is able to
gather. Up to now, automatic digital preservation processes have been restricted to knowledge expressed in a machine
understandable language, ignoring a plethora of data expressed in natural language, such as the DPC Technology Watch Reports, which could greatly contribute to the completeness and freshness of data about aspects of the world related to digital preservation.
This paper presents a real case scenario from the National Library of the Netherlands, where the monitoring of publishers
and journals is needed. This knowledge is mostly represented in natural language on Web sites of the publishers and, therefore, is dificult to automatically monitor. In this
paper, we demonstrate how we use information extraction technologies to end and extract machine readable information
on publishers and journals for ingestion into automatic digital preservation watch tools. We show that the results of automatic semantic extraction are a good complement to existing knowledge bases on publishers [9, 20], finding newer and more complete data. We demonstrate the viability of the approach as an alternative or auxiliary method for automatically gathering information on preservation risks in digital content.
Palavras-chave
Digital preservation , Monitoring , Watch , Natural language , Information extraction
Tipo de Documento
Comunicação em conferência
Versão da Editora
Dataset
Citação
Identificadores
TID
Designação
Tipo de Acesso
Acesso Aberto