Enrichment of Publications

Example: SDGs for KTH Theses

Curation and enrichment of DiVA publications data (OAI-PMH harvest to oai.db)
Published

2024-10-22

DiVA harvesting using OAI-PMH

The DAUF project now harvests DiVA publication data using the OAI-PMH protocol which regularly updates a single file duckdb database, openly available from object storage:

https://data.bibliometrics.lib.kth.se/kthcorpus/oai.db

The database with the harvested information is currently about 4.4 GB large.It is reqularly updated and contains MODS and JSON representations of “all-kth” DiVA records.

Harvesting mechanism

The harvesting mechanism is similar to the approach used in swepub-redux, but adapted to KTH needs. The kthcorpus R-package contains functions to power the harvesting process (including creating, downloading and updating the oai.db duckdb database)

Enrichment example

To illustrate options for enrichment of DiVA publications we make use of the Aurora classifier to associate SDG goals to all DiVA Theses.

SDG goals from Aurora

The Aurora classifier service enables you to relate a text fragment to one of the 17 United Nations Sustainable Development Goals (SDGs), get an SDG Badge and use the SDG API.

More details can be read in the paper “AI for mapping multi-lingual academic papers to the United Nations’ Sustainable Development Goals (SDGs)”.

Illustrations

Out of 48592 theses in the “all-kth” set from DiVA, 37547 were classified by Aurora (prediction of > 0.4, ie at most 2 goals per theses).

The following table present an overview of the total number of distinct theses publications per year that was classified to a specific SDG, for the last ten years. Darker cells indicate higher frequencies (number of distinct theses associated with the specific goal).

Future work

  • Recently, enrichments with categorical data for being able to break down the data above based on the type of thesis (student / licentiate / doctoral) and at the school level has been added. These dimensions are not yet reflected in this report, which illustrate the full set of publications.

  • Local AI can be used for SDG (and other) classifications; currently the Aurora service has been used.

  • More text and richer text fragments can be used, currently the abstract has been used. This can be complemented by fulltext, titles and other relevant text extracts.

Accessing this data

Various clients (R, Python, curl, NodeJS, harlequin.sh, PowerBI etc) can be used to read the data from here:

KTH Theses SDGs data in .parquet format

The data used in the illustrations above can be inspected and queried openly directly in the webbrowser by using this link:

SQL Workbench for KTH Theses SDG dataset