DiVA harvesting using OAI-PMH

The DAUF project now harvests DiVA publication data using the OAI-PMH protocol which regularly updates a single file duckdb database, openly available from object storage:

https://data.bibliometrics.lib.kth.se/kthcorpus/oai.db

The database with the harvested information is currently about 4.4 GB large.It is reqularly updated and contains MODS and JSON representations of “all-kth” DiVA records.

Harvesting mechanism

The harvesting mechanism is similar to the approach used in swepub-redux, but adapted to KTH needs. The kthcorpus R-package contains functions to power the harvesting process (including creating, downloading and updating the oai.db duckdb database)

Enrichment example

To illustrate options for enrichment of DiVA publications we make use of the Aurora classifier to associate SDG goals to all DiVA Theses.

SDG goals from Aurora

The Aurora classifier service enables you to relate a text fragment to one of the 17 United Nations Sustainable Development Goals (SDGs), get an SDG Badge and use the SDG API.

More details can be read in the paper “AI for mapping multi-lingual academic papers to the United Nations’ Sustainable Development Goals (SDGs)”.

Illustrations

Out of 48592 theses in the “all-kth” set from DiVA, 37547 were classified by Aurora (prediction of > 0.4, ie at most 2 goals per theses).

The following table present an overview of the total number of distinct theses publications per year that was classified to a specific SDG, for the last ten years. Darker cells indicate higher frequencies (number of distinct theses associated with the specific goal).

Trends

Relative positions for specific SDGs over time. The graphs below show the relative position / ranking for a goal during the ten year period.

Top Three Goals

In the top we can find more or less the same three goals, although Goal 12 recently seems to enter the top three recently.

Amongst “gainers” or risers, we find “Good health and well-being”, “Peace, Justice and strong institutions” and “Quality Education”.

Goals on the rise

Amongst “losers” that seem to relatively get a weaker position in the ten year period, we find “No poverty”, “Life below water” and “Decent work and economic growth”.

Goals on the decline

Goals reported to external rankings

Future work

Recently, enrichments with categorical data for being able to break down the data above based on the type of thesis (student / licentiate / doctoral) and at the school level has been added. These dimensions are not yet reflected in this report, which illustrate the full set of publications.
Local AI can be used for SDG (and other) classifications; currently the Aurora service has been used.
More text and richer text fragments can be used, currently the abstract has been used. This can be complemented by fulltext, titles and other relevant text extracts.

Accessing this data

Various clients (R, Python, curl, NodeJS, harlequin.sh, PowerBI etc) can be used to read the data from here:

KTH Theses SDGs data in .parquet format

The data used in the illustrations above can be inspected and queried openly directly in the webbrowser by using this link:

SQL Workbench for KTH Theses SDG dataset