Metadata Export Formats

Introduction

Dataverse ships with a number of metadata export formats available for published datasets. A given metadata export format may be available for user download (via the UI and API) and/or be available for use in Harvesting between Dataverse instances.

As of v5.14, Dataverse provides a mechanism for third-party developers to create new metadata Exporters than implement new metadata formats or that replace existing formats. All the necessary dependencies are packaged in an interface JAR file available from Maven Central. Developers can distribute their new Exporters as JAR files which can be dynamically loaded into Dataverse instances - see External Metadata Exporters. Developers are encouraged to work with the core Dataverse team (see Getting Help) to distribute these JAR files via Maven Central. See the Croissant and Debug artifacts as examples. You may find other examples under Inventory of External Exporters in the Installation Guide.

Exporter Basics

New Exports must implement the io.gdcc.spi.export.Exporter interface. The interface includes a few methods for the Exporter to provide Dataverse with the format it produces, a display name, format mimetype, and whether the format is for download and/or harvesting use, etc. It also includes a main exportDataset(ExportDataProvider dataProvider, OutputStream outputStream) method through which the Exporter receives metadata about the given dataset (via the ExportDataProvider, described further below) and writes its output (as an OutputStream).

Exporters that create an XML format must implement the io.gdcc.spi.export.XMLExporter interface (which extends the Exporter interface). XMLExporter adds a few methods through which the XMLExporter provides information to Dataverse about the XML namespace and version being used.

Exporters also need to use the @AutoService(Exporter.class) which makes the class discoverable as an Exporter implementation.

The ExportDataProvider interface provides several methods through which your Exporter can receive dataset and file metadata in various formats. Your exporter would parse the information in one or more of these inputs to retrieve the values needed to generate the Exporter’s output format.

The most important methods/input formats are:

getDatasetJson() - metadata in the internal Dataverse JSON format used in the native API and available via the built-in JSON metadata export.
getDatasetORE() - metadata in the OAI_ORE format available as a built-in metadata format and as used in Dataverse’s BagIT-based Archiving capability.
getDatasetFileDetails - detailed file-level metadata for ingested tabular files.

The first two of these provide ~complete metadata about the dataset along with the metadata common to all files. This includes all metadata entries from all metadata blocks, PIDs, tags, Licenses and custom terms, etc. Almost all built-in exporters today use the JSON input. The newer OAI_ORE export, which is JSON-LD-based, provides a flatter structure and references metadata terms by their external vocabulary ids (e.g. http://purl.org/dc/terms/title) which may make it a prefereable starting point in some cases.

The last method above provides a new JSON-formatted serialization of the variable-level file metadata Dataverse generates during ingest of tabular files. This information has only been included in the built-in DDI export, as the content of a dataDscr element. (Hence inspecting the edu.harvard.iq.dataverse.export.DDIExporter and related classes would be a good way to explore how the JSON is structured.)

The interface also provides

getDatasetSchemaDotOrg(); and
getDataCiteXml();.

These provide subsets of metadata in the indicated formats. They may be useful starting points if your exporter will, for example, only add one or two additional fields to the given format.

If an Exporter cannot create a requested metadata format for some reason, it should throw an io.gdcc.spi.export.ExportException.

Building an Exporter

The examples at https://github.com/gdcc/exporter-croissant and https://github.com/gdcc/exporter-debug provide a Maven pom.xml file suitable for building an Exporter JAR file and those repositories provide additional development guidance.

There are four dependencies needed to build an Exporter:

io.gdcc dataverse-spi library containing the interfaces discussed above and the ExportException class
com.google.auto.service auto-service, which provides the @AutoService annotation
jakarta.json jakarata.json-api for JSON classes
jakarta.ws.rs jakarta.ws.rs-api, which provides a MediaType enumeration for specifying mime types.

Specifying a Prerequisite Export

An advanced feature of the Exporter mechanism allows a new Exporter to specify that it requires, as input, the output of another Exporter. An example of this is the builting HTMLExporter which requires the output of the DDI XML Exporter to produce an HTML document with the same DDI content.

This is configured by providing the metadata format name via the Exporter.getPrerequisiteFormatName() method. When this method returns a non-empty format name, Dataverse will provide the requested format to the Exporter via the ExportDataProvider.getPrerequisiteInputStream() method.

Developers and administrators deploying Exporters using this mechanism should be aware that, since metadata formats can be changed by other Exporters, the InputStream received may not hold the expected metadata. Developers should clearly document their compatability with the built-in or third-party Exporters they support as prerequisites.

«Previous Next»