Make Data Count

Make Data Count is a project to collect and standardize metrics on data use, especially views, downloads, and citations. The Dataverse Software can integrate Make Data Count to collect and display usage metrics including counts of dataset views, file downloads, and dataset citations.

Introduction

Make Data Count is part of a broader Research Data Alliance (RDA) Data Usage Metrics Working Group which helped to produce a specification called the COUNTER Code of Practice for Research Data (PDF, HTML) that the Dataverse Software makes every effort to comply with. The Code of Practice (CoP) is built on top of existing standards such as COUNTER and SUSHI that come out of the article publishing world. The Make Data Count project has emphasized that they would like feedback on the code of practice. You can keep up to date on the Make Data Count project by subscribing to their newsletter.

Architecture

Dataverse installations who would like support for Make Data Count must install Counter Processor, a Python project created by California Digital Library (CDL) which is part of the Make Data Count project and which runs the software in production as part of their DASH data sharing platform.

The diagram below shows how Counter Processor interacts with your Dataverse installation and the DataCite hub, once configured. Dataverse installations using Handles rather than DOIs should note the limitations in the next section of this page.

makedatacount_components

The most important takeaways from the diagram are:

  • Once enabled, your Dataverse installation will log activity (views and downloads) to a specialized date-stamped file.

  • You should run Counter Processor once a day to create reports in SUSHI (JSON) format that are saved to disk for your Dataverse installation to process and that are sent to the DataCite hub.

  • You should set up a cron job to have your Dataverse installation process the daily SUSHI reports, updating the Dataverse installation database with the latest metrics.

  • You should set up a cron job to have your Dataverse installation pull the latest list of citations for each dataset on a periodic basis, perhaps weekly or daily. These citations come from Crossref via the DataCite hub.

  • APIs are available in the Dataverse Software to retrieve Make Data Count metrics: views, downloads, and citations.

Limitations for Dataverse Installations Using Handles Rather Than DOIs

Data repositories using Handles and other identifiers are not supported by Make Data Count but in the notes following a July 2018 webinar, you can see the Make Data Count project’s response on this topic. In short, the DataCite hub does not want to receive reports for non-DOI datasets. Additionally, citations are only available from the DataCite hub for datasets that have DOIs. See also the table below.

DOIs

Handles

Out of the box

Classic download counts

Classic download counts

Make Data Count

MDC views, MDC downloads, MDC citations

MDC views, MDC downloads

This being said, the Dataverse Software usage logging can still generate logs and process those logs with Counter Processor to create json that details usage on a dataset level. Dataverse installations can ingest this locally generated json.

When editing the counter-processor-config.yaml file mentioned below, make sure that the upload_to_hub boolean is set to False.

Configuring Your Dataverse Installation for Make Data Count Views and Downloads

If you haven’t already, follow the steps for installing Counter Processor in the Prerequisites section of the Installation Guide.

Enable Logging for Make Data Count

To make your Dataverse installation log dataset usage (views and downloads) for Make Data Count, you must set the :MDCLogPath database setting. See :MDCLogPath for details.

If you wish to start logging in advance of setting up other components, or wish to log without display MDC metrics for any other reason, you can set the optional :DisplayMDCMetrics database setting to false. See :DisplayMDCMetrics for details.

After you have your first day of logs, you can process them the next day.

Enable or Disable Display of Make Data Count Metrics

By default, when MDC logging is enabled (when :MDCLogPath is set), your Dataverse installation will display MDC metrics instead of it’s internal (legacy) metrics. You can avoid this (e.g. to collect MDC metrics for some period of time before starting to display them) by setting :DisplayMDCMetrics to false.

The following discussion assumes :MDCLogPath has been set to /usr/local/payara6/glassfish/domains/domain1/logs/mdc You can also decide to display MDC metrics along with Dataverse’s traditional download counts from the time before MDC was enabled. To do this, set the :MDCStartDate to when you started MDC logging.

Configure Counter Processor

  • First, become the “counter” Unix user.

    • sudo su - counter

  • Change to the directory where you installed Counter Processor.

    • cd /usr/local/counter-processor-0.1.04

  • Download counter-processor-config.yaml to /usr/local/counter-processor-0.1.04.

  • Edit the config file and pay particular attention to the FIXME lines.

    • vim counter-processor-config.yaml

Populate Views and Downloads for the First Time

Soon we will be setting up a cron job to run nightly but we start with a single successful configuration and manual run of Counter Processor and calls to your Dataverse installation’s APIs. (The scripts discussed in the next section automate the steps described here, including creating empty log files if you’re starting mid-month.)

  • Change to the directory where you installed Counter Processor.

    • cd /usr/local/counter-processor-0.1.04

  • If you are running Counter Processor for the first time in the middle of a month, you will need create blank log files for the previous days. e.g.:

    • cd /usr/local/payara6/glassfish/domains/domain1/logs/mdc

    • touch counter_2019-02-01.log

    • ...

    • touch counter_2019-02-20.log

  • Run Counter Processor.

    • CONFIG_FILE=counter-processor-config.yaml python39 main.py

    • A JSON file in SUSHI format will be created in the directory you specified under “output_file” in the config file.

  • Populate views and downloads for your datasets based on the SUSHI JSON file. The “/tmp” directory is used in the example below.

    • curl -X POST "http://localhost:8080/api/admin/makeDataCount/addUsageMetricsFromSushiReport?reportOnDisk=/tmp/make-data-count-report.json"

  • Verify that views and downloads are available via API.

    • Now that views and downloads have been recorded in the Dataverse installation’s database, you should make sure you can retrieve them from a dataset or two. Use the Dataset Metrics endpoints in the Native API section of the API Guide.

Populate Views and Downloads Nightly

Running main.py to create the SUSHI JSON file and the subsequent calling of the Dataverse Software API to process it should be added as a cron job.

The Dataverse Software provides example scripts that run the steps to process new accesses and uploads and update your Dataverse installation’s database counter_daily.sh and to retrieve citations for all Datasets from DataCite counter_weekly.sh. These scripts should be configured for your environment and can be run manually or as cron jobs.

Sending Usage Metrics to the DataCite Hub

Once you are satisfied with your testing, you should contact support@datacite.org for your JSON Web Token and change “upload_to_hub” to “True” in the config file. The next time you run main.py the following metrics will be sent to the DataCite hub for each published dataset:

  • Views (“investigations” in COUNTER)

  • Downloads (“requests” in COUNTER)

Configuring Your Dataverse Installation for Make Data Count Citations

Please note: as explained in the note above about limitations, this feature is not available to Dataverse installations that use Handles.

To configure your Dataverse installation to pull citations from the test vs. production DataCite server see Legacy Single PID Provider: dataverse.pid.datacite.rest-api-url in the Installation Guide.

Please note that in the curl example, Bash environment variables are used with the idea that you can set a few environment variables and copy and paste the examples as is. For example, “$DOI” could become “doi:10.5072/FK2/BL2IBM” by issuing the following export command from Bash:

export DOI="doi:10.5072/FK2/BL2IBM"

To confirm that the environment variable was set properly, you can use echo like this:

echo $DOI

On some periodic basis (perhaps weekly) you should call the following curl command for each published dataset to update the list of citations that have been made for that dataset.

curl -X POST "http://localhost:8080/api/admin/makeDataCount/:persistentId/updateCitationsForDataset?persistentId=$DOI"

Citations will be retrieved for each published dataset and recorded in the your Dataverse installation’s database.

For how to get the citations out of your Dataverse installation, see “Retrieving Citations for a Dataset” under Dataset Metrics in the Native API section of the API Guide.

Please note that while the Dataverse Software has a metadata field for “Related Dataset” this information is not currently sent as a citation to Crossref.

Retrieving Make Data Count Metrics from the DataCite Hub

The following metrics can be downloaded directly from the DataCite hub (see https://support.datacite.org/docs/eventdata-guide) for datasets hosted by Dataverse installations that have been configured to send these metrics to the hub:

  • Total Views for a Dataset

  • Unique Views for a Dataset

  • Total Downloads for a Dataset

  • Downloads for a Dataset

  • Citations for a Dataset (via Crossref)

Retrieving Make Data Count Metrics from a Dataverse Installation

The Dataverse Software API endpoints for retrieving Make Data Count metrics are described below under Dataset Metrics in the Native API section of the API Guide.

Please note that it is also possible to retrieve metrics from the DataCite hub itself via https://api.datacite.org