Make Data Count

Support for Make Data Count is a feature of the Dataverse Software that is described in the Make Data Count section of the Admin Guide. In order for developers to work on the feature, they must install Counter Processor, a Python 3 application, as described below. Counter Processor can be found at https://github.com/gdcc/counter-processor

Architecture

There are many components involved in the Dataverse Software’s architecture for Make Data Count as shown in the diagram below.

makedatacount_components

Dev Environment Setup for Make Data Count

Generate Fake Metrics Only

The quickest way to put populate tables with some data is to run the MakeDataCountApiIT integration test. For more on running integration tests see the Testing section. This will populate views and downloads.

To insert a citation you could insert a row like below, changing “72” in the example below with the dataset id for your dataset.

INSERT INTO datasetexternalcitations (id, citedbyurl, dataset_id) VALUES (1, 'https://doi.org/10.1007/s12520-017-0578-2', 72);

Full Setup

The recommended way to work on the Make Data Count feature is to spin up an EC2 instance that has both the Dataverse Software and Counter Processor installed. Go to the Deployment page for details on how to spin up an EC2 instance and make sure that your Ansible file is configured to install Counter Processor before running the “create” script.

After you have spun to your EC2 instance, set :MDCLogPath so that the Dataverse installation creates a log for Counter Processor to operate on. For more on this database setting, see the Configuration section of the Installation Guide.

Next you need to have the Dataverse installation add some entries to the log that Counter Processor will operate on. To do this, click on some published datasets and download some files.

Next you should run Counter Processor to convert the log into a SUSHI report, which is in JSON format. Before running Counter Processor, you need to put a configuration file into place. As a starting point use counter-processor-config.yaml and edit the file, paying particular attention to the following settings:

log_name_pattern You might want something like /usr/local/payara7/glassfish/domains/domain1/logs/counter_(yyyy-mm-dd).log
year_month You should probably set this to the current month.
output_file This needs to be a directory that the “dataverse” Unix user can read but that the “counter” user can write to. In dev, you can probably get away with “/tmp” as the directory.
platform Out of the box from Counter Processor this is set to Dash but this should be changed to match the name of your Dataverse installation. Examples are “Harvard Dataverse Repository” for Harvard University or “LibraData” for the University of Virginia.
upload_to_hub This should be “False” unless you are testing sending SUSHI reports to the DataCite hub.
simulate_date You should probably set this to tomorrow.

Once you are done with your configuration, you can run Counter Processor like this:

sudo -i

su - counter

cd /usr/local/counter-processor-1.06

CONFIG_FILE=counter-processor-config.yaml python39 main.py

(Please note that the Counter Processor README says you can also pass in values like START_DATE, END_DATE etc. at the command line if you find this to be more convenient.)

After main.py has finished, you should see output that a JSON file has been placed in the directory you specified in output_file. Next, pass this JSON file to curl like this, substituting the DOI of the dataset you’re testing with:

curl -X POST 'http://localhost:8080/api/admin/makeDataCount/:persistentId/addUsageMetricsFromSushiReport?reportOnDisk=/tmp/sushi_sample_logs.json&persistentId=doi:10.5072/FK2/BL2IBM

(Note that in production the persistentId of a dataset is not passed. Instead the PIDs for the datasets are pulled out of the JSON file.)

Assuming the curl command above worked, you should be able to retrieve the views and downloads via API like this, substituting the metric you’re interested in (viewsTotal, viewsUnique, downloadsTotal, downloadsUnique) and the DOI of the dataset you’re testing with:

curl -X POST 'http://localhost:8080/api/datasets/:persistentId/makeDataCount/viewsTotal/2019-01&persistentId=doi:10.5072/FK2/BL2IBM

If all this is working and you want to send data to the test instance of the Datacite hub, change upload_to_hub to “True” and contact support@datacite.org to get a JSON Web Token (JWT) to test with. Counter Processor should send the SUSHI reports for you but if you need to troubleshoot sending the reports manually, you can try the following curl command, substituting your JWT:

curl --header "Content-Type: application/json; Accept: application/json" -H "Authorization: Bearer $JSON_WEB_TOKEN" -X POST https://api.test.datacite.org/reports/ -d @sushi_report.json

For how to put citations into your dev database and how to get them out again, see Configuring Your Dataverse Installation for Make Data Count Citations section in Make Data Count of the Admin Guide.

Testing Make Data Count and Your Dataverse Installation

A developer running Counter Processor alongside the Dataverse installation for development or testing purposes will notice that once the raw Dataverse installation logs have been processed, there is no straightforward way to re-test those same logs.

The first thing to fix is to clear two files from Counter Processor state folder, statefile.json and counter_db_[yyyy-mm].sqlite3

Second, if you are also sending your SUSHI report to Make Data Count, you will notice that re-running Counter Processor will not update the information logged on their servers. This is due to us clearing the state of Counter Processor, which in turn causes Counter Processor to send incorrectly structured requests to Make Data Count. The easiest way to resolve this issue is to DELETE the record Counter Processor has created on Make Data Count:

curl -H "Authorization: Bearer $JSON_WEB_TOKEN" -X DELETE https://$MDC_SERVER/reports/$REPORT_ID

To get the REPORT_ID, look at the logs generated in /usr/local/counter-processor-1.06/tmp/datacite_response_body.txt

To read more about the Make Data Count api, see https://github.com/datacite/sashimi

You can compare the MDC metrics display with the Dataverse installation’s original by toggling the :DisplayMDCMetrics setting (true by default to display MDC metrics).

Processing Archived Logs

A new script (release date TBD) will be available for processing archived Dataverse log files. Monthly logs that are zipped, TARed, and copied to an archive can be processed by this script running nightly or weekly.

The script will keep track of the state of each tar file they are processed and will make use of the following “processingState” API endpoints, which allow the state of each file to be checked or modified.

The possible states are new, done, skip, processing, and failed.

Setting the state to “skip” will prevent the file from being processed if the developer needs to analyze the contents.

“failed” files will be re-tried in a later run.

“done” files are successful and will be ignored going forward.

The files currently being processed will have the state “processing”.

The script will process the newest set of log files (merging files from multiple nodes) and call Counter Processor.

APIs to manage the states include GET, POST, and DELETE (for testing), as shown below.

Note: yearMonth must be in the format yyyymm or yyyymmdd. Note: If running the new script on multiple servers add the query parameter &server=serverName on the first POST call. The server name can not be changed once set. To clear the name out you must delete the state and post a new one.

curl -X GET http://localhost:8080/api/admin/makeDataCount/{yearMonth}/processingState

curl -X POST http://localhost:8080/api/admin/makeDataCount/{yearMonth}/processingState?state=processing&server=server1 curl -X POST http://localhost:8080/api/admin/makeDataCount/{yearMonth}/processingState?state=done

curl -X DELETE http://localhost:8080/api/admin/makeDataCount/{yearMonth}/processingState

Resources

The original issue for adding Make Data Count support is https://github.com/IQSS/dataverse/issues/4821

«Previous Next»