Big data support is highly experimental. Eventually this content will move to the Installation Guide.
Contents:
Various components need to be installed and/or configured for big data support.
A lightweight option for supporting file sizes beyond a few gigabytes - a size that can cause performance issues when uploaded through the Dataverse server itself - is to configure an S3 store to provide direct upload and download via ‘pre-signed URLs’. When these options are configured, file uploads and downloads are made directly to and from a configured S3 store using secure (https) connections that enforce Dataverse’s access controls. (The upload and download URLs are signed with a unique key that only allows access for a short time period and Dataverse will only generate such a URL if the user has permission to upload/download the specific file in question.)
This option can handle files >40GB and could be appropriate for files up to a TB. Other options can scale farther, but this option has the advantages that it is simple to configure and does not require any user training - uploads and downloads are done via the same interface as normal uploads to Dataverse.
To configure these options, an administrator must set two JVM options for the Dataverse server using the same process as for other configuration options:
./asadmin create-jvm-options "-Ddataverse.files.<id>.download-redirect=true"
./asadmin create-jvm-options "-Ddataverse.files.<id>.upload-redirect=true"
With multiple stores configured, it is possible to configure one S3 store with direct upload and/or download to support large files (in general or for specific dataverses) while configuring only direct download, or no direct access for another store.
It is also possible to set file upload size limits per store. See the :MaxFileUploadSizeInBytes setting described in the Configuration guide.
At present, one potential drawback for direct-upload is that files are only partially ‘ingested’, tabular and FITS files are processed, but zip files are not unzipped, and the file contents are not inspected to evaluate their mimetype. This could be appropriate for large files, or it may be useful to completely turn off ingest processing for performance reasons (ingest processing requires a copy of the file to be retrieved by Dataverse from the S3 store). A store using direct upload can be configured to disable all ingest processing for files above a given size limit:
./asadmin create-jvm-options "-Ddataverse.files.<id>.ingestsizelimit=<size in bytes>"
IMPORTANT: One additional step that is required to enable direct download to work with previewers is to allow cross site (CORS) requests on your S3 store. The example below shows how to enable the minimum needed CORS rules on a bucket using the AWS CLI command line tool. Note that you may need to add more methods and/or locations, if you also need to support certain previewers and external tools.
aws s3api put-bucket-cors --bucket <BUCKET_NAME> --cors-configuration file://cors.json
with the contents of the file cors.json as follows:
{
"CORSRules": [
{
"AllowedOrigins": ["https://<DATAVERSE SERVER>"],
"AllowedHeaders": ["*"],
"AllowedMethods": ["PUT", "GET"]
}
]
}
Alternatively, you can enable CORS using the AWS S3 web interface, using json-encoded rules as in the example above.
Since the direct upload mechanism creates the final file rather than an intermediate temporary file, user actions, such as neither saving or canceling an upload session before closing the browser page, can leave an abandoned file in the store. The direct upload mechanism attempts to use S3 Tags to aid in identifying/removing such files. Upon upload, files are given a “dv-status”:”temp” tag which is removed when the dataset changes are saved and the new file(s) are added in Dataverse. Note that not all S3 implementations support Tags: Minio does not. WIth such stores, direct upload works, but Tags are not used.
Data Capture Module (DCM) is an experimental component that allows users to upload large datasets via rsync over ssh.
Installation instructions can be found at https://github.com/sbgrid/data-capture-module/blob/master/doc/installation.md. Note that shared storage (posix or AWS S3) between Dataverse and your DCM is required. You cannot use a DCM with Swift at this point in time.
Once you have installed a DCM, you will need to configure two database settings on the Dataverse side. These settings are documented in the Configuration section of the Installation Guide:
:DataCaptureModuleUrl
should be set to the URL of a DCM you installed.:UploadMethods
should include dcm/rsync+ssh
.This will allow your Dataverse installation to communicate with your DCM, so that Dataverse can download rsync scripts for your users.
The rsync script can be downloaded from Dataverse via API using an authorized API token. In the curl example below, substitute $PERSISTENT_ID
with a DOI or Handle:
curl -H "X-Dataverse-key: $API_TOKEN" $DV_BASE_URL/api/datasets/:persistentId/dataCaptureModule/rsync?persistentId=$PERSISTENT_ID
Once the user uploads files to a DCM, that DCM will perform checksum validation and report to Dataverse the results of that validation. The DCM must be configured to pass the API token of a superuser. The implementation details, which are subject to change, are below.
The JSON that a DCM sends to Dataverse on successful checksum validation looks something like the contents of checksumValidationSuccess.json
below:
{
"status": "validation passed",
"uploadFolder": "OS7O8Y",
"totalSize": 72
}
status
- The valid strings to send are validation passed
and validation failed
.uploadFolder
- This is the directory on disk where Dataverse should attempt to find the files that a DCM has moved into place. There should always be a files.sha
file and a least one data file. files.sha
is a manifest of all the data files and their checksums. The uploadFolder
directory is inside the directory where data is stored for the dataset and may have the same name as the “identifier” of the persistent id (DOI or Handle). For example, you would send "uploadFolder": "DNXV2H"
in the JSON file when the absolute path to this directory is /usr/local/glassfish4/glassfish/domains/domain1/files/10.5072/FK2/DNXV2H/DNXV2H
.totalSize
- Dataverse will use this value to represent the total size in bytes of all the files in the “package” that’s created. If 360 data files and one files.sha
manifest file are in the uploadFolder
, this value is the sum of the 360 data files.Here’s the syntax for sending the JSON.
curl -H "X-Dataverse-key: $API_TOKEN" -X POST -H 'Content-type: application/json' --upload-file checksumValidationSuccess.json $DV_BASE_URL/api/datasets/:persistentId/dataCaptureModule/checksumValidation?persistentId=$PERSISTENT_ID
See instructions at https://github.com/sbgrid/data-capture-module/blob/master/doc/mock.md
Add Dataverse settings to use mock (same as using DCM, noted above):
curl http://localhost:8080/api/admin/settings/:DataCaptureModuleUrl -X PUT -d "http://localhost:5000"
curl http://localhost:8080/api/admin/settings/:UploadMethods -X PUT -d "dcm/rsync+ssh"
At this point you should be able to download a placeholder rsync script. Dataverse is then waiting for news from the DCM about if checksum validation has succeeded or not. First, you have to put files in place, which is usually the job of the DCM. You should substitute “X1METO” for the “identifier” of the dataset you create. You must also use the proper path for where you store files in your dev environment.
mkdir /usr/local/glassfish4/glassfish/domains/domain1/files/10.5072/FK2/X1METO
mkdir /usr/local/glassfish4/glassfish/domains/domain1/files/10.5072/FK2/X1METO/X1METO
cd /usr/local/glassfish4/glassfish/domains/domain1/files/10.5072/FK2/X1METO/X1METO
echo "hello" > file1.txt
shasum file1.txt > files.sha
Now the files are in place and you need to send JSON to Dataverse with a success or failure message as described above. Make a copy of doc/sphinx-guides/source/_static/installation/files/root/big-data-support/checksumValidationSuccess.json
and put the identifier in place such as “X1METO” under “uploadFolder”). Then use curl as described above to send the JSON.
The following low level command should only be used when troubleshooting the “import” code a DCM uses but is documented here for completeness.
curl -H "X-Dataverse-key: $API_TOKEN" -X POST "$DV_BASE_URL/api/batch/jobs/import/datasets/files/$DATASET_DB_ID?uploadFolder=$UPLOAD_FOLDER&totalSize=$TOTAL_SIZE"
If you need a fully operating DCM client for development purposes, these steps will guide you to setting one up. This includes steps to set up the DCM on S3 variant.
See https://github.com/IQSS/dataverse/blob/develop/conf/docker-dcm/readme.md
Before: the default bucket for DCM to hold files in S3 is named test-dcm. It is coded into post_upload_s3.bash (line 30). Change to a different bucket if needed.
Also Note: With the new support for multiple file store in Dataverse, DCM requires a store with id=”s3” and DCM will only work with this store.
Add AWS bucket info to dcmsrv
- Add AWS credentials to ~/.aws/credentials
[default]
aws_access_key_id =
aws_secret_access_key =
Dataverse configuration (on dvsrv):
Set S3 as the storage driver
cd /opt/glassfish4/bin/
./asadmin delete-jvm-options "\-Ddataverse.files.storage-driver-id=file"
./asadmin create-jvm-options "\-Ddataverse.files.storage-driver-id=s3"
./asadmin create-jvm-options "\-Ddataverse.files.s3.type=s3"
./asadmin create-jvm-options "\-Ddataverse.files.s3.label=s3"
Add AWS bucket info to Dataverse
- Add AWS credentials to ~/.aws/credentials
[default]
aws_access_key_id =
aws_secret_access_key =
~/.aws/config
to create a region file. Add these contents:[default]
region = us-east-1
Add the S3 bucket names to Dataverse
/usr/local/glassfish4/glassfish/bin/asadmin create-jvm-options "-Ddataverse.files.s3.bucket-name=iqsstestdcmbucket"
/usr/local/glassfish4/glassfish/bin/asadmin create-jvm-options "-Ddataverse.files.dcm-s3-bucket-name=test-dcm"
Set download method to be HTTP, as DCM downloads through S3 are over this protocol curl -X PUT "http://localhost:8080/api/admin/settings/:DownloadMethods" -d "native/http"
For using these commands, you will need to connect to the shell prompt inside various containers (e.g. docker exec -it dvsrv /bin/bash
)
docker exec -it dcm_client bash
cd /mnt ; ./create.bash
; this will echo the database ID to stdout./get_transfer.bash $database_id_from_create_script
bash ./upload-${database_id_from-create_script}.bash
, and follow instructions from script.bash ./upload-3.bash
(3
being the database id from earlier commands in this example).docker exec -it dcmsrv /opt/dcm/scn/post_upload.bash
docker exec -it dcmsrv /opt/dcm/scn/post_upload_s3.bash
docker-compose -f docmer-compose.yml down -v
tail -n 2000 -f /opt/glassfish4/glassfish/domains/domain1/logs/server.log
tail -n 2000 -f /var/log/lighttpd/breakage.log
tail -n 2000 -f /var/log/lighttpd/access.log
See https://github.com/IQSS/dataverse/blob/develop/conf/docker-dcm/readme.md
cd /mnt; ./publish_major.bash ${database_id}
docker exec -it rsalsrv /opt/rsal/scn/pub.py
Info for configuring the RSAL Mock: https://github.com/sbgrid/rsal/tree/master/mocks
Also, to configure Dataverse to use the new workflow you must do the following (see also the Workflows section):
curl -X PUT -d 'http://<myipaddr>:5050' http://localhost:8080/api/admin/settings/:RepositoryStorageAbstractionLayerUrl
Edit internal-httpSR-workflow.json and replace url and rollbackUrl to be the url of your RSAL mock.
curl http://localhost:8080/api/admin/workflows -X POST --data-binary @internal-httpSR-workflow.json -H "Content-type: application/json"
curl http://localhost:8080/api/admin/workflows
curl http://localhost:8080/api/admin/workflows/default/PrePublishDataset -X PUT -d 2
curl http://localhost:8080/api/admin/workflows/default/PrePublishDataset
curl -X DELETE http://localhost:8080/api/admin/workflows/default/PrePublishDataset
In order to see the rsync URLs, you must run this command:
curl -X PUT -d 'rsal/rsync' http://localhost:8080/api/admin/settings/:DownloadMethods
To specify replication sites that appear in rsync URLs:
Download add-storage-site.json
and adjust it to meet your needs. The file should look something like this:
{
"hostname": "dataverse.librascholar.edu",
"name": "LibraScholar, USA",
"primaryStorage": true,
"transferProtocols": "rsync,posix,globus"
}
Then add the storage site using curl:
curl -H "Content-type:application/json" -X POST http://localhost:8080/api/admin/storageSites --upload-file add-storage-site.json
You make a storage site the primary site by passing “true”. Pass “false” to make it not the primary site. (id “1” in the example):
curl -X PUT -d true http://localhost:8080/api/admin/storageSites/1/primaryStorage
You can delete a storage site like this (id “1” in the example):
curl -X DELETE http://localhost:8080/api/admin/storageSites/1
You can view a single storage site like this: (id “1” in the example):
curl http://localhost:8080/api/admin/storageSites/1
You can view all storage site like this:
curl http://localhost:8080/api/admin/storageSites
In the GUI, this is called “Local Access”. It’s where you can compute on files on your cluster.
curl http://localhost:8080/api/admin/settings/:LocalDataAccessPath -X PUT -d "/programs/datagrid"