Developers often only deploy the Dataverse Software to their Development Environment but it can be useful to deploy the Dataverse Software to cloud services such as Amazon Web Services (AWS).
We have written scripts to deploy the Dataverse Software to Amazon Web Services (AWS) but they require some setup.
First, you need to have AWS Command Line Interface (AWS CLI) installed, which is called
aws in your terminal. Launching your terminal and running the following command to print out the version of AWS CLI will tell you if it is installed or not:
If you have not yet installed AWS CLI you should install it by following the instructions at https://docs.aws.amazon.com/cli/latest/userguide/installing.html
Afterwards, you should re-run the “version” command above to verify that AWS CLI has been properly installed. If “version” still doesn’t work, read on for troubleshooting advice. If “version” works, you can skip down to the “Configure AWS CLI” step.
Please note that as of this writing the AWS docs are not especially clear about how to fix errors such as
aws: command not found. If the AWS CLI cannot be found after you followed the AWS installation docs, it is very likely that the
aws program is not in your
$PATH is an “environment variable” that tells your shell (usually Bash) where to look for programs.
To see what
$PATH is set to, run the following command:
On Mac, to update your
$PATH to include the location where the current AWS docs install AWS CLI on the version of Python included with your Mac, run the following command:
After all this, you can try the “version” command again.
Note that it’s possible to add an
export line like the one above to your
~/.bash_profile file so you don’t have to run it yourself when you open a new terminal.
Next you need to configure AWS CLI.
.aws directory in your home directory (which is called
~) like this:
We will be creating two plain text files in the
.aws directory and it is important that these files do not end in “.txt” or any other extension. After creating the files, you can verify their names with the following command:
Create a plain text file at
~/.aws/config with the following content:
[default] region = us-east-1
Please note that at this time the region must be set to “us-east-1” but in the future we could improve our scripts to support other regions.
Create a plain text file at
~/.aws/credentials with the following content:
[default] aws_access_key_id = XXXXXXXXXXXXXXXXXXXX aws_secret_access_key = XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Then update the file and replace the values for “aws_access_key_id” and “aws_secret_access_key” with your actual credentials by following the instructions at https://aws.amazon.com/blogs/security/wheres-my-secret-access-key/
If you are having trouble configuring the files manually as described above, see https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html which documents the
aws configure command.
In order to configure Dataverse installation settings such as the password of the dataverseAdmin user, download https://raw.githubusercontent.com/GlobalDataverseCommunityConsortium/dataverse-ansible/master/defaults/main.yml and edit the file to your liking.
You can skip this step if you’re fine with the values in the “main.yml” file in the link above.
Once you have done the configuration above, you are ready to try running the “ec2-create-instance.sh” script to spin up a Dataverse installation in AWS.
Download ec2-create-instance.sh and put it somewhere reasonable. For the purpose of these instructions we’ll assume it’s in the “Downloads” directory in your home directory.
To run it with default values you just need the script, but you may also want a current copy of the ansible group vars file.
ec2-create-instance accepts a number of command-line switches, including:
-r: GitHub Repository URL (defaults to https://github.com/IQSS/dataverse.git)
-b: branch to build (defaults to develop)
-p: pemfile directory (defaults to $HOME)
-g: Ansible GroupVars file (if you wish to override role defaults)
-h: help (displays usage for each available option)
bash ~/Downloads/ec2-create-instance.sh -b develop -r https://github.com/scholarsportal/dataverse.git -g main.yml
You will need to wait for 15 minutes or so until the deployment is finished, longer if you’ve enabled sample data and/or the API test suite. Eventually, the output should tell you how to access the Dataverse installation in a web browser or via SSH. It will also provide instructions on how to delete the instance when you are finished with it. Please be aware that AWS charges per minute for a running instance. You may also delete your instance from https://console.aws.amazon.com/console/home?region=us-east-1 .
Please note that while the script should work well on new-ish branches, older branches that have different dependencies such as an older version of Solr may not produce a working Dataverse installation. Your mileage may vary.
A number of pilot Dataverse installations start on local storage, then administrators are tasked with migrating datafiles into S3 or similar object stores. The files may be copied with a command-line utility such as s3cmd<https://s3tools.org/s3cmd>. You will want to retain the local file hierarchy, keeping the authority (for example: 10.5072) at the bucket “root.”
The below example queries may assist with updating dataset and datafile locations in the Dataverse installation’s PostgresQL database. Depending on the initial version of the Dataverse Software and subsequent upgrade path, Datafile storage identifiers may or may not include a
file:// prefix, so you’ll want to catch both cases.
UPDATE dvobject SET storageidentifier=REPLACE(storageidentifier,'file://','s3://') WHERE dtype='Dataset';
UPDATE dvobject SET storageidentifier=REPLACE(storageidentifier,'file://','s3://your-s3-bucket:') WHERE id IN (SELECT o.id FROM dvobject o, dataset s WHERE o.dtype = 'DataFile' AND s.id = o.owner_id AND s.harvestingclient_id IS null AND o.storageidentifier NOT LIKE 's3://%');
UPDATE dvobject SET storageidentifier=CONCAT('s3://your-s3-bucket:', storageidentifier) WHERE id IN (SELECT o.id FROM dvobject o, dataset s WHERE o.dtype = 'DataFile' AND s.id = o.owner_id AND s.harvestingclient_id IS null AND o.storageidentifier NOT LIKE '%://%');