Dataverse is exploring the use of Docker, Kubernetes, OpenShift and other container-related technologies.
Contents:
From the Dataverse perspective, we are in the business of providing a “template” for OpenShift that describes how the various components we build our application on (Glassfish, PostgreSQL, Solr, the Dataverse war file itself, etc.) work together. We publish Docker images to DockerHub at https://hub.docker.com/u/iqss/ that are used in this OpenShift template.
Dataverse’s (light) use of Docker is documented below in a separate section. We actually started with Docker in the context of OpenShift, which is why OpenShift is listed first but we can imagine rearranging this in the future.
The OpenShift template for Dataverse can be found at conf/openshift/openshift.json
and if you need to hack on the template or related files under conf/docker
it is recommended that you iterate on them using Minishift.
The instructions below will walk you through spinning up Dataverse within Minishift. It is recommended that you do this on the “develop” branch to make sure everything is working before changing anything.
Minishift requires a hypervisor and since we already use VirtualBox for Vagrant, you should install VirtualBox from http://virtualbox.org .
Download the Minishift tarball from https://docs.openshift.org/latest/minishift/getting-started/installing.html and put the minishift
binary in /usr/local/bin
or somewhere in your $PATH
. This assumes Mac or Linux. These instructions were last tested on version v1.14.0+1ec5877
of Minishift.
At this point, you might want to consider going through the Minishift quickstart to get oriented: https://docs.openshift.org/latest/minishift/getting-started/quickstart.html
minishift start --vm-driver=virtualbox --memory=8GB
eval $(minishift oc-env)
Note that if you just installed and started Minishift, you are probably logged in already. This oc login
step is included in case you aren’t logged in anymore.
oc login --username developer --password=whatever
Use “developer” as the username and a couple characters as the password.
Calling the project “project1” is fairly arbitrary. We’ll probably want to revisit this name in the future. A project is necessary in order to create an OpenShift app.
oc new-project project1
Note that oc projects
will return a list of projects.
The following command operates on the conf/openshift/openshift.json
file that resides in the main Dataverse git repo. It will download images from Docker Hub and use them to spin up Dataverse within Minishift/OpenShift. Later we will cover how to make changes to the images on Docker Hub.
oc new-app conf/openshift/openshift.json
After running the oc new-app
command above, deployment of Dataverse within Minishift/OpenShift will begin. You should log into the OpenShift web interface to check on the status of the deployment. If you just created the Minishift VM with the minishift start
command above, the oc new-app
step is expected to take a while because the images need to be downloaded from Docker Hub. Also, the installation of Dataverse takes a while.
Typing minishift console
should open the OpenShift web interface in your browser. The IP address might not be “192.168.99.100” but it’s used below as an example.
minishift console
)In the OpenShift web interface you should see a link that looks something like http://dataverse-project1.192.168.99.100.nip.io but the IP address will vary and will match the output of minishift ip
. Eventually, after deployment is complete, the Dataverse web interface will appear at this URL and you will be able to log in with the username “dataverseAdmin” and the password “admin”.
Another way to verify that Dataverse has been succesfully deployed is to make sure that the Dataverse “info” API endpoint returns a version (note that minishift ip
is used because the IP address will vary):
curl http://dataverse-project1.`minishift ip`.nip.io/api/info/version
From the perspective of OpenShift and the openshift.json
config file, the HTTP link to Dataverse in called a route. See also documentation for oc expose
.
Here are some tips on troubleshooting your deployment of Dataverse to Minishift.
oc status
Once images have been downloaded from Docker Hub, the output below will change from Pulling
to Pulled
.
oc get events | grep Pull
This is a deep dive:
oc get all
Logs are provided in the web interface to each of the deployment configurations. The URLs should be something like this (but the IP address) will vary and you should click “View Log”. The installation of Dataverse is done within the one Glassfish deployment configuration:
You can also see logs from each of the components (Glassfish, PostgreSQL, and Solr) from the command line with oc logs
like this (just change the grep
at the end):
oc logs $(oc get po -o json | jq '.items[] | select(.kind=="Pod").metadata.name' -r | grep glassfish)
You can get a shell on any of the containers for each of the components (Glassfish, PostgreSQL, and Solr) with oc rc
(just change the grep
at the end):
oc rsh $(oc get po -o json | jq '.items[] | select(.kind=="Pod").metadata.name' -r | grep glassfish)
From the rsh
prompt of the Glassfish container you could run something like the following to make sure that Dataverse is running on port 8080:
curl http://localhost:8080/api/info/version
If you simply wanted to try out Dataverse on Minishift and want to clean up, you can run oc delete project project1
to delete the project or minishift stop
and minishift delete
to delete the entire Minishift VM and all the Docker containers inside it.
If you’re interested in using Minishift for development and want to change the Dataverse code, you will need to get set up to create Docker images based on your changes and make them available within Minishift.
It is recommended to add experimental images to Minishift’s internal registry. Note that despite what https://docs.openshift.org/latest/minishift/openshift/openshift-docker-registry.html says you will not use docker push
because we have seen “unauthorized: authentication required” when trying to push to it as reported at https://github.com/minishift/minishift/issues/817 . Rather you will run docker build
and run docker images
to see that your newly build images are listed in Minishift’s internal registry.
First, set the Docker environment variables so that docker build
and docker images
refer to the internal Minishift registry rather than your normal Docker setup:
eval $(minishift docker-env)
When you’re ready to build, change to the right directory:
cd conf/docker
And then run the build script in “internal” mode:
./build.sh internal
Note that conf/openshift/openshift.json
must not have imagePullPolicy
set to Always
or it will pull from “iqss” on Docker Hub. Changing it to IfNotPresent
allow Minishift to use the images shown from docker images
rather than the ones on Docker Hub.
Using Minishift for day to day Dataverse development might be something we want to investigate in the future. These blog posts talk about developing Java applications using Minishift/OpenShift:
If you are interested in changing the OpenShift config file for Dataverse at conf/openshift/openshift.json
note that in many cases once you have Dataverse running in Minishift you can use oc process
and oc apply
like this (but please note that some errors and warnings are expected):
oc process -f conf/openshift/openshift.json | oc apply -f -
The slower way to iterate on the openshift.json
file is to delete the project and re-create it.
You can access and modify the PostgreSQL database via an interactive terminal called psql.
To log in to psql from the command line of the Glassfish pod, type the following command:
PGPASSWORD=$POSTGRES_PASSWORD; export PGPASSWORD; /usr/bin/psql -h $POSTGRES_SERVER.$POSTGRES_SERVICE_HOST -U $POSTGRES_USER -d $POSTGRES_DATABASE
To log in as an admin, type this command instead:
PGPASSWORD=$POSTGRESQL_ADMIN_PASSWORD; export PGPASSWORD; /usr/bin/psql -h $POSTGRES_SERVER.$POSTGRES_SERVICE_HOST -U postgres -d $POSTGRES_DATABASE
Glassfish, Solr and PostgreSQL Pods are in a “StatefulSet” which is a concept from OpenShift and Kubernetes that you can read about at https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/
As of this writing, the openshift.json
file has a single “replica” for each of these two stateful sets. It’s possible to increase the number of replicas from 1 to 3, for example, with this command:
oc scale statefulset/dataverse-glassfish --replicas=3
The command above should result in two additional Glassfish pods being spun up. The name of the pods is significant and there is special logic in the “zeroth” pod (“dataverse-glassfish-0” and “dataverse-postgresql-0”). For example, only “dataverse-glassfish-0” makes itself the dedicated timer server as explained in Dataverse Application Timers section of the Admin Guide. “dataverse-glassfish-1” and other higher number pods will not be configured as a timer server.
Once you have multiple Glassfish servers you may notice bugs that will require additional configuration to fix. One such bug has to do with Dataverse logos which are stored at /usr/local/glassfish4/glassfish/domains/domain1/docroot/logos
on each of the Glassfish servers. This means that the logo will look fine when you just uploaded it because you’re on the server with the logo on the local file system but when you visit that dataverse in the future and you’re on a differernt Glassfish server, you will see a broken image. (You can find some discussion of this logo bug at https://github.com/IQSS/dataverse-aws/issues/10 and http://irclog.iq.harvard.edu/dataverse/2016-10-21 .) This is all “advanced” installation territory (see the Advanced Installation section of the Installation Guide) and OpenShift might be a good environment in which to work on some of these bugs.
Multiple PostgreSQL servers are possible within the OpenShift environment as well and have been set up with some amount of replication. “dataverse-postgresql-0” is the master and non-zero pods are the slaves. We have just scratched the surface of this configuration but replication from master to slave seems to we working. Future work could include failover and making Dataverse smarter about utilizing multiple PostgreSQL servers for reads. Right now we assume Dataverse is only being used with a single PostgreSQL server and that it’s the master.
Solr supports index distribution and replication for scaling. For OpenShift use, we choose replication. It’s possible to scale up Solr using the method method similar to Glassfish, as mentioned aboved In OpenShift, the first Solr pod, dataverse-solr-0, will be the master node, and the rest will be slave nodes
Solr requires backing up the search index to persistent storage. For our proof of concept, we configure a hostPath, which allows Solr containers to access the hosts’ file system, for our Solr containers backups. To read more about OpenShift/Kubernetes’ persistent volumes, please visit: https://kubernetes.io/docs/concepts/storage/persistent-volumes
To allow containers to use a host’s storage, we need to allow access to that directory first. In this example, we expose /tmp/share to the containers:
# mkdir /tmp/share
# chcon -R -t svirt_sandbox_file_t
# chgrp root -R /tmp/share
# oc login -u system:admin
# oc edit scc restricted # Update allowHostDirVolumePlugin to true and runAsUser type to RunAsAny
To add a persistent volume and persistent volume claim, in conf/docker/openshift/openshift.json, add the following to objects in openshift.json. Here, we are using hostPath for development purposes. Since OpenShift supports many types of cluster storages, if the administrator wishes to use any cluster storage like EBS, Google Cloud Storage, etc, they would have to use a different type of Persistent Storage:
{
"kind" : "PersistentVolume",
"apiVersion" : "v1",
"metadata":{
"name" : "solr-index-backup",
"labels":{
"name" : "solr-index-backup",
"type" : "local"
}
},
"spec":{
"capacity":{
"storage" : "8Gi"
},
"accessModes":[
"ReadWriteMany", "ReadWriteOnce", "ReadOnlyMany"
],
"hostPath": {
"path" : "/tmp/share"
}
}
},
{
"kind" : "PersistentVolumeClaim",
"apiVersion": "v1",
"metadata": {
"name": "solr-claim"
},
"spec": {
"accessModes": [
"ReadWriteMany", "ReadWriteOnce", "ReadOnlyMany"
],
"resources": {
"requests": {
"storage": "3Gi"
}
},
"selector":{
"matchLabels":{
"name" : "solr-index-backup",
"type" : "local"
}
}
}
}
To make solr container mount the hostPath, add the following part under .spec.spec (for Solr StatefulSet):
{
"kind": "StatefulSet",
"apiVersion": "apps/v1beta1",
"metadata": {
"name": "dataverse-solr",
....
"spec": {
"serviceName" : "dataverse-solr-service",
.....
"spec": {
"volumes": [
{
"name": "solr-index-backup",
"persistentVolumeClaim": {
"claimName": "solr-claim"
}
}
],
"containers": [
....
"volumeMounts":[
{
"mountPath" : "/var/share",
"name" : "solr-index-backup"
}
Solr is now ready for backup and recovery. In order to backup:
oc rsh dataverse-solr-0
curl 'http://localhost:8983/solr/collection1/replication?command=backup&location=/var/share'
In solr entrypoint.sh, it’s configured so that if dataverse-solr-0 failed, it will get the latest version of the index in the backup and restore. All backups are stored in /tmp/share in the host, or /home/share in solr containers.
It is not recommended to run containers as root in Minishift because for security reasons OpenShift doesn’t support running containers as root. However, it’s good to know how to allow containers to run as root in case you need to work on a Docker image to make it run as non-root.
For more information on improving Docker images to run as non-root, see “Support Arbitrary User IDs” at https://docs.openshift.org/latest/creating_images/guidelines.html#openshift-origin-specific-guidelines
Let’s say you have a container that you suspect works fine when it runs as root. You want to see it working as-is before you start hacking on the Dockerfile and entrypoint file. You can configure Minishift to allow containers to run as root with this command:
oc adm policy add-scc-to-user anyuid -z default --as system:admin
Once you are done testing you can revert Minishift back to not allowing containers to run as root with this command:
oc adm policy remove-scc-from-user anyuid -z default --as system:admin
The following resources might be helpful.
From the Dataverse perspective, Docker is important for a few reasons:
On Linux, you can probably get Docker from your package manager.
On Mac, download the .dmg
from https://www.docker.com and install it. As of this writing is it known as Docker Community Edition for Mac.
On Windows, we have heard reports of success using Docker on a Linux VM running in VirtualBox or similar. There’s something called “Docker Community Edition for Windows” but we haven’t tried it. See also the Windows Development section.
As explained above, we use Docker images in two different contexts:
The “all in one” Docker files are in conf/docker-aio
and you should follow the readme in that directory for more information on how to use them.
FIXME: rewrite this section to talk about only pushing stable images to Docker Hub.
When working with Docker in the context of Minishift, follow the instructions above and make sure you get the Dataverse Docker images running in Minishift before you start messing with them.
As of this writing, the Dataverse Docker images we publish under https://hub.docker.com/u/iqss/ are highly experimental. They were originally tagged with branch names like kick-the-tires
and as of this writing the latest
tag should be considered highly experimental and not for production use. See https://github.com/IQSS/dataverse/issues/4040 for the latest status and please reach out if you’d like to help!
Change to the docker directory:
cd conf/docker
Edit one of the files:
vim dataverse-glassfish/Dockerfile
At this point you want to build the image and run it. We are assuming you want to run it in your Minishift environment. We will be building your image and pushing it to Docker Hub.
Log in to Docker Hub with an account that has access to push to the iqss
organization:
docker login
(If you don’t have access to push to the iqss
organization, you can push elsewhere and adjust your openshift.json
file accordingly.)
Build and push the images to Docker Hub:
./build.sh
Note that you will see output such as digest: sha256:213b6380e6ee92607db5d02c9e88d7591d81f4b6d713224d47003d5807b93d4b
that should later be reflected in Minishift to indicate that you are using the latest image you just pushed to Docker Hub.
You can get a list of all repos under the iqss
organization with this:
curl https://hub.docker.com/v2/repositories/iqss/
To see a specific repo:
curl https://hub.docker.com/v2/repositories/iqss/dataverse-glassfish/
Again, Dataverse Docker images on Docker Hub are highly experimental at this point. As of this writing, their purpose is primarily for kicking the tires on Dataverse. Here are some known issues:
/sys/fs/cgroup/memory/memory.limit_in_bytes
and incorporating this into the Dataverse installation script.Previous: Deployment | Next: Making Releases