Important: TwoRavens is an optional component. The software was created here at IQSS, as a standalone tool, by an independent developers team (outside of the main Dataverse development team). TwoRavens was successfully integrated with Dataverse and made available to the users of the production Dataverse cluster at Harvard. However, the original developers have since left IQSS. Plans for the future of this collaboration are still being worked out. As the result, the support for TwoRavens is somewhat limited at the moment (as of Dataverse release 4.6.1). Though it is possible to try and contact the developers via the TwoRavens project on GitHub. The Dataverse project has however invested into improving the installation process; specifically, the notoriously tricky part of getting the correct versions of the required third party R packages installed. We are expecting to release the new installation scripts (and update this guide) within a few days of the 4.6.1 release.
Be warned that the installation process for TwoRavens has proven to be tricky and complicated. Besides the application proper, several required components need to be installed and configured. Specifically, R, rApache and, most importantly, a collection of required third-party R packages. The installation steps for these components are described in the individual sections of the document below.
yum install httpd
Disable SELinux:
setenforce permissive
getenforce
(Note: a pull request to get rApache working with SELinux is welcome! Please see the SELinux section of the Developer Guide to get started.)
https strongly recommended; signed certificate (as opposed to self-signed) is recommended.
Directory listing needs to be disabled on the web documents folder served by Apache:
In the main Apache configuration file (/etc/httpd/conf/httpd.conf
in the default setup), find the section that configures your web directory. For example, if the DocumentRoot
, defined elsewhere in the file, is set to the default "/var/www/html"
, the opening line of the section will look like this:
<Directory "/var/www/html">
Find the Options
line in that section, and make sure that it doesn’t contain the Indexes
statement.
For example, if the options line in your configuration is
Options Indexes FollowSymLinks
change it to
Options FollowSymLinks
yum install R R-devel
(EPEL distribution recommended; version 3.* required; 3.1.* recommended as of writing this)
To pick up any needed dependencies, CentOS users may simply install the epel-release RPM.
RHEL users will want to log in to their organization’s respective RHN interface, find the particular machine in question and:
rpm distribution from the HMDC systems group is recommended (latest available version is 1.2.6, as of writing this). The rpm requires Apache libapreq2, that should be available via yum.
install rApache as follows:
yum install libapreq2
rpm -ivh http://mirror.hmdc.harvard.edu/HMDC-Public/RedHat-6/rapache-1.2.6-rpm0.x86_64.rpm
If you are using RHEL/CentOS 7, you can download an experimental rapache-1.2.7-rpm0.x86_64.rpm
and install it with:
rpm -ivh rapache-1.2.7-rpm0.x86_64.rpm
The r-setup.sh script launches child processes which log to RINSTALL.* files. Once the script exits, search these files for the word “error” and be sure to install any missing dependencies and run the script again. At present, at minimum it needs:
yum install libcurl-devel openssl-devel libxml2-devel ed libX11-devel libpng-devel mesa-libGL-devel mesa-libGLU-devel libpqxx-devel
Make sure you have the standard GNU compilers installed (needed for 3rd-party R packages to build themselves). CentOS 6 users will need gcc-fortran 4.6 or greater, available from the CentOS devtools repo.
Again, without these rpms, R package devtools was failing to install, silently or with a non-informative error message.
Note: this package devtools
has proven to be very flaky; it is being very actively maintained, new dependencies are being constantly added and new bugs introduced... however, it is only needed to install the package Zelig
, the main R workhorse behind TwoRavens. It cannot be installed from CRAN, like all the other 3rd party packages we use - becase TwoRavens requires version 5, which is still in beta. So devtools is needed to build it from sources downloaded directly from github. Once Zelig 5 is released, we’ll be able to drop the requirement for devtools - and that will make this process much simpler. For now, be prepared for it to be somewhat of an adventure.
R is used both by the Dataverse application, directly, and the TwoRavens companion app.
Two distinct interfaces are used to access R: Dataverse uses Rserve; and TwoRavens sends jobs to R running under rApache using Rook interface.
We provide a shell script (conf/R/r-setup.sh
in the Dataverse source tree; you will need the other 3 files in that directory as well - https://github.com/IQSS/dataverse/tree/master/conf/R) that will attempt to install the required 3rd party packages; it will also configure Rserve and rserve user. rApache configuration will be addressed in its own section.
The script will attempt to download the packages from CRAN (or a mirror) and GitHub, so the system must have access to the internet. On a server fully firewalled from the world, packages can be installed from downloaded sources. This is left as an exercise for the reader. Consult the script for insight.
See the Appendix for the information on the specific packages, and versions that the script will attempt to install.
and rename the resulting directory dataexplore
.
Place it in the web root directory of your apache server. We’ll assume /var/www/html/dataexplore
in the examples below.
a scripted, interactive installer is provided at the top level of the TwoRavens distribution. Run it as:
cd /var/www/html/dataexplore
chmod +x install.pl
./install.pl
The installer will ask you to provide the following:
Setting | default | Comment |
---|---|---|
TwoRavens directory | /var/www/html/dataexplore |
File directory where TwoRavens is installed. |
Apache config dir. | /etc/httpd |
rApache config file for TwoRavens will be placed under conf.d/ there. |
Apache web dir. | /var/www/html |
|
Apache host address | local hostname | rApache host |
Apache host port | 80 |
rApache port (see the next section for the discussion on ports!) |
Apache web protocol | http |
http or https for rApache (https recommended) |
Dataverse URL | http://{local hostname}:8080 |
URL of the Dataverse from which TwoRavens will be receiving metadata and data files. |
Once everything is installed and configured, the installer script will print out a confirmation message with the URL of the TwoRavens application. For example:
The application URL is https://server.dataverse.edu/dataexplore/gui.html
This URL must be configured in the settings of your Dataverse application! This can be done by issuing the following settings API call:
curl -X PUT -d {TWORAVENS_URL} http://localhost:8080/api/admin/settings/:TwoRavensUrl
where “{TWORAVENS_URL}” is the URL reported by the installer script (as in the example above).
By default, Glassfish will install itself on ports 8080 and 8181 (for http and https, respectively), and Apache - on port 80 (the default port for http). Under this configuration, your Dataverse will be accessible at http://{your host}:8080 and https://{your host}:8181; and rApache - at http://{your host}/. The TwoRavens installer, above, will default to these values (and assume you are running both the Dataverse and TwoRavens/rApache on the same host).
This configuration may be the easiest to set up if you are simply trying out/testing the Dataverse and TwoRavens. Accept all the defaults, and you should have a working installation in no time. However, if you are planning to use this installation to actually serve data to real users, you’ll probably want to run Glassfish on ports 80 and 443. This way, there will be no non-standard ports in the Dataverse url visible to the users. Then you’ll need to configure the Apache to run on some other port - for example, 8080, instead of 80. This port will only appear in the URL for the TwoRavens app. If you want to use this configuration - or any other that is not the default one described above! - it is your job to reconfigure Glassfish and Apache to run on the desired ports before you run the TwoRavens installer.
Furthermore, while the default setup assumes http as the default protocol for both the Dataverse and TwoRavens, https is strongly recommended for a real production system. Again, this will be your responsibility, to configure https in both Glassfish and Apache. Glassfih comes pre-configured to run https on port 8181, with a self-signed certificiate. For a production system, you will most certainly will want to obtain a properly signed certificate and configure Glassfish to use it. Apache does not use https out of the box at all. Again, it is the responsibility of the installing user, to configure Apache to run https, and, providing you are planning to run rApache on the same host as the Dataverse, use the same SSL certificate as your Glassfish instance. Again, it will need to be done before you run the installer script above. All of this may involve some non-trivial steps and will most likely require help from your local network administrator - unless you happen to be your local sysadmin. Unfortunately, we cannot provide step-by-step instructions for these tasks. As the actual steps required will likely depend on the specifics of how your institution obtains signed SSL certificates, the format in which you receive these certificates, etc. Good luck!
Finally: If you choose to have your Dataverse support secure Shibboleth authentication, it will require a server and port configuration that is different still. Under this arrangement Glassfish instance is running on a high local port unaccessible from the outside, and is “hidden” behind Apache. With the latter running on the default https port, accepting and proxying the incoming connections to the former. This is described in the Shibboleth section of the Installation Guide. With this proxying setup in place, the TwoRavens and rApache configuration actually becomes simpler. As both the Dataverse and TwoRavens will be served on the same port - 443 (the default port for https). So when running the installer script above, and providing you are planning to run both on the same server, enter “https”, your host name and “443” for the rApache protocol, host and port, respectively. The base URL of the Dataverse app will be simply https://{your host name}/.
Explained below are the steps needed to manually install and configure the required R packages, and to configure TwoRavens to run under rApache (these are performed by the r-setup.sh
and install.pl
scripts above). Provided for reference.
TwoRavens requires the following R packages and versions to be installed:
R Package | Version Number |
---|---|
Zelig | 5.0.5 |
Rook | 1.1.1 |
rjson | 0.2.13 |
jsonlite | 0.9.16 |
DescTools | 0.99.11 |
Note that some of these packages have their own dependencies, and additional installations are likely necessary. TwoRavens is not compatible with older versions of these R packages.
Edit the file /var/www/html/dataexplore/app_ddi.js
.
find and edit the following 3 lines:
var production=false;
and change it to true
;
hostname="localhost:8080";
so that it points to the dataverse app, from which TwoRavens will be obtaining the metadata and data files. (don’t forget to change 8080 to the correct port number!)
and
var rappURL = "http://0.0.0.0:8000/custom/";
set this to the URL of your rApache server, i.e.
"https://<rapacheserver>:<rapacheport>/custom/";
rApache is a loadable httpd module that provides a link between Apache and R.
When you installed the rApache rpm, under 0., it placed the module in the Apache library directory and added a configuration entry to the config file (/etc/httpd/conf/httpd.conf
).
Now we need to configure rApache to serve several R “mini-apps”, from the R sources provided with TwoRavens.
in dataexplore/rook
:
rookdata.R, rookzelig.R, rooksubset.R, rooktransform.R, rookselector.R, rooksource.R
and replace every instance of production<-FALSE
line with production<-TRUE
.
(yeah, that’s why we provide that installer script...)
and change the following line:
setwd("/usr/local/glassfish4/glassfish/domains/domain1/docroot/dataexplore/rook")
to
setwd("/var/www/html/dataexplore/rook")
(or your dataexplore directory, if different from the above)
url <- paste("https://demo.dataverse.org/custom/preprocess_dir/preprocessSubset_",sessionid,".txt",sep="")
and
imageVector[[qicount]]<<-paste("https://dataverse-demo.iq.harvard.edu/custom/pic_dir/", mysessionid,"_",mymodelcount,qicount,".png", sep = "")
and change the URL to reflect the correct location of your rApache instance - make sure that the protocol and the port number are correct too, not just the host name!
(This configuration is now supplied in its own config file tworavens-rapache.conf
, it can be dropped into the Apache’s /etc/httpd/conf.d
. Again, the scripted installer will do this for you automatically.)
RSourceOnStartup "/var/www/html/dataexplore/rook/rooksource.R"
<Location /custom/zeligapp>
SetHandler r-handler
RFileEval /var/www/html/dataexplore/rook/rookzelig.R:Rook::Server$call(zelig.app)
</Location>
<Location /custom/subsetapp>
SetHandler r-handler
RFileEval /var/www/html/dataexplore/rook/rooksubset.R:Rook::Server$call(subset.app)
</Location>
<Location /custom/transformapp>
SetHandler r-handler
RFileEval /var/www/html/dataexplore/rook/rooktransform.R:Rook::Server$call(transform.app)
</Location>
<Location /custom/dataapp>
SetHandler r-handler
RFileEval /var/www/html/dataexplore/rook/rookdata.R:Rook::Server$call(data.app)
</Location>
mkdir --parents /var/www/html/custom/pic_dir
mkdir --parents /var/www/html/custom/preprocess_dir
mkdir --parents /var/www/html/custom/log_dir
chown -R apache.apache /var/www/html/custom
service httpd restart
The final step of TwoRavens’ installation is to tell Dataverse to display its Explore button alongside tabular datafiles by executing the following command on the Glassfish host:
curl -X PUT -d true http://localhost:8080/api/admin/settings/:TwoRavensTabularView