Tools
These are handy tools for your Development Environment.
Contents:
Netbeans Connector Chrome Extension
The Netbeans Connector extension for Chrome allows you to see changes you’ve made to HTML pages the moment you save the file without having to refresh your browser. See also http://wiki.netbeans.org/ChromeExtensionInstallation
Unfortunately, while the Netbeans Connector Chrome Extension used to “just work”, these days a workaround described at https://www.youtube.com/watch?v=J6lOQS2rWK0&t=130 seems to be necessary. For now, under “Run” (under project properties), choose “Chrome” as the browser rather than “Chrome with NetBeans Connector”. After you run the project, click the Netbeans logo in Chrome and then “Debug in NetBeans”. For more information, please see the “workaround for Netbeans Connector Chrome Extension” post at https://groups.google.com/d/msg/dataverse-dev/agJZilD1l0Q/cMBkt5KDBQAJ
pgAdmin
You probably installed pgAdmin when following the steps in the Development Environment section but if not, you can download it from https://www.pgadmin.org
Maven
With Maven installed you can run mvn package
and mvn test
from the command line. It can be downloaded from https://maven.apache.org
Vagrant
Vagrant allows you to spin up a virtual machine running the Dataverse Software on your development workstation. You’ll need to install Vagrant from https://www.vagrantup.com and VirtualBox from https://www.virtualbox.org.
We assume you have already cloned the repo from https://github.com/IQSS/dataverse as explained in the Development Environment section.
From the root of the git repo (where the Vagrantfile
is), run vagrant up
and eventually you should be able to reach a Dataverse installation at http://localhost:8888 (the forwarded_port
indicated in the Vagrantfile
).
Please note that running vagrant up
for the first time should run the downloads/download.sh
script for you to download required software such as an app server, Solr, etc. However, these dependencies change over time so it’s a place to look if vagrant up
was working but later fails.
On Windows if you see an error like /usr/bin/perl^M: bad interpreter
you might need to run dos2unix
on the installation scripts.
PlantUML
PlantUML is used to create diagrams in the guides and other places. Download it from http://plantuml.com and check out an example script at https://github.com/IQSS/dataverse/blob/v4.6.1/doc/Architecture/components.sh . Note that for this script to work, you’ll need the dot
program, which can be installed on Mac with brew install graphviz
.
Eclipse Memory Analyzer Tool (MAT)
The Memory Analyzer Tool (MAT) from Eclipse can help you analyze heap dumps, showing you “leak suspects” such as seen at https://github.com/payara/Payara/issues/350#issuecomment-115262625
It can be downloaded from http://www.eclipse.org/mat
If the heap dump provided to you was created with gcore
(such as with gcore -o /tmp/app.core $app_pid
) rather than jmap
, you will need to convert the file before you can open it in MAT. Using app.core.13849
as example of the original 33 GB file, here is how you could convert it into a 26 GB app.core.13849.hprof
file. Please note that this operation took almost 90 minutes:
/usr/java7/bin/jmap -dump:format=b,file=app.core.13849.hprof /usr/java7/bin/java app.core.13849
A file of this size may not “just work” in MAT. When you attempt to open it you may see something like “An internal error occurred during: “Parsing heap dump from ‘/tmp/heapdumps/app.core.13849.hprof’”. Java heap space”. If so, you will need to increase the memory allocated to MAT. On Mac OS X, this can be done by editing MemoryAnalyzer.app/Contents/MacOS/MemoryAnalyzer.ini
and increasing the value “-Xmx1024m” until it’s high enough to open the file. See also http://wiki.eclipse.org/index.php/MemoryAnalyzer/FAQ#Out_of_Memory_Error_while_Running_the_Memory_Analyzer
PageKite
PageKite is a fantastic service that can be used to share your local development environment over the Internet on a public IP address.
With PageKite running on your laptop, the world can access a URL such as http://pdurbin.pagekite.me to see what you see at http://localhost:8080
Sign up at https://pagekite.net and follow the installation instructions or simply download https://pagekite.net/pk/pagekite.py
The first time you run ./pagekite.py
a file at ~/.pagekite.rc
will be
created. You can edit this file to configure PageKite to serve up port 8080
(the default app server HTTP port) or the port of your choosing.
According to https://pagekite.net/support/free-for-foss/ PageKite (very generously!) offers free accounts to developers writing software the meets http://opensource.org/docs/definition.php such as the Dataverse Project.
MSV
MSV (Multi Schema Validator) can be used from the command line to validate an XML document against a schema. Download the latest version from https://java.net/downloads/msv/releases/ (msv.20090415.zip as of this writing), extract it, and run it like this:
$ java -jar /tmp/msv-20090415/msv.jar Version2-0.xsd ddi.xml
start parsing a grammar.
validating ddi.xml
the document is valid.
FontCustom
The custom file type icons were created with the help of FontCustom <https://github.com/FontCustom/fontcustom>. Their README provides installation instructions as well as directions for producing your own vector-based icon font.
Here is a vector-based SVG file to start with as a template: icon-template.svg
SonarQube
SonarQube is a static analysis tool that can be used to identify possible problems in the codebase, or with new code. It may report false positives or false negatives, but can help identify potential problems before they are reported in prodution or to identify potential causes of problems reported in production.
Download SonarQube from https://www.sonarqube.org and start look in the bin directory for a sonar.sh script for your architecture. Once the tool is running on http://localhost:9000 you can use it as the URL in this example script to run sonar:
#!/bin/sh
mvn sonar:sonar \
-Dsonar.host.url=${your_sonar_url} \
-Dsonar.login=${your_sonar_token_for_project} \
-Dsonar.test.exclusions='src/test/**,src/main/webapp/resources/**' \
-Dsonar.issuesReport.html.enable=true \
-Dsonar.issuesReport.html.location='sonar-issues-report.html' \
-Dsonar.jacoco.reportPath=target/jacoco.exec
Once the analysis is complete, you should be able to access http://localhost:9000/dashboard?id=edu.harvard.iq%3Adataverse to see the report. To learn about resource leaks, for example, click on “Bugs”, the “Tag”, then “leak” or “Rule”, then “Resources should be closed”.
Infer
Infer is another static analysis tool that can be downloaded from https://github.com/facebook/infer
Example command to run infer:
$ infer -- mvn package
Look for “RESOURCE_LEAK”, for example.
lsof
If file descriptors are not closed, eventually the open but unused resources can cause problems with system (app servers in particular) stability. Static analysis and heap dumps are not always sufficient to identify the sources of these problems. For a quick sanity check, it can be helpful to check that the number of file descriptors does not increase after a request has finished processing.
For example…
$ lsof | grep M6EI0N | wc -l
0
$ curl -X GET "http://localhost:8083/dataset.xhtml?persistentId=doi:10.5072/FK2/M6EI0N" > /dev/null
$ lsof | grep M6EI0N | wc -l
500
would be consistent with a file descriptor leak on the dataset page.
jmap and jstat
jmap
and jstat
are parts of the standard JDK distribution.
jmap allows you to look at the contents of the java heap. It can be used to create a heap dump, that you can then feed to another tool, such as Memory Analyzer Tool
(see above). It can also be used as a useful tool of its own, for example, to list all the classes currently instantiated in memory:
$ jmap -histo <app process id>
will output a list of all classes, sorted by the number of instances of each individual class, with the size in bytes.
This can be very useful when looking for memory leaks in the application. Another useful tool is jstat
, that can be used in combination with jmap
to monitor the effectiveness of garbage collection in reclaiming allocated memory.
In the example script below we stress running Dataverse Software applicatione with GET requests to a specific page in a Dataverse installation, use jmap
to see how many Dataverse collection, Dataset and DataFile class object get allocated, then run jstat
to see how the numbers are affected by both “Young Generation” and “Full” garbage collection runs (YGC
and FGC
respectively):
(This is script is provided as an example only! You will have to experiment and expand it to suit any specific needs and any specific problem you may be trying to diagnose, and this is just to give an idea of how to go about it)
#!/bin/sh
# the script takes the numeric id of the app server process as the command line argument:
id=$1
while :
do
# Access the Dataverse collection xxx 10 times in a row:
for ((i = 0; i < 10; i++))
do
# hide the output, standard and stderr:
curl http://localhost:8080/dataverse/xxx 2>/dev/null > /dev/null
done
sleep 1
# run jmap and save the output in a temp file:
jmap -histo $id > /tmp/jmap.histo.out
# grep the output for Dataverse Collection, Dataset and DataFile classes:
grep '\.Dataverse$' /tmp/jmap.histo.out
grep '\.Dataset$' /tmp/jmap.histo.out
grep '\.DataFile$' /tmp/jmap.histo.out
# (or grep for whatever else you may be interested in)
# print the last line of the jmap output (the totals):
tail -1 /tmp/jmap.histo.out
# run jstat to check on GC:
jstat -gcutil ${id} 1000 1 2>/dev/null
# add a time stamp and a new line:
date
echo
done
The script above will run until you stop it, and will output something like:
439: 141 28200 edu.harvard.iq.dataverse.Dataverse
472: 160 24320 edu.harvard.iq.dataverse.Dataset
674: 60 9600 edu.harvard.iq.dataverse.DataFile
S0 S1 E O P YGC YGCT FGC FGCT GCT
0.00 100.00 35.32 20.15 ? 7 2.145 0 0.000 2.145
Total 108808814 5909776392
Wed Aug 14 23:13:42 EDT 2019
385: 181 36200 edu.harvard.iq.dataverse.Dataverse
338: 320 48640 edu.harvard.iq.dataverse.Dataset
524: 120 19200 edu.harvard.iq.dataverse.DataFile
S0 S1 E O P YGC YGCT FGC FGCT GCT
0.00 100.00 31.69 45.11 ? 9 3.693 0 0.000 3.693
Total 167998691 9080163904
Wed Aug 14 23:14:59 EDT 2019
367: 201 40200 edu.harvard.iq.dataverse.Dataverse
272: 480 72960 edu.harvard.iq.dataverse.Dataset
442: 180 28800 edu.harvard.iq.dataverse.DataFile
S0 S1 E O P YGC YGCT FGC FGCT GCT
0.00 100.00 28.05 69.94 ? 11 5.001 0 0.000 5.001
Total 226826706 12230221352
Wed Aug 14 23:16:16 EDT 2019
... etc.
How to analyze the output, what to look for:
First, look at the numbers in the jmap output. In the example above, you can immediately see, after the first three iterations, that every 10 Dataverse installation page loads results in the increase of the number of Dataset classes by 160. I.e., each page load leaves 16 of these on the heap. We can also see that each of the 10 page load cycles increased the heap by roughly 3GB; that each cycle resulted in a couple of YG (young generation) garbage collections, and in the old generation allocation being almost 70% full. These numbers in the example are clearly quite high and are an indication of some problematic memory allocation by the Dataverse installation page - if this is the result of something you have added to the page, you probably would want to investigate and fix it. However, overly generous memory use is not the same as a leak necessarily. What you want to see now is how much of this allocation can be reclaimed by “Full GC”. If all of it gets freed by FGC
, it is not the end of the world (even though you do not want your system to spend too much time running FGC
; it costs CPU cycles, and actually freezes the application while it’s in progress!). It is however a really serious problem, if you determine that a growing portion of the old. gen. memory ("O"
in the jmap
output) is not getting freed, even by FGC
. This is a real leak now, i.e. something is leaving behind some objects that are still referenced and thus off limits to garbage collector. So look for the lines where the FGC
counter is incremented. For example, the first FGC
in the example output above:
271: 487 97400 edu.harvard.iq.dataverse.Dataverse
216: 3920 150784 edu.harvard.iq.dataverse.Dataset
337: 372 59520 edu.harvard.iq.dataverse.DataFile
Total 277937182 15052367360
S0 S1 E O P YGC YGCT FGC FGCT GCT
0.00 100.00 77.66 88.15 ? 17 8.734 0 0.000 8.734
Wed Aug 14 23:20:05 EDT 2019
265: 551 110200 edu.harvard.iq.dataverse.Dataverse
202: 4080 182400 edu.harvard.iq.dataverse.Dataset
310: 450 72000 edu.harvard.iq.dataverse.DataFile
Total 142023031 8274454456
S0 S1 E O P YGC YGCT FGC FGCT GCT
0.00 100.00 71.95 20.12 ? 22 25.034 1 4.455 29.489
Wed Aug 14 23:21:40 EDT 2019
We can see that the first FGC
resulted in reducing the "O"
by almost 7GB, from 15GB down to 8GB (from 88% to 20% full). The number of Dataset classes has not budged at all - it has grown by the same 160 objects as before (very suspicious!). To complicate matters, FGC
does not guarantee to free everything that can be freed - it will balance how much the system needs memory vs. how much it is willing to spend in terms of CPU cycles performing GC (remember, the application freezes while FGC
is running!). So you should not assume that the “20% full” number above means that you have 20% of your stack already wasted and unrecoverable. Instead, look for the next minium value of "O"
; then for the next, etc. Now compare these consecutive miniums. With the above test (this is an output of a real experiment, a particularly memory-hungry feature added to the Dataverse installation page), the minimums sequence (of old. gen. usage, in %) was looking as follows:
2.19
2.53
3.00
3.13
3.95
4.03
4.21
4.40
4.64
5.06
5.17
etc. ...
It is clearly growing - so now we can conclude that indeed something there is using memory in a way that’s not recoverable, and this is a clear problem.
Previous: Making Releases | Next: Universal Numerical Fingerprint (UNF)