(this document is a draft!)
The document is primarily intended for those who are interested in implementing their own UNF Version 6 calculator. We would like to encourage multiple parallel implementations, since that would be a great (the only, really) way to cross-validate UNF signatures calculated for specific sets of data.
Algorithm Description
UNF v5, on which v6 is based, was originally described in Dr. Micah Altman’s paper “A Fingerprint Method for Verification of Scientific Data”, Springer Verlag, 2008. The reader is encouraged to consult it for the explanation of the theory behind UNF. However, various changes and clarifications concerning the specifics of normalization have been made to the algorithm since the publication. These crucial details were only documented in the author’s unpublished edits of the article and in private correspondence. With this document, a serious effort has been made to produce a complete step-by-step description of the entire process. It should be fully sufficient for the purposes of implementing the algorithm.
Contents:
For each individual vector in a data frame, calculate its UNF signature as follows:
Ia. Normalize each vector element as follows:
1. For a vector of numeric elements: Round each vector element to N significant digits using the IEEE 754 “round towards nearest, ties to even” rounding mode. The default value of N is 7.
(See an Important Note on the use of default and optional values and methods!)
Convert each vector element into a character string in exponential notation, as follows:
Special cases:
Infinity and NaN (“Not a Number”) values:
If an element is an IEEE 754, non-finite, special floating-point value, represent it as the signed, lowercase, IEEE minimal printable equivalent, that is, +inf, -inf, or +nan. No attempt is made to differentiate between various types of NaNs allowed under IEEE 754.
Examples:
An “official” list of pre-calculated sample UNFs is supplied with the source of the Java implementation of UNF v6; see the Note at the end of the document.
2. For a vector of character strings:
Encode each character string with Unicode bit encoding. In UNF Version 6 UTF-8 is used. Truncate each string to X characters; the default value of X is 128. No further normalization is performed.
3. Vectors of Boolean values
Should be treated as numeric vectors of 0 s and 1 s.
4. Bit fields.
Normalize bit fields by converting to big-endian form, truncating all leading empty bits, aligning to a byte boundary by padding with leading zero bits, and base64 encoding to form a character string representation.
5. Normalize dates, times and intervals as follows:
(see the Note at the end of this document for a discussion on potential issues when calculating UNFs of time values).
Examples:
2:29 pm on Jun 10, 2012 is normalized as “2012-06-10T14:29:00”.
6. Missing values Missing values, of all of the above types, are encoded as 3 null bytes: \000\000\000.
Ib. Calculate the UNF of the vector as follows:
Terminate each character string representing a NON-MISSING value with a POSIX end-of-line character and a null byte (\000). Do not terminate missing value representations (3 null bytes \000\000\000). Concatenate all the individual character strings, and compute the SHA256 hash of the combined string. Truncate the resulting hash to 128 bits (128 being the default, with other values possible - see the note at the end of the document). Encode the resulting string in base64, for readability. Prepend the encoded hash string with the signature header UNF:6: (with 6 indicating the current version).
Example:
IIa. Combine the UNFs of multiple variables to form the UNF for an entire data frame as follows:
Do note the sorting part, above, it is important! In a vector of observations, the order is important; changing the order of observations changes the UNF. A data frame, however, is considered an unordered set of individual vectors. I.e., re-arranging the order in which data variable columns occur in an R or Stata file should not affect the UNF. Hence the UNFs of individual variables are sorted, before the combined UNF of the data frame is calculated.
IIb. Similarly, combine the UNFs for a set of data frames to form a single UNF that represents an entire research study (“dataset”).
Again, the UNF of a study (dataset) with a single file = the UNF of the file; for more than one file, calculate the study UNF as described above.
Using a consistent UNF version and level of precision across an entire dataset is recommended when calculating the UNFs of individual data objects.
Note: On default and non-default parameter values: Here and throughout the rest of this document, phrases like “The default value of N is 7” suggest that it is possible to use non-default values, such as a different number of digits of precision, in this case. This has been a source of some confusion in the past. UNF relies on data normalization to produce “data fingerprints” that are meaningful and descriptive. So how do you generate reproducible and verifiable signatures if any flexibility is allowed in the normalization algorithm? The answer, as specified in the original UNF paper: any non-default parameters used are embedded in the header portion of the UNF!
For example, to specify a non-default precision the parameter it is specified using the parameter N, formatted as follows:
Nnnn - where nnn is the number of precision digits, different from the default 7.
Dr. Micah Altman’s classic UNF v5 paper mentions another optional parameter T###, for specifying rounding of date and time values (implemented as stripping the values of entire components - fractional seconds, seconds, minutes, hours... etc., progressively) - but it doesn’t specify its syntax. It is left as an exercise for a curious reader to contact the author and work out the details, if so desired. (Not implemented in UNF Version 6 by the Dataverse Project).
Note: we do not recommend truncating character strings at fewer bytes than the default 128 (the X parameter). At the very least this number must be high enough so that the printable UNFs of individual variables or files are not truncated, when calculating combined UNFs of files or datasets, respectively.
It should also be noted that the Dataverse application never calculates UNFs with any non-default parameters. And we are not aware of anyone else actually doing so. If you are considering creating your own implementation of the UNF, it may be worth trying to create a simplified, defaults-only version first. Such an implementation would be sufficient to independently verify Dataverse-produced UNFs, among other things.
Note: Negative Zero
Note: UNFs of time values in real-life statistical packages
The following is not by itself an implementation concern. But it is something you may need to consider when calculating UNFs of time values from real-world data.
The fact that the same time value with and without the time zone specified produces different UNFs presents an interesting issue when converting data between different formats. For example, in STATA none of the available time types support time zones. In R, on the other hand, ALL time values are stored with a time zone. While it is possible to create an R time value from a character representation with no time zone - for example:
timevar<-as.POSIXct("03/19/2013 18:20:00", format = "%m/%d/%Y %H:%M:%OS");
it still results in R assuming the time is in the current time zone, and storing the UTC equivalent of that time. In fact R always stores its time values in UTC; specific time zones can be defined, as attributes, in which case the values will be adjusted accordingly for display. Otherwise the display representation will be readjusted each time the vector is viewed, according to the time zone current to the viewer. Meaning that the human readable representation of the same stored time value will be different when viewed on systems in different time zones. With that in mind, it appears that the only way to calculate a meaningful UNF of a time value from an R data frame is to use the stored UTC time - resulting in the “Z” in the normalized string. And that further means that it is impossible to convert a data frame with time values from STATA to R, or the other way around, and have the same UNF preserved.
We do not consider this a problem with the algorithm. These differences between the two approaches to handling time values, in R and STATA, should in fact be considered as significant. Enough so to conclude that the format conversion actually changes the data semantically. Which, in turn, justifies a new UNF.
If for whatever reason it is important to produce an R version of a STATA file while preserving the UNF, it can still be done. One way to achieve that would be to convert the original time vector to a String vector in R, in the format identical to that used in the UNF normalization algorithm, e.g., “yy-mm-ddThh:mm:ss”. One would not be able to use this resulting R vector in any time-based calculations without extra type conversion. But the data frame would produce the same UNF.
More UNF Examples:
An “official” list of sample UNFs of various data types is provided with the source of the UNF v6 Java implementation.