OpenCDISC Validator Performance and Scalability Guide

While our developers continuously work hard to make improvements to increase performance in the OpenCDISC codebase, there are also many things that a user can do to get the most out of running the OpenCDISC Validator in their environment. To make this process easier to understand, we'll cover what some of these customizations are and what benefits they could provide to the validation process.

Using Multiple Processors/Cores

Version 1.1 of the Validator introduces multicore dataset processing, where more than one dataset can be validated simultaneously on computers which have multiple processors/logical cores. However, given that most user machines only have one or two cores and that the validation process is very processor-intensive, the default configuration specifies that only one dataset should be processed at once.

Users with more powerful operating environments can easily change this setting to take full advantage of their hardware by going into the lib/properties directory and making changes to the settings.properties file. Modifying the value of Engine.ThreadCount will change how many cores the program uses, and it can be set to a numerical value representing a fixed number of cores to use, or to the value auto to allow the program to automatically determine and use the maximum number of available cores.

Note: Processing the datasets can be extremely processor-intensive, so we recommended against using all cores in situations where other critical applications' performance may be compromised.

Increasing Available Memory

Given that the OpenCDISC Validator does all of its processing without the help of databases or temporary files, the memory demands for very large datasets can be high. Currently, the default memory limit for a validation run is 1024 MB (1 GB), a fairly safe "standard" value which allows the Validator to run on most modern workstations and laptops. However, this setting can cause some limitations on the size of the datasets that can be processed.

Our development team has performed tests which suggest that the approximate maximum size of a single dataset which can be handled with this memory limit is on average around two million records. The number of datasets that can be processed overall is not impacted by available memory, although more memory is required if you choose to process more than one dataset at once using the multithreading technique described in the previous section.

Note: Running the OpenCDISC Validator (or any Java program) on a 64-bit Java Virtual Machine (JVM) may require slightly more memory than running on a 32-bit runtime.

Users with machines with several gigabytes of RAM installed may find it useful to increase this memory limit to support studies containing large datasets. For instance, if you are running a 32-bit version of Windows® as your operating system, this setting can only be increased to about 1500 MB. On a 64-bit OS running a 64-bit JVM, however, it is possible to increase this value to about 75 percent of available RAM.

To make this change, edit the client.bat file and change the -Xmx1024m entry to a higher value. For instance, if we wanted to increase the maximum available memory to three gigabytes, we would replace that section with -Xmx3072m

START /B javaw -XX:+HeapDumpOnOutOfMemoryError -Xms256m -Xmx1024m -jar lib/validator-gui-1.1.jar

becomes

START /B javaw -XX:+HeapDumpOnOutOfMemoryError -Xms256m -Xmx3072m -jar lib/validator-gui-1.1.jar

Other Considerations

Users who wish to load their source data from network shares may find that adjusting the Engine.InputBufferSize setting in the settings.properties file might improve performance somewhat (while requiring a little more memory). Ultimately though, decreases in performance are generally related to the responsiveness of the network share, which is out of the program's control.

Also, users running the Validator on a machine that they use for other tasks should remember that using other programs while running a validation might cause the validation to take longer due to the computer having to switch between running each application.

Important: Keep in mind that these settings are read when the program is first launched, so changes to the files mentioned in this document require a restart of the Validator to become effective.