Record-Replay Test Mechanism for GPUs

Project Description

Record-replay analysis is a software debugging technique for accurately reconstructing the execution scenarios that causes a software application to fail because of a nondeterministically occurring bug. By identifying and logging the application's inputs and internal states during execution, this technique enables developers to reliably reproduce reported execution failures in order to identify and eliminate their causes [Bell2012].

The complexity of neural models that comprise a nontrivial number of neurons and synapses makes the analysis of unexpected model outputs almost inevitable. We believe that the record-replay technique may be useful in improving the accuracy of such models by enabling the designers of these models to reliably reconstruct unusual model outputs elicited by unanticipated input stimuli. Since our neural models will be run on NVIDIA's Graphics Processing Units (GPUs) using a software framework written in Python and CUDA, we need to implement a record-replay mechanism that can be used with GPU-based code.

Possible Project Goals

A mechanism that automatically instruments GPU-based code (possibly after being translated to PTX) to log states and inputs needed to reconstruct executions leading to failure. This mechanism might be implemented as an addition to the run-time code generation (RTCG) support in PyCUDA or Theano.
A means of tracing execution errors back to the code that failed to run correctly.

Skills Gained

Experience developing parallel programming using both NVIDIA CUDA and Python.
Exposure to powerful open-source scientific software packages such as Scientific Python
Experience implementing and using novel software testing methodologies for novel applications.