Mining software repositories to detect incomplete systematic changes

Printer-friendly version
  • An interest in analysing the history of software projects and data mining

During the development of a software project developers often make systematic changes. A systematic change is a set of similar modifications that are applied in several different locations. Refactorings are typical examples of systematic changes. While IDEs can automate certain refactorings, such as renaming methods, developers often need to manually perform systematic changes.

Such systematic changes can happen in a single commit, e.g. when a large-scale refactoring is performed and several source files need to be modified in a similar manner. They can also happen across several commits: for example, the developer may need to make multiple similar changes manually to fix a bug, or add a new feature. The developer can make these changes in one commit, but may have forgotten to apply a few instances and needs to push the remaining instances in a later commit.

To study how often such systematic changes actually occur in practice, as well as assist developers in detecting missed instances of systematic changes, this thesis is concerned with automatically finding occurrences of systematic changes in the history of software projects. Last year, an initial prototype of a "frequent change pattern mining tool" was developed during a Master's thesis. This tool looks for similar groups of fine-grained changes within a single commit, and hence can be used to detect systematic changes.

To retrieve these fine-grained changes a change distilling algorithm was used. Such an algorithm takes as input two revisions of a file, and outputs a sequence of changes (i.e. an insert, a move, a delete or an update of an AST node) that transform the first revision into the second revision. The change pattern mining tool then uses a frequent itemset mining algorithm to look for patterns in these changes.

Goal & Research Activities: 

While the initial prototype of the tool is a good starting point, there are several areas that can be extended and improved upon in this thesis:

First, the current implementation is limited in detection patterns that occur on the level of methods. We would like to extend this to patterns occuring in a complete file, in a complete revision or even across multiple revisions.
Second, the current implementation returns instances of a pattern (namely different concrete change sequences implementing a pattern), but the actual pattern must be retrieved by manually inspecting these instances. The automatic generation of such patterns has several technical challenges that need to be overcome.
Third, the tool can currently only be used to detect systematic changes, but it does not tell whether this systematic change has any missing instances. This could be done by generalizing the locations where the instances of a systematic change occur.
Finally, the current implementation has only been tested on several toy examples. As such, a study on detecting unknown change patterns in open-source projects must be performed. As a starting point a set of commits containing known patterns can be used to see whether the tool can identify them. Our lab also has several tools to navigate and query the history of a software project that may be helpful.