Student projects for supervision by Hayo Thielecke

Generally, I am interested in supervising projects that are about programming languages (C/C++, OCaml, Agda, Scala, Rust, even Javascript...), such as analyzing or translating programs. If you have your own project idea related to the above, please feel free to discuss it.

Formalizing correctness properties of RXXR2 in Agda

Usage patterns of templates in real-world C++ code

Gamification in teaching computer science

Browser games (or mobile apps) that help sutdents learn computer science concepts and techniques. See for example the regular expressions game that was implemented in Javascript.

Traceabily enhancements for LLVM

(Suggested by Asiri at ARM.) Modern compilers perform such aggressive optimisations on programs that render the correspondence between the input sources and the final binaries quite un-obvious for human understanding. Yet, certain sensitive application domains (such as aviation) demand that final binaries be manually inspected to ensure that they are semantically equivalent to the input sources. Traditionally, debug information allows one to establish binary-to-source correspondence, but this technique cannot be relied upon at higher optimisation levels, where the most advanced transformations make it extremely difficult to maintain a useful debug view (e.g. code-motion). Moreover, the resource-constrained nature of the concerned application domains makes it infeasible to compile programs at lower optimisation levels.
The aim of this project is to develop tools/techniques to aid in the manual verification of programs compiled at higher optimisation levels. A good starting point is the --print-after=all option provided by clang that dumps the LLVM IR output for each of the optimisation passes involved. This is a rudimentary debug routine which is neither user-friendly nor complete enough to be deployed in a production environment, but one can imagine ways to build on this option.

Porting RXXR2 from Ocaml to C++

Out tool RXXR2 automatically finds REDoS vulnerabilities in regular expressions. The target audience and intended users for this tool are systems security people, whose canonical language tends to be C rather than functional languages. The aim of this project is to re-implement RXXR2 in C++ using the current Ocaml implementation as a guideline. It is a good opportunity for students who have taken the C/C++ module to become more fluent by writing a sizeable program in C++.

Contification in LLVM and CPS

Contification is a compiler optimization that turns some function calls into jumps. It has been studied in CPS compilers for functional languages, but LLVM does at least some of it for compiling C, generating some impressively tight machine code. The aim of this project is to compare what LLVM does to contification expressed in CPS, and see if there is scope for LLVM to do more optimization than it currently does.

LLVM memory layout visualizer

Add code to the LLVM compiler that does the following: for each function, draw the layout of its stack frames; and/or for each C struct or C++ object, draw the how its members are laid out in memory. The information is contained in the LLVM internal representation; the output should be in LaTeX or a format suitable for a drawing package.

Adding unicode support to RXXR2

Adding unicode to RXXR2 would be useful. See here for some of the issues.

Integration of RXXR2 into an IDE

The project involves the integration of the static analyser RXXR2 into a popular IDE, such as Eclipse for Java. The aim is that programmers should receive automatic warnings about potential REDoS vulnerabilities. When using the IDE, the user's code should be automatically searched for uses of regular expressions by running RXXR2 in the background. For example, certain arguments to API calls in the Java regular expression libraries should be extracted and fed to RXXR2.

Fast parallel regular expression matching on GPUs

In much of the literature you will see the claim that GPUs can match regular expression in log time. This seems impossible at first, but there is some ingenious theory that makes it work in principle, namely Blelloch’s prefix sum operation. I would like to see some experiments to gauge how practical such fast parallel matching actually is. The project consists of implementing the lexer sketched in Steele and Hillis’s paper in CUDA or OpenGL. Potential applications may be in areas such as biology rather than CS, e.g. finding patterns in DNA.

Verifying abstract machines in Agda

Various abtract machine could be written in Agda: the machines from our reg exp DoS work, or for LR parsing.

C++ templates and design

Traditional object-oriented design patterns emphasize inheritance. However, in modern C++ the importance of parametrization using templates has increased; see for example the Standard Template Library. This topic will investigate high-level program design using C++11 templates, with emphasis on type safety and correctness.

Safe systems-level programming in Rust

C is still the language of choice for programming operating systems and other code that needs to be efficient and manage system resources. However, C code is often afflicted with buffer overflows and other memory corruption vulnerabilities. Some new programming languages (such as Rust) aim to give the same power as C while being much safer (eliminating buffer overflows, for instance).

Recursive-ascent parsing

Top-down, or recursive descent, parsers can be written in a pleasantly direct style as a collection of mutually recursive methods in Java. By contrast, bottom-up parsers (like those generated by yacc/bison) are typically table-driven. Recursive ascent generates methods for a bottom-up parser. The project would explore writing such parsers in C/C++, OCaml or Haskell.

Regular expression learning game

The aim is to develop a game that helps students learn about regular expressions, which is something every CS student in the world has to master. It may not sound like fun at first, and it would certainly be possible to make a very tedious game. However, the aim is to make the game as interesting as possible by coming up with new challenges that are appropriate for the player’s level. At the lowest level, the questions can be fairly obvious, such as: does string w match reg exp e. As the player progresses, more challenges can be unlocked (and badges won), such as: find a string w that matches both of e1 and e2, or only e1 and not e2; or: write an the expression e (perhaps of minimal size) such that two given strings w1 and w2 are matched. On the technical level, the reason that a game like this can work is that regular expressions are closed under many operations like union, intersection and negation, so the game can determine correct answers even for quite complex challenges. The game can be implemented in any language that has a good regular expression library, e.g. Java, C++, or Haskell. If implemented in Javascript, it could be brower game, and thus easy to use without having to install anything.

Eliminating left recursion and parsing actions

Some popular parser generators such as ANTLR will not work on grammars with left recursion. There is a standard technqiue for eliminating left recursion from grammars. However, a parser in ANTLR also contains semantic actions in the form of Java or C code. These actions also need to be transformed. Doing so is the aim of this project, by programming an appropriate pre-processor for ANTLR. It should be based on the L-calculus in my 2014 paper “On the Semantics of Parsing Actions” in Science of Computer Programming (see my publications page).