This is a list of useful libraries for developing new “Big Code” tools.

Add your library by creating a pull request here.


codemining-* is a suite of Java-based tools for tokenizing, parsing Java code. The repository also contains code to analyze Git-based repositories.
  • codeminining-core contains code for tokenizing Java, JavaScript, Python, C and C++ in the JVM.
  • codemining-treelm contains Java AST parsing and tree-level language models.
  • commitmining-tools contains tools for traversing a Git repository, its history and possibly its files.

Tags: #codeanalysis


bigcode-tools is a suite of tools to fetch, parse and process source code. It also contains utility to generate vector embeddings from source code. It currently supports Python 2 and 3, Java and JavaScript. The tools are designed to be compatible with py150 and js150 datasets.
Tags: #codeanalysis #embeddings