Sharing datasets in a research community has a number of advantages. It allows for meaningful comparison between systems and enables researchers to spend more time on developing state-of-the-art methods as opposed to collecting and cleaning data.

In case a dataset is useful for a specific task, a challenge can be created such that other people try to solve the same problem.

The datasets here should not require sign-up for web services or writing emails to their authors or third parties.

Python ASTs

This dataset includes 100'000 + 50'000 python files as parsed abstract syntax trees along with the code of the parser (that wraps the built-in Python AST parser)
[download dataset]

JavaScript ASTs

This dataset includes 100'000 + 50'000 JavaScript files. The data is available as JavaScript and as parsed abstract syntax trees (parsed with acorn.js and serialized in a JSON format as described in the link)
[download dataset]

Java GitHub corpus

This dataset includes about 14'000 Java files from GitHub, split into training and test set. The files are from open source projects that have been forked at least once.
[download dataset]

Java Variable and Method Naming Dataset and Embeddings

This dataset includes the Java source code and JSON serialized files containing their tokens and the locations of the tokens that refer to the same variable. It also contains pre-trained embeddings for variable and method names.
[download dataset]

Estimating Types in Stripped Binaries Dataset

This dataset includes 20 stripped binaries used as benchmarks for evaluate estimation of object types and virtual function call targets. It also contains the ground-truth determined for the object types. The binaries were collected from publically availalble open-source projects, compiled with default optimizations and stripped. The dataset is provided in the form of dumps of MongoDB databases holding the data (see README for further details).
[download dataset]

Similarity of code fragments Dataset

This dataset includes 3 different collections that provide pairs of code fragments with our tool's similarity score, the users' similarity score and the code's meta-data

The dataset is provided as a MongoExport database holding the data (see README for further details).
[download dataset]

Method Naming Dataset

This dataset includes the Java source code and JSON files containing the names and the tokens of the methods of 11 of the most popular GitHub Java projects.

[download dataset]</p>

Parallel Django Dataset

This dataset includes all source code of Python web framework Django with line-by-line English annotation.
[download dataset]

Learning from "Big Code"

Community

Tools

Datasets

Challenges