150k Python Dataset

This dataset is released as a part of Machine Learning for Programming project that aims to create new kinds of programming tools and techniques based on machine learning and statistical models learned over massive codebases. For more information about the project, tools and other resources please visit the main project page.


We provide a dataset consisting of parsed Parsed ASTs that were used to train and evaluate the DeepSyn tool. The Python programs are collected from GitHub repositories by removing duplicate files, removing project forks (copy of another existing repository), keeping only programs that parse and have at most 30'000 nodes in the AST and we aim to remove obfuscated files. Furthermore, we only used repositories with permissive and non-viral licenses such as MIT, BSD and Apache. For parsing, we used the Python AST parser included in Python 2.7. We also include the parser as part of our dataset. The dataset is split into two parts -- 100'000 files used for training and 50'000 files used for evaluation.


Below you can download an archive of the dataset. The archive contains the following files:
  • parse_python.py -- The parser that we used to obtain JSON from each Python source code that we used to obtain this dataset.
  • python100k_train.json -- Parsed ASTs in JSON format. This is a dataset for training.
  • python50k_eval.json -- Parsed ASTs in JSON format. This is a dataset for evaluation.

Version 1.0 [526.6MB]
Published research using this dataset may cite the following paper:


Now we briefly explain the JSON format into which each AST is stored. The python100k_train.json and python50k_eval.json files include one such JSON per line. As an example, given a simple program:
x = 7
print x+1
The serialized AST is as follows (here we show it pretty-printed, but the entire JSON is on a single line in the data):
[ {"type":"Module","children":[1,4]},
        {"type":"Num","value":"1"} ]
As can be seen, the json contains array of objects. Each object contains several name/value pairs:
  • (Required) type: string containing type of current AST node.
  • (Optional) value: string containing value (if any) of the current AST node.
  • (Optional) children: array of integers denoting indices of children (if any) of the current AST node. Indices are 0-based starting from the first node in the JSON.