150k JavaScript Dataset

This dataset is released as a part of Machine Learning for Programming project that aims to create new kinds of programming tools and techniques based on machine learning and statistical models learned over massive codebases. For more information about the project, tools and other resources please visit the main project page.

Overview

We provide a dataset consisting of 150'000 JavaScript files and their corresponding parsed ASTs that were used to train and evaluate the DeepSyn tool. The JavaScript programs are collected from GitHub repositories by removing duplicate files, removing project forks (copy of another existing repository), keeping only programs that parse and we aim to remove obfuscated files. For parsing we used the error-tolerant Acorn parser (using the parse_dammit interface). The dataset is split into two parts -- 100'000 files used for training and 50'000 files used for evaluation.

Download

Below you can download an archive of the dataset. The archive contains the following files:
  • [478MB] data.tar.gz -- 150'000 JavaScript source files
  • [6.6MB] programs_training.txt -- List of 100'000 filenames (from data.tar.gz) used to build the training dataset, one per line
  • [3.3MB] programs_eval.txt -- List of 50'000 filenames (from data.tar.gz) used to build the evaluation dataset, one per line
  • [11GB] programs_training.json -- Parsed ASTs in JSON format for the files in programs_training.txt, one per line
  • [4.8GB] programs_eval.json -- Parsed ASTs in JSON format for the files in programs_eval.txt, one per line

Version 1.0
TAR
Published research using this dataset may cite the following paper:
Raychev, V., Bielik, P., Vechev, M. and Krause, A. Learning Programs from Noisy Data. In Proceedings of the 43nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (2016), POPL ’16, ACM

Format

Now we briefly explain the JSON format into which each JavaScript AST is serialized. Note that the files programs_training.json and programs_eval.json contain one serialized file per line. As an example, given a simple program:
  console.log("Hello World!");
The serialized AST is as follows:
[ { "id":0, "type":"Program", "children":[1] }, 
    { "id":1, "type":"ExpressionStatement", "children":[2] }, 
      { "id":2, "type":"CallExpression", "children":[3,6] }, 
        { "id":3, "type":"MemberExpression", "children":[4,5] }, 
          { "id":4, "type":"Identifier", "value":"console" }, 
          { "id":5, "type":"Property", "value":"log" }, 
        { "id":6, "type":"LiteralString", "value":"Hello World!" }, 0]
As can be seen, the json contains array of objects followed by number 0. Each object contains several name/value pairs:
  • (Required) id: unique integer identifying current AST node.
  • (Required) type: string containing type of current AST node.
  • (Optional) value: string containing value (if any) of the current AST node.
  • (Optional) children: array of integers denoting children (if any) of the current AST node.