Probabilistic Models for Code

As part of our research efforts in the machine learning/programming languages space (see here), we develop new program synthesis techniques based on domain-specific languages to create precise probabilistic models over structures such as trees. This idea proved useful for creating probabilistic models of source code. Here, we provide several already synthesized models and code to replicate the results. The code can also be used to build other tools on top of the said models.

Code to replicate results

The code and the trained models are available at: https://github.com/eth-srl/ModelsPHOG

Precision for code completion tasks

We provide code and models with their respective accuracy for several code completion tasks. Running these commands requres downloading the corresponding training datasets listed below.
  • JavaScript dataset
    • Best terminal prediction (based on the E13 synthesis procedure defined in [1] with additional prunning of the tree, this is better result than shown in [1])
      Accuracy: 85.9%
      Command to replicate:
      bazel-bin/phog/model/evaluate --logtostderr \
          --training_data programs_training.json \
          --evaluation_data programs_eval.json \
          --tgen_program synthesized/js/values_best.tgen
      
    • Best non-terminal prediction (based on the ID3+ synthesis procedure defined in [1], the result is the shown in [1])
      Accuracy: 83.9%
      Command to replicate:
      bazel-bin/phog/model/evaluate --logtostderr \
          --training_data programs_training.json \
          --evaluation_data programs_eval.json \
          --tgen_program synthesized/js/types_id3.tgen \
          --is_for_node_type
      
  • Python dataset
    • Best terminal prediction (based on the E13 synthesis procedure defined in [1], this is the result shown in [1])
      Accuracy: 69.2%
      Command to replicate:
      bazel-bin/phog/model/evaluate --logtostderr \
          --training_data python100k_train.json \
          --evaluation_data python50k_eval.json \
          --tgen_program synthesized/py/values_e13.tgen
      
    • Best non-terminal prediction (based on the ID3+ synthesis procedure defined in [1], the result is the shown in [1])
      Accuracy: 76.1%
      Command to replicate:
      bazel-bin/phog/model/evaluate --logtostderr \
          --training_data python100k_train.json \
          --evaluation_data python50k_eval.json \
          --tgen_program synthesized/py/types_id3.tgen \
          --is_for_node_type
      

Training datasets

150k Python Dataset
Dataset consisting of 150'000 Python ASTs
150k JavaScript Dataset
Dataset consisting of 150'000 JavaScript files and their parsed ASTs

Relevant publications

PDF [1] Probabilistic Model for Code with Decision Trees
Veselin Raychev, Pavol Bielik, Martin Vechev
ACM OOPSLA'16
PDF [2] PHOG: Probabilistic Model for Code
Pavol Bielik, Veselin Raychev, Martin Vechev
ACM ICML'16
PDF [3] Learning Programs from Noisy Data
Veselin Raychev, Pavol Bielik, Martin Vechev, Andreas Krause
ACM POPL'16