Now Reading
Diff Fashions – A New Technique to Edit Code

Diff Fashions – A New Technique to Edit Code

2023-01-28 05:04:16

CarperAI is releasing a sequence of diff fashions—fashions skilled to foretell a code diff, skilled on tens of millions of commits scraped from GitHub. We’re releasing 3 fashions of various sizes, all fine-tuned from Salesforce’s CodeGen code synthesis fashions:

The dataset of diffs we scraped to coach these fashions can be launched individually within the close to future. We hope these fashions can be helpful for suggesting clever modifications to present code, controllable via a selected commit message describing the change. We are going to proceed to iterate on our diff fashions, so keep tuned for additional releases.

Learn on for extra particulars on how the fashions have been skilled, with benchmark outcomes!


A diff mannequin is an autoregressive language mannequin skilled on edits to a chunk of textual content, formatted in Unified Diff Format. These diff fashions can recommend, given a bit of textual content and an outline of the specified change, an clever change to the textual content that matches the outline, marking the strains added, modified, and deleted in diff format. The first use case for these fashions is for suggesting modifications to code—as such, the fashions we’re releasing are fine-tuned variations of fashions already skilled on code datasets.

Compared to few-shot prompting of regular code era fashions, diff fashions are specialised for suggesting clever modifications to present code, notably longer items of code and the place a change is required to comply with some pure language textual content description (offered within the type of a commit message).

Prior work by Microsoft Analysis (Li et al., 2022) and OpenAI (Ray and McCandlish 20201; Lehman et al. 2022) recognized the potential for diffs as a supply of wealthy information on the way to make modifications to code, and skilled fashions on diffs, however didn’t launch any diff fashions or publish an evaluation of the way to receive good efficiency.

1 Alex Ray and Sam McCandlish, OpenAI. Unbiased contribution: Coaching diff fashions, 2020.

A Diff Dataset

Our dataset for this fine-tune consists of commits from GitHub, obtained utilizing the Google BigQuery Public Dataset, a public updated snapshot of an enormous variety of open-source GitHub repositories. We took this dataset and filtered utilizing BigQuery on the variety of stars within the repository to exclude repos with lower than 100 stars, and additional restricted the question to solely repositories with open-source non-copyleft licenses (e.g. MIT, Apache, and many others) and commits with greater than 10 characters within the commit message. We additionally restricted ourselves to a listing of twenty-two fashionable programming, scripting, and markup languages, together with Python, HTML, Bash scripts, SQL, C++, and many others. This resulted in a dataset of 19 million commits after filtering.

At this level we had the commit hashes, repository names, and different metadata for the commits we wished in our dataset. We then ran `git clone` on each repository in our dataset and used a Python script to acquire the uncooked code information earlier than the diff is utilized, along with the diff itself in Unified Diff Format. These have been processed into Apache Parquet format utilizing Dask with Apache Arrow to effectively get it right into a dataframe format, with one row per file modified (e.g. if a diff affected a number of information it was break up up), and included solely rows the place every file + diff was brief sufficient to suit into the context of the language mannequin. 

From there, we processed the dataset into EleutherAI’s lm_dataformat, a utility to create compressed information information for environment friendly language mannequin coaching. The ultimate format of the info seen by the language mannequin consisted of the filename modified by the diff, the file earlier than modifications, the commit message, and the diff itself, all concatenated along with delineating tags in between:

<NME> {filename}
<BEF> {file_before_changes}
<MSG> {commit_message}
<DFF> {diff}

The mannequin is then usually prompted with every part as much as and together with <DFF>, however you can even optionally embody the part heading of the unified diff format instantly after <DFF>, which specifies which strains precisely the mannequin ought to change. For instance, appending @@ -1,3 +1,9 @@ after the diff tag would instruct the mannequin to alter the file at line 1, including 9 – 3 = 6 strains. We don’t add these 4 tags as particular tokens, since we prioritized leaving the tokenizer unchanged.

The ultimate dataset consisted of 1.4 million information from 19 million commits, which resulted in 1.086 billion tokens after tokenizing with a modified GPT-2 tokenizer to incorporate whitespace tokens—a mean of 888 tokens per pattern.

High quality-tuning CodeGen

The mannequin suite we labored with as a base was Salesforce’s CodeGen sequence of fashions, that are decoder-only transformer language fashions skilled to foretell the subsequent token in a sequence. These fashions have been first pre-trained on The Pile, an 800GB dataset of various textual content launched by EleutherAI, after which additional skilled on a big dataset of permissively licensed code from GitHub BigQuery in 6 programming languages, earlier than lastly being skilled on Python solely code from the identical supply. Notice that the code in these pre-training datasets will inevitably overlap to some extent with our diff dataset, though they don’t include diffs. 

Salesforce have launched variants of their fashions at 4 scales (350M, 2B, 6B, and 16B parameters) with 3 variants at every scale comparable to the three totally different phases of pre-training described above. We selected to fine-tune the “mono” variants at every mannequin scale, that means the model skilled on Python solely code along with multi-language code.

In an effort to fine-tune these fashions on our diff dataset, we used HuggingFace’s standard fine-tuning script with slight modifications to customise to CodeGen’s structure, utilizing the default hyperparameters and with out freezing any layers. To pre-process the info we concatenated every pattern (file with modifications) collectively within the format described above and minimize it into chunks of 2048 tokens (the context size of the CodeGen fashions). We then fine-tuned all the mannequin sizes with this dataset as an preliminary trial run and baseline for additional experiments. For all fine-tuning experiments on this publish, we used 64 Nvidia A100 GPUs—we thank Stability AI for entry to their compute assets!

To check a spread of hyperparameters, we did a 12 run sweep with the 350m mannequin throughout a spread of studying charges and batch sizes, and settled on a studying charge of 3e-5 and a batch measurement of 1024 samples.

Token Masking

We then experimented with masking tokens within the loss computation, as described within the ELM paper. Particularly, we embody solely the tokens within the diff (after the tag <DFF>) within the loss, which is meant to encourage the mannequin to foretell the diff and never memorize the file and commit message. For instance, we count on that filenames in <NME> and file contexts in <BEF> are given by the immediate, whereas <DFF> is the one aim within the diff era. Subsequently, it’s pure to disregard unrelated prediction targets and exclude tokens outdoors <DFF> within the computation of the loss operate. We fine-tuned the total suite of fashions with this modification to match the outcomes throughout mannequin scale.

File Truncation

We additionally experimented with alternative ways of truncating the file earlier than modifications to suit extra of it into the context size. With none truncation, roughly half of the information within the unique dataset match into the 2048 context size, for a complete of 1.086 billion tokens. If we crop the file earlier than modifications to solely include the strains within the diff file, we are able to then match 95% of the unique dataset within the context, for a complete of two.181 billion tokens (see Determine 1). We hoped that together with the additional information at the price of some context within the file being modified would enhance the mannequin’s efficiency. Nonetheless, we discovered that this experiment resulted in a mannequin considerably worse than with out truncation, possible as a result of with the ability to see a whole class/operate {that a} change depends on is essential for modelling.

Determine 1: Histograms of the samples within the dataset, ordered by size of file in tokens on the x-axis. (left) The baseline dataset, displaying that the 2048 context size of our language fashions cuts off round 50% of the information. (proper) The results of truncating the code file earlier than modifications to solely include strains altered by the diff and surrounding context.


To guage our fashions, we check their bug fixing capabilities on two duties: 4-Parity, a easy toy benchmark the place the mannequin is required to repair fundamental bugs in a Python operate to calculate the parity of a 4-bit sequence, and a extra advanced dataset of many artificial and actual Python bugs scraped from GitHub repositories by He et al. (2022). These benchmarks present a easy testbed for whether or not diff LLMs could make a number of coordinated and efficient modifications to code.

For 4-Parity, we generate completions utilizing a immediate consisting of the unique operate adopted by the commit message <MSG> # Mounted bugs. We generate 3200 completions for every mannequin, apply the ensuing diff patches to the unique operate, execute the generated code and report the % of the generations the place the generated 4-Parity operate is appropriate throughout all check circumstances, at the perfect mannequin temperature from {0.7, 0.8. 0.9}. We report outcomes throughout 1-5 bugs synthetically launched to the unique operate.

For the latter process of actual Python bugs, we filter the dataset right down to 1000 bugs throughout a number of bug fixing issues (e.g. a incorrect binary operator and incorrect variable identify downside), the place we generate a diff for every bug and measure the precise string match accuracy between the generated operate after making use of the diff, and the proper (bug-free) operate. The commit message for this process is Repair {bug_class}, the place the bug class could be, for instance, “incorrect binary operator”. Notice that on this case we don’t execute the generated code to check it, since these bugs are scraped from many various GitHub repositories and execution could be impractical.

The outcomes from 4-Parity, proven in Determine 2, reveal that our diff fashions can carry out fundamental bug fixing at comparable ability to the prompted CodeGen fashions. There’s a clear efficiency improve with scale, and the 350M diff mannequin performs higher on the bug fixing process. We are able to additionally see that the loss masking strategy described above leads to considerably higher diff fashions on this process.

Determine 2: Outcomes from evaluating our diff fashions on the straightforward 4-Parity bug fixing process. Notice the log scale y-axis. The x-axis is the variety of progressively launched bugs within the 4-Parity operate. The bolded strains present our greatest diff fashions, whereas the dot-dash strains present the CodeGen fashions we used as a place to begin, and the baseline fashions are skilled with out loss masking.

Desk 1 exhibits the outcomes from our diff fashions on the artificial + actual bugs benchmark, utilizing the go@okay metric with okay = 1 (outlined because the fraction of issues solved when the mannequin generates okay code samples per downside. We are able to see that the masked diff fashions carry out barely higher

Mannequin go@1 – Artificial + Actual Bugs
Baseline Diff 350M 0.9%
Baseline Diff 2B 1.9%
Baseline Diff 6B 2.3%
Masked Diff 350M 1.7%
Masked Diff 2B 3.9%
Masked Diff 6B 4.8%
CodeGen 350M 2.0%
CodeGen 2B 3.8%
CodeGen 6B 4.5%
Desk 1: Cross@1 (Chen et al., 2021) on the actual + artificial bugs benchmark (He et al., 2022).

Qualitatively, we additionally evaluated the accuracy of the road numbers within the generated diff hunk, and observed that the bigger scale fashions do very effectively at precisely producing line numbers which correspond to the strains which the diff under truly modifications. This opens the door to prompting the mannequin with particular line numbers to alter, add, or take away, permitting for extra management over the code era compared with a non-diff mannequin.

We additionally observed that diff fashions (particularly the 2B and 6B) are likely to do higher when prompted with longer code era duties (corresponding to fixing bugs in a big operate, and that various the immediate induces higher range in generated code compared with the conventional CodeGen fashions.

See Also

In additional work, we hope to look at in higher element the improved range and localised mutation skills that diff fashions supply over commonplace code era fashions, throughout many mannequin scales.

Accelerated Inference with Triton and FasterTransformer

We additionally investigated the usage of Nvidia’s FasterTransformer (FT) framework with the Triton Inference Server utilizing an FT backend to realize considerably accelerated inference. FasterTransformer is a group of fused CUDA kernels optimized for inference, written in C++. The Triton Inference Server is an optimized system for serving massive language fashions at scale, in each multi-GPU and multi-node setups utilizing Docker containers.

Changing the CodeGen fashions to FT concerned important technical work, since CodeGen shouldn’t be supported natively in FT. We first transformed the CodeGen weights to GPT-J format through a linear algebra trick, since GPT-J has a really related structure, constructing on Brendan Dolan-Gavitt’s work with the Fauxpilot framework. From there, we used the FT script to transform the GPT-J HuggingFace checkpoint into FT’s format, which might be run with the Triton server. We struggled to get this to run on our cluster (which doesn’t use Docker), however finally succeeded and achieved a major speedup on inference of our fashions—in some circumstances as much as an order of magnitude quicker.

Mannequin Time: HuggingFace Transformers Time: FasterTransformer + Triton Inference Server
CodeGen 350m 5m 44s 31s
CodeGen 2B 9m 38s 1m 27s
CodeGen 6B 10m 45s 2m 9s
Desk 2: Time benchmark outcomes for the bottom CodeGen fashions on the 4-Parity process described above, evaluating HuggingFace Transformers inference pace with FasterTransformer utilizing the Triton Inference Server.

Our scripts to transform and run these fashions with FasterTransformer and Triton can be found within the OpenELM library.

We hope that this work conjures up others to take our fashions and experiment with the potential of diff-based code era!

To quote this weblog publish, please use the next entry:

H. Bradley, H. Fan, H. Saini, R. Adithyan, S. Purohit, and J. Lehman. (Jan 2023). Diff Fashions – A New Technique to Edit Code. CarperAI Weblog.


  title   = "Diff Fashions - A New Technique to Edit Code",
  creator  = "Bradley, Herbie and Fan, Honglu and Saini, Harry and Adithyan, Reshinth and Purohit, Shivanshu and Lehman, Joel",
  journal = "CarperAI Weblog",
  12 months    = "2023",
  month   = "Jan",
  url     = ""


The CarperAI diff fashions staff consisted of Herbie Bradley, Honglu Fan, Harry Saini, Reshinth Adithyan, Shivanshu Purohit, and Joel Lehman.

We thank Stability AI for offering compute assets.

Source Link

What's Your Reaction?
In Love
Not Sure
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top