Intelligent autocomplete and evaluation in Satyrn

Preface

I’m in the middle of optimizing the intelligent autocomplete feature in Satyrn (my Jupyter client). The feature is live in the app, and now I’m working on a more rigorous evaluation. What I’ve learned so far is that Gemini is a pretty good model for this task out of the box, and GPT and Claude are pretty bad, but you can massage them toward usability with some more advanced prompting. In this post I’ll give a little background on adding the feature to Satyrn and then discuss some early results from evaluation.

Motivation

I never loved using VS Code as my Jupyter Notebook client, but I did love using Copilot (and now I’m addicted to Cursor Tab). So when I set out to build Satyrn, my own Jupyter client, I knew I wanted to include intelligent autocomplete.

I looked into integrating Github Copilot, but it looked like a bit of a pain. Although there are several open source editors like Sublime, Zed, and VIM which have added Copilot support, there is no official public facing documentation on how it can be integrated. And as far as I can tell other editors have reverse engineered how it’s plumbed into VS Code and have tried to reproduce that framework.

Instead of immediately going down that rabbit hole I decided to use existing LLM API integrations in Satyrn to provide intelligent autocomplete. How hard could it be?

The MVP

My first attempt was to hack together an MVP I could ship in Satyrn. There was no eval involved, I just wrote the best prompt and glue code I could while testing it out in the app live, and once it felt usable I shipped it. This is clearly not the best approach to shipping the most reliable LLM feature, but it was a starting point, and I learned a lot in solving the problem end-to-end.

Since Satyrn already supported the Anthropic, OpenAI, and Gemini APIs I wanted to enable users to choose any of these providers to power the autocomplete, but I ran into the (in hindsight obvious) challenge that using the same prompt with different models does not produce the same results. But it turns out that claude-haiku and gpt-4o-mini do behave similarly enough, but Gemini was quite different.

I started by tuning the prompt for GPT and quickly learned that it was abysmal at handling newline and indentations around the cursor properly. For example, given the example below (where the cursor position is indicated by |) it would try to insert a line at the cursor position and fail to add a newline character at the end.

def mean_absolute_deviation(numbers: List) -> float:
|    return sum(abs(x - mean) for x in numbers) / len(numbers)

A common error for the test case above results in the model adding in a line of code where the cursor position is, but failing to add a newline at the end, resulting in invalid python code:

def mean_absolute_deviation(numbers: List) -> float:
    mean = sum(numbers) / len(numbers)    return sum(abs(x - mean) for x in numbers) / len(numbers)

I think this issue is actually related to the infamous chess weirdness, but I’ll come back to that another time.

I stumbled upon a workaround to compensate for models failing to generate newlines where they were supposed to in a paper on “character-level text infilling”. Basically the technique involves asking the model to not just generate the missing code, but actually rewrite the line before and after the cursor including the newly added code, and then manually removing the existing code. The prompt I came up with from this technique ended up working ok for both gpt-4o-mini and claude-haiku but completely did not work for gemini-1.5-flash. No matter how hard I tried to cajole Gemini into rewriting the line before and after it only wanted to generate the missing code. I didn’t realize it at the time, but it turns out this is because Gemini is actually really good at the basic task of intelligent autocomplete, and does not want to play ball with this weird workaround needed for the other two chat models.

Starting evaluations

After shipping the MVP of intelligent autocomplete I moved on to a more rigorous evaluation approach. In hindsight I definitely should have started here.

To evaluate the performance of each model I am using the human-eval-infilling tool from OpenAI. I forked their library and made some slight adjustments to make it more ergonomic to work with. Then I created some scripts to take the test cases from the eval, create a prompt with them, and generate responses from the LLM providers. You can check out my code here (it’s a work in progress).

So far I’ve done a limited evaluation on gpt-4o-mini and gemini-1.5-flash. I ran 100 test samples against both GPT and Gemini. Each sample was attempted 5 times.

I used the same system prompt for both models. Unlike my approach in the MVP, I’ve gone back to basics for a standard FIM (“fill in the middle”) prompting strategy. Basically you take the block of code and split it into a “prefix” and “suffix” around the cursor position. Then you use XML tags to tell the model where the prefix and suffix begin, and add a “middle” tag to ask it to “fill in the middle”. So for example we’ll take the code below (where the | indicates the cursor position):

def mean_absolute_deviation(numbers: List) -> float:
|    return sum(abs(x - mean) for x in numbers) / len(numbers)

We’ll re-arrange the code block above, using <PRE>, <SUF>, and <MID> tags to indicate the position of the prefix, suffix, and middle. The middle always comes at the end because the model completes the sequence of tokens.

<PRE>def mean_absolute_deviation(numbers: List) -> float:\n<SUF>return sum(abs(x - mean) for x in numbers) / len(numbers)<MID>

So I created a simple function which will take the prefix and suffix from the evaluation task and turn it into an api call to the LLM provider. Below is the version for GPT, and there is a similar one for Gemini.

def generate_completion(prefix: str, suffix: str) -> str:
    completion = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "developer", "content": SYSTEM_PROMPT},
            {"role": "user", "content": f"<PRE>{prefix}<SUF>{suffix}<MID>"},
        ],
    )
		return completion.choices[0].message.content

In the code above you can see the user prompt is very simple. I did not use the workaround used in the MVP. Thus, unsurprisingly, GPT performed quite poorly. Gemini on the other hand crushed it.

Gemini flash comes out on top with ~96% accuracy for pass@1 which is very impressive, and 99% for pass@5. It cost me ~$0.033 for Gemini (you can actually do it for free but the rate limit is a pain). GPT on the other hand does abysmally with ~39% accuracy for pass@1 and ~43% for pass@5. This cost ~$0.063.

This is just a little taste of evaluations that I’ve been working on today, I’m excited to get back to it tomorrow to test Claude and perhaps introduce some open source models into the mix as well. But so far it’s clear that Gemini is in the lead.