How to Run a Generative AI Developer Tooling Experiment

Software development company Devexperts tested and compared code generator tools Copilot and Cursor, which serves as a blueprint for testing AI developer tools.

Apr 27th, 2025 7:00am by Jennifer Riggins

Featued image for: How to Run a Generative AI Developer Tooling Experiment

Photo by Immo Wegmann on Unsplash

Editor’s note: Jennifer Riggins created this post on behalf of DX.

As a software development company specializing in brokerage solutions, Devexperts must balance the speed of innovation with the caution required by a highly regulated industry.

That includes in its consideration of the hottest things, like generative AI (GenAI) tooling. Over the past two years, Devexperts has been evaluating AI tools to enhance developer productivity, particularly code generators. Being engineers, the team runs highly measurable experiments with the scientific method.

First, they tested GitHub Copilot on a mix of stellar and perplexing results. And then they hypothesized and tested if the Cursor AI code editor would perform even better.

Now they give their developers a choice of generative AI coding assistants. Devexperts shares here the developer metrics that were the backbone of its GenAI experiments, so you can test out generative AI with your team.

Can You Perform Like GitHub Developers?

In September 2022, GitHub published an article stating that its engineers saw a 55% increase in task completion using Copilot. This inspired Devexperts to test out Copilot to see if its developers could achieve similar results.

Devexperts started with a live experiment from August 1 to November 1, 2024, involving five software engineers. They then conducted a retrospective analysis with an additional 50 participants. The results for both were benchmarked against Github’s 55% increase for time metrics, mirroring ((100%/55%)-1)*100% ≈ 120% as the expected throughput increase.

Overall, the company wanted to observe three trends:

Engineers noticeably increase their coding speed.
Engineers noticeably increase the number of tasks completed.
Engineers work with the same quality.

The Devexperts team had also read enough of other organizations’ experiments to expect that there was a real risk of the solution size and the demand for rework notably increasing, so it also wanted to keep an eye on that.

Both the experiment and the ex post facto analysis looked to measure aspects of the developer experience with the following metrics:

Individual cycle time – time from when the first commit is authored to when the pull request (PR) is merged.
Time to open – time between the earliest commit in a pull request and when the PR is opened.
Time to merge – duration between when a pull request is opened and when it is merged.
Discussion cycles – the number of times a conversation on a pull request goes back and forth between people.
Pull request size – number of lines of code that were added, changed, or removed.
PR throughput – number of pull requests that are merged into the main codebase over a period of time.
Innovation rate – percentage of code changes that are writing brand-new code.
Maintenance rate – percentage of changes that are updating code that is more than three weeks old.
Rework – percentage of code changes that were updated within the past three weeks.

By its GitHub threshold, Devexperts found that only three out of nine of their expectations — a 55% improvement — were fulfilled with their five-engineer pilot. The company did see a 69% decrease in discussion cycles and a 176% increase in PR throughput, surpassing these expectations.

There were other positive changes, including a decrease in cycle time and time to merge, but not at the same improvement rate as GitHub’s engineers experienced.

There was also a 14% decrease in the innovation rate. This means that during the three-month trial, developers were spending less time solving unique problems.

The last metric, rework, is a significant consideration with generative AI, as 67% of developers find that they are spending more time debugging AI-generated code. Devexperts experienced a 200% increase in rework and a 30% increase in maintenance. On the other hand, while the majority of organizations are seeing an increase in complexity and lines of code with code generators, these five engineers saw a surprising 15% decrease in lines of code created.

“We can conclude that, for the live experiment, GitHub Copilot didn’t deliver the results one could expect after reading their articles,” summarized German Tebiev, the software engineering process architect who ran the experiment.

He did think the results were persuasive enough to believe speed will be enabled if the right processes are put in place: “The fact that the PR throughput shows significant growth tells us that the desired speed increase can be achieved if the tasks’ flow is handled effectively.”

Different Devs, Different Results?

The results from the first pilot called for a further exploration of Copilot with a broader sample audience.

For 50 more engineers, Devexperts counted back and rechecked the same metrics for three months before adopting GitHub Copilot and then again for three months after. The results between the live experiment and the retrospective analysis were very different.

“With GitHub Copilot, we reduced our workload by 20%, completed tasks 45% faster, and maintained the same level of quality,” Tebiev said. “The most significant improvement was observed during the review stage.”

With GitHub, this larger pool maintained, but didn’t notably increase, its level of innovation, while decreasing its rework. Most importantly, this team experienced a significant decrease in throughput.

In the retrospective analysis, these engineers completed less work but did it faster. However, in the live experiment, they didn’t see a speed increase, but they did observe a productivity increase.

Since they found the quantitative outcomes counterintuitive, Tebiev predicts there are some flaws in the retrospective analysis and design, so even more experiments are needed. “At Devexperts, we couldn’t isolate the development process sufficiently or gather a powerful enough selection of engineers to deliver stable results.”

Overall, things trended positively, and any perception of a productivity increase for the majority of developers is not something to be brushed off. It may just be growing pains or a need to review the company’s workflow and processes.

The Qualitative Developer Experience

Numbers matter, but only to an extent. Qualitative data, including how developers feel, is equally important. Often, even the perception of productivity can impact reality.

Devexperts uses the DX engineering intelligence platform to run quarterly developer productivity surveys called Snapshots. This includes collecting self-reported data from developers on the gains they’re seeing from using Copilot.

Just 17% of developers responded that they think Copilot helped them save at least an hour a week, versus a whopping 40% saw no time savings by using the code generator, which is well below the industry average.

Developers were also able to share their own anecdotal experience, which is very situation-dependent. Copilot seemed to be a better choice for completing more basic lines of code for new features, less so when there’s complexity of working with an existing codebase.

As senior software engineer Aleksandr put it: “It’s perfect for some disgusting but simple routine tasks like generating some test data, objects, JSONs and so on. Sometimes it is really good at generating some unit tests but the effectiveness here is seriously limited.”

But even though Copilot removed some of his toil, he remained unconvinced it could tackle regular coding, and he dubbed it “useless” for the typical Kotlin-Java backend development stack.

Several devs also complained the popular AI coding assistant didn’t help at all with bug fixes because it couldn’t handle the context of a larger codebase. As the ability to isolate pieces within more complex codebases becomes more difficult, Dmitry Derbenev, head of R&D at Devexperts said, it will become increasingly crucial for an LLM to understand context to provide accurate recommendations.

“Copilot makes it easier to write new features and sometimes conveniently completes sentences. It handles basic classes and models well, but doesn’t help much with bug fixing or suggestions, which are the main part of my work on the project,” responded software developer Alexandr. “In my opinion, other solutions like ChatGPT and Claude have performed better so far at this point. Copilot is definitely not a game changer.”

One frontend developer who found Copilot the most useful responded that he would miss it if it were taken away, and lamented that it’s not included in the Android Studio.

Is Cursor Better Than Copilot?

For their next experiment, from Oct. 30, 2024, to Jan. 30, 2025, 11 developers from Devexperts who had previously used Copilot started incorporating Cursor’s code completer into their work.

The measurement of this Cursor cohort was originally to be grounded in the Cycle Time metric — measuring from first commit to merging the pull request. However, they found that it was an unreliable source of truth because it fluctuated too much and had a limited scope, as it only “partially” captured the time required to complete a task.

“When experimenting in a live environment, it’s hard to isolate just one factor — like Copilot or Cursor usage. Different workloads, vacations, limited duration, different codebase and other factors make a proper design of the experiment both challenging and important,” said Dmitry Derbenev, head of R&D at Devexperts. “Additionally, the short duration of the experiment — three months — made it difficult to observe meaningful trends.”

Some quantitative results were impressive. Contrary to Copilot, with Cursor, they saw a 5 to 10% gain for bug fixing. But again, the biggest win was in Greenfield development at an astounding 80% improvement — tasks that would normally take three to four days were cut down to a single day.

But since the quantitative results weren’t subjective enough, Derbenev said that “we primarily relied on subjective developer feedback rather than purely quantitative measurements to analyze the results.”

Objective measurements still have many limitations in software development, he argued in a post on LinkedIn, and can even interfere with achieving goals and create “twisted incentives.”

Devexperts’ qualitative feedback focused on three questions:

Would you be using Cursor going forward?
Would you continue to use Copilot? If yes, together with Cursor? Alone? Neither.
Any free-form comments?

Even considering the briefness of this time period, the overall qualitative results of Cursor versus Copilot, according to more DX qualitative data, favored Cursor:

Eight out of 11 developers chose to continue using Cursor after the experiment, while two developers did not feel that they would use Cursor again.
Four respondents opted to continue using Copilot with or without Cursor, whereas seven developers discontinued Copilot in favor of Cursor.

The biggest Cursor drawback was that, as a fork of the VSCode integrated developer environment, it requires changing to a different IDE. This was fine for most frontend developers, Derbenev explained, but backend developers preferred using other IDEs.

This also inspired enterprise architect Viktor Isaev to run a proof of concept with the Cursor Agent AI code assistant, with zero handholding — just letting developers experiment. It was again great at the toil, but less so at the more complex, human problem-solving work.

“My perception is that it helps automate 80% of work that takes 20% of time,” Isaev said. “It’s good at standard, well-defined things that require a lot of boiler-plating or making things that are well-known how to make, this speeds up greatly.”

But, similar to Copilot, as tasks and code become more complex, he found it to be a time-waster more or less. This was the same result, he said, across tasks that included “building the project from scratch, making additional all works, doing refactoring, creating building tests, and introducing some architectural layers like authentication [and] authorization.”

An essential part of these experiments to consider is that none of them involved training on the intellectual property of Devexperts or its clients. Derbenev said, “I believe that organizing work with the context of the codebase and documentation in a proper, secure, and reliable way is the biggest challenge for having decent productivity gain when it comes to the LLM usage.”

In the meantime, with all of these results in mind — and considering the whole goal being to improve developer experience — Devexperts is now giving devs a choice of AI developer tools. It will pay for premium licenses for either Copilot or Cursor. And the continuous developer feedback will continue.

Jennifer Riggins is a tech storyteller and journalist, event and panel host. She bridges the gap between business, culture and technology, with her work grounded in the developer experience. She has been a working writer since 2003, and is based...