AI pair programmer should be supervised like a toddler, says researcher
“How risky is it to allow an AI to write some, or all of your code?”
Far too risky without rigorous oversight, concludes security researcher ‘0xabad1dea’ after documenting a trio of security vulnerabilities generated by AI pair programmer GitHub Copilot during a risk assessment.
GitHub Copilot is designed to accelerate software development by suggesting entire lines and functions, adapting to developers’ coding style as it does so.
Trained on billions of lines of code publicly available on GitHub, the machine learning tool is currently in a trial phase and available for testing as a Visual Studio Code extension.
‘Reasonable at first glance’
0xabad1dea says Copilot sometimes generates code that is “so obviously, trivially wrong that no professional programmer could think otherwise”.
More alarmingly still, it also suggests “bad code that looks reasonable at first glance, something that might slip by a programmer in a hurry, or seem correct to a less experienced coder”.
GitHub admits that “the code it suggests may not always work, or even make sense”, but adds that “it’s getting smarter all the time”.
Central to these improvements will be ongoing optimization of a sliding ‘temperature’ scale between conservatism (mimicking the most common inputs) and originality, which makes output “less structured” and more prone to “gibberish”, says 0xabad1dea.
This ‘generative model’ reduces duplication between users but “is at odds with one of the most basic principles of reliability: determinism”, says 0xabad1dea.
She demonstrates this with differing implementations of a moon phase calculator generated from identical inputs.
The researcher also notes that Copilot is currently “unreliable” at generating comments and offers variables with “useless names”, potentially making outputs “utterly inscrutable”.
When she fed Copilot with general purpose HTML parser with regex – an ill-advised input, she says – Copilot “declined to use regex and wrote a complete C function and a decent main() to drive it”.
Alarmingly, however, “if the parsed string contains no >, the parser will run off the end of the buffer and crash”, among other parsing issues.
There was at least qualified praise for the presence of “a surprising amount of delicate pointer math”, and for Copilot being “80% of the way to something that could conceivably be considered a basic parser”.
The AI tool also “blundered right into the most classic security flaw of the early 2000s: a PHP script taking a raw GET variable and interpolating it into a string to be used as an SQL query, causing SQL injection”, says 0xabad1dea. “Now PHP’s notorious propensity for security issues is infecting even non-human life.
“Furthermore, when prompted with shell_exec(), Copilot was happy to pass raw GET variables to the command line.”
Prompted “for a basic listening socket”, Copilot also created “a basic off-by-one buffer error” in the listening function.
The researcher was unable to verify whether Copilot excludes secret information such as API keys and passwords from its training model.
“The most realistic risk here is a naive programmer accepting an autocomplete for a cryptographic key which sets it to be a random-looking but dangerously low-entropy value,” she said.
‘Neural network see, neural network do’
“The inevitable conclusion is that Copilot can and will write security vulnerabilities on a regular basis, especially in memory-unsafe languages,” says the researcher.
While Copilot excels at generating boilerplate that may “bog down” programmers and accurately guesses constants and setup functions, it’s less adroit at handling application logic, she says.
“Copilot cannot always maintain sufficient context to write correct code across many lines”, 0xabad1dea explains, while there’s no apparent “systematic separation of professionally produced code” from the profusion of “buggy code on GitHub”.
She added: “Neural network see, neural network do”.
Supervising a toddler
0xabad1dea tells The Daily Swig that she expects GitHub to be diligent in addressing Copilot’s shortcomings, but that developers should “be realistic about the limitations”.
She likens the Copilot model to a toddler. “They will impress you with how much they have learned, but they will still always lack context and experience. And of course, they shouldn’t be left unsupervised.”
0xabad1dea also notes that a below-the-line commenter flagged a “tiny flaw” in an Easter date calculator she generated through Copilot.
“So even when I was on the lookout, I missed something. Of course this can happen with human-written code as well, but the fact that we have so much trouble just means we don’t need our tools introducing new random faults.”
The Daily Swig invited GitHub to comment on the findings but we have yet to hear back. We will update the story if we do.