'ChatGPT gets little criticism, while the outcomes are sometimes really wrong'

By a happy coincidence, modern programming languages are so much like ordinary language that the same techniques are used in language models can also be used for generating computer codes. So if you expose a language model to lots of computer codes, it learns to programme by itself. Just like toddlers learn to talk. Popular AI programming assistants like Github, CoPilot and ChatGPT are examples of this.

But where did a programming assistant acquire its language skills? And in terms of research, what are the training dates? These are usually not disclosed, and this is increasingly troubling researchers.

They know, for instance, that successive versions of programmes have been used as input, and also that the software was replaced because of bugs or vulnerabilities to hackers. Tests for software may also have ended up in training. This can be compared to taking theory driving tests by using a known practice test. Programmers call it data leakage.

‘ChatGPT doesn’t always give the same answer to the same question. Normally, you would figure out why that is’

These kinds of emerging doubts led Dr Annibale Panichella to write a high-profile paper with his colleagues Dr June Sallou and Dr Thomas Durieux. The paper will be presented at the International Conference on Software Engineering (ICSE) in Lisbon in April, but it has already been downloaded hundreds of times from the TU Delft repository.

The three authors work in the Software Engineering Department of the Faculty of Electrical Engineering, Mathematics and Computer Science (EEMCS). Delta spoke to Dr Panichella to find out where the interest comes from.

The article is called ‘Breaking the silence’. What silence is that?

“People have a lot of faith in generative artificial intelligence, such as ChatGPT. But the results are not always accurate. Also, researchers are often less critical about large language models or LLMs than about other techniques. ChatGPT gets little criticism, while sometimes the results are really wrong. For example, ChatGPT does not always give the same answer to the same question. Normally, you would find out why that is by asking the question many times and comparing the outcomes. But people don’t do that with large language models. They repeat their task a few times and pick out the best result. And nobody thinks that’s weird.”

Why does this silence need to be broken?

“When I start talking about this silence, for instance during a presentation, people often say it’s a good observation. But they still accept the outcomes because they simply cannot know how the model was trained and with which datasets. I think this is unfair. After all, when I use traditional algorithms to train an artificial intelligence model, the first questions are ‘what training set did you use? And what was the model tested with?’ But when large language models are used, no one asks.”

Has the use of ChatGPT and the like become standard in programming?

“I myself use it a lot, for instance when I write a piece of software and want to automate parts. Then I ask an LLM to write a code that does this and that. Afterwards, I check whether the code works. That check is, as I said, an essential step.”

‘Students can never trust ChatGPT’s code blindly’

What does this mean for coding students?

“Students using programming assistants should be aware that the code may be vulnerable to hackers, or even introduce bugs. This is because the language models are trained with multiple versions of the same software, some of which have known vulnerabilities. So students should never trust ChatGPT’s code blindly. They should always check the code. Testing remains necessary.”

How did your fellow researchers react to the article?

“Most agree with us that there should be guidelines for the use of LLMs in software engineering. Others fear that guidelines would make it even more difficult to get their research published. They will then have to prove that they complied with those guidelines throughout their research.”

Read the original article.
More background on AI and software engineering in this interview with TU Delft experts Arie van Deursen en Geert-Jan Houben.

‘ChatGPT gets little criticism, while the outcomes are sometimes really wrong’

Related

Humans of TU Delft: Stefan Buijsman on responsible AI

Outgoing Minister Van Gennip calls for engagement in times of AI

TU Delft, Last Dutch University on X, Goes Silent