Customize Consent Preferences

We use cookies to help you navigate efficiently and perform certain functions. You will find detailed information about all cookies under each consent category below.

The cookies that are categorized as "Necessary" are stored on your browser as they are essential for enabling the basic functionalities of the site. ... 

Always Active

Necessary cookies are required to enable the basic features of this site, such as providing secure log-in or adjusting your consent preferences. These cookies do not store any personally identifiable data.

No cookies to display.

Functional cookies help perform certain functionalities like sharing the content of the website on social media platforms, collecting feedback, and other third-party features.

No cookies to display.

Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics such as the number of visitors, bounce rate, traffic source, etc.

No cookies to display.

Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.

No cookies to display.

Advertisement cookies are used to provide visitors with customized advertisements based on the pages you visited previously and to analyze the effectiveness of the ad campaigns.

No cookies to display.

Science
Interview software researcher Panichella

‘ChatGPT gets little criticism, while the outcomes are sometimes really wrong’

In a much-downloaded article, TU Delft scientists warn against blind faith in artificial intelligent programming tools such as ChatGPT. “Testing remains necessary” stresses software researcher Dr Annibale Panichella.

Software researchers Annibale Panichella (left) and June Sallou question the use of AI coding assistants. (Photo: Jaden Accord)

By a happy coincidence, modern programming languages are so much like ordinary language that the same techniques are used in language models read-more-closed can also be used for generating computer codes. So if you expose a language model to lots of computer codes, it learns to programme by itself. Just like toddlers learn to talk. Popular AI programming assistants like Github, CoPilot and ChatGPT are examples of this.

But where did a programming assistant acquire its language skills? And in terms of research, what are the training dates? These are usually not disclosed, and this is increasingly troubling researchers.

They know, for instance, that successive versions of programmes have been used as input, and also that the software was replaced because of bugs or vulnerabilities to hackers. Tests for software may also have ended up in training. This can be compared to taking theory driving tests by using a known practice test. Programmers call it data leakage.

‘ChatGPT doesn’t always give the same answer to the same question. Normally, you would figure out why that is’

These kinds of emerging doubts led Dr Annibale Panichella to write a high-profile paper with his colleagues Dr June Sallou and Dr Thomas Durieux. The paper will be presented at the International Conference on Software Engineering (ICSE) in Lisbon in April, but it has already been downloaded hundreds of times from the TU Delft repository.

The three authors work in the Software Engineering Department of the Faculty of Electrical Engineering, Mathematics and Computer Science (EEMCS). Delta spoke to Dr Panichella to find out where the interest comes from.

The article is called ‘Breaking the silence’. What silence is that?

“People have a lot of faith in generative artificial intelligence, such as ChatGPT. But the results are not always accurate. Also, researchers are often less critical about large language models or LLMs than about other techniques. ChatGPT gets little criticism, while sometimes the results are really wrong. For example, ChatGPT does not always give the same answer to the same question. Normally, you would find out why that is by asking the question many times and comparing the outcomes. But people don’t do that with large language models. They repeat their task a few times and pick out the best result. And nobody thinks that’s weird.”

Why does this silence need to be broken?

“When I start talking about this silence, for instance during a presentation, people often say it’s a good observation. But they still accept the outcomes because they simply cannot know how the model was trained and with which datasets. I think this is unfair. After all, when I use traditional algorithms to train an artificial intelligence model, the first questions are ‘what training set did you use? And what was the model tested with?’ But when large language models are used, no one asks.”

Has the use of ChatGPT and the like become standard in programming?

“I myself use it a lot, for instance when I write a piece of software and want to automate parts. Then I ask an LLM to write a code that does this and that. Afterwards, I check whether the code works. That check is, as I said, an essential step.”

‘Students can never trust ChatGPT’s code blindly’

What does this mean for coding students?

“Students using programming assistants should be aware that the code may be vulnerable to hackers, or even introduce bugs. This is because the language models are trained with multiple versions of the same software, some of which have known vulnerabilities. So students should never trust ChatGPT’s code blindly. They should always check the code. Testing remains necessary.”

How did your fellow researchers react to the article?

“Most agree with us that there should be guidelines for the use of LLMs in software engineering. Others fear that guidelines would make it even more difficult to get their research published. They will then have to prove that they complied with those guidelines throughout their research.”

Science editor Jos Wassink

Do you have a question or comment about this article?

j.w.wassink@tudelft.nl

Comments are closed.