New study: Research on Microsoft Bing Chat

AI Chatbot produces misinformation about elections

Bing Chat, the AI-driven chatbot on Microsoft’s search engine Bing, makes up false scandals about real politicians and invents polling numbers. Microsoft seems unable or unwilling to fix the problem. These findings are based on a joint investigation by AlgorithmWatch and AI Forensics, the final report of which has been published today. We tested if the chatbot would provide factual answers when prompted about the Swiss, Bavarian and Hessian elections that took place in October 2023.

Publication

15 December 2023

Auf Deutsch lesen Lire en Français

#ai #elections #generativeai #publicsphere

Clara Helming

Senior Advocacy & Policy Manager

helming@algorithmwatch.org

Bing Chat, recently rebranded as «Microsoft Copilot», is a conversational AI tool released by Microsoft in February 2023 as part of its search engine Bing. The AI tool generates answers based on current news by combining the Large Language Model (LLM) GPT-4 with search engine capabilities.

In this investigation, we tested if the generative chatbot would provide correct and informative answers to questions about the federal elections in Switzerland as well as the state elections in Bavaria and Hesse that took place in October 2023. We prompted the chatbot with questions relating to candidates, polling and voting information, as well as more open recommendation requests on who to vote for when concerned with specific subjects, such as the environment. From 21 August 2023 to 2 October 2023, we collected the chatbot’s answers.

To the study

What we found

One third of Bing Chat’s answers to election-related questions contained factual errors. Errors include wrong election dates, outdated candidates, or even invented scandals concerning candidates.
The chatbot’s safeguards are unevenly applied, leading to evasive answers 40% of the time. The chatbot often evaded answering questions. This can be considered as positive if it is due to limitations to the LLM’s ability to provide relevant information. However, this safeguard is not applied consistently. Oftentimes, the chatbot could not answer simple questions about the respective elections’ candidates, which devalues the tool as a source of information.
This is a systemic problem, as the generated answers to specific prompts remain prone to error. The chatbot’s inconsistency is consistent. Answers did not improve over time, which they could have done, for instance, as a result of more information becoming available online. The probability of a factually incorrect answer being generated remained constant.
Factual errors pose a risk to candidates’ and news outlets’ reputation. While generating factually incorrect answers, the chatbot often attributed them to a source that had reported correctly on the subject. Furthermore, Bing Chat made up stories about candidates being involved in scandalous behavior – and sometimes even attributed them to sources.
Microsoft is unable or unwilling to fix the problem. After we informed Microsoft about some of the issues we discovered, the company announced that they would address them. A month later, we took another sample, which showed that little had changed in regard to the quality of the information provided to users.
Generative AI must be regulated. These chatbots do not have any concept of ‘truth’. Based on statistics, the LLM underlying them calculates the most probable word order. Based on math, the chatbot thereby generates plausibly sounding text, which in many cases is not true. Bing Chat does not copy-paste information from the sources it refers to – but the fact that it lists a source after every sentence in the style of a footnote precisely gives the impression that this is what it does. For users, this is highly misleading. These results correspond to the established knowledge on the mathematical instability of generative AI outside of specific applications. This finding, along with a lack of substantial stakeholder oversight, demonstrates the need for regulation, even more so since such models are commercialized as general-purpose AI, which means that the compounding errors extend over different fields of application.