When German journalistĀ Martin Bernklautyped his name and location intoĀ Microsoftā€™s CopilotĀ to see how his articles would be picked up by the chatbot, the answersĀ horrified him. Copilotā€™s results asserted that Bernklau was an escapee from a psychiatric institution, a convicted child abuser, and a conman preying on widowers. For years, Bernklau had served as a courts reporter and the AI chatbot hadĀ falsely blamed himĀ for the crimes whose trials he had covered.

The accusations against Bernklau werenā€™t true, of course, and are examples of generative AIā€™sĀ ā€œhallucinations.ā€Ā These are inaccurate or nonsensical responses to a prompt provided by the user, and theyā€™reĀ alarmingly common. Anyone attempting to use AI should always proceed with great caution, because information from such systems needs validation and verification by humans before it can be trusted.

But why did Copilot hallucinate these terrible and false accusations?

  • daniskarma@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    3
    arrow-down
    7
    Ā·
    edit-2
    3 months ago

    It actually can be fixed. There is an accuracy to answers. Like how confident the statistical model is on the answer. Thatā€™s why some questions get consistent answers while others donā€™t.

    The fix is not that hard, itā€™s a matter of reputation on having the chatbot answer ā€œI donā€™t knowā€ when the confidence on an answer isnā€™t high enough. Itā€™s pretty similar on what the chatbot does when you ask them to make you a bomb, just highjacks the answer calculated by the model and says a predefined answer instead.

    But it makes the AI look bad. So most public available models just answer anything even if they are not confident about it. Also your reaction to the incorrect answer is used to train the model better so itā€™s not even efficient for they to stop the hallucinations on their product. But it can be done.

    Models used by companies usually have a higher confidence threshold and answer ā€œI donā€™t knowā€ if they donā€™t have enough statistical proof on a particular answer.

    • Terrasque@infosec.pub
      link
      fedilink
      English
      arrow-up
      9
      Ā·
      3 months ago

      The fix is not that hard, itā€™s a matter of reputation on having the chatbot answer ā€œI donā€™t knowā€ when the confidence on an answer isnā€™t high enough.

      This has been tried, itā€™s helping but itā€™s not enough by itself. Itā€™s one of the mitigation steps I was thinking of. And companies do work very hard to reduce hallucinations, just look at Microsoftā€™s newest thing.

      From that article:

      ā€œTrying to eliminate hallucinations from generative AI is like trying to eliminate hydrogen from water,ā€ said Os Keyes, a PhD candidate at the University of Washington who studies the ethical impact of emerging tech. ā€œItā€™s an essential component of how the technology works.ā€

      Text-generating models hallucinate because they donā€™t actually ā€œknowā€ anything. Theyā€™re statistical systems that identify patterns in a series of words and predict which words come next based on the countless examples they are trained on.

      It follows that a modelā€™s responses arenā€™t answers, but merely predictions of how a question would be answered were it present in the training set. As a consequence, models tend to play fast and loose with the truth. One study found that OpenAIā€™s ChatGPT gets medical questions wrong half the time.

      • daniskarma@lemmy.dbzer0.com
        link
        fedilink
        English
        arrow-up
        1
        arrow-down
        5
        Ā·
        edit-2
        3 months ago

        The Hidrogen from water thing is simply wrong. If that is supposed to mean that hallucinations are just part of a generative LLM technology that cannot be solved.

        They are not inherent of the technology. They are a product of lack of control over the stadistical output. Prioritizing any answer before no answer.

        As with any statistics you have a confidence on how true something is based on your data. Itā€™s just a matter of putting the threshold higher or lower.

        If you ask an easy question ā€œWhat is the capital of France?ā€ You wont ever get an hallucination. Because all models will have that answer provided with very high confidence. You just have to make so if that level of confidence is not reached it just default to a ā€œI donā€™t know answerā€. But, once again, this will make the chatbots seem very dumb as they will answer with lots of ā€œI donā€™t knowā€.

        The problem here is the amount of data and the efficiency of the model. In order to get an usable general purpose model with a confidence threshold high enough to not hallucinate, by todays efficiency with the models it would need to be an humongous model, too big and with too much training data even for big tech. So we can go that big, we can try to improve efficiency (which is being proven very hard for general models) or we do both. Time will tell, but Iā€™m quite confident that we will reach a general use model without hallucinations sooner or later.

        • Terrasque@infosec.pub
          link
          fedilink
          English
          arrow-up
          6
          Ā·
          3 months ago

          As with any statistics you have a confidence on how true something is based on your data. Itā€™s just a matter of putting the threshold higher or lower.

          You just have to make so if that level of confidence is not reached it just default to a ā€œI donā€™t know answerā€. But, once again, this will make the chatbots seem very dumb as they will answer with lots of ā€œI donā€™t knowā€.

          I think you misunderstand how LLMā€™s work, it doesnā€™t have a confidence, itā€™s not like it looks at itā€™s data and say ā€œhmm, yes, most say Paris is the capital of France, so thatā€™s the answerā€. It ā€œjustā€ puts weight on the next token depending on itā€™s internal statistics, and then one of those tokens are picked, and the process start anew.

          Teaching the model to say ā€œI donā€™t knowā€ helps a bit, and was lauded as ā€œThe Solutionā€ a year or two ago but turns out it didnā€™t really help that much. Then you got Grounded approach, RAG, CoT, and so on, all with the goal to make the LLM more reliable. None of them solves the problem, because as the PhD said itā€™s inherent in how LLMā€™s work.

          And no, local llmā€™s arenā€™t better, theyā€™re actually much worse, and the big companies are throwing billions on trying to solve this. And no, itā€™s not because ā€œthat makes the llm look dumbā€ that they havenā€™t solved it.

          Early on I was looking into making a business of providing local AI to businesses, especially RAG. But no model I tried - even with the documents being part of the context - came close to reliable enough. They all hallucinated too much. I still check this out now and then just out of own interest, and while itā€™s become a lot better itā€™s still a big issue. Which is why you see it on the news again and again.

          This is the single biggest hurdle for the big companies to turn their AIā€™s from a curiosity and something assisting a human into a full fledged autonomous / knowledge system they can sell to customers, you bet your dangleberries they try everything they can to solve this.

          And if you think you have the solution that every researcher and developer and machine learning engineer have missed, then please go prove it and collect some fat checks.

          • daniskarma@lemmy.dbzer0.com
            link
            fedilink
            English
            arrow-up
            2
            arrow-down
            2
            Ā·
            edit-2
            3 months ago

            What do you think is ā€œweightā€?

            Is, simplifying, the amounts of data that says ā€œThe capital of France is Parisā€ it doesnā€™t need to understand anything. It just has to stop the process if the statistics donā€™t not provide enough to continue with confidence. If the data is all over the place and you have several ā€œThe capital of France is Berlin/Madrid/Milanā€, itā€™s measurable compared to all data saying it is Paris. Not need for any kind of ā€œunderstandingā€ of the meaning of the individual words, just measuring confidence on what next word should be.

            Back a couple of years when we played with small neural networks playing mario and you could see the internal process in real time, as there where not that many layers. It was evident how the process and the levels of confidence changed depending on how deep the training was. Here it is just orders of magnitude above. But nothing imposible to overcome as some people pretend to sell.

            Alternative ways of measure confidence is just run the same question several times and check if answers are equivalent.

            PhD is PhD in scaremongering about technology, so itā€™s not an authority on anything here.

            IDK what did you do, but slm donā€™t really hallucinate that much, if at all. Specially if they are trained with good datasets.

            As I said the solution is not in my hand, as it involves improving the efficiency or the amount of data. Efficiency has issues as current techniques seems to be unable to improve efficiency over a certain level. And amount of data is, obviously, costly.

            • Terrasque@infosec.pub
              link
              fedilink
              English
              arrow-up
              2
              Ā·
              3 months ago

              What do you think is ā€œweightā€?

              You can call that confidence if you want, but it got very little to do with how ā€œsureā€ the model is.

              It just has to stop the process if the statistics donā€™t not provide enough to continue with confidence. If the data is all over the place and you have several ā€œThe capital of France is Berlin/Madrid/Milanā€, itā€™s measurable compared to all data saying it is Paris. Not need for any kind of ā€œunderstandingā€ of the meaning of the individual words, just measuring confidence on what next word should be.

              Actually, it would be "The confidence of token Th is 0.95, the confidence of S is 0.32, the confidence of ā€¦ " and so on for each possible token, many llmā€™s have around 16k-32k token vocabulary. Most will be at or near 0. So you pick Th, and then token ā€œeā€ will probably be very high next, then a space token, thenā€¦ Anyway, the confidence of the word ā€œParisā€ wonā€™t be until far into the generation.

              Now there is some overseeing logic in a way, if you ask what the capitol of a non existent country is itā€™ll say thereā€™s no such country, but is that because it understands it doesnā€™t know, or the training data has enough examples of such that it has the statistical data for writing out such an answer?

              IDK what did you do, but slm donā€™t really hallucinate that much, if at all.

              I assume by SLM you mean smaller LLMā€™s like for example mistral 7b and llama3.1 8b? Well those were the kind of models I did try for local RAG.

              Well, it was before llama3, but I remember trying mistral, mixtral, llama2 70b, command-r, phi, vicuna, yi, and a few others. They all made mistakes.

              I especially remember one case where a product manual had this text : ā€œIf the same or a newer version of <product> is already installed on the computer, then the <product> installation will be aborted, and the currently installed version will be maintainedā€ and the question was ā€œWhat happens if an older version of <product> is already installed?ā€ and every local model answered that then that version will be kept and the installation will be aborted.

              When trying with OpenAIā€™s latest model at that time, I think 4, it got it right. In general, about 1 in ~5-7 answers to RAG backed questions were wrong, depending on the model and type of question. I could usually reword the question to get the correct answer, but to do that you kinda already have to know the answer is wrong. Which defeats the whole point of it.

              • daniskarma@lemmy.dbzer0.com
                link
                fedilink
                English
                arrow-up
                1
                Ā·
                edit-2
                3 months ago

                More or less that. Thereā€™s a point during the path that the input is taking on the language model were the induced randomness can significantly affect the output or not. If all the weights are pointing to the same end node, because the ā€œconfidenceā€ is high, the no matter the random seed, the output will be the same. When the seed greatly affect the final result is because the weights donā€™t point with that confidence to an unique end node, so the small randomness introduced at the beginning (the seed to say so) greatly change the result. It is here were you are most likely to get an hallucination.

                To put again in terms of the much more easier to view earlier neural networks. When you didnā€™t trail the model enough mario just made random movements without doing attempts to complete the level. Because the weights of the neurons could not reliably take the input and transform into an useful output. It os something that could be solved in smaller models. For larger models gets incredibly complicated because the massive amount of data. The complexity of the data. And the complexity of a proper training. But itā€™s not something imposible or that could not get rid of. The same you can get Mario to finally complete all levels every time without issues, you can get a non hallucinanting chat bot, it just takes more technology improvements.

                I suppose it could be said that the nature of language is chaotic like weather and not deterministic like a Mario level, and thus it would be actually ā€œimpossibleā€ to get large results, like itā€™s impossible to get precise weather a month in advance. But Iā€™m not sure there would be enough evidence to support that, as hallucinations are not just across the board, they just tend to happen on matters that had little training data. Matters with plenty of training data do not hallucinate even in today models.

                I searched slm online and found out that small models you said. I wasnā€™t refering to those. Those are just small large language models IMO if that makes any sense. A proper slm should also have a small purpose, cannot be general chat. I mostly refer to the current chatbots that point you to predefined answers, or summarizing ones. Nothing that could really elaborate a wrote answer word by word.

                Currently and to my knowledge. There isnā€™t any general language model that can just write up answers and that is good enough to not hallucinate. But certainly we are getting closer each year.

                Edit: Iā€™ve been looking for an example, here https://www.tax.service.gov.uk/ask-hmrc/chat/self-assessment These kind of chatbots, they know when their answer is not precise and default to a polite ā€œask againā€ answer instead of just tell you the first ā€œhallucinationā€ that came to them. They are powered by similar AI technology but itā€™s not a general use and cannot write word by word. But it ā€œknowsā€ when te answer is precise or not.

                • Railcar8095@lemm.ee
                  link
                  fedilink
                  English
                  arrow-up
                  2
                  Ā·
                  3 months ago

                  The example you shared is not an LLM. Itā€™s a classic chatbot with pre-defined answers. It basically knows keyword to KB article. If no term is known, it will tell ā€œI donā€™t knowā€. It will also suggest incorrect KB if picks one keyword, ignoring the rest of the context. It has no idea of the answer is correct by any means. At best somebody will periodically check a sample of questions that the user didnā€™t consider correct to evaluate the pairings, but itā€™s not AI, at least not a good one

                  • daniskarma@lemmy.dbzer0.com
                    link
                    fedilink
                    English
                    arrow-up
                    1
                    Ā·
                    edit-2
                    3 months ago

                    If you read my answers youā€™ll see that I said they are not llm. They are language models powered by smaller datasets and with smaller neural networks.

                    I picked a tax agency in particular because I know first hand that tax agencies (I would surprise me that UK didnā€™t use it) do use language models with neural networks, notice that again Iā€™m not saying generative llm, to parse the question and select a proper answer. Not the keyword method you think they use.

                    I would have provided the first hand example I know but it is spanish and people may not be able to effectively understand it. But I do know that tax agencies usually use very similar tools one country from another. So probably UK does use it. If you want to test the spanish one here it is. And sources on what type of AI is used.

                    https://sede.agenciatributaria.gob.es/Sede/ayuda/herramientas-asistencia-virtual.html

                    https://es.newsroom.ibm.com/2018-02-28-La-Agencia-Tributaria-utiliza-IBM-Watson-para-ayudar-a-las-empresas-en-la-gestion-del-IVA

                    Again, because it seems that I need to repeat this so people can properly train on the info Iā€™m writing, not LLM, not GPT, not a large general use language model. As for that amount of parameters cutting not confident answers would cut most answers, probably. At least with nowadays state of technology, things keep improving each year.

                    Edit: found some english source on the matter https://www.investinspain.org/en/news/2024/ibm

                    The chatbot it is still only in spanish and co-official languages still.

        • jj4211@lemmy.world
          link
          fedilink
          English
          arrow-up
          2
          Ā·
          edit-2
          3 months ago

          This article is an example where statistical confidence doesnā€™t help. The model has lots of data so it likely has high confidence, but it didnā€™t have any understanding of the nature of the relation in the data.

          I recently did an application where we indicated the confidence of the output of the model. For some scenarios, the high confidence output had even more mistakes than the low confidence output

        • futatorius@lemm.ee
          link
          fedilink
          English
          arrow-up
          1
          Ā·
          3 months ago

          They are a product of lack of control over the stadistical output.

          OK, so describe how you control that output so that hallucinations donā€™t occur. Does the anti-hallucination training set exceed the size of the original LLMā€™s training set? How is it validated? If itā€™s validated by human feedback, then how much of that validation feedback is required, and how do you know that the feedback is not being used to subvert the model rather than to train it?