I’m not sure where you’re getting the idea that language models are effective lie detectors, it’s very widely known that LLMs have no concept of truth and hallucinate constantly.
And that’s before we even get into inherent biases and moral judgements required for any form of truth detection.
The point isn’t to have it be a lie detector but a factual claim detector. So you have an neural network that reads statements and says “this thing is saying something factual” or “this is just an opinion/obvious joke/whatever” and a person grades the responses to train it. So then the AI just says “hey this thing is making some sort of fact-related claim” and then the warning applies no matter what.
I’m not sure where you’re getting the idea that language models are effective lie detectors, it’s very widely known that LLMs have no concept of truth and hallucinate constantly.
And that’s before we even get into inherent biases and moral judgements required for any form of truth detection.
The point isn’t to have it be a lie detector but a factual claim detector. So you have an neural network that reads statements and says “this thing is saying something factual” or “this is just an opinion/obvious joke/whatever” and a person grades the responses to train it. So then the AI just says “hey this thing is making some sort of fact-related claim” and then the warning applies no matter what.