When You Doubt, Abstain

By Clément Bénesse: The gist of this paper is the following "chain of thoughts" (since this term appears to be quite popular nowadays):

Italian data is not readily available (with only some data in X-Fact dataset and FakeCovid).
Lack of context and lack of domain diversity make some claims unverifiable.
Because of this, the authors explore claim ambiguity and shift along genres and sources + the benefit of allowing the model to abstain.

They begin with a manual annotation of all italian data reported in the train set of X-Fact, and manual creation of synthetic data (pre-existing text twisted to go from "news-like" to "social-like" and vice-versa). The aim is to train a model that, given a text, provide the truth-level of said message, using a semantic search (spoiler: it is be Sentence-BERT + cosine similarity with known anchor points + majority vote of these anchors).

There are two hyperparameters: the maximal number of anchor points ($n$) and the threshold for the cosine similarity. They found that $n=1$ is the best. Given this fact, the hyperparameter has the following meaning: if an anchor point is close enough, we transfer the label from it; otherwise, we abstain. Hence, the choice of the hyperparameter defines the amount of "jokers" we allow the model to use. This leads to the figures (esp. Figure 2) given in the article.

My opinion:

The lack of data is real for non-English disinformation! Manual annotation of these texts, while time-intensive, may indeed be the first step for better automated disinformation detection. I am a bit curious regarding the methodology for synth. data creation, as human input may not be reproductible. I like the idea of allowing the model to abstain, as I often speak about it :). I would have like more comparison with other models, especially since the pipeline is quite "simple". Otherwise, nice paper that exhibits once again the need for context, in particular when using short texts.