Wikipedia talk:WikiProject AI Tools: Difference between revisions

Line 78: Line 78:

:::I don’t agree that a simple true/false evaluation will lead to meaningful results (point 13 above), but let’s assume it is fine.

:::Each participant will be asked to fact-check about 50 samples so you need 22 people.

:::{{tq|commitment of 10 – 20 hours in mid December}} So 220 – 440 hours of volunteer time.

:::I find it ”extremely” difficult to outsource my todolist. Finding 22 Wikipedians who are willing to spend a significant amount of time doing a very boring task that does not benefit Wikipedia is gonna be real hard. I don’t think a symbolic stipend is gonna do much to motivate em.

:::In summary, the study won’t work. But installing MiniCheck somewhere and giving me an API endpoint and credentials is a good idea. [[User:Polygnotus|Polygnotus]] ([[User talk:Polygnotus|talk]]) 08:09, 27 November 2025 (UTC)

::This study appears to have the goal of encouraging the use of LLMs, based on ‘fact-checking’ using Wikipedia as a source. Given that Wikipedia makes it entirely clear that it does not consider ”itself” as a reliable source, the study is clearly ill-thought out, or at best, engaging in wishful thinking. And furthermore, any encouragement of this misleading LLM use can only make things worse for Wikipedia itself, as it faces a deluge of LLM-generated garbage, generated by a technology which routinely hallucinates (as has been demonstrated to be mathematically inherent in such software), engages in synthesis, contrary to Wikipedia policy, and mangles source citations to the extent that even if they originate from something genuine (and meeting wikipedia sourcing policy, which LLM citations routinely don’t) the amount of effort required to find the actual source is totally disproportionate to their utility. I would advise anyone contemplating engaging with this study to question whether it is in the interest of Wikipedia’s contributors, and perhaps more importantly its readers, to do so. [[User:AndyTheGrump|AndyTheGrump]] ([[User talk:AndyTheGrump|talk]])

Hi. I’m working with a team from Columbia University, funded by a Wikimedia Foundation rapid grant. We are seeking Wikipedia editors who are willing to participate in study on GenAI reliability, with a commitment of 10 – 20 hours in mid December – mid January 2026, and a symbolic stipend to compensate for your time.

The Research Project. Our goal is to find out if using a Wikipedia-inspired fact-checking process can increase the reliability of chatbots responding to queries related to Wikipedia’s content. The study uses open-source language models and frameworks, and our full results will be openly shared, with the aim of finding better methods for addressing AI hallucinations that are inspired by the well-established and highly successful practices of Wikimedia projects.

Please note that this project is a ‘’’pure and contained experiment’’’ for analyzing how far or close large language models are to editor-level factuality. We don’t plan on implementing any live tools at the moment.

The Task. The task required from participants is to fact-check an AI-generated response to a general knowledge question. This will be done checking whether each claim in a paragraph-long response is supported by the provided sources (each paragraph will be supported by up to 3 citations, the text of each citation is up to a few paragraphs).

Each participant will be asked to fact-check about 50 samples, with flexibility to do a bit more or less according to your availability. We recognize that this will be a demanding task, which is why we’re offering a stipend to those willing to make the time. The amount of the stipend is based on the amount of samples fact-checked.

Privacy & Security. If you choose to participate, we’re open to either crediting your efforts in our paper, or maintaining your full anonymity, whichever you prefer.

We adhere to the Wikimedia Foundation’s privacy policy. Participants may be asked to provide basic demographics for research purposes, which will be completely discarded after research concludes in early 2026.

Participation. All Wikipedia editors are eligible to participate. For methodological purposes, we may prioritize editors with expertise in specific subject matters, a higher Wikimedia project editor experience, or a focus and interest in fact-checking. If interested, please take a few minutes to submit the form! (Qualtrics external link). If you’re not comfortable filling out an external form, you may just send the answers to me directly using the EmailUser.

Happy to share the research proposal or answer any questions! –Abbad (talk) 00:27, 26 November 2025 (UTC).[reply]

@عباد ديرانية Well that seems like a major waste of time. The page you linked to says We’ll build an experimental AI assistant for readers that exclusively draws answer from Wikipedia pages, and integrates an explicit and novel fact-checking step into its architecture that’s inspired by Wikipedia’s own fact-checking process by editors. and This assistant is not intended for public use but only as a time-bound experiment, which will be used for rigorous testing and evaluation of this model’s reliability compared to Wikipedia’s baseline of reliable information and using open source large language models (LLMs) as fact-checkers that can provide a reliable paraphrasing of Wikipedia’s content

it won’t be able to differentiate between its training data and the Wikipedia pages its supposed to use as sources.
current LLM technology can’t reliably paraphrase or summarize content
training models requires copyright infringement on a massive scale, or it will be inferior to alternatives, which already have an established install base and a trillion dollars; kinda difficult to compete with.[1][2]
doesn’t it make more sense to actually check the sources and verify if they support the claim made in the article, instead of having yet another chatbot which can do something any chatbot can do, but worse?
2300 dollar is not enough to achieve something meaningful.
sample size is tiny
moderate agreement is a very very low bar
We’ll consider this a success if more than two thirds of respondents support further experiemntation in the future. Makes no sense, of course 100% will support further experimentation, I do too, but not of this dead-end street. Having people support further experimentation does not mean this was a good idea.
It will just be another lossy unreliable vague layer between users and reliable sources, like Wikipedia often is. We need less of that (e.g. by using the |quote= parameter), not more.
This sounds like “I want to use AI, let me invent a usecase” not “I have a problem, let me fix it with whatever the best tool for the job is”.
It is unclear what the results will be used for. The output will just be some numbers, which are meaningless by themselves.
It is unclear what an explicit and novel fact-checking step into its architecture that’s inspired by Wikipedia’s own fact-checking process by editors. means. Using MiniCheck isn’t novel and “We’ll ask an AI model to check the work of an AI model” leads to diminishing results. If MiniCheck can do verification, why can’t the original model incorporate fact checking. The root problem is that the base model generates facts, half-truths and nonsense. Instead of trying to sort fact from fiction the goal should be to create a model that can verify its own output during generation but that is far outside the scope of the WMF.
A binary metric (true/false) is clearly inadequate when checking if the paraphrasing is any good. A good summary doesn’t leave out important facts; yet the proposal only measures pure falsehoods instead of omissions of important stuff, distortions, cherry picking, loss of nuance, synthesis et cetera. Pure hallucinations are a minority of the mistakes an LLM makes, but according to the proposal they’re the only ones being measured.
We already had this same discussion , for example over at Wikipedia:Village_pump_(technical)/Archive_221#Simple_summaries:_editor_survey_and_2-week_mobile_study. So when the response was universally negative, and we already know why this can’t work, why try again?
Why ask for volunteers and WMF money when Wikipedia doesn’t benefit from the results? Why ask Wikipedians, who have a lot of stuff to do, to volunteer to do stuff that doesn’t help Wikipedia? Its not like the AI companies will improve their products based on the results, and one can’t improve Wikipedia based on the outcome, so who benefits?
The proposal says We’ll build an experimental AI assistant and if that was true testing it would make sense. But it also says the plan is to just mash some pre-existing stuff together. If so, why ask volunteers to check how good or bad Llama and MiniCheck are? Shouldn’t Meta Platforms employees test Llama? Shouldn’t Mistral AI SAS employees test Mixtral? These are commercial companies who can surely hire some people to test their stuff, if they wanted. If there is no plan to add anything new that should improve performance, why bother testing? One datapoint is no datapoint. I already know the outcome: current AI tech is not as good as humans, especially not the nerdy type who edits Wikipedia, and attempts to quantify the difference are pointless because they are just a weighted random number generator one could build a narrative around. In order to make it slightly less meaningless you’d have to keep doing it with each new version and track performance over time, but that would only help AI companies, not Wikipedia.

You can’t measure success by comparing this chatbot against commercially available chatbots. The correct baseline is Wikipedia itself, which anyone can access already and read what it actually says.

Showing that this chatbot produces fewer errors than commercial LLMs only proves that it is slightly less bad than commercial LLMs, not that it is a good approach to deliver Wikipedia content to users.

If any hallucinations or distortions are added by the chatbot, then it is worse than just reading Wikipedia yourself.

The interesting variable is how many hallucinations/misrepresentations/distortions are added compared to just reading the Wikipedia articles; how the chatbot compares to commercial LLMs is irrelevant to us.

I may be stupid but I don’t get it. Polygnotus (talk) 00:49, 26 November 2025 (UTC)[reply]

@DSaroyan (WMF) and FElgueretly-WMF: Please explain why this is a good idea. Which technical experts has the Review team consulted? It would be nice to hear from them as well. It is also unclear to me how a Rapid grant can be awarded to a project that is ineligible: Applications to complete proposed research related to the Wikimedia movement are not eligible. Please review the Wikimedia Research Fund for these funding opportunties. —meta:Grants:Project/Rapid#Eligibility_requirements Thanks, Polygnotus (talk) 01:06, 26 November 2025 (UTC)[reply]

This was also posted over at Wikipedia:Village_pump_(miscellaneous)#Looking_for_participants_in_a_GenAI_factuality_study. Doubleposting is generally discouraged because it wastes people’s time. Polygnotus (talk) 03:29, 26 November 2025 (UTC)[reply]

@Polygnotus I appreciate the thoughtful critique. To what I interpret as your main point – yes, any hallucinations are bad. However, LLMs are already prevelant in the industry and academia, as you must know, and from our daily observations, their use almost completely lacks any sense of responsibility towards reliability. Honestly, Wikipedia itself, as a teritary soruce, shouldn’t even be the ideal baseline for factuality, but we recognize that research is an incremental endavour, therefore our approach is start with introducing a methodical way to improve over the status quo of LLMs usage. Realistically, we can’t even expect LLMs to improve without such experiments. Please note that because Wikipedia is our chatbot’s source, it is effectively a baseline for this study well.

In-line responses:

Points 1-3: We examined the differentiation between retrieval and training data in-depth when scoping our research, and we have two consideraions: A.From our literature review, we’re our aware of methods that aim exactly to differentiate when an LLM’s answer is grounded in the provided context versus training data. If our resources allow, we do aim to implement the methodology from this paper in drawing this differentiation. However, this is a challenging set up, and our team is ~~100%~~ volunteer-based (or more like 90%, we’ve had a little budget planned for some team members, but with fiscal sponsorship + paying evaluators + computing, we now expect a couple hundred USD surplus only), so even with the humble grant we may not be able to go that far. B. The eventual purpose of this study is to evaluate the factuality of LLMs in practice. Whether they make errors due to their training data, architecture, or Wikipedia-grounded context, it’s eventually an error.
Re: Point 4: 100% agreed, and honestly my original idea was to build something exactly like the Source Verification tool using the MiniCheck model, which is open source, very lightweight, and has shown imprssive accuracy in dozens of experiments that I did with it. My fellow researchers recommnded a RAG approach because it has much more impact on the irresponsible use of chatbots in the industry, which is true. Also, because I discovered now that the Source Verification tool exists, I’m not sure if this approach is any different. I do still hope to run a methodical experiment, once we’re done with this project, by: A. Extracting the full-text of some citations (e.g. a book), B. Extracting instances where they’re cited on Wikipedia pages, 3. Running the full-text + cited phrases through MiniCheck to see how accurate it is. I believe the results coudl be impressive.
Re: Points 5 – 6: Indeed! That’s why all the researchers are ~~100%~~ volunteer. We’re doing what we can with our budget, but we also understand that the community may not support pouring larger resources into experimental research at this point
Re: Point 7: This is almost exclusively the annotation baseline from other LLM research we ran across. I’ll do more homework on this, but please feel free to advise if you’re aware of alternatives.
Re: Point 8: This is a goal to determine the success of the grant itself, so it needed to be experiment-tied, and a user-testing goal seemed appropriate. You’re right, though, and I’m open to revise. I’m hestitant to set a specific goal of factuality improvement because we won’t know, obviously, until we conduct the experiment.
Re: Point 9: While I don’t disagree, lossy middle layers are not only a reality, but a necessity. As you mention, Wikipedia itself is a mediator of information, simply because most people lack the depth of knowledge and/or the time to digest information directly from secondary sources. LLMs, as far as we know, are here to stay, and this is debate of that reality rather than how it can be improved.
Re: Points 10-11: This is clearly a huge use case, which is literally why we opted for it (over, as I mentioned above, what could be personally more interesting to me in terms of a tool to fact-check Wikipedia sources). For example, my company, which is not special in this in any way, pays for what’s easily hundreds of millions of LLM queries a month, mainly to power chatbots. As of now, the vast majority of these chatbots on the internet barely make any attempt at truth-seeking that’s analoguous to what we’re proposing. The results from our study have the basic purpose of proving or disproving that the approach we’re trying can have an impact on factuality. In case it does, that’s an improvement on the status quo that will affect millions of users.
Re: Point 13: Yes, strictly speaking, this is a factuality-centered study. Other aspects would fall under a summarizing task.
Re: Points 14 – 16: This is very intentionally designed as an experiment of how existing tools like MiniCheck work. MiniCheck has already been developed, but how do we know if it’s doing its job well? The fact that these LLMs have been developed by labs has little to do with who’s using them, which extends to researchers, educators, non-profits, amd even Wikimedians. However, the commercial labs obviously don’t care that much about how factual their models are in an academic sense, and have had little work in this avenue (otherwise, we would have seen way fewer hallucinations). We’re volunteering our time for this because we feel like it’s a critical under-researched area, and you’re free to think it’s worth or not worth your own time. Because this is such a small study, the impact won’t be astronomical, but we believe it can be very singificant for both Wikipedia contributors, because our results will show how effective MiniCheck can be as a fact-checker. This will be evidence of whether or not it’s usable for Source Verification tool, rather than the simple fact that it exists. Did anyone else systematically test whether the fact-checking framework of that tool is consistent and useable?

~~TBC – the are lots of good points here, I’ll come back for the rest as soon as I have the chance!~~ Answered —Abbad (talk) 21:24, 26 November 2025 (UTC).[reply]

@عباد ديرانية ReDeEP looks cool but if I were you I would completely ignore Mixtral and stick to LLaMA. I do not think ReDeEP will be able to fix the problem that the model will mix training data and Wikipedia content.

Please correct me if I am wrong, but if I am reading between the lines I think we mostly agree on the facts (although I would recommend using a different tactic).

While LLM factuality is interesting (and annoyingly under researched by the guys with the big bucks), most Wikipedians are always gonna be more interested in using MiniCheck to determine if a claim in a Wikipedia article is supported by the source (or not).

We Wikipedians are a very simple people of humble peasant farmers like myself who just want results; not an academic study.

So while you do your thing, can you please allow others to use MiniCheck as well? You already know exactly how I want to use it.

Adding “MiniCheck was correct” and “MiniCheck was wrong” buttons is not very complicated.

If we can show the masses practical results, it is much much easier to get them to volunteer/contribute/whatever.

That way we have both academic validation and real-world testing, which benefits both.

I do not agree that our results will show how effective MiniCheck can be as a fact-checker because that is not what is being tested (and you wouldn’t need such a complex pipeline for such a test).

Testing whether a complex AI pipeline produces fewer (or filters out more) hallucinations than the base model is interesting, but not relevant to Wikipedia.

I think the study needs to benefit Wikipedia, not just use it as a testbed, before you should be able to get WMF money or Wikipedia volunteers. And I don’t really see it doing that at the moment. Polygnotus (talk) 07:15, 27 November 2025 (UTC)[reply]

~500 responses total need evaluation.

At least 300 of those need >3 evaluators.

Lets say the remaining 200 get one evaluation each.

1100 evaluation tasks.

I don’t agree that a simple true/false evaluation will lead to meaningful results (point 13 above), but let’s assume it is fine.

Each participant will be asked to fact-check about 50 samples so you need 22 people.

commitment of 10 – 20 hours in mid December So 220 – 440 hours of volunteer time. Assuming an 8 hour work day we are talking 1.25-2.5 workdays. I am not sure why evaluating 50 samples should take 10-20 hours but whatever.

I find it extremely difficult to outsource my todolist. Finding 22 Wikipedians who are willing to spend a significant amount of time doing a very boring task that does not benefit Wikipedia is gonna be real hard. I don’t think a symbolic stipend is gonna do much to motivate em.

In summary, the study as proposed won’t work. But installing MiniCheck somewhere and giving me an API endpoint and credentials is a good idea. Polygnotus (talk) 08:09, 27 November 2025 (UTC)[reply]

This study appears to have the goal of encouraging the use of LLMs, based on ‘fact-checking’ using Wikipedia as a source. Given that Wikipedia makes it entirely clear that it does not consider itself as a reliable source, the study is clearly ill-thought out, or at best, engaging in wishful thinking. And furthermore, any encouragement of this misleading LLM use can only make things worse for Wikipedia itself, as it faces a deluge of LLM-generated garbage, generated by a technology which routinely hallucinates (as has been demonstrated to be mathematically inherent in such software), engages in synthesis, contrary to Wikipedia policy, and mangles source citations to the extent that even if they originate from something genuine (and meeting wikipedia sourcing policy, which LLM citations routinely don’t) the amount of effort required to find the actual source is totally disproportionate to their utility. I would advise anyone contemplating engaging with this study to question whether it is in the interest of Wikipedia’s contributors, and perhaps more importantly its readers, to do so. AndyTheGrump (talk)

Must Read

Leave a Comment Cancel Reply