BJCR

British Journal of Contemporary Research

BJCR

2979-8582

Bexford Publishing Ltd

https://bexfordpublishing.co.uk

BEX_MAY_26_025

10.67693/BJCR-ZK75R86D

Meta-Analysis

Algorithmic Poisoning: Quantifying the Infiltration of Synthetic Data in Open Public Knowledge Bases

JHANNA MARIEL A. MERINO

lorenzo.delacruz@bipsu.edu.ph

Biliran Province State University, Philippines

Lorenzo Romero de la Cruz

romero.lorenzo98@gmail.com

Biliran Province State University, Philippines

10 06 2026

1 1

Received: 01 06 2026

2026

This work is licensed under a Creative Commons Attribution 4.0 International License.

The mass deployment of large language models (LLMs) since the late-2022 release of ChatGPT has produced a recursive feedback loop in which model-generated (“synthetic”) text and code are scraped back into the corpora used to train subsequent models. Theoretical work warns that this loop can trigger “model collapse”—an irreversible loss of distributional tails and information integrity. This study synthesises and quantifies the rate of synthetic-data infiltration across three pillars of the open public knowledge commons—web crawls (Common Crawl), Wikipedia, and open code repositories—and evaluates how that infiltration correlates with degradation of information integrity. We conducted a structured review and meta-synthesis of peer-reviewed and government-sourced empirical studies, triangulating corpus-level detection estimates (excess-vocabulary maximum-likelihood models, perplexity and cross-perplexity detectors such as Binoculars, and proprietary classifiers such as GPTZero), provenance documentation analyses of Hugging Face dataset cards, and multi-way-parallelism audits of Common Crawl. Pre-March-2022 corpora served as calibrated false-positive-rate baselines. Convergent evidence shows synthetic infiltration rising sharply post-2022. Detectors flag over 5% of newly created English Wikipedia articles as AI-generated (lower bound 4.36%); at least 13.5% of 2024 biomedical abstracts show LLM excess-vocabulary signatures; 6.5–16.9% of AI-conference peer reviews are LLM-modified; and 57.1% of sentences in a Common Crawl-derived corpus are multi-way machine translations. Flagged content is systematically lower quality. Controlled experiments confirm model collapse under data replacement but show data accumulation bounds the damage. Synthetic infiltration of open knowledge bases is measurable, accelerating, and correlated with quality degradation, but is not yet catastrophic and is conditional on training methodology. Provenance standards, watermarking, and regulatory marking obligations are necessary but currently insufficient mitigations.