Algorithmic Poisoning: Quantifying the Infiltration of Synthetic Data in Open Public Knowledge Bases

JHANNA MARIEL A. MERINO; Lorenzo Romero de la Cruz

doi:10.67693/BJCR-ZK75R86D

Publication Details

Publication Date: 10/06/2026
Volume / Issue: Vol 1, Issue 1 (2026)
Article No.: 011
Journal: British Journal of Contemporary Research
DOI: 10.67693/BJCR-ZK75R86D
Received: 01 Jun 2026
Views: 118
Downloads: 35

Affiliations

JHANNA MARIEL A. MERINO: Biliran Province State University, Philippines

Lorenzo Romero de la Cruz: Biliran Province State University, Philippines

Abstract

The mass deployment of large language models (LLMs) since the late-2022 release of ChatGPT has produced a recursive feedback loop in which model-generated (“synthetic”) text and code are scraped back into the corpora used to train subsequent models. Theoretical work warns that this loop can trigger “model collapse”—an irreversible loss of distributional tails and information integrity. This study synthesises and quantifies the rate of synthetic-data infiltration across three pillars of the open public knowledge commons—web crawls (Common Crawl), Wikipedia, and open code repositories—and evaluates how that infiltration correlates with degradation of information integrity. We conducted a structured review and meta-synthesis of peer-reviewed and government-sourced empirical studies, triangulating corpus-level detection estimates (excess-vocabulary maximum-likelihood models, perplexity and cross-perplexity detectors such as Binoculars, and proprietary classifiers such as GPTZero), provenance documentation analyses of Hugging Face dataset cards, and multi-way-parallelism audits of Common Crawl. Pre-March-2022 corpora served as calibrated false-positive-rate baselines. Convergent evidence shows synthetic infiltration rising sharply post-2022. Detectors flag over 5% of newly created English Wikipedia articles as AI-generated (lower bound 4.36%); at least 13.5% of 2024 biomedical abstracts show LLM excess-vocabulary signatures; 6.5–16.9% of AI-conference peer reviews are LLM-modified; and 57.1% of sentences in a Common Crawl-derived corpus are multi-way machine translations. Flagged content is systematically lower quality. Controlled experiments confirm model collapse under data replacement but show data accumulation bounds the damage. Synthetic infiltration of open knowledge bases is measurable, accelerating, and correlated with quality degradation, but is not yet catastrophic and is conditional on training methodology. Provenance standards, watermarking, and regulatory marking obligations are necessary but currently insufficient mitigations.

Keywords

Model Collapse Synthetic Data Data Provenance Large Language Models Information Integrity Machine-Generated Text Detection Common Crawl Content Authenticity

License

This article is published under the Creative Commons Attribution 4.0 International License . Free to read, share, and adapt with attribution.

Download

Download PDF

Cite This Article

JHANNA MARIEL A. MERINO, Lorenzo Romero de la Cruz (2026). Algorithmic Poisoning: Quantifying the Infiltration of Synthetic Data in Open Public Knowledge Bases. British Journal of Contemporary Research, 1(1), Article 011. https://doi.org/10.67693/BJCR-ZK75R86D

JHANNA MARIEL A. MERINO. “Algorithmic Poisoning: Quantifying the Infiltration of Synthetic Data in Open Public Knowledge Bases.” British Journal of Contemporary Research, vol. 1, no. 1, 2026.

JHANNA MARIEL A. MERINO. “Algorithmic Poisoning: Quantifying the Infiltration of Synthetic Data in Open Public Knowledge Bases.” British Journal of Contemporary Research 1, no. 1.

Metadata

ISSN 2979-8582

DOI Prefix 10.67693

Tracking ID BEX_MAY_26_025

Download JATS XML

British Journal of Contemporary Research

Open Access · Peer Reviewed · Published by Bexford Publishing Ltd

Browse All Issues