<?xml version="1.0" encoding="UTF-8"?>
<article xmlns:xlink="http://www.w3.org/1999/xlink" xml:lang="en"
         xmlns:mml="http://www.w3.org/1998/Math/MathML">
  <front>
    <journal-meta>
      <journal-id journal-id-type="publisher">BJCR</journal-id>
      <journal-title-group>
        <journal-title xml:lang="en">British Journal of Contemporary Research</journal-title>
        <abbrev-journal-title xml:lang="en">BJCR</abbrev-journal-title>
      </journal-title-group>
      
      <publisher>
        <publisher-name>Bexford Publishing Ltd</publisher-name>
        <publisher-loc><uri>https://bexfordpublishing.co.uk</uri></publisher-loc>
      </publisher>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="publisher-id">BEX_MAY_26_025</article-id>
      
      <article-categories>
        <subj-group xml:lang="en" subj-group-type="heading">
          <subject>Meta-Analysis</subject>
        </subj-group>
      </article-categories>
      <title-group>
        <article-title xml:lang="en">Algorithmic Poisoning: Quantifying the Infiltration of Synthetic Data in Open Public Knowledge Bases</article-title>
      </title-group>
      <contrib-group content-type="author">
      <contrib corresp="yes">
        <name-alternatives>
          <name name-style="western" specific-use="primary">
            <given-names>JHANNA MARIEL A. MERINO</given-names>
          </name>
        </name-alternatives>
        <email>lorenzo.delacruz@bipsu.edu.ph</email>
        <bio xml:lang="en"><p> Biliran Province State University, Philippines</p></bio>
      </contrib>
      <contrib>
        <name-alternatives>
          <name name-style="western" specific-use="primary">
            <given-names>Lorenzo Romero de la Cruz</given-names>
          </name>
        </name-alternatives>
        <email>romero.lorenzo98@gmail.com</email>
        <bio xml:lang="en"><p>Biliran Province State University, Philippines</p></bio>
      </contrib>
      </contrib-group>
      <pub-date date-type="pub" publication-format="epub">
        <day>10</day>
        <month>06</month>
        <year>2026</year>
      </pub-date>
      <volume>1</volume>
      <issue>1</issue>
      
      
      <pub-history>
        <event event-type="received">
          <event-desc>Received: <date date-type="received">
            <day>01</day>
            <month>06</month>
            <year>2026</year>
          </date></event-desc>
        </event>
        
      </pub-history>
      <permissions>
        <copyright-statement>Copyright (c) 2026 JHANNA MARIEL A. MERINO</copyright-statement>
        <copyright-year>2026</copyright-year>
        <license xlink:href="https://creativecommons.org/licenses/by/4.0">
          <license-p>This work is licensed under a Creative Commons Attribution 4.0 International License.</license-p>
        </license>
      </permissions>
      <abstract><p>The mass deployment of large language models (LLMs) since the late-2022 release of ChatGPT has produced a recursive feedback loop in which model-generated (“synthetic”) text and code are scraped back into the corpora used to train subsequent models. Theoretical work warns that this loop can trigger “model collapse”—an irreversible loss of distributional tails and information integrity. This study synthesises and quantifies the rate of synthetic-data infiltration across three pillars of the open public knowledge commons—web crawls (Common Crawl), Wikipedia, and open code repositories—and evaluates how that infiltration correlates with degradation of information integrity. We conducted a structured review and meta-synthesis of peer-reviewed and government-sourced empirical studies, triangulating corpus-level detection estimates (excess-vocabulary maximum-likelihood models, perplexity and cross-perplexity detectors such as Binoculars, and proprietary classifiers such as GPTZero), provenance documentation analyses of Hugging Face dataset cards, and multi-way-parallelism audits of Common Crawl. Pre-March-2022 corpora served as calibrated false-positive-rate baselines. Convergent evidence shows synthetic infiltration rising sharply post-2022. Detectors flag over 5% of newly created English Wikipedia articles as AI-generated (lower bound 4.36%); at least 13.5% of 2024 biomedical abstracts show LLM excess-vocabulary signatures; 6.5–16.9% of AI-conference peer reviews are LLM-modified; and 57.1% of sentences in a Common Crawl-derived corpus are multi-way machine translations. Flagged content is systematically lower quality. Controlled experiments confirm model collapse under data replacement but show data accumulation bounds the damage. Synthetic infiltration of open knowledge bases is measurable, accelerating, and correlated with quality degradation, but is not yet catastrophic and is conditional on training methodology. Provenance standards, watermarking, and regulatory marking obligations are necessary but currently insufficient mitigations.</p></abstract>
    </article-meta>
  </front>
  <body/>
</article>