Study Finds Long Input Tokens Cause Performance Drop in LLMs

Researchers have identified a phenomenon called “context rot” where increasing the number of input tokens degrades the performance of large language models ^[1].

This discovery challenges the assumption that LLMs can handle vast amounts of data uniformly. As companies integrate longer prompts and multi-turn conversations into commercial products, performance drops could lead to inaccuracies in critical downstream applications [1, 3].

The study, published in 2024, evaluated 18 LLMs ^[1]. The tested models included GPT-4.1, Claude 4, Gemini 2.5, and Qwen 3 ^[2]. The Trychroma team said there was a consistent drop in accuracy as the context length grew, suggesting that the models struggle to maintain coherence when overwhelmed by too much input data [1, 3].

However, other researchers suggest this limitation is not absolute. MIT CSAIL researchers developed a recursive framework designed to address these stability issues ^[2]. According to their findings, this specific framework allows LLMs to process up to 10 million tokens ^[2] without observable context rot.

The discrepancy between the Trychroma paper and the MIT framework highlights a central tension in AI development. While standard model architectures appear to suffer from degradation as input grows, new orchestration methods may bypass these hardware or software bottlenecks [1, 2].

The Trychroma team conducted the analysis to investigate whether LLMs truly handle long contexts uniformly ^[1]. Their results indicate that performance varies significantly as input length changes, even when the models are performing simple tasks ^[3].

“Increasing the number of input tokens degrades the performance of large language models.”

The emergence of 'context rot' suggests that simply increasing the window of a model's memory does not guarantee the model can actually use that memory effectively. The contrast between the Trychroma findings and the MIT recursive framework indicates that the solution to long-context processing may lie not in the models themselves, but in the architectural frameworks used to feed data into those models.

Sources

[1]bing news — MIT’s new ‘recursive’ framework lets LLMs process 10 million tokens without context rot

[2]duckduckgo news — Context Rot and Overload : How Feeding AI More Data Can Decrease Accuracy

[3]bing news — GAM takes aim at “context rot”: A dual-agent memory architecture that outperforms long-context LLMs

[4]duckduckgo news — Anthropic's Claude Opus 4.6 brings 1M token context and 'agent teams' to take on OpenAI's Codex

Study Finds Long Input Tokens Cause Performance Drop in LLMs

Sources

Related

Guest Column Urges AI Researchers to Establish Moral Red Lines

Bolt and China's Dongfeng Partner for South Africa EV Fleet

Corporate Ownership Revealed for Major Power Tool Brands

Comments