Researchers have identified a phenomenon called “context rot” where increasing the number of input tokens degrades the performance of large language models [1].
This discovery challenges the assumption that LLMs can handle vast amounts of data uniformly. As companies integrate longer prompts and multi-turn conversations into commercial products, performance drops could lead to inaccuracies in critical downstream applications [1, 3].
The study, published in 2024, evaluated 18 LLMs [1]. The tested models included GPT-4.1, Claude 4, Gemini 2.5, and Qwen 3 [2]. The Trychroma team said there was a consistent drop in accuracy as the context length grew, suggesting that the models struggle to maintain coherence when overwhelmed by too much input data [1, 3].
However, other researchers suggest this limitation is not absolute. MIT CSAIL researchers developed a recursive framework designed to address these stability issues [2]. According to their findings, this specific framework allows LLMs to process up to 10 million tokens [2] without observable context rot.
The discrepancy between the Trychroma paper and the MIT framework highlights a central tension in AI development. While standard model architectures appear to suffer from degradation as input grows, new orchestration methods may bypass these hardware or software bottlenecks [1, 2].
The Trychroma team conducted the analysis to investigate whether LLMs truly handle long contexts uniformly [1]. Their results indicate that performance varies significantly as input length changes, even when the models are performing simple tasks [3].
“Increasing the number of input tokens degrades the performance of large language models.”
The emergence of 'context rot' suggests that simply increasing the window of a model's memory does not guarantee the model can actually use that memory effectively. The contrast between the Trychroma findings and the MIT recursive framework indicates that the solution to long-context processing may lie not in the models themselves, but in the architectural frameworks used to feed data into those models.





