A. Bulatov

Y. Kuratov

M. Burtsev

The abilities and power of a type of transformer model with memory are greatly improved by learning several key tasks at once during training.

Long-context reasoning with language models remains computationally costly as attention scales quadratically and contexts grow to millions of tokens. We show that a compact recurrent-memory transformer, trained across several reasoning tasks and guided by task descriptions, can answer questions over very long texts more accurately than far larger models, while generalising to longer inputs, new tasks and input noise.

Recent advancements have significantly improved the skills and performance of language models, but have also increased computational demands due to the increasing number of parameters and the quadratic complexity of the attention mechanism. As context sizes expand into millions of tokens, making long-context processing more accessible and efficient becomes a critical challenge. Furthermore, modern benchmarks such as BABILong [1] underscore the inefficiency of even the most powerful LLMs in long context reasoning. In this paper, we employ finetuning and multi-task learning to train a model capable of mastering multiple BABILong long-context reasoning skills. We demonstrate that even models with fewer than 140 million parameters can outperform much larger counterparts by learning multiple essential tasks simultaneously. By conditioning Recurrent Memory Transformer [2] on task description, we achieve state-of-the-art results on multi-task BABILong QA1–QA5 set for up to 32k tokens. The proposed model also shows generalization abilities to new lengths and tasks, along with increased robustness to input perturbations.

Multitasking memory

Mastering long-context multi-task reasoning with transformers and recurrent memory