The Wayback Machine - https://web.archive.org/web/20251023114933/https://www.alphaxiv.org/

Discover, Discuss, and Read arXiv papers

Discover new, recommended papers

DeepSeek-OCR explores "contexts optical compression" to enable Large Language Models (LLMs) to process lengthy texts more efficiently by representing information visually. The model achieves approximately 97% text decoding precision at 9-10x vision-text compression ratios and sets new benchmarks in OCR performance with significantly fewer vision tokens, while also offering advanced deep parsing capabilities for structured data.
9,072
Glyph scales large language model context windows by visually compressing long texts into compact images, enabling Vision-Language Models to process 3-4x more original text tokens. This method, developed by Tsinghua University and Zhipu AI, achieves significantly faster inference and training while maintaining competitive performance on long-context benchmarks.
26
The Free Transformer augments standard decoder-only Transformer architectures by conditioning their generative process on learned, unsupervised random latent variables injected into a middle layer. This architectural innovation from FAIR at Meta enhances performance on reasoning-intensive tasks like code generation and math problems, demonstrating improved inductive bias with minimal computational overhead.
300
Researchers from Renmin University of China and Tsinghua University introduce DeepAnalyze, an 8-billion-parameter agentic large language model designed for autonomous end-to-end data science, from raw data processing to generating comprehensive research reports. The model's curriculum-based training and novel architecture enable it to consistently outperform existing automated methods and most proprietary LLMs across various data analysis, modeling, and open-ended research benchmarks.
82
Ring-1T is the first open-source trillion-parameter thinking model developed by the Ling Team at Inclusion AI, achieving state-of-the-art reasoning capabilities in competitive mathematics, coding, and logical reasoning benchmarks through a multi-stage reinforcement learning pipeline and specialized infrastructure.
42
LightMem introduces a lightweight and efficient memory system for Large Language Models, enabling them to effectively process long and dynamic conversational contexts with improved accuracy and drastically reduced computational overhead. It achieves up to 117x fewer tokens, 177x fewer API calls, and over 12x faster runtime while increasing question-answering accuracy by up to 10.9%.
50
Researchers from Microsoft Research Asia and Peking University developed Deep Self-Evolving Reasoning (DSER), a probabilistic framework that significantly extends the complex reasoning capabilities of open-weight large language models. The framework enabled an 8B-parameter model to solve 5 out of 9 previously intractable AIME problems and surpass its 600B teacher model's performance on AIME benchmarks.
39
FineVision introduces a meticulously curated open multimodal dataset comprising 24 million samples derived from over 200 public sources. Models trained on FineVision achieve an average performance improvement of 5.1 to 14.3 percentage points across 11 diverse benchmarks compared to previous open-source datasets, while exhibiting a low test-set contamination rate of 1.02%.
Researchers at Harvard University developed power sampling, a training-free method leveraging the Metropolis-Hastings algorithm to sample from a sharpened distribution of a base large language model. This technique unlocks latent reasoning capabilities, achieving single-shot performance comparable to or exceeding reinforcement learning post-training methods across various tasks, while also preserving generation diversity.
46
QueST, a framework from the University of Zurich and Microsoft Research, trains large language models to generate difficult competitive coding problems at an unprecedented scale. This enables an 8B model, fine-tuned on the generated synthetic data, to achieve code reasoning performance on par with a 671B model, establishing a new Pareto optimum.
This research introduces Loopholing, a mechanism that deterministically propagates rich continuous latent information across denoising steps in discrete diffusion models, directly addressing the "sampling wall" problem. Loopholing Discrete Diffusion Models (LDDMs) demonstrate enhanced language generation quality and improved performance on reasoning tasks, often surpassing autoregressive baselines in generative perplexity and consistency.
Researchers at Google DeepMind and collaborating institutions found that open-weight large language models readily regurgitate semantically similar alignment data, which traditional string-matching metrics undercount by at least 10 times. The study shows this extracted data is sufficiently potent to train new models that achieve comparable performance to those trained on original, proprietary datasets.
While Multimodal Large Language Models (MLLMs) excel at holistic understanding, they struggle in capturing the dense world with complex scenes, requiring fine-grained analysis of intricate details and object inter-relationships. Region-level MLLMs have been a promising step. However, previous attempts are generally optimized to understand given regions in isolation, neglecting crucial global contexts. To address this, we introduce Grasp Any Region (GAR) for comprehen- sive region-level visual understanding. Empowered by an effective RoI-aligned feature replay technique, GAR supports (1) precise perception by leveraging necessary global contexts, and (2) modeling interactions between multiple prompts. Together, it then naturally achieves (3) advanced compositional reasoning to answer specific free-form questions about any region, shifting the paradigm from passive description to active dialogue. Moreover, we construct GAR-Bench, which not only provides a more accurate evaluation of single-region comprehension, but also, more importantly, measures interactions and complex reasoning across multiple regions. Extensive experiments have demonstrated that GAR-1B not only maintains the state-of-the-art captioning capabilities, e.g., outperforming DAM-3B +4.5 on DLC-Bench, but also excels at modeling relationships between multiple prompts with advanced comprehension capabilities, even surpassing InternVL3-78B on GAR-Bench-VQA. More importantly, our zero-shot GAR-8B even outperforms in-domain VideoRefer-7B on VideoRefer-BenchQ, indicating its strong capabilities can be easily transferred to videos.
5
Salesforce AI Research developed Enterprise Deep Research (EDR), a steerable multi-agent system that conducts comprehensive research for enterprise analytics by leveraging diverse data sources, including proprietary ones. This framework enables real-time human intervention and transparent reasoning, outperforming several existing deep research systems on benchmarks and significantly reducing time-to-insight in internal enterprise use cases.
14
Vision-Language Models (VLMs) achieve strong results on multimodal tasks such as visual question answering, yet they can still fail even when the correct visual evidence is present. In this work, we systematically investigate whether these failures arise from not perceiving the evidence or from not leveraging it effectively. By examining layer-wise attention dynamics, we find that shallow layers focus primarily on text, while deeper layers sparsely but reliably attend to localized evidence regions. Surprisingly, VLMs often perceive the visual evidence when outputting incorrect answers, a phenomenon we term ``seeing but not believing'' that widely exists in major VLM families. Building on this, we introduce an inference-time intervention that highlights deep-layer evidence regions through selective attention-based masking. It requires no training and consistently improves accuracy across multiple families, including LLaVA, Qwen, Gemma, and InternVL. These results show that VLMs encode reliable evidence internally but under-utilize it, making such signals explicit can bridge the gap between perception and reasoning, advancing the diagnostic understanding and reliability of VLMs.
SEETOK proposes a vision-centric tokenization method that converts text into images for Large Language Models (LLMs), enabling them to "read" text visually. This approach reduces token counts by 4.43x and FLOPs by 70.5%, demonstrating improved multilingual fairness, translation quality, and robustness to text perturbations, while maintaining or exceeding performance on language understanding tasks.
15
Researchers at Johns Hopkins University and collaborators introduce "World-in-World," the first comprehensive open platform to benchmark visual world models by their practical utility in supporting embodied agents in closed-loop interaction. The platform reveals that for embodied tasks, controllability and scaling post-training with action-observation data are more effective than relying on visual quality alone, and that increased inference-time computation leads to substantially improved closed-loop performance.
Researchers at The University of Tokyo established a general algebraic-topological framework to analyze the entanglement structure of quantum many-body states possessing Rep(G) loop symmetries. This method determines the full entanglement spectrum across arbitrary dimensions and manifold topologies, including a direct verification of the Li-Haldane conjecture for Kitaev quantum double models.
Researchers demonstrate that representing long textual inputs as images for multimodal LLMs can significantly reduce the number of decoder tokens by nearly half while maintaining high accuracy. This approach, developed by AI2, University of Chicago, and Stony Brook University, leads to faster inference for large models on tasks like retrieval and summarization.
OmniNWM unifies comprehensive multi-modal state generation, precise action control, and occupancy-grounded rewards into a single framework for autonomous driving world models. It achieves state-of-the-art performance in generating high-fidelity RGB videos, panoramic depth maps, and 3D semantic occupancy, while demonstrating robust long-term generation and zero-shot generalization.
20
There are no more papers matching your filters at the moment.