alphaXiv

Discover, Discuss, and Read arXiv papers

Discover new, recommended papers

Papers Datasets Models

2,447

21 Oct 2025

computer-science computer-vision-and-pattern-recognition efficient-transformers

DeepSeek-OCR: Contexts Optical Compression

DeepSeek

DeepSeek-OCR explores "contexts optical compression" to enable Large Language Models (LLMs) to process lengthy texts more efficiently by representing information visually. The model achieves approximately 97% text decoding precision at 9-10x vision-text compression ratios and sets new benchmarks in OCR performance with significantly fewer vision tokens, while also offering advanced deep parsing capabilities for structured data.

9,072

1,103

21 Oct 2025

computer-science computation-and-language computer-vision-and-pattern-recognition

Glyph: Scaling Context Windows via Visual-Text Compression

Tsinghua University Zhipu AI

Glyph scales large language model context windows by visually compressing long texts into compact images, enabling Vision-Language Models to process 3-4x more original text tokens. This method, developed by Tsinghua University and Zhipu AI, achieves significantly faster inference and training while maintaining competitive performance on long-context benchmarks.

493

20 Oct 2025

bayesian-deep-learning computer-science machine-learning

The Free Transformer

DeepAnalyze: Agentic Large Language Models for Autonomous Data Science

Tsinghua University

Renmin University of China

Researchers from Renmin University of China and Tsinghua University introduce DeepAnalyze, an 8-billion-parameter agentic large language model designed for autonomous end-to-end data science, from raw data processing to generating comprehensive research reports. The model's curriculum-based training and novel architecture enable it to consistently outperform existing automated methods and most proprietary LLMs across various data analysis, modeling, and open-ended research benchmarks.

264

21 Oct 2025

computer-science artificial-intelligence computation-and-language

Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model

Inclusion AI Ling Team

Ring-1T is the first open-source trillion-parameter thinking model developed by the Ling Team at Inclusion AI, achieving state-of-the-art reasoning capabilities in competitive mathematics, coding, and logical reasoning benchmarks through a multi-stage reinforcement learning pipeline and specialized infrastructure.

186

21 Oct 2025

agents computer-science artificial-intelligence

LightMem: Lightweight and Efficient Memory-Augmented Generation

National University of Singapore

Zhejiang University

LightMem introduces a lightweight and efficient memory system for Large Language Models, enabling them to effectively process long and dynamic conversational contexts with improved accuracy and drastically reduced computational overhead. It achieves up to 117x fewer tokens, 177x fewer API calls, and over 12x faster runtime while increasing question-answering accuracy by up to 10.9%.

242

20 Oct 2025

agents chain-of-thought computer-science

Deep Self-Evolving Reasoning

Peking University

Microsoft

Researchers from Microsoft Research Asia and Peking University developed Deep Self-Evolving Reasoning (DSER), a probabilistic framework that significantly extends the complex reasoning capabilities of open-weight large language models. The framework enabled an 8B-parameter model to solve 5 out of 9 previously intractable AIME problems and surpass its 600B teacher model's performance on AIME benchmarks.

236

20 Oct 2025

agents computer-science artificial-intelligence

FineVision: Open Data Is All You Need

Stanford University

Hugging Face Technical University Munich

FineVision introduces a meticulously curated open multimodal dataset comprising 24 million samples derived from over 200 public sources. Models trained on FineVision achieve an average performance improvement of 5.1 to 14.3 percentage points across 11 diverse benchmarks compared to previous open-source datasets, while exhibiting a low test-set contamination rate of 1.02%.

3,647

16 Oct 2025

computer-science artificial-intelligence computation-and-language

Reasoning with Sampling: Your Base Model is Smarter Than You Think

Harvard University

Researchers at Harvard University developed power sampling, a training-free method leveraging the Metropolis-Hastings algorithm to sample from a sharpened distribution of a base large language model. This technique unlocks latent reasoning capabilities, achieving single-shot performance comparable to or exceeding reinforcement learning post-training methods across various tasks, while also preserving generation diversity.

186

20 Oct 2025

chain-of-thought computer-science computation-and-language

QueST: Incentivizing LLMs to Generate Difficult Problems

University of Zurich

Microsoft

QueST, a framework from the University of Zurich and Microsoft Research, trains large language models to generate difficult competitive coding problems at an unprecedented scale. This enables an 8B model, fine-tuned on the generated synthetic data, to achieve code reasoning performance on par with a 671B model, establishing a new Pareto optimum.

22 Oct 2025

computer-science machine-learning

Loopholing Discrete Diffusion: Deterministic Bypass of the Sampling Wall

This research introduces Loopholing, a mechanism that deterministically propagates rich continuous latent information across denoising steps in discrete diffusion models, directly addressing the "sampling wall" problem. Loopholing Discrete Diffusion Models (LDDMs) demonstrate enhanced language generation quality and improved performance on reasoning tasks, often surpassing autoregressive baselines in generative perplexity and consistency.

103

21 Oct 2025

adversarial-attacks computer-science artificial-intelligence

Extracting alignment data in open models

Google DeepMind

Anthropic

National University of Singapore

University of Oxford

OpenAI AI Sequrity Company MentaLeap

Researchers at Google DeepMind and collaborating institutions found that open-weight large language models readily regurgitate semantically similar alignment data, which traditional string-matching metrics undercount by at least 10 times. The study shows this extracted data is sufficiently potent to train new models that achieve comparable performance to those trained on original, proprietary datasets.

140

22 Oct 2025

computer-science artificial-intelligence computation-and-language

Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs

Wuhan University

ByteDance

Peking University CASIA UCAS

While Multimodal Large Language Models (MLLMs) excel at holistic understanding, they struggle in capturing the dense world with complex scenes, requiring fine-grained analysis of intricate details and object inter-relationships. Region-level MLLMs have been a promising step. However, previous attempts are generally optimized to understand given regions in isolation, neglecting crucial global contexts. To address this, we introduce Grasp Any Region (GAR) for comprehen- sive region-level visual understanding. Empowered by an effective RoI-aligned feature replay technique, GAR supports (1) precise perception by leveraging necessary global contexts, and (2) modeling interactions between multiple prompts. Together, it then naturally achieves (3) advanced compositional reasoning to answer specific free-form questions about any region, shifting the paradigm from passive description to active dialogue. Moreover, we construct GAR-Bench, which not only provides a more accurate evaluation of single-region comprehension, but also, more importantly, measures interactions and complex reasoning across multiple regions. Extensive experiments have demonstrated that GAR-1B not only maintains the state-of-the-art captioning capabilities, e.g., outperforming DAM-3B +4.5 on DLC-Bench, but also excels at modeling relationships between multiple prompts with advanced comprehension capabilities, even surpassing InternVL3-78B on GAR-Bench-VQA. More importantly, our zero-shot GAR-8B even outperforms in-domain VideoRefer-7B on VideoRefer-BenchQ, indicating its strong capabilities can be easily transferred to videos.

169

20 Oct 2025

agentic-frameworks agents computer-science

Enterprise Deep Research: Steerable Multi-Agent Deep Research for Enterprise Analytics

Salesforce AI Research

Salesforce AI Research developed Enterprise Deep Research (EDR), a steerable multi-agent system that conducts comprehensive research for enterprise analytics by leveraging diverse data sources, including proprietary ones. This framework enables real-time human intervention and transparent reasoning, outperforming several existing deep research systems on benchmarks and significantly reducing time-to-insight in internal enterprise use cases.

108

20 Oct 2025

attention-mechanisms computer-science artificial-intelligence

Seeing but Not Believing: Probing the Disconnect Between Visual Attention and Answer Correctness in VLMs

Vision-Language Models (VLMs) achieve strong results on multimodal tasks such as visual question answering, yet they can still fail even when the correct visual evidence is present. In this work, we systematically investigate whether these failures arise from not perceiving the evidence or from not leveraging it effectively. By examining layer-wise attention dynamics, we find that shallow layers focus primarily on text, while deeper layers sparsely but reliably attend to localized evidence regions. Surprisingly, VLMs often perceive the visual evidence when outputting incorrect answers, a phenomenon we term ``seeing but not believing'' that widely exists in major VLM families. Building on this, we introduce an inference-time intervention that highlights deep-layer evidence regions through selective attention-based masking. It requires no training and consistently improves accuracy across multiple families, including LLaVA, Qwen, Gemma, and InternVL. These results show that VLMs encode reliable evidence internally but under-utilize it, making such signals explicit can bridge the gap between perception and reasoning, advancing the diagnostic understanding and reliability of VLMs.

111

21 Oct 2025

computer-science computation-and-language computer-vision-and-pattern-recognition

See the Text: From Tokenization to Visual Reading

Nanjing University of Science and Technology Central South University Nanjing Forestry University

SEETOK proposes a vision-centric tokenization method that converts text into images for Large Language Models (LLMs), enabling them to "read" text visually. This approach reduces token counts by 4.43x and FLOPs by 70.5%, demonstrating improved multilingual fairness, translation quality, and robustness to text perturbations, while maintaining or exceeding performance on language understanding tasks.

118

20 Oct 2025

computer-science computer-vision-and-pattern-recognition

World-in-World: World Models in a Closed-Loop World

Harvard University

Peking University

Johns Hopkins University

MIT

Princeton University

Researchers at Johns Hopkins University and collaborators introduce "World-in-World," the first comprehensive open platform to benchmark visual world models by their practical utility in supporting embodied agents in closed-loop interaction. The platform reveals that for embodied tasks, controllability and scaling post-training with action-observation data are more effective than relying on visual quality alone, and that increased inference-time computation leads to substantially improved closed-loop performance.

129

21 Oct 2025

statistical-mechanics strongly-correlated-electrons high-energy-physics-theory

Entanglement Spectrum Resolved by Loop Symmetries

the University of Tokyo

Researchers at The University of Tokyo established a general algebraic-topological framework to analyze the entanglement structure of quantum many-body states possessing Rep(G) loop symmetries. This method determines the full entanglement spectrum across arbitrary dimensions and manifold topologies, including a direct verification of the Li-Haldane conjecture for Kitaev quantum double models.

21 Oct 2025

computer-science artificial-intelligence computation-and-language

Text or Pixels? It Takes Half: On the Token Efficiency of Visual Text Inputs in Multimodal LLMs

University of Chicago

Allen Institute for AI

Stony Brook University

Researchers demonstrate that representing long textual inputs as images for multimodal LLMs can significantly reduce the number of decoder tokens by nearly half while maintaining high accuracy. This approach, developed by AI2, University of Chicago, and Stony Brook University, leads to faster inference for large models on tasks like retrieval and summarization.

21 Oct 2025

autonomous-vehicles computer-science computer-vision-and-pattern-recognition

OmniNWM: Omniscient Driving Navigation World Models

National University of Singapore

Shanghai Jiao Tong University

Tsinghua University Eastern Institute of Technology, Ningbo Phigent

OmniNWM unifies comprehensive multi-modal state generation, precise action control, and occupancy-grounded rewards into a single framework for autonomous driving world models. It achieves state-of-the-art performance in generating high-fidelity RGB videos, panoramic depth maps, and 3D semantic occupancy, while demonstrating robust long-term generation and zero-shot generalization.

There are no more papers matching your filters at the moment.

Install Browser Extension

Blog|We're hiring

Sep	OCT	Nov
	23
2024	2025	2026

alphaXiv

Explore

Login

Feedback

Dark mode

Discover, Discuss, and Read arXiv papers

Discover new, recommended papers

DeepSeek-OCR: Contexts Optical Compression

Glyph: Scaling Context Windows via Visual-Text Compression

The Free Transformer

DeepAnalyze: Agentic Large Language Models for Autonomous Data Science

Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model

LightMem: Lightweight and Efficient Memory-Augmented Generation

Deep Self-Evolving Reasoning

FineVision: Open Data Is All You Need

Reasoning with Sampling: Your Base Model is Smarter Than You Think

QueST: Incentivizing LLMs to Generate Difficult Problems

Loopholing Discrete Diffusion: Deterministic Bypass of the Sampling Wall

Extracting alignment data in open models

Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs

Enterprise Deep Research: Steerable Multi-Agent Deep Research for Enterprise Analytics

Seeing but Not Believing: Probing the Disconnect Between Visual Attention and Answer Correctness in VLMs

See the Text: From Tokenization to Visual Reading

World-in-World: World Models in a Closed-Loop World

Entanglement Spectrum Resolved by Loop Symmetries

Text or Pixels? It Takes Half: On the Token Efficiency of Visual Text Inputs in Multimodal LLMs

OmniNWM: Omniscient Driving Navigation World Models