Ai2

Ai2 · 2026-01-28T18:43:49.999Z

🧪 Introducing Theorizer: Generating scientific theories from thousands of papers Most automated discovery systems focus on experimentation. Theorizer tackles the other half of science: theory building—compressing scattered findings into structured, testable claims. Experiments drive science forward, but progress compounds when findings coalesce into theories that explain and predict. Kepler's laws distilled centuries of observations into a few statements about planetary motion. We asked: can an AI build theories by reading the literature? Theorizer is a multi-LLM framework. Ask "make me theories about X" and it reads relevant papers and outputs candidate laws, looking for regularities across studies and writing them as ⟨LAW, SCOPE, EVIDENCE⟩ tuples. Theorizer gathers a focused corpus (up to ~100 papers), pulling full text when available and expanding via citations when needed. It then builds a query-specific schema and extracts structured records from each paper. Finally, Theorizer aggregates evidence into candidate laws, refining for clarity and attribution. Benchmarking theory generation is hard, so we evaluate on 5 desiderata: specificity, empirical support, predictive accuracy, novelty, and plausibility. We find that grounding in papers boosts specificity, empirical support, and plausibility—especially when pushing for novelty. In backtesting, literature-supported generation is ~7× pricier but more predictive (precision ~0.88–0.90; novelty-focused precision jumps from 0.34 to 0.61). We’re releasing the Theorizer code and framework plus a dataset of ~3,000 theories generated by Theorizer across the field of AI/NLP, built from 13,744 source papers. ✍️ Learn more in our blog: https://lnkd.in/eKGiFGb4 💻 Code: https://lnkd.in/eH89QyGh 📝 Technical report: https://lnkd.in/ef2RSsvC

Non-profit Organizations

Seattle, WA 60,771 followers

Breakthrough AI to solve the world's biggest problems.

See jobs Follow

Discover all 325 employees

About us

We are a Seattle-based non-profit AI research institute founded in 2014 by the late Paul Allen. We develop foundational AI research and innovation to deliver real-world impact through large-scale open models, data, robotics, conservation, and beyond.

Website: http://allenai.org
External link for Ai2
Industry: Non-profit Organizations
Company size: 201-500 employees
Headquarters: Seattle, WA
Type: Nonprofit
Founded: 2014
Specialties: Artificial Intelligence, Deep Learning, Natural Language Processing, Computer Vision, Machine Reading, Machine Learning, Knowledge Extraction, Common Sense AI, Machine Reasoning, Information Extraction, and Language Modeling

Locations

Primary

Seattle

Seattle, WA 98013, US

Get directions

Employees at Ai2

See all employees

Updates

Ai2

60,771 followers
20h
Report this post
We’ve released a Chrome extension for Asta—a faster way to go from finding a paper to asking questions about it while you read. 🔍 When you land on a supported research page or PDF, the extension adds an "Ask AI about this paper" button. One click opens Asta with the paper pre-loaded, drawing on a corpus of 108M+ abstracts and 12M+ full-text studies. No copying links or switching tools mid-read required. It works across popular research sites and PDF pages—if Asta can fully parse the paper, the button appears automatically. Try it out → https://lnkd.in/eE6FjAmX
1 Comment

Like Comment Share
Ai2

60,771 followers
5d Edited
Report this post
Introducing Olmix: a framework for data mixing throughout language model development. 🔀 Modern LMs train on many data types—web text, code, PDFs, math, and more. Data mix is a first-order lever on model quality, but the "best" recipe depends on many choices that practitioners often have to guess at. And training data evolves constantly throughout model development as datasets are revised and expanded, so recomputing mixes from scratch becomes a tax on every change. Olmix makes mixing a repeatable workflow with two components: ✅ OlmixBase is a mixing method that provides answers to key configuration questions based on a comprehensive empirical study: what model size do you need? How many experiments do you need to run? How do you handle limited data? ✅ Mixture reuse lets you efficiently update your mix as data domains change without starting over. When the domain set changes, you can bundle unchanged domains into a single "virtual domain" and optimize only what's changed—translating to fewer proxy model runs. In experiments across 5 domain updates and 64 domains, Olmix produces a mix that's 12% better on our downstream evaluation task suite and 3× more data-efficient than no mixing at all, while being 74% cheaper than recomputing from scratch after each update. If you're training LMs on diverse data and tired of guessing at configuration choices or re-running expensive experiments every time your data pipeline updates, Olmix is for you. ✍️ Learn more in our blog: https://lnkd.in/dicG2Hp2 💻 Code: https://lnkd.in/d42xXfye 📚 Technical report: https://lnkd.in/dFEFUz8a
1 Comment

Like Comment Share
Ai2

60,771 followers
6d Edited
Report this post
Today we're releasing AutoDiscovery in AstaLabs—an AI system that explores your data and generates its own hypotheses. 🔬 Most AI tools for science wait for a research question, then help answer it. AutoDiscovery works differently. Give it a dataset and it proposes novel hypotheses, runs experiments, interprets results, and uses what it learns to keep exploring. How does it decide what to pursue? Bayesian surprise—a measure of how much the system's beliefs change after seeing evidence. By chasing surprise, AutoDiscovery gravitates toward the unexpected, prioritizing findings most likely to represent genuine discoveries rather than obvious patterns. Researchers across disciplines are already using it: 👩🔬 Oncology—Dr. Kelly Paulson at the Swedish Cancer Institute used AutoDiscovery to explore breast cancer and melanoma data, surfacing hypotheses about immune responses and lymph node spread her team hadn't thought to look for. → Read the report: https://lnkd.in/gnxpVU9t 🌊 Marine ecology—Fabio Favoretto at the Scripps Institution of Oceanography used AutoDiscovery to explore 20+ years of rocky reef monitoring data from the Gulf of California. His team already knew marine heatwaves were hurting fish populations, but AutoDiscovery helped them move beyond broad patterns, surfacing relationships between productivity across trophic levels that would have required extensive manual iterations to discover. → Read the report: https://lnkd.in/g3BqxdAc 📄 Social science—AutoDiscovery helped economist Sanchaita Hazra to surface an unexpected finding: doctoral-level authors made significantly more edits to AI-generated abstracts than less experienced researchers, suggesting expertise shapes how critically people engage with AI output (which Hazra independently confirmed later). → Her work was published in a peer-reviewed paper: https://lnkd.in/gn3NdPA3 Everything AutoDiscovery generates is transparent and reproducible—every hypothesis, statistical analysis, and line of Python code is there for you to inspect. Science has always had more data than time to explore it. AutoDiscovery helps surface the questions hiding in yours. We're giving early users 1,000 free Hypothesis Credits to get started. 🧪 Try it today → https://lnkd.in/gsPvS5AP 📚 Learn more in our blog → https://lnkd.in/gtUQevy8 📖 Read researcher case studies → https://lnkd.in/gzkXF-Kd
1 Comment

Like Comment Share
Ai2

60,771 followers
1w Edited
Report this post
🤖 Introducing MolmoSpaces: a large-scale, fully open platform + benchmark for embodied AI research. The next wave of AI will act in the physical world, but building robots that generalize across new environments rather than simply replaying learned behaviors requires far more diverse training data than exists today. That's where MolmoSpaces comes in. MolmoSpaces brings together 230k+ indoor scenes, 130k+ object models, and 42M annotated robotic grasps into a single open ecosystem built on two foundations: ◘ Objaverse, one of the largest open collections of 3D objects ◘ Our THOR family of interactive simulation environments MolmoSpaces is grounded in physics simulation with validated physical parameters tuned for realistic robotics manipulation, and includes a trajectory-generation pipeline for reproducible embodied AI demonstrations and imitation learning at scale. All assets, scenes, and tools are open and modular – provided in MJCF with USD conversion for cross-simulator portability – so you can plug in new embodiments, regenerate grasps, and run experiments across MuJoCo, ManiSkill, and NVIDIA Isaac Lab/Sim. MolmoSpaces supports teleoperation via mobile platforms like Teledex, so you can collect demonstrations right from your phone, compatible with embodiment setups including DROID and CAP with no extra configuration needed. We're also releasing MolmoSpaces-Bench, a new benchmark for evaluating generalist policies under systematic, controlled variation. Researchers can isolate individual factors – object properties, layouts, task complexity, lighting, dynamics, instruction phrasing, and more – across thousands of realistic scenes. Explore MolmoSpaces today and start building—we can't wait to see what the community does with it: 📝 Blog: https://lnkd.in/ewHM5RNC 💻 Demo: https://lnkd.in/e2i8nSrT ⬇️ Code: https://lnkd.in/ehyzZYbq 📊 Data: https://lnkd.in/ebBtM-UZ ✍️ Paper: https://lnkd.in/edgb_a9p

Like Comment Share
Ai2

60,771 followers
1w
Report this post
Introducing How2Everything—an open framework for benchmarking & improving how LLMs generate step-by-step procedures. LLMs constantly produce instructions for everything from filing taxes to plans for AI agents, but improving this capability is challenging. Outputs can sound fluent while describing steps that don't actually work, surface-level metrics miss critical mistakes like omitted prerequisites or contradictory instructions, and manual verification doesn't scale. How2Everything closes this gap with a practical loop: mine real procedures from the web → benchmark LLM outputs → detect critical failures (missing steps, wrong order, omissions) → use that signal to train better models. It has three main components: ✅ How2Mine—a pipeline that extracts & standardizes procedures from web pages covering 14 topics ✅ How2Bench—a 7,000-procedure benchmark built from How2Mine ✅ How2Score—an evaluation protocol powered by How2Judge, an open 8B judge model trained to flag critical failures How2Judge agrees with human judgments ~80% of the time and is cheap enough for large-scale eval, making it practical as both a benchmark scorer and an RL reward signal. RL training with How2Score yields >10-point gains on Qwen3 4B, Qwen3 8B, and Olmo 3 7B Think, with no regressions across 12 standard benchmarks covering knowledge, reasoning, chat, math, and code. How2Bench also scales cleanly, remaining informative from early 1B pretraining checkpoints through frontier LLMs. And we stress-tested two shortcut explanations (format compliance and memorization); neither accounts for the improvements, pointing to real gains in procedure generation. The full How2Everything framework, including How2Judge, is available now. 👇 📝 Blog: https://lnkd.in/eXrW4g9p 📄 Paper: https://lnkd.in/e3aq4eV9 💻 Code: https://lnkd.in/eeYfeM3E 🤗 HF: https://lnkd.in/eKeCe9Ta
3 Comments

Like Comment Share
Ai2

60,771 followers
1w Edited
Report this post
New: A web demo to make using DR Tulu even simpler, built by our collaborators at the Massachusetts Institute of Technology and the University of Washington. 🔎 DR Tulu is our open, end-to-end recipe for long-form deep research–and the first deep research agent trained directly for long-form responses. Ask a question and watch it plan, search, and synthesize a citation-grounded report you can share. The browser UI lets you pick a checkpoint, choose between Brief Answer or Detailed Report, and set tool use intensity from Quick to Extensive. Every run shows DR Tulu's research steps as they happen—analysis, searches issued, and a running tally of tool calls and documents found. 📝 A dedicated sources view lists retrieved files with snippets, and all reports are citation-backed. The experimental personalized mode lets you bookmark sources from any session to guide future queries—DR Tulu will prioritize your curated references for more targeted results. This demo is designed to make it easier to explore DR Tulu without extensive configuration, & to show how deep research – training, evaluating long-form outputs, & personalization – remains an open academic question. Try the demo and learn more: 🔗 Demo: https://dr-tulu.org 📝 Blog: https://lnkd.in/eJtgyChR
Like Comment Share
Ai2

60,771 followers
2w Edited
Report this post
Our paper is now published in Nature 🎉 Scientists can't keep up with millions of new papers every year. AI should help—but general-purpose models still hallucinate, especially citations. In our study, GPT-4o fabricated 78–90% of its research citations. So with the University of Washington, we built an open-source model for scientific synthesis with citation-grounded answers, trained on 45M papers. Because web search alone can be noisy, it uses retrieval-augmented generation (RAG) to search for, incorporate, and cite new sources—even papers published after training. 🔎 We also created ScholarQABench, the first large, multi-domain benchmark for scientific search and synthesis: 3,000 queries and 250 long-form expert answers across CS, physics, biomedicine, and neuroscience. What started as research into literature-grounded AI now powers real tools. The core concepts developed from this research now live on in Asta and DR Tulu. Our goal: systems scientists can trust and build on. 📄 Nature: https://lnkd.in/eJY6dcWB 📝 Blog: https://lnkd.in/exuNaBdP
6 Comments

Like Comment Share
Ai2

60,771 followers
2w
Report this post
Since launching Open Coding Agents, it's been exciting to see how quickly the community has adopted them. Today we're releasing SERA-14B – a new 14B-parameter coding model – plus a major refresh of our open training datasets. 🚀 SERA-14B is built for more setups and easier deployment: a smaller, more accessible option that still keeps SERA's cheap, customizable approach. ⚠️ Revamped training data: We've converted our open SERA training data into a general, model-agnostic format that's easier to reuse across different workflows. It now includes verification thresholds per sample and more metadata for filtering and analysis. You can drop in the new SERA-14B weights or retrain with our refreshed data. We look forward to seeing what you build! 💻 Model & data: https://lnkd.in/e_wqZ74C 📝 Learn more: https://lnkd.in/esnJp8r2
1 Comment

Like Comment Share
Ai2

60,771 followers
3w
Report this post
🧪 Introducing Theorizer: Generating scientific theories from thousands of papers Most automated discovery systems focus on experimentation. Theorizer tackles the other half of science: theory building—compressing scattered findings into structured, testable claims. Experiments drive science forward, but progress compounds when findings coalesce into theories that explain and predict. Kepler's laws distilled centuries of observations into a few statements about planetary motion. We asked: can an AI build theories by reading the literature? Theorizer is a multi-LLM framework. Ask "make me theories about X" and it reads relevant papers and outputs candidate laws, looking for regularities across studies and writing them as ⟨LAW, SCOPE, EVIDENCE⟩ tuples. Theorizer gathers a focused corpus (up to ~100 papers), pulling full text when available and expanding via citations when needed. It then builds a query-specific schema and extracts structured records from each paper. Finally, Theorizer aggregates evidence into candidate laws, refining for clarity and attribution. Benchmarking theory generation is hard, so we evaluate on 5 desiderata: specificity, empirical support, predictive accuracy, novelty, and plausibility. We find that grounding in papers boosts specificity, empirical support, and plausibility—especially when pushing for novelty. In backtesting, literature-supported generation is ~7× pricier but more predictive (precision ~0.88–0.90; novelty-focused precision jumps from 0.34 to 0.61). We’re releasing the Theorizer code and framework plus a dataset of ~3,000 theories generated by Theorizer across the field of AI/NLP, built from 13,744 source papers. ✍️ Learn more in our blog: https://lnkd.in/eKGiFGb4 💻 Code: https://lnkd.in/eH89QyGh 📝 Technical report: https://lnkd.in/ef2RSsvC
4 Comments

Like Comment Share
Ai2 reposted this
NVIDIA AI

1,630,184 followers
3w Edited
Report this post
🎉 Congrats to Ai2 on your just released Open Coding Agents. 🏎️ Turbocharged on NVIDIA Blackwell GPUs, their fully open models are achieving state‑of‑the‑art SWE‑Bench Verified performance for open systems. What sets this model apart is that it is open source and easily customizable — you can train it on your proprietary code base. Specialization is simple with an SFT pipeline. Learn how to level up your code base. 👇
Ai2

60,771 followers
3w Edited

🚀 Introducing Ai2 Open Coding Agents, starting with SERA—our first-ever coding models. We're entering the coding agents space. SERA models are fast, accessible open coding agents (8B–32B) that adapt to any repo, including private & internal codebases. Train a specialized coding agent with a recipe that can cost as little as ~$400 to reproduce prior best open-source performance, and it works with Claude Code out of the box. Coding agents are changing how software gets built, but most remain closed, expensive, and difficult to customize. Adapting to private codebases has been hard because you need agent-ready synthetic training data—ideally without building complex RL infrastructure for every repo. SERA (Soft-verified Efficient Repository Agents) is our method for training repo-specialized agents quickly and affordably. It generates diverse, realistic training data from any codebase, teaching agents how developers actually work. SERA is high-performing—our **SERA-32B** model solves roughly 49.5%/54.2% (32K/64K context) of SWE-Bench Verified while training on a single small cluster of GPUs (H100s or RTX 6000s equivalents). We’re sharing: ✅ A family of strong open coding models ✅ SERA, our training method for building your own agents ✅ Code, recipes, data, and Claude Code integration We built this so anyone can train a coding agent that knows their codebase inside and out, without needing a massive compute budget. We're eager to see what you create. 👨🏻💻 ⬇️ Models: https://lnkd.in/e_wqZ74C 💻 SERA CLI: https://lnkd.in/ej_4K-ya | PyPi: https://lnkd.in/e3sUZu72 📝 Tech report: https://lnkd.in/ezPvSzvR ✏️ Ai2 news blog: https://lnkd.in/esnJp8r2
9 Comments

Like Comment Share

Jan	FEB	Mar
	19
2025	2026	2027

Ai2

Non-profit Organizations

Seattle, WA 60,771 followers

Breakthrough AI to solve the world's biggest problems.

About us

Locations

Employees at Ai2

Eran Megiddo

Chris Doehring

Ryan Kiskis

Eric Watson

Updates

Join now to see what you are missing

Similar pages

Vercept

Allen Institute

AI2 Incubator

Meta

Anthropic

Google DeepMind

Thinking Machines Lab

Hugging Face

Yoodli AI Roleplays

Interconnects AI

Browse jobs

Engineer jobs

Intern jobs

Scientist jobs

Software Engineer jobs

Machine Learning Engineer jobs

Software Engineer Intern jobs

Analyst jobs

Data Science Specialist jobs

Developer jobs

Researcher jobs

Computational Linguist jobs

Senior Scientist jobs

Research Intern jobs

Python Developer jobs

Associate jobs

Information Technology Engineer jobs

Summer Intern jobs

Scientific Analyst jobs

Research Assistant jobs

Full Stack Engineer jobs