cascadeflow’s cover photo
cascadeflow

cascadeflow

Software Development

New York, NY 2,581 followers

The runtime intelligence layer for AI agents. Govern cost, quality, latency, compliance, and more: inside the agent loop

About us

cascadeflow is the agent runtime intelligence layer. We sit inside AI agent execution and optimize every decision — model calls, tool calls, sub-agent handoffs — across cost, latency, quality, compliance, and energy. Not a router. Not a proxy. A governance system inside the loop. How it works: cascadeflow starts with efficient models, validates quality during generation, and escalates only when needed. Four enforcement actions — allow, switch_model, deny_tool, stop — give you direct control over agent behavior at runtime. Sub-5ms overhead. Results: 69% savings (MT-Bench), 93% (GSM8K), 52% (MMLU), 80% (TruthfulQA) — retaining 96% GPT-4 quality. Works with your stack: LangChain, LangGraph, Vercel AI SDK, Google ADK, OpenAI Agents SDK, CrewAI, n8n. Supports OpenAI, Anthropic, Groq, Ollama, vLLM, Together AI, HuggingFace + 100 more via LiteLLM. 3 lines of code: pip install cascadeflow / npm install @cascadeflow/core Open source (MIT). 286 GitHub stars. 17K+ downloads. 94 forks. cascadeflow Studio: closed beta for enterprise teams needing budget enforcement, compliance gates, KPI weights, and audit trails. Built by Lemony Inc. (NYC / Switzerland). github.com/lemony-ai/cascadeflow cascadeflow.ai

Website
https://cascadeflow.ai
Industry
Software Development
Company size
2-10 employees
Headquarters
New York, NY
Type
Public Company
Founded
2023
Specialties
LLM cost optimization, inference optimization, Model routing, model cascading, open source, MLOps, LLMOps, FinOps, AI Infrastructure, Developer tools, Enterprise AI, AI Finops, API Cost Management, AI Cost reduction, Model Benchmarking, Latency reduction, Agent Harness, and Runtime Intelligence

Products

Locations

Employees at cascadeflow

Updates

  • Your GDPR compliance policy lives in a system prompt. That does not hold in production. Prompts get overridden. They drift across versions. They leave no audit trail. When your agent calls a non-approved model on EU customer data, nothing fires. Runtime compliance enforcement is different. Set a compliance mode and only approved models run: enforced at the step level, inside the loop, before the call executes. GDPR restricts to a specific model allowlist. HIPAA narrows it further. PCI enforces a different set. Every decision is logged automatically: step number, action taken, model used, reason. No prompt content. No user data. An audit trail your legal team can actually read. Full compliance gating docs: GDPR, HIPAA, PCI modes, custom allowlists, and enforcement behavior: https://lnkd.in/eQNzmzEg

    • No alternative text description for this image
  • At every agent step, a runtime layer makes exactly one of four decisions: allow - all constraints pass, proceed normally. switch_model - compliance or KPI mismatch, route to a better model automatically. deny_tool - tool call cap reached, block this execution. stop - budget or energy exceeded, halt the agent entirely. Four verbs. One decision engine. Evaluated at every step across six dimensions simultaneously. This is the grammar of agent governance. Not a dashboard reading, not a post-run report: a live decision made inside the loop before the call executes. The difference between watching agents and controlling them. Full documentation on all four harness actions, decision priority, and how hard vs soft controls work: https://lnkd.in/eWxEFvxe

  • Observability ≠ governance Observability records what happened. A dashboard shows you a spike in model costs at 2am. A log tells you which agent ran over budget. A metric confirms your p99 latency crossed the threshold. None of that changes what happens next. Governance acts on the problem in real time. It switches the model before the budget is exceeded. It blocks the tool call before the cap is reached. It stops the agent before it violates a compliance constraint. The difference is not a feature. It is a category. Most AI infrastructure today stops at observability. Production agents in regulated environments need governance. The two are not the same thing. The full architecture of agent runtime governance and why dashboards are not enough: https://lnkd.in/e57i_u5r

    • No alternative text description for this image
  • Every proxy hop costs you 40 to 60ms. Single call: invisible. 10 step agent loop: 400 to 600ms of avoidable overhead. 25 step deep loop: up to 1.5 seconds added per run. In-process governance: less than 10ms for ten steps. Less than 25ms for twenty-five. This is not a minor optimization. For real-time UX with a 100ms budget, a single proxy hop consumes up to 60% of your entire latency allowance before your agent has done anything. Optimization should not come with a latency penalty. In-process architecture eliminates the penalty entirely. Full latency comparison table with proxy vs in-process overhead across agent loop depths: https://lnkd.in/e57i_u5r Github: https://lnkd.in/e_ZVKbyE

    • No alternative text description for this image
  • There are two places to govern an AI agent. At the HTTP boundary - where a proxy sees a request and a response. Or inside the agent loop - where the system sees step count, budget state, tool history, and compliance context simultaneously. Proxies optimize what they can see. Which is not much. A proxy at the boundary adds 40 to 60ms of latency per model call. It has no view of cumulative budget, tool history, or compliance state. It cannot act on what it cannot see. In-process governance runs inside the agent loop with less than 1ms overhead. It knows the full decision context at every step. Different architecture. Different capability. Different outcomes. The full case for inside-the-loop agent intelligence - with benchmarks and architecture comparison: https://lnkd.in/e57i_u5r

  • Every agent loop runs the same three events: A model call. A tool call. A sub-agent handoff. Repeat until the task is done, or until something goes wrong. The model call generates tokens and costs money. The tool call executes a function and adds latency. The sub-agent handoff spawns a new process and splits the budget. Most AI infrastructure tools sit at the HTTP boundary and watch request/response pairs. They never see what's happening inside. Governance needs to live where failure actually happens: inside the loop, at every step. That's what agent runtime intelligence is for. Check out how agent loops actually work, with a full breakdown of what happens at each step: https://lnkd.in/eKidHumr

  • $15M to Portkey. $20M to Helicone. Congrats to these funding news! Governance and observability for AI agents is clearly a space people are paying attention to and we are excited about it! But: observability sits outside the loop and reports. Runtime intelligence sits inside the loop and acts. cascadeflow enforces policy checks, model selection, and decision tracing at the moment a query runs. Not in a post-hoc dashboard. For teams running agents in production where auditability is not optional, that difference matters.

  • One line from Gartner's top data and analytics predictions for 2026 is worth reading twice: "By 2030, 50% of AI agent deployment failures will be due to insufficient AI governance platform runtime enforcement for capabilities and multisystem interoperability." Not strategy failures. Not data quality failures. Not talent gaps. Runtime enforcement failures! Gartner is naming something precise: agents that take actions, call tools, and make decisions across systems: without a governance layer controlling what they're allowed to do, at the point of execution, in real time. This is the exact problem cascadeflow was built to solve. Not as a proxy that adds compounding latency to every agent step. As an in-process runtime intelligence layer - sitting between the model and the application, enforcing policy, validating quality, and making every agentic decision business-aware at under 5ms overhead. The same report flags a $58 billion productivity tool shakeup as agentic AI rewrites how enterprise work gets done. The organizations that govern that transition well will pull ahead. Those that don't will be part of the 50%. Runtime enforcement isn't a feature. It's the foundation. Link to the Gartner press release in the comments.

    • No alternative text description for this image
  • We LOVE to hear from cascadeflow users. Nari writes about her $2/Day startup: "Lewis is an autonomous AI agent running on a MacBook Air in my apartment in Zurich. He has a brain (PostgreSQL with vector search), a fleet of six specialized subagents, and one job: take a business idea from “what if” to “go or kill” with minimal hand-holding from me." And her testimonial about cascadeflow: "Cascadeflow: Spending Smart Here is something lots of us struggle with when starting out with AI agents: route every single request gets through GPT-4 or worse, Claudes Opus 4.6 and whoopsy - why is the API bill suddenly CHF 200 per day?! Lewis uses CascadeFlow, an open-source model cascading library. It sits between the agent and the models as a local proxy. The logic is simple: start with the cheapest model (Gemini Flash at $0.15 per million tokens). If the quality score drops below 0.75, escalate to GPT-4o ($2.50 per million tokens). Budget cap: $0.05 per request. In practice, 80% of Lewis’s work runs on Flash. Research summaries, formatting, simple tool calls. The expensive model only fires for complex reasoning, multi-step planning, or when Flash starts hallucinating. My daily API cost is under $2. The config is a YAML file. Two models, a quality threshold, a budget cap. That is it." Thank you! https://lnkd.in/ecWZpXNj

  • External proxies add 40 to 60ms per model call. In an agent workflow with 10+ steps, that is half a second of overhead doing nothing useful. cascadeflow runs in-process. Under 5ms total. Same optimization, zero proxy tax. That difference compounds fast when your agents run at scale.

    • No alternative text description for this image

Similar pages

Browse jobs