Civo’s Post

27,575 followers

What is the real difference between NVIDIA’s Rubin GPU and Vera CPU, and why does it matter for the future of agentic AI? In this new blog, Jubril Oytunji breaks down NVIDIA’s Vera Rubin platform and explains how its “extreme co-design” approach is reshaping data center architecture. He explores how the Rubin GPU is built for next-generation AI inference and training, while the Vera CPU is purpose-built to orchestrate agentic workloads at scale. Read the full piece here: https://civo.io/48lS84Q

To view or add a comment, sign in

More Relevant Posts

Hajrë H.

University of Hertfordshire•5K followers
1w
Report this post
5 MILLION tokens in a single context window. Together AI showed how context parallelism distributes attention computation across GPUs to break through the memory wall. The "context window is too small" excuse is officially dead. Start designing systems that assume near infinite context. This changes the architecture of every RAG pipeline, every agent memory system, every document processing workflow. When context is effectively unlimited, half the infrastructure we are building today becomes unnecessary. #TogetherAI #ContextWindow #Infrastructure
Like Comment
To view or add a comment, sign in
NVIDIA Data Center

270,923 followers
1mo
Report this post
Meet the NVIDIA Vera Rubin POD — a next-generation AI supercomputer built for the era of agentic applications. ⚙️ Powered by seven co-designed chips across compute, networking, and storage 🧠 Built on 3rd-gen NVIDIA MGX rack-scale architecture 🧩 Five specialized rack-scale systems, one coherent AI platform Purpose-built for agentic AI workloads, from massive context memory to ultra-low-latency inference. Learn how Vera Rubin is redefining POD-scale intelligence: https://nvda.ws/3NIaCFG
9 Comments
Like Comment
To view or add a comment, sign in
Luna Su

Intel•19 followers
4w
Report this post
The shift toward agentic AI is driving a new class of infrastructure—impressive to see how NVIDIA is redefining data center architecture with the Vera Rubin POD.
NVIDIA Data Center

270,923 followers
1mo

Meet the NVIDIA Vera Rubin POD — a next-generation AI supercomputer built for the era of agentic applications. ⚙️ Powered by seven co-designed chips across compute, networking, and storage 🧠 Built on 3rd-gen NVIDIA MGX rack-scale architecture 🧩 Five specialized rack-scale systems, one coherent AI platform Purpose-built for agentic AI workloads, from massive context memory to ultra-low-latency inference. Learn how Vera Rubin is redefining POD-scale intelligence: https://nvda.ws/3NIaCFG
Like Comment
To view or add a comment, sign in
Brent Ellis

Forrester•2K followers
3w
Report this post
Happy to geek out at #KubeCon, Just heard a speaker discuss tokens as the new units of work. Let's be clear: Tokenization is a hardware constraint, not an architectural ideal. It's just the current way we serialize data for matrix math on GPUs. As AI architectures mature, the token will recede. We will move toward models that interact with natural data object boundaries, not linear streams of integers. We are currently building for the hardware we have, not the data we actually want to process. #KubeCon #GenerativeAI #Infrastructure

3 Comments
Like Comment
To view or add a comment, sign in
CREDX

2,095 followers
5d
Report this post
The open-standard chip architecture thesis is reaching institutional validation simultaneously with agentic AI deployment. SiFive's $400 million round — with Nvidia as a participant — marks the moment when the open-standard CPU architecture category crossed from interesting-niche to institutional-conviction. The timing is not coincidental: agentic AI workloads create a specific CPU compute demand that neither GPU-centric nor Arm-based architectures address optimally, and the RISC-V ecosystem has matured to the point where hyperscalers are prepared to commit to it at production scale. #Funding #SiFive #CredX
1 Comment
Like Comment
To view or add a comment, sign in
PolarGrid

215 followers
2d
Report this post
Scaling from one colo to multiple edge locations sounds simple until you hit the real question: how does a request reach the right GPU, and how do models get onto the right nodes, without building a centralized service that everything depends on? Our latest technical blog breaks down how PolarGrid approaches that problem. Each edge node is built to operate independently, the router acts as a decision service instead of a proxy, and the client makes the final latency-based routing choice from its own network position. The result is a multi-colo inference architecture with no central brain in the hot path, fewer bottlenecks, and a design that holds up as the fleet grows. Read: Running Inference Across Multiple Colos Without a Central Brain written by sev geraskin 👇 https://lnkd.in/e8Crn-f6

How PolarGrid Runs Multi-Colo Inference Without Bottlenecks polargrid.ai
Like Comment
To view or add a comment, sign in
HyperFRAME Research

516 followers
2w
Report this post
From Sandbox to Scale How Vultr is Surfacing the Entire Vera Rubin Stack Your GPUs Are Running. Your AI Isn't in Production. Here's Why. A new HyperFRAME Research white paper, produced in collaboration with Vultr, examines why enterprise AI stalls after the infrastructure decision, and what the NVIDIA Vera Rubin architecture changes about that calculus. Download the paper to understand: Why nearly half of enterprises lack the integrated stack required for sustained AI execution, despite having secured GPU access What the Vera Rubin architectural co-design changes about storage, networking, and orchestration requirements How NVIDIA's open-source software strategy (Dynamo, Nemotron, NeMo) reframes the lock-in conversation Why hyperscaler incentive structures create friction for enterprises wanting NVIDIA-native inference performance What a pre-Vera Rubin evaluation checklist should cover, and why current Blackwell deployments provide a meaningful proxy https://buff.ly/DGx6ujI
Like Comment
To view or add a comment, sign in
Dolev Gabay

Arithmo.io•596 followers
2w
Report this post
During our daily work optimizing ML infrastructure at Arithmo.io, Yonatan Leibovitch and I wanted to run an 7B parameters LLM on an L40S GPU with 48GB of VRAM, more than most small and medium GPUs out there. On paper, it looked like more than enough and we expected fast responses. Instead, we were surprised to find the inference speed wasn't nearly as fast as we anticipated. That gap between expectation and reality pushed us to dig deeper. We started investigating how GPU memory - not just its size, but its type - affects inference speed and scale. The goal was simple: understand what's actually going on under the hood, so we could choose the right GPU for our workload. What we found was eye-opening. The real bottleneck behind LLM inference performance isn't VRAM size or FLOPS - it's memory bandwidth, and how the available memory is managed at runtime. We put everything we learned into an article covering: • Why the decode phase is memory-bound - and why that changes everything • How memory bandwidth directly determines token generation speed • What the KV cache is and why it controls how many users you can serve in parallel • Why a newer, more powerful GPU can actually be slower for inference • HBM vs GDDR • The real tradeoff between concurrency and throughput If you're building or scaling LLM-powered products, this one is worth a read. https://lnkd.in/dXGd8jQY Amos Kopels Yonatan Leibovitch Arithmo.io

8 Comments
Like Comment
To view or add a comment, sign in
Einat Tamir

Arithmo.io•759 followers
2w
Report this post
If you’re building or scaling LLM-powered products, this is the kind of infrastructure knowledge that changes how you make decisions. Highly recommend.

Dolev Gabay

Software Engineer | B.Sc in Computer Science
2w

During our daily work optimizing ML infrastructure at Arithmo.io, Yonatan Leibovitch and I wanted to run an 7B parameters LLM on an L40S GPU with 48GB of VRAM, more than most small and medium GPUs out there. On paper, it looked like more than enough and we expected fast responses. Instead, we were surprised to find the inference speed wasn't nearly as fast as we anticipated. That gap between expectation and reality pushed us to dig deeper. We started investigating how GPU memory - not just its size, but its type - affects inference speed and scale. The goal was simple: understand what's actually going on under the hood, so we could choose the right GPU for our workload. What we found was eye-opening. The real bottleneck behind LLM inference performance isn't VRAM size or FLOPS - it's memory bandwidth, and how the available memory is managed at runtime. We put everything we learned into an article covering: • Why the decode phase is memory-bound - and why that changes everything • How memory bandwidth directly determines token generation speed • What the KV cache is and why it controls how many users you can serve in parallel • Why a newer, more powerful GPU can actually be slower for inference • HBM vs GDDR • The real tradeoff between concurrency and throughput If you're building or scaling LLM-powered products, this one is worth a read. https://lnkd.in/dXGd8jQY Amos Kopels Yonatan Leibovitch Arithmo.io
Like Comment
To view or add a comment, sign in

27,575 followers

View Profile Connect

LinkedIn respects your privacy

Civo’s Post

More from this author

Civo April 2026 newsletter

Civo March 2026 newsletter

Civo February 2026 Newsletter

Explore content categories