Katz Sakai

Posted on Mar 31 • Edited on Apr 5

How a Rails and K8s Newcomer Cut GKE Costs by 60% by Looking Across the Stack

#rails #kubernetes #devops #performance

tl;dr: This is a journal of how an engineer with no prior Rails or Kubernetes experience cut Google Kubernetes Engine (GKE) costs by 60%. The following steps were taken to achieve this cost reduction:

Puma, which hosts the Rails app, was running with 1 worker and 33 threads. Because Ruby has the GVL, only one core was effectively being used. Changing to 4 workers and 8 threads let the Rails app take advantage of multi-core CPUs and process multiple requests more effectively per Pod.
Every API request was running bcrypt for token authentication. This was clearly too much overhead for API token auth, so replacing it with a lighter scheme reduced per-request CPU load.
The GKE Nodes running the Pods were on an older machine generation. Upgrading from n1 to n2d gave 56% better CPU performance, 23% more memory, and a 3% cost reduction, which improved Pod density per Node.
There was no autoscaling for either Pods or Nodes, so capacity was always provisioned for peak traffic. By introducing KEDA for Pod autoscaling and GKE Cluster Autoscaler for Node autoscaling, capacity now scales with actual traffic, so we only pay for what we use.

As you can see, this cost reduction was achieved not only by changing the Kubernetes infrastructure settings, but also by tuning the Rails application running on top of it. It required a comprehensive understanding of the entire stack, from application to infrastructure, and a thorough review of the entire stack. For example, simply implementing Kubernetes autoscaling would only increase or decrease inefficient Pods and Nodes, resulting in limited cost savings.
Furthermore, looking at each individual steps, they might seem like insignificant improvements. However, diligently accumulating these small improvements ultimately led to a 60% cost reduction.

I work on a B2B SaaS platform that runs on Google Kubernetes Engine. The API server is a Rails application, and it was the biggest cost driver on our GKE cluster.

When I joined this project, I had virtually no experience with Ruby on Rails or Kubernetes. However, this lack of knowledge actually became a hidden strength. In other words, it allowed me to look at every implementation and configuration with a fresh perspective and keep asking "why is it implemented/configured this way?" until I was satisfied.

The vicious cycle: why the cluster needed so many Pods

The high cost was not caused by a single problem. It was a cycle:

Inefficiency: Running Puma as a single process meant CPUs were not being fully used. On top of that, running bcrypt on every API request added unnecessary CPU load. The Rails API was simply slow.
Band-aid scaling: Instead of tuning Rails, the response to traffic was to add more GKE Nodes and Pods.
Low cost-performance Nodes: GKE Nodes were running on older generation instances, so the cost per unit of CPU and memory was poor.
Overprovisioning: Without autoscaling, the Pod and Node count was fixed to match peak traffic and stayed that way around the clock, which wasted a lot of resources.

Several issues were driving costs up, and we had to tackle them one by one.

How the cost actually came down

Here's what happened to costs. The chart below shows monthly GKE spending by SKU.

This was not mainly due to lower traffic. API traffic stayed fairly stable during this period, and the savings came mainly from efficiency improvements.

The cost came down in two steps.

In the first step (early 2025 through June 2025), we simply reduced Pod and node counts that were clearly excessive. This brought costs down from the peak, but it was just trimming fat. The underlying inefficiencies remained.

The second step (from July 2025 onward) is where most of the real savings came from. It breaks down into two parts:

Part 1: Making each Pod and node more efficient
Part 2: Scaling with demand

Part 1: Making each Pod and node more efficient

1.1 GKE Node generation upgrade: n1-highmem-2 → n2d-highmem-2

Our GKE nodes had been running on n1-highmem-2 since the early days of the service. The n1 family is Google Cloud's first generation of general-purpose instances, based on older Intel Skylake/Broadwell processors.

We migrated to n2d-highmem-2, which uses AMD EPYC processors. That alone made the upgrade worth doing. According to Google's official CoreMark benchmarks:

Instance type	CoreMark score	RAM	Monthly cost (asia-northeast1)
n1-highmem-2	26,293	13 GiB	~$124
n2d-highmem-2	41,073	16 GiB	~$120

Just upgrading the Node type gives 56% better CPU performance, 23% more RAM, and a 3% lower cost. The migration had no surprises: update the Node Pool configuration, cordon and drain the old Nodes, switch to the new Node Pool. This was done in July 2025 and shows up as a SKU change in the cost graph above.

For binary compatibility with existing Docker images, x86-based machines were used rather than Arm. Switching to Arm for further cost reduction is something being considered for the future.

1.2 Rails process model: from 33 threads to 4 workers

This was where most of the savings came from.

Our Rails application was running Puma with the following configuration:

WEB_CONCURRENCY was not set (defaults to 1, meaning a single worker process)
RAILS_MAX_THREADS=33

That is one process with 33 threads. At first, it looked like this should handle up to 33 concurrent requests, but that was not how it worked in practice. The reason is how Ruby works.

Ruby's Global VM Lock (GVL)

Ruby has a Global VM Lock (GVL). Within a single process, only one thread can execute Ruby code at a time. (Threads waiting on I/O such as DB queries, HTTP requests, or file reads do release the GVL, allowing other threads to run.)

The API server was CPU-bound, so even with 33 threads in one process, effective parallelism was essentially 1. Having 33 threads did not mean 33 requests being processed simultaneously.

The Nodes running the Pods had multi-core CPUs, but the GVL meant each Pod was effectively using only one core.

The bcrypt problem

On top of the GVL issue, every API request was running bcrypt for token authentication. bcrypt is a password hashing algorithm deliberately designed to be CPU-intensive in order to resist brute-force attacks. Running that expensive hash operation on every API request was using up CPU across all Pods. We replaced API token authentication with a lighter method for per-request token validation.

The fix

We changed the Puma configuration to:

WEB_CONCURRENCY=4 (4 worker processes)
RAILS_MAX_THREADS=8

With this setup, Puma spawns 4 worker processes, each with its own GVL, so multi-core CPUs can be used properly. Thread count was reduced from 33 to 8. For reference, since Rails 7.2, Puma's default thread count per worker was reduced from 5 to 3, so further reductions may make sense. For more on tuning workers and threads, see Deployment engineering for Puma.

A fair question is whether going from 1 to 4 workers would also mean using 4x more memory. In practice it did not. Copy-on-Write (CoW) lets worker processes share program memory.
(Note that preload_app! needs to be set in the Puma configuration for CoW to work.)

Setting MALLOC_ARENA_MAX=2 in the Rails container environment also reduced per-Pod memory usage by about 20%. The details are in a separate article:

Katz Sakai

Mar 26

Why Rails App Memory Bloat Happens: Causes and Solutions (2025 Edition)

#ruby #rails #performance #linux

5 min read

By combining CoW, MALLOC_ARENA_MAX, and reducing the number of threads, the amount of memory used by the Pod decreased from 4.2 GiB to 3.5 GiB, even though the number of worker processes was increased from 1 to 4.

That said, the memory improvement was a side benefit. What mattered for cost was needing fewer Pods. With only one Puma worker per Pod and a CPU-bound API, each Pod could handle a limited number of concurrent requests. Moving to 4 workers per Pod meant the same traffic could be served with fewer Pods. Fewer Pods meant fewer Nodes, and that drove the cost down.

Part 2: Scaling with demand

Now that each Pod was running efficiently, the next step was to reduce costs by stopping unnecessary Pods and Nodes during off-peak hours. However, before enabling autoscaling, it was necessary to configure the system so that Pods could be safely started and stopped at any time.

2.0 Prerequisites: making Pods safe to autoscale

Autoscaling Pods means that Kubernetes creates and destroys Pods at any time. If your Pods are not set up for this, you trade cost savings for reliability problems.

We configured the following before enabling any autoscaling:

Startup probe: Our Rails applications can take time to start up (loading the framework, initializing gems, establishing DB connections, warming caches). Without a startup probe, Kubernetes may decide the Pod is not alive during initialization and kill it.
Readiness probe: This tells Kubernetes whether a Pod is ready to accept traffic. When a Pod temporarily cannot handle requests (during heavy processing or after a brief DB connection loss), the readiness probe fails and Kubernetes removes that Pod from the Service endpoints. Once the probe recovers, the Pod is added back and starts receiving traffic again.
Liveness probe: This detects Pods that are running but stuck (a hung Rails process, for example). Kubernetes automatically restarts them. This is important for long-running Pods.
terminationGracePeriodSeconds + preStop hook: The terminationGracePeriodSeconds is used to configure a graceful shutdown time for requests in progress after a termination signal SIGTERM is sent to a Pod during Pod scale-in (Pod deletion). Using this setting allows you to specify the time Kubernetes waits before forcibly terminating a Pod, making it easier to prevent request errors during scaling. However, this setting alone is not sufficient. Kubernetes processes sending the termination signal (SIGTERM) and detaching the Pod from the Service endpoint in parallel, so SIGTERM may arrive before the detachment is complete (after SIGTERM arrives, the Pod enters graceful shutdown and cannot accept new requests). Therefore, it is common to use a preStop hook to sleep for about 10 seconds, delaying the SIGTERM transmission, to ensure that detachment from the Service endpoint occurs before SIGTERM. For more on this timing issue, see Zero-Downtime Rolling Deployments in Kubernetes.

These are the things we needed to put in place before autoscaling safely. Without them, autoscaling may look fine in theory, but it can cause problems in production. For details on configuring each probe type, see the Kubernetes documentation on liveness, readiness, and startup probes.

2.1 KEDA with Cron trigger

As this is a B2B service, traffic is concentrated during weekday business hours (approximately 8:30 AM to 7:30 PM), and quiet at night and on weekends. This traffic pattern remained largely unchanged.

However, the number of Pod replicas (the number of running Pods) was fixed to match the maximum load during peak hours and ran 24/7, 365 days a year. This is a significant waste of resources.

Given this predictability, we chose KEDA's Cron trigger. KEDA is an event-driven autoscaler for Kubernetes, and the Cron trigger is one of its simplest scaling options: it adjusts the replica count on a time-based schedule, rather than reacting to metrics like CPU or memory usage. If your traffic pattern is predictable, this is simpler and more reliable than reactive scaling. There is no lag waiting for metrics to cross a threshold, no risk of flapping, and the configuration is easy to understand.

Our configuration:

Weekdays 08:00-20:00 JST: about 3–4x more replicas than the baseline off-hours level
All other times: baseline replica count

No metrics, no thresholds, no reactive logic. For our traffic pattern, that simplicity was a strength.

Note that Cron-based scaling assumes a stable traffic pattern. If total request volume grows with business growth or the pattern itself changes, the current replica counts may not be enough. To catch that early, API response time and similar metrics are monitored externally on an ongoing basis. Degradation there is the signal to revisit the Cron scaling configuration.

Cron scaling is not something you set once and never touch again. It just tends to need fewer updates over time.

2.2 GKE Node Autoscaling

With KEDA changing Pod replica counts, GKE's Cluster Autoscaler was enabled next.

The logic is simple: when KEDA scales Pods in, some nodes end up underutilized. The Cluster Autoscaler cordons those nodes and removes them. When KEDA scales Pods out, the autoscaler provisions new nodes to accommodate them. We configured a minimum node count for availability zone redundancy and a maximum to cap costs.

With KEDA controlling Pod count and Cluster Autoscaler controlling Node count, the cluster now uses only the capacity it actually needs.

Results

Comparing the H2 2024 average (before optimization) to the Q1 2026 average (after all changes were in place):

Metric	Before	After
Node type	n1-highmem-2	n2d-highmem-2
CoreMark score	26,293	41,073 (+56%)
Node RAM	13 GiB	16 GiB (+23%)
Node cost (asia-northeast1)	~$124/month	~$120/month (-3%)
Puma workers (WEB_CONCURRENCY)	1	4
Threads per worker (RAILS_MAX_THREADS)	33	8
API auth per request	bcrypt	Lighter method
MALLOC_ARENA_MAX	Not set	2
Pod memory request	4.2Gi	3.5Gi (-20%)
Pod scaling	Static (always at peak)	KEDA Cron (higher during weekday daytime, minimum otherwise)
Node scaling	Fixed count	Cluster Autoscaler
GKE monthly cost	Baseline	-60%

Costs have stabilized since the latter half of 2025, once the initial optimization is complete (see graph at the top of the page).

Lessons learned

None of this was clever. Puma config, auth scheme, node type, a cron schedule. Each change looked minor in isolation. Together they cut costs by 60%.

Ruby's GVL means adding threads to CPU-bound work does nothing. More processes, not more threads. Thread count only matters once you know how I/O-heavy the workload actually is.

bcrypt is for password hashing, not API tokens. Running bcrypt on every API request was just a mistake. It's slow by design, that's the point for password hashing. Using it for per-request token verification was the wrong tool. We replaced it, and the CPU load dropped immediately.

Simple autoscaling is often enough. KEDA has a lot of trigger types, metrics, queues, custom event sources. We used none of that. A Cron schedule matched our traffic pattern well enough, and it's been lower maintenance than anything reactive would have been.

Node generation matters more than it looks. Just upgrading from n1 to n2d gave 56% better CPU, 23% more memory, and 3% lower cost. If you are still on n1 instances, moving to n2d or similar is worth doing soon.

Optimize the whole system, not just one layer. The key to this improvement was not just looking at the infrastructure or the application in isolation, but reviewing both together. Revising the Puma worker configuration and authentication method was an application-side improvement, while updating node generations and autoscaling were infrastructure-side improvements. Rather than tuning just one part, aiming for overall optimization while understanding the connections of the entire stack led to significant results.

DEV Community