DEV Community: SINAPTIA

Evaluating LLM prompts in Rails

SINAPTIA — Tue, 17 Feb 2026 14:08:10 +0000

We’ve built several AI features in Rails by now: image classification, image upscaling, similarity search, etc. And every time, the same question came up: which model and prompt should we actually use? The image classification project made this especially painful: a pricing change blew up our budget, smaller images proved to work better than larger ones, and every model switch required re-running the entire evaluation from scratch.

Every change on a prompt opens up a tree of choices. Which provider should we use? Which model? How detailed should the instructions be? Would more samples in the prompt work better? How much context per message? Should we use a reasoning model? Or augment the data available to the model with multi-modal input? There’s also the cost vs. accuracy tradeoff: is 10x the price worth a 5% improvement for this specific feature?

The combinatorial explosion gets overwhelming fast, and the result of the process has this feeling of uncertainty… is there a branch I missed that works better? Or that costs less?

The pragmatic choice: spreadsheets

We needed a methodology to track changes across iterations so the team can follow along. Naturally, we took a pragmatic stance: we started using spreadsheets for each feature tracking results across prompt/provider/model configurations, all run against the same data. It worked quite well, and over several features, we started seeing a workflow emerge, but…

Spreadsheets don’t scale

We knew the limits going in, but they became harder to ignore over time:

They fragment. People make copies. When you’re sharing with non-technical collaborators, you end up with multiple sources of truth.
No enforced structure. Each feature ended up with its own format. You have to re-learn how to read each one, and not all of them track the same metrics the same way.
Hard to compare. Eyeballing results across configurations isn’t intuitive, and people get confused.
No regression baseline. Once you settle on a configuration, how do you catch regressions later?
Prompts drift. Someone edits the spreadsheet and forgets to update the code. Nobody notices until something breaks.
Disconnected from code. Prompts and evaluations should live where the application lives.

In one project with many AI features, this all came apart. Links got lost, copies multiplied across different people’s drives with small divergences. Building eval datasets meant downloading images and re-uploading them to sheets. Running prompts required manual dev work because the data lived in Google Drive, but prompts had to go through the LLM provider. We built some internal tooling to help, but since every sheet and feature had a different format, nothing was reusable.

But they were useful to uncover what we needed: a place where you can couple a prompt configuration with a curated dataset extracted from real data, that helps you find the right balance between accuracy and costs for the feature at hand. Ideally, without leaving the Rails app.

So we built RubyLLM::Evals, a Rails engine for testing, comparing, and improving LLM prompts directly inside your application.

RubyLLM::Evals

Since we’re using RubyLLM, it made sense to build on top of it.

The core abstractions are prompts and samples. A prompt captures a full configuration: provider, model, system instructions, message template (with Liquid variables), tools, and output schemas. If you already have tools or schemas in your app, you can reuse them. Samples are your test cases: each one defines an evaluation type (exact match, contains, regex, LLM judge, or human judge) and an expected output.

The interesting design choice was making the LLM-as-judge a first-class eval type. For features like summarization or classification with fuzzy boundaries, exact matching doesn’t cut it. You need another model to assess whether the response is good enough. It’s not perfect, the judge has its own biases and failure modes, but for iterative prompt development, it’s a pragmatic tradeoff: fast feedback now, human review on the edge cases.

Each run saves a snapshot of the prompt settings and records accuracy, cost, and duration. A comparison tool lays all runs of a prompt side by side, so you can spot what changed and why.

Real application data

One thing we really wanted was the ability to populate samples from the application’s data. For example, in our image categorization feature, we can:

prompt = RubyLLM::Evals::Prompt.find_by(slug: "image-categorization")

Image.uncategorized.limit(50).each do |image|
sample = prompt.samples.create(eval_type: :human)
sample.files.attach(image.attachment.blob)
end

Now you’re iterating on your prompt with actual production data, not synthetic examples.

The temptation is to throw hundreds of samples at a prompt and see what sticks. In practice, a smaller curated set that covers your edge cases tells you more than a large random one. We typically start with 20-30 samples: a mix of straightforward cases, known hard cases from production, and a few adversarial examples. If accuracy looks promising, we expand. If not, the small set is faster to iterate on.

In production

Once you’re happy with a prompt, you can use it directly in your application:

response = RubyLLM::Evals::Prompt.execute(
"image-categorization",
files: [image.attachment.blob]
)
response.content # => "deck"

The configuration lives in the database, versioned through your evaluation runs, always in sync with what you tested. Rolling back to a previous version or A/B testing a new iteration becomes straightforward.

Where this leaves us

Production data has a way of surprising you: new usage patterns, edge cases you never curated a sample for, a provider silently updating a model or its pricing… your prompt’s accuracy can degrade or your cost skyrocket without a single line of code changing. This is a challenge that has no single solution, but monitoring a prompt’s performance in production is key. Each feature will require something different and use different metrics, but you need feedback, so when your metrics surface a drift, lower quality results, or higher error cases, higher costs, you can pull new samples into RubyLLM::Evals and adjust the prompt to the new reality.

The pattern we keep seeing across projects is that prompts are never done. Models get updated, data distributions shift, and what worked last month might silently degrade and fail over time. Continuous testing and monitoring are critical.

RubyLLM::Evals and RubyLLM::Monitoring are how we go from concept to production. Both are open source and built for Rails.

At SINAPTIA, we specialize in helping businesses implement AI solutions that deliver real value. If you’re facing challenges with prompt engineering or AI integration, we’d love to help.

AI agents in Ruby: Why is it so easy?

SINAPTIA — Mon, 09 Feb 2026 18:49:52 +0000

AI agents in Ruby: Why is it so easy?

Scott Werner (founder of Sublayer and organizer of Artificial Ruby) told me something that stuck with me:

“The first version of the sublayer gem was actually a coding agent, but it was coming together so quickly… I was like, wait… if this is so easy for me, it’s going to be easy for everybody, and everybody is going to be making these…”

Last week, we open-sourced a minimal but feature-packed coding agent. We were after the simplest, straightforward, stupidly effective agent possible, so we named it Detritus, after Lance Constable Detritus of the Ankh-Morpork City Watch from Discworld (thanks for so much and so many, Sir Terry).

Detritus is built in just 250 lines of code, yet it packs a CLI with history, custom slash commands and skills (sort of), save/resume chats, subagents, and a two-level configuration (project and global). A full-featured coding agent.

While building this basic agent, we confirmed, firsthand, what Scott had said. And I kept wondering:

Why? What makes it so easy?

Is it the LLMs? Is it Ruby? Is it that it’s fun, so you don’t really feel the pain? Or is it something else?

After giving it some thought and talking about this with teammates, we converged on two key factors:

The first key: general availability of LLMs

General availability of LLMs changed the nature of the problem of building something like Detritus. Before, building a coding AI was unthinkable, but current LLMs made impossible things almost trivial:

Code some utility functions for the LLM to call (one for editing files, one for bash commands), hook up an LLM via API, put it all in a loop, and that’s it. You have a coding agent.

What used to be a research problem is now an integration problem. The problem migrated from the lab to the workshop.

The second key: Ruby’s power

Ruby is well known for its historical focus on developer happiness: “A programmer’s best friend”. I think this is a fundamental characteristic of the language, but sometimes I feel it’s a little superficial, and it doesn’t tell you why.

I think Ruby brings something else that is a much more fundamental property that emerges out of its design and philosophy: Power.

Originally, the idea of “powerful programming languages” came to me via Amir Rajan, creator of DragonRuby, when he shared this article from Paul Graham, “Beating the averages”. We talked about how and why Lisp was the most powerful language, with Ruby being a close second. Graham’s key insight — what he calls the “Blub paradox” — is that power in programming languages sits on a continuum, and you can only recognize a more powerful language from above, never from below.

Any general-purpose programming language is nowadays more or less equivalent, equally capable. You can build Detritus with the exact same features in Python, Go, JavaScript, or even in C. And yet, the experience of building this in Ruby feels fluid and frictionless. Like cutting a wooden block with a hand saw or a circular saw. Both will cut the wood just fine (they are equally capable), and you can probably enjoy both (personal taste is not the matter here), but one will make you feel more powerfully invested than the other.

I think power in programming languages is not just capability, but the relation between using the capability and the effort the developer has to invest in wielding it.
In this sense, Ruby has the ability to maximize the capability/effort ratio. The amount of power condensed in a few lines of code feels extraordinary.

If you take a look at Detritus’ source code, this is how you set up the agent:

def create_chat(instructions: state.instructions, tools: [EditFile, Bash, WebSearch, SubAgent], persist: true)
chat = RubyLLM::Chat.new(model: state.model, provider: $state.provider)
chat.with_instructions(instructions) if instructions
chat.on_end_message { |msg| save_chat } if persist
chat.with_tools(*tools)
end

Five lines of RubyLLM set the model, system prompt, and tools. That’s all you need to set the agentic loop ready to go.

And the rest of the code is the same: chat persistence is Marshal.dump. The CLI router is a case statement. The subagent is a tool that calls create_chat. None of this code is clever or magical; it’s just plain Ruby. That’s exactly the point. When the language is powerful enough, building an AI agent doesn’t require anything special, just the mundane. And Ruby makes the mundane exquisite, short.

Detritus’ history started when Thorsten Ball published The Emperor Has No Clothes, a guide to building a super basic coding agent in Go. My immediate thought after the head explosion was: if we did this in Ruby, it would take a fraction of the code and give us twice the features. So, as Thorsten suggested, “I went and tried how far I could get”. I got this far.

Raised to the power

LLMs’ general availability turned AI from a “research problem” into an “integration problem”. The nature of the work changed to match Ruby’s strengths: orchestration, expressiveness, and fast iteration.

When you combine Ruby with LLMs, you get compounding power. Power * Power. Power squared.

The key to building an agent is defining what to delegate to the LLM and what to handle in code. For example, Detritus’ skills feature: the code just provides a list of instructions and scripts. The actual skill, knowing when to use each one, how to combine them, that’s all the LLM.

This is where both keys meet. LLMs do the hard part; our job is orchestration. And Ruby makes the orchestration so clean you can see just how little code is actually needed. Compounding power.

The Opportunity

The Ruby AI ecosystem is young, but it’s growing fast. RubyLLM, the gem that powers Detritus, is already spawning its own ecosystem: MCP support, monitoring, agent frameworks, etc. Andrew Kane has quietly built an entire ML infrastructure layer for Ruby: transformers, torch, embeddings, vector search, and ONNX runtime. Officially supported SDKs from OpenAI, Anthropic, and MCP. The foundations are being laid right now, the Ruby way: simple, expressive, and delightful to use.

In the coming years, most of us, Ruby developers, won’t be training models. We will be orchestrating API calls, building agents, capabilities, features, and designing systems. Building products on top of a dynamic, ever-changing landscape. We’ll be doing what Ruby does best: making powerful capabilities accessible through elegant, expressive interfaces. And because of Ruby’s power, we can do those things naturally, frictionlessly, easily.

The same things that made Ruby great for web development 15 years ago are perfectly aligned again, but now with a more mature, faster, and modern Ruby. The potential is huge.

The Ruby community has decades of experience building products and delightful tools. The AI landscape is wide open, the tools are here, and the problem fits like a glove. So… what are we, Rubyists, going to do?

RubyLLM::Instrumentation: The foundation for RubyLLM monitoring

SINAPTIA — Tue, 20 Jan 2026 16:35:52 +0000

In our last post, we introduced RubyLLM::Monitoring, a Rails engine that captures every LLM request your application makes and provides a dashboard where you can see cost, throughput, response time, and error aggregations, and lets you set up alerts so that when something interesting to you happens, you receive an email or a Slack notification.

But how did we do it? What mechanism does RubyLLM provide that we can use to capture all LLM requests? Or did we use something else?

RubyLLM event handlers

RubyLLM provides event handlers out of the box. You can use them to capture an event when a message is sent to the LLM and, for example, calculate its cost. This is how you’d use them:

Provided that you have gemini configured in config/initializers/ruby_llm.rb

chat = RubyLLM.chat provider: "gemini", model: "gemini-2.5-flash"

chat.on_end_message do |message|
Event.create(
provider: chat.model.provider,
model: chat.model.id,
input_tokens: message&.input_tokens || 0,
output_tokens: message&.output_tokens || 0
)
end
response = chat.ask("Write a short poem about Ruby")

In the code above, an event record is created when a message is completed, and the cost is calculated in an ActiveRecord callback. The solution is pretty simple and works perfectly, but it doesn’t scale very well:

You need to add this manual tracking everywhere. Every chat instance requires this callback to be set up; otherwise you will lose that data. You can simplify it even more, but you’ll always have to set up the callback.
Your instrumentation code and your business logic are tightly coupled, which makes both harder to maintain.
This only works for RubyLLM::Chat instances. What about embeddings, image generation, and other operations? You’d need different mechanisms for each.
Tracking full request metrics like latency needs more complex and intrusive code.

We needed something more comprehensive and automatic that doesn’t rely on us remembering to hook the instrumentation code everywhere. Luckily, Rails has something neat for us to use baked in.

ActiveSupport::Notifications

ActiveSupport::Notifications is Rails’ instrumentation API. It’s what Rails uses internally to track things like database queries, view rendering, controller executions, and more.
Using it is simple: you make your code emit events by calling ActiveSupport::Notifications#instrument(...), and subscribers can consume those events to do logging, monitoring, or whatever else you need. An interesting example is Prosopite, which hooks into sql.active_record events to detect N+1 queries.
This mechanism is especially important for libraries, as it decouples the business logic of the library and the business logic of the application that uses it. In the case of RubyLLM::Monitoring, the monitoring logic lives separately and subscribes to what it cares about. No coupling between RubyLLM and RubyLLM::Monitoring.

So, this is what we did in RubyLLM::Instrumentation to make RubyLLM emit events after each LLM call. RubyLLM::Monitoring, on the other hand, provides an event subscriber that captures the events and feeds them into its dashboard.

RubyLLM::Instrumentation

Instrumentation should be automatic and invisible. RubyLLM::Instrumentation achieves that: just add it to your Gemfile, run bundle install, and you’re done. RubyLLM will start emitting events for you to subscribe to.
Now, following the example above, the code becomes:

in config/initializers/ruby_llm.rb

ActiveSupport::Notifications.subscribe(/ruby_llm/) do |event|
# Do whatever you want with the event, in RubyLLM::Monitoring we store the event data in the database for later use
Event.create(
provider: event.payload[:provider],
model: event.payload[:model],
input_tokens: event.payload[:input_tokens] || 0,
output_tokens: event.payload[:output_tokens] || 0
)
end

Provided that you have gemini configured in config/initializers/ruby_llm.rb

chat = RubyLLM.chat provider: "gemini", model: "gemini-2.5-flash"

RubyLLM will emit the event, and it'll be captured by the subscriber above

response = chat.ask("Write a short poem about Ruby")

The code remains practically the same as in the original example, but the instrumentation becomes much simpler and decoupled, and there’s no need to repeat the same hook in multiple places.

In the example above, all ruby_llm events are captured, but you can subscribe to specific events. You can read more about the instrumented events and their payload in the project’s repository.

Wrapping up

RubyLLM::Instrumentation takes off the burden of manually instrumenting the code from the users’ shoulders. Originally written as part of RubyLLM::Monitoring, we extracted it into its own gem because we thought it was a fundamental tool, and as we needed it, other people might need it too to build a different monitoring tool, or an analytics tool, or set up logging differently.

Give it a try, send us feedback, and contribute if you want to!

--
If you’re building AI-powered applications with Rails and need help with architecture, optimization, or observability, get in touch.

Monitoring LLM usage in Rails with RubyLLM::Monitoring

SINAPTIA — Wed, 14 Jan 2026 13:30:00 +0000

You've built an AI-powered feature into your Rails application using LLMs. You've built an evaluation set to test different prompts and model combinations, compared them, and improved them[1] so you could get the best bang for the buck out of your LLM usage. You aimed for the highest accuracy at the lowest possible cost. You deployed it to production. And now?

Unlike most APIs, LLM APIs calls have variable costs. They are usage-based, so the price depends on input and output tokens consumed. So, how do you know how your users are using it? Or, how much will it cost you monthly? Is it what you estimated, and are the usage limits you designed ok? Are they needed at all?

Why monitoring LLM interactions matters

Beyond basic visibility, monitoring unlocks practical improvements:

Cost management: Track which models and features are costing you money, then focus optimization efforts where they matter. When 80% of your costs come from one feature, you can try a cheaper model, add caching, optimize prompts, or, if the provider and feature allow it, batch processing.
Performance tracking and anomaly detection: Monitor response times to identify slow prompts and set realistic expectations. A sudden spike in latency or requests usually means something changed—a bug causing retries, or model performance issues—and monitoring helps you correlate changes with their impact.
Capacity planning: Understanding your throughput patterns (requests per minute, hour, day) helps you forecast costs and identify features that might benefit from caching or batching.
Provider comparison: With multiple LLM providers offering similar capabilities at different price points, monitoring helps you make informed decisions about which model delivers the best results for your use case.
Reporting: Product managers and stakeholders want to know what AI is costing. With monitoring data in your database, generating reports is a SQL query away.
Model migration planning: When a provider releases a new model or changes pricing, you can estimate the impact on your costs before making the switch.

Introducing RubyLLM::Monitoring

As you might guess, after deploying our AI-powered features, we had several usage spikes that threatened the viability of the features. We needed to monitor our LLM usage in production. At the beginning, we did it manually, using whatever each inference platform provided. But as we started using different providers and models on several features, manually tracking cost and token usage became complicated and error-prone. So we built RubyLLM::Monitoring: a Rails engine that tracks every LLM request your application makes and provides a dashboard where you can see cost, throughput, response time, and error aggregations. On top of it, you can set up alerts so that when something interesting to you happens, you receive an email or a Slack notification.

As the name suggests, it's built on top of RubyLLM and integrates seamlessly with your existing setup. No separate infrastructure, no external services, just another engine mounted in your Rails app.

How it works

The engine instruments every LLM request your app makes (stay tuned for a related post) and saves it to your database. Cost is calculated automatically using RubyLLM's built-in pricing data. Since everything lives in your database, you can run custom queries when the dashboard isn't enough.

The dashboard

Once installed, you get a dashboard at /monitoring (or wherever you mount it) with:

Summary cards showing total requests, total cost, average response time, and error rate.
A breakdown table grouping metrics by provider and model, so you can see at a glance which models are being used and what they're costing you.
Metrics:
- Throughput: Request count over time
- Cost: Accumulated costs per time window
- Response time: Average latency trends
- Error rate: Percentage of failed requests

Alerts

Beyond the dashboard, you can configure custom alert rules to notify you when specific conditions are met. This is essential for catching cost overruns, error spikes, or unusual patterns before they become problems.

Alert rules are flexible and can trigger based on any condition you can express as a query. Here are some practical examples:

# config/initializers/ruby_llm_monitoring.rb
RubyLLM::Monitoring.channels = {
  email: { to: "[team@example.com]()" },
  slack: { webhook_url: ENV["SLACK_WEBHOOK_URL"] }
}
RubyLLM::Monitoring.alert_rules += [{
  time_range: -&gt; { [Time.current.at](http://Time.current.at)_beginning_of_month.. },
  rule: -&gt;(events) { events.sum(:cost) &gt;= 500 },
  channels: [:email, :slack],
  message: { text: "More than $500 spent this month" }
}, {
  time_range: -&gt; { [1.day](http://1.day).ago.. },
  rule: -&gt;(events) { events.average(:response_time) &gt; 5000 },
  channels: [:slack],
  message: { text: "Average response time exceeded 5 seconds" }
}]

Alert rules have built-in cooldown periods to prevent notification spam, and you can customize channels for each rule. You can even build custom notification channels beyond the built-in email and Slack options.

Conclusion

Building AI-powered features doesn't end at deployment. The models you depend on are expensive, their performance varies, and usage patterns shift over time. Models and providers are unstable due to the rapidly evolving AI landscape. Without proper visibility, you have only guesses. So, we built RubyLLM::Monitoring.

Give it a try, send us feedback, and contribute if you want to!

At SINAPTIA, we specialize in helping businesses implement AI solutions that deliver real value. If you're facing challenges with LLM monitoring or AI integration, we'd love to help.

References

[1]: If you don't know how to do this, we'll have a surprise for you soon.

What's actually slow? A practical guide to Rails performance

SINAPTIA — Mon, 10 Nov 2025 18:13:10 +0000

For the last couple of months, we’ve been building an observability tool that we intend to use internally in our AI-powered solutions. One of the features we wanted to work on was slow action detection, but… What makes an action slow? It’s one of those questions that sounds simple but gets interesting fast. Let’s break it down.

What users actually experience

When a request hits your Rails app and a response goes back, that total time is just a portion of what users experience. Server response time is crucial, but it’s only one piece of perceived performance:

Network round-trip matters. Your app might respond in 100ms, but if the user is on a slow connection or geographically far from your server, they might wait 500ms for the round-trip. A fast server doesn’t fix slow networks.
Download and rendering matter. Once the HTML arrives, the browser needs to download CSS, JavaScript, and images. Then it needs to parse, render, and potentially hydrate a JavaScript framework. A 100ms server response followed by 2 seconds of asset downloads and rendering feels slow to users. The vision on performance should be integral. Server time, network latency, asset delivery, and browser rendering add up to what users experience. In this post, we will focus exclusively on server response time.

Percentiles: the right way to measure

You’ve got a group of similar actions. Some are fast, some are slow. What metric do you use to decide if it’s “slow”?
You shouldn’t use the average. The average lies. Imagine 99 requests at 50ms and 1 request at 5 seconds. Your average is 99.5ms, which looks great! But 1% of your users just waited 5 seconds. That’s not acceptable. Depending on the size of your user base, that 1% can be considered an outlier, but if your user base is large, it means a lot of people are having a bad experience.
Percentiles show you what real users experience:

P50 (median): The middle. Half your requests are faster, half are slower.
P95: 95% of requests are faster than this number.
P99: 99% of requests are faster than this number. Here’s what it looks like in practice: Action: posts#index
P50: 120ms ← typical case
P95: 450ms ← 5% of users wait this long or more
P99: 2.1s ← 1% of users are suffering That P99 of 2.1 seconds is telling you something. If you have 1000 requests a day, that’s 10 users waiting over 2 seconds every single day.

Which Percentile Should You Use?

P50 (median): Too optimistic
P50 only tells you about the typical case. It completely ignores tail latency, i.e., the slow requests that frustrate users.
If P50 is 120ms but P95 is 2 seconds, you have a serious problem that P50 won’t show you. Half your users get a fast experience, but a significant chunk are having a terrible time.
Don’t use P50 to decide what’s slow. It hides too much.

P95: The sweet spot
It catches problems that affect enough users to matter. If P95 is 2 seconds, that means 5% of your users (1 in 20) are waiting that long. That’s significant.
It’s not so sensitive that every minor blip flags the system. You’re looking at the experience of a meaningful percentage of users, not just the absolute worst cases.
When to use P95:

Setting performance thresholds for alerts
Deciding if an action needs optimization
Comparing performance across different endpoints

P99: More aggressive, catches edge cases
P99 is more aggressive than P95 as it looks at the worst 1% of requests. This catches the outliers, the edge cases, the weird scenarios.
Use P99 when:

You want to understand your absolute worst-case performance
You’re debugging specific slow requests
You have extremely high traffic, and 1% still represents many users
You’re operating at a scale where tail latency really matters (think Amazon, Google) But for flagging what’s “slow” in most applications, P99 can be too noisy. That worst 1% might include legitimate edge cases—a user with a massive dataset, a bot, a weird network condition. Flagging everything where P99 exceeds your threshold might give you too many false positives.

The decision rule
Use P95 as your threshold for marking something as slow. Monitor P99 too; it tells you about edge cases worth investigating. But make decisions based on P95. Why? Because P95 catches problems that affect enough users to matter without drowning you in noise from edge cases.

What actually matters: server response time

Rails tells you this for free:
Completed 200 OK in 250ms (Views: 180ms | ActiveRecord: 45ms)
That 250ms is what the server spent processing the request. This is what’s considered in reality:
Fast enough that nobody complains:

Under 100ms: Feels instant. Users are happy.
100-200ms: Still responsive. Most users won’t notice. Getting into trouble territory:
200-500ms: Noticeable. Not great, not terrible.
500ms-1s: Users are tapping their fingers.
1-3 seconds: You’re losing people.
Over 3 seconds: They’ve already opened another tab. Of course, the context matters. A simple action with basic queries should be under 200ms. A complex dashboard with aggregations spending 500ms to a second might be acceptable. But anything consistently over 500ms deserves investigation.

Breaking down the bottlenecks

Your action response time is the sum of its parts. This is what we use as a baseline when we analyze each component of a request. Bear in mind that these values are just guidelines; they can vary from project to project and be influenced by business requirements (eg, SEO penalties) or context (eg, for an admin interface that’s used sparingly for very specific tasks, there’s no problem in relaxing these a little bit).

Database Queries

Your actions are only as fast as your slowest queries.
Fast:

Under 10ms: Perfect. Nothing to do here, this is probably a properly designed query using the correct indexes.
10-50ms: Good for queries with optimized joins. Acceptable:
50-100ms: Fine for moderately complex queries.
100-200ms: Okay for heavy aggregations. Slow:
200-500ms: Here we start seeing things that are worth investigating.
500ms-1s: Definitely needs work.
Over 1 second: We can consider these critical and MUST FIX if they are part of a critical path. Simple queries (single table, indexed columns) should be under 10ms. If User.find(123) is taking 50ms, something’s wrong. Complex queries with joins and aggregations? They should be under 200ms. Some of the common root causes of these slow queries we usually see when we are doing performance optimization work are missing indexes on foreign keys or WHERE/ORDER BY columns, N+1 queries, full table scans on large tables, and unoptimized LIKE queries with wildcards on both sides. The power tool to uncover these: EXPLAIN ANALYZE. It will let you see execution plans and identify missing indexes or sequential scans.

View Rendering

View rendering time is usually high because of:

Rendering too many partials (partials are slow!)
N+1 queries hidden in view code
Not using fragment caching where you could Our suggestion for flagging views as slow is: if they are consistently over 100ms, investigate.

External API Calls

An action is only as fast as its slowest code statement. Hitting an external service in an action will kill your response time. This is not always possible, but we should work hard to avoid hitting 3rd party services via HTTP/network during the process of a request flow. Try to move those calls to background jobs and build a business process that takes into account asynchronicity around them.
In cases where the above is not possible, we try to target under 200ms for API calls. Anything over 500ms should be moved to background jobs or cached aggressively.
If you must make synchronous API calls, remember to set timeouts and have fallback behavior or use circuit breakers.

TL;DR: Thresholds

Here’s what to flag as slow using P95:

Actions: P95 > 500ms
Database queries: P95 > 100ms
API calls: P95 > 200ms And remember, these thresholds can vary from project to project and be influenced by business requirements (eg, SEO penalties) or context (eg, an admin interface that’s used sparingly for very specific tasks, there’s no problem in relaxing these a little bit), but they work as solid starting points.

Conclusion

Performance is an integral concern: Server time, network latency, asset delivery, and browser rendering, users experience all of it.
Of all these components, server time is where you have the most control. Every millisecond you shave off server response time is a millisecond that doesn’t add to the total user experience.
Look at P95 for your actions. Find the bottlenecks (database queries, view rendering, API calls) and fix what’s making users wait.
Always take the whole picture into account when prioritizing performance-related work, and put your effort where it will give your users the bigger benefits.

Rails views performance matters: can `render` slow you down?

SINAPTIA — Mon, 22 Sep 2025 18:53:30 +0000

Classic performance optimization strategies in a Ruby on Rails application involve moving slow or expensive logic to background jobs, looking at slow queries and adding missing indexes, or tracking and fixing N+1 query issues. The view layer, most of the time overlooked, should also be a target for performance improvements. In this post, we will do a quick recap of the different rendering strategies in Rails, benchmark them to set the base, and analyze them to decide when to use them (or when not to).

Rendering strategies in Rails

In Rails, we can render a template in many ways. To illustrate the different rendering strategies, we're going to use a simple Rails 8 app, like the one in the classic 15-minute blog: an Article model that has many Comment.

Inline rendering

Inline rendering refers to a chunk of HTML that could be modularized by extracting it into a partial. In a Rails view, one would do:

<h1><%= @article.title %></h1>
<p><%= @article.body %></p>

<h2>Comments (<%= @article.comments.count %>)</h2>

<% @article.comments.each do |comment| %>
  <div class="comment">
    <p><strong><%= comment.author %>:</strong> <%= comment.body %></p>
    <small>Posted on <%= comment.created_at.strftime("%b %d, %Y at %H:%M") %></small>
  </div>
<% end %>

It's good for small views, but it's hard to use on large views. And, probably the biggest drawback, it doesn't allow reusing any piece of the HTML.

Partial rendering

Here we extract the HTML in the iteration in its own partial, letting us reuse that piece of HTML in other places, and have more focused and smaller files, easier to work with.

<h1><%= @article.title %></h1>
<p><%= @article.body %></p>

<h2>Comments (<%= @article.comments.count %>)</h2>

<% @article.comments.each do |comment| %>
  <%= render "comments/comment", comment: comment %>
<% end %>

Collection rendering

Same as partial rendering, but we delegate the loop to the render method using the collection parameter.

<h1><%= @article.title %></h1>
<p><%= @article.body %></p>

<h2>Comments (<%= @article.comments.count %>)</h2>

<%= render partial: "comments/comment", collection: @comments, as: :comment %>

Implicit rendering

This is the more succinct version of all, similar in pros and cons to the previous strategy, but here we also delegate to the render method the decision of which partial to use.

<h1><%= @article.title %></h1>
<p><%= @article.body %></p>

<h2>Comments (<%= @article.comments.count %>)</h2>

<%= render @comments %>

The benchmark

The benchmark renders each view 1000 times using Benchmark::bmbm.

And these were the results:

Rehearsal -----------------------------------------------------------
Inline ERB view:          1.597948   0.012652   1.610600 (  1.611081)
Partial loop view:        6.774650   0.024155   6.798805 (  6.799789)
Collection render view:   3.257858   0.019441   3.277299 (  3.279077)
Implicit render view:     3.641655   0.018333   3.659988 (  3.660372)
------------------------------------------------- total: 15.346692sec

                              user     system      total        real
Inline ERB view:          1.705810   0.008909   1.714719 (  1.715067)
Partial loop view:        6.914086   0.026164   6.940250 (  6.944075)
Collection render view:   3.269090   0.018296   3.287386 (  3.287694)
Implicit render view:     3.678030   0.019551   3.697581 (  3.697888)

Analysis

Let's analyze from the slowest to the fastest one, so we can understand how the optimizations of each strategy work.

Partial rendering

Why is the partial rendering in a loop so slow?

The render method is a perfect example of the conceptual compression philosophy core to Rails design.
Behind the scenes, just rendering a template to an HTML string is much more complex than it sounds:

Finds the compiled cached template (fast, but not free)
Creates an ActionView::Renderer
Sets up the rendering context and binds the locals to it (comment: comment)
Finally, executes the cached template method that generates the HTML

This work is repeated for each of the 1000 comments. This has a lot of repeated work that we should be able to avoid: enter collection rendering.

Collection and implicit rendering improvements

Collection rendering and implicit rendering are sister strategies. In this case, implicit rendering is just collection rendering with a tiny bit of object-oriented magic on top: the object knows how to render itself by implementing the to_partial_path method (which is implemented by default).

How do they perform 2 times better? Well, with collection rendering, steps 1 and 2 are done once for the entire loop, so for the 1000 partials, we save 999 template searches and 999 ActionView::Renderer instantiations. That's quite a lot of work. Even bigger if your collections are bigger (not very usual, though).

But we still need to bind the locals and call the rendering method 1000 times. Could we do any better?

Inline rendering

Now that we know what render is doing under the hood, we can easily figure out why inline rendering is the fastest: there's no render at all. So, there's no template lookup, there's no rendering context instantiation nor binding setup, and there's no separate method invocation to assemble the HTML. It's all already taken care of by the article partial as it's done once. Hard to beat.

Having second thoughts

If you are like me, you would be thinking: if the performance hit comes from calling the render method and the main pain point of inlined views is maintainability and reusability... what would happen if instead of a partial, we put the template in a helper using content_tag? That should give us the best of both worlds, right? We could modularize using Ruby methods, and we wouldn't be calling render, so it should be fast, right?

Well, let's see! Let's add this method to the application helper:

  def render_comment(comment)
    content_tag("div", class: "comment") do
      content_tag("p") do
        content_tag("strong", comment.author) +
        content_tag("small", "posted on #{comment.created_at.strftime("%b %d, %Y at %H:%M")}")
      end
    end
  end

And our view becomes:

<% @comments.each do |comment| %>
  <%= render_comment comment %>
<% end %>

It almost looks like a component! Let's see the benchmarks now:

Rehearsal -----------------------------------------------------------
Inline ERB view:          1.576235   0.016086   1.592321 (  1.595398)
Partial loop view:        6.798589   0.027718   6.826307 (  6.828153)
Collection render view:   3.215288   0.017600   3.232888 (  3.234518)
Implicit render view:     3.623890   0.020319   3.644209 (  3.645871)
helper loop view:         6.856758   0.020698   6.877456 (  6.878699)
------------------------------------------------- total: 22.173181sec

                              user     system      total        real
Inline ERB view:          1.558490   0.010675   1.569165 (  1.569459)
Partial loop view:        6.928491   0.026780   6.955271 (  6.955799)
Collection render view:   3.258910   0.018507   3.277417 (  3.277837)
Implicit render view:     3.659728   0.019208   3.678936 (  3.679344)
helper loop view:         6.939471   0.024494   6.963965 (  6.964710)

HA! Have you ever seen a hypothesis go down that spectacularly? Looks like there are worse things than render out there!

What happened back there?

To understand why the helper strategy is by far the slowest, we need to see what our application was actually doing. A good way to see where our code spends most of the time is with a profiler.

This is what ruby-prof gives us back (the first few most interesting lines):

Measure Mode: wall_time
Thread ID: 1616
Fiber ID: 9368
Total: 0.044312
Sort by: self_time

 %self      total      self      wait     child     calls  name                           location
 11.18      0.005     0.005     0.000     0.000     6027   String#initialize
  6.20      0.022     0.003     0.000     0.019     4002   ActionView::Helpers::TagHelper::TagBuilder#content_tag_string /Users/f-3r/.rbenv/versions/3.4.5/lib/ruby/gems/3.4.0/gems/actionview-8.0.2.1/lib/action_view/helpers/tag_helper.rb:239
  5.40      0.040     0.002     0.000     0.038     4000  *ActionView::Helpers::TagHelper#content_tag /Users/f-3r/.rbenv/versions/3.4.5/lib/ruby/gems/3.4.0/gems/actionview-8.0.2.1/lib/action_view/helpers/tag_helper.rb:516
  5.00      0.004     0.002     0.000     0.002     5034   ActiveSupport::CoreExt::ERBUtil#unwrapped_html_escape /Users/f-3r/.rbenv/versions/3.4.5/lib/ruby/gems/3.4.0/gems/activesupport-8.0.2.1/lib/active_support/core_ext/erb/util.rb:10
  4.29      0.025     0.002     0.000     0.023     2000  *ActionView::OutputBuffer#capture /Users/f-3r/.rbenv/versions/3.4.5/lib/ruby/gems/3.4.0/gems/actionview-8.0.2.1/lib/action_view/buffers.rb:72
  4.24      0.007     0.002     0.000     0.005     6023   ActiveSupport::SafeBuffer#initialize /Users/f-3r/.rbenv/versions/3.4.5/lib/ruby/gems/3.4.0/gems/activesupport-8.0.2.1/lib/active_support/core_ext/string/output_safety.rb:70
  3.94      0.003     0.002     0.000     0.001     7025   String#blank?                  /Users/f-3r/.rbenv/versions/3.4.5/lib/ruby/gems/3.4.0/gems/activesupport-8.0.2.1/lib/active_support/core_ext/object/blank.rb:153
  3.39      0.004     0.002     0.000     0.003     6004   String#present?                /Users/f-3r/.rbenv/versions/3.4.5/lib/ruby/gems/3.4.0/gems/activesupport-8.0.2.1/lib/active_support/core_ext/object/blank.rb:165
  3.22      0.026     0.001     0.000     0.025     2000  *ActionView::Helpers::CaptureHelper#capture /Users/f-3r/.rbenv/versions/3.4.5/lib/ruby/gems/3.4.0/gems/actionview-8.0.2.1/lib/action_view/helpers/capture_helper.rb:47
  2.68      0.001     0.001     0.000     0.000     9034   Regexp#match?
  2.36      0.005     0.001     0.000     0.004     1010   Hash#each_pair
  2.20      0.008     0.001     0.000     0.007     6027   <Class::String>#new
  2.15      0.009     0.001     0.000     0.008     6020   String#html_safe               /Users/f-3r/.rbenv/versions/3.4.5/lib/ruby/gems/3.4.0/gems/activesupport-8.0.2.1/lib/active_support/core_ext/string/output_safety.rb:225
  1.86      0.001     0.001     0.000     0.000     7252   Hash#[]
  1.82      0.003     0.001     0.000     0.002     1020   ActionView::Helpers::TagHelper::TagBuilder#tag_option /Users/f-3r/.rbenv/versions/3.4.5/lib/ruby/gems/3.4.0/gems/actionview-8.0.2.1/lib/action_view/helpers/tag_helper.rb:294
  1.62      0.001     0.001     0.000     0.001     4001   ActionView::Helpers::TagHelper#ensure_valid_html5_tag_name /Users/f-3r/.rbenv/versions/3.4.5/lib/ruby/gems/3.4.0/gems/actionview-8.0.2.1/lib/action_view/helpers/tag_helper.rb:575
  1.48      0.001     0.001     0.000     0.000    11060   String#empty?

Well... it looks rather messy. What are we looking for here? We want lines that have a high %self (% of the total sampled time) and low child time, and higher or equal self time. This means the time was used by the method itself and not by another method that was called. For example:

String#initialize: was called 6027 times and was executing 11% of the time
ERBUtil#unwrapped_html_escape: called 5034 times, 5%
ActiveSupport::SafeBuffer#initialize: instantiated 6023 times, 4.24%
String#blank?: 7025 times, 4.94%
Regexp#match?: 9034 times, 2.68%
TagHelper#ensure_valid_html5_tag_name: 4001 times, 1.62%
String#empty?: 11060 times, 1.48%
and so on...

We can see that what takes more time are String and SafeBuffer allocations, and string validations/checks.

Internally, every content_tag:

Validates the tag name
Processes the attributes
Escapes attributes via ERB::Util.html_escape
Allocates and returns an ActiveSupport::SafeBuffer

And we have 4 of these, 1000 times. The operations are fast, but there are so many of them that the work piles up.

So, inline rendering is the definitive winner in terms of performance.

When to and when not to

We have focused solely on performance, but you can imagine that rendering all your views in a huge single ERB file, because it is the fastest, might not be the smartest choice. There are trade-offs, always:

You can trade in a little performance for maintainability/readability,
Or exchange a little performance for re-usability,
Or the other way around, sacrifice readability/maintainability for a performance boost when things get critical

Here we enter the domain of design choices. As usually happens with the interesting sides of programming, there's no silver bullet, no single correct answer. This has to do more with team alignment and project culture than purely technical decisions.

Some recommendations

We usually use the following heuristic/guidelines to decide when to use one or the other:

Try to make your view tree as shallow as possible. Modularize where it makes sense, not just for the sake of it, as indirection is not free (in terms of performance and maintainability).

The priorities we use to decide:

Focus on maintainability and readability first. Code is written once and read thousands of times. Be kind to your future self.
Always use collection rendering (where applicable). We like it more than implicit rendering because it's more explicit and flexible regarding partial locations, and doesn't need to switch context from the view to the model to know what partial will be rendered.
Always profile your code (eg, use rack-mini-profiler, or any observability/APM solution). Though rendering is not always the first reason views are slow, if you detect a partial that's making things slow, you can try inlining it.

But perhaps the more important thing to take into account is that rendering performance tuning might be negligible in a broader context. For example, if a page is loading 300 third-party JS dependencies, squeezing out 80ms from your view rendering won't help much with your app performance or what your users perceive. Also, there are other alternatives that are outside of pure view performance, like caching or taking smart product decisions, but every performance issue is different from another, and one should always investigate what's possible on each individual problem.

Ruby Argentina September Meetup

SINAPTIA — Fri, 12 Sep 2025 13:05:11 +0000

On September 10th, 2025, the Argentina Ruby community gathered once again for another meetup. The event was sponsored by several companies, including SINAPTIA, LeWagon, OmbuLabs, and Rootstrap, who also hosted the event at their office space.

We had a first talk by Fernando, who analyzed the inner workings of AI agents (coding agents in particular), and another by Nicolas that demonstrated a hands-on approach to separating a Rails app into an API and a frontend app using Vite.js.

Something about agents

The first talk was presented by Fernando from SINAPTIA. He told us about his journey in the AI world and AI agents using Ruby. His presentation was guided by these questions:

What is an AI agent? (spoiler: a case in a loop)
How much “magic” is needed to bring one to life? (spoiler: about 50 lines of Ruby)
Do we need super-intelligent models to create effective agents? (spoiler: sadly, we do)
Can we run them locally? (spoiler: only if you have a lot of RAM and a lot of patience)
Are they actually useful for real-world applications? (you tell me!)

Main take of the night driving this talk in the picture below.

If you have any of these questions or similar ideas, reach out; we are always in the community chat!

Breaking the monolith

The second presentation, delivered by Nicolas Navarro, showed us how he learned to split a Rails monolith into 2 apps:

A pure JS app for the frontend, served as a static site.
A backend powered by Rails API

Everything is deployed on Heroku with a very simple couple of commands. Super practical, beginner-friendly, and hands-on talk.

As always, the meetup finished with some beers, food, and networking, where folks shared experiences, discussed the topics in the talks more deeply, face-to-face.

For those who missed the event, keep an eye out for future Ruby Sur meetups. We have invited a Ruby super heroine for next month: Rosa Gutierrez from Basecamp/37 Signals. It’s going to be online, so you don’t have an excuse to miss it this time!

MCP on Rails

SINAPTIA — Tue, 02 Sep 2025 15:49:17 +0000

This year, we started deep diving into AI, specifically focusing on two aspects. First, what tools can we use to be even more productive? There are a lot of models, coding agents, and editor combinations to try, and a lot of new ones coming every day. Second, how can we use AI to improve the apps we're working on?

On the first aspect, there's not much we can say yet. Like I said, there are many options, and more keep appearing every day. Too soon to jump to conclusions.

On the second aspect, we already wrote a few articles about it: We scaled image classification with AI, upscaled images with AI, and improved a similarity search with AI. And we're looking forward to building more intelligent applications with Rails.

There is a third aspect that we haven't tried yet until now: how can we provide tools and context to AI models?

Model Context Protocol

MCP is an open protocol that standardizes how applications provide context (data sources and tools) to large language models (LLMs). Simply put, you can connect your AI agent with an MCP server, and you'll have access to resources and tools provided by that server. For example, you can connect your AI agent with Google Calendar's MCP server and ask:

> do I have any meetings today?

The AI model will understand that you're asking it to read your calendar, and will either request the resources or call the tools that correspond to your prompt. The response will be something like:

Yes, you have 2 meetings scheduled for today:
1. Team standup - 11:00 AM to 11:15 AM (15 minutes)
  * With multiple team members (8 attendees total)
2. Executive meeting - 2:00 PM to 5:00 PM (3 hours)
  * With one colleague
The first is a brief team sync, and the second is a longer strategic meeting.

The Google Calendar MCP server can only read and search your events and cannot create new ones, but other MCP servers have the ability to also write new objects. For example, you can connect your model with the official GitHub MCP server and create comments, open pull requests, etc.

MCP is a very powerful protocol. Still in its early stages, we can see there's a bright future ahead. Imagine the possibilities: you could connect your model to several MCP servers and interact with it to accomplish a task, or you could build certain resources or tools you can use to integrate with other resources or tools. What you get is, instead of a traditional web application, a conversational application that is basically the glue for various services.

This is something we wanted to try.

Timetracker

We've developed a time tracker for internal use. Everyone at SINAPTIA tracks their time every day, and by the end of the month, we run reports that we use for invoicing. It's a simple and extremely effective solution. We've been using it for at least 4 years without any issues, and no major modifications.

In terms of UI/UX, it's also quite effective: every day at 5 pm, you'll receive a Slack message from the timetracker app asking you to track your time. And if you have missing entries this month, it'll remind you so you're up to date with your updates. The message includes a link to the time tracker, so the only thing you have to do is follow the link, log in (if you're not logged in), and create a new entry for the work you did today. If you need to create more than one entry, you can create them in bulk.

But what if you could talk to the timetracker? What if you could say:

> log 8 hours today to the Test project

or even:

> log 8 hours to the Test project for the last week

That would be great. Let's see how easy it is...

But before, let's scope the project: we're only going to create new entries. And this is what an entry looks like:

  create_table "entries", force: :cascade do |t|
    t.float "duration"
    t.string "comments"
    t.datetime "date", precision: nil
    t.bigint "project_id", null: false
    t.bigint "user_id", null: false
    t.datetime "created_at", null: false
    t.datetime "updated_at", null: false
    t.index ["project_id"], name: "index_entries_on_project_id"
    t.index ["user_id"], name: "index_entries_on_user_id"
  end

fast-mcp

fast-mcp is an MCP implementation for Ruby. Adding an MCP server in Rails with fast-mcp is simple, and it will only take 3 steps:

add fast-mcp to your Gemfile and install it with bundle install
run rails generate fast_mcp:install
add resources and tools

The fast_mcp:install generator creates:

an initializer to configure the server, such as the name of the MCP server, the allowed origins, the auth token, etc.
a sample resource
a sample tool

Next time you start your Rails app, the MCP server will be up and running, and you'll be able to connect your MCP client (typically an AI model) to use the provided resources and tools. To connect your MCP client to the MCP server, refer to your MCP client documentation, as each one has its own configuration.

Also, to make sure everything's working, you can connect to your MCP server with the official MCP inspector by running npx @modelcontextprotocol/inspector. The inspector is an interactive dev tool for testing and debugging MCP servers. Once it's running, you can connect the inspector to your MCP server and fetch the resources and call tools for testing purposes.

Resources

Resources provide structured access to information that the host application can retrieve and provide to AI models as context. Resources are application-controlled. This means applications decide how they retrieve, process, and present available context. Common interaction patterns include tree or list views for browsing resources in familiar folder-like structures, search and filter interfaces for finding specific resources, automatic context inclusion based on heuristics or AI selection, and manual selection interfaces.

Resources are identified with a unique URI-based string. For our timetracker MCP server, we're going to define the timetracker://projects resource, which will return the active projects. We need this resource because the entry belongs to a project, so we need to know it before we can create an entry. We will define it like this:

class ProjectsResource < ApplicationResource
  uri "timetracker://projects"
  resource_name "Projects"
  description "Active projects"
  mime_type "application/json"

  def content
    JSON.generate(Project.active.as_json)
  end
end

Entries belong to users, but we won't have a timetracker://users resource. This is because we don't want the LLM to decide which user it will call the tool with. The tool should know who the user is who's creating the entry. And for that, we need authentication.

Authentication

Before continuing, we need to secure our connections.

FastMCP supports token authentication. This means that connections from an MCP client to an MCP server must include a bearer token to establish the connection. We can turn on the token authentication by configuring the initializer. This is the first step. No one without the bearer token can connect to our server to query our resources or call our tools.

But that's not enough. The bearer token does not identify a user within the server. We need to authenticate users so that only real users can have access to our resources and tools.

To do that, we need the timetracker users to have a secure token. Let's add it:

$ rails g migration add_mcp_token_to_users mcp_token:string

Then let's update the user model so it has the secure token:

class User < ApplicationRecord
  # ...
  has_secure_token :mcp_token
  # ...
end

Finally, to make sure all active users have an MCP token:

class AddMcpTokenToUsers < ActiveRecord::Migration[8.0]
  def change
    add_column :users, :mcp_token, :string

    User.active.each(&:regenerate_mcp_token)
  end
end

Now, with our MCP server secured with the token authentication, we need to update the MCP configuration. Each AI provider has its own configuration, so make sure you edit your MCP client configuration and add the following headers:

Authorization: your token from config/initializers/fast_mcp.rb
X-MCP-Token: the mcp token that identifies you as a user (eg, User.first.mcp_token)

To make sure users are authenticated with the MCP token, we need to define a current_user method in app/tools/application_tool.rb:

class ApplicationTool < ActionTool::Base
  private

  def current_user
    user = User.active.find_by mcp_token: headers["x-mcp-token"]

    raise "MCP Token is invalid" unless user.present?

    user
  end
end

We could do the same with resources, but in this particular case, it's not important.

Now that we have authentication in place, let's move on to our tools and see how we define them.

Tools

Tools enable AI models to perform actions through server-implemented functions. The model requests tool execution based on context.

Tools are schema-defined interfaces that LLMs can invoke. MCP uses JSON Schema for validation. Each tool performs a single operation with clearly defined inputs and outputs. Most importantly, tool execution requires explicit user approval, ensuring users maintain control over actions taken by a model.

With the authentication we just implemented, the tools we define will have access to the authenticated user by just calling current_user. The "create entry tool" will use the current user to track time against it. So let's define the create entry tool:

class CreateEntryTool < ApplicationTool
  description "Create an entry"

  arguments do
    required(:project_id).filled(:integer).description("ID of the project, use the projects resource if it's unknown")
    optional(:date).maybe(:date).description("Date of the entry")
    required(:duration).filled(:string).description("Duration of the entry, always a float (eg. 2.0)")
    optional(:comments).maybe(:string).description("Comments, leave empty if it hasn't been provided")
  end

  def call(project_id:, duration:, comments: nil, date: Date.current)
    Entry.create(date: date, duration: duration.to_f, project_id: project_id, user: current_user, comments: comments)

    "Entry was successfully created."
  end
end

If you're used to reading Ruby code, you'll find the tool's code quite self-explanatory: it has a description, an arguments block, and a call method. The arguments block defines the tool's input as a dry-schema. In this case, we're going to receive:

a required project_id
an optional date that defaults to today (see the call method signature)
a required duration
an optional comments string, that defaults to nil (see the call method signature)

Pay attention to the descriptions, they're super important! This is the piece of information the LLM will use to determine when and how to call this tool. You can think of it as the "system prompt" of the tool. So, all you know about prompt engineering applies here, too. For example, the description of the project_id tells the LLM to use the projects resource if the user hasn't specified the project ID. It feels like cheating!

The call method is the actual tool. It receives the arguments the LLM sent and creates an entry, as you would do in a normal controller. Unlike resources, the output is a string that has the info you want to send back to the model.

Now that we have a working tool, we can ask the LLM:

> log 4 hours to the Test project yesterday

The agent, if it's properly connected, will know before you ask it what the type of each parameter is. This way, it will take your prompt and convert the information it was given:

the duration becomes "4.0"
yesterday's date becomes a date in YYYY-MM-DD format

And it doesn't know the project ID. But it understands the hints from the description. So instead of asking which project ID it should use, it will call the timetracker://projects resource, and match the Test project against them. If it can't find it, it will ask you.

That's pretty cool.

So now let's see what the response looks like:

⏺ I'll help you log 4 hours to the Test project for yesterday. Let me first get the project information and then create the entry.
⏺ listMcpResources(List MCP resources from server "timetracker")
  ⎿  [
       {
         "name": "Projects",
     …
⏺ readMcpResource(Read resource "timetracker://projects" from server "timetracker")
  ⎿  {
       "contents": [
         {
     …
⏺ I found the Test project (ID: 12). Now I'll create a time entry for 4 hours yesterday
  (2025-08-26).
⏺ timetracker - CreateEntryTool (MCP)(project_id: 12, duration: "4.0", date: "2025-08-26", comments:
                                     "")
  ⎿  Entry was successfully created.
⏺ Created a 4-hour time entry for the Test project on August 26, 2025.

Conclusion

MCP is a very powerful protocol that allows simple applications to become part of a more complex conversational interface. Imagine the possibilities: instead of using unintuitive, complex UIs for CRM, inventory management, or project management tools, we could simply describe what we want to accomplish, and the LLM would handle it for us. It would be like having a conversation with our tools.

On the technical side, and thanks to Ruby and fast-mcp, providing an MCP server for your app is practically as simple as it could be. The complexity lies in properly authenticating and authorizing users, and filtering the tools and resources available to them. This is not covered in this post, but it's possible with fast-mcp. At the end of the day, you should consider tools and resources as regular Rails actions and take the same security measures you would take with them.

At SINAPTIA, we specialize in helping businesses implement AI solutions that deliver real value. If you want to discuss how conversational UIs could benefit your business workflows, we'd love to help you explore what's possible.

AI4Devs August meetup

SINAPTIA — Fri, 29 Aug 2025 17:57:35 +0000

On August 28th, 2025, we sponsored and attended the 3rd AI4Devs meetup of the year. It was an amazing event, fully packed:

almost one hundred atendees ranging from young students, university professors, and developers of every seniority
3 interesting talks and speakers
and, as the icing on the cake, hosted on a venue called “Sobre Rieles” (Coincidence? I don’t think so!)

RAG applications on local infrastructure

The meetup kicked off with Gustavo Archuby and Pablo Musa from the Facultad de Humanidades, UNLP. They talked about their experience building a complete RAG system for an academic document base for research and preservation. They built it exclusively with open source tools and open weight models and everything self-hosted. They also talked about the challenges of data consistency, data fixing for effective indexing and retrieval.

It’s not magic, it’s engineering

The second presentation was Ulises Cornejo’s. He talked us about building rulesets for code assistants, focused more than anything in the interactions and overlap of docs for AI tools and for humans. He higlighted several conflicts,and pain points, and proposed a system to have a single source of truth that works for both human developers and AI assistants, that leave no one out of the picture.

LLM Workflows for classification and large-scale data coherence

The last talk came from our very own Fernando Martinez. He talked about pains and lessons learned from SINAPTIA’s successful experiences building AI-based features. He touched topics like testing non-deterministic systems, evals and test dataset building; challenges and solutions needed when using the OpenAI’s batch API to process millions of requests per month, and a couple of fun anecdotes about pricing and the ultimate need for monitoring.

Community and Networking

As always there was time to have a nice beer and sandwiches and catch up with other folks in the industry. There were a lot of people we haven’t seen in years! And the variety of people (professors, students, profesional) was a fantastic mix. Props to the organizers from AI4devs La Plata 👏 was a fantastic night.

We look forward anxiously for the next one!

Improving a similarity search with AI

SINAPTIA — Fri, 22 Aug 2025 13:53:10 +0000

One of our clients operates a large boat marketplace with thousands of listings. One of the most common features in marketplaces is showing similar items: when users find a boat they like, they want to explore similar options. But our client’s similarity search was not providing useful listings.

The problem

The existing solution was based on range queries in Elasticsearch. Boats specs were indexed and a query compared boats across multiple dimensions, for example:

Year (±5 years)
Length (±2 meters)
Categories and specifications
Price ranges

The logic made perfect sense: if you’re looking at a 12-meter sailboat from 2018, the similarity search would show you other 10-14 meter sailboats from 2013-2023.

But there were a couple of issues. The first was data consistency. Since most boats are imported from different sources, the database contained duplicate or meaningless categories, wrong lengths, etc. which resulted in unreliable matching. The second issue was that specification-based similarity completely missed what users actually cared about. Similarity of boats is complex, multi-faceted criteria, even subjective; parametric search is not always able to capture the subtleness of the concept, thus comparing boats purely on technical specifications would still miss the mark on user intent.

Experimenting with LLMs

Fixing the data inconsistencies is another great example of how we can leverage AI to improve the imported data, but for now it was out of scope for the time being. Fixing the data inconsistency manually would mean a waste of time, not to mention that it could still miss the mark on user intent as explained above. So, what if we stopped doing traditional specification-based similarity searches, and instead leveraged AI that could think more like a human?

We started experimenting with different AI models and approaches. Initially, we tried feeding the AI everything we had about each boat: detailed descriptions, technical specifications, images, category information. We tested several prompts, and several iterations of each prompt, like some asking for explanations of why boats were similar, others requesting rankings with confidence scores.

The results were not very satisfactory. They were better than the previous similarity search, but we had that bitter taste of failure. There was a clear potential in this solution to have way better instead of barely better.

There’s always a sweet spot to find between the right combination of prompt, model, and application logic. So we continued our journey, and as it usually happens, we discovered that the best results came from the simplest approach: our main problem was the inconsistent data, so getting a lot of data into the prompt was not helping at all. The final decision was, similar to what we did with image classification, we minimized the prompt to its essence and focused on building a similarity graph of boats.

The difference was night and day. Users searching for luxury sailing yachts now saw other luxury sailing yachts with similar characteristics. The contextual understanding that we’ve been trying to achieve came naturally to the AI.

But here’s the best part of all:

this worked better than the original approach
required less maintenance because the query now is simpler
the cost of the whole solution is negligible in the long run

The AI approach captures something that traditional similarity algorithms struggle with: market context and buyer intent.

Two boats might have completely different specifications, but if they’re both positioned as “weekend cruising boats for families”, they’re genuinely similar from a user’s perspective. This kind of contextual similarity is extremely difficult to achieve with a traditional specification-based similarity search, but comes naturally to LLMs.

Conclusion

This experience changed how we think about recommendation and matching problems. When human judgment and context matter more than mathematical precision, AI-powered approaches can deliver results that traditional algorithms simply can’t match.
The key insight isn’t that AI is always better, it’s knowing when the problem requires understanding intent and context rather than just crunching numbers.

At SINAPTIA, we specialize in helping businesses implement AI solutions that deliver real value. If you’re facing similar challenges with large-scale data processing, content enhancement, or other AI applications, we’d love to help you explore what’s possible.

Upscaling Images with AI

SINAPTIA — Wed, 13 Aug 2025 15:21:31 +0000

In the world of online boat marketplaces, the user experience can make or break a sale. A beautiful yacht with poor quality photos might sit unsold, while an average boat with stunning visuals attracts multiple buyers. At SINAPTIA, we recently tackled this exact challenge for a leading boat marketplace, transforming thousands of low-quality images using AI-powered upscaling technology.

The problem

The marketplace we work with faces a common issue in the industry: image quality varies dramatically across listings. While some boat owners upload high-resolution, professional photos, many images come from third-party sources or older listings with significantly lower quality. The platform works with millions of images per month, with the vast majority coming from automatically imported external sources.

This creates several challenges:

Poor user experience: Potential buyers can't properly evaluate boats with pixelated or blurry images
Reduced conversions: Low-quality images directly impact the likelihood of inquiry or sale
Mobile optimization issues: Most users browse on mobile devices, where image quality is even more critical
Competitive disadvantage: Listings with better images naturally perform better in search results

The solution

Rather than asking users to re-upload better images (often impossible with imported listings) or manually editing thousands of photos, we implemented an AI-powered solution that automatically enhances image quality.

Choosing the right technology

We evaluated several approaches before settling on AI-based upscaling:

Traditional upscaling methods rely on mathematical algorithms to interpolate new pixels based on existing image data. These include techniques like nearest-neighbor interpolation, bicubic interpolation, and Lanczos resampling. While these methods vary in sophistication and computational requirements, they share a fundamental approach: they analyze existing pixels and use mathematical formulas to estimate what new pixels should look like.

The core problem with traditional methods is that they can increase image dimensions but cannot add new visual information. A 400-pixel image scaled to 800 pixels using these methods will occupy more space but won't look significantly better. It may even appear worse due to artifacts produced by the pixel manipulation.

Diffusion models can intelligently add detail and texture during the scaling process. AI-based upscaling leverages trained neural networks that have learned patterns from millions of high-quality images. Rather than simply interpolating existing pixels, these models can generate new visual information that makes logical sense within the image's context. This means they can enhance textures, sharpen details, and add realistic elements that weren't clearly visible in the original low-resolution version.

Candidate image selection criteria

Determining which images to process required careful analysis of cost versus benefit. We established specific criteria for candidate selection:

Resolution Thresholds: Images with width or height dimensions between 200 and 800 pixels were considered optimal candidates. Images below 200 pixels produced poor results regardless of the algorithm used, while images above 800 pixels already provided adequate quality for our use cases.

Cost Considerations: AI-powered upscaling is computationally expensive. With approximately 30,000 new images entering our system daily, we needed to balance image quality improvements with operational costs. Processing every image would have been prohibitively expensive.

Device Optimization: Since most of our users browse on mobile devices where the largest displayed image is approximately 800 pixels wide, scaling beyond 2x didn't provide meaningful benefits and would have increased processing costs and bandwidth usage unnecessarily.

After extensive testing with various pixel ranges and manual quality evaluation, we found that approximately 5% of our images met the criteria for AI upscaling; a manageable volume that provided significant quality improvements where they were most needed.

Results and implementation

Choosing the right AI model required extensive testing and evaluation. We began by creating a curated dataset of representative boat images with various quality issues: images that reflected the real-world problems we needed to solve. This dataset became our benchmark for comparing different models, providers, and configurations.

The evaluation process was inherently manual and time-intensive. Since image quality is subjective and directly impacts user experience, we couldn't rely on automated metrics or other AI models to determine what "looked better." Human evaluation was essential—we needed actual people to examine the processed images and assess whether the AI had successfully enhanced the photos or had introduced unwanted artifacts.

Through this iterative testing process, we discovered that many high-end models designed for human faces or fine art restoration were overkill for our use case. Boat photography proved more forgiving than portrait photography, allowing us to achieve excellent results with more cost-effective models. We ultimately selected an ESRGAN model that offered the optimal balance of quality improvement and processing cost.

The implementation was designed to run independently from our existing image processing pipeline, maintaining flexibility for future modifications while avoiding disruption to established workflows. This approach, while requiring additional storage for multiple image versions, provided the modularity needed for ongoing optimization and experimentation.

Conclusion

AI-powered image upscaling has proven to be an effective solution for improving user experience in our client's marketplace. By carefully selecting candidate images based on resolution thresholds and cost-benefit analysis, we enhanced the visual quality of thousands of boat listings without overwhelming operational costs.
The key to success was understanding that not every image was worth enhancing; strategic application based on clear criteria delivers maximum impact. For platforms dealing with user-generated content or third-party image sources, AI upscaling represents a powerful tool for maintaining visual quality standards while preserving the scalability needed for large-volume operations.

At SINAPTIA, we specialize in helping businesses implement AI solutions that deliver real value. If you're facing similar challenges with large-scale data processing, content enhancement, or other AI applications, we'd love to help you explore what's possible.

Scaling image classification with AI

SINAPTIA — Wed, 30 Jul 2025 03:00:00 +0000

At SINAPTIA, we're always looking for innovative ways to leverage AI to solve real-world problems. Recently, we had the opportunity to work with Rightboat, a leading boat marketplace, to tackle a massive image classification challenge that was impacting both user experience and internal operations.

The Problem

Rightboat's platform hosts thousands of boats. Some of them have more than 200 images. However, these images lacked any descriptive information or categorization. Some boats are manually loaded into the system. Their images are curated and sorted by the customer success team, so related images are next to each other. But the great majority of the loading work is automated. This means that there are cases where the image selected as the main image is not the best, and the order of the images is the order in which the import script reads them from the source, which is not always the ideal order for a good user experience.

To solve this, the product design team came up with a new image gallery component that grouped the images by category. They devised 16 categories, including Deck, Galley, Boat Underway, and other significant categories for boaters. This was a fantastic move, the new gallery:

has a modern look and feel
improves the browsing experience drastically
improves the image management process, as sorting the images while uploading does not matter anymore.
Provides the same experience for manually and automatically loaded boats

The implementation was also simpler than the current one; the only thing we needed to change was to allow the images to belong to a category to group them into. Easy!

But this came with a scale challenge:

The system receives around 1 million images every two months (and growing!). The customer success team is usually responsible for adjusting certain data bits for their customers, but the human effort required to categorize 1 million images, plus the new ones that come in every day, makes this solution unviable.

The Solution

Our approach leveraged the latest advances in AI vision models to automate the image categorization process. We designed a system using OpenAI's vision-capable models to classify images into 16 predefined categories, including:

Structural elements: Hull, deck, sails, fly bridge
Interior spaces: Kitchen, bathroom, bedrooms
Perspective categories: Top sides (boat viewed from the side), boat underway (boat in motion)

Technical Architecture

We decided to use OpenAI's batches API to implement this. The reason was two-fold:

Cost reduction (async batch processing cost 50% less)
API rate and daily limits (batch API supports way higher loads)

Managing the batch API workflow required building a complex state management system. The OpenAI batch API can take up to 24 hours to process requests, batches can expire and be partially processed, and various error conditions need to be handled gracefully or retried.

We developed an internal tool that manages batch states, automatic retries, and error handling, making it easy to add new AI-powered batch processes beyond image classification.

The tool workflow:

Automatically detects new uncategorized images from daily imports
Groups images into batches of up to 50,000 (OpenAI's maximum limit)
Processes batches using OpenAI's batch API for cost efficiency
Updates the database with categorization results
Handles errors gracefully by assigning a default "other" category when processing fails

The system runs continuously, polling every 5-10 minutes for new images to process, ensuring that new boat listings are categorized promptly.

Working with OpenAI batches is not as straightforward as it seems at first sight. We go into more details in the untold challenges of OpenAI's batch processing API.

Prompt Engineering

During the experimentation phase, and after we first deployed the feature to production, we learned that different models require different prompt complexity. It is key to always keep experimenting and trying different prompts and models that adapt perfectly to your requirements and desired output.

The Challenges

Pricing Surprises with OpenAI

Our biggest challenge came from unexpected pricing changes. Initially, we processed around 800,000 images for under $200 using GPT-4o mini. However, two months later, we found ourselves spending approximately twice as much for only 100,000 images.

After investigation, we discovered that OpenAI had applied a pricing multiplier to GPT-4o mini requests for vision processing. The token count per image jumped from ~1,500 to ~25,000 tokens, making GPT-4o mini 30% more expensive than the full GPT-4o model while delivering lower quality results.

This destroyed the budget allocation we were assigned and risked a feature rollback. So we were forced to pause the image processing and reevaluate our approach.

Migration and Optimization

The solution came with OpenAI's release of GPT-4.1 mini, which introduced more efficient image processing. This change reduced costs while maintaining output quality.

In one of our several experiments, we discovered a counterintuitive optimization: we assumed that the bigger the image, the greater the details the LLM would be able to analyze, thus making the categorization and image feature detection more precise.

However, we found that sending smaller images (512px on their larger axis), besides reducing the costs and processing time (which was what we were after), also produced more accurate categorizations (as if the model was able to "see better" with lower quality images).

These 2 findings were a life-saver, crucial optimizations that allowed us to keep the feature running in production.

Conclusions

Impact and Results

The project delivered remarkable results:

Speed: Categorized 1 million images in a couple of days (due to OpenAI API usage limits), instead of months of manual work
Accuracy: Achieved approximately 85% correct categorization rate
Cost-effectiveness: The initial budget allocation for the feature was honored, making the feature viable
Scalability: System now processes new images automatically as they arrive

Key Learnings

AI Implementation is More Complex Than It Appears: While the core integration (sending requests to an AI API) is straightforward, the real complexity lies in data analysis, prompt engineering, and iterative refinement based on results.

Model Behavior is Inherently Random: Prompt evaluations are probably the hardest part of working with LLMs. The relation between the input and the output is not direct. You can hypothesize and form a heuristic on how a prompt change will affect the result, but the process requires statistical analysis across large datasets, which is hard and time-consuming.

Experimentation Often Yields Surprises: Our discovery that smaller images produce better results challenges common assumptions about AI vision models and highlights the importance of experimentation.

Business Impact Beyond the Obvious: The successful image categorization changed stakeholder perception of AI capabilities, leading to the expansion of AI initiatives across other areas of the platform.

The Bottom Line

The categorization via LLMs is not 100% accurate; users sometimes upload very bad images that would also give a knowledgeable human a hard time categorizing them. But even with the current error rate, this project represents a clear win. The alternative – having team members manually categorize millions of images – was simply not feasible given other business priorities. The system now enables better user experiences, more efficient internal processes, and has opened the door for additional AI-powered improvements across the platform.

For businesses considering AI implementation, our experience at Rightboat demonstrates that success comes not just from choosing the right model, but from building robust systems that can handle the inherent unpredictability of AI while delivering consistent business value.

At SINAPTIA, we specialize in helping businesses implement AI solutions that deliver real value. If you're facing similar challenges with large-scale data processing or AI integration, we'd love to help you explore what's possible.