AranaDeDoros

Posted on Feb 12

What Happens When One Parallel Call Fails? Structured Concurrency in Scala

#backend #scala #programming #distributedsystems

When building backend systems, we often fan out to multiple services in parallel:

Price providers
Recommendation engines
Search indexes
Payment gateways

The real question isn't "How do I run these in parallel?"

It's:

What happens if one fails?
What happens if one times out?
Do retries leak work?
Can I keep partial results?

I explored this while building a small price aggregation simulator in Scala 3 using Cats Effect.

The scenario

The setup was simple:

Multiple providers
Each returns a price
We call them in parallel
We aggregate the results

But the interesting part wasn't parallelism.

It was failure semantics.

The Core Abstraction

trait PriceProvider[F[_]]:
  def name: ProviderName
  def fetchPrice(productId: ProductId): F[Price]

Providers may fail.

One specific failure is modeled explicitly:

sealed trait ProviderError extends Throwable
final case class TimeOutError(provider: String,time: LocalDateTime) 
extends ProviderError(s"$provider timed out", time)

Timeouts are not transport hacks.

They are part of the domain.

Retry as a Decorator

Instead of baking retry into the provider, I wrapped providers with a retry policy:

def retryOnTimeout[A](
  fa: IO[A],
  maxRetries: Int,
  delay: FiniteDuration
): IO[A] =

  def loop(attempt: Int): IO[A] =
    fa.handleErrorWith {
      case _: TimeOutError if attempt < maxRetries =>
        IO.sleep(delay) *> loop(attempt + 1)
      case e =>
        IO.raiseError(e)
    }

  loop(0)

Then composed it:

final class RetryingProvider(
  underlying: PriceProvider[IO],
  maxRetries: Int,
  delay: FiniteDuration
) extends PriceProvider[IO]:

  override def fetchPrice(productId: ProductId): IO[Price] =
    retryOnTimeout(
      underlying.fetchPrice(productId),
      maxRetries,
      delay
    )

Important detail:

Only timeouts retry
Other failures propagate immediately
The underlying provider remains untouched

Retry is a policy layer, not embedded behavior.

Aggregation Strategy #1 — Fail Fast

def fetchAll(
  providers: List[PriceProvider[IO]],
  productId: ProductId
): IO[List[Price]] =
  providers.parTraverse(_.fetchPrice(productId))

Semantics:

All providers run in parallel
If one fails, the whole operation fails
Remaining providers are cancelled

This is strict and correct when all results are required.

Aggregation Strategy #2 — Keep Partial Results

Sometimes partial success is acceptable.

So instead of failing the entire computation, we capture results explicitly:

def fetchSafe(
  provider: PriceProvider[IO],
  productId: ProductId
): IO[ProviderResult] =
  provider.fetchPrice(productId)
    .map(price =>
      ProviderResult.Success(provider.name, price)
    )
    .handleError {
      case e: ProviderError =>
        ProviderResult.Failure(provider.name, e)
    }

Then aggregate:

def fetchAllPartial(
  providers: List[PriceProvider[IO]],
  productId: ProductId
): IO[List[ProviderResult]] =
  providers.parTraverse(p => fetchSafe(p, productId))

Now:

All providers run in parallel
Failures don't cancel siblings
Partial results are preserved
The operation always completes

Different policy. Same building blocks.

Optional: Enforcing a Quorum

You can even require a minimum number of successes:

def requireAtLeast(
  n: Int,
  results: List[ProviderResult]
): IO[List[(ProviderName, Price)]] =
  val successes = results.collect {
    case ProviderResult.Success(p, price) => (p, price)
  }

  if successes.size >= n then IO.pure(successes)
  else IO.raiseError(new RuntimeException("Not enough providers"))

Now the system supports:

Fail-fast
Partial aggregation
Quorum-based acceptance

All composed explicitly.

What This Experiment Revealed

The interesting part wasn't syntax. It was separation of concerns.

The system has three independent policy layers:

1. Failure classification (timeouts vs other errors)

2. Retry behavior (decorator)

3. Aggregation strategy (fail-fast vs partial vs quorum)

None of them are tangled together.

That's the real value.

When to retry
When to cancel
When partial results are acceptable
When to fail the whole operation

The abstraction you choose determines how clearly you can express those decisions.

And that's a backend engineering concern, not a language feature.

DEV Community