Put the timeout on the connection

Timeout.timeout(3) do
  make_anthropic_call
end

How does this code kill your Rails app’s scaling?

The war story

Last month, a Solid Queue job that called the Anthropic Ruby SDK started taking the app down under load. Normal completions came back in a second or two. A fraction took longer. Once in a while the connection just sat there, open, silent, forever.

The symptom wasn’t the vendor. It was the database. ActiveRecord::ConnectionTimeoutError everywhere. Pool exhausted.

First theory: LLM calls were pinning a DB connection for the whole job

A 10-second completion would hold a checkout for 10 seconds. Rails 7.2 made this behavior explicit by renaming ActiveRecord::Base.connection to .lease_connection: the connection is leased for the entire job unless you scope it. So I wrapped the ActiveRecord parts in ActiveRecord::Base.with_connection blocks to release the checkout before the HTTP call. Pool still drained. Ruled out.

Second theory: a thread leak

Nope. The threads were alive. They just weren’t doing anything. Every Solid Queue worker was sitting inside an HTTP call to the vendor, waiting for bytes that weren’t coming.

Then I found the code at the top of this article in one of the jobs. A previous version of me had wrapped the Anthropic call in Timeout.timeout(3), thinking that was a 3-second budget. rescue Timeout::Error was firing in the logs, so the timeout was “working.” But the incident didn’t stop.

That’s when I learned Timeout.timeout is not what it looks like. Mike Perham called it Ruby’s most dangerous API back in 2015. Plenty of other Rubyists have been saying the same thing for years. I just wasn’t listening.

What Timeout.timeout actually does

It spawns a second thread that sleeps for N seconds, then calls Thread#raise(Timeout::Error) on the caller. That’s the whole mechanism.

Thread#raise can only deliver an exception at an interrupt check, and Ruby only checks for interrupts between bytecode instructions. If the target thread is down inside a C extension blocked on recv(), there is no interrupt check. The exception sits in a queue and doesn’t fire until the syscall returns.

When the vendor froze, the kernel had no reason to give up. SO_KEEPALIVE is off by default, so a dead connection can sit open indefinitely. Turn it on and Linux’s defaults still wait over two hours before declaring the socket dead. The Timeout::Error was eventually raised, the log message showed up, the thread unblocked. For the whole time in between, the thread was gone and the pool was wedged.

Timeout.timeout doesn’t bound execution. It schedules an exception and hopes the blocking call cooperates.

The real fix

Every real network client exposes its own timeouts that push down to socket options the kernel respects.

http = Net::HTTP.new(uri.host, uri.port)
http.open_timeout  = 2  # TCP handshake
http.read_timeout  = 3  # each recv()
http.write_timeout = 3  # each send()

Under the hood Net::HTTP wraps every blocking read in IO#wait_readable(timeout), which uses select()/poll(). The read syscall is only issued once the fd is known to be readable, and if the wait times out, Net::HTTP raises Net::ReadTimeout from within Ruby. The thread is free, ensure runs, the connection returns to the pool, the request fails fast.

Most clients have the same shape: Faraday, HTTP.rb, pg, Redis, Anthropic’s SDK. Two numbers. How long to wait for the handshake, and how long to wait for a byte. If your client doesn’t expose them, that’s a bug in the client.

Is this only a Ruby problem?

No. It’s a problem in any language where the runtime can’t cancel a blocking syscall out from under a C library. Ruby just has the worst reputation for it because Timeout.timeout is in the standard library and looks friendly.

Language	Does wrapping the call in a timer work?	The real answer
Ruby (threaded)	No	Client/connection timeouts
Python (threaded)	No	`socket.settimeout`, `requests(timeout=...)`
Java	Sometimes	`Socket#setSoTimeout`, client timeouts
Go	Yes	`context.WithTimeout`, honored by stdlib
Node	Yes	`AbortController`, async by default
Rust (async)	Yes	`tokio::time::timeout`
Rust (sync)	No	`TcpStream::set_read_timeout`

Ruby, Python, and sync Java share the hazard because the language was designed first and “cancel a blocking I/O” was retrofitted. Go, Node, and async Rust share the answer because they designed for it on day one.

Could fibers fix this?

Mostly yes, with one big asterisk.

Wire Timeout.timeout to the fiber scheduler

Since Ruby 3.0 there’s Fiber::Scheduler, and runtimes like Falcon implement one. The interface already defines timeout_after, which is the hook Timeout.timeout would call if it wanted a real deadline. The scheduler owns the I/O reactor, which means it can actually cancel a blocking read: it wakes the fiber up and makes the syscall return early.

If Timeout.timeout asked the scheduler to enforce the deadline instead of spawning a thread and calling Thread#raise, it would work on every piece of I/O that goes through Ruby’s IO class. That’s most of the stdlib: Net::HTTP, Socket, file reads, subprocess pipes. On Falcon, Timeout.timeout could finally mean what it says.

The C-extension asterisk

A gem that drops to C and calls recv() directly, bypassing Ruby’s IO layer, will block the whole thread the fiber is running on. No scheduler hook fires. Other fibers on that thread are stuck behind it. Every Ruby version has a handful of these: some database drivers historically, some FFI-backed clients, anything where the author wanted to avoid Ruby’s IO overhead.

Going all-in on fibers fixes timeouts for the ecosystem that lives on top of IO, which is most of it. But the moment a C binding does its own socket handling, you’re back to the same problem. That’s a social fix, not a technical one: pressure gems to route through IO, or keep a list of the ones that don’t.

Set IO#timeout= across the stdlib

There’s a simpler win available today. Ruby 3.2 added IO#timeout and IO#timeout=. Under a fiber scheduler, they already do the right thing: reads and writes past the deadline raise IO::TimeoutError. The stdlib mostly doesn’t set them. Net::HTTP has its own timeouts that predate this. Most gems don’t set one. A pass across the stdlib to set IO#timeout at the socket layer so every client inherits it would give Ruby a Go-style default without any language changes.

Another reason to go all-in on fibers

I’ve been banging this drum because the shape of a fiber-scheduled Ruby is just better: blocked I/O doesn’t burn a thread, concurrency gets cheap, and the cooperative scheduling model shakes a lot of accidental complexity out of the stack. “Timeouts that actually work” is one more thing that falls out of that model for free. Meanwhile on thread-centric Ruby, you’re on your own at the connection layer, forever.

That’s long-term. What did I do about it today?

Set a timeout on every external call. Two numbers, both short. A vendor that can’t answer in 3 seconds is already failing you. Don’t let them fail you for 30.

Cap the pool checkout wait. checkout_timeout: 3 on your connection pools. If the pool is empty because upstream is slow, new requests should fail fast instead of piling up behind the slow ones.

Add a circuit breaker on the flaky dependency. Semian or Stoplight. After N failures in a window, stop calling the vendor for a cooldown. Your p99 drops immediately because you’re no longer paying full timeout cost per request while they’re down.

Monitor pool wait time and downstream p99 separately. If you only have “request latency,” you can’t tell “app is slow” from “app is fine, Redis is slow and everyone is waiting for a connection.”

Treat rack-timeout as a canary, not a safety belt. It’s the same mechanism as Timeout.timeout, with the same limitation. It won’t rescue you from a hung socket.

The rule we added

The ultimate fix wasn’t any single timeout. It was turning “does this connection have a timeout?” into a PR-review rule. New HTTP client, new background job talking to a vendor, new Redis connection, new SDK. The PR doesn’t merge until the timeout is set. Five-second review item that catches the one mistake that cost us a weekend.

Put the timeout on the connection. Every time.