A Ruby Timeout that works

TCP_USER_TIMEOUT, exposed to Ruby

Last month a Solid Queue job at Beautiful Ruby hung on a slow Anthropic API call. Workers wedged. ActiveRecord pool drained. Pager going off. The fix should have been obvious: tell the kernel to give up after a deadline. Ruby doesn’t expose that knob. So I built it.

If you’ve got a Sidekiq job hitting Stripe, a Solid Queue worker sending email through Mailgun, or any background processor talking to any vendor, this is your future too. You just haven’t lived it yet.

If you think the :timeout on your job is going to save you, sit down, friend. We need to talk.

Timeout.timeout is a polite suggestion

Timeout.timeout doesn’t actually interrupt a blocking syscall. Mike Perham called it Ruby’s most dangerous API back in 2015. I walked the mechanism end to end in Put the timeout on the connection. Plenty of other Rubyists have written the same warning in the years since.

The standard library still ships it. Nobody has fixed the underlying problem.

So if your HTTP client didn’t have a read_timeout set, the thread is gone for as long as the kernel feels like waiting. That can be hours. The Sidekiq worker is gone with it. If the job grabbed an ActiveRecord connection on the way in, that’s gone too. Other jobs back up behind an empty pool. The queue stops moving and the pager goes off.

Yes, it happens to your queue too

Sidekiq, Solid Queue, GoodJob, Resque, DelayedJob, your hand-rolled forking worker. Same story. Job processors run jobs in Ruby threads, and a Ruby thread can’t be interrupted out of a blocking syscall in a C extension. There is no holy queue. Every Ruby job processor in production is one wedged Stripe call away from this graph.

Linux has the right knob

It’s called TCP_USER_TIMEOUT. It’s been in the kernel since 2010. The contract is short: if you’ve sent data and none of it has been acknowledged in N milliseconds, the kernel kills the connection. No retransmission patience. No keepalive math. Deadline expires, socket dies.

Ruby doesn’t expose it.

So I wrote tcp_user_timeout.

gem "tcp_user_timeout"
TcpUserTimeout.with_timeout(3) do
  Anthropic::Client.new.messages.create(...)
end

Inside that block, every TCP socket Ruby opens has the option set via setsockopt. Three seconds with no acknowledged byte and the kernel tears the connection down. Your code gets Errno::ETIMEDOUT. No second thread. No Thread#raise. No politely waiting for a C extension to come back to its senses.

This is the thing that makes per-job timeouts actually fire. The :timeout on your Sidekiq job. The retry_on on your ActiveJob. None of it works without something at the kernel layer enforcing the deadline. Now there is.

How it’s wired

Small mechanism, big surface area. The gem prepends Socket and stashes a deadline in fiber-inheritable storage. When with_timeout is active, every connect attaches the option:

def connect(*)
  super.tap do
    if deadline = TcpUserTimeout.current_deadline_ms
      setsockopt(IPPROTO_TCP, TCP_USER_TIMEOUT, deadline)
    end
  end
end

Wrap once at the top of a job, every downstream socket is bounded. Threads and fibers spawned inside the block inherit the deadline.

What’s covered out of the box

Layer Coverage
Web Rack middleware bounds the request (Puma, Falcon, Passenger, Thin, Unicorn)
Jobs Sidekiq server middleware, Solid Queue, GoodJob via ActiveJob concern
HTTP Net::HTTP, httpx, Faraday (Net::HTTP / httpx adapters), Excon, RestClient
Postgres libpq’s native tcp_user_timeout connection param
MySQL Trilogy via the socket hook
Cache Redis (redis-rb), Memcached (dalli)
Email Net::SMTP, Mail, ActionMailer
Mongo Ruby driver via the socket hook

The Rack, Sidekiq, and ActiveJob integrations are thin glue around with_timeout. Call the block directly anywhere you want. The integrations are conveniences, not requirements.

What it can’t fix

Being honest about the edges so nobody pages me at 2am.

DNS. getaddrinfo(3) runs before the socket exists. Configure resolv.conf (options timeout:1 attempts:2) or use a DNS client with its own timeout.

TCP connect. TCP_USER_TIMEOUT only kicks in once the handshake is done. Use the host library’s open_timeout for the handshake itself.

FFI clients. libcurl (curb) and mysql2 open sockets at the C layer underneath Ruby’s TCPSocket. The hook never sees them. Use the client’s own timeouts, or move to a pure-Ruby alternative like httpx or Trilogy.

CPU-bound code. A pure-Ruby infinite loop isn’t a socket problem. Different gem, different day.

macOS, BSD, Windows. The kernel option doesn’t exist there. The gem silently no-ops so your dev box keeps working. Production needs to be on Linux for the deadline to actually fire.

Retries. When the kernel kills the connection your code sees Errno::ETIMEDOUT. Retry policy is yours.

Status

Pre-1.0. The core (one setsockopt, one Module#prepend) is unit-tested and verified end-to-end on Linux. Rack middleware and ActiveJob concern have tests. Sidekiq middleware is tested with synthesized job hashes; running it inside a real Sidekiq server is on me to validate. If any integration misbehaves, falling back to a literal with_timeout block always works.

A timeout that fires

Timeout.timeout is a wish. The :timeout on your Sidekiq job is a wish. The read_timeout on your HTTP client is a wish you have to set on every client and remember to set right.

TCP_USER_TIMEOUT is a contract with the kernel. Three seconds. No byte. Dead socket.

Do you want to learn Phlex 💪 and enjoy these code examples?

Support Beautiful Ruby by pre-ordering the Phlex on Rails video course.

Order the Phlex on Rails video course for $379