Debugging Deadlocks?

* Debugging Deadlocks?
@ 2017-05-30 16:12 Sargun Dhillon
  2017-05-31  6:47 ` Duncan
  2017-05-31 12:54 ` David Sterba
  0 siblings, 2 replies; 5+ messages in thread
From: Sargun Dhillon @ 2017-05-30 16:12 UTC (permalink / raw)
  To: BTRFS ML

We've been running BtrFS for a couple months now in production on
several clusters. We're running on Canonical's 4.8 kernel, and
currently, in the process of moving to our own patchset atop vanilla
4.10+. I'm glad to say it's been a fairly good experience for us. Bar
some performance issues, it's been largely smooth sailing.

There has been one class of persistent issues that has been plaguing
our cluster is deadlocks. We've seen a fair number of issues where
there are some number of background threads and user threads are in
the process of performing operations where some are waiting to start a
transaction, and at least one background thread or user thread is in
the process of committing a transaction. Unfortunately, these
situations are ending in deadlocks, where no threads are making
progress.

We've talked about a couple ideas internally, like adding the ability
to timeout transactions, abort commits or start_transactions which are
taking too long, and adding more debugging to get insights into the
state of the filesystem. Unfortunately, since our usage and knowledge
of BtrFS is still somewhat nascent, we're unsure of what is the right
investment.

I'm curious, are other people seeing deadlocks crop up in production
often? How are you going about debugging them, and are there any good
pieces of advice on avoiding these for production workloads?

^ permalink raw reply	[flat|nested] 5+ messages in thread