On Thu, Nov 08 2018, J. Bruce Fields wrote: > On Fri, Nov 09, 2018 at 11:38:19AM +1100, NeilBrown wrote: >> On Thu, Nov 08 2018, J. Bruce Fields wrote: >> >> > On Mon, Nov 05, 2018 at 12:30:48PM +1100, NeilBrown wrote: >> >> When we find an existing lock which conflicts with a request, >> >> and the request wants to wait, we currently add the request >> >> to a list. When the lock is removed, the whole list is woken. >> >> This can cause the thundering-herd problem. >> >> To reduce the problem, we make use of the (new) fact that >> >> a pending request can itself have a list of blocked requests. >> >> When we find a conflict, we look through the existing blocked requests. >> >> If any one of them blocks the new request, the new request is attached >> >> below that request, otherwise it is added to the list of blocked >> >> requests, which are now known to be mutually non-conflicting. >> >> >> >> This way, when the lock is released, only a set of non-conflicting >> >> locks will be woken, the rest can stay asleep. >> >> If the lock request cannot be granted and the request needs to be >> >> requeued, all the other requests it blocks will then be woken >> > >> > So, to make sure I understand: the tree of blocking locks only ever has >> > three levels (the active lock, the locks blocking on it, and their >> > children?) >> >> Not correct. >> Blocks is only vertical, never horizontal. Siblings never block each >> other. >> So one process hold a lock on a byte, and 27 other process want a lock >> on that byte, then there will be 28 levels in a narrow tree - it is >> effectively a queue. >> Branching (via siblings) only happens when a child conflict with only >> part of the lock held by the parent. >> So if one process locks 32K, then two other processes request locks on >> the 2 16K halves, then 4 processes request locks on the 8K quarters, and >> so-on, then you could end up with 32767 processes in a binary tree, with >> half of them all waiting on different individual bytes. > > Maybe I should actually read the code carefully instead of just skimming > the changelog and jumping to conclusions. > > I think this is correct, but I wish we had an actual written-out > argument that it's correct, because intuition isn't a great guide for > posix file locks. > > Maybe: > > Waiting and applied locks are all kept in trees whose properties are: > > - the root of a tree may be an applied or unapplied lock. > - every other node in the tree is an unapplied lock that > conflicts with every ancestor of that node. > > Every such tree begins life as an unapplied singleton which obviously > satisfies the above properties. > > The only ways we modify trees preserve these properties: > > 1. We may add a new child, but only after first verifying that it > conflicts with all of its ancestors. > 2. We may remove the root of a tree, creating a new singleton > tree from the root and N new trees rooted in the immediate > children. > 3. If the root of a tree is not currently an applied lock, we may > apply it (if possible). > 4. We may upgrade the root of the tree (either extend its range, > or upgrade its entire range from read to write). > > When an applied lock is modified in a way that reduces or downgrades any > part of its range, we remove all its children (2 above). > > For each of those child trees: if the root of the tree applies, we do so > (3). If it doesn't, it must conflict with some applied lock. We remove > all of its children (2), and add it is a new leaf to the tree rooted in > the applied lock (1). We then repeat the process recursively with those > children. > Thanks pretty thorough - and even looks correct. I'll re-reading some time when it isn't late, and maybe make it into a comment in the code. I agree, this sort of documentation can be quite helpful. Thanks, NeilBrown