LKML Archive on lore.kernel.org
 help / Atom feed
* [PATCH 0/5 - V2] locks: avoid thundering-herd wake-ups
@ 2018-08-09  2:04 NeilBrown
  2018-08-09  2:04 ` [PATCH 1/5] fs/locks: rename some lists and pointers NeilBrown
                   ` (5 more replies)
  0 siblings, 6 replies; 27+ messages in thread
From: NeilBrown @ 2018-08-09  2:04 UTC (permalink / raw)
  To: Jeff Layton, Alexander Viro
  Cc: J. Bruce Fields, Martin Wilck, linux-fsdevel, Frank Filz, linux-kernel

This series adds "wake_non_conflicts()" to wake up
any waiters that are being added beneath a lock that they
don't actually conflict with, even though they conflict with a parent
which conflict with the top-level blocker.
Thanks for Bruce for highlighting this issue.

This series hasn't been tested beyond compile-test.

Original description:
If you have a many-core machine, and have many threads all wanting to
briefly lock a give file (udev is known to do this), you can get quite
poor performance.

When one thread releases a lock, it wakes up all other threads that
are waiting (classic thundering-herd) - one will get the lock and the
others go to sleep.
When you have few cores, this is not very noticeable: by the time the
4th or 5th thread gets enough CPU time to try to claim the lock, the
earlier threads have claimed it, done what was needed, and released.
With 50+ cores, the contention can easily be measured.

This patchset creates a tree of pending lock request in which siblings
don't conflict and each lock request does conflict with its parent.
When a lock is released, only requests which don't conflict with each
other a woken.

Testing shows that lock-acquisitions-per-second is now fairly stable even
as number of contending process goes to 1000.  Without this patch,
locks-per-second drops off steeply after a few 10s of processes.

There is a small cost to this extra complexity.
At 20 processes running a particular test on 72 cores, the lock
acquisitions per second drops from 1.8 million to 1.4 million with
this patch.  For 100 processes, this patch still provides 1.4 million
while without this patch there are about 700,000.

NeilBrown


---

NeilBrown (5):
      fs/locks: rename some lists and pointers.
      fs/locks: allow a lock request to block other requests.
      fs/locks: change all *_conflict() functions to return a new enum.
      fs/locks: split out __locks_wake_one()
      fs/locks: create a tree of dependent requests.


 fs/cifs/file.c                  |    2 
 fs/locks.c                      |  228 ++++++++++++++++++++++++++++++---------
 include/linux/fs.h              |    5 +
 include/trace/events/filelock.h |   16 +--
 4 files changed, 186 insertions(+), 65 deletions(-)

--
Signature


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/5] fs/locks: rename some lists and pointers.
  2018-08-09  2:04 [PATCH 0/5 - V2] locks: avoid thundering-herd wake-ups NeilBrown
@ 2018-08-09  2:04 ` NeilBrown
  2018-08-09  2:04 ` [PATCH 2/5] fs/locks: allow a lock request to block other requests NeilBrown
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 27+ messages in thread
From: NeilBrown @ 2018-08-09  2:04 UTC (permalink / raw)
  To: Jeff Layton, Alexander Viro
  Cc: J. Bruce Fields, Martin Wilck, linux-fsdevel, Frank Filz, linux-kernel

struct file lock contains an 'fl_next' pointer which
is used to point to the lock that this request is blocked
waiting for.  So rename it to fl_blocker.

The fl_blocked list_head in an active lock is the head of a list of
blocked requests.  In a request it is a node in that list.
These are two distinct uses, so replace with two list_heads
with different names.
fl_blocked is the head of a list of blocked requests
fl_block is a node on that list.

This separation will be used in the next patch.

Note that a tracepoint is changed to report fl_blocker instead
of fl_next.  Might this be an ABI breakage?

Signed-off-by: NeilBrown <neilb@suse.com>
---
 fs/cifs/file.c                  |    2 +-
 fs/locks.c                      |   40 ++++++++++++++++++++-------------------
 include/linux/fs.h              |    5 +++--
 include/trace/events/filelock.h |   16 ++++++++--------
 4 files changed, 33 insertions(+), 30 deletions(-)

diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 8d41ca7bfcf1..066ed2e4ba96 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -1092,7 +1092,7 @@ cifs_posix_lock_set(struct file *file, struct file_lock *flock)
 	rc = posix_lock_file(file, flock, NULL);
 	up_write(&cinode->lock_sem);
 	if (rc == FILE_LOCK_DEFERRED) {
-		rc = wait_event_interruptible(flock->fl_wait, !flock->fl_next);
+		rc = wait_event_interruptible(flock->fl_wait, !flock->fl_blocker);
 		if (!rc)
 			goto try_again;
 		posix_unblock_lock(flock);
diff --git a/fs/locks.c b/fs/locks.c
index db7b6917d9c5..322491e70e41 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -194,7 +194,7 @@ static DEFINE_HASHTABLE(blocked_hash, BLOCKED_HASH_BITS);
  * This lock protects the blocked_hash. Generally, if you're accessing it, you
  * want to be holding this lock.
  *
- * In addition, it also protects the fl->fl_block list, and the fl->fl_next
+ * In addition, it also protects the fl->fl_blocked list, and the fl->fl_blocker
  * pointer for file_lock structures that are acting as lock requests (in
  * contrast to those that are acting as records of acquired locks).
  *
@@ -203,7 +203,7 @@ static DEFINE_HASHTABLE(blocked_hash, BLOCKED_HASH_BITS);
  * protected by this lock, we can skip acquiring it iff we already hold the
  * flc_lock.
  *
- * In particular, adding an entry to the fl_block list requires that you hold
+ * In particular, adding an entry to the fl_blocked list requires that you hold
  * both the flc_lock and the blocked_lock_lock (acquired in that order).
  * Deleting an entry from the list however only requires the file_lock_lock.
  */
@@ -302,6 +302,7 @@ static void locks_init_lock_heads(struct file_lock *fl)
 {
 	INIT_HLIST_NODE(&fl->fl_link);
 	INIT_LIST_HEAD(&fl->fl_list);
+	INIT_LIST_HEAD(&fl->fl_blocked);
 	INIT_LIST_HEAD(&fl->fl_block);
 	init_waitqueue_head(&fl->fl_wait);
 }
@@ -341,6 +342,7 @@ void locks_free_lock(struct file_lock *fl)
 {
 	BUG_ON(waitqueue_active(&fl->fl_wait));
 	BUG_ON(!list_empty(&fl->fl_list));
+	BUG_ON(!list_empty(&fl->fl_blocked));
 	BUG_ON(!list_empty(&fl->fl_block));
 	BUG_ON(!hlist_unhashed(&fl->fl_link));
 
@@ -676,7 +678,7 @@ static void __locks_delete_block(struct file_lock *waiter)
 {
 	locks_delete_global_blocked(waiter);
 	list_del_init(&waiter->fl_block);
-	waiter->fl_next = NULL;
+	waiter->fl_blocker = NULL;
 }
 
 static void locks_delete_block(struct file_lock *waiter)
@@ -692,16 +694,16 @@ static void locks_delete_block(struct file_lock *waiter)
  * it seems like the reasonable thing to do.
  *
  * Must be called with both the flc_lock and blocked_lock_lock held. The
- * fl_block list itself is protected by the blocked_lock_lock, but by ensuring
+ * fl_blocked list itself is protected by the blocked_lock_lock, but by ensuring
  * that the flc_lock is also held on insertions we can avoid taking the
- * blocked_lock_lock in some cases when we see that the fl_block list is empty.
+ * blocked_lock_lock in some cases when we see that the fl_blocked list is empty.
  */
 static void __locks_insert_block(struct file_lock *blocker,
 					struct file_lock *waiter)
 {
 	BUG_ON(!list_empty(&waiter->fl_block));
-	waiter->fl_next = blocker;
-	list_add_tail(&waiter->fl_block, &blocker->fl_block);
+	waiter->fl_blocker = blocker;
+	list_add_tail(&waiter->fl_block, &blocker->fl_blocked);
 	if (IS_POSIX(blocker) && !IS_OFDLCK(blocker))
 		locks_insert_global_blocked(waiter);
 }
@@ -725,18 +727,18 @@ static void locks_wake_up_blocks(struct file_lock *blocker)
 	/*
 	 * Avoid taking global lock if list is empty. This is safe since new
 	 * blocked requests are only added to the list under the flc_lock, and
-	 * the flc_lock is always held here. Note that removal from the fl_block
+	 * the flc_lock is always held here. Note that removal from the fl_blocked
 	 * list does not require the flc_lock, so we must recheck list_empty()
 	 * after acquiring the blocked_lock_lock.
 	 */
-	if (list_empty(&blocker->fl_block))
+	if (list_empty(&blocker->fl_blocked))
 		return;
 
 	spin_lock(&blocked_lock_lock);
-	while (!list_empty(&blocker->fl_block)) {
+	while (!list_empty(&blocker->fl_blocked)) {
 		struct file_lock *waiter;
 
-		waiter = list_first_entry(&blocker->fl_block,
+		waiter = list_first_entry(&blocker->fl_blocked,
 				struct file_lock, fl_block);
 		__locks_delete_block(waiter);
 		if (waiter->fl_lmops && waiter->fl_lmops->lm_notify)
@@ -887,7 +889,7 @@ static struct file_lock *what_owner_is_waiting_for(struct file_lock *block_fl)
 
 	hash_for_each_possible(blocked_hash, fl, fl_link, posix_owner_key(block_fl)) {
 		if (posix_same_owner(fl, block_fl))
-			return fl->fl_next;
+			return fl->fl_blocker;
 	}
 	return NULL;
 }
@@ -1245,7 +1247,7 @@ static int posix_lock_inode_wait(struct inode *inode, struct file_lock *fl)
 		error = posix_lock_inode(inode, fl, NULL);
 		if (error != FILE_LOCK_DEFERRED)
 			break;
-		error = wait_event_interruptible(fl->fl_wait, !fl->fl_next);
+		error = wait_event_interruptible(fl->fl_wait, !fl->fl_blocker);
 		if (!error)
 			continue;
 
@@ -1332,7 +1334,7 @@ int locks_mandatory_area(struct inode *inode, struct file *filp, loff_t start,
 		error = posix_lock_inode(inode, &fl, NULL);
 		if (error != FILE_LOCK_DEFERRED)
 			break;
-		error = wait_event_interruptible(fl.fl_wait, !fl.fl_next);
+		error = wait_event_interruptible(fl.fl_wait, !fl.fl_blocker);
 		if (!error) {
 			/*
 			 * If we've been sleeping someone might have
@@ -1526,7 +1528,7 @@ int __break_lease(struct inode *inode, unsigned int mode, unsigned int type)
 
 	locks_dispose_list(&dispose);
 	error = wait_event_interruptible_timeout(new_fl->fl_wait,
-						!new_fl->fl_next, break_time);
+						!new_fl->fl_blocker, break_time);
 
 	percpu_down_read_preempt_disable(&file_rwsem);
 	spin_lock(&ctx->flc_lock);
@@ -1940,7 +1942,7 @@ static int flock_lock_inode_wait(struct inode *inode, struct file_lock *fl)
 		error = flock_lock_inode(inode, fl);
 		if (error != FILE_LOCK_DEFERRED)
 			break;
-		error = wait_event_interruptible(fl->fl_wait, !fl->fl_next);
+		error = wait_event_interruptible(fl->fl_wait, !fl->fl_blocker);
 		if (!error)
 			continue;
 
@@ -2212,7 +2214,7 @@ static int do_lock_file_wait(struct file *filp, unsigned int cmd,
 		error = vfs_lock_file(filp, cmd, fl, NULL);
 		if (error != FILE_LOCK_DEFERRED)
 			break;
-		error = wait_event_interruptible(fl->fl_wait, !fl->fl_next);
+		error = wait_event_interruptible(fl->fl_wait, !fl->fl_blocker);
 		if (!error)
 			continue;
 
@@ -2583,7 +2585,7 @@ posix_unblock_lock(struct file_lock *waiter)
 	int status = 0;
 
 	spin_lock(&blocked_lock_lock);
-	if (waiter->fl_next)
+	if (waiter->fl_blocker)
 		__locks_delete_block(waiter);
 	else
 		status = -ENOENT;
@@ -2711,7 +2713,7 @@ static int locks_show(struct seq_file *f, void *v)
 
 	lock_get_status(f, fl, iter->li_pos, "");
 
-	list_for_each_entry(bfl, &fl->fl_block, fl_block)
+	list_for_each_entry(bfl, &fl->fl_blocked, fl_block)
 		lock_get_status(f, bfl, iter->li_pos, " ->");
 
 	return 0;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 805bf22898cf..48f7f36c3772 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1002,10 +1002,11 @@ bool opens_in_grace(struct net *);
  * Obviously, the last two criteria only matter for POSIX locks.
  */
 struct file_lock {
-	struct file_lock *fl_next;	/* singly linked list for this inode  */
+	struct file_lock *fl_blocker;	/* The lock, that is blocking us */
 	struct list_head fl_list;	/* link into file_lock_context */
 	struct hlist_node fl_link;	/* node in global lists */
-	struct list_head fl_block;	/* circular list of blocked processes */
+	struct list_head fl_blocked;	/* list of requests with ->fl_blocker pointing here */
+	struct list_head fl_block;	/* node in ->fl_blocker->fl_blocked */
 	fl_owner_t fl_owner;
 	unsigned int fl_flags;
 	unsigned char fl_type;
diff --git a/include/trace/events/filelock.h b/include/trace/events/filelock.h
index d1faf3597b9d..0e50f985a390 100644
--- a/include/trace/events/filelock.h
+++ b/include/trace/events/filelock.h
@@ -68,7 +68,7 @@ DECLARE_EVENT_CLASS(filelock_lock,
 		__field(struct file_lock *, fl)
 		__field(unsigned long, i_ino)
 		__field(dev_t, s_dev)
-		__field(struct file_lock *, fl_next)
+		__field(struct file_lock *, fl_blocker)
 		__field(fl_owner_t, fl_owner)
 		__field(unsigned int, fl_pid)
 		__field(unsigned int, fl_flags)
@@ -82,7 +82,7 @@ DECLARE_EVENT_CLASS(filelock_lock,
 		__entry->fl = fl ? fl : NULL;
 		__entry->s_dev = inode->i_sb->s_dev;
 		__entry->i_ino = inode->i_ino;
-		__entry->fl_next = fl ? fl->fl_next : NULL;
+		__entry->fl_blocker = fl ? fl->fl_blocker : NULL;
 		__entry->fl_owner = fl ? fl->fl_owner : NULL;
 		__entry->fl_pid = fl ? fl->fl_pid : 0;
 		__entry->fl_flags = fl ? fl->fl_flags : 0;
@@ -92,9 +92,9 @@ DECLARE_EVENT_CLASS(filelock_lock,
 		__entry->ret = ret;
 	),
 
-	TP_printk("fl=0x%p dev=0x%x:0x%x ino=0x%lx fl_next=0x%p fl_owner=0x%p fl_pid=%u fl_flags=%s fl_type=%s fl_start=%lld fl_end=%lld ret=%d",
+	TP_printk("fl=0x%p dev=0x%x:0x%x ino=0x%lx fl_blocker=0x%p fl_owner=0x%p fl_pid=%u fl_flags=%s fl_type=%s fl_start=%lld fl_end=%lld ret=%d",
 		__entry->fl, MAJOR(__entry->s_dev), MINOR(__entry->s_dev),
-		__entry->i_ino, __entry->fl_next, __entry->fl_owner,
+		__entry->i_ino, __entry->fl_blocker, __entry->fl_owner,
 		__entry->fl_pid, show_fl_flags(__entry->fl_flags),
 		show_fl_type(__entry->fl_type),
 		__entry->fl_start, __entry->fl_end, __entry->ret)
@@ -122,7 +122,7 @@ DECLARE_EVENT_CLASS(filelock_lease,
 		__field(struct file_lock *, fl)
 		__field(unsigned long, i_ino)
 		__field(dev_t, s_dev)
-		__field(struct file_lock *, fl_next)
+		__field(struct file_lock *, fl_blocker)
 		__field(fl_owner_t, fl_owner)
 		__field(unsigned int, fl_flags)
 		__field(unsigned char, fl_type)
@@ -134,7 +134,7 @@ DECLARE_EVENT_CLASS(filelock_lease,
 		__entry->fl = fl ? fl : NULL;
 		__entry->s_dev = inode->i_sb->s_dev;
 		__entry->i_ino = inode->i_ino;
-		__entry->fl_next = fl ? fl->fl_next : NULL;
+		__entry->fl_blocker = fl ? fl->fl_blocker : NULL;
 		__entry->fl_owner = fl ? fl->fl_owner : NULL;
 		__entry->fl_flags = fl ? fl->fl_flags : 0;
 		__entry->fl_type = fl ? fl->fl_type : 0;
@@ -142,9 +142,9 @@ DECLARE_EVENT_CLASS(filelock_lease,
 		__entry->fl_downgrade_time = fl ? fl->fl_downgrade_time : 0;
 	),
 
-	TP_printk("fl=0x%p dev=0x%x:0x%x ino=0x%lx fl_next=0x%p fl_owner=0x%p fl_flags=%s fl_type=%s fl_break_time=%lu fl_downgrade_time=%lu",
+	TP_printk("fl=0x%p dev=0x%x:0x%x ino=0x%lx fl_blocker=0x%p fl_owner=0x%p fl_flags=%s fl_type=%s fl_break_time=%lu fl_downgrade_time=%lu",
 		__entry->fl, MAJOR(__entry->s_dev), MINOR(__entry->s_dev),
-		__entry->i_ino, __entry->fl_next, __entry->fl_owner,
+		__entry->i_ino, __entry->fl_blocker, __entry->fl_owner,
 		show_fl_flags(__entry->fl_flags),
 		show_fl_type(__entry->fl_type),
 		__entry->fl_break_time, __entry->fl_downgrade_time)



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 2/5] fs/locks: allow a lock request to block other requests.
  2018-08-09  2:04 [PATCH 0/5 - V2] locks: avoid thundering-herd wake-ups NeilBrown
  2018-08-09  2:04 ` [PATCH 1/5] fs/locks: rename some lists and pointers NeilBrown
@ 2018-08-09  2:04 ` NeilBrown
  2018-08-09  2:04 ` [PATCH 3/5] fs/locks: change all *_conflict() functions to return a new enum NeilBrown
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 27+ messages in thread
From: NeilBrown @ 2018-08-09  2:04 UTC (permalink / raw)
  To: Jeff Layton, Alexander Viro
  Cc: J. Bruce Fields, Martin Wilck, linux-fsdevel, Frank Filz, linux-kernel

Currently, a lock can block pending requests, but all pending
requests are equal.  If lots of pending requests are
mutually exclusive, this means they will all be woken up
and all but one will fail.  This can hurt performance.

So we will allow pending requests to block other requests.
Only the first request will be woken, and it will wake the others.

This patch doesn't implement this fully, but prepares the way.

- It acknowledges that a request might be blocking other requests,
  and when the request is converted to a lock, those blocked
  requests are moved across.
- When a request is removed, all blocked requests are woken.
- When deadlock-detection looks for the lock which blocks a
  given request, we follow the chain of ->fl_blocker all
  the way to the top.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 fs/locks.c |   58 ++++++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 40 insertions(+), 18 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index 322491e70e41..b4812da2a374 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -408,9 +408,19 @@ void locks_copy_lock(struct file_lock *new, struct file_lock *fl)
 			fl->fl_ops->fl_copy_lock(new, fl);
 	}
 }
-
 EXPORT_SYMBOL(locks_copy_lock);
 
+static void locks_move_blocks(struct file_lock *new, struct file_lock *fl)
+{
+	struct file_lock *f;
+
+	spin_lock(&blocked_lock_lock);
+	list_splice_init(&fl->fl_blocked, &new->fl_blocked);
+	list_for_each_entry(f, &fl->fl_blocked, fl_block)
+		f->fl_blocker = new;
+	spin_unlock(&blocked_lock_lock);
+}
+
 static inline int flock_translate_cmd(int cmd) {
 	if (cmd & LOCK_MAND)
 		return cmd & (LOCK_MAND | LOCK_RW);
@@ -681,9 +691,25 @@ static void __locks_delete_block(struct file_lock *waiter)
 	waiter->fl_blocker = NULL;
 }
 
+static void __locks_wake_up_blocks(struct file_lock *blocker)
+{
+	while (!list_empty(&blocker->fl_blocked)) {
+		struct file_lock *waiter;
+
+		waiter = list_first_entry(&blocker->fl_blocked,
+				struct file_lock, fl_block);
+		__locks_delete_block(waiter);
+		if (waiter->fl_lmops && waiter->fl_lmops->lm_notify)
+			waiter->fl_lmops->lm_notify(waiter);
+		else
+			wake_up(&waiter->fl_wait);
+	}
+}
+
 static void locks_delete_block(struct file_lock *waiter)
 {
 	spin_lock(&blocked_lock_lock);
+	__locks_wake_up_blocks(waiter);
 	__locks_delete_block(waiter);
 	spin_unlock(&blocked_lock_lock);
 }
@@ -735,17 +761,7 @@ static void locks_wake_up_blocks(struct file_lock *blocker)
 		return;
 
 	spin_lock(&blocked_lock_lock);
-	while (!list_empty(&blocker->fl_blocked)) {
-		struct file_lock *waiter;
-
-		waiter = list_first_entry(&blocker->fl_blocked,
-				struct file_lock, fl_block);
-		__locks_delete_block(waiter);
-		if (waiter->fl_lmops && waiter->fl_lmops->lm_notify)
-			waiter->fl_lmops->lm_notify(waiter);
-		else
-			wake_up(&waiter->fl_wait);
-	}
+	__locks_wake_up_blocks(blocker);
 	spin_unlock(&blocked_lock_lock);
 }
 
@@ -888,8 +904,11 @@ static struct file_lock *what_owner_is_waiting_for(struct file_lock *block_fl)
 	struct file_lock *fl;
 
 	hash_for_each_possible(blocked_hash, fl, fl_link, posix_owner_key(block_fl)) {
-		if (posix_same_owner(fl, block_fl))
-			return fl->fl_blocker;
+		if (posix_same_owner(fl, block_fl)) {
+			while (fl->fl_blocker)
+				fl = fl->fl_blocker;
+			return fl;
+		}
 	}
 	return NULL;
 }
@@ -982,6 +1001,7 @@ static int flock_lock_inode(struct inode *inode, struct file_lock *request)
 	if (request->fl_flags & FL_ACCESS)
 		goto out;
 	locks_copy_lock(new_fl, request);
+	locks_move_blocks(new_fl, request);
 	locks_insert_lock_ctx(new_fl, &ctx->flc_flock);
 	new_fl = NULL;
 	error = 0;
@@ -1174,6 +1194,7 @@ static int posix_lock_inode(struct inode *inode, struct file_lock *request,
 			goto out;
 		}
 		locks_copy_lock(new_fl, request);
+		locks_move_blocks(new_fl, request);
 		locks_insert_lock_ctx(new_fl, &fl->fl_list);
 		fl = new_fl;
 		new_fl = NULL;
@@ -2582,13 +2603,14 @@ void locks_remove_file(struct file *filp)
 int
 posix_unblock_lock(struct file_lock *waiter)
 {
-	int status = 0;
+	int status = -ENOENT;
 
 	spin_lock(&blocked_lock_lock);
-	if (waiter->fl_blocker)
+	if (waiter->fl_blocker) {
+		__locks_wake_up_blocks(waiter);
 		__locks_delete_block(waiter);
-	else
-		status = -ENOENT;
+		status = 0;
+	}
 	spin_unlock(&blocked_lock_lock);
 	return status;
 }



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 3/5] fs/locks: change all *_conflict() functions to return a new enum.
  2018-08-09  2:04 [PATCH 0/5 - V2] locks: avoid thundering-herd wake-ups NeilBrown
  2018-08-09  2:04 ` [PATCH 1/5] fs/locks: rename some lists and pointers NeilBrown
  2018-08-09  2:04 ` [PATCH 2/5] fs/locks: allow a lock request to block other requests NeilBrown
@ 2018-08-09  2:04 ` NeilBrown
  2018-08-09 11:09   ` Jeff Layton
  2018-08-09 13:09   ` J. Bruce Fields
  2018-08-09  2:04 ` [PATCH 4/5] fs/locks: split out __locks_wake_one() NeilBrown
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 27+ messages in thread
From: NeilBrown @ 2018-08-09  2:04 UTC (permalink / raw)
  To: Jeff Layton, Alexander Viro
  Cc: J. Bruce Fields, Martin Wilck, linux-fsdevel, Frank Filz, linux-kernel

In a future patch we will need to differentiate between conflicts that
are "transitive" and those that aren't.
A "transitive" conflict is defined as one where any lock that
conflicts with the first (newly requested) lock would conflict with
the existing lock.

So change posix_locks_conflict(), flock_locks_conflict() (both
currently returning int) and leases_conflict() (currently returning
bool) to return "enum conflict".
Add locks_transitive_overlap() to make it possible to compute the
correct conflict for posix locks.

The FL_NO_CONFLICT value is zero, so current code which only tests the
truth value of these functions will still work the same way.

And convert some
   return (foo);
to
   return foo;

Signed-off-by: NeilBrown <neilb@suse.com>
---
 fs/locks.c |   64 ++++++++++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 49 insertions(+), 15 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index b4812da2a374..d06658b2dc7a 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -139,6 +139,16 @@
 #define IS_OFDLCK(fl)	(fl->fl_flags & FL_OFDLCK)
 #define IS_REMOTELCK(fl)	(fl->fl_pid <= 0)
 
+/* A transitive conflict is one where the first lock conflicts with
+ * the second lock, and any other lock that conflicts with the
+ * first lock, also conflicts with the second lock.
+ */
+enum conflict {
+	FL_NO_CONFLICT = 0,
+	FL_CONFLICT,
+	FL_TRANSITIVE_CONFLICT,
+};
+
 static inline bool is_remote_lock(struct file *filp)
 {
 	return likely(!(filp->f_path.dentry->d_sb->s_flags & SB_NOREMOTELOCK));
@@ -612,6 +622,15 @@ static inline int locks_overlap(struct file_lock *fl1, struct file_lock *fl2)
 		(fl2->fl_end >= fl1->fl_start));
 }
 
+/* Check for transitive-overlap - true if any lock that overlaps
+ * the first lock must overlap the seconds
+ */
+static inline bool locks_transitive_overlap(struct file_lock *fl1,
+					    struct file_lock *fl2)
+{
+	return (fl1->fl_start >= fl2->fl_start) &&
+		(fl1->fl_end <= fl2->fl_end);
+}
 /*
  * Check whether two locks have the same owner.
  */
@@ -793,47 +812,61 @@ locks_delete_lock_ctx(struct file_lock *fl, struct list_head *dispose)
 /* Determine if lock sys_fl blocks lock caller_fl. Common functionality
  * checks for shared/exclusive status of overlapping locks.
  */
-static int locks_conflict(struct file_lock *caller_fl, struct file_lock *sys_fl)
+static enum conflict locks_conflict(struct file_lock *caller_fl,
+				    struct file_lock *sys_fl)
 {
 	if (sys_fl->fl_type == F_WRLCK)
-		return 1;
+		return FL_TRANSITIVE_CONFLICT;
 	if (caller_fl->fl_type == F_WRLCK)
-		return 1;
-	return 0;
+		return FL_CONFLICT;
+	return FL_NO_CONFLICT;
 }
 
 /* Determine if lock sys_fl blocks lock caller_fl. POSIX specific
  * checking before calling the locks_conflict().
  */
-static int posix_locks_conflict(struct file_lock *caller_fl, struct file_lock *sys_fl)
+static enum conflict posix_locks_conflict(struct file_lock *caller_fl,
+					  struct file_lock *sys_fl)
 {
 	/* POSIX locks owned by the same process do not conflict with
 	 * each other.
 	 */
 	if (posix_same_owner(caller_fl, sys_fl))
-		return (0);
+		return FL_NO_CONFLICT;
 
 	/* Check whether they overlap */
 	if (!locks_overlap(caller_fl, sys_fl))
-		return 0;
+		return FL_NO_CONFLICT;
 
-	return (locks_conflict(caller_fl, sys_fl));
+	switch (locks_conflict(caller_fl, sys_fl)) {
+	default:
+	case FL_NO_CONFLICT:
+		return FL_NO_CONFLICT;
+	case FL_CONFLICT:
+		return FL_CONFLICT;
+	case FL_TRANSITIVE_CONFLICT:
+		if (locks_transitive_overlap(caller_fl, sys_fl))
+			return FL_TRANSITIVE_CONFLICT;
+		else
+			return FL_CONFLICT;
+	}
 }
 
 /* Determine if lock sys_fl blocks lock caller_fl. FLOCK specific
  * checking before calling the locks_conflict().
  */
-static int flock_locks_conflict(struct file_lock *caller_fl, struct file_lock *sys_fl)
+static enum conflict flock_locks_conflict(struct file_lock *caller_fl,
+					  struct file_lock *sys_fl)
 {
 	/* FLOCK locks referring to the same filp do not conflict with
 	 * each other.
 	 */
 	if (caller_fl->fl_file == sys_fl->fl_file)
-		return (0);
+		return FL_NO_CONFLICT;
 	if ((caller_fl->fl_type & LOCK_MAND) || (sys_fl->fl_type & LOCK_MAND))
-		return 0;
+		return FL_NO_CONFLICT;
 
-	return (locks_conflict(caller_fl, sys_fl));
+	return locks_conflict(caller_fl, sys_fl);
 }
 
 void
@@ -1435,12 +1468,13 @@ static void time_out_leases(struct inode *inode, struct list_head *dispose)
 	}
 }
 
-static bool leases_conflict(struct file_lock *lease, struct file_lock *breaker)
+static enum conflict leases_conflict(struct file_lock *lease,
+				     struct file_lock *breaker)
 {
 	if ((breaker->fl_flags & FL_LAYOUT) != (lease->fl_flags & FL_LAYOUT))
-		return false;
+		return FL_NO_CONFLICT;
 	if ((breaker->fl_flags & FL_DELEG) && (lease->fl_flags & FL_LEASE))
-		return false;
+		return FL_NO_CONFLICT;
 	return locks_conflict(breaker, lease);
 }
 



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 4/5] fs/locks: split out __locks_wake_one()
  2018-08-09  2:04 [PATCH 0/5 - V2] locks: avoid thundering-herd wake-ups NeilBrown
                   ` (2 preceding siblings ...)
  2018-08-09  2:04 ` [PATCH 3/5] fs/locks: change all *_conflict() functions to return a new enum NeilBrown
@ 2018-08-09  2:04 ` NeilBrown
  2018-08-09  2:04 ` [PATCH 5/5] fs/locks: create a tree of dependent requests NeilBrown
  2018-08-09 17:32 ` [PATCH 0/5 - V2] locks: avoid thundering-herd wake-ups J. Bruce Fields
  5 siblings, 0 replies; 27+ messages in thread
From: NeilBrown @ 2018-08-09  2:04 UTC (permalink / raw)
  To: Jeff Layton, Alexander Viro
  Cc: J. Bruce Fields, Martin Wilck, linux-fsdevel, Frank Filz, linux-kernel

This functionality will be useful in the next patch, so
split it out from __locks_wake_up_blocks().

Signed-off-by: NeilBrown <neilb@suse.com>
---
 fs/locks.c |   15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index d06658b2dc7a..fc64016d01ee 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -710,6 +710,15 @@ static void __locks_delete_block(struct file_lock *waiter)
 	waiter->fl_blocker = NULL;
 }
 
+static void __locks_wake_one(struct file_lock *waiter)
+{
+	__locks_delete_block(waiter);
+	if (waiter->fl_lmops && waiter->fl_lmops->lm_notify)
+		waiter->fl_lmops->lm_notify(waiter);
+	else
+		wake_up(&waiter->fl_wait);
+}
+
 static void __locks_wake_up_blocks(struct file_lock *blocker)
 {
 	while (!list_empty(&blocker->fl_blocked)) {
@@ -717,11 +726,7 @@ static void __locks_wake_up_blocks(struct file_lock *blocker)
 
 		waiter = list_first_entry(&blocker->fl_blocked,
 				struct file_lock, fl_block);
-		__locks_delete_block(waiter);
-		if (waiter->fl_lmops && waiter->fl_lmops->lm_notify)
-			waiter->fl_lmops->lm_notify(waiter);
-		else
-			wake_up(&waiter->fl_wait);
+		__locks_wake_one(waiter);
 	}
 }
 



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 5/5] fs/locks: create a tree of dependent requests.
  2018-08-09  2:04 [PATCH 0/5 - V2] locks: avoid thundering-herd wake-ups NeilBrown
                   ` (3 preceding siblings ...)
  2018-08-09  2:04 ` [PATCH 4/5] fs/locks: split out __locks_wake_one() NeilBrown
@ 2018-08-09  2:04 ` NeilBrown
  2018-08-09 11:17   ` Jeff Layton
  2018-08-09 14:13   ` J. Bruce Fields
  2018-08-09 17:32 ` [PATCH 0/5 - V2] locks: avoid thundering-herd wake-ups J. Bruce Fields
  5 siblings, 2 replies; 27+ messages in thread
From: NeilBrown @ 2018-08-09  2:04 UTC (permalink / raw)
  To: Jeff Layton, Alexander Viro
  Cc: J. Bruce Fields, Martin Wilck, linux-fsdevel, Frank Filz, linux-kernel

When we find an existing lock which conflicts with a request,
and the request wants to wait, we currently add the request
to a list.  When the lock is removed, the whole list is woken.
This can cause the thundering-herd problem.
To reduce the problem, we make use of the (new) fact that
a pending request can itself have a list of blocked requests.
When we find a conflict, we look through the existing blocked requests.
If any one of them blocks the new request, the new request is attached
below that request.
This way, when the lock is released, only a set of non-conflicting
locks will be woken.  The rest of the herd can stay asleep.

Reported-and-tested-by: Martin Wilck <mwilck@suse.de>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 fs/locks.c |   69 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 63 insertions(+), 6 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index fc64016d01ee..17843feb6f5b 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -738,6 +738,39 @@ static void locks_delete_block(struct file_lock *waiter)
 	spin_unlock(&blocked_lock_lock);
 }
 
+static void wake_non_conflicts(struct file_lock *waiter, struct file_lock *blocker,
+			       enum conflict conflict(struct file_lock *,
+						      struct file_lock *))
+{
+	struct file_lock *parent = waiter;
+	struct file_lock *fl;
+	struct file_lock  *t;
+
+	fl = list_entry(&parent->fl_blocked, struct file_lock, fl_block);
+restart:
+	list_for_each_entry_safe_continue(fl, t, &parent->fl_blocked, fl_block) {
+		switch (conflict(fl, blocker)) {
+		default:
+		case FL_NO_CONFLICT:
+			__locks_wake_one(fl);
+			break;
+		case FL_CONFLICT:
+			/* Need to check children */
+			parent = fl;
+			fl = list_entry(&parent->fl_blocked, struct file_lock, fl_block);
+			goto restart;
+		case FL_TRANSITIVE_CONFLICT:
+			/* all children must also conflict, no need to check */
+			continue;
+		}
+	}
+	if (parent != waiter) {
+		parent = parent->fl_blocker;
+		fl = parent;
+		goto restart;
+	}
+}
+
 /* Insert waiter into blocker's block list.
  * We use a circular list so that processes can be easily woken up in
  * the order they blocked. The documentation doesn't require this but
@@ -747,11 +780,32 @@ static void locks_delete_block(struct file_lock *waiter)
  * fl_blocked list itself is protected by the blocked_lock_lock, but by ensuring
  * that the flc_lock is also held on insertions we can avoid taking the
  * blocked_lock_lock in some cases when we see that the fl_blocked list is empty.
+ *
+ * Rather than just adding to the list, we check for conflicts with any existing
+ * waiter, and add to that waiter instead.
+ * Thus wakeups don't happen until needed.
  */
 static void __locks_insert_block(struct file_lock *blocker,
-					struct file_lock *waiter)
+				 struct file_lock *waiter,
+				 enum conflict conflict(struct file_lock *,
+							struct file_lock *))
 {
+	struct file_lock *fl;
 	BUG_ON(!list_empty(&waiter->fl_block));
+
+	/* Any request in waiter->fl_blocked is know to conflict with
+	 * waiter, but it might not conflict with blocker.
+	 * If it doesn't, it needs to be woken now so it can find
+	 * somewhere else to wait, or possible it can get granted.
+	 */
+	if (conflict(waiter, blocker) != FL_TRANSITIVE_CONFLICT)
+		wake_non_conflicts(waiter, blocker, conflict);
+new_blocker:
+	list_for_each_entry(fl, &blocker->fl_blocked, fl_block)
+		if (conflict(fl, waiter)) {
+			blocker =  fl;
+			goto new_blocker;
+		}
 	waiter->fl_blocker = blocker;
 	list_add_tail(&waiter->fl_block, &blocker->fl_blocked);
 	if (IS_POSIX(blocker) && !IS_OFDLCK(blocker))
@@ -760,10 +814,12 @@ static void __locks_insert_block(struct file_lock *blocker,
 
 /* Must be called with flc_lock held. */
 static void locks_insert_block(struct file_lock *blocker,
-					struct file_lock *waiter)
+			       struct file_lock *waiter,
+			       enum conflict conflict(struct file_lock *,
+						      struct file_lock *))
 {
 	spin_lock(&blocked_lock_lock);
-	__locks_insert_block(blocker, waiter);
+	__locks_insert_block(blocker, waiter, conflict);
 	spin_unlock(&blocked_lock_lock);
 }
 
@@ -1033,7 +1089,7 @@ static int flock_lock_inode(struct inode *inode, struct file_lock *request)
 		if (!(request->fl_flags & FL_SLEEP))
 			goto out;
 		error = FILE_LOCK_DEFERRED;
-		locks_insert_block(fl, request);
+		locks_insert_block(fl, request, flock_locks_conflict);
 		goto out;
 	}
 	if (request->fl_flags & FL_ACCESS)
@@ -1107,7 +1163,8 @@ static int posix_lock_inode(struct inode *inode, struct file_lock *request,
 			spin_lock(&blocked_lock_lock);
 			if (likely(!posix_locks_deadlock(request, fl))) {
 				error = FILE_LOCK_DEFERRED;
-				__locks_insert_block(fl, request);
+				__locks_insert_block(fl, request,
+						     posix_locks_conflict);
 			}
 			spin_unlock(&blocked_lock_lock);
 			goto out;
@@ -1581,7 +1638,7 @@ int __break_lease(struct inode *inode, unsigned int mode, unsigned int type)
 		break_time -= jiffies;
 	if (break_time == 0)
 		break_time++;
-	locks_insert_block(fl, new_fl);
+	locks_insert_block(fl, new_fl, leases_conflict);
 	trace_break_lease_block(inode, new_fl);
 	spin_unlock(&ctx->flc_lock);
 	percpu_up_read_preempt_enable(&file_rwsem);



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/5] fs/locks: change all *_conflict() functions to return a new enum.
  2018-08-09  2:04 ` [PATCH 3/5] fs/locks: change all *_conflict() functions to return a new enum NeilBrown
@ 2018-08-09 11:09   ` Jeff Layton
  2018-08-09 13:09   ` J. Bruce Fields
  1 sibling, 0 replies; 27+ messages in thread
From: Jeff Layton @ 2018-08-09 11:09 UTC (permalink / raw)
  To: NeilBrown, Alexander Viro
  Cc: J. Bruce Fields, Martin Wilck, linux-fsdevel, Frank Filz, linux-kernel

On Thu, 2018-08-09 at 12:04 +1000, NeilBrown wrote:
> In a future patch we will need to differentiate between conflicts that
> are "transitive" and those that aren't.
> A "transitive" conflict is defined as one where any lock that
> conflicts with the first (newly requested) lock would conflict with
> the existing lock.
> 
> So change posix_locks_conflict(), flock_locks_conflict() (both
> currently returning int) and leases_conflict() (currently returning
> bool) to return "enum conflict".
> Add locks_transitive_overlap() to make it possible to compute the
> correct conflict for posix locks.
> 
> The FL_NO_CONFLICT value is zero, so current code which only tests the
> truth value of these functions will still work the same way.
> 
> And convert some
>    return (foo);
> to
>    return foo;
> 
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  fs/locks.c |   64 ++++++++++++++++++++++++++++++++++++++++++++++--------------
>  1 file changed, 49 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/locks.c b/fs/locks.c
> index b4812da2a374..d06658b2dc7a 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -139,6 +139,16 @@
>  #define IS_OFDLCK(fl)	(fl->fl_flags & FL_OFDLCK)
>  #define IS_REMOTELCK(fl)	(fl->fl_pid <= 0)
>  
> +/* A transitive conflict is one where the first lock conflicts with
> + * the second lock, and any other lock that conflicts with the
> + * first lock, also conflicts with the second lock.
> + */
> +enum conflict {
> +	FL_NO_CONFLICT = 0,
> +	FL_CONFLICT,
> +	FL_TRANSITIVE_CONFLICT,
> +};
> +
>  static inline bool is_remote_lock(struct file *filp)
>  {
>  	return likely(!(filp->f_path.dentry->d_sb->s_flags & SB_NOREMOTELOCK));
> @@ -612,6 +622,15 @@ static inline int locks_overlap(struct file_lock *fl1, struct file_lock *fl2)
>  		(fl2->fl_end >= fl1->fl_start));
>  }
>  
> +/* Check for transitive-overlap - true if any lock that overlaps
> + * the first lock must overlap the seconds
> + */
> +static inline bool locks_transitive_overlap(struct file_lock *fl1,
> +					    struct file_lock *fl2)
> +{
> +	return (fl1->fl_start >= fl2->fl_start) &&
> +		(fl1->fl_end <= fl2->fl_end);
> +}
>  /*
>   * Check whether two locks have the same owner.
>   */
> @@ -793,47 +812,61 @@ locks_delete_lock_ctx(struct file_lock *fl, struct list_head *dispose)
>  /* Determine if lock sys_fl blocks lock caller_fl. Common functionality
>   * checks for shared/exclusive status of overlapping locks.
>   */
> -static int locks_conflict(struct file_lock *caller_fl, struct file_lock *sys_fl)
> +static enum conflict locks_conflict(struct file_lock *caller_fl,
> +				    struct file_lock *sys_fl)
>  {
>  	if (sys_fl->fl_type == F_WRLCK)
> -		return 1;
> +		return FL_TRANSITIVE_CONFLICT;
>  	if (caller_fl->fl_type == F_WRLCK)
> -		return 1;
> -	return 0;
> +		return FL_CONFLICT;
> +	return FL_NO_CONFLICT;
>  }
>  
>  /* Determine if lock sys_fl blocks lock caller_fl. POSIX specific
>   * checking before calling the locks_conflict().
>   */
> -static int posix_locks_conflict(struct file_lock *caller_fl, struct file_lock *sys_fl)
> +static enum conflict posix_locks_conflict(struct file_lock *caller_fl,
> +					  struct file_lock *sys_fl)
>  {
>  	/* POSIX locks owned by the same process do not conflict with
>  	 * each other.
>  	 */
>  	if (posix_same_owner(caller_fl, sys_fl))
> -		return (0);
> +		return FL_NO_CONFLICT;
>  
>  	/* Check whether they overlap */
>  	if (!locks_overlap(caller_fl, sys_fl))
> -		return 0;
> +		return FL_NO_CONFLICT;
>  
> -	return (locks_conflict(caller_fl, sys_fl));
> +	switch (locks_conflict(caller_fl, sys_fl)) {
> +	default:

Maybe BUG or WARN here or something? locks_conflict should never return
values that aren't in the enum.

> +	case FL_NO_CONFLICT:
> +		return FL_NO_CONFLICT;
> +	case FL_CONFLICT:
> +		return FL_CONFLICT;
> +	case FL_TRANSITIVE_CONFLICT:
> +		if (locks_transitive_overlap(caller_fl, sys_fl))
> +			return FL_TRANSITIVE_CONFLICT;
> +		else
> +			return FL_CONFLICT;
> +	}
>  }
>  
>  /* Determine if lock sys_fl blocks lock caller_fl. FLOCK specific
>   * checking before calling the locks_conflict().
>   */
> -static int flock_locks_conflict(struct file_lock *caller_fl, struct file_lock *sys_fl)
> +static enum conflict flock_locks_conflict(struct file_lock *caller_fl,
> +					  struct file_lock *sys_fl)
>  {
>  	/* FLOCK locks referring to the same filp do not conflict with
>  	 * each other.
>  	 */
>  	if (caller_fl->fl_file == sys_fl->fl_file)
> -		return (0);
> +		return FL_NO_CONFLICT;
>  	if ((caller_fl->fl_type & LOCK_MAND) || (sys_fl->fl_type & LOCK_MAND))
> -		return 0;
> +		return FL_NO_CONFLICT;
>  
> -	return (locks_conflict(caller_fl, sys_fl));
> +	return locks_conflict(caller_fl, sys_fl);
>  }
>  
>  void
> @@ -1435,12 +1468,13 @@ static void time_out_leases(struct inode *inode, struct list_head *dispose)
>  	}
>  }
>  
> -static bool leases_conflict(struct file_lock *lease, struct file_lock *breaker)
> +static enum conflict leases_conflict(struct file_lock *lease,
> +				     struct file_lock *breaker)
>  {
>  	if ((breaker->fl_flags & FL_LAYOUT) != (lease->fl_flags & FL_LAYOUT))
> -		return false;
> +		return FL_NO_CONFLICT;
>  	if ((breaker->fl_flags & FL_DELEG) && (lease->fl_flags & FL_LEASE))
> -		return false;
> +		return FL_NO_CONFLICT;
>  	return locks_conflict(breaker, lease);
>  }
>  
> 
> 
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/5] fs/locks: create a tree of dependent requests.
  2018-08-09  2:04 ` [PATCH 5/5] fs/locks: create a tree of dependent requests NeilBrown
@ 2018-08-09 11:17   ` Jeff Layton
  2018-08-09 23:25     ` NeilBrown
  2018-08-09 14:13   ` J. Bruce Fields
  1 sibling, 1 reply; 27+ messages in thread
From: Jeff Layton @ 2018-08-09 11:17 UTC (permalink / raw)
  To: NeilBrown, Alexander Viro
  Cc: J. Bruce Fields, Martin Wilck, linux-fsdevel, Frank Filz, linux-kernel

On Thu, 2018-08-09 at 12:04 +1000, NeilBrown wrote:
> When we find an existing lock which conflicts with a request,
> and the request wants to wait, we currently add the request
> to a list.  When the lock is removed, the whole list is woken.
> This can cause the thundering-herd problem.
> To reduce the problem, we make use of the (new) fact that
> a pending request can itself have a list of blocked requests.
> When we find a conflict, we look through the existing blocked requests.
> If any one of them blocks the new request, the new request is attached
> below that request.
> This way, when the lock is released, only a set of non-conflicting
> locks will be woken.  The rest of the herd can stay asleep.
> 
> Reported-and-tested-by: Martin Wilck <mwilck@suse.de>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  fs/locks.c |   69 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 63 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/locks.c b/fs/locks.c
> index fc64016d01ee..17843feb6f5b 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -738,6 +738,39 @@ static void locks_delete_block(struct file_lock *waiter)
>  	spin_unlock(&blocked_lock_lock);
>  }
>  
> +static void wake_non_conflicts(struct file_lock *waiter, struct file_lock *blocker,
> +			       enum conflict conflict(struct file_lock *,
> +						      struct file_lock *))
> +{
> +	struct file_lock *parent = waiter;
> +	struct file_lock *fl;
> +	struct file_lock  *t;
> +
> +	fl = list_entry(&parent->fl_blocked, struct file_lock, fl_block);
> +restart:
> +	list_for_each_entry_safe_continue(fl, t, &parent->fl_blocked, fl_block) {
> +		switch (conflict(fl, blocker)) {
> +		default:

BUG or WARN here too please.

> +		case FL_NO_CONFLICT:
> +			__locks_wake_one(fl);
> +			break;
> +		case FL_CONFLICT:
> +			/* Need to check children */
> +			parent = fl;
> +			fl = list_entry(&parent->fl_blocked, struct file_lock, fl_block);
> +			goto restart;
> +		case FL_TRANSITIVE_CONFLICT:
> +			/* all children must also conflict, no need to check */
> +			continue;
> +		}
> +	}
> +	if (parent != waiter) {
> +		parent = parent->fl_blocker;
> +		fl = parent;
> +		goto restart;
> +	}
> +}
> +
>  /* Insert waiter into blocker's block list.
>   * We use a circular list so that processes can be easily woken up in
>   * the order they blocked. The documentation doesn't require this but
> @@ -747,11 +780,32 @@ static void locks_delete_block(struct file_lock *waiter)
>   * fl_blocked list itself is protected by the blocked_lock_lock, but by ensuring
>   * that the flc_lock is also held on insertions we can avoid taking the
>   * blocked_lock_lock in some cases when we see that the fl_blocked list is empty.
> + *
> + * Rather than just adding to the list, we check for conflicts with any existing
> + * waiter, and add to that waiter instead.
> + * Thus wakeups don't happen until needed.
>   */
>  static void __locks_insert_block(struct file_lock *blocker,
> -					struct file_lock *waiter)
> +				 struct file_lock *waiter,
> +				 enum conflict conflict(struct file_lock *,
> +							struct file_lock *))
>  {
> +	struct file_lock *fl;
>  	BUG_ON(!list_empty(&waiter->fl_block));
> +
> +	/* Any request in waiter->fl_blocked is know to conflict with

"known"

> +	 * waiter, but it might not conflict with blocker.
> +	 * If it doesn't, it needs to be woken now so it can find
> +	 * somewhere else to wait, or possible it can get granted.

"possibly it can be"

> +	 */
> +	if (conflict(waiter, blocker) != FL_TRANSITIVE_CONFLICT)
> +		wake_non_conflicts(waiter, blocker, conflict);
> +new_blocker:
> +	list_for_each_entry(fl, &blocker->fl_blocked, fl_block)
> +		if (conflict(fl, waiter)) {
> +			blocker =  fl;
> +			goto new_blocker;
> +		}
>  
> > 	waiter->fl_blocker = blocker;
>  	list_add_tail(&waiter->fl_block, &blocker->fl_blocked);
>  	if (IS_POSIX(blocker) && !IS_OFDLCK(blocker))

I wonder if it might be better to insert the blocker first before waking
up other waiters? Consider that anything awoken will end up contending
for the flc_lock that is held by "current" at this point. Doing most of
what you need to get done before waking them might mean less spinning in
other tasks.

> @@ -760,10 +814,12 @@ static void __locks_insert_block(struct file_lock *blocker,
>  
>  /* Must be called with flc_lock held. */
>  static void locks_insert_block(struct file_lock *blocker,
> -					struct file_lock *waiter)
> +			       struct file_lock *waiter,
> +			       enum conflict conflict(struct file_lock *,
> +						      struct file_lock *))
>  {
>  	spin_lock(&blocked_lock_lock);
> -	__locks_insert_block(blocker, waiter);
> +	__locks_insert_block(blocker, waiter, conflict);
>  	spin_unlock(&blocked_lock_lock);
>  }
>  
> @@ -1033,7 +1089,7 @@ static int flock_lock_inode(struct inode *inode, struct file_lock *request)
>  		if (!(request->fl_flags & FL_SLEEP))
>  			goto out;
>  		error = FILE_LOCK_DEFERRED;
> -		locks_insert_block(fl, request);
> +		locks_insert_block(fl, request, flock_locks_conflict);
>  		goto out;
>  	}
>  	if (request->fl_flags & FL_ACCESS)
> @@ -1107,7 +1163,8 @@ static int posix_lock_inode(struct inode *inode, struct file_lock *request,
>  			spin_lock(&blocked_lock_lock);
>  			if (likely(!posix_locks_deadlock(request, fl))) {
>  				error = FILE_LOCK_DEFERRED;
> -				__locks_insert_block(fl, request);
> +				__locks_insert_block(fl, request,
> +						     posix_locks_conflict);
>  			}
>  			spin_unlock(&blocked_lock_lock);
>  			goto out;
> @@ -1581,7 +1638,7 @@ int __break_lease(struct inode *inode, unsigned int mode, unsigned int type)
>  		break_time -= jiffies;
>  	if (break_time == 0)
>  		break_time++;
> -	locks_insert_block(fl, new_fl);
> +	locks_insert_block(fl, new_fl, leases_conflict);
>  	trace_break_lease_block(inode, new_fl);
>  	spin_unlock(&ctx->flc_lock);
>  	percpu_up_read_preempt_enable(&file_rwsem);
> 
> 

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/5] fs/locks: change all *_conflict() functions to return a new enum.
  2018-08-09  2:04 ` [PATCH 3/5] fs/locks: change all *_conflict() functions to return a new enum NeilBrown
  2018-08-09 11:09   ` Jeff Layton
@ 2018-08-09 13:09   ` J. Bruce Fields
  2018-08-09 23:40     ` NeilBrown
  1 sibling, 1 reply; 27+ messages in thread
From: J. Bruce Fields @ 2018-08-09 13:09 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jeff Layton, Alexander Viro, Martin Wilck, linux-fsdevel,
	Frank Filz, linux-kernel

On Thu, Aug 09, 2018 at 12:04:41PM +1000, NeilBrown wrote:
> In a future patch we will need to differentiate between conflicts that
> are "transitive" and those that aren't.
> A "transitive" conflict is defined as one where any lock that
> conflicts with the first (newly requested) lock would conflict with
> the existing lock.
> 
> So change posix_locks_conflict(), flock_locks_conflict() (both
> currently returning int) and leases_conflict() (currently returning
> bool) to return "enum conflict".
> Add locks_transitive_overlap() to make it possible to compute the
> correct conflict for posix locks.
> 
> The FL_NO_CONFLICT value is zero, so current code which only tests the
> truth value of these functions will still work the same way.
> 
> And convert some
>    return (foo);
> to
>    return foo;
> 
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  fs/locks.c |   64 ++++++++++++++++++++++++++++++++++++++++++++++--------------
>  1 file changed, 49 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/locks.c b/fs/locks.c
> index b4812da2a374..d06658b2dc7a 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -139,6 +139,16 @@
>  #define IS_OFDLCK(fl)	(fl->fl_flags & FL_OFDLCK)
>  #define IS_REMOTELCK(fl)	(fl->fl_pid <= 0)
>  
> +/* A transitive conflict is one where the first lock conflicts with
> + * the second lock, and any other lock that conflicts with the
> + * first lock, also conflicts with the second lock.
> + */
> +enum conflict {
> +	FL_NO_CONFLICT = 0,
> +	FL_CONFLICT,
> +	FL_TRANSITIVE_CONFLICT,
> +};
> +
>  static inline bool is_remote_lock(struct file *filp)
>  {
>  	return likely(!(filp->f_path.dentry->d_sb->s_flags & SB_NOREMOTELOCK));
> @@ -612,6 +622,15 @@ static inline int locks_overlap(struct file_lock *fl1, struct file_lock *fl2)
>  		(fl2->fl_end >= fl1->fl_start));
>  }
>  
> +/* Check for transitive-overlap - true if any lock that overlaps
> + * the first lock must overlap the seconds
> + */
> +static inline bool locks_transitive_overlap(struct file_lock *fl1,
> +					    struct file_lock *fl2)
> +{
> +	return (fl1->fl_start >= fl2->fl_start) &&
> +		(fl1->fl_end <= fl2->fl_end);
> +}
>  /*
>   * Check whether two locks have the same owner.
>   */
> @@ -793,47 +812,61 @@ locks_delete_lock_ctx(struct file_lock *fl, struct list_head *dispose)
>  /* Determine if lock sys_fl blocks lock caller_fl. Common functionality
>   * checks for shared/exclusive status of overlapping locks.
>   */
> -static int locks_conflict(struct file_lock *caller_fl, struct file_lock *sys_fl)
> +static enum conflict locks_conflict(struct file_lock *caller_fl,
> +				    struct file_lock *sys_fl)
>  {
>  	if (sys_fl->fl_type == F_WRLCK)
> -		return 1;
> +		return FL_TRANSITIVE_CONFLICT;
>  	if (caller_fl->fl_type == F_WRLCK)
> -		return 1;
> -	return 0;
> +		return FL_CONFLICT;
> +	return FL_NO_CONFLICT;
>  }
>  
>  /* Determine if lock sys_fl blocks lock caller_fl. POSIX specific
>   * checking before calling the locks_conflict().
>   */
> -static int posix_locks_conflict(struct file_lock *caller_fl, struct file_lock *sys_fl)
> +static enum conflict posix_locks_conflict(struct file_lock *caller_fl,
> +					  struct file_lock *sys_fl)
>  {
>  	/* POSIX locks owned by the same process do not conflict with
>  	 * each other.
>  	 */
>  	if (posix_same_owner(caller_fl, sys_fl))
> -		return (0);
> +		return FL_NO_CONFLICT;
>  
>  	/* Check whether they overlap */
>  	if (!locks_overlap(caller_fl, sys_fl))
> -		return 0;
> +		return FL_NO_CONFLICT;
>  
> -	return (locks_conflict(caller_fl, sys_fl));
> +	switch (locks_conflict(caller_fl, sys_fl)) {
> +	default:
> +	case FL_NO_CONFLICT:
> +		return FL_NO_CONFLICT;
> +	case FL_CONFLICT:
> +		return FL_CONFLICT;

If I'm understanding the logic here and in locks_conflict correctly,
you're telling me that in the case where sys_fl is a read lock, and
caller_fl is a write lock, then any lock which conflicts with sys_fl
must conflict with caller_fl?  Or do I have that backwards?  It doesn't
sound right, in any case.

--b.

> +	case FL_TRANSITIVE_CONFLICT:
> +		if (locks_transitive_overlap(caller_fl, sys_fl))
> +			return FL_TRANSITIVE_CONFLICT;
> +		else
> +			return FL_CONFLICT;
> +	}
>  }
>  
>  /* Determine if lock sys_fl blocks lock caller_fl. FLOCK specific
>   * checking before calling the locks_conflict().
>   */
> -static int flock_locks_conflict(struct file_lock *caller_fl, struct file_lock *sys_fl)
> +static enum conflict flock_locks_conflict(struct file_lock *caller_fl,
> +					  struct file_lock *sys_fl)
>  {
>  	/* FLOCK locks referring to the same filp do not conflict with
>  	 * each other.
>  	 */
>  	if (caller_fl->fl_file == sys_fl->fl_file)
> -		return (0);
> +		return FL_NO_CONFLICT;
>  	if ((caller_fl->fl_type & LOCK_MAND) || (sys_fl->fl_type & LOCK_MAND))
> -		return 0;
> +		return FL_NO_CONFLICT;
>  
> -	return (locks_conflict(caller_fl, sys_fl));
> +	return locks_conflict(caller_fl, sys_fl);
>  }
>  
>  void
> @@ -1435,12 +1468,13 @@ static void time_out_leases(struct inode *inode, struct list_head *dispose)
>  	}
>  }
>  
> -static bool leases_conflict(struct file_lock *lease, struct file_lock *breaker)
> +static enum conflict leases_conflict(struct file_lock *lease,
> +				     struct file_lock *breaker)
>  {
>  	if ((breaker->fl_flags & FL_LAYOUT) != (lease->fl_flags & FL_LAYOUT))
> -		return false;
> +		return FL_NO_CONFLICT;
>  	if ((breaker->fl_flags & FL_DELEG) && (lease->fl_flags & FL_LEASE))
> -		return false;
> +		return FL_NO_CONFLICT;
>  	return locks_conflict(breaker, lease);
>  }
>  
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/5] fs/locks: create a tree of dependent requests.
  2018-08-09  2:04 ` [PATCH 5/5] fs/locks: create a tree of dependent requests NeilBrown
  2018-08-09 11:17   ` Jeff Layton
@ 2018-08-09 14:13   ` J. Bruce Fields
  2018-08-09 22:19     ` NeilBrown
  1 sibling, 1 reply; 27+ messages in thread
From: J. Bruce Fields @ 2018-08-09 14:13 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jeff Layton, Alexander Viro, Martin Wilck, linux-fsdevel,
	Frank Filz, linux-kernel

On Thu, Aug 09, 2018 at 12:04:41PM +1000, NeilBrown wrote:
> When we find an existing lock which conflicts with a request,
> and the request wants to wait, we currently add the request
> to a list.  When the lock is removed, the whole list is woken.
> This can cause the thundering-herd problem.
> To reduce the problem, we make use of the (new) fact that
> a pending request can itself have a list of blocked requests.
> When we find a conflict, we look through the existing blocked requests.
> If any one of them blocks the new request, the new request is attached
> below that request.
> This way, when the lock is released, only a set of non-conflicting
> locks will be woken.  The rest of the herd can stay asleep.

That that's not true any more--some of the locks you wake may conflict
with each other.  Is that right?  Which is fine (the possibility of
thundering herds in weird overlapping-range cases probably isn't a big
deal).  I just want to make sure I understand....

I think you could simplify the code a lot by maintaining the tree so
that it always satisfies the condition that waiters are always strictly
"weaker" than their descendents, so that finding a conflict with a
waiter is always enough to know that the descendents also conflict.

So, when you put a waiter to sleep, you don't add it below a child
unless it's "stronger" than the child.

You give up the property that siblings don't conflict, but again that
just means thundering herds in weird cases, which is OK.

--b.

> 
> Reported-and-tested-by: Martin Wilck <mwilck@suse.de>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  fs/locks.c |   69 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 63 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/locks.c b/fs/locks.c
> index fc64016d01ee..17843feb6f5b 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -738,6 +738,39 @@ static void locks_delete_block(struct file_lock *waiter)
>  	spin_unlock(&blocked_lock_lock);
>  }
>  
> +static void wake_non_conflicts(struct file_lock *waiter, struct file_lock *blocker,
> +			       enum conflict conflict(struct file_lock *,
> +						      struct file_lock *))
> +{
> +	struct file_lock *parent = waiter;
> +	struct file_lock *fl;
> +	struct file_lock  *t;
> +
> +	fl = list_entry(&parent->fl_blocked, struct file_lock, fl_block);
> +restart:
> +	list_for_each_entry_safe_continue(fl, t, &parent->fl_blocked, fl_block) {
> +		switch (conflict(fl, blocker)) {
> +		default:
> +		case FL_NO_CONFLICT:
> +			__locks_wake_one(fl);
> +			break;
> +		case FL_CONFLICT:
> +			/* Need to check children */
> +			parent = fl;
> +			fl = list_entry(&parent->fl_blocked, struct file_lock, fl_block);
> +			goto restart;
> +		case FL_TRANSITIVE_CONFLICT:
> +			/* all children must also conflict, no need to check */
> +			continue;
> +		}
> +	}
> +	if (parent != waiter) {
> +		parent = parent->fl_blocker;
> +		fl = parent;
> +		goto restart;
> +	}
> +}
> +
>  /* Insert waiter into blocker's block list.
>   * We use a circular list so that processes can be easily woken up in
>   * the order they blocked. The documentation doesn't require this but
> @@ -747,11 +780,32 @@ static void locks_delete_block(struct file_lock *waiter)
>   * fl_blocked list itself is protected by the blocked_lock_lock, but by ensuring
>   * that the flc_lock is also held on insertions we can avoid taking the
>   * blocked_lock_lock in some cases when we see that the fl_blocked list is empty.
> + *
> + * Rather than just adding to the list, we check for conflicts with any existing
> + * waiter, and add to that waiter instead.
> + * Thus wakeups don't happen until needed.
>   */
>  static void __locks_insert_block(struct file_lock *blocker,
> -					struct file_lock *waiter)
> +				 struct file_lock *waiter,
> +				 enum conflict conflict(struct file_lock *,
> +							struct file_lock *))
>  {
> +	struct file_lock *fl;
>  	BUG_ON(!list_empty(&waiter->fl_block));
> +
> +	/* Any request in waiter->fl_blocked is know to conflict with
> +	 * waiter, but it might not conflict with blocker.
> +	 * If it doesn't, it needs to be woken now so it can find
> +	 * somewhere else to wait, or possible it can get granted.
> +	 */
> +	if (conflict(waiter, blocker) != FL_TRANSITIVE_CONFLICT)
> +		wake_non_conflicts(waiter, blocker, conflict);
> +new_blocker:
> +	list_for_each_entry(fl, &blocker->fl_blocked, fl_block)
> +		if (conflict(fl, waiter)) {
> +			blocker =  fl;
> +			goto new_blocker;
> +		}
>  	waiter->fl_blocker = blocker;
>  	list_add_tail(&waiter->fl_block, &blocker->fl_blocked);
>  	if (IS_POSIX(blocker) && !IS_OFDLCK(blocker))
> @@ -760,10 +814,12 @@ static void __locks_insert_block(struct file_lock *blocker,
>  
>  /* Must be called with flc_lock held. */
>  static void locks_insert_block(struct file_lock *blocker,
> -					struct file_lock *waiter)
> +			       struct file_lock *waiter,
> +			       enum conflict conflict(struct file_lock *,
> +						      struct file_lock *))
>  {
>  	spin_lock(&blocked_lock_lock);
> -	__locks_insert_block(blocker, waiter);
> +	__locks_insert_block(blocker, waiter, conflict);
>  	spin_unlock(&blocked_lock_lock);
>  }
>  
> @@ -1033,7 +1089,7 @@ static int flock_lock_inode(struct inode *inode, struct file_lock *request)
>  		if (!(request->fl_flags & FL_SLEEP))
>  			goto out;
>  		error = FILE_LOCK_DEFERRED;
> -		locks_insert_block(fl, request);
> +		locks_insert_block(fl, request, flock_locks_conflict);
>  		goto out;
>  	}
>  	if (request->fl_flags & FL_ACCESS)
> @@ -1107,7 +1163,8 @@ static int posix_lock_inode(struct inode *inode, struct file_lock *request,
>  			spin_lock(&blocked_lock_lock);
>  			if (likely(!posix_locks_deadlock(request, fl))) {
>  				error = FILE_LOCK_DEFERRED;
> -				__locks_insert_block(fl, request);
> +				__locks_insert_block(fl, request,
> +						     posix_locks_conflict);
>  			}
>  			spin_unlock(&blocked_lock_lock);
>  			goto out;
> @@ -1581,7 +1638,7 @@ int __break_lease(struct inode *inode, unsigned int mode, unsigned int type)
>  		break_time -= jiffies;
>  	if (break_time == 0)
>  		break_time++;
> -	locks_insert_block(fl, new_fl);
> +	locks_insert_block(fl, new_fl, leases_conflict);
>  	trace_break_lease_block(inode, new_fl);
>  	spin_unlock(&ctx->flc_lock);
>  	percpu_up_read_preempt_enable(&file_rwsem);
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/5 - V2] locks: avoid thundering-herd wake-ups
  2018-08-09  2:04 [PATCH 0/5 - V2] locks: avoid thundering-herd wake-ups NeilBrown
                   ` (4 preceding siblings ...)
  2018-08-09  2:04 ` [PATCH 5/5] fs/locks: create a tree of dependent requests NeilBrown
@ 2018-08-09 17:32 ` J. Bruce Fields
  2018-08-09 22:12   ` NeilBrown
  5 siblings, 1 reply; 27+ messages in thread
From: J. Bruce Fields @ 2018-08-09 17:32 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jeff Layton, Alexander Viro, Martin Wilck, linux-fsdevel,
	Frank Filz, linux-kernel

I think there's also a problem with multiple tasks sharing the same
lock owner.

So, all locks are exclusive locks for the same range.  We have four
tasks.  Tasks 1 and 4 share the same owner, the others' owners are
distinct.

	- Task 1 gets a lock.
	- Task 2 gets a conflicting lock.
	- Task 3 gets another conflicting lock.  So now we the tree is
		3->2->1.
	- Task 1's lock is released.
	- Before task 2 is scheduled, task 4 acquires a new lock.
	- Task 2 waits on task 4's lock, we now have
		3->2->4.

Task 3 shouldn't be waiting--the lock it's requesting has the same owner
as the lock task 4 holds--but we fail to wake up task 3.

--b.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/5 - V2] locks: avoid thundering-herd wake-ups
  2018-08-09 17:32 ` [PATCH 0/5 - V2] locks: avoid thundering-herd wake-ups J. Bruce Fields
@ 2018-08-09 22:12   ` NeilBrown
  2018-08-10  0:29     ` J. Bruce Fields
  0 siblings, 1 reply; 27+ messages in thread
From: NeilBrown @ 2018-08-09 22:12 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Jeff Layton, Alexander Viro, Martin Wilck, linux-fsdevel,
	Frank Filz, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1103 bytes --]

On Thu, Aug 09 2018, J. Bruce Fields wrote:

> I think there's also a problem with multiple tasks sharing the same
> lock owner.
>
> So, all locks are exclusive locks for the same range.  We have four
> tasks.  Tasks 1 and 4 share the same owner, the others' owners are
> distinct.
>
> 	- Task 1 gets a lock.
> 	- Task 2 gets a conflicting lock.
> 	- Task 3 gets another conflicting lock.  So now we the tree is
> 		3->2->1.
> 	- Task 1's lock is released.
> 	- Before task 2 is scheduled, task 4 acquires a new lock.
> 	- Task 2 waits on task 4's lock, we now have
> 		3->2->4.
>
> Task 3 shouldn't be waiting--the lock it's requesting has the same owner
> as the lock task 4 holds--but we fail to wake up task 3.

So task 1 and task 4 are threads in the one process - OK.
Tasks 2 and 3 are threads in two other processes.

So 2 and 3 conflict with either 1 or 4 equally - why should task 3 be
woken?

I suspect you got the numbers bit mixed up, but in any case, the
"conflict()" function that is passed around takes ownership into account
when assessing if one lock conflicts with another.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/5] fs/locks: create a tree of dependent requests.
  2018-08-09 14:13   ` J. Bruce Fields
@ 2018-08-09 22:19     ` NeilBrown
  2018-08-10  0:36       ` J. Bruce Fields
  0 siblings, 1 reply; 27+ messages in thread
From: NeilBrown @ 2018-08-09 22:19 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Jeff Layton, Alexander Viro, Martin Wilck, linux-fsdevel,
	Frank Filz, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 7178 bytes --]

On Thu, Aug 09 2018, J. Bruce Fields wrote:

> On Thu, Aug 09, 2018 at 12:04:41PM +1000, NeilBrown wrote:
>> When we find an existing lock which conflicts with a request,
>> and the request wants to wait, we currently add the request
>> to a list.  When the lock is removed, the whole list is woken.
>> This can cause the thundering-herd problem.
>> To reduce the problem, we make use of the (new) fact that
>> a pending request can itself have a list of blocked requests.
>> When we find a conflict, we look through the existing blocked requests.
>> If any one of them blocks the new request, the new request is attached
>> below that request.
>> This way, when the lock is released, only a set of non-conflicting
>> locks will be woken.  The rest of the herd can stay asleep.
>
> That that's not true any more--some of the locks you wake may conflict
> with each other.  Is that right?  Which is fine (the possibility of
> thundering herds in weird overlapping-range cases probably isn't a big
> deal).  I just want to make sure I understand....

Yes, you are correct.
Lock waiters will be woken if they were directly blocked by a lock that
has been released, if they were blocked (directly or indirectly) by a
lock which is now blocked by a lock that they don't conflict with.
The first set will be mutually non-conflicting.

>
> I think you could simplify the code a lot by maintaining the tree so
> that it always satisfies the condition that waiters are always strictly
> "weaker" than their descendents, so that finding a conflict with a
> waiter is always enough to know that the descendents also conflict.

Can you define "weaker" please.
I suspect it is a partial ordering, in which case a tree would normally
be more appropriate than trying to find a total ordering.

Thanks,
NeilBrown

>
> So, when you put a waiter to sleep, you don't add it below a child
> unless it's "stronger" than the child.
>
> You give up the property that siblings don't conflict, but again that
> just means thundering herds in weird cases, which is OK.
>
> --b.
>
>> 
>> Reported-and-tested-by: Martin Wilck <mwilck@suse.de>
>> Signed-off-by: NeilBrown <neilb@suse.com>
>> ---
>>  fs/locks.c |   69 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
>>  1 file changed, 63 insertions(+), 6 deletions(-)
>> 
>> diff --git a/fs/locks.c b/fs/locks.c
>> index fc64016d01ee..17843feb6f5b 100644
>> --- a/fs/locks.c
>> +++ b/fs/locks.c
>> @@ -738,6 +738,39 @@ static void locks_delete_block(struct file_lock *waiter)
>>  	spin_unlock(&blocked_lock_lock);
>>  }
>>  
>> +static void wake_non_conflicts(struct file_lock *waiter, struct file_lock *blocker,
>> +			       enum conflict conflict(struct file_lock *,
>> +						      struct file_lock *))
>> +{
>> +	struct file_lock *parent = waiter;
>> +	struct file_lock *fl;
>> +	struct file_lock  *t;
>> +
>> +	fl = list_entry(&parent->fl_blocked, struct file_lock, fl_block);
>> +restart:
>> +	list_for_each_entry_safe_continue(fl, t, &parent->fl_blocked, fl_block) {
>> +		switch (conflict(fl, blocker)) {
>> +		default:
>> +		case FL_NO_CONFLICT:
>> +			__locks_wake_one(fl);
>> +			break;
>> +		case FL_CONFLICT:
>> +			/* Need to check children */
>> +			parent = fl;
>> +			fl = list_entry(&parent->fl_blocked, struct file_lock, fl_block);
>> +			goto restart;
>> +		case FL_TRANSITIVE_CONFLICT:
>> +			/* all children must also conflict, no need to check */
>> +			continue;
>> +		}
>> +	}
>> +	if (parent != waiter) {
>> +		parent = parent->fl_blocker;
>> +		fl = parent;
>> +		goto restart;
>> +	}
>> +}
>> +
>>  /* Insert waiter into blocker's block list.
>>   * We use a circular list so that processes can be easily woken up in
>>   * the order they blocked. The documentation doesn't require this but
>> @@ -747,11 +780,32 @@ static void locks_delete_block(struct file_lock *waiter)
>>   * fl_blocked list itself is protected by the blocked_lock_lock, but by ensuring
>>   * that the flc_lock is also held on insertions we can avoid taking the
>>   * blocked_lock_lock in some cases when we see that the fl_blocked list is empty.
>> + *
>> + * Rather than just adding to the list, we check for conflicts with any existing
>> + * waiter, and add to that waiter instead.
>> + * Thus wakeups don't happen until needed.
>>   */
>>  static void __locks_insert_block(struct file_lock *blocker,
>> -					struct file_lock *waiter)
>> +				 struct file_lock *waiter,
>> +				 enum conflict conflict(struct file_lock *,
>> +							struct file_lock *))
>>  {
>> +	struct file_lock *fl;
>>  	BUG_ON(!list_empty(&waiter->fl_block));
>> +
>> +	/* Any request in waiter->fl_blocked is know to conflict with
>> +	 * waiter, but it might not conflict with blocker.
>> +	 * If it doesn't, it needs to be woken now so it can find
>> +	 * somewhere else to wait, or possible it can get granted.
>> +	 */
>> +	if (conflict(waiter, blocker) != FL_TRANSITIVE_CONFLICT)
>> +		wake_non_conflicts(waiter, blocker, conflict);
>> +new_blocker:
>> +	list_for_each_entry(fl, &blocker->fl_blocked, fl_block)
>> +		if (conflict(fl, waiter)) {
>> +			blocker =  fl;
>> +			goto new_blocker;
>> +		}
>>  	waiter->fl_blocker = blocker;
>>  	list_add_tail(&waiter->fl_block, &blocker->fl_blocked);
>>  	if (IS_POSIX(blocker) && !IS_OFDLCK(blocker))
>> @@ -760,10 +814,12 @@ static void __locks_insert_block(struct file_lock *blocker,
>>  
>>  /* Must be called with flc_lock held. */
>>  static void locks_insert_block(struct file_lock *blocker,
>> -					struct file_lock *waiter)
>> +			       struct file_lock *waiter,
>> +			       enum conflict conflict(struct file_lock *,
>> +						      struct file_lock *))
>>  {
>>  	spin_lock(&blocked_lock_lock);
>> -	__locks_insert_block(blocker, waiter);
>> +	__locks_insert_block(blocker, waiter, conflict);
>>  	spin_unlock(&blocked_lock_lock);
>>  }
>>  
>> @@ -1033,7 +1089,7 @@ static int flock_lock_inode(struct inode *inode, struct file_lock *request)
>>  		if (!(request->fl_flags & FL_SLEEP))
>>  			goto out;
>>  		error = FILE_LOCK_DEFERRED;
>> -		locks_insert_block(fl, request);
>> +		locks_insert_block(fl, request, flock_locks_conflict);
>>  		goto out;
>>  	}
>>  	if (request->fl_flags & FL_ACCESS)
>> @@ -1107,7 +1163,8 @@ static int posix_lock_inode(struct inode *inode, struct file_lock *request,
>>  			spin_lock(&blocked_lock_lock);
>>  			if (likely(!posix_locks_deadlock(request, fl))) {
>>  				error = FILE_LOCK_DEFERRED;
>> -				__locks_insert_block(fl, request);
>> +				__locks_insert_block(fl, request,
>> +						     posix_locks_conflict);
>>  			}
>>  			spin_unlock(&blocked_lock_lock);
>>  			goto out;
>> @@ -1581,7 +1638,7 @@ int __break_lease(struct inode *inode, unsigned int mode, unsigned int type)
>>  		break_time -= jiffies;
>>  	if (break_time == 0)
>>  		break_time++;
>> -	locks_insert_block(fl, new_fl);
>> +	locks_insert_block(fl, new_fl, leases_conflict);
>>  	trace_break_lease_block(inode, new_fl);
>>  	spin_unlock(&ctx->flc_lock);
>>  	percpu_up_read_preempt_enable(&file_rwsem);
>> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/5] fs/locks: create a tree of dependent requests.
  2018-08-09 11:17   ` Jeff Layton
@ 2018-08-09 23:25     ` NeilBrown
  0 siblings, 0 replies; 27+ messages in thread
From: NeilBrown @ 2018-08-09 23:25 UTC (permalink / raw)
  To: Jeff Layton, Alexander Viro
  Cc: J. Bruce Fields, Martin Wilck, linux-fsdevel, Frank Filz, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 7806 bytes --]

On Thu, Aug 09 2018, Jeff Layton wrote:

> On Thu, 2018-08-09 at 12:04 +1000, NeilBrown wrote:
>> When we find an existing lock which conflicts with a request,
>> and the request wants to wait, we currently add the request
>> to a list.  When the lock is removed, the whole list is woken.
>> This can cause the thundering-herd problem.
>> To reduce the problem, we make use of the (new) fact that
>> a pending request can itself have a list of blocked requests.
>> When we find a conflict, we look through the existing blocked requests.
>> If any one of them blocks the new request, the new request is attached
>> below that request.
>> This way, when the lock is released, only a set of non-conflicting
>> locks will be woken.  The rest of the herd can stay asleep.
>> 
>> Reported-and-tested-by: Martin Wilck <mwilck@suse.de>
>> Signed-off-by: NeilBrown <neilb@suse.com>
>> ---
>>  fs/locks.c |   69 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
>>  1 file changed, 63 insertions(+), 6 deletions(-)
>> 
>> diff --git a/fs/locks.c b/fs/locks.c
>> index fc64016d01ee..17843feb6f5b 100644
>> --- a/fs/locks.c
>> +++ b/fs/locks.c
>> @@ -738,6 +738,39 @@ static void locks_delete_block(struct file_lock *waiter)
>>  	spin_unlock(&blocked_lock_lock);
>>  }
>>  
>> +static void wake_non_conflicts(struct file_lock *waiter, struct file_lock *blocker,
>> +			       enum conflict conflict(struct file_lock *,
>> +						      struct file_lock *))
>> +{
>> +	struct file_lock *parent = waiter;
>> +	struct file_lock *fl;
>> +	struct file_lock  *t;
>> +
>> +	fl = list_entry(&parent->fl_blocked, struct file_lock, fl_block);
>> +restart:
>> +	list_for_each_entry_safe_continue(fl, t, &parent->fl_blocked, fl_block) {
>> +		switch (conflict(fl, blocker)) {
>> +		default:
>
> BUG or WARN here too please.

Maybe .... I'd rather not have the default case at all.
I can remove this one, but if I remove the next one, gcc complains
../fs/locks.c: In function ‘posix_locks_conflict’:
../fs/locks.c:912:1: warning: control reaches end of non-void function [-Wreturn-type]

event though control cannot reach the end of the function.
Maybe:
	switch (locks_conflict(caller_fl, sys_fl)) {
	default:
		WARN(1, "locks_conflict returned impossible value");
		/* fallthrough */
	case FL_NO_CONFLICT:


>
>> +		case FL_NO_CONFLICT:
>> +			__locks_wake_one(fl);
>> +			break;
>> +		case FL_CONFLICT:
>> +			/* Need to check children */
>> +			parent = fl;
>> +			fl = list_entry(&parent->fl_blocked, struct file_lock, fl_block);
>> +			goto restart;
>> +		case FL_TRANSITIVE_CONFLICT:
>> +			/* all children must also conflict, no need to check */
>> +			continue;
>> +		}
>> +	}
>> +	if (parent != waiter) {
>> +		parent = parent->fl_blocker;
>> +		fl = parent;
>> +		goto restart;
>> +	}
>> +}
>> +
>>  /* Insert waiter into blocker's block list.
>>   * We use a circular list so that processes can be easily woken up in
>>   * the order they blocked. The documentation doesn't require this but
>> @@ -747,11 +780,32 @@ static void locks_delete_block(struct file_lock *waiter)
>>   * fl_blocked list itself is protected by the blocked_lock_lock, but by ensuring
>>   * that the flc_lock is also held on insertions we can avoid taking the
>>   * blocked_lock_lock in some cases when we see that the fl_blocked list is empty.
>> + *
>> + * Rather than just adding to the list, we check for conflicts with any existing
>> + * waiter, and add to that waiter instead.
>> + * Thus wakeups don't happen until needed.
>>   */
>>  static void __locks_insert_block(struct file_lock *blocker,
>> -					struct file_lock *waiter)
>> +				 struct file_lock *waiter,
>> +				 enum conflict conflict(struct file_lock *,
>> +							struct file_lock *))
>>  {
>> +	struct file_lock *fl;
>>  	BUG_ON(!list_empty(&waiter->fl_block));
>> +
>> +	/* Any request in waiter->fl_blocked is know to conflict with
>
> "known"
>
>> +	 * waiter, but it might not conflict with blocker.
>> +	 * If it doesn't, it needs to be woken now so it can find
>> +	 * somewhere else to wait, or possible it can get granted.
>
> "possibly it can be"

Both fixed, thanks.

>
>> +	 */
>> +	if (conflict(waiter, blocker) != FL_TRANSITIVE_CONFLICT)
>> +		wake_non_conflicts(waiter, blocker, conflict);
>> +new_blocker:
>> +	list_for_each_entry(fl, &blocker->fl_blocked, fl_block)
>> +		if (conflict(fl, waiter)) {
>> +			blocker =  fl;
>> +			goto new_blocker;
>> +		}
>>  
>> > 	waiter->fl_blocker = blocker;
>>  	list_add_tail(&waiter->fl_block, &blocker->fl_blocked);
>>  	if (IS_POSIX(blocker) && !IS_OFDLCK(blocker))
>
> I wonder if it might be better to insert the blocker first before waking
> up other waiters? Consider that anything awoken will end up contending
> for the flc_lock that is held by "current" at this point. Doing most of
> what you need to get done before waking them might mean less spinning in
> other tasks.
>

Maybe.
I think you are suggesting we move the call to wake_non_conflicts() to
the end of the function.
The main reason I put it at the top is to use the original value of
"blocker" before it gets changed.
Even if we move it to the end, there is still quite a few other little
tasks to be performed before the lock is dropped.

Will all this get done before some other processor has a chance to wake
up a process, and for that process to get a to spinlock ???

Maybe - though the first spinlock would be blocked_lock_lock in
locks_delete_block(), and that is dropped almost immediately.

I don't know ... it seems much of a muchness.
If the process will be woken that quickly, then some other processor
must be idle, and does it matter much if it spends a microsecond
spinning on a lock rather than being idle a bit longer?

Thanks.
I won't to do a least a little testing before I repost with any of these
changes.

NeilBrown


>> @@ -760,10 +814,12 @@ static void __locks_insert_block(struct file_lock *blocker,
>>  
>>  /* Must be called with flc_lock held. */
>>  static void locks_insert_block(struct file_lock *blocker,
>> -					struct file_lock *waiter)
>> +			       struct file_lock *waiter,
>> +			       enum conflict conflict(struct file_lock *,
>> +						      struct file_lock *))
>>  {
>>  	spin_lock(&blocked_lock_lock);
>> -	__locks_insert_block(blocker, waiter);
>> +	__locks_insert_block(blocker, waiter, conflict);
>>  	spin_unlock(&blocked_lock_lock);
>>  }
>>  
>> @@ -1033,7 +1089,7 @@ static int flock_lock_inode(struct inode *inode, struct file_lock *request)
>>  		if (!(request->fl_flags & FL_SLEEP))
>>  			goto out;
>>  		error = FILE_LOCK_DEFERRED;
>> -		locks_insert_block(fl, request);
>> +		locks_insert_block(fl, request, flock_locks_conflict);
>>  		goto out;
>>  	}
>>  	if (request->fl_flags & FL_ACCESS)
>> @@ -1107,7 +1163,8 @@ static int posix_lock_inode(struct inode *inode, struct file_lock *request,
>>  			spin_lock(&blocked_lock_lock);
>>  			if (likely(!posix_locks_deadlock(request, fl))) {
>>  				error = FILE_LOCK_DEFERRED;
>> -				__locks_insert_block(fl, request);
>> +				__locks_insert_block(fl, request,
>> +						     posix_locks_conflict);
>>  			}
>>  			spin_unlock(&blocked_lock_lock);
>>  			goto out;
>> @@ -1581,7 +1638,7 @@ int __break_lease(struct inode *inode, unsigned int mode, unsigned int type)
>>  		break_time -= jiffies;
>>  	if (break_time == 0)
>>  		break_time++;
>> -	locks_insert_block(fl, new_fl);
>> +	locks_insert_block(fl, new_fl, leases_conflict);
>>  	trace_break_lease_block(inode, new_fl);
>>  	spin_unlock(&ctx->flc_lock);
>>  	percpu_up_read_preempt_enable(&file_rwsem);
>> 
>> 
>
> -- 
> Jeff Layton <jlayton@kernel.org>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/5] fs/locks: change all *_conflict() functions to return a new enum.
  2018-08-09 13:09   ` J. Bruce Fields
@ 2018-08-09 23:40     ` NeilBrown
  2018-08-10  0:56       ` J. Bruce Fields
  0 siblings, 1 reply; 27+ messages in thread
From: NeilBrown @ 2018-08-09 23:40 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Jeff Layton, Alexander Viro, Martin Wilck, linux-fsdevel,
	Frank Filz, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 5497 bytes --]

On Thu, Aug 09 2018, J. Bruce Fields wrote:

> On Thu, Aug 09, 2018 at 12:04:41PM +1000, NeilBrown wrote:
>> In a future patch we will need to differentiate between conflicts that
>> are "transitive" and those that aren't.
>> A "transitive" conflict is defined as one where any lock that
>> conflicts with the first (newly requested) lock would conflict with
>> the existing lock.
>> 
>> So change posix_locks_conflict(), flock_locks_conflict() (both
>> currently returning int) and leases_conflict() (currently returning
>> bool) to return "enum conflict".
>> Add locks_transitive_overlap() to make it possible to compute the
>> correct conflict for posix locks.
>> 
>> The FL_NO_CONFLICT value is zero, so current code which only tests the
>> truth value of these functions will still work the same way.
>> 
>> And convert some
>>    return (foo);
>> to
>>    return foo;
>> 
>> Signed-off-by: NeilBrown <neilb@suse.com>
>> ---
>>  fs/locks.c |   64 ++++++++++++++++++++++++++++++++++++++++++++++--------------
>>  1 file changed, 49 insertions(+), 15 deletions(-)
>> 
>> diff --git a/fs/locks.c b/fs/locks.c
>> index b4812da2a374..d06658b2dc7a 100644
>> --- a/fs/locks.c
>> +++ b/fs/locks.c
>> @@ -139,6 +139,16 @@
>>  #define IS_OFDLCK(fl)	(fl->fl_flags & FL_OFDLCK)
>>  #define IS_REMOTELCK(fl)	(fl->fl_pid <= 0)
>>  
>> +/* A transitive conflict is one where the first lock conflicts with
>> + * the second lock, and any other lock that conflicts with the
>> + * first lock, also conflicts with the second lock.
>> + */
>> +enum conflict {
>> +	FL_NO_CONFLICT = 0,
>> +	FL_CONFLICT,
>> +	FL_TRANSITIVE_CONFLICT,
>> +};
>> +
>>  static inline bool is_remote_lock(struct file *filp)
>>  {
>>  	return likely(!(filp->f_path.dentry->d_sb->s_flags & SB_NOREMOTELOCK));
>> @@ -612,6 +622,15 @@ static inline int locks_overlap(struct file_lock *fl1, struct file_lock *fl2)
>>  		(fl2->fl_end >= fl1->fl_start));
>>  }
>>  
>> +/* Check for transitive-overlap - true if any lock that overlaps
>> + * the first lock must overlap the seconds
>> + */
>> +static inline bool locks_transitive_overlap(struct file_lock *fl1,
>> +					    struct file_lock *fl2)
>> +{
>> +	return (fl1->fl_start >= fl2->fl_start) &&
>> +		(fl1->fl_end <= fl2->fl_end);
>> +}
>>  /*
>>   * Check whether two locks have the same owner.
>>   */
>> @@ -793,47 +812,61 @@ locks_delete_lock_ctx(struct file_lock *fl, struct list_head *dispose)
>>  /* Determine if lock sys_fl blocks lock caller_fl. Common functionality
>>   * checks for shared/exclusive status of overlapping locks.
>>   */
>> -static int locks_conflict(struct file_lock *caller_fl, struct file_lock *sys_fl)
>> +static enum conflict locks_conflict(struct file_lock *caller_fl,
>> +				    struct file_lock *sys_fl)
>>  {
>>  	if (sys_fl->fl_type == F_WRLCK)
>> -		return 1;
>> +		return FL_TRANSITIVE_CONFLICT;
>>  	if (caller_fl->fl_type == F_WRLCK)
>> -		return 1;
>> -	return 0;
>> +		return FL_CONFLICT;
>> +	return FL_NO_CONFLICT;
>>  }
>>  
>>  /* Determine if lock sys_fl blocks lock caller_fl. POSIX specific
>>   * checking before calling the locks_conflict().
>>   */
>> -static int posix_locks_conflict(struct file_lock *caller_fl, struct file_lock *sys_fl)
>> +static enum conflict posix_locks_conflict(struct file_lock *caller_fl,
>> +					  struct file_lock *sys_fl)
>>  {
>>  	/* POSIX locks owned by the same process do not conflict with
>>  	 * each other.
>>  	 */
>>  	if (posix_same_owner(caller_fl, sys_fl))
>> -		return (0);
>> +		return FL_NO_CONFLICT;
>>  
>>  	/* Check whether they overlap */
>>  	if (!locks_overlap(caller_fl, sys_fl))
>> -		return 0;
>> +		return FL_NO_CONFLICT;
>>  
>> -	return (locks_conflict(caller_fl, sys_fl));
>> +	switch (locks_conflict(caller_fl, sys_fl)) {
>> +	default:
>> +	case FL_NO_CONFLICT:
>> +		return FL_NO_CONFLICT;
>> +	case FL_CONFLICT:
>> +		return FL_CONFLICT;
>
> If I'm understanding the logic here and in locks_conflict correctly,
> you're telling me that in the case where sys_fl is a read lock, and
> caller_fl is a write lock, then any lock which conflicts with sys_fl
> must conflict with caller_fl?  Or do I have that backwards?  It doesn't
> sound right, in any case.

As I was writing this code, I was thinking that I'd probably end up
getting something backwards....
Let's see.  I wrote:

>> +/* A transitive conflict is one where the first lock conflicts with
>> + * the second lock, and any other lock that conflicts with the
>> + * first lock, also conflicts with the second lock.
>> + */

caller_fl is first and sys_fl is second.

if sys_fl, the second, is a read lock, and caller_fl, the first, is a
write lock, they clearly conflict but any other lock that conflict
with caller_fl (The write lock) would *not* necessarily conflict with
the read lock.  So this situation is *not*  FL_TRANSITIVE_CONFLICT.

locks_conflict() only returns FL_TRANSITIVE_CONFLICT when sys_fl (the
second) is a write lock, which it isn't in this case.  So I think that
this case is handled correctly.
posix_locks_conflict() will return FL_CONFLICT, but not
FL_TRANSITIVE_CONFLICT.

Have I convinced you, or have I missed your point?

Thanks,
NeilBrown


>
> --b.
>
>> +	case FL_TRANSITIVE_CONFLICT:
>> +		if (locks_transitive_overlap(caller_fl, sys_fl))
>> +			return FL_TRANSITIVE_CONFLICT;
>> +		else
>> +			return FL_CONFLICT;
>> +	}
>>  }

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/5 - V2] locks: avoid thundering-herd wake-ups
  2018-08-09 22:12   ` NeilBrown
@ 2018-08-10  0:29     ` J. Bruce Fields
  2018-08-10  1:50       ` NeilBrown
  2018-08-11 11:51       ` Jeff Layton
  0 siblings, 2 replies; 27+ messages in thread
From: J. Bruce Fields @ 2018-08-10  0:29 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jeff Layton, Alexander Viro, Martin Wilck, linux-fsdevel,
	Frank Filz, linux-kernel

On Fri, Aug 10, 2018 at 08:12:43AM +1000, NeilBrown wrote:
> On Thu, Aug 09 2018, J. Bruce Fields wrote:
> 
> > I think there's also a problem with multiple tasks sharing the same
> > lock owner.
> >
> > So, all locks are exclusive locks for the same range.  We have four
> > tasks.  Tasks 1 and 4 share the same owner, the others' owners are
> > distinct.
> >
> > 	- Task 1 gets a lock.
> > 	- Task 2 gets a conflicting lock.
> > 	- Task 3 gets another conflicting lock.  So now we the tree is
> > 		3->2->1.
> > 	- Task 1's lock is released.
> > 	- Before task 2 is scheduled, task 4 acquires a new lock.
> > 	- Task 2 waits on task 4's lock, we now have
> > 		3->2->4.
> >
> > Task 3 shouldn't be waiting--the lock it's requesting has the same owner
> > as the lock task 4 holds--but we fail to wake up task 3.
> 
> So task 1 and task 4 are threads in the one process - OK.
> Tasks 2 and 3 are threads in two other processes.
> 
> So 2 and 3 conflict with either 1 or 4 equally - why should task 3 be
> woken?
> 
> I suspect you got the numbers bit mixed up,

Whoops.

> but in any case, the "conflict()" function that is passed around takes
> ownership into account when assessing if one lock conflicts with
> another.

Right, I know, but, let me try again:

All locks are exclusive locks for the same range.  Only tasks 3 and 4
share the the same owner.

	- Task 1 gets a lock.
	- Task 2 requests a conflicting lock, so we have    2->1.
	- Task 3 requests a conflicting lock, so we have 3->2->1.
	- Task 1 unlocks.  We wake up task 2, but it isn't scheduled yet.
	- Task 4 gets a new lock.
	- Task 2 runs, discovers the conflict, and waits.  Now we have:
		3->2->4.

There is no conflict between the lock 3 requested and the lock 4 holds,
but 3 is not woken up.

This is another version of the first problem: there's information we
need (the owners of the waiting locks in the tree) that we can't
determine just from looking at the root of the tree.

I'm not sure what to do about that.

--b.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/5] fs/locks: create a tree of dependent requests.
  2018-08-09 22:19     ` NeilBrown
@ 2018-08-10  0:36       ` J. Bruce Fields
  0 siblings, 0 replies; 27+ messages in thread
From: J. Bruce Fields @ 2018-08-10  0:36 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jeff Layton, Alexander Viro, Martin Wilck, linux-fsdevel,
	Frank Filz, linux-kernel

On Fri, Aug 10, 2018 at 08:19:26AM +1000, NeilBrown wrote:
> On Thu, Aug 09 2018, J. Bruce Fields wrote:
> > I think you could simplify the code a lot by maintaining the tree so
> > that it always satisfies the condition that waiters are always strictly
> > "weaker" than their descendents, so that finding a conflict with a
> > waiter is always enough to know that the descendents also conflict.
> 
> Can you define "weaker" please.
> I suspect it is a partial ordering, in which case a tree would normally
> be more appropriate than trying to find a total ordering.

Lock X is stronger than lock Y if any lock that would conflict with lock
Y would also conflict with lock X.

Equivalently, X is stronger than Y if lock X's range is a superset of
lock Y's and if X is a write lock whenever Y is.  Well, I *thought* that
was equivalent until I thought about the owner problem.  Ugh.

--b.

> 
> Thanks,
> NeilBrown
> 
> >
> > So, when you put a waiter to sleep, you don't add it below a child
> > unless it's "stronger" than the child.
> >
> > You give up the property that siblings don't conflict, but again that
> > just means thundering herds in weird cases, which is OK.
> >
> > --b.
> >
> >> 
> >> Reported-and-tested-by: Martin Wilck <mwilck@suse.de>
> >> Signed-off-by: NeilBrown <neilb@suse.com>
> >> ---
> >>  fs/locks.c |   69 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
> >>  1 file changed, 63 insertions(+), 6 deletions(-)
> >> 
> >> diff --git a/fs/locks.c b/fs/locks.c
> >> index fc64016d01ee..17843feb6f5b 100644
> >> --- a/fs/locks.c
> >> +++ b/fs/locks.c
> >> @@ -738,6 +738,39 @@ static void locks_delete_block(struct file_lock *waiter)
> >>  	spin_unlock(&blocked_lock_lock);
> >>  }
> >>  
> >> +static void wake_non_conflicts(struct file_lock *waiter, struct file_lock *blocker,
> >> +			       enum conflict conflict(struct file_lock *,
> >> +						      struct file_lock *))
> >> +{
> >> +	struct file_lock *parent = waiter;
> >> +	struct file_lock *fl;
> >> +	struct file_lock  *t;
> >> +
> >> +	fl = list_entry(&parent->fl_blocked, struct file_lock, fl_block);
> >> +restart:
> >> +	list_for_each_entry_safe_continue(fl, t, &parent->fl_blocked, fl_block) {
> >> +		switch (conflict(fl, blocker)) {
> >> +		default:
> >> +		case FL_NO_CONFLICT:
> >> +			__locks_wake_one(fl);
> >> +			break;
> >> +		case FL_CONFLICT:
> >> +			/* Need to check children */
> >> +			parent = fl;
> >> +			fl = list_entry(&parent->fl_blocked, struct file_lock, fl_block);
> >> +			goto restart;
> >> +		case FL_TRANSITIVE_CONFLICT:
> >> +			/* all children must also conflict, no need to check */
> >> +			continue;
> >> +		}
> >> +	}
> >> +	if (parent != waiter) {
> >> +		parent = parent->fl_blocker;
> >> +		fl = parent;
> >> +		goto restart;
> >> +	}
> >> +}
> >> +
> >>  /* Insert waiter into blocker's block list.
> >>   * We use a circular list so that processes can be easily woken up in
> >>   * the order they blocked. The documentation doesn't require this but
> >> @@ -747,11 +780,32 @@ static void locks_delete_block(struct file_lock *waiter)
> >>   * fl_blocked list itself is protected by the blocked_lock_lock, but by ensuring
> >>   * that the flc_lock is also held on insertions we can avoid taking the
> >>   * blocked_lock_lock in some cases when we see that the fl_blocked list is empty.
> >> + *
> >> + * Rather than just adding to the list, we check for conflicts with any existing
> >> + * waiter, and add to that waiter instead.
> >> + * Thus wakeups don't happen until needed.
> >>   */
> >>  static void __locks_insert_block(struct file_lock *blocker,
> >> -					struct file_lock *waiter)
> >> +				 struct file_lock *waiter,
> >> +				 enum conflict conflict(struct file_lock *,
> >> +							struct file_lock *))
> >>  {
> >> +	struct file_lock *fl;
> >>  	BUG_ON(!list_empty(&waiter->fl_block));
> >> +
> >> +	/* Any request in waiter->fl_blocked is know to conflict with
> >> +	 * waiter, but it might not conflict with blocker.
> >> +	 * If it doesn't, it needs to be woken now so it can find
> >> +	 * somewhere else to wait, or possible it can get granted.
> >> +	 */
> >> +	if (conflict(waiter, blocker) != FL_TRANSITIVE_CONFLICT)
> >> +		wake_non_conflicts(waiter, blocker, conflict);
> >> +new_blocker:
> >> +	list_for_each_entry(fl, &blocker->fl_blocked, fl_block)
> >> +		if (conflict(fl, waiter)) {
> >> +			blocker =  fl;
> >> +			goto new_blocker;
> >> +		}
> >>  	waiter->fl_blocker = blocker;
> >>  	list_add_tail(&waiter->fl_block, &blocker->fl_blocked);
> >>  	if (IS_POSIX(blocker) && !IS_OFDLCK(blocker))
> >> @@ -760,10 +814,12 @@ static void __locks_insert_block(struct file_lock *blocker,
> >>  
> >>  /* Must be called with flc_lock held. */
> >>  static void locks_insert_block(struct file_lock *blocker,
> >> -					struct file_lock *waiter)
> >> +			       struct file_lock *waiter,
> >> +			       enum conflict conflict(struct file_lock *,
> >> +						      struct file_lock *))
> >>  {
> >>  	spin_lock(&blocked_lock_lock);
> >> -	__locks_insert_block(blocker, waiter);
> >> +	__locks_insert_block(blocker, waiter, conflict);
> >>  	spin_unlock(&blocked_lock_lock);
> >>  }
> >>  
> >> @@ -1033,7 +1089,7 @@ static int flock_lock_inode(struct inode *inode, struct file_lock *request)
> >>  		if (!(request->fl_flags & FL_SLEEP))
> >>  			goto out;
> >>  		error = FILE_LOCK_DEFERRED;
> >> -		locks_insert_block(fl, request);
> >> +		locks_insert_block(fl, request, flock_locks_conflict);
> >>  		goto out;
> >>  	}
> >>  	if (request->fl_flags & FL_ACCESS)
> >> @@ -1107,7 +1163,8 @@ static int posix_lock_inode(struct inode *inode, struct file_lock *request,
> >>  			spin_lock(&blocked_lock_lock);
> >>  			if (likely(!posix_locks_deadlock(request, fl))) {
> >>  				error = FILE_LOCK_DEFERRED;
> >> -				__locks_insert_block(fl, request);
> >> +				__locks_insert_block(fl, request,
> >> +						     posix_locks_conflict);
> >>  			}
> >>  			spin_unlock(&blocked_lock_lock);
> >>  			goto out;
> >> @@ -1581,7 +1638,7 @@ int __break_lease(struct inode *inode, unsigned int mode, unsigned int type)
> >>  		break_time -= jiffies;
> >>  	if (break_time == 0)
> >>  		break_time++;
> >> -	locks_insert_block(fl, new_fl);
> >> +	locks_insert_block(fl, new_fl, leases_conflict);
> >>  	trace_break_lease_block(inode, new_fl);
> >>  	spin_unlock(&ctx->flc_lock);
> >>  	percpu_up_read_preempt_enable(&file_rwsem);
> >> 



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/5] fs/locks: change all *_conflict() functions to return a new enum.
  2018-08-09 23:40     ` NeilBrown
@ 2018-08-10  0:56       ` J. Bruce Fields
  0 siblings, 0 replies; 27+ messages in thread
From: J. Bruce Fields @ 2018-08-10  0:56 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jeff Layton, Alexander Viro, Martin Wilck, linux-fsdevel,
	Frank Filz, linux-kernel

On Fri, Aug 10, 2018 at 09:40:35AM +1000, NeilBrown wrote:
> caller_fl is first and sys_fl is second.
> 
> if sys_fl, the second, is a read lock, and caller_fl, the first, is a
> write lock, they clearly conflict but any other lock that conflict
> with caller_fl (The write lock) would *not* necessarily conflict with
> the read lock.  So this situation is *not*  FL_TRANSITIVE_CONFLICT.
> 
> locks_conflict() only returns FL_TRANSITIVE_CONFLICT when sys_fl (the
> second) is a write lock, which it isn't in this case.  So I think that
> this case is handled correctly.
> posix_locks_conflict() will return FL_CONFLICT, but not
> FL_TRANSITIVE_CONFLICT.
> 
> Have I convinced you, or have I missed your point?

Eh, I was just confused.

And now I'm tempted to blame you for confusing me, but maybe that's just
my ego going defensive.

(My bruised ego suggests leaving locks_conflict and its callers alone,
and having an entirely separate function that checks this when we need
it.)

--b.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/5 - V2] locks: avoid thundering-herd wake-ups
  2018-08-10  0:29     ` J. Bruce Fields
@ 2018-08-10  1:50       ` NeilBrown
  2018-08-10  2:52         ` J. Bruce Fields
  2018-08-11 11:51       ` Jeff Layton
  1 sibling, 1 reply; 27+ messages in thread
From: NeilBrown @ 2018-08-10  1:50 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Jeff Layton, Alexander Viro, Martin Wilck, linux-fsdevel,
	Frank Filz, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2685 bytes --]

On Thu, Aug 09 2018, J. Bruce Fields wrote:

> On Fri, Aug 10, 2018 at 08:12:43AM +1000, NeilBrown wrote:
>> On Thu, Aug 09 2018, J. Bruce Fields wrote:
>> 
>> > I think there's also a problem with multiple tasks sharing the same
>> > lock owner.
>> >
>> > So, all locks are exclusive locks for the same range.  We have four
>> > tasks.  Tasks 1 and 4 share the same owner, the others' owners are
>> > distinct.
>> >
>> > 	- Task 1 gets a lock.
>> > 	- Task 2 gets a conflicting lock.
>> > 	- Task 3 gets another conflicting lock.  So now we the tree is
>> > 		3->2->1.
>> > 	- Task 1's lock is released.
>> > 	- Before task 2 is scheduled, task 4 acquires a new lock.
>> > 	- Task 2 waits on task 4's lock, we now have
>> > 		3->2->4.
>> >
>> > Task 3 shouldn't be waiting--the lock it's requesting has the same owner
>> > as the lock task 4 holds--but we fail to wake up task 3.
>> 
>> So task 1 and task 4 are threads in the one process - OK.
>> Tasks 2 and 3 are threads in two other processes.
>> 
>> So 2 and 3 conflict with either 1 or 4 equally - why should task 3 be
>> woken?
>> 
>> I suspect you got the numbers bit mixed up,
>
> Whoops.
>
>> but in any case, the "conflict()" function that is passed around takes
>> ownership into account when assessing if one lock conflicts with
>> another.
>
> Right, I know, but, let me try again:
>
> All locks are exclusive locks for the same range.  Only tasks 3 and 4
> share the the same owner.
>
> 	- Task 1 gets a lock.
> 	- Task 2 requests a conflicting lock, so we have    2->1.
> 	- Task 3 requests a conflicting lock, so we have 3->2->1.
> 	- Task 1 unlocks.  We wake up task 2, but it isn't scheduled yet.
> 	- Task 4 gets a new lock.
> 	- Task 2 runs, discovers the conflict, and waits.  Now we have:
> 		3->2->4.
>
> There is no conflict between the lock 3 requested and the lock 4 holds,
> but 3 is not woken up.
>
> This is another version of the first problem: there's information we
> need (the owners of the waiting locks in the tree) that we can't
> determine just from looking at the root of the tree.
>
> I'm not sure what to do about that.

You're good at this game!

So, because a locker with the same "owner" gets a free pass, you can
*never* say that any lock which conflicts with A also conflicts with B,
as a lock with the same owner as B will never conflict with B, even
though it conflicts with A.

I think there is still value in having the tree, but when a waiter is
attached under a new blocker, we need to walk the whole tree beneath the
waiter and detach/wake anything that is not blocked by the new blocker.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/5 - V2] locks: avoid thundering-herd wake-ups
  2018-08-10  1:50       ` NeilBrown
@ 2018-08-10  2:52         ` J. Bruce Fields
  2018-08-10  3:17           ` NeilBrown
  0 siblings, 1 reply; 27+ messages in thread
From: J. Bruce Fields @ 2018-08-10  2:52 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jeff Layton, Alexander Viro, Martin Wilck, linux-fsdevel,
	Frank Filz, linux-kernel

On Fri, Aug 10, 2018 at 11:50:58AM +1000, NeilBrown wrote:
> You're good at this game!

Everybody's got to have a hobby, mine is pathological posix locking
cases....

> So, because a locker with the same "owner" gets a free pass, you can
> *never* say that any lock which conflicts with A also conflicts with B,
> as a lock with the same owner as B will never conflict with B, even
> though it conflicts with A.
> 
> I think there is still value in having the tree, but when a waiter is
> attached under a new blocker, we need to walk the whole tree beneath the
> waiter and detach/wake anything that is not blocked by the new blocker.

If you're walking the whole tree every time then it might as well be a
flat list, I think?

--b.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/5 - V2] locks: avoid thundering-herd wake-ups
  2018-08-10  2:52         ` J. Bruce Fields
@ 2018-08-10  3:17           ` NeilBrown
  2018-08-10 15:47             ` J. Bruce Fields
  0 siblings, 1 reply; 27+ messages in thread
From: NeilBrown @ 2018-08-10  3:17 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Jeff Layton, Alexander Viro, Martin Wilck, linux-fsdevel,
	Frank Filz, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1447 bytes --]

On Thu, Aug 09 2018, J. Bruce Fields wrote:

> On Fri, Aug 10, 2018 at 11:50:58AM +1000, NeilBrown wrote:
>> You're good at this game!
>
> Everybody's got to have a hobby, mine is pathological posix locking
> cases....
>
>> So, because a locker with the same "owner" gets a free pass, you can
>> *never* say that any lock which conflicts with A also conflicts with B,
>> as a lock with the same owner as B will never conflict with B, even
>> though it conflicts with A.
>> 
>> I think there is still value in having the tree, but when a waiter is
>> attached under a new blocker, we need to walk the whole tree beneath the
>> waiter and detach/wake anything that is not blocked by the new blocker.
>
> If you're walking the whole tree every time then it might as well be a
> flat list, I think?

The advantage of a tree is that it keeps over-lapping locks closer
together.
For it to make a difference you would need a load where lots of threads
were locking several different small ranges, and other threads were
locking large ranges that cover all the small ranges.
I doubt this is common, but it doesn't seem as strange as other things
I've seen in the wild.
The other advantage, of course, is that I've already written the code,
and I like it.

Maybe I'll do a simple-list version, then a patch to convert that to the
clever-tree version, and we can then have something concrete to assess.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/5 - V2] locks: avoid thundering-herd wake-ups
  2018-08-10  3:17           ` NeilBrown
@ 2018-08-10 15:47             ` J. Bruce Fields
  2018-08-11 11:56               ` Jeff Layton
  0 siblings, 1 reply; 27+ messages in thread
From: J. Bruce Fields @ 2018-08-10 15:47 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jeff Layton, Alexander Viro, Martin Wilck, linux-fsdevel,
	Frank Filz, linux-kernel

On Fri, Aug 10, 2018 at 01:17:14PM +1000, NeilBrown wrote:
> On Thu, Aug 09 2018, J. Bruce Fields wrote:
> 
> > On Fri, Aug 10, 2018 at 11:50:58AM +1000, NeilBrown wrote:
> >> You're good at this game!
> >
> > Everybody's got to have a hobby, mine is pathological posix locking
> > cases....
> >
> >> So, because a locker with the same "owner" gets a free pass, you can
> >> *never* say that any lock which conflicts with A also conflicts with B,
> >> as a lock with the same owner as B will never conflict with B, even
> >> though it conflicts with A.
> >> 
> >> I think there is still value in having the tree, but when a waiter is
> >> attached under a new blocker, we need to walk the whole tree beneath the
> >> waiter and detach/wake anything that is not blocked by the new blocker.
> >
> > If you're walking the whole tree every time then it might as well be a
> > flat list, I think?
> 
> The advantage of a tree is that it keeps over-lapping locks closer
> together.
> For it to make a difference you would need a load where lots of threads
> were locking several different small ranges, and other threads were
> locking large ranges that cover all the small ranges.

OK, I'm not sure I understand, but I'll give another look at the next
version....

> I doubt this is common, but it doesn't seem as strange as other things
> I've seen in the wild.
> The other advantage, of course, is that I've already written the code,
> and I like it.
> 
> Maybe I'll do a simple-list version, then a patch to convert that to the
> clever-tree version, and we can then have something concrete to assess.

That might help, thanks.

--b.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/5 - V2] locks: avoid thundering-herd wake-ups
  2018-08-10  0:29     ` J. Bruce Fields
  2018-08-10  1:50       ` NeilBrown
@ 2018-08-11 11:51       ` Jeff Layton
  2018-08-11 12:21         ` J. Bruce Fields
  1 sibling, 1 reply; 27+ messages in thread
From: Jeff Layton @ 2018-08-11 11:51 UTC (permalink / raw)
  To: J. Bruce Fields, NeilBrown
  Cc: Alexander Viro, Martin Wilck, linux-fsdevel, Frank Filz, linux-kernel

On Thu, 2018-08-09 at 20:29 -0400, J. Bruce Fields wrote:
> On Fri, Aug 10, 2018 at 08:12:43AM +1000, NeilBrown wrote:
> > On Thu, Aug 09 2018, J. Bruce Fields wrote:
> > 
> > > I think there's also a problem with multiple tasks sharing the same
> > > lock owner.
> > > 
> > > So, all locks are exclusive locks for the same range.  We have four
> > > tasks.  Tasks 1 and 4 share the same owner, the others' owners are
> > > distinct.
> > > 
> > > 	- Task 1 gets a lock.
> > > 	- Task 2 gets a conflicting lock.
> > > 	- Task 3 gets another conflicting lock.  So now we the tree is
> > > 		3->2->1.
> > > 	- Task 1's lock is released.
> > > 	- Before task 2 is scheduled, task 4 acquires a new lock.
> > > 	- Task 2 waits on task 4's lock, we now have
> > > 		3->2->4.
> > > 
> > > Task 3 shouldn't be waiting--the lock it's requesting has the same owner
> > > as the lock task 4 holds--but we fail to wake up task 3.
> > 
> > So task 1 and task 4 are threads in the one process - OK.
> > Tasks 2 and 3 are threads in two other processes.
> > 
> > So 2 and 3 conflict with either 1 or 4 equally - why should task 3 be
> > woken?
> > 
> > I suspect you got the numbers bit mixed up,
> 
> Whoops.
> 
> > but in any case, the "conflict()" function that is passed around takes
> > ownership into account when assessing if one lock conflicts with
> > another.
> 
> Right, I know, but, let me try again:
> 
> All locks are exclusive locks for the same range.  Only tasks 3 and 4
> share the the same owner.
> 
> 	- Task 1 gets a lock.
> 	- Task 2 requests a conflicting lock, so we have    2->1.
> 	- Task 3 requests a conflicting lock, so we have 3->2->1.
> 	- Task 1 unlocks.  We wake up task 2, but it isn't scheduled yet.
> 	- Task 4 gets a new lock.
> 	- Task 2 runs, discovers the conflict, and waits.  Now we have:
> 		3->2->4.
> 
> There is no conflict between the lock 3 requested and the lock 4 holds,
> but 3 is not woken up.
> 
> This is another version of the first problem: there's information we
> need (the owners of the waiting locks in the tree) that we can't
> determine just from looking at the root of the tree.
> 
> I'm not sure what to do about that.
> 

Is this still a problem in the v2 set?

wake_non_conflicts walks the whole tree of requests that were blocked on
it, so a. After task 2 discovers the conflict, it should wake any of its
children that don't conflict. So in that last step, task 3 would be
awoken before task 2 goes back to sleep.

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/5 - V2] locks: avoid thundering-herd wake-ups
  2018-08-10 15:47             ` J. Bruce Fields
@ 2018-08-11 11:56               ` Jeff Layton
  2018-08-11 12:35                 ` J. Bruce Fields
  0 siblings, 1 reply; 27+ messages in thread
From: Jeff Layton @ 2018-08-11 11:56 UTC (permalink / raw)
  To: J. Bruce Fields, NeilBrown
  Cc: Alexander Viro, Martin Wilck, linux-fsdevel, Frank Filz, linux-kernel

On Fri, 2018-08-10 at 11:47 -0400, J. Bruce Fields wrote:
> On Fri, Aug 10, 2018 at 01:17:14PM +1000, NeilBrown wrote:
> > On Thu, Aug 09 2018, J. Bruce Fields wrote:
> > 
> > > On Fri, Aug 10, 2018 at 11:50:58AM +1000, NeilBrown wrote:
> > > > You're good at this game!
> > > 
> > > Everybody's got to have a hobby, mine is pathological posix locking
> > > cases....
> > > 
> > > > So, because a locker with the same "owner" gets a free pass, you can
> > > > *never* say that any lock which conflicts with A also conflicts with B,
> > > > as a lock with the same owner as B will never conflict with B, even
> > > > though it conflicts with A.
> > > > 
> > > > I think there is still value in having the tree, but when a waiter is
> > > > attached under a new blocker, we need to walk the whole tree beneath the
> > > > waiter and detach/wake anything that is not blocked by the new blocker.
> > > 
> > > If you're walking the whole tree every time then it might as well be a
> > > flat list, I think?
> > 
> > The advantage of a tree is that it keeps over-lapping locks closer
> > together.
> > For it to make a difference you would need a load where lots of threads
> > were locking several different small ranges, and other threads were
> > locking large ranges that cover all the small ranges.
> 
> OK, I'm not sure I understand, but I'll give another look at the next
> version....
> 
> > I doubt this is common, but it doesn't seem as strange as other things
> > I've seen in the wild.
> > The other advantage, of course, is that I've already written the code,
> > and I like it.
> > 
> > Maybe I'll do a simple-list version, then a patch to convert that to the
> > clever-tree version, and we can then have something concrete to assess.
> 
> That might help, thanks.
> 

FWIW, I did a bit of testing with lockperf tests that I had written on
an earlier rework of this code:

    https://git.samba.org/jlayton/linux.git/?p=jlayton/lockperf.git;a=summary


The posix01 and flock01 tests in there show about a 10x speedup with
this set in place.

I think something closer to Neil's design will end up being what we want
here. Consider the relatively common case where you have a whole-file
POSIX write lock held with a bunch of different waiters blocked on it
(all whole file write locks with different owners):

With Neil's patches, you will just wake up a single waiter when the
blocked lock is released, as they would all be in a long chain of
waiters.

If you keep all the locks in a single list, you'll either have to:

a) wake up all the waiters on the list when the lock comes free: no lock
is held at that point so none of them will conflict.

...or...

b) keep track of what waiters have already been awoken, and compare any
further candidate for waking against the current set of held locks and
any lock requests by waiters that you just woke.

b seems more expensive as you have to walk over a larger set of locks
on every change.
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/5 - V2] locks: avoid thundering-herd wake-ups
  2018-08-11 11:51       ` Jeff Layton
@ 2018-08-11 12:21         ` J. Bruce Fields
  2018-08-11 13:15           ` Jeff Layton
  0 siblings, 1 reply; 27+ messages in thread
From: J. Bruce Fields @ 2018-08-11 12:21 UTC (permalink / raw)
  To: Jeff Layton
  Cc: NeilBrown, Alexander Viro, Martin Wilck, linux-fsdevel,
	Frank Filz, linux-kernel

On Sat, Aug 11, 2018 at 07:51:13AM -0400, Jeff Layton wrote:
> On Thu, 2018-08-09 at 20:29 -0400, J. Bruce Fields wrote:
> > On Fri, Aug 10, 2018 at 08:12:43AM +1000, NeilBrown wrote:
> > > On Thu, Aug 09 2018, J. Bruce Fields wrote:
> > > 
> > > > I think there's also a problem with multiple tasks sharing the same
> > > > lock owner.
> > > > 
> > > > So, all locks are exclusive locks for the same range.  We have four
> > > > tasks.  Tasks 1 and 4 share the same owner, the others' owners are
> > > > distinct.
> > > > 
> > > > 	- Task 1 gets a lock.
> > > > 	- Task 2 gets a conflicting lock.
> > > > 	- Task 3 gets another conflicting lock.  So now we the tree is
> > > > 		3->2->1.
> > > > 	- Task 1's lock is released.
> > > > 	- Before task 2 is scheduled, task 4 acquires a new lock.
> > > > 	- Task 2 waits on task 4's lock, we now have
> > > > 		3->2->4.
> > > > 
> > > > Task 3 shouldn't be waiting--the lock it's requesting has the same owner
> > > > as the lock task 4 holds--but we fail to wake up task 3.
> > > 
> > > So task 1 and task 4 are threads in the one process - OK.
> > > Tasks 2 and 3 are threads in two other processes.
> > > 
> > > So 2 and 3 conflict with either 1 or 4 equally - why should task 3 be
> > > woken?
> > > 
> > > I suspect you got the numbers bit mixed up,
> > 
> > Whoops.
> > 
> > > but in any case, the "conflict()" function that is passed around takes
> > > ownership into account when assessing if one lock conflicts with
> > > another.
> > 
> > Right, I know, but, let me try again:
> > 
> > All locks are exclusive locks for the same range.  Only tasks 3 and 4
> > share the the same owner.
> > 
> > 	- Task 1 gets a lock.
> > 	- Task 2 requests a conflicting lock, so we have    2->1.
> > 	- Task 3 requests a conflicting lock, so we have 3->2->1.
> > 	- Task 1 unlocks.  We wake up task 2, but it isn't scheduled yet.
> > 	- Task 4 gets a new lock.
> > 	- Task 2 runs, discovers the conflict, and waits.  Now we have:
> > 		3->2->4.
> > 
> > There is no conflict between the lock 3 requested and the lock 4 holds,
> > but 3 is not woken up.
> > 
> > This is another version of the first problem: there's information we
> > need (the owners of the waiting locks in the tree) that we can't
> > determine just from looking at the root of the tree.
> > 
> > I'm not sure what to do about that.
> > 
> 
> Is this still a problem in the v2 set?
> 
> wake_non_conflicts walks the whole tree of requests that were blocked on
> it,

Not in the FL_TRANSITIVE_CONFLICT case, which is the case here.

--b.

> so a. After task 2 discovers the conflict, it should wake any of its
> children that don't conflict. So in that last step, task 3 would be
> awoken before task 2 goes back to sleep.
> 
> -- 
> Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/5 - V2] locks: avoid thundering-herd wake-ups
  2018-08-11 11:56               ` Jeff Layton
@ 2018-08-11 12:35                 ` J. Bruce Fields
  0 siblings, 0 replies; 27+ messages in thread
From: J. Bruce Fields @ 2018-08-11 12:35 UTC (permalink / raw)
  To: Jeff Layton
  Cc: NeilBrown, Alexander Viro, Martin Wilck, linux-fsdevel,
	Frank Filz, linux-kernel

On Sat, Aug 11, 2018 at 07:56:25AM -0400, Jeff Layton wrote:
> FWIW, I did a bit of testing with lockperf tests that I had written on
> an earlier rework of this code:
> 
>     https://git.samba.org/jlayton/linux.git/?p=jlayton/lockperf.git;a=summary
> 
> 
> The posix01 and flock01 tests in there show about a 10x speedup with
> this set in place.
> 
> I think something closer to Neil's design will end up being what we want
> here. Consider the relatively common case where you have a whole-file
> POSIX write lock held with a bunch of different waiters blocked on it
> (all whole file write locks with different owners):
> 
> With Neil's patches, you will just wake up a single waiter when the
> blocked lock is released, as they would all be in a long chain of
> waiters.

Right, but you still need to walk the whole tree to make sure that it's
the only one you need to wake.  The tree structure means that you know
all the other locks have non-overlapping ranges, but it doesn't tell you
the lock owners.

Maybe there's some reasonable way to rule out the shared-lockowner case
more quickly too.  I haven't thought about that much.

> If you keep all the locks in a single list, you'll either have to:
> 
> a) wake up all the waiters on the list when the lock comes free: no lock
> is held at that point so none of them will conflict.
> 
> ...or...
> 
> b) keep track of what waiters have already been awoken, and compare any
> further candidate for waking against the current set of held locks and
> any lock requests by waiters that you just woke.

Instead of keeping track of *every* waiter that you've woken, you could
keep track of some subset.  Worst case that just means waking more
processes than you need to, which is wasteful but correct.

In the common case that you give, that subset could just be "the first
waiter you wake".  You'd get the same result.

The every-waiter-a-whole-file-write-lock case is pretty easy.  To
benefit from the tree you need a case where some of the waiters overlap
and some don't.

Might be worth it, sure.

--b.

> b seems more expensive as you have to walk over a larger set of locks
> on every change.
> -- 
> Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/5 - V2] locks: avoid thundering-herd wake-ups
  2018-08-11 12:21         ` J. Bruce Fields
@ 2018-08-11 13:15           ` Jeff Layton
  0 siblings, 0 replies; 27+ messages in thread
From: Jeff Layton @ 2018-08-11 13:15 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: NeilBrown, Alexander Viro, Martin Wilck, linux-fsdevel,
	Frank Filz, linux-kernel

On Sat, 2018-08-11 at 08:21 -0400, J. Bruce Fields wrote:
> On Sat, Aug 11, 2018 at 07:51:13AM -0400, Jeff Layton wrote:
> > On Thu, 2018-08-09 at 20:29 -0400, J. Bruce Fields wrote:
> > > On Fri, Aug 10, 2018 at 08:12:43AM +1000, NeilBrown wrote:
> > > > On Thu, Aug 09 2018, J. Bruce Fields wrote:
> > > > 
> > > > > I think there's also a problem with multiple tasks sharing the same
> > > > > lock owner.
> > > > > 
> > > > > So, all locks are exclusive locks for the same range.  We have four
> > > > > tasks.  Tasks 1 and 4 share the same owner, the others' owners are
> > > > > distinct.
> > > > > 
> > > > > 	- Task 1 gets a lock.
> > > > > 	- Task 2 gets a conflicting lock.
> > > > > 	- Task 3 gets another conflicting lock.  So now we the tree is
> > > > > 		3->2->1.
> > > > > 	- Task 1's lock is released.
> > > > > 	- Before task 2 is scheduled, task 4 acquires a new lock.
> > > > > 	- Task 2 waits on task 4's lock, we now have
> > > > > 		3->2->4.
> > > > > 
> > > > > Task 3 shouldn't be waiting--the lock it's requesting has the same owner
> > > > > as the lock task 4 holds--but we fail to wake up task 3.
> > > > 
> > > > So task 1 and task 4 are threads in the one process - OK.
> > > > Tasks 2 and 3 are threads in two other processes.
> > > > 
> > > > So 2 and 3 conflict with either 1 or 4 equally - why should task 3 be
> > > > woken?
> > > > 
> > > > I suspect you got the numbers bit mixed up,
> > > 
> > > Whoops.
> > > 
> > > > but in any case, the "conflict()" function that is passed around takes
> > > > ownership into account when assessing if one lock conflicts with
> > > > another.
> > > 
> > > Right, I know, but, let me try again:
> > > 
> > > All locks are exclusive locks for the same range.  Only tasks 3 and 4
> > > share the the same owner.
> > > 
> > > 	- Task 1 gets a lock.
> > > 	- Task 2 requests a conflicting lock, so we have    2->1.
> > > 	- Task 3 requests a conflicting lock, so we have 3->2->1.
> > > 	- Task 1 unlocks.  We wake up task 2, but it isn't scheduled yet.
> > > 	- Task 4 gets a new lock.
> > > 	- Task 2 runs, discovers the conflict, and waits.  Now we have:
> > > 		3->2->4.
> > > 
> > > There is no conflict between the lock 3 requested and the lock 4 holds,
> > > but 3 is not woken up.
> > > 
> > > This is another version of the first problem: there's information we
> > > need (the owners of the waiting locks in the tree) that we can't
> > > determine just from looking at the root of the tree.
> > > 
> > > I'm not sure what to do about that.
> > > 
> > 
> > Is this still a problem in the v2 set?
> > 
> > wake_non_conflicts walks the whole tree of requests that were blocked on
> > it,
> 
> Not in the FL_TRANSITIVE_CONFLICT case, which is the case here.
> 

Got it. Yeah, I'm not sure that the idea of FL_TRANSITIVE_CONFLICT
really works. I think you could fix this by just getting rid of that and
folding it into the FL_CONFLICT case.

In more complex situations (like the one you describe), you may end up
cycling through several blocked requests before you hit one that can get
the lock. It may slow things down some for those cases. In the more
common locking scenarios (whole file locks, different owners), I think
this will be much more efficient by avoiding so many wakeups.

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, back to index

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-09  2:04 [PATCH 0/5 - V2] locks: avoid thundering-herd wake-ups NeilBrown
2018-08-09  2:04 ` [PATCH 1/5] fs/locks: rename some lists and pointers NeilBrown
2018-08-09  2:04 ` [PATCH 2/5] fs/locks: allow a lock request to block other requests NeilBrown
2018-08-09  2:04 ` [PATCH 3/5] fs/locks: change all *_conflict() functions to return a new enum NeilBrown
2018-08-09 11:09   ` Jeff Layton
2018-08-09 13:09   ` J. Bruce Fields
2018-08-09 23:40     ` NeilBrown
2018-08-10  0:56       ` J. Bruce Fields
2018-08-09  2:04 ` [PATCH 4/5] fs/locks: split out __locks_wake_one() NeilBrown
2018-08-09  2:04 ` [PATCH 5/5] fs/locks: create a tree of dependent requests NeilBrown
2018-08-09 11:17   ` Jeff Layton
2018-08-09 23:25     ` NeilBrown
2018-08-09 14:13   ` J. Bruce Fields
2018-08-09 22:19     ` NeilBrown
2018-08-10  0:36       ` J. Bruce Fields
2018-08-09 17:32 ` [PATCH 0/5 - V2] locks: avoid thundering-herd wake-ups J. Bruce Fields
2018-08-09 22:12   ` NeilBrown
2018-08-10  0:29     ` J. Bruce Fields
2018-08-10  1:50       ` NeilBrown
2018-08-10  2:52         ` J. Bruce Fields
2018-08-10  3:17           ` NeilBrown
2018-08-10 15:47             ` J. Bruce Fields
2018-08-11 11:56               ` Jeff Layton
2018-08-11 12:35                 ` J. Bruce Fields
2018-08-11 11:51       ` Jeff Layton
2018-08-11 12:21         ` J. Bruce Fields
2018-08-11 13:15           ` Jeff Layton

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org linux-kernel@archiver.kernel.org
	public-inbox-index lkml


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/ public-inbox