LKML Archive on lore.kernel.org
 help / Atom feed
* [PATCH 0/4] locks: avoid thundering-herd wake-ups
@ 2018-08-08  1:51 NeilBrown
  2018-08-08  1:51 ` [PATCH 1/4] fs/locks: rename some lists and pointers NeilBrown
                   ` (5 more replies)
  0 siblings, 6 replies; 26+ messages in thread
From: NeilBrown @ 2018-08-08  1:51 UTC (permalink / raw)
  To: Jeff Layton, Alexander Viro; +Cc: J. Bruce Fields, linux-fsdevel, linux-kernel

If you have a many-core machine, and have many threads all wanting to
briefly lock a give file (udev is known to do this), you can get quite
poor performance.

When one thread releases a lock, it wakes up all other threads that
are waiting (classic thundering-herd) - one will get the lock and the
others go to sleep.
When you have few cores, this is not very noticeable: by the time the
4th or 5th thread gets enough CPU time to try to claim the lock, the
earlier threads have claimed it, done what was needed, and released.
With 50+ cores, the contention can easily be measured.

This patchset creates a tree of pending lock request in which siblings
don't conflict and each lock request does conflict with its parent.
When a lock is released, only requests which don't conflict with each
other a woken.

Testing shows that lock-acquisitions-per-second is now fairly stable even
as number of contending process goes to 1000.  Without this patch,
locks-per-second drops off steeply after a few 10s of processes.

There is a small cost to this extra complexity.
At 20 processes running a particular test on 72 cores, the lock
acquisitions per second drops from 1.8 million to 1.4 million with
this patch.  For 100 processes, this patch still provides 1.4 million
while without this patch there are about 700,000.

NeilBrown

---

NeilBrown (4):
      fs/locks: rename some lists and pointers.
      fs/locks: allow a lock request to block other requests.
      fs/locks: change all *_conflict() functions to return bool.
      fs/locks: create a tree of dependent requests.


 fs/cifs/file.c                  |    2 -
 fs/locks.c                      |  142 +++++++++++++++++++++++++--------------
 include/linux/fs.h              |    5 +
 include/trace/events/filelock.h |   16 ++--
 4 files changed, 103 insertions(+), 62 deletions(-)

--
Signature


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/4] fs/locks: rename some lists and pointers.
  2018-08-08  1:51 [PATCH 0/4] locks: avoid thundering-herd wake-ups NeilBrown
@ 2018-08-08  1:51 ` NeilBrown
  2018-08-08 10:47   ` Jeff Layton
  2018-08-08  1:51 ` [PATCH 3/4] fs/locks: change all *_conflict() functions to return bool NeilBrown
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 26+ messages in thread
From: NeilBrown @ 2018-08-08  1:51 UTC (permalink / raw)
  To: Jeff Layton, Alexander Viro; +Cc: J. Bruce Fields, linux-fsdevel, linux-kernel

struct file lock contains an 'fl_next' pointer which
is used to point to the lock that this request is blocked
waiting for.  So rename it to fl_blocker.

The fl_blocked list_head in an active lock is the head of a list of
blocked requests.  In a request it is a node in that list.
These are two distinct uses, so replace with two list_heads
with different names.
fl_blocked is the head of a list of blocked requests
fl_block is a node on that list.

This separation will be used in the next patch.

Note that a tracepoint is changed to report fl_blocker instead
of fl_next.  Might this be an ABI breakage?

Signed-off-by: NeilBrown <neilb@suse.com>
---
 fs/cifs/file.c                  |    2 +-
 fs/locks.c                      |   40 ++++++++++++++++++++-------------------
 include/linux/fs.h              |    5 +++--
 include/trace/events/filelock.h |   16 ++++++++--------
 4 files changed, 33 insertions(+), 30 deletions(-)

diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 8d41ca7bfcf1..066ed2e4ba96 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -1092,7 +1092,7 @@ cifs_posix_lock_set(struct file *file, struct file_lock *flock)
 	rc = posix_lock_file(file, flock, NULL);
 	up_write(&cinode->lock_sem);
 	if (rc == FILE_LOCK_DEFERRED) {
-		rc = wait_event_interruptible(flock->fl_wait, !flock->fl_next);
+		rc = wait_event_interruptible(flock->fl_wait, !flock->fl_blocker);
 		if (!rc)
 			goto try_again;
 		posix_unblock_lock(flock);
diff --git a/fs/locks.c b/fs/locks.c
index db7b6917d9c5..322491e70e41 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -194,7 +194,7 @@ static DEFINE_HASHTABLE(blocked_hash, BLOCKED_HASH_BITS);
  * This lock protects the blocked_hash. Generally, if you're accessing it, you
  * want to be holding this lock.
  *
- * In addition, it also protects the fl->fl_block list, and the fl->fl_next
+ * In addition, it also protects the fl->fl_blocked list, and the fl->fl_blocker
  * pointer for file_lock structures that are acting as lock requests (in
  * contrast to those that are acting as records of acquired locks).
  *
@@ -203,7 +203,7 @@ static DEFINE_HASHTABLE(blocked_hash, BLOCKED_HASH_BITS);
  * protected by this lock, we can skip acquiring it iff we already hold the
  * flc_lock.
  *
- * In particular, adding an entry to the fl_block list requires that you hold
+ * In particular, adding an entry to the fl_blocked list requires that you hold
  * both the flc_lock and the blocked_lock_lock (acquired in that order).
  * Deleting an entry from the list however only requires the file_lock_lock.
  */
@@ -302,6 +302,7 @@ static void locks_init_lock_heads(struct file_lock *fl)
 {
 	INIT_HLIST_NODE(&fl->fl_link);
 	INIT_LIST_HEAD(&fl->fl_list);
+	INIT_LIST_HEAD(&fl->fl_blocked);
 	INIT_LIST_HEAD(&fl->fl_block);
 	init_waitqueue_head(&fl->fl_wait);
 }
@@ -341,6 +342,7 @@ void locks_free_lock(struct file_lock *fl)
 {
 	BUG_ON(waitqueue_active(&fl->fl_wait));
 	BUG_ON(!list_empty(&fl->fl_list));
+	BUG_ON(!list_empty(&fl->fl_blocked));
 	BUG_ON(!list_empty(&fl->fl_block));
 	BUG_ON(!hlist_unhashed(&fl->fl_link));
 
@@ -676,7 +678,7 @@ static void __locks_delete_block(struct file_lock *waiter)
 {
 	locks_delete_global_blocked(waiter);
 	list_del_init(&waiter->fl_block);
-	waiter->fl_next = NULL;
+	waiter->fl_blocker = NULL;
 }
 
 static void locks_delete_block(struct file_lock *waiter)
@@ -692,16 +694,16 @@ static void locks_delete_block(struct file_lock *waiter)
  * it seems like the reasonable thing to do.
  *
  * Must be called with both the flc_lock and blocked_lock_lock held. The
- * fl_block list itself is protected by the blocked_lock_lock, but by ensuring
+ * fl_blocked list itself is protected by the blocked_lock_lock, but by ensuring
  * that the flc_lock is also held on insertions we can avoid taking the
- * blocked_lock_lock in some cases when we see that the fl_block list is empty.
+ * blocked_lock_lock in some cases when we see that the fl_blocked list is empty.
  */
 static void __locks_insert_block(struct file_lock *blocker,
 					struct file_lock *waiter)
 {
 	BUG_ON(!list_empty(&waiter->fl_block));
-	waiter->fl_next = blocker;
-	list_add_tail(&waiter->fl_block, &blocker->fl_block);
+	waiter->fl_blocker = blocker;
+	list_add_tail(&waiter->fl_block, &blocker->fl_blocked);
 	if (IS_POSIX(blocker) && !IS_OFDLCK(blocker))
 		locks_insert_global_blocked(waiter);
 }
@@ -725,18 +727,18 @@ static void locks_wake_up_blocks(struct file_lock *blocker)
 	/*
 	 * Avoid taking global lock if list is empty. This is safe since new
 	 * blocked requests are only added to the list under the flc_lock, and
-	 * the flc_lock is always held here. Note that removal from the fl_block
+	 * the flc_lock is always held here. Note that removal from the fl_blocked
 	 * list does not require the flc_lock, so we must recheck list_empty()
 	 * after acquiring the blocked_lock_lock.
 	 */
-	if (list_empty(&blocker->fl_block))
+	if (list_empty(&blocker->fl_blocked))
 		return;
 
 	spin_lock(&blocked_lock_lock);
-	while (!list_empty(&blocker->fl_block)) {
+	while (!list_empty(&blocker->fl_blocked)) {
 		struct file_lock *waiter;
 
-		waiter = list_first_entry(&blocker->fl_block,
+		waiter = list_first_entry(&blocker->fl_blocked,
 				struct file_lock, fl_block);
 		__locks_delete_block(waiter);
 		if (waiter->fl_lmops && waiter->fl_lmops->lm_notify)
@@ -887,7 +889,7 @@ static struct file_lock *what_owner_is_waiting_for(struct file_lock *block_fl)
 
 	hash_for_each_possible(blocked_hash, fl, fl_link, posix_owner_key(block_fl)) {
 		if (posix_same_owner(fl, block_fl))
-			return fl->fl_next;
+			return fl->fl_blocker;
 	}
 	return NULL;
 }
@@ -1245,7 +1247,7 @@ static int posix_lock_inode_wait(struct inode *inode, struct file_lock *fl)
 		error = posix_lock_inode(inode, fl, NULL);
 		if (error != FILE_LOCK_DEFERRED)
 			break;
-		error = wait_event_interruptible(fl->fl_wait, !fl->fl_next);
+		error = wait_event_interruptible(fl->fl_wait, !fl->fl_blocker);
 		if (!error)
 			continue;
 
@@ -1332,7 +1334,7 @@ int locks_mandatory_area(struct inode *inode, struct file *filp, loff_t start,
 		error = posix_lock_inode(inode, &fl, NULL);
 		if (error != FILE_LOCK_DEFERRED)
 			break;
-		error = wait_event_interruptible(fl.fl_wait, !fl.fl_next);
+		error = wait_event_interruptible(fl.fl_wait, !fl.fl_blocker);
 		if (!error) {
 			/*
 			 * If we've been sleeping someone might have
@@ -1526,7 +1528,7 @@ int __break_lease(struct inode *inode, unsigned int mode, unsigned int type)
 
 	locks_dispose_list(&dispose);
 	error = wait_event_interruptible_timeout(new_fl->fl_wait,
-						!new_fl->fl_next, break_time);
+						!new_fl->fl_blocker, break_time);
 
 	percpu_down_read_preempt_disable(&file_rwsem);
 	spin_lock(&ctx->flc_lock);
@@ -1940,7 +1942,7 @@ static int flock_lock_inode_wait(struct inode *inode, struct file_lock *fl)
 		error = flock_lock_inode(inode, fl);
 		if (error != FILE_LOCK_DEFERRED)
 			break;
-		error = wait_event_interruptible(fl->fl_wait, !fl->fl_next);
+		error = wait_event_interruptible(fl->fl_wait, !fl->fl_blocker);
 		if (!error)
 			continue;
 
@@ -2212,7 +2214,7 @@ static int do_lock_file_wait(struct file *filp, unsigned int cmd,
 		error = vfs_lock_file(filp, cmd, fl, NULL);
 		if (error != FILE_LOCK_DEFERRED)
 			break;
-		error = wait_event_interruptible(fl->fl_wait, !fl->fl_next);
+		error = wait_event_interruptible(fl->fl_wait, !fl->fl_blocker);
 		if (!error)
 			continue;
 
@@ -2583,7 +2585,7 @@ posix_unblock_lock(struct file_lock *waiter)
 	int status = 0;
 
 	spin_lock(&blocked_lock_lock);
-	if (waiter->fl_next)
+	if (waiter->fl_blocker)
 		__locks_delete_block(waiter);
 	else
 		status = -ENOENT;
@@ -2711,7 +2713,7 @@ static int locks_show(struct seq_file *f, void *v)
 
 	lock_get_status(f, fl, iter->li_pos, "");
 
-	list_for_each_entry(bfl, &fl->fl_block, fl_block)
+	list_for_each_entry(bfl, &fl->fl_blocked, fl_block)
 		lock_get_status(f, bfl, iter->li_pos, " ->");
 
 	return 0;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 805bf22898cf..48f7f36c3772 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1002,10 +1002,11 @@ bool opens_in_grace(struct net *);
  * Obviously, the last two criteria only matter for POSIX locks.
  */
 struct file_lock {
-	struct file_lock *fl_next;	/* singly linked list for this inode  */
+	struct file_lock *fl_blocker;	/* The lock, that is blocking us */
 	struct list_head fl_list;	/* link into file_lock_context */
 	struct hlist_node fl_link;	/* node in global lists */
-	struct list_head fl_block;	/* circular list of blocked processes */
+	struct list_head fl_blocked;	/* list of requests with ->fl_blocker pointing here */
+	struct list_head fl_block;	/* node in ->fl_blocker->fl_blocked */
 	fl_owner_t fl_owner;
 	unsigned int fl_flags;
 	unsigned char fl_type;
diff --git a/include/trace/events/filelock.h b/include/trace/events/filelock.h
index d1faf3597b9d..0e50f985a390 100644
--- a/include/trace/events/filelock.h
+++ b/include/trace/events/filelock.h
@@ -68,7 +68,7 @@ DECLARE_EVENT_CLASS(filelock_lock,
 		__field(struct file_lock *, fl)
 		__field(unsigned long, i_ino)
 		__field(dev_t, s_dev)
-		__field(struct file_lock *, fl_next)
+		__field(struct file_lock *, fl_blocker)
 		__field(fl_owner_t, fl_owner)
 		__field(unsigned int, fl_pid)
 		__field(unsigned int, fl_flags)
@@ -82,7 +82,7 @@ DECLARE_EVENT_CLASS(filelock_lock,
 		__entry->fl = fl ? fl : NULL;
 		__entry->s_dev = inode->i_sb->s_dev;
 		__entry->i_ino = inode->i_ino;
-		__entry->fl_next = fl ? fl->fl_next : NULL;
+		__entry->fl_blocker = fl ? fl->fl_blocker : NULL;
 		__entry->fl_owner = fl ? fl->fl_owner : NULL;
 		__entry->fl_pid = fl ? fl->fl_pid : 0;
 		__entry->fl_flags = fl ? fl->fl_flags : 0;
@@ -92,9 +92,9 @@ DECLARE_EVENT_CLASS(filelock_lock,
 		__entry->ret = ret;
 	),
 
-	TP_printk("fl=0x%p dev=0x%x:0x%x ino=0x%lx fl_next=0x%p fl_owner=0x%p fl_pid=%u fl_flags=%s fl_type=%s fl_start=%lld fl_end=%lld ret=%d",
+	TP_printk("fl=0x%p dev=0x%x:0x%x ino=0x%lx fl_blocker=0x%p fl_owner=0x%p fl_pid=%u fl_flags=%s fl_type=%s fl_start=%lld fl_end=%lld ret=%d",
 		__entry->fl, MAJOR(__entry->s_dev), MINOR(__entry->s_dev),
-		__entry->i_ino, __entry->fl_next, __entry->fl_owner,
+		__entry->i_ino, __entry->fl_blocker, __entry->fl_owner,
 		__entry->fl_pid, show_fl_flags(__entry->fl_flags),
 		show_fl_type(__entry->fl_type),
 		__entry->fl_start, __entry->fl_end, __entry->ret)
@@ -122,7 +122,7 @@ DECLARE_EVENT_CLASS(filelock_lease,
 		__field(struct file_lock *, fl)
 		__field(unsigned long, i_ino)
 		__field(dev_t, s_dev)
-		__field(struct file_lock *, fl_next)
+		__field(struct file_lock *, fl_blocker)
 		__field(fl_owner_t, fl_owner)
 		__field(unsigned int, fl_flags)
 		__field(unsigned char, fl_type)
@@ -134,7 +134,7 @@ DECLARE_EVENT_CLASS(filelock_lease,
 		__entry->fl = fl ? fl : NULL;
 		__entry->s_dev = inode->i_sb->s_dev;
 		__entry->i_ino = inode->i_ino;
-		__entry->fl_next = fl ? fl->fl_next : NULL;
+		__entry->fl_blocker = fl ? fl->fl_blocker : NULL;
 		__entry->fl_owner = fl ? fl->fl_owner : NULL;
 		__entry->fl_flags = fl ? fl->fl_flags : 0;
 		__entry->fl_type = fl ? fl->fl_type : 0;
@@ -142,9 +142,9 @@ DECLARE_EVENT_CLASS(filelock_lease,
 		__entry->fl_downgrade_time = fl ? fl->fl_downgrade_time : 0;
 	),
 
-	TP_printk("fl=0x%p dev=0x%x:0x%x ino=0x%lx fl_next=0x%p fl_owner=0x%p fl_flags=%s fl_type=%s fl_break_time=%lu fl_downgrade_time=%lu",
+	TP_printk("fl=0x%p dev=0x%x:0x%x ino=0x%lx fl_blocker=0x%p fl_owner=0x%p fl_flags=%s fl_type=%s fl_break_time=%lu fl_downgrade_time=%lu",
 		__entry->fl, MAJOR(__entry->s_dev), MINOR(__entry->s_dev),
-		__entry->i_ino, __entry->fl_next, __entry->fl_owner,
+		__entry->i_ino, __entry->fl_blocker, __entry->fl_owner,
 		show_fl_flags(__entry->fl_flags),
 		show_fl_type(__entry->fl_type),
 		__entry->fl_break_time, __entry->fl_downgrade_time)



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 2/4] fs/locks: allow a lock request to block other requests.
  2018-08-08  1:51 [PATCH 0/4] locks: avoid thundering-herd wake-ups NeilBrown
  2018-08-08  1:51 ` [PATCH 1/4] fs/locks: rename some lists and pointers NeilBrown
  2018-08-08  1:51 ` [PATCH 3/4] fs/locks: change all *_conflict() functions to return bool NeilBrown
@ 2018-08-08  1:51 ` NeilBrown
  2018-08-08  1:51 ` [PATCH 4/4] fs/locks: create a tree of dependent requests NeilBrown
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 26+ messages in thread
From: NeilBrown @ 2018-08-08  1:51 UTC (permalink / raw)
  To: Jeff Layton, Alexander Viro; +Cc: J. Bruce Fields, linux-fsdevel, linux-kernel

Currently, a lock can block pending requests, but all pending
requests are equal.  If lots of pending requests are
mutually exclusive, this means they will all be woken up
and all but one will fail.  This can hurt performance.

So we will allow pending requests to block other requests.
Only the first request will be woken, and it will wake the others.

This patch doesn't implement this fully, but prepares the way.

- It acknowledges that a request might be blocking other requests,
  and when the request is converted to a lock, those blocked
  requests are moved across.
- When a request is removed, all blocked requests are woken.
- When deadlock-detection looks for the lock which blocks a
  given request, we follow the chain of ->fl_blocker all
  the way to the top.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 fs/locks.c |   58 ++++++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 40 insertions(+), 18 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index 322491e70e41..b4812da2a374 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -408,9 +408,19 @@ void locks_copy_lock(struct file_lock *new, struct file_lock *fl)
 			fl->fl_ops->fl_copy_lock(new, fl);
 	}
 }
-
 EXPORT_SYMBOL(locks_copy_lock);
 
+static void locks_move_blocks(struct file_lock *new, struct file_lock *fl)
+{
+	struct file_lock *f;
+
+	spin_lock(&blocked_lock_lock);
+	list_splice_init(&fl->fl_blocked, &new->fl_blocked);
+	list_for_each_entry(f, &fl->fl_blocked, fl_block)
+		f->fl_blocker = new;
+	spin_unlock(&blocked_lock_lock);
+}
+
 static inline int flock_translate_cmd(int cmd) {
 	if (cmd & LOCK_MAND)
 		return cmd & (LOCK_MAND | LOCK_RW);
@@ -681,9 +691,25 @@ static void __locks_delete_block(struct file_lock *waiter)
 	waiter->fl_blocker = NULL;
 }
 
+static void __locks_wake_up_blocks(struct file_lock *blocker)
+{
+	while (!list_empty(&blocker->fl_blocked)) {
+		struct file_lock *waiter;
+
+		waiter = list_first_entry(&blocker->fl_blocked,
+				struct file_lock, fl_block);
+		__locks_delete_block(waiter);
+		if (waiter->fl_lmops && waiter->fl_lmops->lm_notify)
+			waiter->fl_lmops->lm_notify(waiter);
+		else
+			wake_up(&waiter->fl_wait);
+	}
+}
+
 static void locks_delete_block(struct file_lock *waiter)
 {
 	spin_lock(&blocked_lock_lock);
+	__locks_wake_up_blocks(waiter);
 	__locks_delete_block(waiter);
 	spin_unlock(&blocked_lock_lock);
 }
@@ -735,17 +761,7 @@ static void locks_wake_up_blocks(struct file_lock *blocker)
 		return;
 
 	spin_lock(&blocked_lock_lock);
-	while (!list_empty(&blocker->fl_blocked)) {
-		struct file_lock *waiter;
-
-		waiter = list_first_entry(&blocker->fl_blocked,
-				struct file_lock, fl_block);
-		__locks_delete_block(waiter);
-		if (waiter->fl_lmops && waiter->fl_lmops->lm_notify)
-			waiter->fl_lmops->lm_notify(waiter);
-		else
-			wake_up(&waiter->fl_wait);
-	}
+	__locks_wake_up_blocks(blocker);
 	spin_unlock(&blocked_lock_lock);
 }
 
@@ -888,8 +904,11 @@ static struct file_lock *what_owner_is_waiting_for(struct file_lock *block_fl)
 	struct file_lock *fl;
 
 	hash_for_each_possible(blocked_hash, fl, fl_link, posix_owner_key(block_fl)) {
-		if (posix_same_owner(fl, block_fl))
-			return fl->fl_blocker;
+		if (posix_same_owner(fl, block_fl)) {
+			while (fl->fl_blocker)
+				fl = fl->fl_blocker;
+			return fl;
+		}
 	}
 	return NULL;
 }
@@ -982,6 +1001,7 @@ static int flock_lock_inode(struct inode *inode, struct file_lock *request)
 	if (request->fl_flags & FL_ACCESS)
 		goto out;
 	locks_copy_lock(new_fl, request);
+	locks_move_blocks(new_fl, request);
 	locks_insert_lock_ctx(new_fl, &ctx->flc_flock);
 	new_fl = NULL;
 	error = 0;
@@ -1174,6 +1194,7 @@ static int posix_lock_inode(struct inode *inode, struct file_lock *request,
 			goto out;
 		}
 		locks_copy_lock(new_fl, request);
+		locks_move_blocks(new_fl, request);
 		locks_insert_lock_ctx(new_fl, &fl->fl_list);
 		fl = new_fl;
 		new_fl = NULL;
@@ -2582,13 +2603,14 @@ void locks_remove_file(struct file *filp)
 int
 posix_unblock_lock(struct file_lock *waiter)
 {
-	int status = 0;
+	int status = -ENOENT;
 
 	spin_lock(&blocked_lock_lock);
-	if (waiter->fl_blocker)
+	if (waiter->fl_blocker) {
+		__locks_wake_up_blocks(waiter);
 		__locks_delete_block(waiter);
-	else
-		status = -ENOENT;
+		status = 0;
+	}
 	spin_unlock(&blocked_lock_lock);
 	return status;
 }



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 3/4] fs/locks: change all *_conflict() functions to return bool.
  2018-08-08  1:51 [PATCH 0/4] locks: avoid thundering-herd wake-ups NeilBrown
  2018-08-08  1:51 ` [PATCH 1/4] fs/locks: rename some lists and pointers NeilBrown
@ 2018-08-08  1:51 ` NeilBrown
  2018-08-08  1:51 ` [PATCH 2/4] fs/locks: allow a lock request to block other requests NeilBrown
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 26+ messages in thread
From: NeilBrown @ 2018-08-08  1:51 UTC (permalink / raw)
  To: Jeff Layton, Alexander Viro; +Cc: J. Bruce Fields, linux-fsdevel, linux-kernel

posix_locks_conflict() and flock_locks_conflict() both return int.
leases_conflict() returns bool.

This inconsistency will cause problems for the next patch if not
fixed.

So change posix_locks_conflict() and flock_locks_conflict() to return
bool.
Also change the locks_conflict() helper.

And convert some
   return (foo);
to
   return foo;

Signed-off-by: NeilBrown <neilb@suse.com>
---
 fs/locks.c |   24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index b4812da2a374..aaa55925c788 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -793,47 +793,47 @@ locks_delete_lock_ctx(struct file_lock *fl, struct list_head *dispose)
 /* Determine if lock sys_fl blocks lock caller_fl. Common functionality
  * checks for shared/exclusive status of overlapping locks.
  */
-static int locks_conflict(struct file_lock *caller_fl, struct file_lock *sys_fl)
+static bool locks_conflict(struct file_lock *caller_fl, struct file_lock *sys_fl)
 {
 	if (sys_fl->fl_type == F_WRLCK)
-		return 1;
+		return true;
 	if (caller_fl->fl_type == F_WRLCK)
-		return 1;
-	return 0;
+		return true;
+	return false;
 }
 
 /* Determine if lock sys_fl blocks lock caller_fl. POSIX specific
  * checking before calling the locks_conflict().
  */
-static int posix_locks_conflict(struct file_lock *caller_fl, struct file_lock *sys_fl)
+static bool posix_locks_conflict(struct file_lock *caller_fl, struct file_lock *sys_fl)
 {
 	/* POSIX locks owned by the same process do not conflict with
 	 * each other.
 	 */
 	if (posix_same_owner(caller_fl, sys_fl))
-		return (0);
+		return false;
 
 	/* Check whether they overlap */
 	if (!locks_overlap(caller_fl, sys_fl))
-		return 0;
+		return false;
 
-	return (locks_conflict(caller_fl, sys_fl));
+	return locks_conflict(caller_fl, sys_fl);
 }
 
 /* Determine if lock sys_fl blocks lock caller_fl. FLOCK specific
  * checking before calling the locks_conflict().
  */
-static int flock_locks_conflict(struct file_lock *caller_fl, struct file_lock *sys_fl)
+static bool flock_locks_conflict(struct file_lock *caller_fl, struct file_lock *sys_fl)
 {
 	/* FLOCK locks referring to the same filp do not conflict with
 	 * each other.
 	 */
 	if (caller_fl->fl_file == sys_fl->fl_file)
-		return (0);
+		return false;
 	if ((caller_fl->fl_type & LOCK_MAND) || (sys_fl->fl_type & LOCK_MAND))
-		return 0;
+		return false;
 
-	return (locks_conflict(caller_fl, sys_fl));
+	return locks_conflict(caller_fl, sys_fl);
 }
 
 void



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 4/4] fs/locks: create a tree of dependent requests.
  2018-08-08  1:51 [PATCH 0/4] locks: avoid thundering-herd wake-ups NeilBrown
                   ` (2 preceding siblings ...)
  2018-08-08  1:51 ` [PATCH 2/4] fs/locks: allow a lock request to block other requests NeilBrown
@ 2018-08-08  1:51 ` NeilBrown
  2018-08-08 16:47 ` [PATCH 0/4] locks: avoid thundering-herd wake-ups Jeff Layton
  2018-08-08 19:54 ` J. Bruce Fields
  5 siblings, 0 replies; 26+ messages in thread
From: NeilBrown @ 2018-08-08  1:51 UTC (permalink / raw)
  To: Jeff Layton, Alexander Viro; +Cc: J. Bruce Fields, linux-fsdevel, linux-kernel

When we find an existing lock which conflicts with a request,
and the request wants to wait, we currently add the request
to a list.  When the lock is removed, the whole list is woken.
This can cause the thundering-herd problem.
To reduce the problem, we make use of the (new) fact that
a pending request can itself have a list of blocked requests.
When we find a conflict, we look through the existing blocked requests.
If any one of them blocks the new request, the new request is attached
below that request.
This way, when the lock is released, only a set of non-conflicting
locks will be woken.  The rest of the herd can stay asleep.

Reported-and-tested-by: Martin Wilck <mwilck@suse.de>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 fs/locks.c |   28 ++++++++++++++++++++++------
 1 file changed, 22 insertions(+), 6 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index aaa55925c788..69a30421218b 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -723,11 +723,24 @@ static void locks_delete_block(struct file_lock *waiter)
  * fl_blocked list itself is protected by the blocked_lock_lock, but by ensuring
  * that the flc_lock is also held on insertions we can avoid taking the
  * blocked_lock_lock in some cases when we see that the fl_blocked list is empty.
+ *
+ * Rather than just adding to the list, we check for conflicts with any existing
+ * waiter, and add to that waiter instead.
+ * Thus wakeups don't happen until needed.
  */
 static void __locks_insert_block(struct file_lock *blocker,
-					struct file_lock *waiter)
+				 struct file_lock *waiter,
+				 bool conflict(struct file_lock *,
+					       struct file_lock *))
 {
+	struct file_lock *fl;
 	BUG_ON(!list_empty(&waiter->fl_block));
+new_blocker:
+	list_for_each_entry(fl, &blocker->fl_blocked, fl_block)
+		if (conflict(fl, waiter)) {
+			blocker =  fl;
+			goto new_blocker;
+		}
 	waiter->fl_blocker = blocker;
 	list_add_tail(&waiter->fl_block, &blocker->fl_blocked);
 	if (IS_POSIX(blocker) && !IS_OFDLCK(blocker))
@@ -736,10 +749,12 @@ static void __locks_insert_block(struct file_lock *blocker,
 
 /* Must be called with flc_lock held. */
 static void locks_insert_block(struct file_lock *blocker,
-					struct file_lock *waiter)
+			       struct file_lock *waiter,
+			       bool conflict(struct file_lock *,
+					     struct file_lock *))
 {
 	spin_lock(&blocked_lock_lock);
-	__locks_insert_block(blocker, waiter);
+	__locks_insert_block(blocker, waiter, conflict);
 	spin_unlock(&blocked_lock_lock);
 }
 
@@ -995,7 +1010,7 @@ static int flock_lock_inode(struct inode *inode, struct file_lock *request)
 		if (!(request->fl_flags & FL_SLEEP))
 			goto out;
 		error = FILE_LOCK_DEFERRED;
-		locks_insert_block(fl, request);
+		locks_insert_block(fl, request, flock_locks_conflict);
 		goto out;
 	}
 	if (request->fl_flags & FL_ACCESS)
@@ -1069,7 +1084,8 @@ static int posix_lock_inode(struct inode *inode, struct file_lock *request,
 			spin_lock(&blocked_lock_lock);
 			if (likely(!posix_locks_deadlock(request, fl))) {
 				error = FILE_LOCK_DEFERRED;
-				__locks_insert_block(fl, request);
+				__locks_insert_block(fl, request,
+						     posix_locks_conflict);
 			}
 			spin_unlock(&blocked_lock_lock);
 			goto out;
@@ -1542,7 +1558,7 @@ int __break_lease(struct inode *inode, unsigned int mode, unsigned int type)
 		break_time -= jiffies;
 	if (break_time == 0)
 		break_time++;
-	locks_insert_block(fl, new_fl);
+	locks_insert_block(fl, new_fl, leases_conflict);
 	trace_break_lease_block(inode, new_fl);
 	spin_unlock(&ctx->flc_lock);
 	percpu_up_read_preempt_enable(&file_rwsem);



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/4] fs/locks: rename some lists and pointers.
  2018-08-08  1:51 ` [PATCH 1/4] fs/locks: rename some lists and pointers NeilBrown
@ 2018-08-08 10:47   ` Jeff Layton
  2018-08-08 19:07     ` J. Bruce Fields
  0 siblings, 1 reply; 26+ messages in thread
From: Jeff Layton @ 2018-08-08 10:47 UTC (permalink / raw)
  To: NeilBrown, Alexander Viro
  Cc: J. Bruce Fields, linux-fsdevel, linux-kernel, Martin Wilck

On Wed, 2018-08-08 at 11:51 +1000, NeilBrown wrote:
> struct file lock contains an 'fl_next' pointer which
> is used to point to the lock that this request is blocked
> waiting for.  So rename it to fl_blocker.
> 
> The fl_blocked list_head in an active lock is the head of a list of
> blocked requests.  In a request it is a node in that list.
> These are two distinct uses, so replace with two list_heads
> with different names.
> fl_blocked is the head of a list of blocked requests
> fl_block is a node on that list.
> 
> This separation will be used in the next patch.
> 
> Note that a tracepoint is changed to report fl_blocker instead
> of fl_next.  Might this be an ABI breakage?
> 

My understanding (possibly wrong) is that because tracepoints are
exposed via debugfs, they aren't considered part of the ABI.

Thanks for the patches, Neil. I'll go over them in detail soon.

> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  fs/cifs/file.c                  |    2 +-
>  fs/locks.c                      |   40 ++++++++++++++++++++-------------------
>  include/linux/fs.h              |    5 +++--
>  include/trace/events/filelock.h |   16 ++++++++--------
>  4 files changed, 33 insertions(+), 30 deletions(-)
> 
> diff --git a/fs/cifs/file.c b/fs/cifs/file.c
> index 8d41ca7bfcf1..066ed2e4ba96 100644
> --- a/fs/cifs/file.c
> +++ b/fs/cifs/file.c
> @@ -1092,7 +1092,7 @@ cifs_posix_lock_set(struct file *file, struct file_lock *flock)
>  	rc = posix_lock_file(file, flock, NULL);
>  	up_write(&cinode->lock_sem);
>  	if (rc == FILE_LOCK_DEFERRED) {
> -		rc = wait_event_interruptible(flock->fl_wait, !flock->fl_next);
> +		rc = wait_event_interruptible(flock->fl_wait, !flock->fl_blocker);
>  		if (!rc)
>  			goto try_again;
>  		posix_unblock_lock(flock);
> diff --git a/fs/locks.c b/fs/locks.c
> index db7b6917d9c5..322491e70e41 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -194,7 +194,7 @@ static DEFINE_HASHTABLE(blocked_hash, BLOCKED_HASH_BITS);
>   * This lock protects the blocked_hash. Generally, if you're accessing it, you
>   * want to be holding this lock.
>   *
> - * In addition, it also protects the fl->fl_block list, and the fl->fl_next
> + * In addition, it also protects the fl->fl_blocked list, and the fl->fl_blocker
>   * pointer for file_lock structures that are acting as lock requests (in
>   * contrast to those that are acting as records of acquired locks).
>   *
> @@ -203,7 +203,7 @@ static DEFINE_HASHTABLE(blocked_hash, BLOCKED_HASH_BITS);
>   * protected by this lock, we can skip acquiring it iff we already hold the
>   * flc_lock.
>   *
> - * In particular, adding an entry to the fl_block list requires that you hold
> + * In particular, adding an entry to the fl_blocked list requires that you hold
>   * both the flc_lock and the blocked_lock_lock (acquired in that order).
>   * Deleting an entry from the list however only requires the file_lock_lock.
>   */
> @@ -302,6 +302,7 @@ static void locks_init_lock_heads(struct file_lock *fl)
>  {
>  	INIT_HLIST_NODE(&fl->fl_link);
>  	INIT_LIST_HEAD(&fl->fl_list);
> +	INIT_LIST_HEAD(&fl->fl_blocked);
>  	INIT_LIST_HEAD(&fl->fl_block);
>  	init_waitqueue_head(&fl->fl_wait);
>  }
> @@ -341,6 +342,7 @@ void locks_free_lock(struct file_lock *fl)
>  {
>  	BUG_ON(waitqueue_active(&fl->fl_wait));
>  	BUG_ON(!list_empty(&fl->fl_list));
> +	BUG_ON(!list_empty(&fl->fl_blocked));
>  	BUG_ON(!list_empty(&fl->fl_block));
>  	BUG_ON(!hlist_unhashed(&fl->fl_link));
>  
> @@ -676,7 +678,7 @@ static void __locks_delete_block(struct file_lock *waiter)
>  {
>  	locks_delete_global_blocked(waiter);
>  	list_del_init(&waiter->fl_block);
> -	waiter->fl_next = NULL;
> +	waiter->fl_blocker = NULL;
>  }
>  
>  static void locks_delete_block(struct file_lock *waiter)
> @@ -692,16 +694,16 @@ static void locks_delete_block(struct file_lock *waiter)
>   * it seems like the reasonable thing to do.
>   *
>   * Must be called with both the flc_lock and blocked_lock_lock held. The
> - * fl_block list itself is protected by the blocked_lock_lock, but by ensuring
> + * fl_blocked list itself is protected by the blocked_lock_lock, but by ensuring
>   * that the flc_lock is also held on insertions we can avoid taking the
> - * blocked_lock_lock in some cases when we see that the fl_block list is empty.
> + * blocked_lock_lock in some cases when we see that the fl_blocked list is empty.
>   */
>  static void __locks_insert_block(struct file_lock *blocker,
>  					struct file_lock *waiter)
>  {
>  	BUG_ON(!list_empty(&waiter->fl_block));
> -	waiter->fl_next = blocker;
> -	list_add_tail(&waiter->fl_block, &blocker->fl_block);
> +	waiter->fl_blocker = blocker;
> +	list_add_tail(&waiter->fl_block, &blocker->fl_blocked);
>  	if (IS_POSIX(blocker) && !IS_OFDLCK(blocker))
>  		locks_insert_global_blocked(waiter);
>  }
> @@ -725,18 +727,18 @@ static void locks_wake_up_blocks(struct file_lock *blocker)
>  	/*
>  	 * Avoid taking global lock if list is empty. This is safe since new
>  	 * blocked requests are only added to the list under the flc_lock, and
> -	 * the flc_lock is always held here. Note that removal from the fl_block
> +	 * the flc_lock is always held here. Note that removal from the fl_blocked
>  	 * list does not require the flc_lock, so we must recheck list_empty()
>  	 * after acquiring the blocked_lock_lock.
>  	 */
> -	if (list_empty(&blocker->fl_block))
> +	if (list_empty(&blocker->fl_blocked))
>  		return;
>  
>  	spin_lock(&blocked_lock_lock);
> -	while (!list_empty(&blocker->fl_block)) {
> +	while (!list_empty(&blocker->fl_blocked)) {
>  		struct file_lock *waiter;
>  
> -		waiter = list_first_entry(&blocker->fl_block,
> +		waiter = list_first_entry(&blocker->fl_blocked,
>  				struct file_lock, fl_block);
>  		__locks_delete_block(waiter);
>  		if (waiter->fl_lmops && waiter->fl_lmops->lm_notify)
> @@ -887,7 +889,7 @@ static struct file_lock *what_owner_is_waiting_for(struct file_lock *block_fl)
>  
>  	hash_for_each_possible(blocked_hash, fl, fl_link, posix_owner_key(block_fl)) {
>  		if (posix_same_owner(fl, block_fl))
> -			return fl->fl_next;
> +			return fl->fl_blocker;
>  	}
>  	return NULL;
>  }
> @@ -1245,7 +1247,7 @@ static int posix_lock_inode_wait(struct inode *inode, struct file_lock *fl)
>  		error = posix_lock_inode(inode, fl, NULL);
>  		if (error != FILE_LOCK_DEFERRED)
>  			break;
> -		error = wait_event_interruptible(fl->fl_wait, !fl->fl_next);
> +		error = wait_event_interruptible(fl->fl_wait, !fl->fl_blocker);
>  		if (!error)
>  			continue;
>  
> @@ -1332,7 +1334,7 @@ int locks_mandatory_area(struct inode *inode, struct file *filp, loff_t start,
>  		error = posix_lock_inode(inode, &fl, NULL);
>  		if (error != FILE_LOCK_DEFERRED)
>  			break;
> -		error = wait_event_interruptible(fl.fl_wait, !fl.fl_next);
> +		error = wait_event_interruptible(fl.fl_wait, !fl.fl_blocker);
>  		if (!error) {
>  			/*
>  			 * If we've been sleeping someone might have
> @@ -1526,7 +1528,7 @@ int __break_lease(struct inode *inode, unsigned int mode, unsigned int type)
>  
>  	locks_dispose_list(&dispose);
>  	error = wait_event_interruptible_timeout(new_fl->fl_wait,
> -						!new_fl->fl_next, break_time);
> +						!new_fl->fl_blocker, break_time);
>  
>  	percpu_down_read_preempt_disable(&file_rwsem);
>  	spin_lock(&ctx->flc_lock);
> @@ -1940,7 +1942,7 @@ static int flock_lock_inode_wait(struct inode *inode, struct file_lock *fl)
>  		error = flock_lock_inode(inode, fl);
>  		if (error != FILE_LOCK_DEFERRED)
>  			break;
> -		error = wait_event_interruptible(fl->fl_wait, !fl->fl_next);
> +		error = wait_event_interruptible(fl->fl_wait, !fl->fl_blocker);
>  		if (!error)
>  			continue;
>  
> @@ -2212,7 +2214,7 @@ static int do_lock_file_wait(struct file *filp, unsigned int cmd,
>  		error = vfs_lock_file(filp, cmd, fl, NULL);
>  		if (error != FILE_LOCK_DEFERRED)
>  			break;
> -		error = wait_event_interruptible(fl->fl_wait, !fl->fl_next);
> +		error = wait_event_interruptible(fl->fl_wait, !fl->fl_blocker);
>  		if (!error)
>  			continue;
>  
> @@ -2583,7 +2585,7 @@ posix_unblock_lock(struct file_lock *waiter)
>  	int status = 0;
>  
>  	spin_lock(&blocked_lock_lock);
> -	if (waiter->fl_next)
> +	if (waiter->fl_blocker)
>  		__locks_delete_block(waiter);
>  	else
>  		status = -ENOENT;
> @@ -2711,7 +2713,7 @@ static int locks_show(struct seq_file *f, void *v)
>  
>  	lock_get_status(f, fl, iter->li_pos, "");
>  
> -	list_for_each_entry(bfl, &fl->fl_block, fl_block)
> +	list_for_each_entry(bfl, &fl->fl_blocked, fl_block)
>  		lock_get_status(f, bfl, iter->li_pos, " ->");
>  
>  	return 0;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 805bf22898cf..48f7f36c3772 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1002,10 +1002,11 @@ bool opens_in_grace(struct net *);
>   * Obviously, the last two criteria only matter for POSIX locks.
>   */
>  struct file_lock {
> -	struct file_lock *fl_next;	/* singly linked list for this inode  */
> +	struct file_lock *fl_blocker;	/* The lock, that is blocking us */
>  	struct list_head fl_list;	/* link into file_lock_context */
>  	struct hlist_node fl_link;	/* node in global lists */
> -	struct list_head fl_block;	/* circular list of blocked processes */
> +	struct list_head fl_blocked;	/* list of requests with ->fl_blocker pointing here */
> +	struct list_head fl_block;	/* node in ->fl_blocker->fl_blocked */
>  	fl_owner_t fl_owner;
>  	unsigned int fl_flags;
>  	unsigned char fl_type;
> diff --git a/include/trace/events/filelock.h b/include/trace/events/filelock.h
> index d1faf3597b9d..0e50f985a390 100644
> --- a/include/trace/events/filelock.h
> +++ b/include/trace/events/filelock.h
> @@ -68,7 +68,7 @@ DECLARE_EVENT_CLASS(filelock_lock,
>  		__field(struct file_lock *, fl)
>  		__field(unsigned long, i_ino)
>  		__field(dev_t, s_dev)
> -		__field(struct file_lock *, fl_next)
> +		__field(struct file_lock *, fl_blocker)
>  		__field(fl_owner_t, fl_owner)
>  		__field(unsigned int, fl_pid)
>  		__field(unsigned int, fl_flags)
> @@ -82,7 +82,7 @@ DECLARE_EVENT_CLASS(filelock_lock,
>  		__entry->fl = fl ? fl : NULL;
>  		__entry->s_dev = inode->i_sb->s_dev;
>  		__entry->i_ino = inode->i_ino;
> -		__entry->fl_next = fl ? fl->fl_next : NULL;
> +		__entry->fl_blocker = fl ? fl->fl_blocker : NULL;
>  		__entry->fl_owner = fl ? fl->fl_owner : NULL;
>  		__entry->fl_pid = fl ? fl->fl_pid : 0;
>  		__entry->fl_flags = fl ? fl->fl_flags : 0;
> @@ -92,9 +92,9 @@ DECLARE_EVENT_CLASS(filelock_lock,
>  		__entry->ret = ret;
>  	),
>  
> -	TP_printk("fl=0x%p dev=0x%x:0x%x ino=0x%lx fl_next=0x%p fl_owner=0x%p fl_pid=%u fl_flags=%s fl_type=%s fl_start=%lld fl_end=%lld ret=%d",
> +	TP_printk("fl=0x%p dev=0x%x:0x%x ino=0x%lx fl_blocker=0x%p fl_owner=0x%p fl_pid=%u fl_flags=%s fl_type=%s fl_start=%lld fl_end=%lld ret=%d",
>  		__entry->fl, MAJOR(__entry->s_dev), MINOR(__entry->s_dev),
> -		__entry->i_ino, __entry->fl_next, __entry->fl_owner,
> +		__entry->i_ino, __entry->fl_blocker, __entry->fl_owner,
>  		__entry->fl_pid, show_fl_flags(__entry->fl_flags),
>  		show_fl_type(__entry->fl_type),
>  		__entry->fl_start, __entry->fl_end, __entry->ret)
> @@ -122,7 +122,7 @@ DECLARE_EVENT_CLASS(filelock_lease,
>  		__field(struct file_lock *, fl)
>  		__field(unsigned long, i_ino)
>  		__field(dev_t, s_dev)
> -		__field(struct file_lock *, fl_next)
> +		__field(struct file_lock *, fl_blocker)
>  		__field(fl_owner_t, fl_owner)
>  		__field(unsigned int, fl_flags)
>  		__field(unsigned char, fl_type)
> @@ -134,7 +134,7 @@ DECLARE_EVENT_CLASS(filelock_lease,
>  		__entry->fl = fl ? fl : NULL;
>  		__entry->s_dev = inode->i_sb->s_dev;
>  		__entry->i_ino = inode->i_ino;
> -		__entry->fl_next = fl ? fl->fl_next : NULL;
> +		__entry->fl_blocker = fl ? fl->fl_blocker : NULL;
>  		__entry->fl_owner = fl ? fl->fl_owner : NULL;
>  		__entry->fl_flags = fl ? fl->fl_flags : 0;
>  		__entry->fl_type = fl ? fl->fl_type : 0;
> @@ -142,9 +142,9 @@ DECLARE_EVENT_CLASS(filelock_lease,
>  		__entry->fl_downgrade_time = fl ? fl->fl_downgrade_time : 0;
>  	),
>  
> -	TP_printk("fl=0x%p dev=0x%x:0x%x ino=0x%lx fl_next=0x%p fl_owner=0x%p fl_flags=%s fl_type=%s fl_break_time=%lu fl_downgrade_time=%lu",
> +	TP_printk("fl=0x%p dev=0x%x:0x%x ino=0x%lx fl_blocker=0x%p fl_owner=0x%p fl_flags=%s fl_type=%s fl_break_time=%lu fl_downgrade_time=%lu",
>  		__entry->fl, MAJOR(__entry->s_dev), MINOR(__entry->s_dev),
> -		__entry->i_ino, __entry->fl_next, __entry->fl_owner,
> +		__entry->i_ino, __entry->fl_blocker, __entry->fl_owner,
>  		show_fl_flags(__entry->fl_flags),
>  		show_fl_type(__entry->fl_type),
>  		__entry->fl_break_time, __entry->fl_downgrade_time)
> 
> 

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] locks: avoid thundering-herd wake-ups
  2018-08-08  1:51 [PATCH 0/4] locks: avoid thundering-herd wake-ups NeilBrown
                   ` (3 preceding siblings ...)
  2018-08-08  1:51 ` [PATCH 4/4] fs/locks: create a tree of dependent requests NeilBrown
@ 2018-08-08 16:47 ` Jeff Layton
  2018-08-08 18:29   ` J. Bruce Fields
  2018-08-08 19:54 ` J. Bruce Fields
  5 siblings, 1 reply; 26+ messages in thread
From: Jeff Layton @ 2018-08-08 16:47 UTC (permalink / raw)
  To: NeilBrown, Alexander Viro
  Cc: J. Bruce Fields, linux-fsdevel, linux-kernel, Martin Wilck

On Wed, 2018-08-08 at 11:51 +1000, NeilBrown wrote:
> If you have a many-core machine, and have many threads all wanting to
> briefly lock a give file (udev is known to do this), you can get quite
> poor performance.
> 
> When one thread releases a lock, it wakes up all other threads that
> are waiting (classic thundering-herd) - one will get the lock and the
> others go to sleep.
> When you have few cores, this is not very noticeable: by the time the
> 4th or 5th thread gets enough CPU time to try to claim the lock, the
> earlier threads have claimed it, done what was needed, and released.
> With 50+ cores, the contention can easily be measured.
> 
> This patchset creates a tree of pending lock request in which siblings
> don't conflict and each lock request does conflict with its parent.
> When a lock is released, only requests which don't conflict with each
> other a woken.
> 
> Testing shows that lock-acquisitions-per-second is now fairly stable even
> as number of contending process goes to 1000.  Without this patch,
> locks-per-second drops off steeply after a few 10s of processes.
> 
> There is a small cost to this extra complexity.
> At 20 processes running a particular test on 72 cores, the lock
> acquisitions per second drops from 1.8 million to 1.4 million with
> this patch.  For 100 processes, this patch still provides 1.4 million
> while without this patch there are about 700,000.
> 
> NeilBrown
> 
> ---
> 
> NeilBrown (4):
>       fs/locks: rename some lists and pointers.
>       fs/locks: allow a lock request to block other requests.
>       fs/locks: change all *_conflict() functions to return bool.
>       fs/locks: create a tree of dependent requests.
> 
> 
>  fs/cifs/file.c                  |    2 -
>  fs/locks.c                      |  142 +++++++++++++++++++++++++--------------
>  include/linux/fs.h              |    5 +
>  include/trace/events/filelock.h |   16 ++--
>  4 files changed, 103 insertions(+), 62 deletions(-)
> 

Nice work! I looked over this and I think it looks good.

I made an attempt to fix this issue several years ago, but my method
sucked as it ended up penalizing the unlocking task too much. This is
much cleaner and should scale well overall, I think.

I'll put this in -next soon and we can aim for merge in v4.20.

Thanks,
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] locks: avoid thundering-herd wake-ups
  2018-08-08 16:47 ` [PATCH 0/4] locks: avoid thundering-herd wake-ups Jeff Layton
@ 2018-08-08 18:29   ` J. Bruce Fields
  2018-08-09  0:58     ` NeilBrown
  2018-08-20 11:02     ` Martin Wilck
  0 siblings, 2 replies; 26+ messages in thread
From: J. Bruce Fields @ 2018-08-08 18:29 UTC (permalink / raw)
  To: Jeff Layton
  Cc: NeilBrown, Alexander Viro, linux-fsdevel, linux-kernel, Martin Wilck

On Wed, Aug 08, 2018 at 12:47:22PM -0400, Jeff Layton wrote:
> On Wed, 2018-08-08 at 11:51 +1000, NeilBrown wrote:
> > If you have a many-core machine, and have many threads all wanting to
> > briefly lock a give file (udev is known to do this), you can get quite
> > poor performance.
> > 
> > When one thread releases a lock, it wakes up all other threads that
> > are waiting (classic thundering-herd) - one will get the lock and the
> > others go to sleep.
> > When you have few cores, this is not very noticeable: by the time the
> > 4th or 5th thread gets enough CPU time to try to claim the lock, the
> > earlier threads have claimed it, done what was needed, and released.
> > With 50+ cores, the contention can easily be measured.
> > 
> > This patchset creates a tree of pending lock request in which siblings
> > don't conflict and each lock request does conflict with its parent.
> > When a lock is released, only requests which don't conflict with each
> > other a woken.
> > 
> > Testing shows that lock-acquisitions-per-second is now fairly stable even
> > as number of contending process goes to 1000.  Without this patch,
> > locks-per-second drops off steeply after a few 10s of processes.
> > 
> > There is a small cost to this extra complexity.
> > At 20 processes running a particular test on 72 cores, the lock
> > acquisitions per second drops from 1.8 million to 1.4 million with
> > this patch.  For 100 processes, this patch still provides 1.4 million
> > while without this patch there are about 700,000.
> > 
> > NeilBrown
> > 
> > ---
> > 
> > NeilBrown (4):
> >       fs/locks: rename some lists and pointers.
> >       fs/locks: allow a lock request to block other requests.
> >       fs/locks: change all *_conflict() functions to return bool.
> >       fs/locks: create a tree of dependent requests.
> > 
> > 
> >  fs/cifs/file.c                  |    2 -
> >  fs/locks.c                      |  142 +++++++++++++++++++++++++--------------
> >  include/linux/fs.h              |    5 +
> >  include/trace/events/filelock.h |   16 ++--
> >  4 files changed, 103 insertions(+), 62 deletions(-)
> > 
> 
> Nice work! I looked over this and I think it looks good.
> 
> I made an attempt to fix this issue several years ago, but my method
> sucked as it ended up penalizing the unlocking task too much. This is
> much cleaner and should scale well overall, I think.

I think I also took a crack at this at one point while I was at UM/CITI
and never got anything I was happy with.  Looks like good work!

I remember one main obstacle that I felt like I never had a good
benchmark....

How did you choose this workload and hardware?  Was it in fact udev
(booting a large machine?), or was there some other motivation?

Not that I'm likely to do it any time soon, but could you share
sufficient details for someone else to reproduce your results?

--b.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/4] fs/locks: rename some lists and pointers.
  2018-08-08 10:47   ` Jeff Layton
@ 2018-08-08 19:07     ` J. Bruce Fields
  0 siblings, 0 replies; 26+ messages in thread
From: J. Bruce Fields @ 2018-08-08 19:07 UTC (permalink / raw)
  To: Jeff Layton
  Cc: NeilBrown, Alexander Viro, linux-fsdevel, linux-kernel, Martin Wilck

On Wed, Aug 08, 2018 at 06:47:34AM -0400, Jeff Layton wrote:
> On Wed, 2018-08-08 at 11:51 +1000, NeilBrown wrote:
> > struct file lock contains an 'fl_next' pointer which
> > is used to point to the lock that this request is blocked
> > waiting for.  So rename it to fl_blocker.
> > 
> > The fl_blocked list_head in an active lock is the head of a list of
> > blocked requests.  In a request it is a node in that list.
> > These are two distinct uses, so replace with two list_heads
> > with different names.
> > fl_blocked is the head of a list of blocked requests
> > fl_block is a node on that list.
> > 
> > This separation will be used in the next patch.
> > 
> > Note that a tracepoint is changed to report fl_blocker instead
> > of fl_next.  Might this be an ABI breakage?
> > 
> 
> My understanding (possibly wrong) is that because tracepoints are
> exposed via debugfs, they aren't considered part of the ABI.

Wasn't there an incident where tracepoint changes had to be reverted for
powertop?

The chances of breaking an equivalent utility in this case are small, I
think, I'm fine with trying it.

--b.

> 
> Thanks for the patches, Neil. I'll go over them in detail soon.
> 
> > Signed-off-by: NeilBrown <neilb@suse.com>
> > ---
> >  fs/cifs/file.c                  |    2 +-
> >  fs/locks.c                      |   40 ++++++++++++++++++++-------------------
> >  include/linux/fs.h              |    5 +++--
> >  include/trace/events/filelock.h |   16 ++++++++--------
> >  4 files changed, 33 insertions(+), 30 deletions(-)
> > 
> > diff --git a/fs/cifs/file.c b/fs/cifs/file.c
> > index 8d41ca7bfcf1..066ed2e4ba96 100644
> > --- a/fs/cifs/file.c
> > +++ b/fs/cifs/file.c
> > @@ -1092,7 +1092,7 @@ cifs_posix_lock_set(struct file *file, struct file_lock *flock)
> >  	rc = posix_lock_file(file, flock, NULL);
> >  	up_write(&cinode->lock_sem);
> >  	if (rc == FILE_LOCK_DEFERRED) {
> > -		rc = wait_event_interruptible(flock->fl_wait, !flock->fl_next);
> > +		rc = wait_event_interruptible(flock->fl_wait, !flock->fl_blocker);
> >  		if (!rc)
> >  			goto try_again;
> >  		posix_unblock_lock(flock);
> > diff --git a/fs/locks.c b/fs/locks.c
> > index db7b6917d9c5..322491e70e41 100644
> > --- a/fs/locks.c
> > +++ b/fs/locks.c
> > @@ -194,7 +194,7 @@ static DEFINE_HASHTABLE(blocked_hash, BLOCKED_HASH_BITS);
> >   * This lock protects the blocked_hash. Generally, if you're accessing it, you
> >   * want to be holding this lock.
> >   *
> > - * In addition, it also protects the fl->fl_block list, and the fl->fl_next
> > + * In addition, it also protects the fl->fl_blocked list, and the fl->fl_blocker
> >   * pointer for file_lock structures that are acting as lock requests (in
> >   * contrast to those that are acting as records of acquired locks).
> >   *
> > @@ -203,7 +203,7 @@ static DEFINE_HASHTABLE(blocked_hash, BLOCKED_HASH_BITS);
> >   * protected by this lock, we can skip acquiring it iff we already hold the
> >   * flc_lock.
> >   *
> > - * In particular, adding an entry to the fl_block list requires that you hold
> > + * In particular, adding an entry to the fl_blocked list requires that you hold
> >   * both the flc_lock and the blocked_lock_lock (acquired in that order).
> >   * Deleting an entry from the list however only requires the file_lock_lock.
> >   */
> > @@ -302,6 +302,7 @@ static void locks_init_lock_heads(struct file_lock *fl)
> >  {
> >  	INIT_HLIST_NODE(&fl->fl_link);
> >  	INIT_LIST_HEAD(&fl->fl_list);
> > +	INIT_LIST_HEAD(&fl->fl_blocked);
> >  	INIT_LIST_HEAD(&fl->fl_block);
> >  	init_waitqueue_head(&fl->fl_wait);
> >  }
> > @@ -341,6 +342,7 @@ void locks_free_lock(struct file_lock *fl)
> >  {
> >  	BUG_ON(waitqueue_active(&fl->fl_wait));
> >  	BUG_ON(!list_empty(&fl->fl_list));
> > +	BUG_ON(!list_empty(&fl->fl_blocked));
> >  	BUG_ON(!list_empty(&fl->fl_block));
> >  	BUG_ON(!hlist_unhashed(&fl->fl_link));
> >  
> > @@ -676,7 +678,7 @@ static void __locks_delete_block(struct file_lock *waiter)
> >  {
> >  	locks_delete_global_blocked(waiter);
> >  	list_del_init(&waiter->fl_block);
> > -	waiter->fl_next = NULL;
> > +	waiter->fl_blocker = NULL;
> >  }
> >  
> >  static void locks_delete_block(struct file_lock *waiter)
> > @@ -692,16 +694,16 @@ static void locks_delete_block(struct file_lock *waiter)
> >   * it seems like the reasonable thing to do.
> >   *
> >   * Must be called with both the flc_lock and blocked_lock_lock held. The
> > - * fl_block list itself is protected by the blocked_lock_lock, but by ensuring
> > + * fl_blocked list itself is protected by the blocked_lock_lock, but by ensuring
> >   * that the flc_lock is also held on insertions we can avoid taking the
> > - * blocked_lock_lock in some cases when we see that the fl_block list is empty.
> > + * blocked_lock_lock in some cases when we see that the fl_blocked list is empty.
> >   */
> >  static void __locks_insert_block(struct file_lock *blocker,
> >  					struct file_lock *waiter)
> >  {
> >  	BUG_ON(!list_empty(&waiter->fl_block));
> > -	waiter->fl_next = blocker;
> > -	list_add_tail(&waiter->fl_block, &blocker->fl_block);
> > +	waiter->fl_blocker = blocker;
> > +	list_add_tail(&waiter->fl_block, &blocker->fl_blocked);
> >  	if (IS_POSIX(blocker) && !IS_OFDLCK(blocker))
> >  		locks_insert_global_blocked(waiter);
> >  }
> > @@ -725,18 +727,18 @@ static void locks_wake_up_blocks(struct file_lock *blocker)
> >  	/*
> >  	 * Avoid taking global lock if list is empty. This is safe since new
> >  	 * blocked requests are only added to the list under the flc_lock, and
> > -	 * the flc_lock is always held here. Note that removal from the fl_block
> > +	 * the flc_lock is always held here. Note that removal from the fl_blocked
> >  	 * list does not require the flc_lock, so we must recheck list_empty()
> >  	 * after acquiring the blocked_lock_lock.
> >  	 */
> > -	if (list_empty(&blocker->fl_block))
> > +	if (list_empty(&blocker->fl_blocked))
> >  		return;
> >  
> >  	spin_lock(&blocked_lock_lock);
> > -	while (!list_empty(&blocker->fl_block)) {
> > +	while (!list_empty(&blocker->fl_blocked)) {
> >  		struct file_lock *waiter;
> >  
> > -		waiter = list_first_entry(&blocker->fl_block,
> > +		waiter = list_first_entry(&blocker->fl_blocked,
> >  				struct file_lock, fl_block);
> >  		__locks_delete_block(waiter);
> >  		if (waiter->fl_lmops && waiter->fl_lmops->lm_notify)
> > @@ -887,7 +889,7 @@ static struct file_lock *what_owner_is_waiting_for(struct file_lock *block_fl)
> >  
> >  	hash_for_each_possible(blocked_hash, fl, fl_link, posix_owner_key(block_fl)) {
> >  		if (posix_same_owner(fl, block_fl))
> > -			return fl->fl_next;
> > +			return fl->fl_blocker;
> >  	}
> >  	return NULL;
> >  }
> > @@ -1245,7 +1247,7 @@ static int posix_lock_inode_wait(struct inode *inode, struct file_lock *fl)
> >  		error = posix_lock_inode(inode, fl, NULL);
> >  		if (error != FILE_LOCK_DEFERRED)
> >  			break;
> > -		error = wait_event_interruptible(fl->fl_wait, !fl->fl_next);
> > +		error = wait_event_interruptible(fl->fl_wait, !fl->fl_blocker);
> >  		if (!error)
> >  			continue;
> >  
> > @@ -1332,7 +1334,7 @@ int locks_mandatory_area(struct inode *inode, struct file *filp, loff_t start,
> >  		error = posix_lock_inode(inode, &fl, NULL);
> >  		if (error != FILE_LOCK_DEFERRED)
> >  			break;
> > -		error = wait_event_interruptible(fl.fl_wait, !fl.fl_next);
> > +		error = wait_event_interruptible(fl.fl_wait, !fl.fl_blocker);
> >  		if (!error) {
> >  			/*
> >  			 * If we've been sleeping someone might have
> > @@ -1526,7 +1528,7 @@ int __break_lease(struct inode *inode, unsigned int mode, unsigned int type)
> >  
> >  	locks_dispose_list(&dispose);
> >  	error = wait_event_interruptible_timeout(new_fl->fl_wait,
> > -						!new_fl->fl_next, break_time);
> > +						!new_fl->fl_blocker, break_time);
> >  
> >  	percpu_down_read_preempt_disable(&file_rwsem);
> >  	spin_lock(&ctx->flc_lock);
> > @@ -1940,7 +1942,7 @@ static int flock_lock_inode_wait(struct inode *inode, struct file_lock *fl)
> >  		error = flock_lock_inode(inode, fl);
> >  		if (error != FILE_LOCK_DEFERRED)
> >  			break;
> > -		error = wait_event_interruptible(fl->fl_wait, !fl->fl_next);
> > +		error = wait_event_interruptible(fl->fl_wait, !fl->fl_blocker);
> >  		if (!error)
> >  			continue;
> >  
> > @@ -2212,7 +2214,7 @@ static int do_lock_file_wait(struct file *filp, unsigned int cmd,
> >  		error = vfs_lock_file(filp, cmd, fl, NULL);
> >  		if (error != FILE_LOCK_DEFERRED)
> >  			break;
> > -		error = wait_event_interruptible(fl->fl_wait, !fl->fl_next);
> > +		error = wait_event_interruptible(fl->fl_wait, !fl->fl_blocker);
> >  		if (!error)
> >  			continue;
> >  
> > @@ -2583,7 +2585,7 @@ posix_unblock_lock(struct file_lock *waiter)
> >  	int status = 0;
> >  
> >  	spin_lock(&blocked_lock_lock);
> > -	if (waiter->fl_next)
> > +	if (waiter->fl_blocker)
> >  		__locks_delete_block(waiter);
> >  	else
> >  		status = -ENOENT;
> > @@ -2711,7 +2713,7 @@ static int locks_show(struct seq_file *f, void *v)
> >  
> >  	lock_get_status(f, fl, iter->li_pos, "");
> >  
> > -	list_for_each_entry(bfl, &fl->fl_block, fl_block)
> > +	list_for_each_entry(bfl, &fl->fl_blocked, fl_block)
> >  		lock_get_status(f, bfl, iter->li_pos, " ->");
> >  
> >  	return 0;
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 805bf22898cf..48f7f36c3772 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -1002,10 +1002,11 @@ bool opens_in_grace(struct net *);
> >   * Obviously, the last two criteria only matter for POSIX locks.
> >   */
> >  struct file_lock {
> > -	struct file_lock *fl_next;	/* singly linked list for this inode  */
> > +	struct file_lock *fl_blocker;	/* The lock, that is blocking us */
> >  	struct list_head fl_list;	/* link into file_lock_context */
> >  	struct hlist_node fl_link;	/* node in global lists */
> > -	struct list_head fl_block;	/* circular list of blocked processes */
> > +	struct list_head fl_blocked;	/* list of requests with ->fl_blocker pointing here */
> > +	struct list_head fl_block;	/* node in ->fl_blocker->fl_blocked */
> >  	fl_owner_t fl_owner;
> >  	unsigned int fl_flags;
> >  	unsigned char fl_type;
> > diff --git a/include/trace/events/filelock.h b/include/trace/events/filelock.h
> > index d1faf3597b9d..0e50f985a390 100644
> > --- a/include/trace/events/filelock.h
> > +++ b/include/trace/events/filelock.h
> > @@ -68,7 +68,7 @@ DECLARE_EVENT_CLASS(filelock_lock,
> >  		__field(struct file_lock *, fl)
> >  		__field(unsigned long, i_ino)
> >  		__field(dev_t, s_dev)
> > -		__field(struct file_lock *, fl_next)
> > +		__field(struct file_lock *, fl_blocker)
> >  		__field(fl_owner_t, fl_owner)
> >  		__field(unsigned int, fl_pid)
> >  		__field(unsigned int, fl_flags)
> > @@ -82,7 +82,7 @@ DECLARE_EVENT_CLASS(filelock_lock,
> >  		__entry->fl = fl ? fl : NULL;
> >  		__entry->s_dev = inode->i_sb->s_dev;
> >  		__entry->i_ino = inode->i_ino;
> > -		__entry->fl_next = fl ? fl->fl_next : NULL;
> > +		__entry->fl_blocker = fl ? fl->fl_blocker : NULL;
> >  		__entry->fl_owner = fl ? fl->fl_owner : NULL;
> >  		__entry->fl_pid = fl ? fl->fl_pid : 0;
> >  		__entry->fl_flags = fl ? fl->fl_flags : 0;
> > @@ -92,9 +92,9 @@ DECLARE_EVENT_CLASS(filelock_lock,
> >  		__entry->ret = ret;
> >  	),
> >  
> > -	TP_printk("fl=0x%p dev=0x%x:0x%x ino=0x%lx fl_next=0x%p fl_owner=0x%p fl_pid=%u fl_flags=%s fl_type=%s fl_start=%lld fl_end=%lld ret=%d",
> > +	TP_printk("fl=0x%p dev=0x%x:0x%x ino=0x%lx fl_blocker=0x%p fl_owner=0x%p fl_pid=%u fl_flags=%s fl_type=%s fl_start=%lld fl_end=%lld ret=%d",
> >  		__entry->fl, MAJOR(__entry->s_dev), MINOR(__entry->s_dev),
> > -		__entry->i_ino, __entry->fl_next, __entry->fl_owner,
> > +		__entry->i_ino, __entry->fl_blocker, __entry->fl_owner,
> >  		__entry->fl_pid, show_fl_flags(__entry->fl_flags),
> >  		show_fl_type(__entry->fl_type),
> >  		__entry->fl_start, __entry->fl_end, __entry->ret)
> > @@ -122,7 +122,7 @@ DECLARE_EVENT_CLASS(filelock_lease,
> >  		__field(struct file_lock *, fl)
> >  		__field(unsigned long, i_ino)
> >  		__field(dev_t, s_dev)
> > -		__field(struct file_lock *, fl_next)
> > +		__field(struct file_lock *, fl_blocker)
> >  		__field(fl_owner_t, fl_owner)
> >  		__field(unsigned int, fl_flags)
> >  		__field(unsigned char, fl_type)
> > @@ -134,7 +134,7 @@ DECLARE_EVENT_CLASS(filelock_lease,
> >  		__entry->fl = fl ? fl : NULL;
> >  		__entry->s_dev = inode->i_sb->s_dev;
> >  		__entry->i_ino = inode->i_ino;
> > -		__entry->fl_next = fl ? fl->fl_next : NULL;
> > +		__entry->fl_blocker = fl ? fl->fl_blocker : NULL;
> >  		__entry->fl_owner = fl ? fl->fl_owner : NULL;
> >  		__entry->fl_flags = fl ? fl->fl_flags : 0;
> >  		__entry->fl_type = fl ? fl->fl_type : 0;
> > @@ -142,9 +142,9 @@ DECLARE_EVENT_CLASS(filelock_lease,
> >  		__entry->fl_downgrade_time = fl ? fl->fl_downgrade_time : 0;
> >  	),
> >  
> > -	TP_printk("fl=0x%p dev=0x%x:0x%x ino=0x%lx fl_next=0x%p fl_owner=0x%p fl_flags=%s fl_type=%s fl_break_time=%lu fl_downgrade_time=%lu",
> > +	TP_printk("fl=0x%p dev=0x%x:0x%x ino=0x%lx fl_blocker=0x%p fl_owner=0x%p fl_flags=%s fl_type=%s fl_break_time=%lu fl_downgrade_time=%lu",
> >  		__entry->fl, MAJOR(__entry->s_dev), MINOR(__entry->s_dev),
> > -		__entry->i_ino, __entry->fl_next, __entry->fl_owner,
> > +		__entry->i_ino, __entry->fl_blocker, __entry->fl_owner,
> >  		show_fl_flags(__entry->fl_flags),
> >  		show_fl_type(__entry->fl_type),
> >  		__entry->fl_break_time, __entry->fl_downgrade_time)
> > 
> > 
> 
> -- 
> Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] locks: avoid thundering-herd wake-ups
  2018-08-08  1:51 [PATCH 0/4] locks: avoid thundering-herd wake-ups NeilBrown
                   ` (4 preceding siblings ...)
  2018-08-08 16:47 ` [PATCH 0/4] locks: avoid thundering-herd wake-ups Jeff Layton
@ 2018-08-08 19:54 ` J. Bruce Fields
  2018-08-08 20:09   ` J. Bruce Fields
  5 siblings, 1 reply; 26+ messages in thread
From: J. Bruce Fields @ 2018-08-08 19:54 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jeff Layton, Alexander Viro, linux-fsdevel, linux-kernel, Martin Wilck

On Wed, Aug 08, 2018 at 11:51:07AM +1000, NeilBrown wrote:
> If you have a many-core machine, and have many threads all wanting to
> briefly lock a give file (udev is known to do this), you can get quite
> poor performance.
> 
> When one thread releases a lock, it wakes up all other threads that
> are waiting (classic thundering-herd) - one will get the lock and the
> others go to sleep.
> When you have few cores, this is not very noticeable: by the time the
> 4th or 5th thread gets enough CPU time to try to claim the lock, the
> earlier threads have claimed it, done what was needed, and released.
> With 50+ cores, the contention can easily be measured.
> 
> This patchset creates a tree of pending lock request in which siblings
> don't conflict and each lock request does conflict with its parent.
> When a lock is released, only requests which don't conflict with each
> other a woken.

Are you sure you aren't depending on the (incorrect) assumption that "X
blocks Y" is a transitive relation?

OK I should be able to answer that question myself, my patience for
code-reading is at a real low this afternoon....

--b.

> 
> Testing shows that lock-acquisitions-per-second is now fairly stable even
> as number of contending process goes to 1000.  Without this patch,
> locks-per-second drops off steeply after a few 10s of processes.
> 
> There is a small cost to this extra complexity.
> At 20 processes running a particular test on 72 cores, the lock
> acquisitions per second drops from 1.8 million to 1.4 million with
> this patch.  For 100 processes, this patch still provides 1.4 million
> while without this patch there are about 700,000.
> 
> NeilBrown
> 
> ---
> 
> NeilBrown (4):
>       fs/locks: rename some lists and pointers.
>       fs/locks: allow a lock request to block other requests.
>       fs/locks: change all *_conflict() functions to return bool.
>       fs/locks: create a tree of dependent requests.
> 
> 
>  fs/cifs/file.c                  |    2 -
>  fs/locks.c                      |  142 +++++++++++++++++++++++++--------------
>  include/linux/fs.h              |    5 +
>  include/trace/events/filelock.h |   16 ++--
>  4 files changed, 103 insertions(+), 62 deletions(-)
> 
> --
> Signature

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] locks: avoid thundering-herd wake-ups
  2018-08-08 19:54 ` J. Bruce Fields
@ 2018-08-08 20:09   ` J. Bruce Fields
  2018-08-08 21:15     ` Frank Filz
  2018-08-08 21:28     ` J. Bruce Fields
  0 siblings, 2 replies; 26+ messages in thread
From: J. Bruce Fields @ 2018-08-08 20:09 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jeff Layton, Alexander Viro, linux-fsdevel, linux-kernel, Martin Wilck

On Wed, Aug 08, 2018 at 03:54:45PM -0400, J. Bruce Fields wrote:
> On Wed, Aug 08, 2018 at 11:51:07AM +1000, NeilBrown wrote:
> > If you have a many-core machine, and have many threads all wanting to
> > briefly lock a give file (udev is known to do this), you can get quite
> > poor performance.
> > 
> > When one thread releases a lock, it wakes up all other threads that
> > are waiting (classic thundering-herd) - one will get the lock and the
> > others go to sleep.
> > When you have few cores, this is not very noticeable: by the time the
> > 4th or 5th thread gets enough CPU time to try to claim the lock, the
> > earlier threads have claimed it, done what was needed, and released.
> > With 50+ cores, the contention can easily be measured.
> > 
> > This patchset creates a tree of pending lock request in which siblings
> > don't conflict and each lock request does conflict with its parent.
> > When a lock is released, only requests which don't conflict with each
> > other a woken.
> 
> Are you sure you aren't depending on the (incorrect) assumption that "X
> blocks Y" is a transitive relation?
> 
> OK I should be able to answer that question myself, my patience for
> code-reading is at a real low this afternoon....

In other words, is there the possibility of a tree of, say, exclusive
locks with (offset, length) like:

	(0, 2) waiting on (1, 2) waiting on (2, 2) waiting on (0, 4)

and when waking (0, 4) you could wake up (2, 2) but not (0, 2), leaving
a process waiting without there being an actual conflict.

--b.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [PATCH 0/4] locks: avoid thundering-herd wake-ups
  2018-08-08 20:09   ` J. Bruce Fields
@ 2018-08-08 21:15     ` Frank Filz
  2018-08-08 22:34       ` NeilBrown
  2018-08-08 21:28     ` J. Bruce Fields
  1 sibling, 1 reply; 26+ messages in thread
From: Frank Filz @ 2018-08-08 21:15 UTC (permalink / raw)
  To: J. Bruce Fields, NeilBrown
  Cc: Jeff Layton, Alexander Viro, linux-fsdevel, linux-kernel, Martin Wilck

> On Wed, Aug 08, 2018 at 03:54:45PM -0400, J. Bruce Fields wrote:
> > On Wed, Aug 08, 2018 at 11:51:07AM +1000, NeilBrown wrote:
> > > If you have a many-core machine, and have many threads all wanting
> > > to briefly lock a give file (udev is known to do this), you can get
> > > quite poor performance.
> > >
> > > When one thread releases a lock, it wakes up all other threads that
> > > are waiting (classic thundering-herd) - one will get the lock and
> > > the others go to sleep.
> > > When you have few cores, this is not very noticeable: by the time
> > > the 4th or 5th thread gets enough CPU time to try to claim the lock,
> > > the earlier threads have claimed it, done what was needed, and
released.
> > > With 50+ cores, the contention can easily be measured.
> > >
> > > This patchset creates a tree of pending lock request in which
> > > siblings don't conflict and each lock request does conflict with its
parent.
> > > When a lock is released, only requests which don't conflict with
> > > each other a woken.
> >
> > Are you sure you aren't depending on the (incorrect) assumption that
> > "X blocks Y" is a transitive relation?
> >
> > OK I should be able to answer that question myself, my patience for
> > code-reading is at a real low this afternoon....
> 
> In other words, is there the possibility of a tree of, say, exclusive
locks with
> (offset, length) like:
> 
> 	(0, 2) waiting on (1, 2) waiting on (2, 2) waiting on (0, 4)
> 
> and when waking (0, 4) you could wake up (2, 2) but not (0, 2), leaving a
process
> waiting without there being an actual conflict.

That implies that the order the locks were received in was:

(0,4)
(2,2)
(1,2)
(0,2)

But couldn't (0,2) have been made only dependent on (0,4)? Of course then
(1,2) is dependent on BOTH (2,2) and (0,2). Does this tree logic handle that
case?

On the other hand, there might be a fairness reason to make (0,2) wait for
(1,2) even though it could have been granted concurrently with (2,2) since
this dependency tree also preserves some of the order of lock requests.

Frank


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] locks: avoid thundering-herd wake-ups
  2018-08-08 20:09   ` J. Bruce Fields
  2018-08-08 21:15     ` Frank Filz
@ 2018-08-08 21:28     ` J. Bruce Fields
  2018-08-08 22:39       ` NeilBrown
  2018-08-08 22:50       ` Jeff Layton
  1 sibling, 2 replies; 26+ messages in thread
From: J. Bruce Fields @ 2018-08-08 21:28 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jeff Layton, Alexander Viro, linux-fsdevel, linux-kernel, Martin Wilck

On Wed, Aug 08, 2018 at 04:09:12PM -0400, J. Bruce Fields wrote:
> On Wed, Aug 08, 2018 at 03:54:45PM -0400, J. Bruce Fields wrote:
> > On Wed, Aug 08, 2018 at 11:51:07AM +1000, NeilBrown wrote:
> > > If you have a many-core machine, and have many threads all wanting to
> > > briefly lock a give file (udev is known to do this), you can get quite
> > > poor performance.
> > > 
> > > When one thread releases a lock, it wakes up all other threads that
> > > are waiting (classic thundering-herd) - one will get the lock and the
> > > others go to sleep.
> > > When you have few cores, this is not very noticeable: by the time the
> > > 4th or 5th thread gets enough CPU time to try to claim the lock, the
> > > earlier threads have claimed it, done what was needed, and released.
> > > With 50+ cores, the contention can easily be measured.
> > > 
> > > This patchset creates a tree of pending lock request in which siblings
> > > don't conflict and each lock request does conflict with its parent.
> > > When a lock is released, only requests which don't conflict with each
> > > other a woken.
> > 
> > Are you sure you aren't depending on the (incorrect) assumption that "X
> > blocks Y" is a transitive relation?
> > 
> > OK I should be able to answer that question myself, my patience for
> > code-reading is at a real low this afternoon....
> 
> In other words, is there the possibility of a tree of, say, exclusive
> locks with (offset, length) like:
> 
> 	(0, 2) waiting on (1, 2) waiting on (2, 2) waiting on (0, 4)
> 
> and when waking (0, 4) you could wake up (2, 2) but not (0, 2), leaving
> a process waiting without there being an actual conflict.

After batting it back and forth with Jeff on IRC....  So do I understand
right that when we wake a waiter, we leave its own tree of waiters
intact, and when it wakes if it finds a conflict it just adds it lock
(with tree of waiters) in to the tree of the conflicting lock?

If so then yes I think that depends on the transitivity
assumption--you're assuming that finding a conflict between the root of
the tree and a lock proves that all the other members of the tree also
conflict.

So maybe this example works.  (All locks are exclusive and written
(offset, length), X->Y means X is waiting on Y.)

	process acquires (0,3)
	2nd process requests (1,2), is put to sleep.
	3rd process requests (0,2), is put to sleep.

	The tree of waiters now looks like (0,2)->(1,2)->(0,3)

	(0,3) is unlocked.
	A 4th process races in and locks (2,2).
	The 2nd process wakes up, sees this new conflict, and waits on
	(2,2).  Now the tree looks like (0,2)->(1,2)->(2,2), and (0,2)
	is waiting for no reason.

?

--b.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [PATCH 0/4] locks: avoid thundering-herd wake-ups
  2018-08-08 21:15     ` Frank Filz
@ 2018-08-08 22:34       ` NeilBrown
  0 siblings, 0 replies; 26+ messages in thread
From: NeilBrown @ 2018-08-08 22:34 UTC (permalink / raw)
  To: Frank Filz, J. Bruce Fields
  Cc: Jeff Layton, Alexander Viro, linux-fsdevel, linux-kernel, Martin Wilck

[-- Attachment #1: Type: text/plain, Size: 2843 bytes --]

On Wed, Aug 08 2018, Frank Filz wrote:

>> On Wed, Aug 08, 2018 at 03:54:45PM -0400, J. Bruce Fields wrote:
>> > On Wed, Aug 08, 2018 at 11:51:07AM +1000, NeilBrown wrote:
>> > > If you have a many-core machine, and have many threads all wanting
>> > > to briefly lock a give file (udev is known to do this), you can get
>> > > quite poor performance.
>> > >
>> > > When one thread releases a lock, it wakes up all other threads that
>> > > are waiting (classic thundering-herd) - one will get the lock and
>> > > the others go to sleep.
>> > > When you have few cores, this is not very noticeable: by the time
>> > > the 4th or 5th thread gets enough CPU time to try to claim the lock,
>> > > the earlier threads have claimed it, done what was needed, and
> released.
>> > > With 50+ cores, the contention can easily be measured.
>> > >
>> > > This patchset creates a tree of pending lock request in which
>> > > siblings don't conflict and each lock request does conflict with its
> parent.
>> > > When a lock is released, only requests which don't conflict with
>> > > each other a woken.
>> >
>> > Are you sure you aren't depending on the (incorrect) assumption that
>> > "X blocks Y" is a transitive relation?
>> >
>> > OK I should be able to answer that question myself, my patience for
>> > code-reading is at a real low this afternoon....
>> 
>> In other words, is there the possibility of a tree of, say, exclusive
> locks with
>> (offset, length) like:
>> 
>> 	(0, 2) waiting on (1, 2) waiting on (2, 2) waiting on (0, 4)
>> 
>> and when waking (0, 4) you could wake up (2, 2) but not (0, 2), leaving a
> process
>> waiting without there being an actual conflict.
>
> That implies that the order the locks were received in was:
>
> (0,4)
> (2,2)
> (1,2)
> (0,2)
>
> But couldn't (0,2) have been made only dependent on (0,4)?

Correct.  (0,2) would be a child if (0,4), but a sibling of (2,2).

>    Of course then
> (1,2) is dependent on BOTH (2,2) and (0,2). Does this tree logic handle that
> case?

No, there is no support for double dependencies.  It is still possible
for a lock request to be woken up which, which still cannot be
satisfied.
When one block lock is unlocked, a pending request might then be queued
against a different blocking lock.

>
> On the other hand, there might be a fairness reason to make (0,2) wait for
> (1,2) even though it could have been granted concurrently with (2,2) since
> this dependency tree also preserves some of the order of lock requests.

The locking API doesn't promise fairness, and I don't think we should
try too hard to achieve it.  Certainly we shouldn't actively fight it
(so no LIFO queuing) but if we try harder than that I suspect it would
just be needless complexity.

Thanks,
NeilBrown


>
> Frank

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] locks: avoid thundering-herd wake-ups
  2018-08-08 21:28     ` J. Bruce Fields
@ 2018-08-08 22:39       ` NeilBrown
  2018-08-08 22:50       ` Jeff Layton
  1 sibling, 0 replies; 26+ messages in thread
From: NeilBrown @ 2018-08-08 22:39 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Jeff Layton, Alexander Viro, linux-fsdevel, linux-kernel, Martin Wilck

[-- Attachment #1: Type: text/plain, Size: 3427 bytes --]

On Wed, Aug 08 2018, J. Bruce Fields wrote:

> On Wed, Aug 08, 2018 at 04:09:12PM -0400, J. Bruce Fields wrote:
>> On Wed, Aug 08, 2018 at 03:54:45PM -0400, J. Bruce Fields wrote:
>> > On Wed, Aug 08, 2018 at 11:51:07AM +1000, NeilBrown wrote:
>> > > If you have a many-core machine, and have many threads all wanting to
>> > > briefly lock a give file (udev is known to do this), you can get quite
>> > > poor performance.
>> > > 
>> > > When one thread releases a lock, it wakes up all other threads that
>> > > are waiting (classic thundering-herd) - one will get the lock and the
>> > > others go to sleep.
>> > > When you have few cores, this is not very noticeable: by the time the
>> > > 4th or 5th thread gets enough CPU time to try to claim the lock, the
>> > > earlier threads have claimed it, done what was needed, and released.
>> > > With 50+ cores, the contention can easily be measured.
>> > > 
>> > > This patchset creates a tree of pending lock request in which siblings
>> > > don't conflict and each lock request does conflict with its parent.
>> > > When a lock is released, only requests which don't conflict with each
>> > > other a woken.
>> > 
>> > Are you sure you aren't depending on the (incorrect) assumption that "X
>> > blocks Y" is a transitive relation?
>> > 
>> > OK I should be able to answer that question myself, my patience for
>> > code-reading is at a real low this afternoon....
>> 
>> In other words, is there the possibility of a tree of, say, exclusive
>> locks with (offset, length) like:
>> 
>> 	(0, 2) waiting on (1, 2) waiting on (2, 2) waiting on (0, 4)
>> 
>> and when waking (0, 4) you could wake up (2, 2) but not (0, 2), leaving
>> a process waiting without there being an actual conflict.
>
> After batting it back and forth with Jeff on IRC....  So do I understand
> right that when we wake a waiter, we leave its own tree of waiters
> intact, and when it wakes if it finds a conflict it just adds it lock
> (with tree of waiters) in to the tree of the conflicting lock?
>
> If so then yes I think that depends on the transitivity
> assumption--you're assuming that finding a conflict between the root of
> the tree and a lock proves that all the other members of the tree also
> conflict.

Ahhh... I see what you are getting at.  When lock requests are added
individually, they will always be blocked by all ancestors in the tree.
But when they are added as a group, that might not be the case.
So we might need to re-add every request individually.
In the (common) case where a lock request is blocked across its whole
range, we can just attach the whole tree beneath the blocker.  In other
cases we need a finer grained approach.

I'll have a look and see how to make the code work for this case.

Thanks for the thorough review!

NeilBrown

>
> So maybe this example works.  (All locks are exclusive and written
> (offset, length), X->Y means X is waiting on Y.)
>
> 	process acquires (0,3)
> 	2nd process requests (1,2), is put to sleep.
> 	3rd process requests (0,2), is put to sleep.
>
> 	The tree of waiters now looks like (0,2)->(1,2)->(0,3)
>
> 	(0,3) is unlocked.
> 	A 4th process races in and locks (2,2).
> 	The 2nd process wakes up, sees this new conflict, and waits on
> 	(2,2).  Now the tree looks like (0,2)->(1,2)->(2,2), and (0,2)
> 	is waiting for no reason.
>
> ?
>
> --b.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] locks: avoid thundering-herd wake-ups
  2018-08-08 21:28     ` J. Bruce Fields
  2018-08-08 22:39       ` NeilBrown
@ 2018-08-08 22:50       ` Jeff Layton
  2018-08-08 23:34         ` Frank Filz
  2018-08-09 13:00         ` J. Bruce Fields
  1 sibling, 2 replies; 26+ messages in thread
From: Jeff Layton @ 2018-08-08 22:50 UTC (permalink / raw)
  To: J. Bruce Fields, NeilBrown
  Cc: Alexander Viro, linux-fsdevel, linux-kernel, Martin Wilck

On Wed, 2018-08-08 at 17:28 -0400, J. Bruce Fields wrote:
> On Wed, Aug 08, 2018 at 04:09:12PM -0400, J. Bruce Fields wrote:
> > On Wed, Aug 08, 2018 at 03:54:45PM -0400, J. Bruce Fields wrote:
> > > On Wed, Aug 08, 2018 at 11:51:07AM +1000, NeilBrown wrote:
> > > > If you have a many-core machine, and have many threads all wanting to
> > > > briefly lock a give file (udev is known to do this), you can get quite
> > > > poor performance.
> > > > 
> > > > When one thread releases a lock, it wakes up all other threads that
> > > > are waiting (classic thundering-herd) - one will get the lock and the
> > > > others go to sleep.
> > > > When you have few cores, this is not very noticeable: by the time the
> > > > 4th or 5th thread gets enough CPU time to try to claim the lock, the
> > > > earlier threads have claimed it, done what was needed, and released.
> > > > With 50+ cores, the contention can easily be measured.
> > > > 
> > > > This patchset creates a tree of pending lock request in which siblings
> > > > don't conflict and each lock request does conflict with its parent.
> > > > When a lock is released, only requests which don't conflict with each
> > > > other a woken.
> > > 
> > > Are you sure you aren't depending on the (incorrect) assumption that "X
> > > blocks Y" is a transitive relation?
> > > 
> > > OK I should be able to answer that question myself, my patience for
> > > code-reading is at a real low this afternoon....
> > 
> > In other words, is there the possibility of a tree of, say, exclusive
> > locks with (offset, length) like:
> > 
> > 	(0, 2) waiting on (1, 2) waiting on (2, 2) waiting on (0, 4)
> > 
> > and when waking (0, 4) you could wake up (2, 2) but not (0, 2), leaving
> > a process waiting without there being an actual conflict.
> 
> After batting it back and forth with Jeff on IRC....  So do I understand
> right that when we wake a waiter, we leave its own tree of waiters
> intact, and when it wakes if it finds a conflict it just adds it lock
> (with tree of waiters) in to the tree of the conflicting lock?
> 
> If so then yes I think that depends on the transitivity
> assumption--you're assuming that finding a conflict between the root of
> the tree and a lock proves that all the other members of the tree also
> conflict.
> 
> So maybe this example works.  (All locks are exclusive and written
> (offset, length), X->Y means X is waiting on Y.)
> 
> 	process acquires (0,3)
> 	2nd process requests (1,2), is put to sleep.
> 	3rd process requests (0,2), is put to sleep.
> 
> 	The tree of waiters now looks like (0,2)->(1,2)->(0,3)
> 
> 	(0,3) is unlocked.
> 	A 4th process races in and locks (2,2).
> 	The 2nd process wakes up, sees this new conflict, and waits on
> 	(2,2).  Now the tree looks like (0,2)->(1,2)->(2,2), and (0,2)
> 	is waiting for no reason.
> 

That seems like a legit problem.

One possible fix might be to have the waiter on (1,2) walk down the
entire subtree and wake up any waiter that is waiting on a lock that
doesn't conflict with the lock on which it's waiting.

So, before the task waiting on 1,2 goes back to sleep to wait on 2,2, it
could walk down its entire fl_blocked subtree and wake up anything
waiting on a lock that doesn't conflict with (2,2).

That's potentially an expensive operation, but:

a) the task is going back to sleep anyway, so letting it do a little
extra work before that should be no big deal

b) it's probably still cheaper than waking up the whole herd

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [PATCH 0/4] locks: avoid thundering-herd wake-ups
  2018-08-08 22:50       ` Jeff Layton
@ 2018-08-08 23:34         ` Frank Filz
  2018-08-09  2:52           ` NeilBrown
  2018-08-09 13:00         ` J. Bruce Fields
  1 sibling, 1 reply; 26+ messages in thread
From: Frank Filz @ 2018-08-08 23:34 UTC (permalink / raw)
  To: Jeff Layton, J. Bruce Fields, NeilBrown
  Cc: Alexander Viro, linux-fsdevel, linux-kernel, Martin Wilck

> On Wed, 2018-08-08 at 17:28 -0400, J. Bruce Fields wrote:
> > On Wed, Aug 08, 2018 at 04:09:12PM -0400, J. Bruce Fields wrote:
> > > On Wed, Aug 08, 2018 at 03:54:45PM -0400, J. Bruce Fields wrote:
> > > > On Wed, Aug 08, 2018 at 11:51:07AM +1000, NeilBrown wrote:
> > > > > If you have a many-core machine, and have many threads all
> > > > > wanting to briefly lock a give file (udev is known to do this),
> > > > > you can get quite poor performance.
> > > > >
> > > > > When one thread releases a lock, it wakes up all other threads
> > > > > that are waiting (classic thundering-herd) - one will get the
> > > > > lock and the others go to sleep.
> > > > > When you have few cores, this is not very noticeable: by the
> > > > > time the 4th or 5th thread gets enough CPU time to try to claim
> > > > > the lock, the earlier threads have claimed it, done what was needed, and
> released.
> > > > > With 50+ cores, the contention can easily be measured.
> > > > >
> > > > > This patchset creates a tree of pending lock request in which
> > > > > siblings don't conflict and each lock request does conflict with its parent.
> > > > > When a lock is released, only requests which don't conflict with
> > > > > each other a woken.
> > > >
> > > > Are you sure you aren't depending on the (incorrect) assumption
> > > > that "X blocks Y" is a transitive relation?
> > > >
> > > > OK I should be able to answer that question myself, my patience
> > > > for code-reading is at a real low this afternoon....
> > >
> > > In other words, is there the possibility of a tree of, say,
> > > exclusive locks with (offset, length) like:
> > >
> > > 	(0, 2) waiting on (1, 2) waiting on (2, 2) waiting on (0, 4)
> > >
> > > and when waking (0, 4) you could wake up (2, 2) but not (0, 2),
> > > leaving a process waiting without there being an actual conflict.
> >
> > After batting it back and forth with Jeff on IRC....  So do I
> > understand right that when we wake a waiter, we leave its own tree of
> > waiters intact, and when it wakes if it finds a conflict it just adds
> > it lock (with tree of waiters) in to the tree of the conflicting lock?
> >
> > If so then yes I think that depends on the transitivity
> > assumption--you're assuming that finding a conflict between the root
> > of the tree and a lock proves that all the other members of the tree
> > also conflict.
> >
> > So maybe this example works.  (All locks are exclusive and written
> > (offset, length), X->Y means X is waiting on Y.)
> >
> > 	process acquires (0,3)
> > 	2nd process requests (1,2), is put to sleep.
> > 	3rd process requests (0,2), is put to sleep.
> >
> > 	The tree of waiters now looks like (0,2)->(1,2)->(0,3)
> >
> > 	(0,3) is unlocked.
> > 	A 4th process races in and locks (2,2).
> > 	The 2nd process wakes up, sees this new conflict, and waits on
> > 	(2,2).  Now the tree looks like (0,2)->(1,2)->(2,2), and (0,2)
> > 	is waiting for no reason.
> >
> 
> That seems like a legit problem.
> 
> One possible fix might be to have the waiter on (1,2) walk down the entire
> subtree and wake up any waiter that is waiting on a lock that doesn't conflict
> with the lock on which it's waiting.
> 
> So, before the task waiting on 1,2 goes back to sleep to wait on 2,2, it could
> walk down its entire fl_blocked subtree and wake up anything waiting on a lock
> that doesn't conflict with (2,2).
> 
> That's potentially an expensive operation, but:
> 
> a) the task is going back to sleep anyway, so letting it do a little extra work
> before that should be no big deal
> 
> b) it's probably still cheaper than waking up the whole herd

Yea, I think so.

Now here's another question... How does this new logic play with Open File Description Locks? Should still be ok since there's a thread waiting on each of those.

Frank


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] locks: avoid thundering-herd wake-ups
  2018-08-08 18:29   ` J. Bruce Fields
@ 2018-08-09  0:58     ` NeilBrown
  2018-08-20 11:02     ` Martin Wilck
  1 sibling, 0 replies; 26+ messages in thread
From: NeilBrown @ 2018-08-09  0:58 UTC (permalink / raw)
  To: J. Bruce Fields, Jeff Layton
  Cc: Alexander Viro, linux-fsdevel, linux-kernel, Martin Wilck

[-- Attachment #1: Type: text/plain, Size: 3184 bytes --]

On Wed, Aug 08 2018, J. Bruce Fields wrote:

> On Wed, Aug 08, 2018 at 12:47:22PM -0400, Jeff Layton wrote:
>> On Wed, 2018-08-08 at 11:51 +1000, NeilBrown wrote:
>> > If you have a many-core machine, and have many threads all wanting to
>> > briefly lock a give file (udev is known to do this), you can get quite
>> > poor performance.
>> > 
>> > When one thread releases a lock, it wakes up all other threads that
>> > are waiting (classic thundering-herd) - one will get the lock and the
>> > others go to sleep.
>> > When you have few cores, this is not very noticeable: by the time the
>> > 4th or 5th thread gets enough CPU time to try to claim the lock, the
>> > earlier threads have claimed it, done what was needed, and released.
>> > With 50+ cores, the contention can easily be measured.
>> > 
>> > This patchset creates a tree of pending lock request in which siblings
>> > don't conflict and each lock request does conflict with its parent.
>> > When a lock is released, only requests which don't conflict with each
>> > other a woken.
>> > 
>> > Testing shows that lock-acquisitions-per-second is now fairly stable even
>> > as number of contending process goes to 1000.  Without this patch,
>> > locks-per-second drops off steeply after a few 10s of processes.
>> > 
>> > There is a small cost to this extra complexity.
>> > At 20 processes running a particular test on 72 cores, the lock
>> > acquisitions per second drops from 1.8 million to 1.4 million with
>> > this patch.  For 100 processes, this patch still provides 1.4 million
>> > while without this patch there are about 700,000.
>> > 
>> > NeilBrown
>> > 
>> > ---
>> > 
>> > NeilBrown (4):
>> >       fs/locks: rename some lists and pointers.
>> >       fs/locks: allow a lock request to block other requests.
>> >       fs/locks: change all *_conflict() functions to return bool.
>> >       fs/locks: create a tree of dependent requests.
>> > 
>> > 
>> >  fs/cifs/file.c                  |    2 -
>> >  fs/locks.c                      |  142 +++++++++++++++++++++++++--------------
>> >  include/linux/fs.h              |    5 +
>> >  include/trace/events/filelock.h |   16 ++--
>> >  4 files changed, 103 insertions(+), 62 deletions(-)
>> > 
>> 
>> Nice work! I looked over this and I think it looks good.
>> 
>> I made an attempt to fix this issue several years ago, but my method
>> sucked as it ended up penalizing the unlocking task too much. This is
>> much cleaner and should scale well overall, I think.
>
> I think I also took a crack at this at one point while I was at UM/CITI
> and never got anything I was happy with.  Looks like good work!
>
> I remember one main obstacle that I felt like I never had a good
> benchmark....
>
> How did you choose this workload and hardware?  Was it in fact udev
> (booting a large machine?), or was there some other motivation?

I'm hoping Martin will chime in here - her identified the problem and
did most of the testing...

NeilBrown


>
> Not that I'm likely to do it any time soon, but could you share
> sufficient details for someone else to reproduce your results?
>
> --b.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [PATCH 0/4] locks: avoid thundering-herd wake-ups
  2018-08-08 23:34         ` Frank Filz
@ 2018-08-09  2:52           ` NeilBrown
  0 siblings, 0 replies; 26+ messages in thread
From: NeilBrown @ 2018-08-09  2:52 UTC (permalink / raw)
  To: Frank Filz, Jeff Layton, J. Bruce Fields
  Cc: Alexander Viro, linux-fsdevel, linux-kernel, Martin Wilck

[-- Attachment #1: Type: text/plain, Size: 407 bytes --]

On Wed, Aug 08 2018, Frank Filz wrote:

>
> Now here's another question... How does this new logic play with Open
> File Description Locks? Should still be ok since there's a thread
> waiting on each of those.

At this level Posix locks and OFD locks are almost identical - just a
different owner.
So this enhancement should work equally for posix, flock, ofd and even
leases.

Thanks,
NeilBrown

>
> Frank

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] locks: avoid thundering-herd wake-ups
  2018-08-08 22:50       ` Jeff Layton
  2018-08-08 23:34         ` Frank Filz
@ 2018-08-09 13:00         ` J. Bruce Fields
  2018-08-09 14:49           ` Jeff Layton
  2018-08-09 23:56           ` NeilBrown
  1 sibling, 2 replies; 26+ messages in thread
From: J. Bruce Fields @ 2018-08-09 13:00 UTC (permalink / raw)
  To: Jeff Layton
  Cc: NeilBrown, Alexander Viro, linux-fsdevel, linux-kernel, Martin Wilck

On Wed, Aug 08, 2018 at 06:50:06PM -0400, Jeff Layton wrote:
> That seems like a legit problem.
> 
> One possible fix might be to have the waiter on (1,2) walk down the
> entire subtree and wake up any waiter that is waiting on a lock that
> doesn't conflict with the lock on which it's waiting.
> 
> So, before the task waiting on 1,2 goes back to sleep to wait on 2,2, it
> could walk down its entire fl_blocked subtree and wake up anything
> waiting on a lock that doesn't conflict with (2,2).
> 
> That's potentially an expensive operation, but:
> 
> a) the task is going back to sleep anyway, so letting it do a little
> extra work before that should be no big deal

I don't understand why cpu used by a process going to sleep is cheaper
than cpu used in any other situation.

> b) it's probably still cheaper than waking up the whole herd

Yeah, I'd like to understand this.

I feel like Neil's addressing two different performance costs:

	- the cost of waking up all the waiters
	- the cost of walking the list of waiters

Are they equally important?

If we only cared about the former, and only in simple cases, we could
walk the entire list and skip waking up only the locks that conflict
with the first one we wake.  We wouldn't need the tree.

--b.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] locks: avoid thundering-herd wake-ups
  2018-08-09 13:00         ` J. Bruce Fields
@ 2018-08-09 14:49           ` Jeff Layton
  2018-08-09 23:56           ` NeilBrown
  1 sibling, 0 replies; 26+ messages in thread
From: Jeff Layton @ 2018-08-09 14:49 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: NeilBrown, Alexander Viro, linux-fsdevel, linux-kernel, Martin Wilck

On Thu, 2018-08-09 at 09:00 -0400, J. Bruce Fields wrote:
> On Wed, Aug 08, 2018 at 06:50:06PM -0400, Jeff Layton wrote:
> > That seems like a legit problem.
> > 
> > One possible fix might be to have the waiter on (1,2) walk down the
> > entire subtree and wake up any waiter that is waiting on a lock that
> > doesn't conflict with the lock on which it's waiting.
> > 
> > So, before the task waiting on 1,2 goes back to sleep to wait on 2,2, it
> > could walk down its entire fl_blocked subtree and wake up anything
> > waiting on a lock that doesn't conflict with (2,2).
> > 
> > That's potentially an expensive operation, but:
> > 
> > a) the task is going back to sleep anyway, so letting it do a little
> > extra work before that should be no big deal
> 
> I don't understand why cpu used by a process going to sleep is cheaper
> than cpu used in any other situation.
> 

It's not any cheaper in that sense. It's just that this task is not
slated to be doing any useful work anyway as it's going to sleep, so we
aren't delaying any "real work" by this task by having it do this 
before returning to userland. It's already scheduled and holds the
appropriate lock.

The alternative would be to do this in the context of a different task,
but that means extra context switching and spinlocking, etc.

> > b) it's probably still cheaper than waking up the whole herd
> 
> Yeah, I'd like to understand this.
> 
> I feel like Neil's addressing two different performance costs:
> 
> 	- the cost of waking up all the waiters
> 	- the cost of walking the list of waiters
> 
> Are they equally important?
> 
> If we only cared about the former, and only in simple cases, we could
> walk the entire list and skip waking up only the locks that conflict
> with the first one we wake.  We wouldn't need the tree.

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] locks: avoid thundering-herd wake-ups
  2018-08-09 13:00         ` J. Bruce Fields
  2018-08-09 14:49           ` Jeff Layton
@ 2018-08-09 23:56           ` NeilBrown
  2018-08-10  1:05             ` J. Bruce Fields
  1 sibling, 1 reply; 26+ messages in thread
From: NeilBrown @ 2018-08-09 23:56 UTC (permalink / raw)
  To: J. Bruce Fields, Jeff Layton
  Cc: Alexander Viro, linux-fsdevel, linux-kernel, Martin Wilck

[-- Attachment #1: Type: text/plain, Size: 2157 bytes --]

On Thu, Aug 09 2018, J. Bruce Fields wrote:

> On Wed, Aug 08, 2018 at 06:50:06PM -0400, Jeff Layton wrote:
>> That seems like a legit problem.
>> 
>> One possible fix might be to have the waiter on (1,2) walk down the
>> entire subtree and wake up any waiter that is waiting on a lock that
>> doesn't conflict with the lock on which it's waiting.
>> 
>> So, before the task waiting on 1,2 goes back to sleep to wait on 2,2, it
>> could walk down its entire fl_blocked subtree and wake up anything
>> waiting on a lock that doesn't conflict with (2,2).
>> 
>> That's potentially an expensive operation, but:
>> 
>> a) the task is going back to sleep anyway, so letting it do a little
>> extra work before that should be no big deal
>
> I don't understand why cpu used by a process going to sleep is cheaper
> than cpu used in any other situation.
>
>> b) it's probably still cheaper than waking up the whole herd
>
> Yeah, I'd like to understand this.
>
> I feel like Neil's addressing two different performance costs:
>
> 	- the cost of waking up all the waiters
> 	- the cost of walking the list of waiters
>
> Are they equally important?

Probably not.  Reducing wakeups is likely the most important.

>
> If we only cared about the former, and only in simple cases, we could
> walk the entire list and skip waking up only the locks that conflict
> with the first one we wake.  We wouldn't need the tree.

Yes, we could do that. It would probably make the code simpler.
It would mean doing the conflicts() tests when performing wake-up rather
than prior to going to sleep.  In general it would mean doing the tests
more often, as the tree effectively records the results of the previous
time conflicts() was run.
You might get a quadratic effect though ... wouldn't you want to
skip anything that conflicts with anything that has been woken?

If the tree-management code turns out to be too complex, it would
certainly be an option.  I think the tree approach should result in less
total CPU usage..

Thanks for the thought - taking this simple approach really hadn't
occurred to me. :-(

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] locks: avoid thundering-herd wake-ups
  2018-08-09 23:56           ` NeilBrown
@ 2018-08-10  1:05             ` J. Bruce Fields
  0 siblings, 0 replies; 26+ messages in thread
From: J. Bruce Fields @ 2018-08-10  1:05 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jeff Layton, Alexander Viro, linux-fsdevel, linux-kernel, Martin Wilck

On Fri, Aug 10, 2018 at 09:56:07AM +1000, NeilBrown wrote:
> On Thu, Aug 09 2018, J. Bruce Fields wrote:
> > If we only cared about the former, and only in simple cases, we could
> > walk the entire list and skip waking up only the locks that conflict
> > with the first one we wake.  We wouldn't need the tree.
> 
> Yes, we could do that. It would probably make the code simpler.
> It would mean doing the conflicts() tests when performing wake-up rather
> than prior to going to sleep.  In general it would mean doing the tests
> more often, as the tree effectively records the results of the previous
> time conflicts() was run.
> You might get a quadratic effect though ... wouldn't you want to
> skip anything that conflicts with anything that has been woken?

I was thinking it'd be simplest just to check for conflict with the
first thing that you decide to wake.  That might be all that's necessary
for typical cases.

If you wanted to keep a running list of the locks you've chosen to wake
so far and check each one against that list, I guess you could.

> If the tree-management code turns out to be too complex, it would
> certainly be an option.  I think the tree approach should result in less
> total CPU usage..

Have we ever looked into the CPU usage of deadlock detection?  Might
experiment with turning that off.  But I may be biased by my white-hot
hatred of it.

--b.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] locks: avoid thundering-herd wake-ups
  2018-08-08 18:29   ` J. Bruce Fields
  2018-08-09  0:58     ` NeilBrown
@ 2018-08-20 11:02     ` Martin Wilck
  2018-08-20 20:02       ` J. Bruce Fields
  1 sibling, 1 reply; 26+ messages in thread
From: Martin Wilck @ 2018-08-20 11:02 UTC (permalink / raw)
  To: J. Bruce Fields, Jeff Layton
  Cc: NeilBrown, Alexander Viro, linux-fsdevel, linux-kernel

On Wed, 2018-08-08 at 14:29 -0400, J. Bruce Fields wrote:
> On Wed, Aug 08, 2018 at 12:47:22PM -0400, Jeff Layton wrote:
> > On Wed, 2018-08-08 at 11:51 +1000, NeilBrown wrote:
> > > If you have a many-core machine, and have many threads all
> > > wanting to
> > > briefly lock a give file (udev is known to do this), you can get
> > > quite
> > > poor performance.
> > > 
> > > When one thread releases a lock, it wakes up all other threads
> > > that
> > > are waiting (classic thundering-herd) - one will get the lock and
> > > the
> > > others go to sleep.
> > > When you have few cores, this is not very noticeable: by the time
> > > the
> > > 4th or 5th thread gets enough CPU time to try to claim the lock,
> > > the
> > > earlier threads have claimed it, done what was needed, and
> > > released.
> > > With 50+ cores, the contention can easily be measured.
> > > 
> > > This patchset creates a tree of pending lock request in which
> > > siblings
> > > don't conflict and each lock request does conflict with its
> > > parent.
> > > When a lock is released, only requests which don't conflict with
> > > each
> > > other a woken.
> > > 
> > > Testing shows that lock-acquisitions-per-second is now fairly
> > > stable even
> > > as number of contending process goes to 1000.  Without this
> > > patch,
> > > locks-per-second drops off steeply after a few 10s of processes.
> > > 
> > > There is a small cost to this extra complexity.
> > > At 20 processes running a particular test on 72 cores, the lock
> > > acquisitions per second drops from 1.8 million to 1.4 million
> > > with
> > > this patch.  For 100 processes, this patch still provides 1.4
> > > million
> > > while without this patch there are about 700,000.
> > > 
> > > NeilBrown
> > > 
> > > ---
> > > 
> > > NeilBrown (4):
> > >       fs/locks: rename some lists and pointers.
> > >       fs/locks: allow a lock request to block other requests.
> > >       fs/locks: change all *_conflict() functions to return bool.
> > >       fs/locks: create a tree of dependent requests.
> > > 
> > > 
> > >  fs/cifs/file.c                  |    2 -
> > >  fs/locks.c                      |  142
> > > +++++++++++++++++++++++++--------------
> > >  include/linux/fs.h              |    5 +
> > >  include/trace/events/filelock.h |   16 ++--
> > >  4 files changed, 103 insertions(+), 62 deletions(-)
> > > 
> > 
> > Nice work! I looked over this and I think it looks good.
> > 
> > I made an attempt to fix this issue several years ago, but my
> > method
> > sucked as it ended up penalizing the unlocking task too much. This
> > is
> > much cleaner and should scale well overall, I think.
> 
> I think I also took a crack at this at one point while I was at
> UM/CITI
> and never got anything I was happy with.  Looks like good work!
> 
> I remember one main obstacle that I felt like I never had a good
> benchmark....
> 
> How did you choose this workload and hardware?  Was it in fact udev
> (booting a large machine?), or was there some other motivation?

Some details can be found here:

https://github.com/systemd/systemd/pull/9551

https://github.com/systemd/systemd/pull/8667#issuecomment-385520335
and comments further down. 8667 has been superseded by 9551.

The original problem was that the symlink "/dev/disk/by-
partlabel/primary" may be claimed by _many_ devices on big systems
under certain distributions, which use older versions of parted for
partition creation on GPT disk labels. I've seen systems with literally
thousands of contenders for this symlink. 

We found that with current systemd, this can cause a boot-time race
where the wrong device is eventually assigned the "best" contender for
the symlink (e.g. a partition on multipath member rather than a
partition on the multipath map itself). I extended the udev test suite,
creating a test that makes this race easily reproducible, at least on
systems with many CPUs (the test host I used most had 72 cores).

I created an udev patch that would use systemd's built in fcntl-based
locking to avoid this race, but I found that it would massively slow
down the system, and found the contention to be in the spin locks in
posix_lock_common(). (I therefore added more the systemd patches to
make the locking scale better, but that's irrelevant for the kernel-
side discussion).

I further created an artificial test just for the scaling of
fcntl(F_OFD_SETLKW) and flock(), with which I could reproduce the
scaling problem easily, and do some quantitive experiments. My tests
didn't use any byte ranges, only "full" locking of 0-byte files.

> Not that I'm likely to do it any time soon, but could you share
> sufficient details for someone else to reproduce your results?
> 
> --b.

The udev test code can be found in the above links. It adds a new
script "test/sd-script.py" that would be run after "test/sys-script.py" 
using a numeric argument indicating the number of contenders for the
test link to be created, such as "python test/sd-script.py test 1000".
Next step would be running "test/udev-test.pl 152" e.g. under perf (152
is the test ID of the scaling test).

Of course I can also share my other test program if you desire so.

Regards,
Martin



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] locks: avoid thundering-herd wake-ups
  2018-08-20 11:02     ` Martin Wilck
@ 2018-08-20 20:02       ` J. Bruce Fields
  2018-08-20 20:06         ` Martin Wilck
  0 siblings, 1 reply; 26+ messages in thread
From: J. Bruce Fields @ 2018-08-20 20:02 UTC (permalink / raw)
  To: Martin Wilck
  Cc: Jeff Layton, NeilBrown, Alexander Viro, linux-fsdevel, linux-kernel

On Mon, Aug 20, 2018 at 01:02:21PM +0200, Martin Wilck wrote:
> On Wed, 2018-08-08 at 14:29 -0400, J. Bruce Fields wrote:
> > On Wed, Aug 08, 2018 at 12:47:22PM -0400, Jeff Layton wrote:
> > > On Wed, 2018-08-08 at 11:51 +1000, NeilBrown wrote:
> > > > If you have a many-core machine, and have many threads all
> > > > wanting to
> > > > briefly lock a give file (udev is known to do this), you can get
> > > > quite
> > > > poor performance.
> > > > 
> > > > When one thread releases a lock, it wakes up all other threads
> > > > that
> > > > are waiting (classic thundering-herd) - one will get the lock and
> > > > the
> > > > others go to sleep.
> > > > When you have few cores, this is not very noticeable: by the time
> > > > the
> > > > 4th or 5th thread gets enough CPU time to try to claim the lock,
> > > > the
> > > > earlier threads have claimed it, done what was needed, and
> > > > released.
> > > > With 50+ cores, the contention can easily be measured.
> > > > 
> > > > This patchset creates a tree of pending lock request in which
> > > > siblings
> > > > don't conflict and each lock request does conflict with its
> > > > parent.
> > > > When a lock is released, only requests which don't conflict with
> > > > each
> > > > other a woken.
> > > > 
> > > > Testing shows that lock-acquisitions-per-second is now fairly
> > > > stable even
> > > > as number of contending process goes to 1000.  Without this
> > > > patch,
> > > > locks-per-second drops off steeply after a few 10s of processes.
> > > > 
> > > > There is a small cost to this extra complexity.
> > > > At 20 processes running a particular test on 72 cores, the lock
> > > > acquisitions per second drops from 1.8 million to 1.4 million
> > > > with
> > > > this patch.  For 100 processes, this patch still provides 1.4
> > > > million
> > > > while without this patch there are about 700,000.
> > > > 
> > > > NeilBrown
> > > > 
> > > > ---
> > > > 
> > > > NeilBrown (4):
> > > >       fs/locks: rename some lists and pointers.
> > > >       fs/locks: allow a lock request to block other requests.
> > > >       fs/locks: change all *_conflict() functions to return bool.
> > > >       fs/locks: create a tree of dependent requests.
> > > > 
> > > > 
> > > >  fs/cifs/file.c                  |    2 -
> > > >  fs/locks.c                      |  142
> > > > +++++++++++++++++++++++++--------------
> > > >  include/linux/fs.h              |    5 +
> > > >  include/trace/events/filelock.h |   16 ++--
> > > >  4 files changed, 103 insertions(+), 62 deletions(-)
> > > > 
> > > 
> > > Nice work! I looked over this and I think it looks good.
> > > 
> > > I made an attempt to fix this issue several years ago, but my
> > > method
> > > sucked as it ended up penalizing the unlocking task too much. This
> > > is
> > > much cleaner and should scale well overall, I think.
> > 
> > I think I also took a crack at this at one point while I was at
> > UM/CITI
> > and never got anything I was happy with.  Looks like good work!
> > 
> > I remember one main obstacle that I felt like I never had a good
> > benchmark....
> > 
> > How did you choose this workload and hardware?  Was it in fact udev
> > (booting a large machine?), or was there some other motivation?
> 
> Some details can be found here:
> 
> https://github.com/systemd/systemd/pull/9551
> 
> https://github.com/systemd/systemd/pull/8667#issuecomment-385520335
> and comments further down. 8667 has been superseded by 9551.
> 
> The original problem was that the symlink "/dev/disk/by-
> partlabel/primary" may be claimed by _many_ devices on big systems
> under certain distributions, which use older versions of parted for
> partition creation on GPT disk labels. I've seen systems with literally
> thousands of contenders for this symlink. 
> 
> We found that with current systemd, this can cause a boot-time race
> where the wrong device is eventually assigned the "best" contender for
> the symlink (e.g. a partition on multipath member rather than a
> partition on the multipath map itself). I extended the udev test suite,
> creating a test that makes this race easily reproducible, at least on
> systems with many CPUs (the test host I used most had 72 cores).
> 
> I created an udev patch that would use systemd's built in fcntl-based
> locking to avoid this race, but I found that it would massively slow
> down the system, and found the contention to be in the spin locks in
> posix_lock_common(). (I therefore added more the systemd patches to
> make the locking scale better, but that's irrelevant for the kernel-
> side discussion).
> 
> I further created an artificial test just for the scaling of
> fcntl(F_OFD_SETLKW) and flock(), with which I could reproduce the
> scaling problem easily, and do some quantitive experiments. My tests
> didn't use any byte ranges, only "full" locking of 0-byte files.

Thanks for the explanation!

I wonder whether there's also anything we could do to keep every waiter
from having to take the same spinlock.

--b.

> 
> > Not that I'm likely to do it any time soon, but could you share
> > sufficient details for someone else to reproduce your results?
> > 
> > --b.
> 
> The udev test code can be found in the above links. It adds a new
> script "test/sd-script.py" that would be run after "test/sys-script.py" 
> using a numeric argument indicating the number of contenders for the
> test link to be created, such as "python test/sd-script.py test 1000".
> Next step would be running "test/udev-test.pl 152" e.g. under perf (152
> is the test ID of the scaling test).
> 
> Of course I can also share my other test program if you desire so.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] locks: avoid thundering-herd wake-ups
  2018-08-20 20:02       ` J. Bruce Fields
@ 2018-08-20 20:06         ` Martin Wilck
  0 siblings, 0 replies; 26+ messages in thread
From: Martin Wilck @ 2018-08-20 20:06 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Jeff Layton, NeilBrown, Alexander Viro, linux-fsdevel, linux-kernel

On Mon, 2018-08-20 at 16:02 -0400, J. Bruce Fields wrote:
> On Mon, Aug 20, 2018 at 01:02:21PM +0200, Martin Wilck wrote:
> > On Wed, 2018-08-08 at 14:29 -0400, J. Bruce Fields wrote:
> > > On Wed, Aug 08, 2018 at 12:47:22PM -0400, Jeff Layton wrote:
> > > > On Wed, 2018-08-08 at 11:51 +1000, NeilBrown wrote:
> > > > > If you have a many-core machine, and have many threads all
> > > > > wanting to
> > > > > briefly lock a give file (udev is known to do this), you can
> > > > > get
> > > > > quite
> > > > > poor performance.
> > > > > 
> > > > > When one thread releases a lock, it wakes up all other
> > > > > threads
> > > > > that
> > > > > are waiting (classic thundering-herd) - one will get the lock
> > > > > and
> > > > > the
> > > > > others go to sleep.
> > > > > When you have few cores, this is not very noticeable: by the
> > > > > time
> > > > > the
> > > > > 4th or 5th thread gets enough CPU time to try to claim the
> > > > > lock,
> > > > > the
> > > > > earlier threads have claimed it, done what was needed, and
> > > > > released.
> > > > > With 50+ cores, the contention can easily be measured.
> > > > > 
> > > > > This patchset creates a tree of pending lock request in which
> > > > > siblings
> > > > > don't conflict and each lock request does conflict with its
> > > > > parent.
> > > > > When a lock is released, only requests which don't conflict
> > > > > with
> > > > > each
> > > > > other a woken.
> > > > > 
> > > > > Testing shows that lock-acquisitions-per-second is now fairly
> > > > > stable even
> > > > > as number of contending process goes to 1000.  Without this
> > > > > patch,
> > > > > locks-per-second drops off steeply after a few 10s of
> > > > > processes.
> > > > > 
> > > > > There is a small cost to this extra complexity.
> > > > > At 20 processes running a particular test on 72 cores, the
> > > > > lock
> > > > > acquisitions per second drops from 1.8 million to 1.4 million
> > > > > with
> > > > > this patch.  For 100 processes, this patch still provides 1.4
> > > > > million
> > > > > while without this patch there are about 700,000.
> > > > > 
> > > > > NeilBrown
> > > > > 
> > > > > ---
> > > > > 
> > > > > NeilBrown (4):
> > > > >       fs/locks: rename some lists and pointers.
> > > > >       fs/locks: allow a lock request to block other requests.
> > > > >       fs/locks: change all *_conflict() functions to return
> > > > > bool.
> > > > >       fs/locks: create a tree of dependent requests.
> > > > > 
> > > > > 
> > > > >  fs/cifs/file.c                  |    2 -
> > > > >  fs/locks.c                      |  142
> > > > > +++++++++++++++++++++++++--------------
> > > > >  include/linux/fs.h              |    5 +
> > > > >  include/trace/events/filelock.h |   16 ++--
> > > > >  4 files changed, 103 insertions(+), 62 deletions(-)
> > > > > 
> > > > 
> > > > Nice work! I looked over this and I think it looks good.
> > > > 
> > > > I made an attempt to fix this issue several years ago, but my
> > > > method
> > > > sucked as it ended up penalizing the unlocking task too much.
> > > > This
> > > > is
> > > > much cleaner and should scale well overall, I think.
> > > 
> > > I think I also took a crack at this at one point while I was at
> > > UM/CITI
> > > and never got anything I was happy with.  Looks like good work!
> > > 
> > > I remember one main obstacle that I felt like I never had a good
> > > benchmark....
> > > 
> > > How did you choose this workload and hardware?  Was it in fact
> > > udev
> > > (booting a large machine?), or was there some other motivation?
> > 
> > Some details can be found here:
> > 
> > https://github.com/systemd/systemd/pull/9551
> > 
> > https://github.com/systemd/systemd/pull/8667#issuecomment-385520335
> > and comments further down. 8667 has been superseded by 9551.
> > 
> > The original problem was that the symlink "/dev/disk/by-
> > partlabel/primary" may be claimed by _many_ devices on big systems
> > under certain distributions, which use older versions of parted for
> > partition creation on GPT disk labels. I've seen systems with
> > literally
> > thousands of contenders for this symlink. 
> > 
> > We found that with current systemd, this can cause a boot-time race
> > where the wrong device is eventually assigned the "best" contender
> > for
> > the symlink (e.g. a partition on multipath member rather than a
> > partition on the multipath map itself). I extended the udev test
> > suite,
> > creating a test that makes this race easily reproducible, at least
> > on
> > systems with many CPUs (the test host I used most had 72 cores).
> > 
> > I created an udev patch that would use systemd's built in fcntl-
> > based
> > locking to avoid this race, but I found that it would massively
> > slow
> > down the system, and found the contention to be in the spin locks
> > in
> > posix_lock_common(). (I therefore added more the systemd patches to
> > make the locking scale better, but that's irrelevant for the
> > kernel-
> > side discussion).
> > 
> > I further created an artificial test just for the scaling of
> > fcntl(F_OFD_SETLKW) and flock(), with which I could reproduce the
> > scaling problem easily, and do some quantitive experiments. My
> > tests
> > didn't use any byte ranges, only "full" locking of 0-byte files.
> 
> Thanks for the explanation!
> 
> I wonder whether there's also anything we could do to keep every
> waiter
> from having to take the same spinlock.

That was my line of thinking initially, but Neil's patches that just
avoid the thundering herd wakeup made both tests scale quite nicely, so
at least for my use case, no further optimizations are required.

Martin



^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, back to index

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-08  1:51 [PATCH 0/4] locks: avoid thundering-herd wake-ups NeilBrown
2018-08-08  1:51 ` [PATCH 1/4] fs/locks: rename some lists and pointers NeilBrown
2018-08-08 10:47   ` Jeff Layton
2018-08-08 19:07     ` J. Bruce Fields
2018-08-08  1:51 ` [PATCH 3/4] fs/locks: change all *_conflict() functions to return bool NeilBrown
2018-08-08  1:51 ` [PATCH 2/4] fs/locks: allow a lock request to block other requests NeilBrown
2018-08-08  1:51 ` [PATCH 4/4] fs/locks: create a tree of dependent requests NeilBrown
2018-08-08 16:47 ` [PATCH 0/4] locks: avoid thundering-herd wake-ups Jeff Layton
2018-08-08 18:29   ` J. Bruce Fields
2018-08-09  0:58     ` NeilBrown
2018-08-20 11:02     ` Martin Wilck
2018-08-20 20:02       ` J. Bruce Fields
2018-08-20 20:06         ` Martin Wilck
2018-08-08 19:54 ` J. Bruce Fields
2018-08-08 20:09   ` J. Bruce Fields
2018-08-08 21:15     ` Frank Filz
2018-08-08 22:34       ` NeilBrown
2018-08-08 21:28     ` J. Bruce Fields
2018-08-08 22:39       ` NeilBrown
2018-08-08 22:50       ` Jeff Layton
2018-08-08 23:34         ` Frank Filz
2018-08-09  2:52           ` NeilBrown
2018-08-09 13:00         ` J. Bruce Fields
2018-08-09 14:49           ` Jeff Layton
2018-08-09 23:56           ` NeilBrown
2018-08-10  1:05             ` J. Bruce Fields

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org linux-kernel@archiver.kernel.org
	public-inbox-index lkml


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/ public-inbox