linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] Extent buffer locking and documentation
@ 2019-10-17 19:38 David Sterba
  2019-10-17 19:38 ` [PATCH 1/5] btrfs: merge blocking_writers branches in btrfs_tree_read_lock David Sterba
                   ` (4 more replies)
  0 siblings, 5 replies; 20+ messages in thread
From: David Sterba @ 2019-10-17 19:38 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba

I've spent a lot of time staring at the locking code and speculating
about all sorts of weird problems that could happen due to memory
ordering or lost wakeups or if the custom locking is safe at all, also
regarding the recent changes.

Inevitably I found something but also wrote documentation. Please read
it and if you see need for more clarifications, I'm happy to add it as
I'm now in a state that things become temporarily obvious and trivial.

I've tested it in fstests with KCSAN (the new concurrency sanitizer), no
problems found but this is not considered sufficient, more tests will
follow.

David Sterba (5):
  btrfs: merge blocking_writers branches in btrfs_tree_read_lock
  btrfs: set blocking_writers directly, no increment or decrement
  btrfs: access eb::blocking_writers according to ACCESS_ONCE policies
  btrfs: serialize blocking_writers updates
  btrfs: document extent buffer locking

 fs/btrfs/locking.c | 184 +++++++++++++++++++++++++++++++++++----------
 1 file changed, 144 insertions(+), 40 deletions(-)

-- 
2.23.0


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/5] btrfs: merge blocking_writers branches in btrfs_tree_read_lock
  2019-10-17 19:38 [PATCH 0/5] Extent buffer locking and documentation David Sterba
@ 2019-10-17 19:38 ` David Sterba
  2019-10-17 19:38 ` [PATCH 2/5] btrfs: set blocking_writers directly, no increment or decrement David Sterba
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 20+ messages in thread
From: David Sterba @ 2019-10-17 19:38 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba

There are two ifs that use eb::blocking_writers. As this is a variable
modified inside and outside of locks, we could minimize number of
accesses to avoid problems with getting different results at different
times.

The access here is locked so this can only race with btrfs_tree_unlock
that sets blocking_writers to 0 without lock and unsets the lock owner.

The first branch is taken only if the same thread already holds the
lock, the second if checks for blocking writers. Here we'd either unlock
and wait, or proceed. Both are valid states of the locking protocol.

Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/locking.c | 27 ++++++++++++++-------------
 1 file changed, 14 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c
index 93146b495276..c84c650e56c7 100644
--- a/fs/btrfs/locking.c
+++ b/fs/btrfs/locking.c
@@ -128,20 +128,21 @@ void btrfs_tree_read_lock(struct extent_buffer *eb)
 	read_lock(&eb->lock);
 	BUG_ON(eb->blocking_writers == 0 &&
 	       current->pid == eb->lock_owner);
-	if (eb->blocking_writers && current->pid == eb->lock_owner) {
-		/*
-		 * This extent is already write-locked by our thread. We allow
-		 * an additional read lock to be added because it's for the same
-		 * thread. btrfs_find_all_roots() depends on this as it may be
-		 * called on a partly (write-)locked tree.
-		 */
-		BUG_ON(eb->lock_nested);
-		eb->lock_nested = true;
-		read_unlock(&eb->lock);
-		trace_btrfs_tree_read_lock(eb, start_ns);
-		return;
-	}
 	if (eb->blocking_writers) {
+		if (current->pid == eb->lock_owner) {
+			/*
+			 * This extent is already write-locked by our thread.
+			 * We allow an additional read lock to be added because
+			 * it's for the same thread. btrfs_find_all_roots()
+			 * depends on this as it may be called on a partly
+			 * (write-)locked tree.
+			 */
+			BUG_ON(eb->lock_nested);
+			eb->lock_nested = true;
+			read_unlock(&eb->lock);
+			trace_btrfs_tree_read_lock(eb, start_ns);
+			return;
+		}
 		read_unlock(&eb->lock);
 		wait_event(eb->write_lock_wq,
 			   eb->blocking_writers == 0);
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/5] btrfs: set blocking_writers directly, no increment or decrement
  2019-10-17 19:38 [PATCH 0/5] Extent buffer locking and documentation David Sterba
  2019-10-17 19:38 ` [PATCH 1/5] btrfs: merge blocking_writers branches in btrfs_tree_read_lock David Sterba
@ 2019-10-17 19:38 ` David Sterba
  2019-10-18 12:08   ` Nikolay Borisov
  2019-10-17 19:38 ` [PATCH 3/5] btrfs: access eb::blocking_writers according to ACCESS_ONCE policies David Sterba
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 20+ messages in thread
From: David Sterba @ 2019-10-17 19:38 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba

The increment and decrement was inherited from previous version that
used atomics, switched in commit 06297d8cefca ("btrfs: switch
extent_buffer blocking_writers from atomic to int"). The only possible
values are 0 and 1 so we can set them directly.

The generated assembly (gcc 9.x) did the direct value assignment in
btrfs_set_lock_blocking_write:

     5d:   test   %eax,%eax
     5f:   je     62 <btrfs_set_lock_blocking_write+0x22>
     61:   retq

  -  62:   lock incl 0x44(%rdi)
  -  66:   add    $0x50,%rdi
  -  6a:   jmpq   6f <btrfs_set_lock_blocking_write+0x2f>

  +  62:   movl   $0x1,0x44(%rdi)
  +  69:   add    $0x50,%rdi
  +  6d:   jmpq   72 <btrfs_set_lock_blocking_write+0x32>

The part in btrfs_tree_unlock did a decrement because
BUG_ON(blockers > 1) is probably not a strong hint for the compiler, but
otherwise the output looks safe:

  - lock decl 0x44(%rdi)

  + sub    $0x1,%eax
  + mov    %eax,0x44(%rdi)

Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/locking.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c
index c84c650e56c7..00edf91c3d1c 100644
--- a/fs/btrfs/locking.c
+++ b/fs/btrfs/locking.c
@@ -109,7 +109,7 @@ void btrfs_set_lock_blocking_write(struct extent_buffer *eb)
 	if (eb->blocking_writers == 0) {
 		btrfs_assert_spinning_writers_put(eb);
 		btrfs_assert_tree_locked(eb);
-		eb->blocking_writers++;
+		eb->blocking_writers = 1;
 		write_unlock(&eb->lock);
 	}
 }
@@ -305,7 +305,7 @@ void btrfs_tree_unlock(struct extent_buffer *eb)
 
 	if (blockers) {
 		btrfs_assert_no_spinning_writers(eb);
-		eb->blocking_writers--;
+		eb->blocking_writers = 0;
 		/*
 		 * We need to order modifying blocking_writers above with
 		 * actually waking up the sleepers to ensure they see the
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 3/5] btrfs: access eb::blocking_writers according to ACCESS_ONCE policies
  2019-10-17 19:38 [PATCH 0/5] Extent buffer locking and documentation David Sterba
  2019-10-17 19:38 ` [PATCH 1/5] btrfs: merge blocking_writers branches in btrfs_tree_read_lock David Sterba
  2019-10-17 19:38 ` [PATCH 2/5] btrfs: set blocking_writers directly, no increment or decrement David Sterba
@ 2019-10-17 19:38 ` David Sterba
  2019-10-23  8:42   ` Nikolay Borisov
  2019-10-17 19:39 ` [PATCH 4/5] btrfs: serialize blocking_writers updates David Sterba
  2019-10-17 19:39 ` [PATCH 5/5] btrfs: document extent buffer locking David Sterba
  4 siblings, 1 reply; 20+ messages in thread
From: David Sterba @ 2019-10-17 19:38 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba

A nice writeup of the LKMM (Linux Kernel Memory Model) rules for access
once policies can be found here
https://lwn.net/Articles/799218/#Access-Marking%20Policies .

The locked and unlocked access to eb::blocking_writers should be
annotated accordingly, following this:

Writes:

- locked write must use ONCE, may use plain read
- unlocked write must use ONCE

Reads:

- unlocked read must use ONCE
- locked read may use plain read iff not mixed with unlocked read
- unlocked read then locked must use ONCE

There's one difference on the assembly level, where
btrfs_tree_read_lock_atomic and btrfs_try_tree_read_lock used the cached
value and did not reevaluate it after taking the lock. This could have
missed some opportunities to take the lock in case blocking writers
changed between the calls, but the window is just a few instructions
long. As this is in try-lock, the callers handle that.

Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/locking.c | 31 +++++++++++++++++++------------
 1 file changed, 19 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c
index 00edf91c3d1c..a12f3edc3505 100644
--- a/fs/btrfs/locking.c
+++ b/fs/btrfs/locking.c
@@ -109,7 +109,7 @@ void btrfs_set_lock_blocking_write(struct extent_buffer *eb)
 	if (eb->blocking_writers == 0) {
 		btrfs_assert_spinning_writers_put(eb);
 		btrfs_assert_tree_locked(eb);
-		eb->blocking_writers = 1;
+		WRITE_ONCE(eb->blocking_writers, 1);
 		write_unlock(&eb->lock);
 	}
 }
@@ -145,7 +145,7 @@ void btrfs_tree_read_lock(struct extent_buffer *eb)
 		}
 		read_unlock(&eb->lock);
 		wait_event(eb->write_lock_wq,
-			   eb->blocking_writers == 0);
+			   READ_ONCE(eb->blocking_writers) == 0);
 		goto again;
 	}
 	btrfs_assert_tree_read_locks_get(eb);
@@ -160,11 +160,12 @@ void btrfs_tree_read_lock(struct extent_buffer *eb)
  */
 int btrfs_tree_read_lock_atomic(struct extent_buffer *eb)
 {
-	if (eb->blocking_writers)
+	if (READ_ONCE(eb->blocking_writers))
 		return 0;
 
 	read_lock(&eb->lock);
-	if (eb->blocking_writers) {
+	/* Refetch value after lock */
+	if (READ_ONCE(eb->blocking_writers)) {
 		read_unlock(&eb->lock);
 		return 0;
 	}
@@ -180,13 +181,14 @@ int btrfs_tree_read_lock_atomic(struct extent_buffer *eb)
  */
 int btrfs_try_tree_read_lock(struct extent_buffer *eb)
 {
-	if (eb->blocking_writers)
+	if (READ_ONCE(eb->blocking_writers))
 		return 0;
 
 	if (!read_trylock(&eb->lock))
 		return 0;
 
-	if (eb->blocking_writers) {
+	/* Refetch value after lock */
+	if (READ_ONCE(eb->blocking_writers)) {
 		read_unlock(&eb->lock);
 		return 0;
 	}
@@ -202,11 +204,12 @@ int btrfs_try_tree_read_lock(struct extent_buffer *eb)
  */
 int btrfs_try_tree_write_lock(struct extent_buffer *eb)
 {
-	if (eb->blocking_writers || atomic_read(&eb->blocking_readers))
+	if (READ_ONCE(eb->blocking_writers) || atomic_read(&eb->blocking_readers))
 		return 0;
 
 	write_lock(&eb->lock);
-	if (eb->blocking_writers || atomic_read(&eb->blocking_readers)) {
+	/* Refetch value after lock */
+	if (READ_ONCE(eb->blocking_writers) || atomic_read(&eb->blocking_readers)) {
 		write_unlock(&eb->lock);
 		return 0;
 	}
@@ -277,9 +280,11 @@ void btrfs_tree_lock(struct extent_buffer *eb)
 	WARN_ON(eb->lock_owner == current->pid);
 again:
 	wait_event(eb->read_lock_wq, atomic_read(&eb->blocking_readers) == 0);
-	wait_event(eb->write_lock_wq, eb->blocking_writers == 0);
+	wait_event(eb->write_lock_wq, READ_ONCE(eb->blocking_writers) == 0);
 	write_lock(&eb->lock);
-	if (atomic_read(&eb->blocking_readers) || eb->blocking_writers) {
+	/* Refetch value after lock */
+	if (atomic_read(&eb->blocking_readers) ||
+	    READ_ONCE(eb->blocking_writers)) {
 		write_unlock(&eb->lock);
 		goto again;
 	}
@@ -294,7 +299,8 @@ void btrfs_tree_lock(struct extent_buffer *eb)
  */
 void btrfs_tree_unlock(struct extent_buffer *eb)
 {
-	int blockers = eb->blocking_writers;
+	/* This is read both locked and unlocked */
+	int blockers = READ_ONCE(eb->blocking_writers);
 
 	BUG_ON(blockers > 1);
 
@@ -305,7 +311,8 @@ void btrfs_tree_unlock(struct extent_buffer *eb)
 
 	if (blockers) {
 		btrfs_assert_no_spinning_writers(eb);
-		eb->blocking_writers = 0;
+		/* Unlocked write */
+		WRITE_ONCE(eb->blocking_writers, 0);
 		/*
 		 * We need to order modifying blocking_writers above with
 		 * actually waking up the sleepers to ensure they see the
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 4/5] btrfs: serialize blocking_writers updates
  2019-10-17 19:38 [PATCH 0/5] Extent buffer locking and documentation David Sterba
                   ` (2 preceding siblings ...)
  2019-10-17 19:38 ` [PATCH 3/5] btrfs: access eb::blocking_writers according to ACCESS_ONCE policies David Sterba
@ 2019-10-17 19:39 ` David Sterba
  2019-10-23  9:57   ` Nikolay Borisov
  2019-10-17 19:39 ` [PATCH 5/5] btrfs: document extent buffer locking David Sterba
  4 siblings, 1 reply; 20+ messages in thread
From: David Sterba @ 2019-10-17 19:39 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba

There's one potential memory ordering problem regarding
eb::blocking_writers, although this hasn't been hit in practice. On TSO
(total store ordering) architectures, the writes will always be seen in
right order.

In case the calls in btrfs_set_lock_blocking_write are seen in this
(wrong) order on another CPU:

0:  /* blocking_writers == 0 */
1:  write_unlock()
2:  blocking_writers = 1

btrfs_try_tree_read_lock, btrfs_tree_read_lock_atomic or
btrfs_try_tree_write_lock would have to squeeze all of this between 1:
and 2:

- check if blocking_writers is 0 (it is, continue)
- try read lock, read lock or write lock (succeeds after 1:)
- check blocking_writers again (continue)

All of that assumes that the reordered writes can survive for quite some
time (unlikely if its in the internal store buffer).

Another point against observing that in practice is that
blocking_writers and the lock are on the same cacheline (64B), so it's
more likely both values are stored in order, or some sort of pending
write flush will update blocking_writers, rwlock before the try lock
happens. Assuming the CPUs work like that and don't hide other
surprises.

struct extent_buffer {
	u64                        start;                /*     0     8 */
	long unsigned int          len;                  /*     8     8 */
	long unsigned int          bflags;               /*    16     8 */
	struct btrfs_fs_info *     fs_info;              /*    24     8 */
	spinlock_t                 refs_lock;            /*    32     4 */
	atomic_t                   refs;                 /*    36     4 */
	atomic_t                   io_pages;             /*    40     4 */
	int                        read_mirror;          /*    44     4 */
	struct callback_head callback_head __attribute__((__aligned__(8))); /*    48    16 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	pid_t                      lock_owner;           /*    64     4 */
	int                        blocking_writers;     /*    68     4 */
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

	atomic_t                   blocking_readers;     /*    72     4 */
	bool                       lock_nested;          /*    76     1 */

	/* XXX 1 byte hole, try to pack */

	short int                  log_index;            /*    78     2 */
	rwlock_t                   lock;                 /*    80     8 */
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

	wait_queue_head_t          write_lock_wq;        /*    88    24 */
	wait_queue_head_t          read_lock_wq;         /*   112    24 */
	/* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
	struct page *              pages[16];            /*   136   128 */

	/* size: 264, cachelines: 5, members: 18 */
	/* sum members: 263, holes: 1, sum holes: 1 */
	/* forced alignments: 1 */
	/* last cacheline: 8 bytes */
} __attribute__((__aligned__(8)));

Add the barriers for correctness sake.

Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/locking.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c
index a12f3edc3505..e0e0430577aa 100644
--- a/fs/btrfs/locking.c
+++ b/fs/btrfs/locking.c
@@ -110,6 +110,18 @@ void btrfs_set_lock_blocking_write(struct extent_buffer *eb)
 		btrfs_assert_spinning_writers_put(eb);
 		btrfs_assert_tree_locked(eb);
 		WRITE_ONCE(eb->blocking_writers, 1);
+		/*
+		 * Writers must be be updated before the lock is released so
+		 * that other threads see that in order. Otherwise this could
+		 * theoretically lead to blocking_writers be still set to 1 but
+		 * this would be unexpected eg. for spinning locks.
+		 *
+		 * There are no pairing read barriers for unlocked access as we
+		 * don't strictly need them for correctness (eg. in
+		 * btrfs_try_tree_read_lock), and the unlock/lock is an implied
+		 * barrier in other cases.
+		 */
+		smp_wmb();
 		write_unlock(&eb->lock);
 	}
 }
@@ -316,7 +328,9 @@ void btrfs_tree_unlock(struct extent_buffer *eb)
 		/*
 		 * We need to order modifying blocking_writers above with
 		 * actually waking up the sleepers to ensure they see the
-		 * updated value of blocking_writers
+		 * updated value of blocking_writers.
+		 *
+		 * cond_wake_up calls smp_mb
 		 */
 		cond_wake_up(&eb->write_lock_wq);
 	} else {
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 5/5] btrfs: document extent buffer locking
  2019-10-17 19:38 [PATCH 0/5] Extent buffer locking and documentation David Sterba
                   ` (3 preceding siblings ...)
  2019-10-17 19:39 ` [PATCH 4/5] btrfs: serialize blocking_writers updates David Sterba
@ 2019-10-17 19:39 ` David Sterba
  2019-10-18  0:17   ` Qu Wenruo
                     ` (2 more replies)
  4 siblings, 3 replies; 20+ messages in thread
From: David Sterba @ 2019-10-17 19:39 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba

Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/locking.c | 110 +++++++++++++++++++++++++++++++++++++++------
 1 file changed, 96 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c
index e0e0430577aa..2a0e828b4470 100644
--- a/fs/btrfs/locking.c
+++ b/fs/btrfs/locking.c
@@ -13,6 +13,48 @@
 #include "extent_io.h"
 #include "locking.h"
 
+/*
+ * Extent buffer locking
+ * ~~~~~~~~~~~~~~~~~~~~~
+ *
+ * The locks use a custom scheme that allows to do more operations than are
+ * available fromt current locking primitives. The building blocks are still
+ * rwlock and wait queues.
+ *
+ * Required semantics:
+ *
+ * - reader/writer exclusion
+ * - writer/writer exclusion
+ * - reader/reader sharing
+ * - spinning lock semantics
+ * - blocking lock semantics
+ * - try-lock semantics for readers and writers
+ * - one level nesting, allowing read lock to be taken by the same thread that
+ *   already has write lock
+ *
+ * The extent buffer locks (also called tree locks) manage access to eb data.
+ * We want concurrency of many readers and safe updates. The underlying locking
+ * is done by read-write spinlock and the blocking part is implemented using
+ * counters and wait queues.
+ *
+ * spinning semantics - the low-level rwlock is held so all other threads that
+ *                      want to take it are spinning on it.
+ *
+ * blocking semantics - the low-level rwlock is not held but the counter
+ *                      denotes how many times the blocking lock was held;
+ *                      sleeping is possible
+ *
+ * Write lock always allows only one thread to access the data.
+ *
+ *
+ * Debugging
+ * ~~~~~~~~~
+ *
+ * There are additional state counters that are asserted in various contexts,
+ * removed from non-debug build to reduce extent_buffer size and for
+ * performance reasons.
+ */
+
 #ifdef CONFIG_BTRFS_DEBUG
 static inline void btrfs_assert_spinning_writers_get(struct extent_buffer *eb)
 {
@@ -80,6 +122,15 @@ static void btrfs_assert_tree_write_locks_get(struct extent_buffer *eb) { }
 static void btrfs_assert_tree_write_locks_put(struct extent_buffer *eb) { }
 #endif
 
+/*
+ * Mark already held read lock as blocking. Can be nested in write lock by the
+ * same thread.
+ *
+ * Use when there are potentially long operations ahead so other thread waiting
+ * on the lock will not actively spin but sleep instead.
+ *
+ * The rwlock is released and blocking reader counter is increased.
+ */
 void btrfs_set_lock_blocking_read(struct extent_buffer *eb)
 {
 	trace_btrfs_set_lock_blocking_read(eb);
@@ -96,6 +147,14 @@ void btrfs_set_lock_blocking_read(struct extent_buffer *eb)
 	read_unlock(&eb->lock);
 }
 
+/*
+ * Mark already held write lock as blocking.
+ *
+ * Use when there are potentially long operations ahead so other threads
+ * waiting on the lock will not actively spin but sleep instead.
+ *
+ * The rwlock is released and blocking writers is set.
+ */
 void btrfs_set_lock_blocking_write(struct extent_buffer *eb)
 {
 	trace_btrfs_set_lock_blocking_write(eb);
@@ -127,8 +186,13 @@ void btrfs_set_lock_blocking_write(struct extent_buffer *eb)
 }
 
 /*
- * take a spinning read lock.  This will wait for any blocking
- * writers
+ * Lock the extent buffer for read. Wait for any writers (spinning or blocking).
+ * Can be nested in write lock by the same thread.
+ *
+ * Use when the locked section does only lightweight actions and busy waiting
+ * would be cheaper than making other threads do the wait/wake loop.
+ *
+ * The rwlock is held upon exit.
  */
 void btrfs_tree_read_lock(struct extent_buffer *eb)
 {
@@ -166,9 +230,10 @@ void btrfs_tree_read_lock(struct extent_buffer *eb)
 }
 
 /*
- * take a spinning read lock.
- * returns 1 if we get the read lock and 0 if we don't
- * this won't wait for blocking writers
+ * Lock extent buffer for read, optimistically expecting that there are no
+ * contending blocking writers. If there are, don't wait.
+ *
+ * Return 1 if the rwlock has been taken, 0 otherwise
  */
 int btrfs_tree_read_lock_atomic(struct extent_buffer *eb)
 {
@@ -188,8 +253,9 @@ int btrfs_tree_read_lock_atomic(struct extent_buffer *eb)
 }
 
 /*
- * returns 1 if we get the read lock and 0 if we don't
- * this won't wait for blocking writers
+ * Try-lock for read. Don't block or wait for contending writers.
+ *
+ * Retrun 1 if the rwlock has been taken, 0 otherwise
  */
 int btrfs_try_tree_read_lock(struct extent_buffer *eb)
 {
@@ -211,8 +277,10 @@ int btrfs_try_tree_read_lock(struct extent_buffer *eb)
 }
 
 /*
- * returns 1 if we get the read lock and 0 if we don't
- * this won't wait for blocking writers or readers
+ * Try-lock for write. May block until the lock is uncontended, but does not
+ * wait until it is free.
+ *
+ * Retrun 1 if the rwlock has been taken, 0 otherwise
  */
 int btrfs_try_tree_write_lock(struct extent_buffer *eb)
 {
@@ -233,7 +301,10 @@ int btrfs_try_tree_write_lock(struct extent_buffer *eb)
 }
 
 /*
- * drop a spinning read lock
+ * Release read lock. Must be used only if the lock is in spinning mode.  If
+ * the read lock is nested, must pair with read lock before the write unlock.
+ *
+ * The rwlock is not held upon exit.
  */
 void btrfs_tree_read_unlock(struct extent_buffer *eb)
 {
@@ -255,7 +326,11 @@ void btrfs_tree_read_unlock(struct extent_buffer *eb)
 }
 
 /*
- * drop a blocking read lock
+ * Release read lock, previously set to blocking by a pairing call to
+ * btrfs_set_lock_blocking_read(). Can be nested in write lock by the same
+ * thread.
+ *
+ * State of rwlock is unchanged, last reader wakes waiting threads.
  */
 void btrfs_tree_read_unlock_blocking(struct extent_buffer *eb)
 {
@@ -279,8 +354,10 @@ void btrfs_tree_read_unlock_blocking(struct extent_buffer *eb)
 }
 
 /*
- * take a spinning write lock.  This will wait for both
- * blocking readers or writers
+ * Lock for write. Wait for all blocking and spinning readers and writers. This
+ * starts context where reader lock could be nested by the same thread.
+ *
+ * The rwlock is held for write upon exit.
  */
 void btrfs_tree_lock(struct extent_buffer *eb)
 {
@@ -307,7 +384,12 @@ void btrfs_tree_lock(struct extent_buffer *eb)
 }
 
 /*
- * drop a spinning or a blocking write lock.
+ * Release the write lock, either blocking or spinning (ie. there's no need
+ * for an explicit blocking unlock, like btrfs_tree_read_unlock_blocking).
+ * This also ends the context for nesting, the read lock must have been
+ * released already.
+ *
+ * Tasks blocked and waiting are woken, rwlock is not held upon exit.
  */
 void btrfs_tree_unlock(struct extent_buffer *eb)
 {
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 5/5] btrfs: document extent buffer locking
  2019-10-17 19:39 ` [PATCH 5/5] btrfs: document extent buffer locking David Sterba
@ 2019-10-18  0:17   ` Qu Wenruo
  2019-10-18 11:56     ` David Sterba
  2019-10-22  9:53   ` Nikolay Borisov
  2019-10-23 11:16   ` Nikolay Borisov
  2 siblings, 1 reply; 20+ messages in thread
From: Qu Wenruo @ 2019-10-18  0:17 UTC (permalink / raw)
  To: David Sterba, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 7992 bytes --]



On 2019/10/18 上午3:39, David Sterba wrote:
> Signed-off-by: David Sterba <dsterba@suse.com>

Great document.

Some questions inlined below.
> ---
>  fs/btrfs/locking.c | 110 +++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 96 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c
> index e0e0430577aa..2a0e828b4470 100644
> --- a/fs/btrfs/locking.c
> +++ b/fs/btrfs/locking.c
> @@ -13,6 +13,48 @@
>  #include "extent_io.h"
>  #include "locking.h"
>  
> +/*
> + * Extent buffer locking
> + * ~~~~~~~~~~~~~~~~~~~~~
> + *
> + * The locks use a custom scheme that allows to do more operations than are
> + * available fromt current locking primitives. The building blocks are still
> + * rwlock and wait queues.
> + *
> + * Required semantics:
> + *
> + * - reader/writer exclusion
> + * - writer/writer exclusion
> + * - reader/reader sharing
> + * - spinning lock semantics
> + * - blocking lock semantics
> + * - try-lock semantics for readers and writers
> + * - one level nesting, allowing read lock to be taken by the same thread that
> + *   already has write lock

Any example about this scenario? IIRC there is only one user of nested lock.
Although we know it exists for a long time, I guess it would be better
trying to remove such call sites?

> + *
> + * The extent buffer locks (also called tree locks) manage access to eb data.

One of my concern related to "access to eb data" is, to access some
member, we don't really need any lock at all.

Some members should never change during the lifespan of an eb. E.g.
bytenr, transid.

Some code is already taking advantage of this, like tree-checker
checking the transid without holding a lock.
Not sure if we should take use of this.

> + * We want concurrency of many readers and safe updates. The underlying locking
> + * is done by read-write spinlock and the blocking part is implemented using
> + * counters and wait queues.
> + *
> + * spinning semantics - the low-level rwlock is held so all other threads that
> + *                      want to take it are spinning on it.
> + *
> + * blocking semantics - the low-level rwlock is not held but the counter
> + *                      denotes how many times the blocking lock was held;
> + *                      sleeping is possible

What about an example/state machine of all read/write and
spinning/blocking combination?

Thanks,
Qu

> + *
> + * Write lock always allows only one thread to access the data.
> + *
> + *
> + * Debugging
> + * ~~~~~~~~~
> + *
> + * There are additional state counters that are asserted in various contexts,
> + * removed from non-debug build to reduce extent_buffer size and for
> + * performance reasons.
> + */
> +
>  #ifdef CONFIG_BTRFS_DEBUG
>  static inline void btrfs_assert_spinning_writers_get(struct extent_buffer *eb)
>  {
> @@ -80,6 +122,15 @@ static void btrfs_assert_tree_write_locks_get(struct extent_buffer *eb) { }
>  static void btrfs_assert_tree_write_locks_put(struct extent_buffer *eb) { }
>  #endif
>  
> +/*
> + * Mark already held read lock as blocking. Can be nested in write lock by the
> + * same thread.
> + *
> + * Use when there are potentially long operations ahead so other thread waiting
> + * on the lock will not actively spin but sleep instead.
> + *
> + * The rwlock is released and blocking reader counter is increased.
> + */
>  void btrfs_set_lock_blocking_read(struct extent_buffer *eb)
>  {
>  	trace_btrfs_set_lock_blocking_read(eb);
> @@ -96,6 +147,14 @@ void btrfs_set_lock_blocking_read(struct extent_buffer *eb)
>  	read_unlock(&eb->lock);
>  }
>  
> +/*
> + * Mark already held write lock as blocking.
> + *
> + * Use when there are potentially long operations ahead so other threads
> + * waiting on the lock will not actively spin but sleep instead.
> + *
> + * The rwlock is released and blocking writers is set.
> + */
>  void btrfs_set_lock_blocking_write(struct extent_buffer *eb)
>  {
>  	trace_btrfs_set_lock_blocking_write(eb);
> @@ -127,8 +186,13 @@ void btrfs_set_lock_blocking_write(struct extent_buffer *eb)
>  }
>  
>  /*
> - * take a spinning read lock.  This will wait for any blocking
> - * writers
> + * Lock the extent buffer for read. Wait for any writers (spinning or blocking).
> + * Can be nested in write lock by the same thread.
> + *
> + * Use when the locked section does only lightweight actions and busy waiting
> + * would be cheaper than making other threads do the wait/wake loop.
> + *
> + * The rwlock is held upon exit.
>   */
>  void btrfs_tree_read_lock(struct extent_buffer *eb)
>  {
> @@ -166,9 +230,10 @@ void btrfs_tree_read_lock(struct extent_buffer *eb)
>  }
>  
>  /*
> - * take a spinning read lock.
> - * returns 1 if we get the read lock and 0 if we don't
> - * this won't wait for blocking writers
> + * Lock extent buffer for read, optimistically expecting that there are no
> + * contending blocking writers. If there are, don't wait.
> + *
> + * Return 1 if the rwlock has been taken, 0 otherwise
>   */
>  int btrfs_tree_read_lock_atomic(struct extent_buffer *eb)
>  {
> @@ -188,8 +253,9 @@ int btrfs_tree_read_lock_atomic(struct extent_buffer *eb)
>  }
>  
>  /*
> - * returns 1 if we get the read lock and 0 if we don't
> - * this won't wait for blocking writers
> + * Try-lock for read. Don't block or wait for contending writers.
> + *
> + * Retrun 1 if the rwlock has been taken, 0 otherwise
>   */
>  int btrfs_try_tree_read_lock(struct extent_buffer *eb)
>  {
> @@ -211,8 +277,10 @@ int btrfs_try_tree_read_lock(struct extent_buffer *eb)
>  }
>  
>  /*
> - * returns 1 if we get the read lock and 0 if we don't
> - * this won't wait for blocking writers or readers
> + * Try-lock for write. May block until the lock is uncontended, but does not
> + * wait until it is free.
> + *
> + * Retrun 1 if the rwlock has been taken, 0 otherwise
>   */
>  int btrfs_try_tree_write_lock(struct extent_buffer *eb)
>  {
> @@ -233,7 +301,10 @@ int btrfs_try_tree_write_lock(struct extent_buffer *eb)
>  }
>  
>  /*
> - * drop a spinning read lock
> + * Release read lock. Must be used only if the lock is in spinning mode.  If
> + * the read lock is nested, must pair with read lock before the write unlock.
> + *
> + * The rwlock is not held upon exit.
>   */
>  void btrfs_tree_read_unlock(struct extent_buffer *eb)
>  {
> @@ -255,7 +326,11 @@ void btrfs_tree_read_unlock(struct extent_buffer *eb)
>  }
>  
>  /*
> - * drop a blocking read lock
> + * Release read lock, previously set to blocking by a pairing call to
> + * btrfs_set_lock_blocking_read(). Can be nested in write lock by the same
> + * thread.
> + *
> + * State of rwlock is unchanged, last reader wakes waiting threads.
>   */
>  void btrfs_tree_read_unlock_blocking(struct extent_buffer *eb)
>  {
> @@ -279,8 +354,10 @@ void btrfs_tree_read_unlock_blocking(struct extent_buffer *eb)
>  }
>  
>  /*
> - * take a spinning write lock.  This will wait for both
> - * blocking readers or writers
> + * Lock for write. Wait for all blocking and spinning readers and writers. This
> + * starts context where reader lock could be nested by the same thread.
> + *
> + * The rwlock is held for write upon exit.
>   */
>  void btrfs_tree_lock(struct extent_buffer *eb)
>  {
> @@ -307,7 +384,12 @@ void btrfs_tree_lock(struct extent_buffer *eb)
>  }
>  
>  /*
> - * drop a spinning or a blocking write lock.
> + * Release the write lock, either blocking or spinning (ie. there's no need
> + * for an explicit blocking unlock, like btrfs_tree_read_unlock_blocking).
> + * This also ends the context for nesting, the read lock must have been
> + * released already.
> + *
> + * Tasks blocked and waiting are woken, rwlock is not held upon exit.
>   */
>  void btrfs_tree_unlock(struct extent_buffer *eb)
>  {
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 5/5] btrfs: document extent buffer locking
  2019-10-18  0:17   ` Qu Wenruo
@ 2019-10-18 11:56     ` David Sterba
  0 siblings, 0 replies; 20+ messages in thread
From: David Sterba @ 2019-10-18 11:56 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: David Sterba, linux-btrfs

On Fri, Oct 18, 2019 at 08:17:44AM +0800, Qu Wenruo wrote:
> > + * Required semantics:
> > + *
> > + * - reader/writer exclusion
> > + * - writer/writer exclusion
> > + * - reader/reader sharing
> > + * - spinning lock semantics
> > + * - blocking lock semantics
> > + * - try-lock semantics for readers and writers
> > + * - one level nesting, allowing read lock to be taken by the same thread that
> > + *   already has write lock
> 
> Any example about this scenario? IIRC there is only one user of nested lock.
> Although we know it exists for a long time, I guess it would be better
> trying to remove such call sites?

It could make a few things easier on the locking side, but I don't know
how much intrusive it would be in the scenario where it happens. The
stacktrace is quite long so some sort of propagation of the locked
context would have to happen.

> 
> > + *
> > + * The extent buffer locks (also called tree locks) manage access to eb data.
> 
> One of my concern related to "access to eb data" is, to access some
> member, we don't really need any lock at all.
> 
> Some members should never change during the lifespan of an eb. E.g.
> bytenr, transid.

Ok, so the paragraph can be more specific that the locking concerns the
b-tree data, ie. item keys and the corresponding item data.

> Some code is already taking advantage of this, like tree-checker
> checking the transid without holding a lock.
> Not sure if we should take use of this.
> 
> > + * We want concurrency of many readers and safe updates. The underlying locking
> > + * is done by read-write spinlock and the blocking part is implemented using
> > + * counters and wait queues.
> > + *
> > + * spinning semantics - the low-level rwlock is held so all other threads that
> > + *                      want to take it are spinning on it.
> > + *
> > + * blocking semantics - the low-level rwlock is not held but the counter
> > + *                      denotes how many times the blocking lock was held;
> > + *                      sleeping is possible
> 
> What about an example/state machine of all read/write and
> spinning/blocking combination?

The sate machine is really trivial, either use just the spinning locks
or at some point do set lock blocking and unblock/unlock at the end. The
main difference is whether the rwlock is held or not.

A few more words about that are in
https://github.com/btrfs/btrfs-dev-docs/blob/master/extent-buffer-locking.txt

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/5] btrfs: set blocking_writers directly, no increment or decrement
  2019-10-17 19:38 ` [PATCH 2/5] btrfs: set blocking_writers directly, no increment or decrement David Sterba
@ 2019-10-18 12:08   ` Nikolay Borisov
  2019-10-18 17:31     ` David Sterba
  0 siblings, 1 reply; 20+ messages in thread
From: Nikolay Borisov @ 2019-10-18 12:08 UTC (permalink / raw)
  To: David Sterba, linux-btrfs



On 17.10.19 г. 22:38 ч., David Sterba wrote:
> The increment and decrement was inherited from previous version that
> used atomics, switched in commit 06297d8cefca ("btrfs: switch
> extent_buffer blocking_writers from atomic to int"). The only possible
> values are 0 and 1 so we can set them directly.
> 
> The generated assembly (gcc 9.x) did the direct value assignment in
> btrfs_set_lock_blocking_write:
> 
>      5d:   test   %eax,%eax
>      5f:   je     62 <btrfs_set_lock_blocking_write+0x22>
>      61:   retq
> 
>   -  62:   lock incl 0x44(%rdi)
>   -  66:   add    $0x50,%rdi
>   -  6a:   jmpq   6f <btrfs_set_lock_blocking_write+0x2f>
> 
>   +  62:   movl   $0x1,0x44(%rdi)
>   +  69:   add    $0x50,%rdi
>   +  6d:   jmpq   72 <btrfs_set_lock_blocking_write+0x32>
> 
> The part in btrfs_tree_unlock did a decrement because
> BUG_ON(blockers > 1) is probably not a strong hint for the compiler, but
> otherwise the output looks safe:
> 
>   - lock decl 0x44(%rdi)
> 
>   + sub    $0x1,%eax
>   + mov    %eax,0x44(%rdi)

I find it strange that plain inc/decs had the lock prefix in the
resulting assembly. The sub instruction can have the lock prefix but
here the compiler chooses not to. While this doesn't mean it's buggy I'm
just curious what's happening.

> 
> Signed-off-by: David Sterba <dsterba@suse.com>
> ---
>  fs/btrfs/locking.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c
> index c84c650e56c7..00edf91c3d1c 100644
> --- a/fs/btrfs/locking.c
> +++ b/fs/btrfs/locking.c
> @@ -109,7 +109,7 @@ void btrfs_set_lock_blocking_write(struct extent_buffer *eb)
>  	if (eb->blocking_writers == 0) {
>  		btrfs_assert_spinning_writers_put(eb);
>  		btrfs_assert_tree_locked(eb);
> -		eb->blocking_writers++;
> +		eb->blocking_writers = 1;
>  		write_unlock(&eb->lock);
>  	}
>  }
> @@ -305,7 +305,7 @@ void btrfs_tree_unlock(struct extent_buffer *eb)
>  
>  	if (blockers) {
>  		btrfs_assert_no_spinning_writers(eb);
> -		eb->blocking_writers--;
> +		eb->blocking_writers = 0;
>  		/*
>  		 * We need to order modifying blocking_writers above with
>  		 * actually waking up the sleepers to ensure they see the
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/5] btrfs: set blocking_writers directly, no increment or decrement
  2019-10-18 12:08   ` Nikolay Borisov
@ 2019-10-18 17:31     ` David Sterba
  0 siblings, 0 replies; 20+ messages in thread
From: David Sterba @ 2019-10-18 17:31 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: David Sterba, linux-btrfs

On Fri, Oct 18, 2019 at 03:08:43PM +0300, Nikolay Borisov wrote:
> 
> 
> On 17.10.19 г. 22:38 ч., David Sterba wrote:
> > The increment and decrement was inherited from previous version that
> > used atomics, switched in commit 06297d8cefca ("btrfs: switch
> > extent_buffer blocking_writers from atomic to int"). The only possible
> > values are 0 and 1 so we can set them directly.
> > 
> > The generated assembly (gcc 9.x) did the direct value assignment in
> > btrfs_set_lock_blocking_write:
> > 
> >      5d:   test   %eax,%eax
> >      5f:   je     62 <btrfs_set_lock_blocking_write+0x22>
> >      61:   retq
> > 
> >   -  62:   lock incl 0x44(%rdi)
> >   -  66:   add    $0x50,%rdi
> >   -  6a:   jmpq   6f <btrfs_set_lock_blocking_write+0x2f>
> > 
> >   +  62:   movl   $0x1,0x44(%rdi)
> >   +  69:   add    $0x50,%rdi
> >   +  6d:   jmpq   72 <btrfs_set_lock_blocking_write+0x32>
> > 
> > The part in btrfs_tree_unlock did a decrement because
> > BUG_ON(blockers > 1) is probably not a strong hint for the compiler, but
> > otherwise the output looks safe:
> > 
> >   - lock decl 0x44(%rdi)
> > 
> >   + sub    $0x1,%eax
> >   + mov    %eax,0x44(%rdi)
> 
> I find it strange that plain inc/decs had the lock prefix in the
> resulting assembly. The sub instruction can have the lock prefix but
> here the compiler chooses not to. While this doesn't mean it's buggy I'm
> just curious what's happening.

The asm snippet is from the patch that did atomics -> int, which was
atomic_dec(blocking_writers) -> blocking_writers--, yet the result was
plain store.

It's probably not too clear from the context that it referes to the
menined commit 06297d8cefca.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 5/5] btrfs: document extent buffer locking
  2019-10-17 19:39 ` [PATCH 5/5] btrfs: document extent buffer locking David Sterba
  2019-10-18  0:17   ` Qu Wenruo
@ 2019-10-22  9:53   ` Nikolay Borisov
  2019-10-22 10:41     ` David Sterba
  2019-10-23 11:16   ` Nikolay Borisov
  2 siblings, 1 reply; 20+ messages in thread
From: Nikolay Borisov @ 2019-10-22  9:53 UTC (permalink / raw)
  To: David Sterba, linux-btrfs



On 17.10.19 г. 22:39 ч., David Sterba wrote:
> Signed-off-by: David Sterba <dsterba@suse.com>
> ---
>  fs/btrfs/locking.c | 110 +++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 96 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c
> index e0e0430577aa..2a0e828b4470 100644
> --- a/fs/btrfs/locking.c
> +++ b/fs/btrfs/locking.c
> @@ -13,6 +13,48 @@
>  #include "extent_io.h"
>  #include "locking.h"
>  
> +/*
> + * Extent buffer locking
> + * ~~~~~~~~~~~~~~~~~~~~~
> + *
> + * The locks use a custom scheme that allows to do more operations than are
> + * available fromt current locking primitives. The building blocks are still
> + * rwlock and wait queues.
> + *
> + * Required semantics:

Thinking out loud here..

> + *
> + * - reader/writer exclusion

RWSEM has that

> + * - writer/writer exclusion

RWSEM has that

> + * - reader/reader sharing

RWSEM has that
> + * - spinning lock semantics

RWSEM has that in the form of optimistic spinning which would be shorter
than what the custom implementation provides.

> + * - blocking lock semantics

RWSEM has that

> + * - try-lock semantics for readers and writers

down_write_trylock, down_read_trylock.

> + * - one level nesting, allowing read lock to be taken by the same thread that
> + *   already has write lock

this might be somewhat problematic but there is downgrade_write which
could be used, I'll have to check.

> + *
> + * The extent buffer locks (also called tree locks) manage access to eb data.
> + * We want concurrency of many readers and safe updates. The underlying locking
> + * is done by read-write spinlock and the blocking part is implemented using
> + * counters and wait queues.
> + *
> + * spinning semantics - the low-level rwlock is held so all other threads that
> + *                      want to take it are spinning on it.
> + *
> + * blocking semantics - the low-level rwlock is not held but the counter
> + *                      denotes how many times the blocking lock was held;
> + *                      sleeping is possible
> + *
> + * Write lock always allows only one thread to access the data.

<snip>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 5/5] btrfs: document extent buffer locking
  2019-10-22  9:53   ` Nikolay Borisov
@ 2019-10-22 10:41     ` David Sterba
  0 siblings, 0 replies; 20+ messages in thread
From: David Sterba @ 2019-10-22 10:41 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: David Sterba, linux-btrfs

On Tue, Oct 22, 2019 at 12:53:29PM +0300, Nikolay Borisov wrote:
> 
> 
> On 17.10.19 г. 22:39 ч., David Sterba wrote:
> > Signed-off-by: David Sterba <dsterba@suse.com>
> > ---
> >  fs/btrfs/locking.c | 110 +++++++++++++++++++++++++++++++++++++++------
> >  1 file changed, 96 insertions(+), 14 deletions(-)
> > 
> > diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c
> > index e0e0430577aa..2a0e828b4470 100644
> > --- a/fs/btrfs/locking.c
> > +++ b/fs/btrfs/locking.c
> > @@ -13,6 +13,48 @@
> >  #include "extent_io.h"
> >  #include "locking.h"
> >  
> > +/*
> > + * Extent buffer locking
> > + * ~~~~~~~~~~~~~~~~~~~~~
> > + *
> > + * The locks use a custom scheme that allows to do more operations than are
> > + * available fromt current locking primitives. The building blocks are still
> > + * rwlock and wait queues.
> > + *
> > + * Required semantics:
> 
> Thinking out loud here..
> 
> > + *
> > + * - reader/writer exclusion
> 
> RWSEM has that
> 
> > + * - writer/writer exclusion
> 
> RWSEM has that
> 
> > + * - reader/reader sharing
> 
> RWSEM has that
> > + * - spinning lock semantics
> 
> RWSEM has that in the form of optimistic spinning which would be shorter
> than what the custom implementation provides.
> 
> > + * - blocking lock semantics
> 
> RWSEM has that

'blocking' in btrfs lock does not hold the rwlock, does rwsem do it the
same? The term might be confusing, a thread that tries to get a lock is
also blocking (and spinning while trying), but the point here is that
the lock is not held so any sleeping operation can happen. And with the
locking status offloaded, we're not nesting the locking primitives, but
only the big tree locks.

> > + * - try-lock semantics for readers and writers
> 
> down_write_trylock, down_read_trylock.
> 
> > + * - one level nesting, allowing read lock to be taken by the same thread that
> > + *   already has write lock
> 
> this might be somewhat problematic but there is downgrade_write which
> could be used, I'll have to check.

The read lock happens transaprently from the POV of the thread that
takes it. As there's only one known context where the write->read
happens, it would be possible to do some magic to convert it, but it's
still problematic.

The read-function is also called without the write lock. So it would
have to atomically check if the lock is held for write AND by the same
task. I don't know if this is feasible.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/5] btrfs: access eb::blocking_writers according to ACCESS_ONCE policies
  2019-10-17 19:38 ` [PATCH 3/5] btrfs: access eb::blocking_writers according to ACCESS_ONCE policies David Sterba
@ 2019-10-23  8:42   ` Nikolay Borisov
  2019-10-29 21:33     ` David Sterba
  0 siblings, 1 reply; 20+ messages in thread
From: Nikolay Borisov @ 2019-10-23  8:42 UTC (permalink / raw)
  To: David Sterba, linux-btrfs



On 17.10.19 г. 22:38 ч., David Sterba wrote:
> A nice writeup of the LKMM (Linux Kernel Memory Model) rules for access
> once policies can be found here
> https://lwn.net/Articles/799218/#Access-Marking%20Policies .
> 
> The locked and unlocked access to eb::blocking_writers should be
> annotated accordingly, following this:
> 
> Writes:
> 
> - locked write must use ONCE, may use plain read
> - unlocked write must use ONCE
> 
> Reads:
> 
> - unlocked read must use ONCE
> - locked read may use plain read iff not mixed with unlocked read
> - unlocked read then locked must use ONCE
> 
> There's one difference on the assembly level, where
> btrfs_tree_read_lock_atomic and btrfs_try_tree_read_lock used the cached
> value and did not reevaluate it after taking the lock. This could have
> missed some opportunities to take the lock in case blocking writers
> changed between the calls, but the window is just a few instructions
> long. As this is in try-lock, the callers handle that.

Looking at blocking_writers it seems this variable is modified under
eb->lock (in set lock blocking which requires us having a spinning lock
first, meaning the setting thread is the owner). At this point the
owning thread will have to eventually unlock it, this happens in
btrfs_tree_unlock. There the read/write to the variable is either locked
or unlocked depending on whether the lock is blocking (unlocked) or
spinning (locked).

OTOH other readers of the lock might access is unlocked, namely :
tree_read_lock (in case wait_event is called), tree_read_lock_atomic -
the first access is unlocked. The optimistic unlocked checks in
try_tree_(read|Write_lock).

So I believe this puts us in scenario 3 in the referenced section of
that LWN article:

A shared variable (blocking_writers) is only modified while holding a
given lock by a given owning CPU or thread (that's the thread which
first took a spinning lock and eventually converted it to a blocking
one), but is read by other CPUs or threads or by code not holding that
lock (that's other threads that might want to acquire the lock on the
given eb).

Apart from one minor nit below (and provided my understanding is correct
as per above exposé) you can add:

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

> 
> Signed-off-by: David Sterba <dsterba@suse.com>
> ---
>  fs/btrfs/locking.c | 31 +++++++++++++++++++------------
>  1 file changed, 19 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c
> index 00edf91c3d1c..a12f3edc3505 100644
> --- a/fs/btrfs/locking.c
> +++ b/fs/btrfs/locking.c
> @@ -109,7 +109,7 @@ void btrfs_set_lock_blocking_write(struct extent_buffer *eb)
>  	if (eb->blocking_writers == 0) {
>  		btrfs_assert_spinning_writers_put(eb);
>  		btrfs_assert_tree_locked(eb);
> -		eb->blocking_writers = 1;
> +		WRITE_ONCE(eb->blocking_writers, 1);
>  		write_unlock(&eb->lock);
>  	}
>  }

<snip>


> @@ -294,7 +299,8 @@ void btrfs_tree_lock(struct extent_buffer *eb)
>   */
>  void btrfs_tree_unlock(struct extent_buffer *eb)
>  {
> -	int blockers = eb->blocking_writers;
> +	/* This is read both locked and unlocked */
> +	int blockers = READ_ONCE(eb->blocking_writers);

Actually aren't we guaranteed that btrfs_tree_unlock's caller is the
owner of blocking_writers meaning we can use plain loads as per:

"The owning CPU or thread may use plain loads..."

>  
>  	BUG_ON(blockers > 1);
>  
> @@ -305,7 +311,8 @@ void btrfs_tree_unlock(struct extent_buffer *eb)
>  
>  	if (blockers) {
>  		btrfs_assert_no_spinning_writers(eb);
> -		eb->blocking_writers = 0;
> +		/* Unlocked write */
> +		WRITE_ONCE(eb->blocking_writers, 0);
>  		/*
>  		 * We need to order modifying blocking_writers above with
>  		 * actually waking up the sleepers to ensure they see the
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/5] btrfs: serialize blocking_writers updates
  2019-10-17 19:39 ` [PATCH 4/5] btrfs: serialize blocking_writers updates David Sterba
@ 2019-10-23  9:57   ` Nikolay Borisov
  2019-10-29 17:51     ` David Sterba
  0 siblings, 1 reply; 20+ messages in thread
From: Nikolay Borisov @ 2019-10-23  9:57 UTC (permalink / raw)
  To: David Sterba, linux-btrfs



On 17.10.19 г. 22:39 ч., David Sterba wrote:
> There's one potential memory ordering problem regarding
> eb::blocking_writers, although this hasn't been hit in practice. On TSO
> (total store ordering) architectures, the writes will always be seen in
> right order.
> 
> In case the calls in btrfs_set_lock_blocking_write are seen in this
> (wrong) order on another CPU:
> 
> 0:  /* blocking_writers == 0 */
> 1:  write_unlock()
> 2:  blocking_writers = 1
> 
> btrfs_try_tree_read_lock, btrfs_tree_read_lock_atomic or
> btrfs_try_tree_write_lock would have to squeeze all of this between 1:
> and 2:

This is only a problem for unlocked (optimistic) accesses in those
functions. Simply because from its POV the eb->lock doesn't play any
role e.g. they don't know about it at all.

This implies there needs to be yet another synchronization/ordering
mechanism only for blocking_writer. But then further down you say that
there are no read side barrier because observing the accesses in a
particular order is not important for correctness.

Given this I fail to see what this smp_wmb affects ordering.
> 
> - check if blocking_writers is 0 (it is, continue)
> - try read lock, read lock or write lock (succeeds after 1:)
> - check blocking_writers again (continue)
> 
> All of that assumes that the reordered writes can survive for quite some
> time (unlikely if its in the internal store buffer).
> 
> Another point against observing that in practice is that
> blocking_writers and the lock are on the same cacheline (64B), so it's
> more likely both values are stored in order, or some sort of pending
> write flush will update blocking_writers, rwlock before the try lock
> happens. Assuming the CPUs work like that and don't hide other
> surprises.
> 
> struct extent_buffer {
> 	u64                        start;                /*     0     8 */
> 	long unsigned int          len;                  /*     8     8 */
> 	long unsigned int          bflags;               /*    16     8 */
> 	struct btrfs_fs_info *     fs_info;              /*    24     8 */
> 	spinlock_t                 refs_lock;            /*    32     4 */
> 	atomic_t                   refs;                 /*    36     4 */
> 	atomic_t                   io_pages;             /*    40     4 */
> 	int                        read_mirror;          /*    44     4 */
> 	struct callback_head callback_head __attribute__((__aligned__(8))); /*    48    16 */
> 	/* --- cacheline 1 boundary (64 bytes) --- */
> 	pid_t                      lock_owner;           /*    64     4 */
> 	int                        blocking_writers;     /*    68     4 */
> 	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> 	atomic_t                   blocking_readers;     /*    72     4 */
> 	bool                       lock_nested;          /*    76     1 */
> 
> 	/* XXX 1 byte hole, try to pack */
> 
> 	short int                  log_index;            /*    78     2 */
> 	rwlock_t                   lock;                 /*    80     8 */
>         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> 	wait_queue_head_t          write_lock_wq;        /*    88    24 */
> 	wait_queue_head_t          read_lock_wq;         /*   112    24 */
> 	/* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
> 	struct page *              pages[16];            /*   136   128 */
> 
> 	/* size: 264, cachelines: 5, members: 18 */
> 	/* sum members: 263, holes: 1, sum holes: 1 */
> 	/* forced alignments: 1 */
> 	/* last cacheline: 8 bytes */
> } __attribute__((__aligned__(8)));
> 
> Add the barriers for correctness sake.
> 
> Signed-off-by: David Sterba <dsterba@suse.com>
> ---
>  fs/btrfs/locking.c | 16 +++++++++++++++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c
> index a12f3edc3505..e0e0430577aa 100644
> --- a/fs/btrfs/locking.c
> +++ b/fs/btrfs/locking.c
> @@ -110,6 +110,18 @@ void btrfs_set_lock_blocking_write(struct extent_buffer *eb)
>  		btrfs_assert_spinning_writers_put(eb);
>  		btrfs_assert_tree_locked(eb);
>  		WRITE_ONCE(eb->blocking_writers, 1);
> +		/*
> +		 * Writers must be be updated before the lock is released so
> +		 * that other threads see that in order. Otherwise this could
> +		 * theoretically lead to blocking_writers be still set to 1 but
> +		 * this would be unexpected eg. for spinning locks.
> +		 *
> +		 * There are no pairing read barriers for unlocked access as we
> +		 * don't strictly need them for correctness (eg. in
> +		 * btrfs_try_tree_read_lock), and the unlock/lock is an implied
> +		 * barrier in other cases.
> +		 */
> +		smp_wmb();
>  		write_unlock(&eb->lock);

That comment sounds to me as if you only care about the readers of
blocking_writers _after_ they have acquired the eb::lock for reading. In
this case this smp_wmb is pointless because write_unlock/write_lock
imply release/acquire semantics.

Unlocked readers are not affected by this wmb.

>  	}
>  }
> @@ -316,7 +328,9 @@ void btrfs_tree_unlock(struct extent_buffer *eb)
>  		/*
>  		 * We need to order modifying blocking_writers above with
>  		 * actually waking up the sleepers to ensure they see the
> -		 * updated value of blocking_writers
> +		 * updated value of blocking_writers.
> +		 *
> +		 * cond_wake_up calls smp_mb
>  		 */
>  		cond_wake_up(&eb->write_lock_wq);
>  	} else {
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 5/5] btrfs: document extent buffer locking
  2019-10-17 19:39 ` [PATCH 5/5] btrfs: document extent buffer locking David Sterba
  2019-10-18  0:17   ` Qu Wenruo
  2019-10-22  9:53   ` Nikolay Borisov
@ 2019-10-23 11:16   ` Nikolay Borisov
  2019-10-29 17:33     ` David Sterba
  2 siblings, 1 reply; 20+ messages in thread
From: Nikolay Borisov @ 2019-10-23 11:16 UTC (permalink / raw)
  To: David Sterba, linux-btrfs



On 17.10.19 г. 22:39 ч., David Sterba wrote:
> Signed-off-by: David Sterba <dsterba@suse.com>
> ---
>  fs/btrfs/locking.c | 110 +++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 96 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c
> index e0e0430577aa..2a0e828b4470 100644
> --- a/fs/btrfs/locking.c
> +++ b/fs/btrfs/locking.c
> @@ -13,6 +13,48 @@
>  #include "extent_io.h"
>  #include "locking.h"
>  
> +/*
> + * Extent buffer locking
> + * ~~~~~~~~~~~~~~~~~~~~~
> + *
> + * The locks use a custom scheme that allows to do more operations than are
> + * available fromt current locking primitives. The building blocks are still
> + * rwlock and wait queues.
> + *
> + * Required semantics:
> + *
> + * - reader/writer exclusion
> + * - writer/writer exclusion
> + * - reader/reader sharing
> + * - spinning lock semantics
> + * - blocking lock semantics
> + * - try-lock semantics for readers and writers
> + * - one level nesting, allowing read lock to be taken by the same thread that
> + *   already has write lock
> + *
> + * The extent buffer locks (also called tree locks) manage access to eb data.
> + * We want concurrency of many readers and safe updates. The underlying locking
> + * is done by read-write spinlock and the blocking part is implemented using
> + * counters and wait queues.
> + *
> + * spinning semantics - the low-level rwlock is held so all other threads that
> + *                      want to take it are spinning on it.
> + *
> + * blocking semantics - the low-level rwlock is not held but the counter
> + *                      denotes how many times the blocking lock was held;
> + *                      sleeping is possible
> + *
> + * Write lock always allows only one thread to access the data.
> + *
> + *
> + * Debugging
> + * ~~~~~~~~~
> + *
> + * There are additional state counters that are asserted in various contexts,
> + * removed from non-debug build to reduce extent_buffer size and for
> + * performance reasons.
> + */
> +
>  #ifdef CONFIG_BTRFS_DEBUG
>  static inline void btrfs_assert_spinning_writers_get(struct extent_buffer *eb)
>  {
> @@ -80,6 +122,15 @@ static void btrfs_assert_tree_write_locks_get(struct extent_buffer *eb) { }
>  static void btrfs_assert_tree_write_locks_put(struct extent_buffer *eb) { }
>  #endif
>  
> +/*
> + * Mark already held read lock as blocking. Can be nested in write lock by the
> + * same thread.
> + *
> + * Use when there are potentially long operations ahead so other thread waiting
> + * on the lock will not actively spin but sleep instead.
> + *
> + * The rwlock is released and blocking reader counter is increased.
> + */
>  void btrfs_set_lock_blocking_read(struct extent_buffer *eb)
>  {

I think for this and it's write counterpart a
lockdep_assert_held(eb->lock) will be a better way to document the fact
the lock needs to be held when those functions are called.

>  	trace_btrfs_set_lock_blocking_read(eb);

<snip>

>  /*
> - * drop a blocking read lock
> + * Release read lock, previously set to blocking by a pairing call to
> + * btrfs_set_lock_blocking_read(). Can be nested in write lock by the same
> + * thread.
> + *
> + * State of rwlock is unchanged, last reader wakes waiting threads.
>   */
>  void btrfs_tree_read_unlock_blocking(struct extent_buffer *eb)
>  {
> @@ -279,8 +354,10 @@ void btrfs_tree_read_unlock_blocking(struct extent_buffer *eb)
>  }
>  
>  /*
> - * take a spinning write lock.  This will wait for both
> - * blocking readers or writers
> + * Lock for write. Wait for all blocking and spinning readers and writers. This
> + * starts context where reader lock could be nested by the same thread.

Imo you shouldn't ommit the explicit mention this takes a spinning lock.

> + *
> + * The rwlock is held for write upon exit.
>   */
>  void btrfs_tree_lock(struct extent_buffer *eb)
>  {
> @@ -307,7 +384,12 @@ void btrfs_tree_lock(struct extent_buffer *eb)
>  }
>  
>  /*
> - * drop a spinning or a blocking write lock.
> + * Release the write lock, either blocking or spinning (ie. there's no need
> + * for an explicit blocking unlock, like btrfs_tree_read_unlock_blocking).
> + * This also ends the context for nesting, the read lock must have been
> + * released already.
> + *
> + * Tasks blocked and waiting are woken, rwlock is not held upon exit.
>   */
>  void btrfs_tree_unlock(struct extent_buffer *eb)
>  {
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 5/5] btrfs: document extent buffer locking
  2019-10-23 11:16   ` Nikolay Borisov
@ 2019-10-29 17:33     ` David Sterba
  0 siblings, 0 replies; 20+ messages in thread
From: David Sterba @ 2019-10-29 17:33 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: David Sterba, linux-btrfs

On Wed, Oct 23, 2019 at 02:16:26PM +0300, Nikolay Borisov wrote:
> >  void btrfs_set_lock_blocking_read(struct extent_buffer *eb)
> >  {
> 
> I think for this and it's write counterpart a
> lockdep_assert_held(eb->lock) will be a better way to document the fact
> the lock needs to be held when those functions are called.

Ok, will add it.

> >  	trace_btrfs_set_lock_blocking_read(eb);
> 
> <snip>
> 
> >  /*
> > - * drop a blocking read lock
> > + * Release read lock, previously set to blocking by a pairing call to
> > + * btrfs_set_lock_blocking_read(). Can be nested in write lock by the same
> > + * thread.
> > + *
> > + * State of rwlock is unchanged, last reader wakes waiting threads.
> >   */
> >  void btrfs_tree_read_unlock_blocking(struct extent_buffer *eb)
> >  {
> > @@ -279,8 +354,10 @@ void btrfs_tree_read_unlock_blocking(struct extent_buffer *eb)
> >  }
> >  
> >  /*
> > - * take a spinning write lock.  This will wait for both
> > - * blocking readers or writers
> > + * Lock for write. Wait for all blocking and spinning readers and writers. This
> > + * starts context where reader lock could be nested by the same thread.
> 
> Imo you shouldn't ommit the explicit mention this takes a spinning lock.

But

> > + * The rwlock is held for write upon exit.

the next line in the commit says that. There's no explicit 'spinning'
but I find it sufficient. Feel free to suggest better wording.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/5] btrfs: serialize blocking_writers updates
  2019-10-23  9:57   ` Nikolay Borisov
@ 2019-10-29 17:51     ` David Sterba
  2019-10-29 18:48       ` Nikolay Borisov
  0 siblings, 1 reply; 20+ messages in thread
From: David Sterba @ 2019-10-29 17:51 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: David Sterba, linux-btrfs

On Wed, Oct 23, 2019 at 12:57:00PM +0300, Nikolay Borisov wrote:
> 
> 
> On 17.10.19 г. 22:39 ч., David Sterba wrote:
> > There's one potential memory ordering problem regarding
> > eb::blocking_writers, although this hasn't been hit in practice. On TSO
> > (total store ordering) architectures, the writes will always be seen in
> > right order.
> > 
> > In case the calls in btrfs_set_lock_blocking_write are seen in this
> > (wrong) order on another CPU:
> > 
> > 0:  /* blocking_writers == 0 */
> > 1:  write_unlock()
> > 2:  blocking_writers = 1
> > 
> > btrfs_try_tree_read_lock, btrfs_tree_read_lock_atomic or
> > btrfs_try_tree_write_lock would have to squeeze all of this between 1:
> > and 2:
> 
> This is only a problem for unlocked (optimistic) accesses in those
> functions. Simply because from its POV the eb->lock doesn't play any
> role e.g. they don't know about it at all.

No the lock is important here. Let's take btrfs_try_tree_read_lock as
example for 2nd thread:

T1:                            T2:
write_unlock
                               blocking_writers == 0
			       try_readlock (success, locked)
			       if (blocking_writers) return -> not taken
blocking_writers = 1

Same for btrfs_tree_read_lock_atomic (with read_lock) and
btrfs_try_tree_write_lock (with write_lock)

IMO it's the other way around than you say, the optimistic lock is
irrelevant. We need the blocking_writers update sneak into a locked
section.

> This implies there needs to be yet another synchronization/ordering
> mechanism only for blocking_writer. But then further down you say that
> there are no read side barrier because observing the accesses in a
> particular order is not important for correctness.
> 
> Given this I fail to see what this smp_wmb affects ordering.

So in the steps above:

T1:                            T2:
blocking_writers = 1
smp_mb()
write_unlock
                               blocking_writers == 1
			       	-> take the fast path and return

> > +		/*
> > +		 * Writers must be be updated before the lock is released so
> > +		 * that other threads see that in order. Otherwise this could
> > +		 * theoretically lead to blocking_writers be still set to 1 but
> > +		 * this would be unexpected eg. for spinning locks.
> > +		 *
> > +		 * There are no pairing read barriers for unlocked access as we
> > +		 * don't strictly need them for correctness (eg. in
> > +		 * btrfs_try_tree_read_lock), and the unlock/lock is an implied
> > +		 * barrier in other cases.
> > +		 */
> > +		smp_wmb();
> >  		write_unlock(&eb->lock);
> 
> That comment sounds to me as if you only care about the readers of
> blocking_writers _after_ they have acquired the eb::lock for reading. In
> this case this smp_wmb is pointless because write_unlock/write_lock
> imply release/acquire semantics.

The pairing read side barrier would have to be before all reads of
blocking_writers, eg.

180 int btrfs_try_tree_read_lock(struct extent_buffer *eb)
181 {

            smp_rmb();

182         if (eb->blocking_writers)
183                 return 0;

ant it won't be wrong, the value would be most up to date as it could be. And
givent that this is an optimistic shortcut, lack of synchronized read might
pessimize that. Ie. blocking_writers is already 0 but the update is not
propagated and btrfs_try_tree_read_lock returns.

This is in principle indistinguishable from a state where the lock is
still locked. Hence the 'not needed for correctness' part. And then the
barrier can be avoided.

The write side needs the barrier to order write and unlock. The
unlock/lock implied barrier won't help as the read-side check happens in
the middle of that.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/5] btrfs: serialize blocking_writers updates
  2019-10-29 17:51     ` David Sterba
@ 2019-10-29 18:48       ` Nikolay Borisov
  2019-10-29 21:15         ` David Sterba
  0 siblings, 1 reply; 20+ messages in thread
From: Nikolay Borisov @ 2019-10-29 18:48 UTC (permalink / raw)
  To: dsterba, David Sterba, linux-btrfs



On 29.10.19 г. 19:51 ч., David Sterba wrote:
> On Wed, Oct 23, 2019 at 12:57:00PM +0300, Nikolay Borisov wrote:
>>
>>
>> On 17.10.19 г. 22:39 ч., David Sterba wrote:
>>> There's one potential memory ordering problem regarding
>>> eb::blocking_writers, although this hasn't been hit in practice. On TSO
>>> (total store ordering) architectures, the writes will always be seen in
>>> right order.
>>>
>>> In case the calls in btrfs_set_lock_blocking_write are seen in this
>>> (wrong) order on another CPU:
>>>
>>> 0:  /* blocking_writers == 0 */
>>> 1:  write_unlock()
>>> 2:  blocking_writers = 1
>>>
>>> btrfs_try_tree_read_lock, btrfs_tree_read_lock_atomic or
>>> btrfs_try_tree_write_lock would have to squeeze all of this between 1:
>>> and 2:
>>
>> This is only a problem for unlocked (optimistic) accesses in those
>> functions. Simply because from its POV the eb->lock doesn't play any
>> role e.g. they don't know about it at all.
> 
> No the lock is important here. Let's take btrfs_try_tree_read_lock as
> example for 2nd thread:
> 
> T1:                            T2:
> write_unlock
>                                blocking_writers == 0
> 			       try_readlock (success, locked)
> 			       if (blocking_writers) return -> not taken
> blocking_writers = 1
> 
> Same for btrfs_tree_read_lock_atomic (with read_lock) and
> btrfs_try_tree_write_lock (with write_lock)
> 
> IMO it's the other way around than you say, the optimistic lock is
> irrelevant. We need the blocking_writers update sneak into a locked
> section.

But write_unlock is a release operation, as per memory-barriers.txt:

(6) RELEASE operations.



     This also acts as a one-way permeable barrier.  It guarantees that
all  memory operations before the RELEASE operation will appear to
happen  before the RELEASE operation with respect to the other
components of the  system. RELEASE operations include UNLOCK operations
and smp_store_release() operations.

...

The use of ACQUIRE and RELEASE operations generally precludes the need
for other sorts of memory barrier (but note the exceptions mentioned in
 the subsection "MMIO write barrier").  In addition, a RELEASE+ACQUIRE
pair is -not- guaranteed to act as a full memory barrier.  However,
after  an ACQUIRE on a given variable, all memory accesses preceding any
prior RELEASE on that same variable are guaranteed to be visible.

try_readlock is an ACQUIRE operation w.r.t the lock. So if it succeeds
then all previous accesses e.g. blocking_writer = 1 are guaranteed to be
visible because we have RELEASE (write_unlock ) + ACQUIRE (try_readlock
in case of success).

So the blocking_write check AFTER try_readlock has succeeded is
guaranteed to see the update to blocking_writers in
btrfs_set_lock_blocking_write.

> 
>> This implies there needs to be yet another synchronization/ordering
>> mechanism only for blocking_writer. But then further down you say that
>> there are no read side barrier because observing the accesses in a
>> particular order is not important for correctness.
>>
>> Given this I fail to see what this smp_wmb affects ordering.
> 
> So in the steps above:
> 
> T1:                            T2:
> blocking_writers = 1
> smp_mb()
> write_unlock
>                                blocking_writers == 1
> 			       	-> take the fast path and return
> 
>>> +		/*
>>> +		 * Writers must be be updated before the lock is released so
>>> +		 * that other threads see that in order. Otherwise this could
>>> +		 * theoretically lead to blocking_writers be still set to 1 but
>>> +		 * this would be unexpected eg. for spinning locks.
>>> +		 *
>>> +		 * There are no pairing read barriers for unlocked access as we
>>> +		 * don't strictly need them for correctness (eg. in
>>> +		 * btrfs_try_tree_read_lock), and the unlock/lock is an implied
>>> +		 * barrier in other cases.
>>> +		 */
>>> +		smp_wmb();
>>>  		write_unlock(&eb->lock);
>>
>> That comment sounds to me as if you only care about the readers of
>> blocking_writers _after_ they have acquired the eb::lock for reading. In
>> this case this smp_wmb is pointless because write_unlock/write_lock
>> imply release/acquire semantics.
> 
> The pairing read side barrier would have to be before all reads of
> blocking_writers, eg.
> 
> 180 int btrfs_try_tree_read_lock(struct extent_buffer *eb)
> 181 {
> 
>             smp_rmb();
> 
> 182         if (eb->blocking_writers)
> 183                 return 0;
> 
> ant it won't be wrong, the value would be most up to date as it could be. And
> givent that this is an optimistic shortcut, lack of synchronized read might
> pessimize that. Ie. blocking_writers is already 0 but the update is not
> propagated and btrfs_try_tree_read_lock returns.
> 
> This is in principle indistinguishable from a state where the lock is
> still locked. Hence the 'not needed for correctness' part. And then the
> barrier can be avoided.

Nod

> 
> The write side needs the barrier to order write and unlock. The
> unlock/lock implied barrier won't help as the read-side check happens in
> the middle of that.
> 

But the unlock (as explained earlier) won't allow the write to 'leak'
past it.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/5] btrfs: serialize blocking_writers updates
  2019-10-29 18:48       ` Nikolay Borisov
@ 2019-10-29 21:15         ` David Sterba
  0 siblings, 0 replies; 20+ messages in thread
From: David Sterba @ 2019-10-29 21:15 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: dsterba, David Sterba, linux-btrfs

On Tue, Oct 29, 2019 at 08:48:15PM +0200, Nikolay Borisov wrote:
> On 29.10.19 г. 19:51 ч., David Sterba wrote:
> > On Wed, Oct 23, 2019 at 12:57:00PM +0300, Nikolay Borisov wrote:
> >> On 17.10.19 г. 22:39 ч., David Sterba wrote:
> >>> There's one potential memory ordering problem regarding
> >>> eb::blocking_writers, although this hasn't been hit in practice. On TSO
> >>> (total store ordering) architectures, the writes will always be seen in
> >>> right order.
> >>>
> >>> In case the calls in btrfs_set_lock_blocking_write are seen in this
> >>> (wrong) order on another CPU:
> >>>
> >>> 0:  /* blocking_writers == 0 */
> >>> 1:  write_unlock()
> >>> 2:  blocking_writers = 1
> >>>
> >>> btrfs_try_tree_read_lock, btrfs_tree_read_lock_atomic or
> >>> btrfs_try_tree_write_lock would have to squeeze all of this between 1:
> >>> and 2:
> >>
> >> This is only a problem for unlocked (optimistic) accesses in those
> >> functions. Simply because from its POV the eb->lock doesn't play any
> >> role e.g. they don't know about it at all.
> > 
> > No the lock is important here. Let's take btrfs_try_tree_read_lock as
> > example for 2nd thread:
> > 
> > T1:                            T2:
> > write_unlock
> >                                blocking_writers == 0
> > 			       try_readlock (success, locked)
> > 			       if (blocking_writers) return -> not taken
> > blocking_writers = 1

Ok, follows from what you write below and I got it wrong as the above
can't happen because of the simple unlock/lock sequence. I don't know
why I missed that, the original changelog has enough clues to see that
too.

> > Same for btrfs_tree_read_lock_atomic (with read_lock) and
> > btrfs_try_tree_write_lock (with write_lock)
> > 
> > IMO it's the other way around than you say, the optimistic lock is
> > irrelevant. We need the blocking_writers update sneak into a locked
> > section.
> 
> But write_unlock is a release operation, as per memory-barriers.txt:
> 
> (6) RELEASE operations.
> 
>      This also acts as a one-way permeable barrier.  It guarantees that
> all  memory operations before the RELEASE operation will appear to
> happen  before the RELEASE operation with respect to the other
> components of the  system. RELEASE operations include UNLOCK operations
> and smp_store_release() operations.
> 
> ...
> 
> The use of ACQUIRE and RELEASE operations generally precludes the need
> for other sorts of memory barrier (but note the exceptions mentioned in
>  the subsection "MMIO write barrier").  In addition, a RELEASE+ACQUIRE
> pair is -not- guaranteed to act as a full memory barrier.  However,
> after  an ACQUIRE on a given variable, all memory accesses preceding any
> prior RELEASE on that same variable are guaranteed to be visible.
> 
> try_readlock is an ACQUIRE operation w.r.t the lock. So if it succeeds
> then all previous accesses e.g. blocking_writer = 1 are guaranteed to be
> visible because we have RELEASE (write_unlock ) + ACQUIRE (try_readlock
> in case of success).
> 
> So the blocking_write check AFTER try_readlock has succeeded is
> guaranteed to see the update to blocking_writers in
> btrfs_set_lock_blocking_write.

Yeah. As long as the slow paths check the value inside locks, we're
safe. And this is done in all the functions AFAICS.

So far we haven't found a problem with the locks on TSO and non-TSO
arches. Thanks for the barrier reasoning exercise, I'll drop the patch.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/5] btrfs: access eb::blocking_writers according to ACCESS_ONCE policies
  2019-10-23  8:42   ` Nikolay Borisov
@ 2019-10-29 21:33     ` David Sterba
  0 siblings, 0 replies; 20+ messages in thread
From: David Sterba @ 2019-10-29 21:33 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: David Sterba, linux-btrfs

On Wed, Oct 23, 2019 at 11:42:29AM +0300, Nikolay Borisov wrote:
> On 17.10.19 г. 22:38 ч., David Sterba wrote:
> > @@ -294,7 +299,8 @@ void btrfs_tree_lock(struct extent_buffer *eb)
> >   */
> >  void btrfs_tree_unlock(struct extent_buffer *eb)
> >  {
> > -	int blockers = eb->blocking_writers;
> > +	/* This is read both locked and unlocked */
> > +	int blockers = READ_ONCE(eb->blocking_writers);
> 
> Actually aren't we guaranteed that btrfs_tree_unlock's caller is the
> owner of blocking_writers meaning we can use plain loads as per:
> 
> "The owning CPU or thread may use plain loads..."

That's right, I'll remove READ_ONCE and leave a comment as it's not
obvious from the context.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2019-10-29 21:32 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-17 19:38 [PATCH 0/5] Extent buffer locking and documentation David Sterba
2019-10-17 19:38 ` [PATCH 1/5] btrfs: merge blocking_writers branches in btrfs_tree_read_lock David Sterba
2019-10-17 19:38 ` [PATCH 2/5] btrfs: set blocking_writers directly, no increment or decrement David Sterba
2019-10-18 12:08   ` Nikolay Borisov
2019-10-18 17:31     ` David Sterba
2019-10-17 19:38 ` [PATCH 3/5] btrfs: access eb::blocking_writers according to ACCESS_ONCE policies David Sterba
2019-10-23  8:42   ` Nikolay Borisov
2019-10-29 21:33     ` David Sterba
2019-10-17 19:39 ` [PATCH 4/5] btrfs: serialize blocking_writers updates David Sterba
2019-10-23  9:57   ` Nikolay Borisov
2019-10-29 17:51     ` David Sterba
2019-10-29 18:48       ` Nikolay Borisov
2019-10-29 21:15         ` David Sterba
2019-10-17 19:39 ` [PATCH 5/5] btrfs: document extent buffer locking David Sterba
2019-10-18  0:17   ` Qu Wenruo
2019-10-18 11:56     ` David Sterba
2019-10-22  9:53   ` Nikolay Borisov
2019-10-22 10:41     ` David Sterba
2019-10-23 11:16   ` Nikolay Borisov
2019-10-29 17:33     ` David Sterba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).