All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] dm snapshot: Improve performance using a more fine-grained locking scheme
@ 2018-12-20 18:06 Nikos Tsironis
  2018-12-20 18:06 ` [PATCH 1/3] list_bl: Add hlist_bl_add_before/behind helpers Nikos Tsironis
                   ` (3 more replies)
  0 siblings, 4 replies; 26+ messages in thread
From: Nikos Tsironis @ 2018-12-20 18:06 UTC (permalink / raw)
  To: snitzer, agk, dm-devel; +Cc: mpatocka, iliastsi

dm-snapshot uses a single mutex to serialize every access to the
snapshot state, including accesses to the exception hash tables. This
mutex is a bottleneck preventing dm-snapshot to scale as the number of
threads doing IO increases.

The major contention points are __origin_write()/snapshot_map() and
pending_complete(), i.e., the submission and completion of pending
exceptions.

This patchset substitutes the single mutex with:

  * A read-write semaphore, which protects the mostly read fields of the
    snapshot structure.

  * Per-bucket bit spinlocks, that protect accesses to the exception
    hash tables.

fio benchmarks using the null_blk device show significant performance
improvements as the number of worker processes increases. Write latency
is almost halved and write IOPS are nearly doubled.

The relevant patch provides detailed benchmark results.

A summary of the patchset follows:

  1. The first patch adds two helper functions to linux/list_bl.h, which
     is used to implement the per-bucket bit spinlocks in dm-snapshot.

  2. The second patch removes the need to sleep holding the snapshot
     lock in pending_complete(), thus allowing us to replace the mutex
     with the per-bucket bit spinlocks.

  3. The third patch changes the locking scheme, as described previously.

Nikos Tsironis (3):
  list_bl: Add hlist_bl_add_before/behind helpers
  dm snapshot: Don't sleep holding the snapshot lock
  dm snapshot: Use fine-grained locking scheme

 drivers/md/dm-exception-store.h |   3 +-
 drivers/md/dm-snap.c            | 359 +++++++++++++++++++++++++++-------------
 include/linux/list_bl.h         |  27 +++
 3 files changed, 269 insertions(+), 120 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/3] list_bl: Add hlist_bl_add_before/behind helpers
  2018-12-20 18:06 [PATCH 0/3] dm snapshot: Improve performance using a more fine-grained locking scheme Nikos Tsironis
@ 2018-12-20 18:06 ` Nikos Tsironis
  2019-02-28 21:32   ` Mike Snitzer
  2018-12-20 18:06 ` [PATCH 2/3] dm snapshot: Don't sleep holding the snapshot lock Nikos Tsironis
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 26+ messages in thread
From: Nikos Tsironis @ 2018-12-20 18:06 UTC (permalink / raw)
  To: snitzer, agk, dm-devel; +Cc: mpatocka, iliastsi

Add hlist_bl_add_before/behind helpers to add an element before/after an
existing element in a bl_list.

Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
---
 include/linux/list_bl.h | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/include/linux/list_bl.h b/include/linux/list_bl.h
index 3fc2cc57ba1b..2fd918e5fd48 100644
--- a/include/linux/list_bl.h
+++ b/include/linux/list_bl.h
@@ -86,6 +86,33 @@ static inline void hlist_bl_add_head(struct hlist_bl_node *n,
 	hlist_bl_set_first(h, n);
 }
 
+static inline void hlist_bl_add_before(struct hlist_bl_node *n,
+				       struct hlist_bl_node *next)
+{
+	struct hlist_bl_node **pprev = next->pprev;
+
+	n->pprev = pprev;
+	n->next = next;
+	next->pprev = &n->next;
+
+	/* pprev may be `first`, so be careful not to lose the lock bit */
+	WRITE_ONCE(*pprev,
+		   (struct hlist_bl_node *)
+			((unsigned long)n |
+			 ((unsigned long)*pprev & LIST_BL_LOCKMASK)));
+}
+
+static inline void hlist_bl_add_behind(struct hlist_bl_node *n,
+				       struct hlist_bl_node *prev)
+{
+	n->next = prev->next;
+	n->pprev = &prev->next;
+	WRITE_ONCE(prev->next, n);
+
+	if (n->next)
+		n->next->pprev = &n->next;
+}
+
 static inline void __hlist_bl_del(struct hlist_bl_node *n)
 {
 	struct hlist_bl_node *next = n->next;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/3] dm snapshot: Don't sleep holding the snapshot lock
  2018-12-20 18:06 [PATCH 0/3] dm snapshot: Improve performance using a more fine-grained locking scheme Nikos Tsironis
  2018-12-20 18:06 ` [PATCH 1/3] list_bl: Add hlist_bl_add_before/behind helpers Nikos Tsironis
@ 2018-12-20 18:06 ` Nikos Tsironis
  2018-12-20 18:06 ` [PATCH 3/3] dm snapshot: Use fine-grained locking scheme Nikos Tsironis
  2019-02-18 14:22 ` [PATCH 0/3] dm snapshot: Improve performance using a more " Nikos Tsironis
  3 siblings, 0 replies; 26+ messages in thread
From: Nikos Tsironis @ 2018-12-20 18:06 UTC (permalink / raw)
  To: snitzer, agk, dm-devel; +Cc: mpatocka, iliastsi

When completing a pending exception, pending_complete() waits for all
conflicting reads to drain, before inserting the final, completed
exception. Conflicting reads are snapshot reads redirected to the
origin, because the relevant chunk is not remapped to the COW device the
moment we receive the read.

The completed exception must be inserted into the exception table after
all conflicting reads drain to ensure snapshot reads don't return
corrupted data. This is required because inserting the completed
exception into the exception table signals that the relevant chunk is
remapped and both origin writes and snapshot merging will now overwrite
the chunk in origin.

This wait is done holding the snapshot lock to ensure that
pending_complete() doesn't starve if new snapshot reads keep coming for
this chunk.

In preparation for the next commit, where we use a spinlock instead of a
mutex to protect the exception tables, we remove the need for holding
the lock while waiting for conflicting reads to drain.

We achieve this in two steps:

1. pending_complete() inserts the completed exception before waiting for
   conflicting reads to drain and removes the pending exception after
   all conflicting reads drain.

   This ensures that new snapshot reads will be redirected to the COW
   device, instead of the origin, and thus pending_complete() will not
   starve. Moreover, we use the existence of both a completed and
   a pending exception to signify that the COW is done but there are
   conflicting reads in flight.

2. In __origin_write() we check first if there is a pending exception
   and then if there is a completed exception. If there is a pending
   exception any submitted BIO is delayed on the pe->origin_bios list and
   DM_MAPIO_SUBMITTED is returned. This ensures that neither writes to the
   origin nor snapshot merging can overwrite the origin chunk, until all
   conflicting reads drain, and thus snapshot reads will not return
   corrupted data.

Summarizing, we now have the following possible combinations of pending
and completed exceptions for a chunk, along with their meaning:

A. No exceptions exist: The chunk has not been remapped yet.
B. Only a pending exception exists: The chunk is currently being copied
   to the COW device.
C. Both a pending and a completed exception exist: COW for this chunk
   has completed but there are snapshot reads in flight which had been
   redirected to the origin before the chunk was remapped.
D. Only the completed exception exists: COW has been completed and there
   are no conflicting reads in flight.

Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
---
 drivers/md/dm-snap.c | 102 ++++++++++++++++++++++++++++++++-------------------
 1 file changed, 65 insertions(+), 37 deletions(-)

diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
index 36805b12661e..4b34bfa0900a 100644
--- a/drivers/md/dm-snap.c
+++ b/drivers/md/dm-snap.c
@@ -1501,16 +1501,24 @@ static void pending_complete(void *context, int success)
 		goto out;
 	}
 
-	/* Check for conflicting reads */
-	__check_for_conflicting_io(s, pe->e.old_chunk);
-
 	/*
-	 * Add a proper exception, and remove the
-	 * in-flight exception from the list.
+	 * Add a proper exception. After inserting the completed exception all
+	 * subsequent snapshot reads to this chunk will be redirected to the
+	 * COW device.  This ensures that we do not starve. Moreover, as long
+	 * as the pending exception exists, neither origin writes nor snapshot
+	 * merging can overwrite the chunk in origin.
 	 */
 	dm_insert_exception(&s->complete, e);
 
+	/* Wait for conflicting reads to drain */
+	if (__chunk_is_tracked(s, pe->e.old_chunk)) {
+		mutex_unlock(&s->lock);
+		__check_for_conflicting_io(s, pe->e.old_chunk);
+		mutex_lock(&s->lock);
+	}
+
 out:
+	/* Remove the in-flight exception from the list */
 	dm_remove_exception(&pe->e);
 	snapshot_bios = bio_list_get(&pe->snapshot_bios);
 	origin_bios = bio_list_get(&pe->origin_bios);
@@ -1660,25 +1668,15 @@ __lookup_pending_exception(struct dm_snapshot *s, chunk_t chunk)
 }
 
 /*
- * Looks to see if this snapshot already has a pending exception
- * for this chunk, otherwise it allocates a new one and inserts
- * it into the pending table.
+ * Inserts a pending exception into the pending table.
  *
  * NOTE: a write lock must be held on snap->lock before calling
  * this.
  */
 static struct dm_snap_pending_exception *
-__find_pending_exception(struct dm_snapshot *s,
-			 struct dm_snap_pending_exception *pe, chunk_t chunk)
+__insert_pending_exception(struct dm_snapshot *s,
+			   struct dm_snap_pending_exception *pe, chunk_t chunk)
 {
-	struct dm_snap_pending_exception *pe2;
-
-	pe2 = __lookup_pending_exception(s, chunk);
-	if (pe2) {
-		free_pending_exception(pe);
-		return pe2;
-	}
-
 	pe->e.old_chunk = chunk;
 	bio_list_init(&pe->origin_bios);
 	bio_list_init(&pe->snapshot_bios);
@@ -1697,6 +1695,29 @@ __find_pending_exception(struct dm_snapshot *s,
 	return pe;
 }
 
+/*
+ * Looks to see if this snapshot already has a pending exception
+ * for this chunk, otherwise it allocates a new one and inserts
+ * it into the pending table.
+ *
+ * NOTE: a write lock must be held on snap->lock before calling
+ * this.
+ */
+static struct dm_snap_pending_exception *
+__find_pending_exception(struct dm_snapshot *s,
+			 struct dm_snap_pending_exception *pe, chunk_t chunk)
+{
+	struct dm_snap_pending_exception *pe2;
+
+	pe2 = __lookup_pending_exception(s, chunk);
+	if (pe2) {
+		free_pending_exception(pe);
+		return pe2;
+	}
+
+	return __insert_pending_exception(s, pe, chunk);
+}
+
 static void remap_exception(struct dm_snapshot *s, struct dm_exception *e,
 			    struct bio *bio, chunk_t chunk)
 {
@@ -2107,7 +2128,7 @@ static int __origin_write(struct list_head *snapshots, sector_t sector,
 	int r = DM_MAPIO_REMAPPED;
 	struct dm_snapshot *snap;
 	struct dm_exception *e;
-	struct dm_snap_pending_exception *pe;
+	struct dm_snap_pending_exception *pe, *pe2;
 	struct dm_snap_pending_exception *pe_to_start_now = NULL;
 	struct dm_snap_pending_exception *pe_to_start_last = NULL;
 	chunk_t chunk;
@@ -2137,17 +2158,17 @@ static int __origin_write(struct list_head *snapshots, sector_t sector,
 		 */
 		chunk = sector_to_chunk(snap->store, sector);
 
-		/*
-		 * Check exception table to see if block
-		 * is already remapped in this snapshot
-		 * and trigger an exception if not.
-		 */
-		e = dm_lookup_exception(&snap->complete, chunk);
-		if (e)
-			goto next_snapshot;
-
 		pe = __lookup_pending_exception(snap, chunk);
 		if (!pe) {
+			/*
+			 * Check exception table to see if block is already
+			 * remapped in this snapshot and trigger an exception
+			 * if not.
+			 */
+			e = dm_lookup_exception(&snap->complete, chunk);
+			if (e)
+				goto next_snapshot;
+
 			mutex_unlock(&snap->lock);
 			pe = alloc_pending_exception(snap);
 			mutex_lock(&snap->lock);
@@ -2157,16 +2178,23 @@ static int __origin_write(struct list_head *snapshots, sector_t sector,
 				goto next_snapshot;
 			}
 
-			e = dm_lookup_exception(&snap->complete, chunk);
-			if (e) {
+			pe2 = __lookup_pending_exception(snap, chunk);
+
+			if (!pe2) {
+				e = dm_lookup_exception(&snap->complete, chunk);
+				if (e) {
+					free_pending_exception(pe);
+					goto next_snapshot;
+				}
+
+				pe = __insert_pending_exception(snap, pe, chunk);
+				if (!pe) {
+					__invalidate_snapshot(snap, -ENOMEM);
+					goto next_snapshot;
+				}
+			} else {
 				free_pending_exception(pe);
-				goto next_snapshot;
-			}
-
-			pe = __find_pending_exception(snap, pe, chunk);
-			if (!pe) {
-				__invalidate_snapshot(snap, -ENOMEM);
-				goto next_snapshot;
+				pe = pe2;
 			}
 		}
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/3] dm snapshot: Use fine-grained locking scheme
  2018-12-20 18:06 [PATCH 0/3] dm snapshot: Improve performance using a more fine-grained locking scheme Nikos Tsironis
  2018-12-20 18:06 ` [PATCH 1/3] list_bl: Add hlist_bl_add_before/behind helpers Nikos Tsironis
  2018-12-20 18:06 ` [PATCH 2/3] dm snapshot: Don't sleep holding the snapshot lock Nikos Tsironis
@ 2018-12-20 18:06 ` Nikos Tsironis
  2019-02-28 21:37   ` Mikulas Patocka
  2019-02-18 14:22 ` [PATCH 0/3] dm snapshot: Improve performance using a more " Nikos Tsironis
  3 siblings, 1 reply; 26+ messages in thread
From: Nikos Tsironis @ 2018-12-20 18:06 UTC (permalink / raw)
  To: snitzer, agk, dm-devel; +Cc: mpatocka, iliastsi

dm-snapshot uses a single mutex to serialize every access to the
snapshot state. This includes all accesses to the complete and pending
exception tables, which occur at every origin write, every snapshot
read/write and every exception completion.

The lock statistics indicate that this mutex is a bottleneck (average
wait time ~480 usecs for 8 processes doing random 4K writes to the
origin device) preventing dm-snapshot to scale as the number of threads
doing IO increases.

The major contention points are __origin_write()/snapshot_map() and
pending_complete(), i.e., the submission and completion of pending
exceptions.

Substitute the single mutex with a fine-grained locking scheme. Use a
read-write semaphore to protect the mostly read fields of the snapshot
structure, e.g., valid, active, etc., and per-bucket bit spinlocks to
protect accesses to the complete and pending exception tables.

This scheme allows dm-snapshot to scale better, resulting in increased
IOPS and reduced latency.

Following are some benchmark results using the null_blk device:

  modprobe null_blk gb=1024 bs=512 submit_queues=8 hw_queue_depth=4096 \
   queue_mode=2 irqmode=1 completion_nsec=1 nr_devices=1

* Benchmark fio_origin_randwrite_throughput_N, from the device mapper
  test suite [1] (direct IO, random 4K writes to origin device, IO
  engine libaio):

  +--------------+-------------+------------+
  | # of workers | IOPS Before | IOPS After |
  +--------------+-------------+------------+
  |      1       |    57708    |   66421    |
  |      2       |    63415    |   77589    |
  |      4       |    67276    |   98839    |
  |      8       |    60564    |   109258   |
  +--------------+-------------+------------+

* Benchmark fio_origin_randwrite_latency_N, from the device mapper test
  suite [1] (direct IO, random 4K writes to origin device, IO engine
  psync):

  +--------------+-----------------------+----------------------+
  | # of workers | Latency (usec) Before | Latency (usec) After |
  +--------------+-----------------------+----------------------+
  |      1       |         16.25         |        13.27         |
  |      2       |         31.65         |        25.08         |
  |      4       |         55.28         |        41.08         |
  |      8       |         121.47        |        74.44         |
  +--------------+-----------------------+----------------------+

* Benchmark fio_snapshot_randwrite_throughput_N, from the device mapper
  test suite [1] (direct IO, random 4K writes to snapshot device, IO
  engine libaio):

  +--------------+-------------+------------+
  | # of workers | IOPS Before | IOPS After |
  +--------------+-------------+------------+
  |      1       |    72593    |   84938    |
  |      2       |    97379    |   134973   |
  |      4       |    90610    |   143077   |
  |      8       |    90537    |   180085   |
  +--------------+-------------+------------+

* Benchmark fio_snapshot_randwrite_latency_N, from the device mapper
  test suite [1] (direct IO, random 4K writes to snapshot device, IO
  engine psync):

  +--------------+-----------------------+----------------------+
  | # of workers | Latency (usec) Before | Latency (usec) After |
  +--------------+-----------------------+----------------------+
  |      1       |         12.53         |         10.6         |
  |      2       |         19.78         |        14.89         |
  |      4       |         40.37         |        23.47         |
  |      8       |         89.32         |        48.48         |
  +--------------+-----------------------+----------------------+

[1] https://github.com/jthornber/device-mapper-test-suite

Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
---
 drivers/md/dm-exception-store.h |   3 +-
 drivers/md/dm-snap.c            | 273 +++++++++++++++++++++++++++-------------
 2 files changed, 185 insertions(+), 91 deletions(-)

diff --git a/drivers/md/dm-exception-store.h b/drivers/md/dm-exception-store.h
index 12b5216c2cfe..5a3c696c057f 100644
--- a/drivers/md/dm-exception-store.h
+++ b/drivers/md/dm-exception-store.h
@@ -11,6 +11,7 @@
 #define _LINUX_DM_EXCEPTION_STORE
 
 #include <linux/blkdev.h>
+#include <linux/list_bl.h>
 #include <linux/device-mapper.h>
 
 /*
@@ -27,7 +28,7 @@ typedef sector_t chunk_t;
  * chunk within the device.
  */
 struct dm_exception {
-	struct list_head hash_list;
+	struct hlist_bl_node hash_list;
 
 	chunk_t old_chunk;
 	chunk_t new_chunk;
diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
index 4b34bfa0900a..4530d0620a72 100644
--- a/drivers/md/dm-snap.c
+++ b/drivers/md/dm-snap.c
@@ -13,6 +13,7 @@
 #include <linux/init.h>
 #include <linux/kdev_t.h>
 #include <linux/list.h>
+#include <linux/list_bl.h>
 #include <linux/mempool.h>
 #include <linux/module.h>
 #include <linux/slab.h>
@@ -44,11 +45,11 @@ static const char dm_snapshot_merge_target_name[] = "snapshot-merge";
 struct dm_exception_table {
 	uint32_t hash_mask;
 	unsigned hash_shift;
-	struct list_head *table;
+	struct hlist_bl_head *table;
 };
 
 struct dm_snapshot {
-	struct mutex lock;
+	struct rw_semaphore lock;
 
 	struct dm_dev *origin;
 	struct dm_dev *cow;
@@ -76,7 +77,9 @@ struct dm_snapshot {
 
 	atomic_t pending_exceptions_count;
 
-	/* Protected by "lock" */
+	spinlock_t pe_allocation_lock;
+
+	/* Protected by "pe_allocation_lock" */
 	sector_t exception_start_sequence;
 
 	/* Protected by kcopyd single-threaded callback */
@@ -457,9 +460,9 @@ static int __find_snapshots_sharing_cow(struct dm_snapshot *snap,
 		if (!bdev_equal(s->cow->bdev, snap->cow->bdev))
 			continue;
 
-		mutex_lock(&s->lock);
+		down_read(&s->lock);
 		active = s->active;
-		mutex_unlock(&s->lock);
+		up_read(&s->lock);
 
 		if (active) {
 			if (snap_src)
@@ -618,6 +621,36 @@ static void unregister_snapshot(struct dm_snapshot *s)
  * The lowest hash_shift bits of the chunk number are ignored, allowing
  * some consecutive chunks to be grouped together.
  */
+static uint32_t exception_hash(struct dm_exception_table *et, chunk_t chunk);
+
+/* Lock to protect access to the completed and pending exception hash tables. */
+struct dm_exception_table_lock {
+	struct hlist_bl_head *complete_slot;
+	struct hlist_bl_head *pending_slot;
+};
+
+static void dm_exception_table_lock_init(struct dm_snapshot *s, chunk_t chunk,
+					 struct dm_exception_table_lock *lock)
+{
+	struct dm_exception_table *complete = &s->complete;
+	struct dm_exception_table *pending = &s->pending;
+
+	lock->complete_slot = &complete->table[exception_hash(complete, chunk)];
+	lock->pending_slot = &pending->table[exception_hash(pending, chunk)];
+}
+
+static void dm_exception_table_lock(struct dm_exception_table_lock *lock)
+{
+	hlist_bl_lock(lock->complete_slot);
+	hlist_bl_lock(lock->pending_slot);
+}
+
+static void dm_exception_table_unlock(struct dm_exception_table_lock *lock)
+{
+	hlist_bl_unlock(lock->pending_slot);
+	hlist_bl_unlock(lock->complete_slot);
+}
+
 static int dm_exception_table_init(struct dm_exception_table *et,
 				   uint32_t size, unsigned hash_shift)
 {
@@ -625,12 +658,12 @@ static int dm_exception_table_init(struct dm_exception_table *et,
 
 	et->hash_shift = hash_shift;
 	et->hash_mask = size - 1;
-	et->table = dm_vcalloc(size, sizeof(struct list_head));
+	et->table = dm_vcalloc(size, sizeof(struct hlist_bl_head));
 	if (!et->table)
 		return -ENOMEM;
 
 	for (i = 0; i < size; i++)
-		INIT_LIST_HEAD(et->table + i);
+		INIT_HLIST_BL_HEAD(et->table + i);
 
 	return 0;
 }
@@ -638,15 +671,16 @@ static int dm_exception_table_init(struct dm_exception_table *et,
 static void dm_exception_table_exit(struct dm_exception_table *et,
 				    struct kmem_cache *mem)
 {
-	struct list_head *slot;
-	struct dm_exception *ex, *next;
+	struct hlist_bl_head *slot;
+	struct dm_exception *ex;
+	struct hlist_bl_node *pos, *n;
 	int i, size;
 
 	size = et->hash_mask + 1;
 	for (i = 0; i < size; i++) {
 		slot = et->table + i;
 
-		list_for_each_entry_safe (ex, next, slot, hash_list)
+		hlist_bl_for_each_entry_safe(ex, pos, n, slot, hash_list)
 			kmem_cache_free(mem, ex);
 	}
 
@@ -660,7 +694,7 @@ static uint32_t exception_hash(struct dm_exception_table *et, chunk_t chunk)
 
 static void dm_remove_exception(struct dm_exception *e)
 {
-	list_del(&e->hash_list);
+	hlist_bl_del(&e->hash_list);
 }
 
 /*
@@ -670,11 +704,12 @@ static void dm_remove_exception(struct dm_exception *e)
 static struct dm_exception *dm_lookup_exception(struct dm_exception_table *et,
 						chunk_t chunk)
 {
-	struct list_head *slot;
+	struct hlist_bl_head *slot;
+	struct hlist_bl_node *pos;
 	struct dm_exception *e;
 
 	slot = &et->table[exception_hash(et, chunk)];
-	list_for_each_entry (e, slot, hash_list)
+	hlist_bl_for_each_entry(e, pos, slot, hash_list)
 		if (chunk >= e->old_chunk &&
 		    chunk <= e->old_chunk + dm_consecutive_chunk_count(e))
 			return e;
@@ -721,7 +756,8 @@ static void free_pending_exception(struct dm_snap_pending_exception *pe)
 static void dm_insert_exception(struct dm_exception_table *eh,
 				struct dm_exception *new_e)
 {
-	struct list_head *l;
+	struct hlist_bl_head *l;
+	struct hlist_bl_node *pos;
 	struct dm_exception *e = NULL;
 
 	l = &eh->table[exception_hash(eh, new_e->old_chunk)];
@@ -731,7 +767,7 @@ static void dm_insert_exception(struct dm_exception_table *eh,
 		goto out;
 
 	/* List is ordered by old_chunk */
-	list_for_each_entry_reverse(e, l, hash_list) {
+	hlist_bl_for_each_entry(e, pos, l, hash_list) {
 		/* Insert after an existing chunk? */
 		if (new_e->old_chunk == (e->old_chunk +
 					 dm_consecutive_chunk_count(e) + 1) &&
@@ -752,12 +788,24 @@ static void dm_insert_exception(struct dm_exception_table *eh,
 			return;
 		}
 
-		if (new_e->old_chunk > e->old_chunk)
+		if (new_e->old_chunk < e->old_chunk)
 			break;
 	}
 
 out:
-	list_add(&new_e->hash_list, e ? &e->hash_list : l);
+	if (!e) {
+		/*
+		 * Either the table doesn't support consecutive chunks or slot
+		 * l is empty.
+		 */
+		hlist_bl_add_head(&new_e->hash_list, l);
+	} else if (new_e->old_chunk < e->old_chunk) {
+		/* Add before an existing exception */
+		hlist_bl_add_before(&new_e->hash_list, &e->hash_list);
+	} else {
+		/* Add to l's tail: e is the last exception in this slot */
+		hlist_bl_add_behind(&new_e->hash_list, &e->hash_list);
+	}
 }
 
 /*
@@ -766,6 +814,7 @@ static void dm_insert_exception(struct dm_exception_table *eh,
  */
 static int dm_add_exception(void *context, chunk_t old, chunk_t new)
 {
+	struct dm_exception_table_lock lock;
 	struct dm_snapshot *s = context;
 	struct dm_exception *e;
 
@@ -778,7 +827,17 @@ static int dm_add_exception(void *context, chunk_t old, chunk_t new)
 	/* Consecutive_count is implicitly initialised to zero */
 	e->new_chunk = new;
 
+	/*
+	 * Although there is no need to lock access to the exception tables
+	 * here, if we don't then hlist_bl_add_head(), called by
+	 * dm_insert_exception(), will complain about accessing the
+	 * corresponding list without locking it first.
+	 */
+	dm_exception_table_lock_init(s, old, &lock);
+
+	dm_exception_table_lock(&lock);
 	dm_insert_exception(&s->complete, e);
+	dm_exception_table_unlock(&lock);
 
 	return 0;
 }
@@ -807,7 +866,7 @@ static int calc_max_buckets(void)
 {
 	/* use a fixed size of 2MB */
 	unsigned long mem = 2 * 1024 * 1024;
-	mem /= sizeof(struct list_head);
+	mem /= sizeof(struct hlist_bl_head);
 
 	return mem;
 }
@@ -927,7 +986,7 @@ static int remove_single_exception_chunk(struct dm_snapshot *s)
 	int r;
 	chunk_t old_chunk = s->first_merging_chunk + s->num_merging_chunks - 1;
 
-	mutex_lock(&s->lock);
+	down_write(&s->lock);
 
 	/*
 	 * Process chunks (and associated exceptions) in reverse order
@@ -942,7 +1001,7 @@ static int remove_single_exception_chunk(struct dm_snapshot *s)
 	b = __release_queued_bios_after_merge(s);
 
 out:
-	mutex_unlock(&s->lock);
+	up_write(&s->lock);
 	if (b)
 		flush_bios(b);
 
@@ -1001,9 +1060,9 @@ static void snapshot_merge_next_chunks(struct dm_snapshot *s)
 		if (linear_chunks < 0) {
 			DMERR("Read error in exception store: "
 			      "shutting down merge");
-			mutex_lock(&s->lock);
+			down_write(&s->lock);
 			s->merge_failed = 1;
-			mutex_unlock(&s->lock);
+			up_write(&s->lock);
 		}
 		goto shut;
 	}
@@ -1044,10 +1103,10 @@ static void snapshot_merge_next_chunks(struct dm_snapshot *s)
 		previous_count = read_pending_exceptions_done_count();
 	}
 
-	mutex_lock(&s->lock);
+	down_write(&s->lock);
 	s->first_merging_chunk = old_chunk;
 	s->num_merging_chunks = linear_chunks;
-	mutex_unlock(&s->lock);
+	up_write(&s->lock);
 
 	/* Wait until writes to all 'linear_chunks' drain */
 	for (i = 0; i < linear_chunks; i++)
@@ -1089,10 +1148,10 @@ static void merge_callback(int read_err, unsigned long write_err, void *context)
 	return;
 
 shut:
-	mutex_lock(&s->lock);
+	down_write(&s->lock);
 	s->merge_failed = 1;
 	b = __release_queued_bios_after_merge(s);
-	mutex_unlock(&s->lock);
+	up_write(&s->lock);
 	error_bios(b);
 
 	merge_shutdown(s);
@@ -1188,10 +1247,11 @@ static int snapshot_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 	s->snapshot_overflowed = 0;
 	s->active = 0;
 	atomic_set(&s->pending_exceptions_count, 0);
+	spin_lock_init(&s->pe_allocation_lock);
 	s->exception_start_sequence = 0;
 	s->exception_complete_sequence = 0;
 	s->out_of_order_tree = RB_ROOT;
-	mutex_init(&s->lock);
+	init_rwsem(&s->lock);
 	INIT_LIST_HEAD(&s->list);
 	spin_lock_init(&s->pe_lock);
 	s->state_bits = 0;
@@ -1357,9 +1417,9 @@ static void snapshot_dtr(struct dm_target *ti)
 	/* Check whether exception handover must be cancelled */
 	(void) __find_snapshots_sharing_cow(s, &snap_src, &snap_dest, NULL);
 	if (snap_src && snap_dest && (s == snap_src)) {
-		mutex_lock(&snap_dest->lock);
+		down_write(&snap_dest->lock);
 		snap_dest->valid = 0;
-		mutex_unlock(&snap_dest->lock);
+		up_write(&snap_dest->lock);
 		DMERR("Cancelling snapshot handover.");
 	}
 	up_read(&_origins_lock);
@@ -1390,8 +1450,6 @@ static void snapshot_dtr(struct dm_target *ti)
 
 	dm_exception_store_destroy(s->store);
 
-	mutex_destroy(&s->lock);
-
 	dm_put_device(ti, s->cow);
 
 	dm_put_device(ti, s->origin);
@@ -1467,6 +1525,13 @@ static void __invalidate_snapshot(struct dm_snapshot *s, int err)
 	dm_table_event(s->ti->table);
 }
 
+static void invalidate_snapshot(struct dm_snapshot *s, int err)
+{
+	down_write(&s->lock);
+	__invalidate_snapshot(s, err);
+	up_write(&s->lock);
+}
+
 static void pending_complete(void *context, int success)
 {
 	struct dm_snap_pending_exception *pe = context;
@@ -1475,29 +1540,37 @@ static void pending_complete(void *context, int success)
 	struct bio *origin_bios = NULL;
 	struct bio *snapshot_bios = NULL;
 	struct bio *full_bio = NULL;
+	struct dm_exception_table_lock lock;
 	int error = 0;
 
+	dm_exception_table_lock_init(s, pe->e.old_chunk, &lock);
+
 	if (!success) {
 		/* Read/write error - snapshot is unusable */
-		mutex_lock(&s->lock);
-		__invalidate_snapshot(s, -EIO);
+		invalidate_snapshot(s, -EIO);
 		error = 1;
+
+		dm_exception_table_lock(&lock);
 		goto out;
 	}
 
 	e = alloc_completed_exception(GFP_NOIO);
 	if (!e) {
-		mutex_lock(&s->lock);
-		__invalidate_snapshot(s, -ENOMEM);
+		invalidate_snapshot(s, -ENOMEM);
 		error = 1;
+
+		dm_exception_table_lock(&lock);
 		goto out;
 	}
 	*e = pe->e;
 
-	mutex_lock(&s->lock);
+	down_read(&s->lock);
+	dm_exception_table_lock(&lock);
 	if (!s->valid) {
+		up_read(&s->lock);
 		free_completed_exception(e);
 		error = 1;
+
 		goto out;
 	}
 
@@ -1509,17 +1582,21 @@ static void pending_complete(void *context, int success)
 	 * merging can overwrite the chunk in origin.
 	 */
 	dm_insert_exception(&s->complete, e);
+	up_read(&s->lock);
 
 	/* Wait for conflicting reads to drain */
 	if (__chunk_is_tracked(s, pe->e.old_chunk)) {
-		mutex_unlock(&s->lock);
+		dm_exception_table_unlock(&lock);
 		__check_for_conflicting_io(s, pe->e.old_chunk);
-		mutex_lock(&s->lock);
+		dm_exception_table_lock(&lock);
 	}
 
 out:
 	/* Remove the in-flight exception from the list */
 	dm_remove_exception(&pe->e);
+
+	dm_exception_table_unlock(&lock);
+
 	snapshot_bios = bio_list_get(&pe->snapshot_bios);
 	origin_bios = bio_list_get(&pe->origin_bios);
 	full_bio = pe->full_bio;
@@ -1527,8 +1604,6 @@ static void pending_complete(void *context, int success)
 		full_bio->bi_end_io = pe->full_bio_end_io;
 	increment_pending_exceptions_done_count();
 
-	mutex_unlock(&s->lock);
-
 	/* Submit any pending write bios */
 	if (error) {
 		if (full_bio)
@@ -1670,8 +1745,8 @@ __lookup_pending_exception(struct dm_snapshot *s, chunk_t chunk)
 /*
  * Inserts a pending exception into the pending table.
  *
- * NOTE: a write lock must be held on snap->lock before calling
- * this.
+ * NOTE: a write lock must be held on the chunk's pending exception table slot
+ * before calling this.
  */
 static struct dm_snap_pending_exception *
 __insert_pending_exception(struct dm_snapshot *s,
@@ -1683,12 +1758,15 @@ __insert_pending_exception(struct dm_snapshot *s,
 	pe->started = 0;
 	pe->full_bio = NULL;
 
+	spin_lock(&s->pe_allocation_lock);
 	if (s->store->type->prepare_exception(s->store, &pe->e)) {
+		spin_unlock(&s->pe_allocation_lock);
 		free_pending_exception(pe);
 		return NULL;
 	}
 
 	pe->exception_sequence = s->exception_start_sequence++;
+	spin_unlock(&s->pe_allocation_lock);
 
 	dm_insert_exception(&s->pending, &pe->e);
 
@@ -1700,8 +1778,8 @@ __insert_pending_exception(struct dm_snapshot *s,
  * for this chunk, otherwise it allocates a new one and inserts
  * it into the pending table.
  *
- * NOTE: a write lock must be held on snap->lock before calling
- * this.
+ * NOTE: a write lock must be held on the chunk's pending exception table slot
+ * before calling this.
  */
 static struct dm_snap_pending_exception *
 __find_pending_exception(struct dm_snapshot *s,
@@ -1735,6 +1813,7 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio)
 	int r = DM_MAPIO_REMAPPED;
 	chunk_t chunk;
 	struct dm_snap_pending_exception *pe = NULL;
+	struct dm_exception_table_lock lock;
 
 	init_tracked_chunk(bio);
 
@@ -1744,13 +1823,15 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio)
 	}
 
 	chunk = sector_to_chunk(s->store, bio->bi_iter.bi_sector);
+	dm_exception_table_lock_init(s, chunk, &lock);
 
 	/* Full snapshots are not usable */
 	/* To get here the table must be live so s->active is always set. */
 	if (!s->valid)
 		return DM_MAPIO_KILL;
 
-	mutex_lock(&s->lock);
+	down_read(&s->lock);
+	dm_exception_table_lock(&lock);
 
 	if (!s->valid || (unlikely(s->snapshot_overflowed) &&
 	    bio_data_dir(bio) == WRITE)) {
@@ -1773,15 +1854,9 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio)
 	if (bio_data_dir(bio) == WRITE) {
 		pe = __lookup_pending_exception(s, chunk);
 		if (!pe) {
-			mutex_unlock(&s->lock);
+			dm_exception_table_unlock(&lock);
 			pe = alloc_pending_exception(s);
-			mutex_lock(&s->lock);
-
-			if (!s->valid || s->snapshot_overflowed) {
-				free_pending_exception(pe);
-				r = DM_MAPIO_KILL;
-				goto out_unlock;
-			}
+			dm_exception_table_lock(&lock);
 
 			e = dm_lookup_exception(&s->complete, chunk);
 			if (e) {
@@ -1792,13 +1867,22 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio)
 
 			pe = __find_pending_exception(s, pe, chunk);
 			if (!pe) {
+				dm_exception_table_unlock(&lock);
+				up_read(&s->lock);
+
+				down_write(&s->lock);
+
 				if (s->store->userspace_supports_overflow) {
-					s->snapshot_overflowed = 1;
-					DMERR("Snapshot overflowed: Unable to allocate exception.");
+					if (s->valid && !s->snapshot_overflowed) {
+						s->snapshot_overflowed = 1;
+						DMERR("Snapshot overflowed: Unable to allocate exception.");
+					}
 				} else
 					__invalidate_snapshot(s, -ENOMEM);
+				up_write(&s->lock);
+
 				r = DM_MAPIO_KILL;
-				goto out_unlock;
+				goto out;
 			}
 		}
 
@@ -1810,7 +1894,10 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio)
 		    bio->bi_iter.bi_size ==
 		    (s->store->chunk_size << SECTOR_SHIFT)) {
 			pe->started = 1;
-			mutex_unlock(&s->lock);
+
+			dm_exception_table_unlock(&lock);
+			up_read(&s->lock);
+
 			start_full_bio(pe, bio);
 			goto out;
 		}
@@ -1818,9 +1905,12 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio)
 		bio_list_add(&pe->snapshot_bios, bio);
 
 		if (!pe->started) {
-			/* this is protected by snap->lock */
+			/* this is protected by the exception table lock */
 			pe->started = 1;
-			mutex_unlock(&s->lock);
+
+			dm_exception_table_unlock(&lock);
+			up_read(&s->lock);
+
 			start_copy(pe);
 			goto out;
 		}
@@ -1830,7 +1920,8 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio)
 	}
 
 out_unlock:
-	mutex_unlock(&s->lock);
+	dm_exception_table_unlock(&lock);
+	up_read(&s->lock);
 out:
 	return r;
 }
@@ -1866,7 +1957,7 @@ static int snapshot_merge_map(struct dm_target *ti, struct bio *bio)
 
 	chunk = sector_to_chunk(s->store, bio->bi_iter.bi_sector);
 
-	mutex_lock(&s->lock);
+	down_write(&s->lock);
 
 	/* Full merging snapshots are redirected to the origin */
 	if (!s->valid)
@@ -1897,12 +1988,12 @@ static int snapshot_merge_map(struct dm_target *ti, struct bio *bio)
 	bio_set_dev(bio, s->origin->bdev);
 
 	if (bio_data_dir(bio) == WRITE) {
-		mutex_unlock(&s->lock);
+		up_write(&s->lock);
 		return do_origin(s->origin, bio);
 	}
 
 out_unlock:
-	mutex_unlock(&s->lock);
+	up_write(&s->lock);
 
 	return r;
 }
@@ -1934,7 +2025,7 @@ static int snapshot_preresume(struct dm_target *ti)
 	down_read(&_origins_lock);
 	(void) __find_snapshots_sharing_cow(s, &snap_src, &snap_dest, NULL);
 	if (snap_src && snap_dest) {
-		mutex_lock(&snap_src->lock);
+		down_read(&snap_src->lock);
 		if (s == snap_src) {
 			DMERR("Unable to resume snapshot source until "
 			      "handover completes.");
@@ -1944,7 +2035,7 @@ static int snapshot_preresume(struct dm_target *ti)
 			      "source is suspended.");
 			r = -EINVAL;
 		}
-		mutex_unlock(&snap_src->lock);
+		up_read(&snap_src->lock);
 	}
 	up_read(&_origins_lock);
 
@@ -1990,11 +2081,11 @@ static void snapshot_resume(struct dm_target *ti)
 
 	(void) __find_snapshots_sharing_cow(s, &snap_src, &snap_dest, NULL);
 	if (snap_src && snap_dest) {
-		mutex_lock(&snap_src->lock);
-		mutex_lock_nested(&snap_dest->lock, SINGLE_DEPTH_NESTING);
+		down_write(&snap_src->lock);
+		down_write_nested(&snap_dest->lock, SINGLE_DEPTH_NESTING);
 		__handover_exceptions(snap_src, snap_dest);
-		mutex_unlock(&snap_dest->lock);
-		mutex_unlock(&snap_src->lock);
+		up_write(&snap_dest->lock);
+		up_write(&snap_src->lock);
 	}
 
 	up_read(&_origins_lock);
@@ -2009,9 +2100,9 @@ static void snapshot_resume(struct dm_target *ti)
 	/* Now we have correct chunk size, reregister */
 	reregister_snapshot(s);
 
-	mutex_lock(&s->lock);
+	down_write(&s->lock);
 	s->active = 1;
-	mutex_unlock(&s->lock);
+	up_write(&s->lock);
 }
 
 static uint32_t get_origin_minimum_chunksize(struct block_device *bdev)
@@ -2051,7 +2142,7 @@ static void snapshot_status(struct dm_target *ti, status_type_t type,
 	switch (type) {
 	case STATUSTYPE_INFO:
 
-		mutex_lock(&snap->lock);
+		down_write(&snap->lock);
 
 		if (!snap->valid)
 			DMEMIT("Invalid");
@@ -2076,7 +2167,7 @@ static void snapshot_status(struct dm_target *ti, status_type_t type,
 				DMEMIT("Unknown");
 		}
 
-		mutex_unlock(&snap->lock);
+		up_write(&snap->lock);
 
 		break;
 
@@ -2131,6 +2222,7 @@ static int __origin_write(struct list_head *snapshots, sector_t sector,
 	struct dm_snap_pending_exception *pe, *pe2;
 	struct dm_snap_pending_exception *pe_to_start_now = NULL;
 	struct dm_snap_pending_exception *pe_to_start_last = NULL;
+	struct dm_exception_table_lock lock;
 	chunk_t chunk;
 
 	/* Do all the snapshots on this origin */
@@ -2142,21 +2234,23 @@ static int __origin_write(struct list_head *snapshots, sector_t sector,
 		if (dm_target_is_snapshot_merge(snap->ti))
 			continue;
 
-		mutex_lock(&snap->lock);
-
-		/* Only deal with valid and active snapshots */
-		if (!snap->valid || !snap->active)
-			goto next_snapshot;
-
 		/* Nothing to do if writing beyond end of snapshot */
 		if (sector >= dm_table_get_size(snap->ti->table))
-			goto next_snapshot;
+			continue;
 
 		/*
 		 * Remember, different snapshots can have
 		 * different chunk sizes.
 		 */
 		chunk = sector_to_chunk(snap->store, sector);
+		dm_exception_table_lock_init(snap, chunk, &lock);
+
+		down_read(&snap->lock);
+		dm_exception_table_lock(&lock);
+
+		/* Only deal with valid and active snapshots */
+		if (!snap->valid || !snap->active)
+			goto next_snapshot;
 
 		pe = __lookup_pending_exception(snap, chunk);
 		if (!pe) {
@@ -2169,14 +2263,9 @@ static int __origin_write(struct list_head *snapshots, sector_t sector,
 			if (e)
 				goto next_snapshot;
 
-			mutex_unlock(&snap->lock);
+			dm_exception_table_unlock(&lock);
 			pe = alloc_pending_exception(snap);
-			mutex_lock(&snap->lock);
-
-			if (!snap->valid) {
-				free_pending_exception(pe);
-				goto next_snapshot;
-			}
+			dm_exception_table_lock(&lock);
 
 			pe2 = __lookup_pending_exception(snap, chunk);
 
@@ -2189,8 +2278,11 @@ static int __origin_write(struct list_head *snapshots, sector_t sector,
 
 				pe = __insert_pending_exception(snap, pe, chunk);
 				if (!pe) {
-					__invalidate_snapshot(snap, -ENOMEM);
-					goto next_snapshot;
+					dm_exception_table_unlock(&lock);
+					up_read(&snap->lock);
+
+					invalidate_snapshot(snap, -ENOMEM);
+					continue;
 				}
 			} else {
 				free_pending_exception(pe);
@@ -2221,7 +2313,8 @@ static int __origin_write(struct list_head *snapshots, sector_t sector,
 		}
 
 next_snapshot:
-		mutex_unlock(&snap->lock);
+		dm_exception_table_unlock(&lock);
+		up_read(&snap->lock);
 
 		if (pe_to_start_now) {
 			start_copy(pe_to_start_now);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/3] dm snapshot: Improve performance using a more fine-grained locking scheme
  2018-12-20 18:06 [PATCH 0/3] dm snapshot: Improve performance using a more fine-grained locking scheme Nikos Tsironis
                   ` (2 preceding siblings ...)
  2018-12-20 18:06 ` [PATCH 3/3] dm snapshot: Use fine-grained locking scheme Nikos Tsironis
@ 2019-02-18 14:22 ` Nikos Tsironis
  2019-02-18 14:37   ` Mikulas Patocka
  3 siblings, 1 reply; 26+ messages in thread
From: Nikos Tsironis @ 2019-02-18 14:22 UTC (permalink / raw)
  To: snitzer, agk, dm-devel; +Cc: mpatocka, iliastsi

Hello,

This is a kind reminder for this patch set. I'm bumping this thread to
solicit your feedback. Please let me know what else I may need to do for
these patches to land in the upcoming merge window for 5.1.

Looking forward to your feedback,

Nikos

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/3] dm snapshot: Improve performance using a more fine-grained locking scheme
  2019-02-18 14:22 ` [PATCH 0/3] dm snapshot: Improve performance using a more " Nikos Tsironis
@ 2019-02-18 14:37   ` Mikulas Patocka
  2019-02-18 15:15     ` Mike Snitzer
  0 siblings, 1 reply; 26+ messages in thread
From: Mikulas Patocka @ 2019-02-18 14:37 UTC (permalink / raw)
  To: Nikos Tsironis; +Cc: dm-devel, agk, snitzer, iliastsi



On Mon, 18 Feb 2019, Nikos Tsironis wrote:

> Hello,
> 
> This is a kind reminder for this patch set. I'm bumping this thread to
> solicit your feedback. Please let me know what else I may need to do for
> these patches to land in the upcoming merge window for 5.1.
> 
> Looking forward to your feedback,
> 
> Nikos

I don't know what top do with those patches.

The patches make it more complicated, so I can't really say if they are 
correct or not.

Mikulas

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/3] dm snapshot: Improve performance using a more fine-grained locking scheme
  2019-02-18 14:37   ` Mikulas Patocka
@ 2019-02-18 15:15     ` Mike Snitzer
  2019-02-19 11:04       ` Nikos Tsironis
  0 siblings, 1 reply; 26+ messages in thread
From: Mike Snitzer @ 2019-02-18 15:15 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: dm-devel, Nikos Tsironis, agk, iliastsi

On Mon, Feb 18 2019 at  9:37am -0500,
Mikulas Patocka <mpatocka@redhat.com> wrote:

> 
> 
> On Mon, 18 Feb 2019, Nikos Tsironis wrote:
> 
> > Hello,
> > 
> > This is a kind reminder for this patch set. I'm bumping this thread to
> > solicit your feedback. Please let me know what else I may need to do for
> > these patches to land in the upcoming merge window for 5.1.
> > 
> > Looking forward to your feedback,
> > 
> > Nikos
> 
> I don't know what to do with those patches.
> 
> The patches make it more complicated, so I can't really say if they are 
> correct or not.

As I mentioned to you before: please do the best you can reasoning
through (and even quantifying on your own) the performance reward these
patches provide.

Then we'll need to make a call on risk vs reward.

Nikos mentioned in another mail on Friday that this dm-snapshot patchset
is a prereq for more dm-snapshot changes he has.  Nikos, if you could
forecast what those additional changes to dm-snapshot are that could
help inform the review process for this initial patchset.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/3] dm snapshot: Improve performance using a more fine-grained locking scheme
  2019-02-18 15:15     ` Mike Snitzer
@ 2019-02-19 11:04       ` Nikos Tsironis
  0 siblings, 0 replies; 26+ messages in thread
From: Nikos Tsironis @ 2019-02-19 11:04 UTC (permalink / raw)
  To: Mike Snitzer, Mikulas Patocka; +Cc: dm-devel, agk, iliastsi

On 2/18/19 5:15 PM, Mike Snitzer wrote:
> On Mon, Feb 18 2019 at  9:37am -0500,
> Mikulas Patocka <mpatocka@redhat.com> wrote:
> 
>>
>>
>> On Mon, 18 Feb 2019, Nikos Tsironis wrote:
>>
>>> Hello,
>>>
>>> This is a kind reminder for this patch set. I'm bumping this thread to
>>> solicit your feedback. Please let me know what else I may need to do for
>>> these patches to land in the upcoming merge window for 5.1.
>>>
>>> Looking forward to your feedback,
>>>
>>> Nikos
>>
>> I don't know what to do with those patches.
>>
>> The patches make it more complicated, so I can't really say if they are 
>> correct or not.
> 
> As I mentioned to you before: please do the best you can reasoning
> through (and even quantifying on your own) the performance reward these
> patches provide.
> 
> Then we'll need to make a call on risk vs reward.
> 
> Nikos mentioned in another mail on Friday that this dm-snapshot patchset
> is a prereq for more dm-snapshot changes he has.  Nikos, if you could
> forecast what those additional changes to dm-snapshot are that could
> help inform the review process for this initial patchset.

Hello Mike, hello Mikulas,

Thank you both for your feedback.

I understand your desire to ensure the patches continue to maintain the
high level of quality in the kernel and I agree completely. As part of
this effort, I have been contributing to the device mapper test suite,
extending it with a test suite for dm-snapshot. Please see the relevant
Pull Requests [1], [2]. Actually, it was this process of me trying to
understand the current functioning of the code in depth and improve its
testing, which helped me identify some minor dm-snapshot bugs previously
[3], [4].

So, I definitely agree with your effort to maintain high quality for
kernel code;  extending the device mapper test suite was my very first
step in facilitating my own coding process for these patches, and I
would be happy to implement any further suggestions for testing you
may have.

The long-term goal of my current and upcoming patches is to make
dm-snapshot usable with modern, low-latency NVMe devices, make it
scalable on multi-core machines, and reduce its overhead significantly.

I understand a lot of effort has been put to make dm-thin performant,
but we are particularly interested in improving the performance of
dm-snapshot and its usability on modern hardware, because it has distinct
advantages over using dm-thin for our specific use cases:

  * Short-lived snapshots, for consistency, where I/O performance to the
    original device degrades only while the snapshot is active, and
    returns back to its original level after deleting the snapshot

  * Taking a consistent snapshot of any volume, while still being able
    to write to the origin volume, i.e., maintaining the snapshot in an
    external device, independently of the origin, without the need to
    make the original volume a dm-thin device (versus dm-thin’s external
    origin snapshots)

Actually, we are using dm-snapshot heavily right now for maintaining
ephemeral transient snapshot for backups. This is the only way to have
close to bare metal performance, without dm-thin’s permanent degradation
(at least until the origin volume is completely overwritten).

So, to give you an overall idea of the end-to-end story and where the
current patch set fits:

My initial performance evaluation of dm-snapshot revealed a big
performance drop, while the snapshot is active; a drop which is not
justified by COW alone. Using fio with blktrace I noticed that the
per-CPU I/O distribution was uneven. Although many threads were doing
I/O, only a couple of the CPUs ended up submitting I/O requests to the
underlying device.

The bottleneck here is kcopyd serializing all I/O. The recent work with
blk-mq towards increasing parallelism of I/O processing in modern
multi-cores is essentially going to waste for any users of kcopyd,
including dm-snapshot, because I/Os are issued only by a single CPU at a
time, the one on which kcopyd’s thread happens to be running.

So, the first step I took was to redesign kcopyd to prevent I/O
serialization by respecting thread locality for I/Os and their
completion. This made distribution of I/O processing uniform across
CPUs, but then dm-snapshot’s single mutex prevented it from exploiting
kcopyd’s improved scalability.

Further investigation revealed that dm-snapshot itself has significant
CPU overhead that increases I/O latency far beyond that of an NVMe
device. As an example, working with an NVMe array capable of 700K write
IOPS with an average latency of ~350us per request, I was seeing 65K
write IOPS with an average latency of ~4ms per request with dm-snapshot.
This was due to lock contention in dm-snapshot itself, even before it
hit kcopyd.

This is where the current patch set fits. Replacing the single mutex
with a fine-grained locking scheme reduces lock contention and enables
dm-snaphsot to scale as the number of threads increases. For all the
details, please see the measurements I include as part of PATCH 3/3.
Also, replacing dm-snapshot's single mutex with fine-grained locking
makes it ready to exploit dm-kcopyd’s improved design, which I will
introduce in follow-up patches.

Finally, after these two patches, I would like to improve performance
specifically for transient snapshots. In the case of transient snapshots,
there is no need to complete exceptions in order
(commit 230c83afdd9cd - "dm snapshot: avoid snapshot space leak on crash")
or even sequentially. By taking advantage of the new kcopyd design, we
can complete them out-of-order and in parallel. This will also increase
IOPS significantly and slash I/O latency.

So, all in all, my end goal is to bring performance of short-lived
snapshots based on dm-snapshot as close as possible to bare-metal
performance.

To summarize, I see the following progression of patches:

  * Scale dm-snapshot (current patch set)
  * Scale kcopyd
  * Further improve transient snapshots by lifting the limitation of
    in-order and sequential completion

My current measurements show that these patch series lead to an eventual
performance improvement of ~300% increase in sustained throughput and
~80% decrease in I/O latency for transient snapshots, over the null_blk
device. 

Especially for this patch set, please find detailed measurements as part
of PATCH 3/3.

Mike, it would be great, as you say, if you could run the test suite
yourselves, and reproduce the results.

This was a rather long email, but I hope it makes the end-to-end purpose
of these patches clearer, and quantifies the reward in increased
throughput, decreased latency, and improved efficiency of dm-snapshot on
modern hardware.

I would be more than happy to continue the conversation and focus on any
other questions you may have, as you continue with reviewing these
patches.

Thanks,
Nikos.

[1] https://github.com/jthornber/device-mapper-test-suite/pull/51
[2] https://github.com/jthornber/device-mapper-test-suite/pull/52
[3] https://www.redhat.com/archives/dm-devel/2018-October/msg00460.html
[4] https://www.redhat.com/archives/dm-devel/2018-October/msg00461.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] list_bl: Add hlist_bl_add_before/behind helpers
  2018-12-20 18:06 ` [PATCH 1/3] list_bl: Add hlist_bl_add_before/behind helpers Nikos Tsironis
@ 2019-02-28 21:32   ` Mike Snitzer
  2019-02-28 21:34     ` Mike Snitzer
                       ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Mike Snitzer @ 2019-02-28 21:32 UTC (permalink / raw)
  To: Paul E. McKenney, hch, Nikos Tsironis
  Cc: agk, dm-devel, mpatocka, iliastsi, linux-kernel

On Thu, Dec 20 2018 at  1:06pm -0500,
Nikos Tsironis <ntsironis@arrikto.com> wrote:

> Add hlist_bl_add_before/behind helpers to add an element before/after an
> existing element in a bl_list.
> 
> Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
> Signed-off-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
> ---
>  include/linux/list_bl.h | 27 +++++++++++++++++++++++++++
>  1 file changed, 27 insertions(+)
> 
> diff --git a/include/linux/list_bl.h b/include/linux/list_bl.h
> index 3fc2cc57ba1b..2fd918e5fd48 100644
> --- a/include/linux/list_bl.h
> +++ b/include/linux/list_bl.h
> @@ -86,6 +86,33 @@ static inline void hlist_bl_add_head(struct hlist_bl_node *n,
>  	hlist_bl_set_first(h, n);
>  }
>  
> +static inline void hlist_bl_add_before(struct hlist_bl_node *n,
> +				       struct hlist_bl_node *next)
> +{
> +	struct hlist_bl_node **pprev = next->pprev;
> +
> +	n->pprev = pprev;
> +	n->next = next;
> +	next->pprev = &n->next;
> +
> +	/* pprev may be `first`, so be careful not to lose the lock bit */
> +	WRITE_ONCE(*pprev,
> +		   (struct hlist_bl_node *)
> +			((unsigned long)n |
> +			 ((unsigned long)*pprev & LIST_BL_LOCKMASK)));
> +}
> +
> +static inline void hlist_bl_add_behind(struct hlist_bl_node *n,
> +				       struct hlist_bl_node *prev)
> +{
> +	n->next = prev->next;
> +	n->pprev = &prev->next;
> +	WRITE_ONCE(prev->next, n);
> +
> +	if (n->next)
> +		n->next->pprev = &n->next;
> +}
> +
>  static inline void __hlist_bl_del(struct hlist_bl_node *n)
>  {
>  	struct hlist_bl_node *next = n->next;
> -- 
> 2.11.0

Hi Paul and Christoph,

You've added your Signed-off-by to include/linux/list_bl.h commits in
the past.  I'm not sure how this proposed patch should be handled.

These new hlist_bl_add_{before,behind} changes are a prereq for
dm-snapshot changes that Nikos has proposed, please see:
https://patchwork.kernel.org/patch/10739265/

Any assistance/review you, or others on LKML, might be able to provide
would be appreciated.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] list_bl: Add hlist_bl_add_before/behind helpers
  2019-02-28 21:32   ` Mike Snitzer
@ 2019-02-28 21:34     ` Mike Snitzer
  2019-03-11 18:16     ` Christoph Hellwig
  2019-03-13 23:48     ` Paul E. McKenney
  2 siblings, 0 replies; 26+ messages in thread
From: Mike Snitzer @ 2019-02-28 21:34 UTC (permalink / raw)
  To: Paul E. McKenney, hch, Nikos Tsironis
  Cc: agk, dm-devel, mpatocka, iliastsi, linux-kernel

On Thu, Feb 28 2019 at  4:32pm -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Thu, Dec 20 2018 at  1:06pm -0500,
> Nikos Tsironis <ntsironis@arrikto.com> wrote:
> 
> > Add hlist_bl_add_before/behind helpers to add an element before/after an
> > existing element in a bl_list.
> > 
> > Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
> > Signed-off-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
> > ---
> >  include/linux/list_bl.h | 27 +++++++++++++++++++++++++++
> >  1 file changed, 27 insertions(+)
> > 
> > diff --git a/include/linux/list_bl.h b/include/linux/list_bl.h
> > index 3fc2cc57ba1b..2fd918e5fd48 100644
> > --- a/include/linux/list_bl.h
> > +++ b/include/linux/list_bl.h
> > @@ -86,6 +86,33 @@ static inline void hlist_bl_add_head(struct hlist_bl_node *n,
> >  	hlist_bl_set_first(h, n);
> >  }
> >  
> > +static inline void hlist_bl_add_before(struct hlist_bl_node *n,
> > +				       struct hlist_bl_node *next)
> > +{
> > +	struct hlist_bl_node **pprev = next->pprev;
> > +
> > +	n->pprev = pprev;
> > +	n->next = next;
> > +	next->pprev = &n->next;
> > +
> > +	/* pprev may be `first`, so be careful not to lose the lock bit */
> > +	WRITE_ONCE(*pprev,
> > +		   (struct hlist_bl_node *)
> > +			((unsigned long)n |
> > +			 ((unsigned long)*pprev & LIST_BL_LOCKMASK)));
> > +}
> > +
> > +static inline void hlist_bl_add_behind(struct hlist_bl_node *n,
> > +				       struct hlist_bl_node *prev)
> > +{
> > +	n->next = prev->next;
> > +	n->pprev = &prev->next;
> > +	WRITE_ONCE(prev->next, n);
> > +
> > +	if (n->next)
> > +		n->next->pprev = &n->next;
> > +}
> > +
> >  static inline void __hlist_bl_del(struct hlist_bl_node *n)
> >  {
> >  	struct hlist_bl_node *next = n->next;
> > -- 
> > 2.11.0
> 
> Hi Paul and Christoph,
> 
> You've added your Signed-off-by to include/linux/list_bl.h commits in
> the past.  I'm not sure how this proposed patch should be handled.
> 
> These new hlist_bl_add_{before,behind} changes are a prereq for
> dm-snapshot changes that Nikos has proposed, please see:
> https://patchwork.kernel.org/patch/10739265/
> 
> Any assistance/review you, or others on LKML, might be able to provide
> would be appreciated.

I should've clarified that: I'm asking for the purpose of getting these
changes staged for Linux 5.2.

Mike

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/3] dm snapshot: Use fine-grained locking scheme
  2018-12-20 18:06 ` [PATCH 3/3] dm snapshot: Use fine-grained locking scheme Nikos Tsironis
@ 2019-02-28 21:37   ` Mikulas Patocka
  2019-03-01  9:15     ` Nikos Tsironis
  0 siblings, 1 reply; 26+ messages in thread
From: Mikulas Patocka @ 2019-02-28 21:37 UTC (permalink / raw)
  To: Nikos Tsironis; +Cc: dm-devel, agk, snitzer, iliastsi

Hi

I looked at the patches, and they could be merged into the kernel 5.2.

Please, split this last patch - it is too big and too hard to understand.

For example, you can split it into three patches:
[PATCH 3/5] replace mutex with rm_semaphore (basically a revert of 
	ae1093be5a0ef997833e200a0dafb9ed0b1ff4fe)
[PATCH 4/5] replace normal lists with locked lists (but still leave the 
	global lock)
[PATCH 5/5] change down_write to down_read and the rest of the changes

... so that the resulting file is the same.

Mikulas


On Thu, 20 Dec 2018, Nikos Tsironis wrote:

> dm-snapshot uses a single mutex to serialize every access to the
> snapshot state. This includes all accesses to the complete and pending
> exception tables, which occur at every origin write, every snapshot
> read/write and every exception completion.
> 
> The lock statistics indicate that this mutex is a bottleneck (average
> wait time ~480 usecs for 8 processes doing random 4K writes to the
> origin device) preventing dm-snapshot to scale as the number of threads
> doing IO increases.
> 
> The major contention points are __origin_write()/snapshot_map() and
> pending_complete(), i.e., the submission and completion of pending
> exceptions.
> 
> Substitute the single mutex with a fine-grained locking scheme. Use a
> read-write semaphore to protect the mostly read fields of the snapshot
> structure, e.g., valid, active, etc., and per-bucket bit spinlocks to
> protect accesses to the complete and pending exception tables.
> 
> This scheme allows dm-snapshot to scale better, resulting in increased
> IOPS and reduced latency.
> 
> Following are some benchmark results using the null_blk device:
> 
>   modprobe null_blk gb=1024 bs=512 submit_queues=8 hw_queue_depth=4096 \
>    queue_mode=2 irqmode=1 completion_nsec=1 nr_devices=1
> 
> * Benchmark fio_origin_randwrite_throughput_N, from the device mapper
>   test suite [1] (direct IO, random 4K writes to origin device, IO
>   engine libaio):
> 
>   +--------------+-------------+------------+
>   | # of workers | IOPS Before | IOPS After |
>   +--------------+-------------+------------+
>   |      1       |    57708    |   66421    |
>   |      2       |    63415    |   77589    |
>   |      4       |    67276    |   98839    |
>   |      8       |    60564    |   109258   |
>   +--------------+-------------+------------+
> 
> * Benchmark fio_origin_randwrite_latency_N, from the device mapper test
>   suite [1] (direct IO, random 4K writes to origin device, IO engine
>   psync):
> 
>   +--------------+-----------------------+----------------------+
>   | # of workers | Latency (usec) Before | Latency (usec) After |
>   +--------------+-----------------------+----------------------+
>   |      1       |         16.25         |        13.27         |
>   |      2       |         31.65         |        25.08         |
>   |      4       |         55.28         |        41.08         |
>   |      8       |         121.47        |        74.44         |
>   +--------------+-----------------------+----------------------+
> 
> * Benchmark fio_snapshot_randwrite_throughput_N, from the device mapper
>   test suite [1] (direct IO, random 4K writes to snapshot device, IO
>   engine libaio):
> 
>   +--------------+-------------+------------+
>   | # of workers | IOPS Before | IOPS After |
>   +--------------+-------------+------------+
>   |      1       |    72593    |   84938    |
>   |      2       |    97379    |   134973   |
>   |      4       |    90610    |   143077   |
>   |      8       |    90537    |   180085   |
>   +--------------+-------------+------------+
> 
> * Benchmark fio_snapshot_randwrite_latency_N, from the device mapper
>   test suite [1] (direct IO, random 4K writes to snapshot device, IO
>   engine psync):
> 
>   +--------------+-----------------------+----------------------+
>   | # of workers | Latency (usec) Before | Latency (usec) After |
>   +--------------+-----------------------+----------------------+
>   |      1       |         12.53         |         10.6         |
>   |      2       |         19.78         |        14.89         |
>   |      4       |         40.37         |        23.47         |
>   |      8       |         89.32         |        48.48         |
>   +--------------+-----------------------+----------------------+
> 
> [1] https://github.com/jthornber/device-mapper-test-suite
> 
> Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
> Signed-off-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
> ---
>  drivers/md/dm-exception-store.h |   3 +-
>  drivers/md/dm-snap.c            | 273 +++++++++++++++++++++++++++-------------
>  2 files changed, 185 insertions(+), 91 deletions(-)
> 
> diff --git a/drivers/md/dm-exception-store.h b/drivers/md/dm-exception-store.h
> index 12b5216c2cfe..5a3c696c057f 100644
> --- a/drivers/md/dm-exception-store.h
> +++ b/drivers/md/dm-exception-store.h
> @@ -11,6 +11,7 @@
>  #define _LINUX_DM_EXCEPTION_STORE
>  
>  #include <linux/blkdev.h>
> +#include <linux/list_bl.h>
>  #include <linux/device-mapper.h>
>  
>  /*
> @@ -27,7 +28,7 @@ typedef sector_t chunk_t;
>   * chunk within the device.
>   */
>  struct dm_exception {
> -	struct list_head hash_list;
> +	struct hlist_bl_node hash_list;
>  
>  	chunk_t old_chunk;
>  	chunk_t new_chunk;
> diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
> index 4b34bfa0900a..4530d0620a72 100644
> --- a/drivers/md/dm-snap.c
> +++ b/drivers/md/dm-snap.c
> @@ -13,6 +13,7 @@
>  #include <linux/init.h>
>  #include <linux/kdev_t.h>
>  #include <linux/list.h>
> +#include <linux/list_bl.h>
>  #include <linux/mempool.h>
>  #include <linux/module.h>
>  #include <linux/slab.h>
> @@ -44,11 +45,11 @@ static const char dm_snapshot_merge_target_name[] = "snapshot-merge";
>  struct dm_exception_table {
>  	uint32_t hash_mask;
>  	unsigned hash_shift;
> -	struct list_head *table;
> +	struct hlist_bl_head *table;
>  };
>  
>  struct dm_snapshot {
> -	struct mutex lock;
> +	struct rw_semaphore lock;
>  
>  	struct dm_dev *origin;
>  	struct dm_dev *cow;
> @@ -76,7 +77,9 @@ struct dm_snapshot {
>  
>  	atomic_t pending_exceptions_count;
>  
> -	/* Protected by "lock" */
> +	spinlock_t pe_allocation_lock;
> +
> +	/* Protected by "pe_allocation_lock" */
>  	sector_t exception_start_sequence;
>  
>  	/* Protected by kcopyd single-threaded callback */
> @@ -457,9 +460,9 @@ static int __find_snapshots_sharing_cow(struct dm_snapshot *snap,
>  		if (!bdev_equal(s->cow->bdev, snap->cow->bdev))
>  			continue;
>  
> -		mutex_lock(&s->lock);
> +		down_read(&s->lock);
>  		active = s->active;
> -		mutex_unlock(&s->lock);
> +		up_read(&s->lock);
>  
>  		if (active) {
>  			if (snap_src)
> @@ -618,6 +621,36 @@ static void unregister_snapshot(struct dm_snapshot *s)
>   * The lowest hash_shift bits of the chunk number are ignored, allowing
>   * some consecutive chunks to be grouped together.
>   */
> +static uint32_t exception_hash(struct dm_exception_table *et, chunk_t chunk);
> +
> +/* Lock to protect access to the completed and pending exception hash tables. */
> +struct dm_exception_table_lock {
> +	struct hlist_bl_head *complete_slot;
> +	struct hlist_bl_head *pending_slot;
> +};
> +
> +static void dm_exception_table_lock_init(struct dm_snapshot *s, chunk_t chunk,
> +					 struct dm_exception_table_lock *lock)
> +{
> +	struct dm_exception_table *complete = &s->complete;
> +	struct dm_exception_table *pending = &s->pending;
> +
> +	lock->complete_slot = &complete->table[exception_hash(complete, chunk)];
> +	lock->pending_slot = &pending->table[exception_hash(pending, chunk)];
> +}
> +
> +static void dm_exception_table_lock(struct dm_exception_table_lock *lock)
> +{
> +	hlist_bl_lock(lock->complete_slot);
> +	hlist_bl_lock(lock->pending_slot);
> +}
> +
> +static void dm_exception_table_unlock(struct dm_exception_table_lock *lock)
> +{
> +	hlist_bl_unlock(lock->pending_slot);
> +	hlist_bl_unlock(lock->complete_slot);
> +}
> +
>  static int dm_exception_table_init(struct dm_exception_table *et,
>  				   uint32_t size, unsigned hash_shift)
>  {
> @@ -625,12 +658,12 @@ static int dm_exception_table_init(struct dm_exception_table *et,
>  
>  	et->hash_shift = hash_shift;
>  	et->hash_mask = size - 1;
> -	et->table = dm_vcalloc(size, sizeof(struct list_head));
> +	et->table = dm_vcalloc(size, sizeof(struct hlist_bl_head));
>  	if (!et->table)
>  		return -ENOMEM;
>  
>  	for (i = 0; i < size; i++)
> -		INIT_LIST_HEAD(et->table + i);
> +		INIT_HLIST_BL_HEAD(et->table + i);
>  
>  	return 0;
>  }
> @@ -638,15 +671,16 @@ static int dm_exception_table_init(struct dm_exception_table *et,
>  static void dm_exception_table_exit(struct dm_exception_table *et,
>  				    struct kmem_cache *mem)
>  {
> -	struct list_head *slot;
> -	struct dm_exception *ex, *next;
> +	struct hlist_bl_head *slot;
> +	struct dm_exception *ex;
> +	struct hlist_bl_node *pos, *n;
>  	int i, size;
>  
>  	size = et->hash_mask + 1;
>  	for (i = 0; i < size; i++) {
>  		slot = et->table + i;
>  
> -		list_for_each_entry_safe (ex, next, slot, hash_list)
> +		hlist_bl_for_each_entry_safe(ex, pos, n, slot, hash_list)
>  			kmem_cache_free(mem, ex);
>  	}
>  
> @@ -660,7 +694,7 @@ static uint32_t exception_hash(struct dm_exception_table *et, chunk_t chunk)
>  
>  static void dm_remove_exception(struct dm_exception *e)
>  {
> -	list_del(&e->hash_list);
> +	hlist_bl_del(&e->hash_list);
>  }
>  
>  /*
> @@ -670,11 +704,12 @@ static void dm_remove_exception(struct dm_exception *e)
>  static struct dm_exception *dm_lookup_exception(struct dm_exception_table *et,
>  						chunk_t chunk)
>  {
> -	struct list_head *slot;
> +	struct hlist_bl_head *slot;
> +	struct hlist_bl_node *pos;
>  	struct dm_exception *e;
>  
>  	slot = &et->table[exception_hash(et, chunk)];
> -	list_for_each_entry (e, slot, hash_list)
> +	hlist_bl_for_each_entry(e, pos, slot, hash_list)
>  		if (chunk >= e->old_chunk &&
>  		    chunk <= e->old_chunk + dm_consecutive_chunk_count(e))
>  			return e;
> @@ -721,7 +756,8 @@ static void free_pending_exception(struct dm_snap_pending_exception *pe)
>  static void dm_insert_exception(struct dm_exception_table *eh,
>  				struct dm_exception *new_e)
>  {
> -	struct list_head *l;
> +	struct hlist_bl_head *l;
> +	struct hlist_bl_node *pos;
>  	struct dm_exception *e = NULL;
>  
>  	l = &eh->table[exception_hash(eh, new_e->old_chunk)];
> @@ -731,7 +767,7 @@ static void dm_insert_exception(struct dm_exception_table *eh,
>  		goto out;
>  
>  	/* List is ordered by old_chunk */
> -	list_for_each_entry_reverse(e, l, hash_list) {
> +	hlist_bl_for_each_entry(e, pos, l, hash_list) {
>  		/* Insert after an existing chunk? */
>  		if (new_e->old_chunk == (e->old_chunk +
>  					 dm_consecutive_chunk_count(e) + 1) &&
> @@ -752,12 +788,24 @@ static void dm_insert_exception(struct dm_exception_table *eh,
>  			return;
>  		}
>  
> -		if (new_e->old_chunk > e->old_chunk)
> +		if (new_e->old_chunk < e->old_chunk)
>  			break;
>  	}
>  
>  out:
> -	list_add(&new_e->hash_list, e ? &e->hash_list : l);
> +	if (!e) {
> +		/*
> +		 * Either the table doesn't support consecutive chunks or slot
> +		 * l is empty.
> +		 */
> +		hlist_bl_add_head(&new_e->hash_list, l);
> +	} else if (new_e->old_chunk < e->old_chunk) {
> +		/* Add before an existing exception */
> +		hlist_bl_add_before(&new_e->hash_list, &e->hash_list);
> +	} else {
> +		/* Add to l's tail: e is the last exception in this slot */
> +		hlist_bl_add_behind(&new_e->hash_list, &e->hash_list);
> +	}
>  }
>  
>  /*
> @@ -766,6 +814,7 @@ static void dm_insert_exception(struct dm_exception_table *eh,
>   */
>  static int dm_add_exception(void *context, chunk_t old, chunk_t new)
>  {
> +	struct dm_exception_table_lock lock;
>  	struct dm_snapshot *s = context;
>  	struct dm_exception *e;
>  
> @@ -778,7 +827,17 @@ static int dm_add_exception(void *context, chunk_t old, chunk_t new)
>  	/* Consecutive_count is implicitly initialised to zero */
>  	e->new_chunk = new;
>  
> +	/*
> +	 * Although there is no need to lock access to the exception tables
> +	 * here, if we don't then hlist_bl_add_head(), called by
> +	 * dm_insert_exception(), will complain about accessing the
> +	 * corresponding list without locking it first.
> +	 */
> +	dm_exception_table_lock_init(s, old, &lock);
> +
> +	dm_exception_table_lock(&lock);
>  	dm_insert_exception(&s->complete, e);
> +	dm_exception_table_unlock(&lock);
>  
>  	return 0;
>  }
> @@ -807,7 +866,7 @@ static int calc_max_buckets(void)
>  {
>  	/* use a fixed size of 2MB */
>  	unsigned long mem = 2 * 1024 * 1024;
> -	mem /= sizeof(struct list_head);
> +	mem /= sizeof(struct hlist_bl_head);
>  
>  	return mem;
>  }
> @@ -927,7 +986,7 @@ static int remove_single_exception_chunk(struct dm_snapshot *s)
>  	int r;
>  	chunk_t old_chunk = s->first_merging_chunk + s->num_merging_chunks - 1;
>  
> -	mutex_lock(&s->lock);
> +	down_write(&s->lock);
>  
>  	/*
>  	 * Process chunks (and associated exceptions) in reverse order
> @@ -942,7 +1001,7 @@ static int remove_single_exception_chunk(struct dm_snapshot *s)
>  	b = __release_queued_bios_after_merge(s);
>  
>  out:
> -	mutex_unlock(&s->lock);
> +	up_write(&s->lock);
>  	if (b)
>  		flush_bios(b);
>  
> @@ -1001,9 +1060,9 @@ static void snapshot_merge_next_chunks(struct dm_snapshot *s)
>  		if (linear_chunks < 0) {
>  			DMERR("Read error in exception store: "
>  			      "shutting down merge");
> -			mutex_lock(&s->lock);
> +			down_write(&s->lock);
>  			s->merge_failed = 1;
> -			mutex_unlock(&s->lock);
> +			up_write(&s->lock);
>  		}
>  		goto shut;
>  	}
> @@ -1044,10 +1103,10 @@ static void snapshot_merge_next_chunks(struct dm_snapshot *s)
>  		previous_count = read_pending_exceptions_done_count();
>  	}
>  
> -	mutex_lock(&s->lock);
> +	down_write(&s->lock);
>  	s->first_merging_chunk = old_chunk;
>  	s->num_merging_chunks = linear_chunks;
> -	mutex_unlock(&s->lock);
> +	up_write(&s->lock);
>  
>  	/* Wait until writes to all 'linear_chunks' drain */
>  	for (i = 0; i < linear_chunks; i++)
> @@ -1089,10 +1148,10 @@ static void merge_callback(int read_err, unsigned long write_err, void *context)
>  	return;
>  
>  shut:
> -	mutex_lock(&s->lock);
> +	down_write(&s->lock);
>  	s->merge_failed = 1;
>  	b = __release_queued_bios_after_merge(s);
> -	mutex_unlock(&s->lock);
> +	up_write(&s->lock);
>  	error_bios(b);
>  
>  	merge_shutdown(s);
> @@ -1188,10 +1247,11 @@ static int snapshot_ctr(struct dm_target *ti, unsigned int argc, char **argv)
>  	s->snapshot_overflowed = 0;
>  	s->active = 0;
>  	atomic_set(&s->pending_exceptions_count, 0);
> +	spin_lock_init(&s->pe_allocation_lock);
>  	s->exception_start_sequence = 0;
>  	s->exception_complete_sequence = 0;
>  	s->out_of_order_tree = RB_ROOT;
> -	mutex_init(&s->lock);
> +	init_rwsem(&s->lock);
>  	INIT_LIST_HEAD(&s->list);
>  	spin_lock_init(&s->pe_lock);
>  	s->state_bits = 0;
> @@ -1357,9 +1417,9 @@ static void snapshot_dtr(struct dm_target *ti)
>  	/* Check whether exception handover must be cancelled */
>  	(void) __find_snapshots_sharing_cow(s, &snap_src, &snap_dest, NULL);
>  	if (snap_src && snap_dest && (s == snap_src)) {
> -		mutex_lock(&snap_dest->lock);
> +		down_write(&snap_dest->lock);
>  		snap_dest->valid = 0;
> -		mutex_unlock(&snap_dest->lock);
> +		up_write(&snap_dest->lock);
>  		DMERR("Cancelling snapshot handover.");
>  	}
>  	up_read(&_origins_lock);
> @@ -1390,8 +1450,6 @@ static void snapshot_dtr(struct dm_target *ti)
>  
>  	dm_exception_store_destroy(s->store);
>  
> -	mutex_destroy(&s->lock);
> -
>  	dm_put_device(ti, s->cow);
>  
>  	dm_put_device(ti, s->origin);
> @@ -1467,6 +1525,13 @@ static void __invalidate_snapshot(struct dm_snapshot *s, int err)
>  	dm_table_event(s->ti->table);
>  }
>  
> +static void invalidate_snapshot(struct dm_snapshot *s, int err)
> +{
> +	down_write(&s->lock);
> +	__invalidate_snapshot(s, err);
> +	up_write(&s->lock);
> +}
> +
>  static void pending_complete(void *context, int success)
>  {
>  	struct dm_snap_pending_exception *pe = context;
> @@ -1475,29 +1540,37 @@ static void pending_complete(void *context, int success)
>  	struct bio *origin_bios = NULL;
>  	struct bio *snapshot_bios = NULL;
>  	struct bio *full_bio = NULL;
> +	struct dm_exception_table_lock lock;
>  	int error = 0;
>  
> +	dm_exception_table_lock_init(s, pe->e.old_chunk, &lock);
> +
>  	if (!success) {
>  		/* Read/write error - snapshot is unusable */
> -		mutex_lock(&s->lock);
> -		__invalidate_snapshot(s, -EIO);
> +		invalidate_snapshot(s, -EIO);
>  		error = 1;
> +
> +		dm_exception_table_lock(&lock);
>  		goto out;
>  	}
>  
>  	e = alloc_completed_exception(GFP_NOIO);
>  	if (!e) {
> -		mutex_lock(&s->lock);
> -		__invalidate_snapshot(s, -ENOMEM);
> +		invalidate_snapshot(s, -ENOMEM);
>  		error = 1;
> +
> +		dm_exception_table_lock(&lock);
>  		goto out;
>  	}
>  	*e = pe->e;
>  
> -	mutex_lock(&s->lock);
> +	down_read(&s->lock);
> +	dm_exception_table_lock(&lock);
>  	if (!s->valid) {
> +		up_read(&s->lock);
>  		free_completed_exception(e);
>  		error = 1;
> +
>  		goto out;
>  	}
>  
> @@ -1509,17 +1582,21 @@ static void pending_complete(void *context, int success)
>  	 * merging can overwrite the chunk in origin.
>  	 */
>  	dm_insert_exception(&s->complete, e);
> +	up_read(&s->lock);
>  
>  	/* Wait for conflicting reads to drain */
>  	if (__chunk_is_tracked(s, pe->e.old_chunk)) {
> -		mutex_unlock(&s->lock);
> +		dm_exception_table_unlock(&lock);
>  		__check_for_conflicting_io(s, pe->e.old_chunk);
> -		mutex_lock(&s->lock);
> +		dm_exception_table_lock(&lock);
>  	}
>  
>  out:
>  	/* Remove the in-flight exception from the list */
>  	dm_remove_exception(&pe->e);
> +
> +	dm_exception_table_unlock(&lock);
> +
>  	snapshot_bios = bio_list_get(&pe->snapshot_bios);
>  	origin_bios = bio_list_get(&pe->origin_bios);
>  	full_bio = pe->full_bio;
> @@ -1527,8 +1604,6 @@ static void pending_complete(void *context, int success)
>  		full_bio->bi_end_io = pe->full_bio_end_io;
>  	increment_pending_exceptions_done_count();
>  
> -	mutex_unlock(&s->lock);
> -
>  	/* Submit any pending write bios */
>  	if (error) {
>  		if (full_bio)
> @@ -1670,8 +1745,8 @@ __lookup_pending_exception(struct dm_snapshot *s, chunk_t chunk)
>  /*
>   * Inserts a pending exception into the pending table.
>   *
> - * NOTE: a write lock must be held on snap->lock before calling
> - * this.
> + * NOTE: a write lock must be held on the chunk's pending exception table slot
> + * before calling this.
>   */
>  static struct dm_snap_pending_exception *
>  __insert_pending_exception(struct dm_snapshot *s,
> @@ -1683,12 +1758,15 @@ __insert_pending_exception(struct dm_snapshot *s,
>  	pe->started = 0;
>  	pe->full_bio = NULL;
>  
> +	spin_lock(&s->pe_allocation_lock);
>  	if (s->store->type->prepare_exception(s->store, &pe->e)) {
> +		spin_unlock(&s->pe_allocation_lock);
>  		free_pending_exception(pe);
>  		return NULL;
>  	}
>  
>  	pe->exception_sequence = s->exception_start_sequence++;
> +	spin_unlock(&s->pe_allocation_lock);
>  
>  	dm_insert_exception(&s->pending, &pe->e);
>  
> @@ -1700,8 +1778,8 @@ __insert_pending_exception(struct dm_snapshot *s,
>   * for this chunk, otherwise it allocates a new one and inserts
>   * it into the pending table.
>   *
> - * NOTE: a write lock must be held on snap->lock before calling
> - * this.
> + * NOTE: a write lock must be held on the chunk's pending exception table slot
> + * before calling this.
>   */
>  static struct dm_snap_pending_exception *
>  __find_pending_exception(struct dm_snapshot *s,
> @@ -1735,6 +1813,7 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio)
>  	int r = DM_MAPIO_REMAPPED;
>  	chunk_t chunk;
>  	struct dm_snap_pending_exception *pe = NULL;
> +	struct dm_exception_table_lock lock;
>  
>  	init_tracked_chunk(bio);
>  
> @@ -1744,13 +1823,15 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio)
>  	}
>  
>  	chunk = sector_to_chunk(s->store, bio->bi_iter.bi_sector);
> +	dm_exception_table_lock_init(s, chunk, &lock);
>  
>  	/* Full snapshots are not usable */
>  	/* To get here the table must be live so s->active is always set. */
>  	if (!s->valid)
>  		return DM_MAPIO_KILL;
>  
> -	mutex_lock(&s->lock);
> +	down_read(&s->lock);
> +	dm_exception_table_lock(&lock);
>  
>  	if (!s->valid || (unlikely(s->snapshot_overflowed) &&
>  	    bio_data_dir(bio) == WRITE)) {
> @@ -1773,15 +1854,9 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio)
>  	if (bio_data_dir(bio) == WRITE) {
>  		pe = __lookup_pending_exception(s, chunk);
>  		if (!pe) {
> -			mutex_unlock(&s->lock);
> +			dm_exception_table_unlock(&lock);
>  			pe = alloc_pending_exception(s);
> -			mutex_lock(&s->lock);
> -
> -			if (!s->valid || s->snapshot_overflowed) {
> -				free_pending_exception(pe);
> -				r = DM_MAPIO_KILL;
> -				goto out_unlock;
> -			}
> +			dm_exception_table_lock(&lock);
>  
>  			e = dm_lookup_exception(&s->complete, chunk);
>  			if (e) {
> @@ -1792,13 +1867,22 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio)
>  
>  			pe = __find_pending_exception(s, pe, chunk);
>  			if (!pe) {
> +				dm_exception_table_unlock(&lock);
> +				up_read(&s->lock);
> +
> +				down_write(&s->lock);
> +
>  				if (s->store->userspace_supports_overflow) {
> -					s->snapshot_overflowed = 1;
> -					DMERR("Snapshot overflowed: Unable to allocate exception.");
> +					if (s->valid && !s->snapshot_overflowed) {
> +						s->snapshot_overflowed = 1;
> +						DMERR("Snapshot overflowed: Unable to allocate exception.");
> +					}
>  				} else
>  					__invalidate_snapshot(s, -ENOMEM);
> +				up_write(&s->lock);
> +
>  				r = DM_MAPIO_KILL;
> -				goto out_unlock;
> +				goto out;
>  			}
>  		}
>  
> @@ -1810,7 +1894,10 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio)
>  		    bio->bi_iter.bi_size ==
>  		    (s->store->chunk_size << SECTOR_SHIFT)) {
>  			pe->started = 1;
> -			mutex_unlock(&s->lock);
> +
> +			dm_exception_table_unlock(&lock);
> +			up_read(&s->lock);
> +
>  			start_full_bio(pe, bio);
>  			goto out;
>  		}
> @@ -1818,9 +1905,12 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio)
>  		bio_list_add(&pe->snapshot_bios, bio);
>  
>  		if (!pe->started) {
> -			/* this is protected by snap->lock */
> +			/* this is protected by the exception table lock */
>  			pe->started = 1;
> -			mutex_unlock(&s->lock);
> +
> +			dm_exception_table_unlock(&lock);
> +			up_read(&s->lock);
> +
>  			start_copy(pe);
>  			goto out;
>  		}
> @@ -1830,7 +1920,8 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio)
>  	}
>  
>  out_unlock:
> -	mutex_unlock(&s->lock);
> +	dm_exception_table_unlock(&lock);
> +	up_read(&s->lock);
>  out:
>  	return r;
>  }
> @@ -1866,7 +1957,7 @@ static int snapshot_merge_map(struct dm_target *ti, struct bio *bio)
>  
>  	chunk = sector_to_chunk(s->store, bio->bi_iter.bi_sector);
>  
> -	mutex_lock(&s->lock);
> +	down_write(&s->lock);
>  
>  	/* Full merging snapshots are redirected to the origin */
>  	if (!s->valid)
> @@ -1897,12 +1988,12 @@ static int snapshot_merge_map(struct dm_target *ti, struct bio *bio)
>  	bio_set_dev(bio, s->origin->bdev);
>  
>  	if (bio_data_dir(bio) == WRITE) {
> -		mutex_unlock(&s->lock);
> +		up_write(&s->lock);
>  		return do_origin(s->origin, bio);
>  	}
>  
>  out_unlock:
> -	mutex_unlock(&s->lock);
> +	up_write(&s->lock);
>  
>  	return r;
>  }
> @@ -1934,7 +2025,7 @@ static int snapshot_preresume(struct dm_target *ti)
>  	down_read(&_origins_lock);
>  	(void) __find_snapshots_sharing_cow(s, &snap_src, &snap_dest, NULL);
>  	if (snap_src && snap_dest) {
> -		mutex_lock(&snap_src->lock);
> +		down_read(&snap_src->lock);
>  		if (s == snap_src) {
>  			DMERR("Unable to resume snapshot source until "
>  			      "handover completes.");
> @@ -1944,7 +2035,7 @@ static int snapshot_preresume(struct dm_target *ti)
>  			      "source is suspended.");
>  			r = -EINVAL;
>  		}
> -		mutex_unlock(&snap_src->lock);
> +		up_read(&snap_src->lock);
>  	}
>  	up_read(&_origins_lock);
>  
> @@ -1990,11 +2081,11 @@ static void snapshot_resume(struct dm_target *ti)
>  
>  	(void) __find_snapshots_sharing_cow(s, &snap_src, &snap_dest, NULL);
>  	if (snap_src && snap_dest) {
> -		mutex_lock(&snap_src->lock);
> -		mutex_lock_nested(&snap_dest->lock, SINGLE_DEPTH_NESTING);
> +		down_write(&snap_src->lock);
> +		down_write_nested(&snap_dest->lock, SINGLE_DEPTH_NESTING);
>  		__handover_exceptions(snap_src, snap_dest);
> -		mutex_unlock(&snap_dest->lock);
> -		mutex_unlock(&snap_src->lock);
> +		up_write(&snap_dest->lock);
> +		up_write(&snap_src->lock);
>  	}
>  
>  	up_read(&_origins_lock);
> @@ -2009,9 +2100,9 @@ static void snapshot_resume(struct dm_target *ti)
>  	/* Now we have correct chunk size, reregister */
>  	reregister_snapshot(s);
>  
> -	mutex_lock(&s->lock);
> +	down_write(&s->lock);
>  	s->active = 1;
> -	mutex_unlock(&s->lock);
> +	up_write(&s->lock);
>  }
>  
>  static uint32_t get_origin_minimum_chunksize(struct block_device *bdev)
> @@ -2051,7 +2142,7 @@ static void snapshot_status(struct dm_target *ti, status_type_t type,
>  	switch (type) {
>  	case STATUSTYPE_INFO:
>  
> -		mutex_lock(&snap->lock);
> +		down_write(&snap->lock);
>  
>  		if (!snap->valid)
>  			DMEMIT("Invalid");
> @@ -2076,7 +2167,7 @@ static void snapshot_status(struct dm_target *ti, status_type_t type,
>  				DMEMIT("Unknown");
>  		}
>  
> -		mutex_unlock(&snap->lock);
> +		up_write(&snap->lock);
>  
>  		break;
>  
> @@ -2131,6 +2222,7 @@ static int __origin_write(struct list_head *snapshots, sector_t sector,
>  	struct dm_snap_pending_exception *pe, *pe2;
>  	struct dm_snap_pending_exception *pe_to_start_now = NULL;
>  	struct dm_snap_pending_exception *pe_to_start_last = NULL;
> +	struct dm_exception_table_lock lock;
>  	chunk_t chunk;
>  
>  	/* Do all the snapshots on this origin */
> @@ -2142,21 +2234,23 @@ static int __origin_write(struct list_head *snapshots, sector_t sector,
>  		if (dm_target_is_snapshot_merge(snap->ti))
>  			continue;
>  
> -		mutex_lock(&snap->lock);
> -
> -		/* Only deal with valid and active snapshots */
> -		if (!snap->valid || !snap->active)
> -			goto next_snapshot;
> -
>  		/* Nothing to do if writing beyond end of snapshot */
>  		if (sector >= dm_table_get_size(snap->ti->table))
> -			goto next_snapshot;
> +			continue;
>  
>  		/*
>  		 * Remember, different snapshots can have
>  		 * different chunk sizes.
>  		 */
>  		chunk = sector_to_chunk(snap->store, sector);
> +		dm_exception_table_lock_init(snap, chunk, &lock);
> +
> +		down_read(&snap->lock);
> +		dm_exception_table_lock(&lock);
> +
> +		/* Only deal with valid and active snapshots */
> +		if (!snap->valid || !snap->active)
> +			goto next_snapshot;
>  
>  		pe = __lookup_pending_exception(snap, chunk);
>  		if (!pe) {
> @@ -2169,14 +2263,9 @@ static int __origin_write(struct list_head *snapshots, sector_t sector,
>  			if (e)
>  				goto next_snapshot;
>  
> -			mutex_unlock(&snap->lock);
> +			dm_exception_table_unlock(&lock);
>  			pe = alloc_pending_exception(snap);
> -			mutex_lock(&snap->lock);
> -
> -			if (!snap->valid) {
> -				free_pending_exception(pe);
> -				goto next_snapshot;
> -			}
> +			dm_exception_table_lock(&lock);
>  
>  			pe2 = __lookup_pending_exception(snap, chunk);
>  
> @@ -2189,8 +2278,11 @@ static int __origin_write(struct list_head *snapshots, sector_t sector,
>  
>  				pe = __insert_pending_exception(snap, pe, chunk);
>  				if (!pe) {
> -					__invalidate_snapshot(snap, -ENOMEM);
> -					goto next_snapshot;
> +					dm_exception_table_unlock(&lock);
> +					up_read(&snap->lock);
> +
> +					invalidate_snapshot(snap, -ENOMEM);
> +					continue;
>  				}
>  			} else {
>  				free_pending_exception(pe);
> @@ -2221,7 +2313,8 @@ static int __origin_write(struct list_head *snapshots, sector_t sector,
>  		}
>  
>  next_snapshot:
> -		mutex_unlock(&snap->lock);
> +		dm_exception_table_unlock(&lock);
> +		up_read(&snap->lock);
>  
>  		if (pe_to_start_now) {
>  			start_copy(pe_to_start_now);
> -- 
> 2.11.0
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/3] dm snapshot: Use fine-grained locking scheme
  2019-02-28 21:37   ` Mikulas Patocka
@ 2019-03-01  9:15     ` Nikos Tsironis
  0 siblings, 0 replies; 26+ messages in thread
From: Nikos Tsironis @ 2019-03-01  9:15 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: dm-devel, snitzer, agk, iliastsi

On 2/28/19 11:37 PM, Mikulas Patocka wrote:
> Hi
> 
> I looked at the patches, and they could be merged into the kernel 5.2.
> 
> Please, split this last patch - it is too big and too hard to understand.
> 
> For example, you can split it into three patches:
> [PATCH 3/5] replace mutex with rm_semaphore (basically a revert of 
> 	ae1093be5a0ef997833e200a0dafb9ed0b1ff4fe)
> [PATCH 4/5] replace normal lists with locked lists (but still leave the 
> 	global lock)
> [PATCH 5/5] change down_write to down_read and the rest of the changes
> 
> ... so that the resulting file is the same.

Hi Mikulas,

Thanks for your feedback. I will split the third patch as you requested
and I will send a second version of these patches.

Thanks,
Nikos

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] list_bl: Add hlist_bl_add_before/behind helpers
  2019-02-28 21:32   ` Mike Snitzer
  2019-02-28 21:34     ` Mike Snitzer
@ 2019-03-11 18:16     ` Christoph Hellwig
  2019-03-11 22:13       ` Paul E. McKenney
  2019-03-13 23:48     ` Paul E. McKenney
  2 siblings, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2019-03-11 18:16 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Paul E. McKenney, hch, Nikos Tsironis, agk, dm-devel, mpatocka,
	iliastsi, linux-kernel

On Thu, Feb 28, 2019 at 04:32:02PM -0500, Mike Snitzer wrote:
> Hi Paul and Christoph,
> 
> You've added your Signed-off-by to include/linux/list_bl.h commits in
> the past.  I'm not sure how this proposed patch should be handled.
> 
> These new hlist_bl_add_{before,behind} changes are a prereq for
> dm-snapshot changes that Nikos has proposed, please see:
> https://patchwork.kernel.org/patch/10739265/
> 
> Any assistance/review you, or others on LKML, might be able to provide
> would be appreciated.

I just killed two helpers.  That being said assuming that we only
rely on the next pointer for the lockless traversals the changes look
fine to me, but the code might be beyond my paygrade..

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] list_bl: Add hlist_bl_add_before/behind helpers
  2019-03-11 18:16     ` Christoph Hellwig
@ 2019-03-11 22:13       ` Paul E. McKenney
  2019-03-11 22:43         ` Mike Snitzer
  0 siblings, 1 reply; 26+ messages in thread
From: Paul E. McKenney @ 2019-03-11 22:13 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Mike Snitzer, Nikos Tsironis, agk, dm-devel, mpatocka, iliastsi,
	linux-kernel

On Mon, Mar 11, 2019 at 11:16:08AM -0700, Christoph Hellwig wrote:
> On Thu, Feb 28, 2019 at 04:32:02PM -0500, Mike Snitzer wrote:
> > Hi Paul and Christoph,
> > 
> > You've added your Signed-off-by to include/linux/list_bl.h commits in
> > the past.  I'm not sure how this proposed patch should be handled.
> > 
> > These new hlist_bl_add_{before,behind} changes are a prereq for
> > dm-snapshot changes that Nikos has proposed, please see:
> > https://patchwork.kernel.org/patch/10739265/
> > 
> > Any assistance/review you, or others on LKML, might be able to provide
> > would be appreciated.
> 
> I just killed two helpers.  That being said assuming that we only
> rely on the next pointer for the lockless traversals the changes look
> fine to me, but the code might be beyond my paygrade..

First, apologies for being slow on this one.

Second, were the two helpers hlist_bl_add_{before,behind}()?  If so, I
guess there is not much point in me looking them over.  Though perhaps
I should be looking something else over?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] list_bl: Add hlist_bl_add_before/behind helpers
  2019-03-11 22:13       ` Paul E. McKenney
@ 2019-03-11 22:43         ` Mike Snitzer
  2019-03-14  0:25           ` Paul E. McKenney
  0 siblings, 1 reply; 26+ messages in thread
From: Mike Snitzer @ 2019-03-11 22:43 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Christoph Hellwig, Nikos Tsironis, agk, dm-devel, mpatocka,
	iliastsi, linux-kernel

On Mon, Mar 11 2019 at  6:13pm -0400,
Paul E. McKenney <paulmck@linux.ibm.com> wrote:

> On Mon, Mar 11, 2019 at 11:16:08AM -0700, Christoph Hellwig wrote:
> > On Thu, Feb 28, 2019 at 04:32:02PM -0500, Mike Snitzer wrote:
> > > Hi Paul and Christoph,
> > > 
> > > You've added your Signed-off-by to include/linux/list_bl.h commits in
> > > the past.  I'm not sure how this proposed patch should be handled.
> > > 
> > > These new hlist_bl_add_{before,behind} changes are a prereq for
> > > dm-snapshot changes that Nikos has proposed, please see:
> > > https://patchwork.kernel.org/patch/10739265/
> > > 
> > > Any assistance/review you, or others on LKML, might be able to provide
> > > would be appreciated.
> > 
> > I just killed two helpers.  That being said assuming that we only
> > rely on the next pointer for the lockless traversals the changes look
> > fine to me, but the code might be beyond my paygrade..
> 
> First, apologies for being slow on this one.

No problem.
 
> Second, were the two helpers hlist_bl_add_{before,behind}()?  If so, I
> guess there is not much point in me looking them over.  Though perhaps
> I should be looking something else over?

No, think Christoph was referring to his commit 1879fd6a26571fd4e8e1f
from 2011.

Anyway, I'd like you to look over this new proposed patch that
introduces hlist_bl_add_{before,behind}(), please see:
https://patchwork.kernel.org/patch/10835713/

If you're happy with the patch, and can provide your Reviewed-by or
Acked-by, I'll then pick it up as a prereq for the broader dm-snapshot
changes that Nikos has provided.

Thanks!
Mike

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] list_bl: Add hlist_bl_add_before/behind helpers
  2019-02-28 21:32   ` Mike Snitzer
  2019-02-28 21:34     ` Mike Snitzer
  2019-03-11 18:16     ` Christoph Hellwig
@ 2019-03-13 23:48     ` Paul E. McKenney
  2019-03-14  0:30       ` Mike Snitzer
  2 siblings, 1 reply; 26+ messages in thread
From: Paul E. McKenney @ 2019-03-13 23:48 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: hch, Nikos Tsironis, agk, dm-devel, mpatocka, iliastsi, linux-kernel

On Thu, Feb 28, 2019 at 04:32:02PM -0500, Mike Snitzer wrote:
> On Thu, Dec 20 2018 at  1:06pm -0500,
> Nikos Tsironis <ntsironis@arrikto.com> wrote:
> 
> > Add hlist_bl_add_before/behind helpers to add an element before/after an
> > existing element in a bl_list.
> > 
> > Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
> > Signed-off-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
> > ---
> >  include/linux/list_bl.h | 27 +++++++++++++++++++++++++++
> >  1 file changed, 27 insertions(+)
> > 
> > diff --git a/include/linux/list_bl.h b/include/linux/list_bl.h
> > index 3fc2cc57ba1b..2fd918e5fd48 100644
> > --- a/include/linux/list_bl.h
> > +++ b/include/linux/list_bl.h
> > @@ -86,6 +86,33 @@ static inline void hlist_bl_add_head(struct hlist_bl_node *n,
> >  	hlist_bl_set_first(h, n);
> >  }
> >  
> > +static inline void hlist_bl_add_before(struct hlist_bl_node *n,
> > +				       struct hlist_bl_node *next)
> > +{
> > +	struct hlist_bl_node **pprev = next->pprev;
> > +
> > +	n->pprev = pprev;
> > +	n->next = next;
> > +	next->pprev = &n->next;
> > +
> > +	/* pprev may be `first`, so be careful not to lose the lock bit */
> > +	WRITE_ONCE(*pprev,
> > +		   (struct hlist_bl_node *)
> > +			((unsigned long)n |
> > +			 ((unsigned long)*pprev & LIST_BL_LOCKMASK)));

A nit, but use of uintptr_t shrinks things a bit:

+		   (struct hlist_bl_node *)
+			((uintptr_t)n | ((uintptr_t)*pprev & LIST_BL_LOCKMASK)));

I am not too concerned about this, though.

The WRITE_ONCE() is to handle races with hlist_bl_empty() (which does contain
the corresponding READ_ONCE()) correct?

> > +}
> > +
> > +static inline void hlist_bl_add_behind(struct hlist_bl_node *n,
> > +				       struct hlist_bl_node *prev)
> > +{
> > +	n->next = prev->next;
> > +	n->pprev = &prev->next;
> > +	WRITE_ONCE(prev->next, n);

I don't see what this WRITE_ONCE() is interacting with.  The traversals
use plain C-language reads, and hlist_bl_empty() can't get here.  All
uses of hlist_bl_for_each_entry() invoke hlist_bl_lock() before starting
the traversal, and hlist_bl_for_each_entry_safe() looks to be unused.
(Perhaps it should be removed?  Or is there some anticipated use?)

I don't believe that the WRITE_ONCE() is needed.  What am I missing?

Other than that, looks good.

							Thanx, Paul

> > +
> > +	if (n->next)
> > +		n->next->pprev = &n->next;
> > +}
> > +
> >  static inline void __hlist_bl_del(struct hlist_bl_node *n)
> >  {
> >  	struct hlist_bl_node *next = n->next;
> > -- 
> > 2.11.0
> 
> Hi Paul and Christoph,
> 
> You've added your Signed-off-by to include/linux/list_bl.h commits in
> the past.  I'm not sure how this proposed patch should be handled.
> 
> These new hlist_bl_add_{before,behind} changes are a prereq for
> dm-snapshot changes that Nikos has proposed, please see:
> https://patchwork.kernel.org/patch/10739265/
> 
> Any assistance/review you, or others on LKML, might be able to provide
> would be appreciated.
> 
> Thanks,
> Mike
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] list_bl: Add hlist_bl_add_before/behind helpers
  2019-03-11 22:43         ` Mike Snitzer
@ 2019-03-14  0:25           ` Paul E. McKenney
  0 siblings, 0 replies; 26+ messages in thread
From: Paul E. McKenney @ 2019-03-14  0:25 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Christoph Hellwig, Nikos Tsironis, agk, dm-devel, mpatocka,
	iliastsi, linux-kernel

On Mon, Mar 11, 2019 at 06:43:33PM -0400, Mike Snitzer wrote:
> On Mon, Mar 11 2019 at  6:13pm -0400,
> Paul E. McKenney <paulmck@linux.ibm.com> wrote:
> 
> > On Mon, Mar 11, 2019 at 11:16:08AM -0700, Christoph Hellwig wrote:
> > > On Thu, Feb 28, 2019 at 04:32:02PM -0500, Mike Snitzer wrote:
> > > > Hi Paul and Christoph,
> > > > 
> > > > You've added your Signed-off-by to include/linux/list_bl.h commits in
> > > > the past.  I'm not sure how this proposed patch should be handled.
> > > > 
> > > > These new hlist_bl_add_{before,behind} changes are a prereq for
> > > > dm-snapshot changes that Nikos has proposed, please see:
> > > > https://patchwork.kernel.org/patch/10739265/
> > > > 
> > > > Any assistance/review you, or others on LKML, might be able to provide
> > > > would be appreciated.
> > > 
> > > I just killed two helpers.  That being said assuming that we only
> > > rely on the next pointer for the lockless traversals the changes look
> > > fine to me, but the code might be beyond my paygrade..
> > 
> > First, apologies for being slow on this one.
> 
> No problem.
>  
> > Second, were the two helpers hlist_bl_add_{before,behind}()?  If so, I
> > guess there is not much point in me looking them over.  Though perhaps
> > I should be looking something else over?
> 
> No, think Christoph was referring to his commit 1879fd6a26571fd4e8e1f
> from 2011.
> 
> Anyway, I'd like you to look over this new proposed patch that
> introduces hlist_bl_add_{before,behind}(), please see:
> https://patchwork.kernel.org/patch/10835713/
> 
> If you're happy with the patch, and can provide your Reviewed-by or
> Acked-by, I'll then pick it up as a prereq for the broader dm-snapshot
> changes that Nikos has provided.

I replied to the version earlier in this email thread.  Looks close,
but a couple of questions.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] list_bl: Add hlist_bl_add_before/behind helpers
  2019-03-13 23:48     ` Paul E. McKenney
@ 2019-03-14  0:30       ` Mike Snitzer
  2019-03-14 13:28           ` Nikos Tsironis
  0 siblings, 1 reply; 26+ messages in thread
From: Mike Snitzer @ 2019-03-14  0:30 UTC (permalink / raw)
  To: Paul E. McKenney, Nikos Tsironis
  Cc: hch, agk, dm-devel, mpatocka, iliastsi, linux-kernel

On Wed, Mar 13 2019 at  7:48pm -0400,
Paul E. McKenney <paulmck@linux.ibm.com> wrote:

> On Thu, Feb 28, 2019 at 04:32:02PM -0500, Mike Snitzer wrote:
> > On Thu, Dec 20 2018 at  1:06pm -0500,
> > Nikos Tsironis <ntsironis@arrikto.com> wrote:
> > 
> > > Add hlist_bl_add_before/behind helpers to add an element before/after an
> > > existing element in a bl_list.
> > > 
> > > Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
> > > Signed-off-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
> > > ---
> > >  include/linux/list_bl.h | 27 +++++++++++++++++++++++++++
> > >  1 file changed, 27 insertions(+)
> > > 
> > > diff --git a/include/linux/list_bl.h b/include/linux/list_bl.h
> > > index 3fc2cc57ba1b..2fd918e5fd48 100644
> > > --- a/include/linux/list_bl.h
> > > +++ b/include/linux/list_bl.h
> > > @@ -86,6 +86,33 @@ static inline void hlist_bl_add_head(struct hlist_bl_node *n,
> > >  	hlist_bl_set_first(h, n);
> > >  }
> > >  
> > > +static inline void hlist_bl_add_before(struct hlist_bl_node *n,
> > > +				       struct hlist_bl_node *next)
> > > +{
> > > +	struct hlist_bl_node **pprev = next->pprev;
> > > +
> > > +	n->pprev = pprev;
> > > +	n->next = next;
> > > +	next->pprev = &n->next;
> > > +
> > > +	/* pprev may be `first`, so be careful not to lose the lock bit */
> > > +	WRITE_ONCE(*pprev,
> > > +		   (struct hlist_bl_node *)
> > > +			((unsigned long)n |
> > > +			 ((unsigned long)*pprev & LIST_BL_LOCKMASK)));
> 
> A nit, but use of uintptr_t shrinks things a bit:
> 
> +		   (struct hlist_bl_node *)
> +			((uintptr_t)n | ((uintptr_t)*pprev & LIST_BL_LOCKMASK)));
> 
> I am not too concerned about this, though.

I'm fine with folding in your suggestion.

> The WRITE_ONCE() is to handle races with hlist_bl_empty() (which does contain
> the corresponding READ_ONCE()) correct?

Correct.

> > > +}
> > > +
> > > +static inline void hlist_bl_add_behind(struct hlist_bl_node *n,
> > > +				       struct hlist_bl_node *prev)
> > > +{
> > > +	n->next = prev->next;
> > > +	n->pprev = &prev->next;
> > > +	WRITE_ONCE(prev->next, n);
> 
> I don't see what this WRITE_ONCE() is interacting with.  The traversals
> use plain C-language reads, and hlist_bl_empty() can't get here.  All
> uses of hlist_bl_for_each_entry() invoke hlist_bl_lock() before starting
> the traversal, and hlist_bl_for_each_entry_safe() looks to be unused.
> (Perhaps it should be removed?  Or is there some anticipated use?)
> 
> I don't believe that the WRITE_ONCE() is needed.  What am I missing?
> 
> Other than that, looks good.
> 
> 							Thanx, Paul
> 

I'd imagine it was just born out of symmetry with hlist_bl_add_before()
and/or caution.  But let's see what Nikos has to say.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dm-devel] [PATCH 1/3] list_bl: Add hlist_bl_add_before/behind helpers
  2019-03-14  0:30       ` Mike Snitzer
@ 2019-03-14 13:28           ` Nikos Tsironis
  0 siblings, 0 replies; 26+ messages in thread
From: Nikos Tsironis @ 2019-03-14 13:28 UTC (permalink / raw)
  To: Mike Snitzer, Paul E. McKenney
  Cc: hch, agk, dm-devel, mpatocka, iliastsi, linux-kernel

On 3/14/19 2:30 AM, Mike Snitzer wrote:
> On Wed, Mar 13 2019 at  7:48pm -0400,
> Paul E. McKenney <paulmck@linux.ibm.com> wrote:
> 
Hi Paul,

Thanks a lot for your feedback!

>> On Thu, Feb 28, 2019 at 04:32:02PM -0500, Mike Snitzer wrote:
>>> On Thu, Dec 20 2018 at  1:06pm -0500,
>>> Nikos Tsironis <ntsironis@arrikto.com> wrote:
>>>
>>>> Add hlist_bl_add_before/behind helpers to add an element before/after an
>>>> existing element in a bl_list.
>>>>
>>>> Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
>>>> Signed-off-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
>>>> ---
>>>>  include/linux/list_bl.h | 27 +++++++++++++++++++++++++++
>>>>  1 file changed, 27 insertions(+)
>>>>
>>>> diff --git a/include/linux/list_bl.h b/include/linux/list_bl.h
>>>> index 3fc2cc57ba1b..2fd918e5fd48 100644
>>>> --- a/include/linux/list_bl.h
>>>> +++ b/include/linux/list_bl.h
>>>> @@ -86,6 +86,33 @@ static inline void hlist_bl_add_head(struct hlist_bl_node *n,
>>>>  	hlist_bl_set_first(h, n);
>>>>  }
>>>>  
>>>> +static inline void hlist_bl_add_before(struct hlist_bl_node *n,
>>>> +				       struct hlist_bl_node *next)
>>>> +{
>>>> +	struct hlist_bl_node **pprev = next->pprev;
>>>> +
>>>> +	n->pprev = pprev;
>>>> +	n->next = next;
>>>> +	next->pprev = &n->next;
>>>> +
>>>> +	/* pprev may be `first`, so be careful not to lose the lock bit */
>>>> +	WRITE_ONCE(*pprev,
>>>> +		   (struct hlist_bl_node *)
>>>> +			((unsigned long)n |
>>>> +			 ((unsigned long)*pprev & LIST_BL_LOCKMASK)));
>>
>> A nit, but use of uintptr_t shrinks things a bit:
>>
>> +		   (struct hlist_bl_node *)
>> +			((uintptr_t)n | ((uintptr_t)*pprev & LIST_BL_LOCKMASK)));
>>
>> I am not too concerned about this, though.
> 
> I'm fine with folding in your suggestion.
> 

Indeed, this looks better.

>> The WRITE_ONCE() is to handle races with hlist_bl_empty() (which does contain
>> the corresponding READ_ONCE()) correct?
> 
> Correct.

Yes that's correct.

> 
>>>> +}
>>>> +
>>>> +static inline void hlist_bl_add_behind(struct hlist_bl_node *n,
>>>> +				       struct hlist_bl_node *prev)
>>>> +{
>>>> +	n->next = prev->next;
>>>> +	n->pprev = &prev->next;
>>>> +	WRITE_ONCE(prev->next, n);
>>
>> I don't see what this WRITE_ONCE() is interacting with.  The traversals
>> use plain C-language reads, and hlist_bl_empty() can't get here.  All
>> uses of hlist_bl_for_each_entry() invoke hlist_bl_lock() before starting
>> the traversal, and hlist_bl_for_each_entry_safe() looks to be unused.
>> (Perhaps it should be removed?  Or is there some anticipated use?)

I am using hlist_bl_for_each_entry_safe() in this proposed patch for
dm-snapshot: https://patchwork.kernel.org/patch/10835709/

>>
>> I don't believe that the WRITE_ONCE() is needed.  What am I missing?
>>
>> Other than that, looks good.
>>
>> 							Thanx, Paul
>>
> 
> I'd imagine it was just born out of symmetry with hlist_bl_add_before()
> and/or caution.  But let's see what Nikos has to say.

I also don't believe that this WRITE_SAME() is needed. But, looking at
hlist_add_behind() in include/linux/list.h, which, if I am not missing
something, is used in the same way as hlist_bl_add_behind(), it also
uses WRITE_ONCE() to update prev->next:

static inline void hlist_add_behind(struct hlist_node *n,
				    struct hlist_node *prev)
{
	n->next = prev->next;
	WRITE_ONCE(prev->next, n);
	n->pprev = &prev->next;

	if (n->next)
		n->next->pprev  = &n->next;
}

Could it be the case that the WRITE_ONCE() in hlist_add_behind() is also
not needed? This WRITE_ONCE() was introduced by commit 1c97be677f72b3
("list: Use WRITE_ONCE() when adding to lists and hlists").

But, since I am not an expert in lockless programming, I opted to be on
the safe side and followed the example of hlist_add_behind().

That said, I will follow up with a new version of the patch removing the
WRITE_ONCE() and using uintptr_t instead of unsigned long.

Thanks,
Nikos

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] list_bl: Add hlist_bl_add_before/behind helpers
@ 2019-03-14 13:28           ` Nikos Tsironis
  0 siblings, 0 replies; 26+ messages in thread
From: Nikos Tsironis @ 2019-03-14 13:28 UTC (permalink / raw)
  To: Mike Snitzer, Paul E. McKenney
  Cc: linux-kernel, iliastsi, hch, dm-devel, mpatocka, agk

On 3/14/19 2:30 AM, Mike Snitzer wrote:
> On Wed, Mar 13 2019 at  7:48pm -0400,
> Paul E. McKenney <paulmck@linux.ibm.com> wrote:
> 
Hi Paul,

Thanks a lot for your feedback!

>> On Thu, Feb 28, 2019 at 04:32:02PM -0500, Mike Snitzer wrote:
>>> On Thu, Dec 20 2018 at  1:06pm -0500,
>>> Nikos Tsironis <ntsironis@arrikto.com> wrote:
>>>
>>>> Add hlist_bl_add_before/behind helpers to add an element before/after an
>>>> existing element in a bl_list.
>>>>
>>>> Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
>>>> Signed-off-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
>>>> ---
>>>>  include/linux/list_bl.h | 27 +++++++++++++++++++++++++++
>>>>  1 file changed, 27 insertions(+)
>>>>
>>>> diff --git a/include/linux/list_bl.h b/include/linux/list_bl.h
>>>> index 3fc2cc57ba1b..2fd918e5fd48 100644
>>>> --- a/include/linux/list_bl.h
>>>> +++ b/include/linux/list_bl.h
>>>> @@ -86,6 +86,33 @@ static inline void hlist_bl_add_head(struct hlist_bl_node *n,
>>>>  	hlist_bl_set_first(h, n);
>>>>  }
>>>>  
>>>> +static inline void hlist_bl_add_before(struct hlist_bl_node *n,
>>>> +				       struct hlist_bl_node *next)
>>>> +{
>>>> +	struct hlist_bl_node **pprev = next->pprev;
>>>> +
>>>> +	n->pprev = pprev;
>>>> +	n->next = next;
>>>> +	next->pprev = &n->next;
>>>> +
>>>> +	/* pprev may be `first`, so be careful not to lose the lock bit */
>>>> +	WRITE_ONCE(*pprev,
>>>> +		   (struct hlist_bl_node *)
>>>> +			((unsigned long)n |
>>>> +			 ((unsigned long)*pprev & LIST_BL_LOCKMASK)));
>>
>> A nit, but use of uintptr_t shrinks things a bit:
>>
>> +		   (struct hlist_bl_node *)
>> +			((uintptr_t)n | ((uintptr_t)*pprev & LIST_BL_LOCKMASK)));
>>
>> I am not too concerned about this, though.
> 
> I'm fine with folding in your suggestion.
> 

Indeed, this looks better.

>> The WRITE_ONCE() is to handle races with hlist_bl_empty() (which does contain
>> the corresponding READ_ONCE()) correct?
> 
> Correct.

Yes that's correct.

> 
>>>> +}
>>>> +
>>>> +static inline void hlist_bl_add_behind(struct hlist_bl_node *n,
>>>> +				       struct hlist_bl_node *prev)
>>>> +{
>>>> +	n->next = prev->next;
>>>> +	n->pprev = &prev->next;
>>>> +	WRITE_ONCE(prev->next, n);
>>
>> I don't see what this WRITE_ONCE() is interacting with.  The traversals
>> use plain C-language reads, and hlist_bl_empty() can't get here.  All
>> uses of hlist_bl_for_each_entry() invoke hlist_bl_lock() before starting
>> the traversal, and hlist_bl_for_each_entry_safe() looks to be unused.
>> (Perhaps it should be removed?  Or is there some anticipated use?)

I am using hlist_bl_for_each_entry_safe() in this proposed patch for
dm-snapshot: https://patchwork.kernel.org/patch/10835709/

>>
>> I don't believe that the WRITE_ONCE() is needed.  What am I missing?
>>
>> Other than that, looks good.
>>
>> 							Thanx, Paul
>>
> 
> I'd imagine it was just born out of symmetry with hlist_bl_add_before()
> and/or caution.  But let's see what Nikos has to say.

I also don't believe that this WRITE_SAME() is needed. But, looking at
hlist_add_behind() in include/linux/list.h, which, if I am not missing
something, is used in the same way as hlist_bl_add_behind(), it also
uses WRITE_ONCE() to update prev->next:

static inline void hlist_add_behind(struct hlist_node *n,
				    struct hlist_node *prev)
{
	n->next = prev->next;
	WRITE_ONCE(prev->next, n);
	n->pprev = &prev->next;

	if (n->next)
		n->next->pprev  = &n->next;
}

Could it be the case that the WRITE_ONCE() in hlist_add_behind() is also
not needed? This WRITE_ONCE() was introduced by commit 1c97be677f72b3
("list: Use WRITE_ONCE() when adding to lists and hlists").

But, since I am not an expert in lockless programming, I opted to be on
the safe side and followed the example of hlist_add_behind().

That said, I will follow up with a new version of the patch removing the
WRITE_ONCE() and using uintptr_t instead of unsigned long.

Thanks,
Nikos

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dm-devel] [PATCH 1/3] list_bl: Add hlist_bl_add_before/behind helpers
  2019-03-14 13:28           ` Nikos Tsironis
  (?)
@ 2019-03-14 14:07           ` Paul E. McKenney
  2019-03-14 15:03             ` Paul E. McKenney
  2019-03-14 17:01             ` Nikos Tsironis
  -1 siblings, 2 replies; 26+ messages in thread
From: Paul E. McKenney @ 2019-03-14 14:07 UTC (permalink / raw)
  To: Nikos Tsironis
  Cc: Mike Snitzer, hch, agk, dm-devel, mpatocka, iliastsi, linux-kernel

On Thu, Mar 14, 2019 at 03:28:23PM +0200, Nikos Tsironis wrote:
> On 3/14/19 2:30 AM, Mike Snitzer wrote:
> > On Wed, Mar 13 2019 at  7:48pm -0400,
> > Paul E. McKenney <paulmck@linux.ibm.com> wrote:
> > 
> Hi Paul,
> 
> Thanks a lot for your feedback!

NP, and apologies for the delay.

> >> On Thu, Feb 28, 2019 at 04:32:02PM -0500, Mike Snitzer wrote:
> >>> On Thu, Dec 20 2018 at  1:06pm -0500,
> >>> Nikos Tsironis <ntsironis@arrikto.com> wrote:
> >>>
> >>>> Add hlist_bl_add_before/behind helpers to add an element before/after an
> >>>> existing element in a bl_list.
> >>>>
> >>>> Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
> >>>> Signed-off-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
> >>>> ---
> >>>>  include/linux/list_bl.h | 27 +++++++++++++++++++++++++++
> >>>>  1 file changed, 27 insertions(+)
> >>>>
> >>>> diff --git a/include/linux/list_bl.h b/include/linux/list_bl.h
> >>>> index 3fc2cc57ba1b..2fd918e5fd48 100644
> >>>> --- a/include/linux/list_bl.h
> >>>> +++ b/include/linux/list_bl.h
> >>>> @@ -86,6 +86,33 @@ static inline void hlist_bl_add_head(struct hlist_bl_node *n,
> >>>>  	hlist_bl_set_first(h, n);
> >>>>  }
> >>>>  
> >>>> +static inline void hlist_bl_add_before(struct hlist_bl_node *n,
> >>>> +				       struct hlist_bl_node *next)
> >>>> +{
> >>>> +	struct hlist_bl_node **pprev = next->pprev;
> >>>> +
> >>>> +	n->pprev = pprev;
> >>>> +	n->next = next;
> >>>> +	next->pprev = &n->next;
> >>>> +
> >>>> +	/* pprev may be `first`, so be careful not to lose the lock bit */
> >>>> +	WRITE_ONCE(*pprev,
> >>>> +		   (struct hlist_bl_node *)
> >>>> +			((unsigned long)n |
> >>>> +			 ((unsigned long)*pprev & LIST_BL_LOCKMASK)));
> >>
> >> A nit, but use of uintptr_t shrinks things a bit:
> >>
> >> +		   (struct hlist_bl_node *)
> >> +			((uintptr_t)n | ((uintptr_t)*pprev & LIST_BL_LOCKMASK)));
> >>
> >> I am not too concerned about this, though.
> > 
> > I'm fine with folding in your suggestion.
> 
> Indeed, this looks better.
> 
> >> The WRITE_ONCE() is to handle races with hlist_bl_empty() (which does contain
> >> the corresponding READ_ONCE()) correct?
> > 
> > Correct.
> 
> Yes that's correct.
> 
> >>>> +}
> >>>> +
> >>>> +static inline void hlist_bl_add_behind(struct hlist_bl_node *n,
> >>>> +				       struct hlist_bl_node *prev)
> >>>> +{
> >>>> +	n->next = prev->next;
> >>>> +	n->pprev = &prev->next;
> >>>> +	WRITE_ONCE(prev->next, n);
> >>
> >> I don't see what this WRITE_ONCE() is interacting with.  The traversals
> >> use plain C-language reads, and hlist_bl_empty() can't get here.  All
> >> uses of hlist_bl_for_each_entry() invoke hlist_bl_lock() before starting
> >> the traversal, and hlist_bl_for_each_entry_safe() looks to be unused.
> >> (Perhaps it should be removed?  Or is there some anticipated use?)
> 
> I am using hlist_bl_for_each_entry_safe() in this proposed patch for
> dm-snapshot: https://patchwork.kernel.org/patch/10835709/

Probably should keep it, then.  ;-)

> >>
> >> I don't believe that the WRITE_ONCE() is needed.  What am I missing?
> >>
> >> Other than that, looks good.
> >>
> >> 							Thanx, Paul
> >>
> > 
> > I'd imagine it was just born out of symmetry with hlist_bl_add_before()
> > and/or caution.  But let's see what Nikos has to say.
> 
> I also don't believe that this WRITE_SAME() is needed. But, looking at
> hlist_add_behind() in include/linux/list.h, which, if I am not missing
> something, is used in the same way as hlist_bl_add_behind(), it also
> uses WRITE_ONCE() to update prev->next:
> 
> static inline void hlist_add_behind(struct hlist_node *n,
> 				    struct hlist_node *prev)
> {
> 	n->next = prev->next;
> 	WRITE_ONCE(prev->next, n);
> 	n->pprev = &prev->next;
> 
> 	if (n->next)
> 		n->next->pprev  = &n->next;
> }
> 
> Could it be the case that the WRITE_ONCE() in hlist_add_behind() is also
> not needed? This WRITE_ONCE() was introduced by commit 1c97be677f72b3
> ("list: Use WRITE_ONCE() when adding to lists and hlists").

Looks like I have no one to blame but myself!

Would you like to remove that as part of your patch series?

> But, since I am not an expert in lockless programming, I opted to be on
> the safe side and followed the example of hlist_add_behind().
> 
> That said, I will follow up with a new version of the patch removing the
> WRITE_ONCE() and using uintptr_t instead of unsigned long.

Sounds good!

							Thanx, Paul


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dm-devel] [PATCH 1/3] list_bl: Add hlist_bl_add_before/behind helpers
  2019-03-14 14:07           ` [dm-devel] " Paul E. McKenney
@ 2019-03-14 15:03             ` Paul E. McKenney
  2019-03-17 11:52               ` Nikos Tsironis
  2019-03-14 17:01             ` Nikos Tsironis
  1 sibling, 1 reply; 26+ messages in thread
From: Paul E. McKenney @ 2019-03-14 15:03 UTC (permalink / raw)
  To: Nikos Tsironis
  Cc: Mike Snitzer, hch, agk, dm-devel, mpatocka, iliastsi, linux-kernel

On Thu, Mar 14, 2019 at 07:07:50AM -0700, Paul E. McKenney wrote:
> On Thu, Mar 14, 2019 at 03:28:23PM +0200, Nikos Tsironis wrote:
> > On 3/14/19 2:30 AM, Mike Snitzer wrote:
> > > On Wed, Mar 13 2019 at  7:48pm -0400,
> > > Paul E. McKenney <paulmck@linux.ibm.com> wrote:
> > > 
> > Hi Paul,
> > 
> > Thanks a lot for your feedback!
> 
> NP, and apologies for the delay.
> 
> > >> On Thu, Feb 28, 2019 at 04:32:02PM -0500, Mike Snitzer wrote:
> > >>> On Thu, Dec 20 2018 at  1:06pm -0500,
> > >>> Nikos Tsironis <ntsironis@arrikto.com> wrote:
> > >>>
> > >>>> Add hlist_bl_add_before/behind helpers to add an element before/after an
> > >>>> existing element in a bl_list.
> > >>>>
> > >>>> Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
> > >>>> Signed-off-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
> > >>>> ---
> > >>>>  include/linux/list_bl.h | 27 +++++++++++++++++++++++++++
> > >>>>  1 file changed, 27 insertions(+)
> > >>>>
> > >>>> diff --git a/include/linux/list_bl.h b/include/linux/list_bl.h
> > >>>> index 3fc2cc57ba1b..2fd918e5fd48 100644
> > >>>> --- a/include/linux/list_bl.h
> > >>>> +++ b/include/linux/list_bl.h
> > >>>> @@ -86,6 +86,33 @@ static inline void hlist_bl_add_head(struct hlist_bl_node *n,
> > >>>>  	hlist_bl_set_first(h, n);
> > >>>>  }
> > >>>>  
> > >>>> +static inline void hlist_bl_add_before(struct hlist_bl_node *n,
> > >>>> +				       struct hlist_bl_node *next)
> > >>>> +{
> > >>>> +	struct hlist_bl_node **pprev = next->pprev;
> > >>>> +
> > >>>> +	n->pprev = pprev;
> > >>>> +	n->next = next;
> > >>>> +	next->pprev = &n->next;
> > >>>> +
> > >>>> +	/* pprev may be `first`, so be careful not to lose the lock bit */
> > >>>> +	WRITE_ONCE(*pprev,
> > >>>> +		   (struct hlist_bl_node *)
> > >>>> +			((unsigned long)n |
> > >>>> +			 ((unsigned long)*pprev & LIST_BL_LOCKMASK)));
> > >>
> > >> A nit, but use of uintptr_t shrinks things a bit:
> > >>
> > >> +		   (struct hlist_bl_node *)
> > >> +			((uintptr_t)n | ((uintptr_t)*pprev & LIST_BL_LOCKMASK)));
> > >>
> > >> I am not too concerned about this, though.
> > > 
> > > I'm fine with folding in your suggestion.
> > 
> > Indeed, this looks better.
> > 
> > >> The WRITE_ONCE() is to handle races with hlist_bl_empty() (which does contain
> > >> the corresponding READ_ONCE()) correct?
> > > 
> > > Correct.
> > 
> > Yes that's correct.
> > 
> > >>>> +}
> > >>>> +
> > >>>> +static inline void hlist_bl_add_behind(struct hlist_bl_node *n,
> > >>>> +				       struct hlist_bl_node *prev)
> > >>>> +{
> > >>>> +	n->next = prev->next;
> > >>>> +	n->pprev = &prev->next;
> > >>>> +	WRITE_ONCE(prev->next, n);
> > >>
> > >> I don't see what this WRITE_ONCE() is interacting with.  The traversals
> > >> use plain C-language reads, and hlist_bl_empty() can't get here.  All
> > >> uses of hlist_bl_for_each_entry() invoke hlist_bl_lock() before starting
> > >> the traversal, and hlist_bl_for_each_entry_safe() looks to be unused.
> > >> (Perhaps it should be removed?  Or is there some anticipated use?)
> > 
> > I am using hlist_bl_for_each_entry_safe() in this proposed patch for
> > dm-snapshot: https://patchwork.kernel.org/patch/10835709/
> 
> Probably should keep it, then.  ;-)
> 
> > >>
> > >> I don't believe that the WRITE_ONCE() is needed.  What am I missing?
> > >>
> > >> Other than that, looks good.
> > >>
> > >> 							Thanx, Paul
> > >>
> > > 
> > > I'd imagine it was just born out of symmetry with hlist_bl_add_before()
> > > and/or caution.  But let's see what Nikos has to say.
> > 
> > I also don't believe that this WRITE_SAME() is needed. But, looking at
> > hlist_add_behind() in include/linux/list.h, which, if I am not missing
> > something, is used in the same way as hlist_bl_add_behind(), it also
> > uses WRITE_ONCE() to update prev->next:
> > 
> > static inline void hlist_add_behind(struct hlist_node *n,
> > 				    struct hlist_node *prev)
> > {
> > 	n->next = prev->next;
> > 	WRITE_ONCE(prev->next, n);
> > 	n->pprev = &prev->next;
> > 
> > 	if (n->next)
> > 		n->next->pprev  = &n->next;
> > }
> > 
> > Could it be the case that the WRITE_ONCE() in hlist_add_behind() is also
> > not needed? This WRITE_ONCE() was introduced by commit 1c97be677f72b3
> > ("list: Use WRITE_ONCE() when adding to lists and hlists").
> 
> Looks like I have no one to blame but myself!
> 
> Would you like to remove that as part of your patch series?
> 
> > But, since I am not an expert in lockless programming, I opted to be on
> > the safe side and followed the example of hlist_add_behind().
> > 
> > That said, I will follow up with a new version of the patch removing the
> > WRITE_ONCE() and using uintptr_t instead of unsigned long.
> 
> Sounds good!

Oh, and of course intptr_t is one character shorter than uintptr_t, and
looks to work just as well in this context.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dm-devel] [PATCH 1/3] list_bl: Add hlist_bl_add_before/behind helpers
  2019-03-14 14:07           ` [dm-devel] " Paul E. McKenney
  2019-03-14 15:03             ` Paul E. McKenney
@ 2019-03-14 17:01             ` Nikos Tsironis
  1 sibling, 0 replies; 26+ messages in thread
From: Nikos Tsironis @ 2019-03-14 17:01 UTC (permalink / raw)
  To: paulmck
  Cc: Mike Snitzer, linux-kernel, iliastsi, hch, dm-devel, mpatocka, agk

On 3/14/19 4:07 PM, Paul E. McKenney wrote:
> On Thu, Mar 14, 2019 at 03:28:23PM +0200, Nikos Tsironis wrote:
>> On 3/14/19 2:30 AM, Mike Snitzer wrote:
>>> On Wed, Mar 13 2019 at  7:48pm -0400,
>>> Paul E. McKenney <paulmck@linux.ibm.com> wrote:
>>>
>> Hi Paul,
>>
>> Thanks a lot for your feedback!
> 
> NP, and apologies for the delay.
> 
>>>> On Thu, Feb 28, 2019 at 04:32:02PM -0500, Mike Snitzer wrote:
>>>>> On Thu, Dec 20 2018 at  1:06pm -0500,
>>>>> Nikos Tsironis <ntsironis@arrikto.com> wrote:
>>>>>
>>>>>> Add hlist_bl_add_before/behind helpers to add an element before/after an
>>>>>> existing element in a bl_list.
>>>>>>
>>>>>> Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
>>>>>> Signed-off-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
>>>>>> ---
>>>>>>  include/linux/list_bl.h | 27 +++++++++++++++++++++++++++
>>>>>>  1 file changed, 27 insertions(+)
>>>>>>
>>>>>> diff --git a/include/linux/list_bl.h b/include/linux/list_bl.h
>>>>>> index 3fc2cc57ba1b..2fd918e5fd48 100644
>>>>>> --- a/include/linux/list_bl.h
>>>>>> +++ b/include/linux/list_bl.h
>>>>>> @@ -86,6 +86,33 @@ static inline void hlist_bl_add_head(struct hlist_bl_node *n,
>>>>>>  	hlist_bl_set_first(h, n);
>>>>>>  }
>>>>>>  
>>>>>> +static inline void hlist_bl_add_before(struct hlist_bl_node *n,
>>>>>> +				       struct hlist_bl_node *next)
>>>>>> +{
>>>>>> +	struct hlist_bl_node **pprev = next->pprev;
>>>>>> +
>>>>>> +	n->pprev = pprev;
>>>>>> +	n->next = next;
>>>>>> +	next->pprev = &n->next;
>>>>>> +
>>>>>> +	/* pprev may be `first`, so be careful not to lose the lock bit */
>>>>>> +	WRITE_ONCE(*pprev,
>>>>>> +		   (struct hlist_bl_node *)
>>>>>> +			((unsigned long)n |
>>>>>> +			 ((unsigned long)*pprev & LIST_BL_LOCKMASK)));
>>>>
>>>> A nit, but use of uintptr_t shrinks things a bit:
>>>>
>>>> +		   (struct hlist_bl_node *)
>>>> +			((uintptr_t)n | ((uintptr_t)*pprev & LIST_BL_LOCKMASK)));
>>>>
>>>> I am not too concerned about this, though.
>>>
>>> I'm fine with folding in your suggestion.
>>
>> Indeed, this looks better.
>>
>>>> The WRITE_ONCE() is to handle races with hlist_bl_empty() (which does contain
>>>> the corresponding READ_ONCE()) correct?
>>>
>>> Correct.
>>
>> Yes that's correct.
>>
>>>>>> +}
>>>>>> +
>>>>>> +static inline void hlist_bl_add_behind(struct hlist_bl_node *n,
>>>>>> +				       struct hlist_bl_node *prev)
>>>>>> +{
>>>>>> +	n->next = prev->next;
>>>>>> +	n->pprev = &prev->next;
>>>>>> +	WRITE_ONCE(prev->next, n);
>>>>
>>>> I don't see what this WRITE_ONCE() is interacting with.  The traversals
>>>> use plain C-language reads, and hlist_bl_empty() can't get here.  All
>>>> uses of hlist_bl_for_each_entry() invoke hlist_bl_lock() before starting
>>>> the traversal, and hlist_bl_for_each_entry_safe() looks to be unused.
>>>> (Perhaps it should be removed?  Or is there some anticipated use?)
>>
>> I am using hlist_bl_for_each_entry_safe() in this proposed patch for
>> dm-snapshot: https://patchwork.kernel.org/patch/10835709/
> 
> Probably should keep it, then.  ;-)
> 
>>>>
>>>> I don't believe that the WRITE_ONCE() is needed.  What am I missing?
>>>>
>>>> Other than that, looks good.
>>>>
>>>> 							Thanx, Paul
>>>>
>>>
>>> I'd imagine it was just born out of symmetry with hlist_bl_add_before()
>>> and/or caution.  But let's see what Nikos has to say.
>>
>> I also don't believe that this WRITE_SAME() is needed. But, looking at
>> hlist_add_behind() in include/linux/list.h, which, if I am not missing
>> something, is used in the same way as hlist_bl_add_behind(), it also
>> uses WRITE_ONCE() to update prev->next:
>>
>> static inline void hlist_add_behind(struct hlist_node *n,
>> 				    struct hlist_node *prev)
>> {
>> 	n->next = prev->next;
>> 	WRITE_ONCE(prev->next, n);
>> 	n->pprev = &prev->next;
>>
>> 	if (n->next)
>> 		n->next->pprev  = &n->next;
>> }
>>
>> Could it be the case that the WRITE_ONCE() in hlist_add_behind() is also
>> not needed? This WRITE_ONCE() was introduced by commit 1c97be677f72b3
>> ("list: Use WRITE_ONCE() when adding to lists and hlists").
> 
> Looks like I have no one to blame but myself!
> 
> Would you like to remove that as part of your patch series?

Yes, Of course. I will add an extra patch removing the WRITE_ONCE() from
hlist_add_behind().

Thanks,
Nikos

> 
>> But, since I am not an expert in lockless programming, I opted to be on
>> the safe side and followed the example of hlist_add_behind().
>>
>> That said, I will follow up with a new version of the patch removing the
>> WRITE_ONCE() and using uintptr_t instead of unsigned long.
> 
> Sounds good!
> 
> 							Thanx, Paul
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dm-devel] [PATCH 1/3] list_bl: Add hlist_bl_add_before/behind helpers
  2019-03-14 15:03             ` Paul E. McKenney
@ 2019-03-17 11:52               ` Nikos Tsironis
  2019-03-18 17:16                 ` Paul E. McKenney
  0 siblings, 1 reply; 26+ messages in thread
From: Nikos Tsironis @ 2019-03-17 11:52 UTC (permalink / raw)
  To: paulmck
  Cc: Mike Snitzer, hch, agk, dm-devel, mpatocka, iliastsi, linux-kernel

On 3/14/19 5:03 PM, Paul E. McKenney wrote:
> On Thu, Mar 14, 2019 at 07:07:50AM -0700, Paul E. McKenney wrote:
>> On Thu, Mar 14, 2019 at 03:28:23PM +0200, Nikos Tsironis wrote:
>>> On 3/14/19 2:30 AM, Mike Snitzer wrote:
>>>> On Wed, Mar 13 2019 at  7:48pm -0400,
>>>> Paul E. McKenney <paulmck@linux.ibm.com> wrote:
>>>>
>>> Hi Paul,
>>>
>>> Thanks a lot for your feedback!
>>
>> NP, and apologies for the delay.
>>
>>>>> On Thu, Feb 28, 2019 at 04:32:02PM -0500, Mike Snitzer wrote:
>>>>>> On Thu, Dec 20 2018 at  1:06pm -0500,
>>>>>> Nikos Tsironis <ntsironis@arrikto.com> wrote:
>>>>>>
>>>>>>> Add hlist_bl_add_before/behind helpers to add an element before/after an
>>>>>>> existing element in a bl_list.
>>>>>>>
>>>>>>> Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
>>>>>>> Signed-off-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
>>>>>>> ---
>>>>>>>  include/linux/list_bl.h | 27 +++++++++++++++++++++++++++
>>>>>>>  1 file changed, 27 insertions(+)
>>>>>>>
>>>>>>> diff --git a/include/linux/list_bl.h b/include/linux/list_bl.h
>>>>>>> index 3fc2cc57ba1b..2fd918e5fd48 100644
>>>>>>> --- a/include/linux/list_bl.h
>>>>>>> +++ b/include/linux/list_bl.h
>>>>>>> @@ -86,6 +86,33 @@ static inline void hlist_bl_add_head(struct hlist_bl_node *n,
>>>>>>>  	hlist_bl_set_first(h, n);
>>>>>>>  }
>>>>>>>  
>>>>>>> +static inline void hlist_bl_add_before(struct hlist_bl_node *n,
>>>>>>> +				       struct hlist_bl_node *next)
>>>>>>> +{
>>>>>>> +	struct hlist_bl_node **pprev = next->pprev;
>>>>>>> +
>>>>>>> +	n->pprev = pprev;
>>>>>>> +	n->next = next;
>>>>>>> +	next->pprev = &n->next;
>>>>>>> +
>>>>>>> +	/* pprev may be `first`, so be careful not to lose the lock bit */
>>>>>>> +	WRITE_ONCE(*pprev,
>>>>>>> +		   (struct hlist_bl_node *)
>>>>>>> +			((unsigned long)n |
>>>>>>> +			 ((unsigned long)*pprev & LIST_BL_LOCKMASK)));
>>>>>
>>>>> A nit, but use of uintptr_t shrinks things a bit:
>>>>>
>>>>> +		   (struct hlist_bl_node *)
>>>>> +			((uintptr_t)n | ((uintptr_t)*pprev & LIST_BL_LOCKMASK)));
>>>>>
>>>>> I am not too concerned about this, though.
>>>>
>>>> I'm fine with folding in your suggestion.
>>>
>>> Indeed, this looks better.
>>>
>>>>> The WRITE_ONCE() is to handle races with hlist_bl_empty() (which does contain
>>>>> the corresponding READ_ONCE()) correct?
>>>>
>>>> Correct.
>>>
>>> Yes that's correct.
>>>
>>>>>>> +}
>>>>>>> +
>>>>>>> +static inline void hlist_bl_add_behind(struct hlist_bl_node *n,
>>>>>>> +				       struct hlist_bl_node *prev)
>>>>>>> +{
>>>>>>> +	n->next = prev->next;
>>>>>>> +	n->pprev = &prev->next;
>>>>>>> +	WRITE_ONCE(prev->next, n);
>>>>>
>>>>> I don't see what this WRITE_ONCE() is interacting with.  The traversals
>>>>> use plain C-language reads, and hlist_bl_empty() can't get here.  All
>>>>> uses of hlist_bl_for_each_entry() invoke hlist_bl_lock() before starting
>>>>> the traversal, and hlist_bl_for_each_entry_safe() looks to be unused.
>>>>> (Perhaps it should be removed?  Or is there some anticipated use?)
>>>
>>> I am using hlist_bl_for_each_entry_safe() in this proposed patch for
>>> dm-snapshot: https://patchwork.kernel.org/patch/10835709/
>>
>> Probably should keep it, then.  ;-)
>>
>>>>>
>>>>> I don't believe that the WRITE_ONCE() is needed.  What am I missing?
>>>>>
>>>>> Other than that, looks good.
>>>>>
>>>>> 							Thanx, Paul
>>>>>
>>>>
>>>> I'd imagine it was just born out of symmetry with hlist_bl_add_before()
>>>> and/or caution.  But let's see what Nikos has to say.
>>>
>>> I also don't believe that this WRITE_SAME() is needed. But, looking at
>>> hlist_add_behind() in include/linux/list.h, which, if I am not missing
>>> something, is used in the same way as hlist_bl_add_behind(), it also
>>> uses WRITE_ONCE() to update prev->next:
>>>
>>> static inline void hlist_add_behind(struct hlist_node *n,
>>> 				    struct hlist_node *prev)
>>> {
>>> 	n->next = prev->next;
>>> 	WRITE_ONCE(prev->next, n);
>>> 	n->pprev = &prev->next;
>>>
>>> 	if (n->next)
>>> 		n->next->pprev  = &n->next;
>>> }
>>>
>>> Could it be the case that the WRITE_ONCE() in hlist_add_behind() is also
>>> not needed? This WRITE_ONCE() was introduced by commit 1c97be677f72b3
>>> ("list: Use WRITE_ONCE() when adding to lists and hlists").
>>
>> Looks like I have no one to blame but myself!
>>
>> Would you like to remove that as part of your patch series?
>>
>>> But, since I am not an expert in lockless programming, I opted to be on
>>> the safe side and followed the example of hlist_add_behind().
>>>
>>> That said, I will follow up with a new version of the patch removing the
>>> WRITE_ONCE() and using uintptr_t instead of unsigned long.
>>
>> Sounds good!
> 
> Oh, and of course intptr_t is one character shorter than uintptr_t, and
> looks to work just as well in this context.  ;-)
> 
> 							Thanx, Paul
> 


Hi Paul,

Sorry for the late reply.

intptr_t seems to be defined only in a header file under arch/mips, so I
will stick to uintptr_t.

Thanks,
Nikos

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dm-devel] [PATCH 1/3] list_bl: Add hlist_bl_add_before/behind helpers
  2019-03-17 11:52               ` Nikos Tsironis
@ 2019-03-18 17:16                 ` Paul E. McKenney
  2019-03-20 20:25                   ` Nikos Tsironis
  0 siblings, 1 reply; 26+ messages in thread
From: Paul E. McKenney @ 2019-03-18 17:16 UTC (permalink / raw)
  To: Nikos Tsironis
  Cc: Mike Snitzer, hch, agk, dm-devel, mpatocka, iliastsi, linux-kernel

On Sun, Mar 17, 2019 at 01:52:50PM +0200, Nikos Tsironis wrote:
> On 3/14/19 5:03 PM, Paul E. McKenney wrote:
> > On Thu, Mar 14, 2019 at 07:07:50AM -0700, Paul E. McKenney wrote:
> >> On Thu, Mar 14, 2019 at 03:28:23PM +0200, Nikos Tsironis wrote:
> >>> On 3/14/19 2:30 AM, Mike Snitzer wrote:
> >>>> On Wed, Mar 13 2019 at  7:48pm -0400,
> >>>> Paul E. McKenney <paulmck@linux.ibm.com> wrote:
> >>>>
> >>> Hi Paul,
> >>>
> >>> Thanks a lot for your feedback!
> >>
> >> NP, and apologies for the delay.
> >>
> >>>>> On Thu, Feb 28, 2019 at 04:32:02PM -0500, Mike Snitzer wrote:
> >>>>>> On Thu, Dec 20 2018 at  1:06pm -0500,
> >>>>>> Nikos Tsironis <ntsironis@arrikto.com> wrote:
> >>>>>>
> >>>>>>> Add hlist_bl_add_before/behind helpers to add an element before/after an
> >>>>>>> existing element in a bl_list.
> >>>>>>>
> >>>>>>> Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
> >>>>>>> Signed-off-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
> >>>>>>> ---
> >>>>>>>  include/linux/list_bl.h | 27 +++++++++++++++++++++++++++
> >>>>>>>  1 file changed, 27 insertions(+)
> >>>>>>>
> >>>>>>> diff --git a/include/linux/list_bl.h b/include/linux/list_bl.h
> >>>>>>> index 3fc2cc57ba1b..2fd918e5fd48 100644
> >>>>>>> --- a/include/linux/list_bl.h
> >>>>>>> +++ b/include/linux/list_bl.h
> >>>>>>> @@ -86,6 +86,33 @@ static inline void hlist_bl_add_head(struct hlist_bl_node *n,
> >>>>>>>  	hlist_bl_set_first(h, n);
> >>>>>>>  }
> >>>>>>>  
> >>>>>>> +static inline void hlist_bl_add_before(struct hlist_bl_node *n,
> >>>>>>> +				       struct hlist_bl_node *next)
> >>>>>>> +{
> >>>>>>> +	struct hlist_bl_node **pprev = next->pprev;
> >>>>>>> +
> >>>>>>> +	n->pprev = pprev;
> >>>>>>> +	n->next = next;
> >>>>>>> +	next->pprev = &n->next;
> >>>>>>> +
> >>>>>>> +	/* pprev may be `first`, so be careful not to lose the lock bit */
> >>>>>>> +	WRITE_ONCE(*pprev,
> >>>>>>> +		   (struct hlist_bl_node *)
> >>>>>>> +			((unsigned long)n |
> >>>>>>> +			 ((unsigned long)*pprev & LIST_BL_LOCKMASK)));
> >>>>>
> >>>>> A nit, but use of uintptr_t shrinks things a bit:
> >>>>>
> >>>>> +		   (struct hlist_bl_node *)
> >>>>> +			((uintptr_t)n | ((uintptr_t)*pprev & LIST_BL_LOCKMASK)));
> >>>>>
> >>>>> I am not too concerned about this, though.
> >>>>
> >>>> I'm fine with folding in your suggestion.
> >>>
> >>> Indeed, this looks better.
> >>>
> >>>>> The WRITE_ONCE() is to handle races with hlist_bl_empty() (which does contain
> >>>>> the corresponding READ_ONCE()) correct?
> >>>>
> >>>> Correct.
> >>>
> >>> Yes that's correct.
> >>>
> >>>>>>> +}
> >>>>>>> +
> >>>>>>> +static inline void hlist_bl_add_behind(struct hlist_bl_node *n,
> >>>>>>> +				       struct hlist_bl_node *prev)
> >>>>>>> +{
> >>>>>>> +	n->next = prev->next;
> >>>>>>> +	n->pprev = &prev->next;
> >>>>>>> +	WRITE_ONCE(prev->next, n);
> >>>>>
> >>>>> I don't see what this WRITE_ONCE() is interacting with.  The traversals
> >>>>> use plain C-language reads, and hlist_bl_empty() can't get here.  All
> >>>>> uses of hlist_bl_for_each_entry() invoke hlist_bl_lock() before starting
> >>>>> the traversal, and hlist_bl_for_each_entry_safe() looks to be unused.
> >>>>> (Perhaps it should be removed?  Or is there some anticipated use?)
> >>>
> >>> I am using hlist_bl_for_each_entry_safe() in this proposed patch for
> >>> dm-snapshot: https://patchwork.kernel.org/patch/10835709/
> >>
> >> Probably should keep it, then.  ;-)
> >>
> >>>>>
> >>>>> I don't believe that the WRITE_ONCE() is needed.  What am I missing?
> >>>>>
> >>>>> Other than that, looks good.
> >>>>>
> >>>>> 							Thanx, Paul
> >>>>>
> >>>>
> >>>> I'd imagine it was just born out of symmetry with hlist_bl_add_before()
> >>>> and/or caution.  But let's see what Nikos has to say.
> >>>
> >>> I also don't believe that this WRITE_SAME() is needed. But, looking at
> >>> hlist_add_behind() in include/linux/list.h, which, if I am not missing
> >>> something, is used in the same way as hlist_bl_add_behind(), it also
> >>> uses WRITE_ONCE() to update prev->next:
> >>>
> >>> static inline void hlist_add_behind(struct hlist_node *n,
> >>> 				    struct hlist_node *prev)
> >>> {
> >>> 	n->next = prev->next;
> >>> 	WRITE_ONCE(prev->next, n);
> >>> 	n->pprev = &prev->next;
> >>>
> >>> 	if (n->next)
> >>> 		n->next->pprev  = &n->next;
> >>> }
> >>>
> >>> Could it be the case that the WRITE_ONCE() in hlist_add_behind() is also
> >>> not needed? This WRITE_ONCE() was introduced by commit 1c97be677f72b3
> >>> ("list: Use WRITE_ONCE() when adding to lists and hlists").
> >>
> >> Looks like I have no one to blame but myself!
> >>
> >> Would you like to remove that as part of your patch series?
> >>
> >>> But, since I am not an expert in lockless programming, I opted to be on
> >>> the safe side and followed the example of hlist_add_behind().
> >>>
> >>> That said, I will follow up with a new version of the patch removing the
> >>> WRITE_ONCE() and using uintptr_t instead of unsigned long.
> >>
> >> Sounds good!
> > 
> > Oh, and of course intptr_t is one character shorter than uintptr_t, and
> > looks to work just as well in this context.  ;-)
> > 
> > 							Thanx, Paul
> > 
> 
> 
> Hi Paul,
> 
> Sorry for the late reply.
> 
> intptr_t seems to be defined only in a header file under arch/mips, so I
> will stick to uintptr_t.

Ah, apologies for the misdirection!  Hmmm...  Maybe intptr_t should be
added alongside uintptr_t?  Saving a character is saving a character.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dm-devel] [PATCH 1/3] list_bl: Add hlist_bl_add_before/behind helpers
  2019-03-18 17:16                 ` Paul E. McKenney
@ 2019-03-20 20:25                   ` Nikos Tsironis
  0 siblings, 0 replies; 26+ messages in thread
From: Nikos Tsironis @ 2019-03-20 20:25 UTC (permalink / raw)
  To: paulmck
  Cc: Mike Snitzer, hch, agk, dm-devel, mpatocka, iliastsi, linux-kernel

On 3/18/19 7:16 PM, Paul E. McKenney wrote:
> On Sun, Mar 17, 2019 at 01:52:50PM +0200, Nikos Tsironis wrote:
>> On 3/14/19 5:03 PM, Paul E. McKenney wrote:
>>> On Thu, Mar 14, 2019 at 07:07:50AM -0700, Paul E. McKenney wrote:
>>>> On Thu, Mar 14, 2019 at 03:28:23PM +0200, Nikos Tsironis wrote:
>>>>> On 3/14/19 2:30 AM, Mike Snitzer wrote:
>>>>>> On Wed, Mar 13 2019 at  7:48pm -0400,
>>>>>> Paul E. McKenney <paulmck@linux.ibm.com> wrote:
>>>>>>
>>>>> Hi Paul,
>>>>>
>>>>> Thanks a lot for your feedback!
>>>>
>>>> NP, and apologies for the delay.
>>>>
>>>>>>> On Thu, Feb 28, 2019 at 04:32:02PM -0500, Mike Snitzer wrote:
>>>>>>>> On Thu, Dec 20 2018 at  1:06pm -0500,
>>>>>>>> Nikos Tsironis <ntsironis@arrikto.com> wrote:
>>>>>>>>
>>>>>>>>> Add hlist_bl_add_before/behind helpers to add an element before/after an
>>>>>>>>> existing element in a bl_list.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
>>>>>>>>> Signed-off-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
>>>>>>>>> ---
>>>>>>>>>  include/linux/list_bl.h | 27 +++++++++++++++++++++++++++
>>>>>>>>>  1 file changed, 27 insertions(+)
>>>>>>>>>
>>>>>>>>> diff --git a/include/linux/list_bl.h b/include/linux/list_bl.h
>>>>>>>>> index 3fc2cc57ba1b..2fd918e5fd48 100644
>>>>>>>>> --- a/include/linux/list_bl.h
>>>>>>>>> +++ b/include/linux/list_bl.h
>>>>>>>>> @@ -86,6 +86,33 @@ static inline void hlist_bl_add_head(struct hlist_bl_node *n,
>>>>>>>>>  	hlist_bl_set_first(h, n);
>>>>>>>>>  }
>>>>>>>>>  
>>>>>>>>> +static inline void hlist_bl_add_before(struct hlist_bl_node *n,
>>>>>>>>> +				       struct hlist_bl_node *next)
>>>>>>>>> +{
>>>>>>>>> +	struct hlist_bl_node **pprev = next->pprev;
>>>>>>>>> +
>>>>>>>>> +	n->pprev = pprev;
>>>>>>>>> +	n->next = next;
>>>>>>>>> +	next->pprev = &n->next;
>>>>>>>>> +
>>>>>>>>> +	/* pprev may be `first`, so be careful not to lose the lock bit */
>>>>>>>>> +	WRITE_ONCE(*pprev,
>>>>>>>>> +		   (struct hlist_bl_node *)
>>>>>>>>> +			((unsigned long)n |
>>>>>>>>> +			 ((unsigned long)*pprev & LIST_BL_LOCKMASK)));
>>>>>>>
>>>>>>> A nit, but use of uintptr_t shrinks things a bit:
>>>>>>>
>>>>>>> +		   (struct hlist_bl_node *)
>>>>>>> +			((uintptr_t)n | ((uintptr_t)*pprev & LIST_BL_LOCKMASK)));
>>>>>>>
>>>>>>> I am not too concerned about this, though.
>>>>>>
>>>>>> I'm fine with folding in your suggestion.
>>>>>
>>>>> Indeed, this looks better.
>>>>>
>>>>>>> The WRITE_ONCE() is to handle races with hlist_bl_empty() (which does contain
>>>>>>> the corresponding READ_ONCE()) correct?
>>>>>>
>>>>>> Correct.
>>>>>
>>>>> Yes that's correct.
>>>>>
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +static inline void hlist_bl_add_behind(struct hlist_bl_node *n,
>>>>>>>>> +				       struct hlist_bl_node *prev)
>>>>>>>>> +{
>>>>>>>>> +	n->next = prev->next;
>>>>>>>>> +	n->pprev = &prev->next;
>>>>>>>>> +	WRITE_ONCE(prev->next, n);
>>>>>>>
>>>>>>> I don't see what this WRITE_ONCE() is interacting with.  The traversals
>>>>>>> use plain C-language reads, and hlist_bl_empty() can't get here.  All
>>>>>>> uses of hlist_bl_for_each_entry() invoke hlist_bl_lock() before starting
>>>>>>> the traversal, and hlist_bl_for_each_entry_safe() looks to be unused.
>>>>>>> (Perhaps it should be removed?  Or is there some anticipated use?)
>>>>>
>>>>> I am using hlist_bl_for_each_entry_safe() in this proposed patch for
>>>>> dm-snapshot: https://patchwork.kernel.org/patch/10835709/
>>>>
>>>> Probably should keep it, then.  ;-)
>>>>
>>>>>>>
>>>>>>> I don't believe that the WRITE_ONCE() is needed.  What am I missing?
>>>>>>>
>>>>>>> Other than that, looks good.
>>>>>>>
>>>>>>> 							Thanx, Paul
>>>>>>>
>>>>>>
>>>>>> I'd imagine it was just born out of symmetry with hlist_bl_add_before()
>>>>>> and/or caution.  But let's see what Nikos has to say.
>>>>>
>>>>> I also don't believe that this WRITE_SAME() is needed. But, looking at
>>>>> hlist_add_behind() in include/linux/list.h, which, if I am not missing
>>>>> something, is used in the same way as hlist_bl_add_behind(), it also
>>>>> uses WRITE_ONCE() to update prev->next:
>>>>>
>>>>> static inline void hlist_add_behind(struct hlist_node *n,
>>>>> 				    struct hlist_node *prev)
>>>>> {
>>>>> 	n->next = prev->next;
>>>>> 	WRITE_ONCE(prev->next, n);
>>>>> 	n->pprev = &prev->next;
>>>>>
>>>>> 	if (n->next)
>>>>> 		n->next->pprev  = &n->next;
>>>>> }
>>>>>
>>>>> Could it be the case that the WRITE_ONCE() in hlist_add_behind() is also
>>>>> not needed? This WRITE_ONCE() was introduced by commit 1c97be677f72b3
>>>>> ("list: Use WRITE_ONCE() when adding to lists and hlists").
>>>>
>>>> Looks like I have no one to blame but myself!
>>>>
>>>> Would you like to remove that as part of your patch series?
>>>>
>>>>> But, since I am not an expert in lockless programming, I opted to be on
>>>>> the safe side and followed the example of hlist_add_behind().
>>>>>
>>>>> That said, I will follow up with a new version of the patch removing the
>>>>> WRITE_ONCE() and using uintptr_t instead of unsigned long.
>>>>
>>>> Sounds good!
>>>
>>> Oh, and of course intptr_t is one character shorter than uintptr_t, and
>>> looks to work just as well in this context.  ;-)
>>>
>>> 							Thanx, Paul
>>>
>>
>>
>> Hi Paul,
>>
>> Sorry for the late reply.
>>
>> intptr_t seems to be defined only in a header file under arch/mips, so I
>> will stick to uintptr_t.
> 
> Ah, apologies for the misdirection!  Hmmm...  Maybe intptr_t should be
> added alongside uintptr_t?  Saving a character is saving a character.  ;-)
> 
> 							Thanx, Paul
> 

I will follow up with an unrelated patch to add intptr_t to
include/linux/types.h, so that it is available in subsequent patches
throughout the kernel.

Nikos

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2019-03-20 20:25 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-20 18:06 [PATCH 0/3] dm snapshot: Improve performance using a more fine-grained locking scheme Nikos Tsironis
2018-12-20 18:06 ` [PATCH 1/3] list_bl: Add hlist_bl_add_before/behind helpers Nikos Tsironis
2019-02-28 21:32   ` Mike Snitzer
2019-02-28 21:34     ` Mike Snitzer
2019-03-11 18:16     ` Christoph Hellwig
2019-03-11 22:13       ` Paul E. McKenney
2019-03-11 22:43         ` Mike Snitzer
2019-03-14  0:25           ` Paul E. McKenney
2019-03-13 23:48     ` Paul E. McKenney
2019-03-14  0:30       ` Mike Snitzer
2019-03-14 13:28         ` [dm-devel] " Nikos Tsironis
2019-03-14 13:28           ` Nikos Tsironis
2019-03-14 14:07           ` [dm-devel] " Paul E. McKenney
2019-03-14 15:03             ` Paul E. McKenney
2019-03-17 11:52               ` Nikos Tsironis
2019-03-18 17:16                 ` Paul E. McKenney
2019-03-20 20:25                   ` Nikos Tsironis
2019-03-14 17:01             ` Nikos Tsironis
2018-12-20 18:06 ` [PATCH 2/3] dm snapshot: Don't sleep holding the snapshot lock Nikos Tsironis
2018-12-20 18:06 ` [PATCH 3/3] dm snapshot: Use fine-grained locking scheme Nikos Tsironis
2019-02-28 21:37   ` Mikulas Patocka
2019-03-01  9:15     ` Nikos Tsironis
2019-02-18 14:22 ` [PATCH 0/3] dm snapshot: Improve performance using a more " Nikos Tsironis
2019-02-18 14:37   ` Mikulas Patocka
2019-02-18 15:15     ` Mike Snitzer
2019-02-19 11:04       ` Nikos Tsironis

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.