All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/4] dm mpath: vastly improve blk-mq IO performance
@ 2016-03-31 20:04 Mike Snitzer
  2016-03-31 20:04 ` [RFC PATCH 1/4] dm mpath: switch to using bitops for state flags Mike Snitzer
                   ` (6 more replies)
  0 siblings, 7 replies; 26+ messages in thread
From: Mike Snitzer @ 2016-03-31 20:04 UTC (permalink / raw)
  To: dm-devel
  Cc: jmoyer, linux-block, linux-scsi, j-nomura, hare, sagig, Mike Snitzer

I developed these changes some weeks ago but have since focused on
regression and performance testing on larger NUMA systems.

For regression testing I've been using mptest:
https://github.com/snitm/mptest

For performance testing I've been using a null_blk device (with
various configuration permutations, e.g. pinning memory to a
particular NUMA node, and varied number of submit_queues).

By eliminating multipath's heavy use of the m->lock spinlock in the
fast IO paths serious performance improvements are realized.

Overview of performance test setup:
===================================

NULL_BLK_HW_QUEUES=12
NULL_BLK_QUEUE_DEPTH=4096

DM_MQ_HW_QUEUES=12
DM_MQ_QUEUE_DEPTH=2048

FIO_QUEUE_DEPTH=32
FIO_RUNTIME=10
FIO_NUMJOBS=12

NID=0

run_fio() {
    DEVICE=$1
    TASK_NAME=$(basename ${DEVICE})
    PERF_RECORD=$2
    RUN_CMD="${FIO} --numa_cpu_nodes=${NID} --numa_mem_policy=bind:${NID} --cpus_allowed_policy=split --group_reporting --rw=randread --bs=4k --numjobs=${FIO_NUMJOBS} \
              --iodepth=${FIO_QUEUE_DEPTH} --runtime=${FIO_RUNTIME} --time_based --loops=1 --ioengine=libaio \
              --direct=1 --invalidate=1 --randrepeat=1 --norandommap --exitall --name task_${TASK_NAME} --filename=${DEVICE}"
    ${RUN_CMD}
}

modprobe null_blk gb=4 bs=512 hw_queue_depth=${NULL_BLK_QUEUE_DEPTH} nr_devices=1 queue_mode=2 irqmode=1 completion_nsec=1 submit_queues=${NULL_BLK_HW_QUEUES}
run_fio /dev/nullb0

echo ${NID} > /sys/module/dm_mod/parameters/dm_numa_node
echo Y > /sys/module/dm_mod/parameters/use_blk_mq
echo ${DM_MQ_QUEUE_DEPTH} > /sys/module/dm_mod/parameters/dm_mq_queue_depth
echo ${DM_MQ_HW_QUEUES} > /sys/module/dm_mod/parameters/dm_mq_nr_hw_queues
echo "0 8388608 multipath 0 0 1 1 service-time 0 1 2 /dev/nullb0 1 1" | dmsetup create dm_mq
run_fio /dev/mapper/dm_mq
dmsetup remove dm_mq

echo "0 8388608 multipath 0 0 1 1 queue-length 0 1 1 /dev/nullb0 1" | dmsetup create dm_mq
run_fio /dev/mapper/dm_mq
dmsetup remove dm_mq

echo "0 8388608 multipath 0 0 1 1 round-robin 0 1 1 /dev/nullb0 1" | dmsetup create dm_mq
run_fio /dev/mapper/dm_mq
dmsetup remove dm_mq

Test results on 4 NUMA node 192-way x86_64 system with 524G of memory:
======================================================================

Big picture is the move to lockless really helps.

round-robin's repeat_count and percpu current_path code (went upstream
during 4.6 merge) seems to _really_ help (even if repeat_count is 1, as
is the case in all these results).

Below, each set of 4 results in the named file (e.g. "result.lockless_pinned") are:
raw null_blk
service-time
queue-length
round-robin

The files with the trailing "_12" means:
NULL_BLK_HW_QUEUES=12
DM_MQ_HW_QUEUES=12
FIO_NUMJOBS=12

And the file without "_12" means:
NULL_BLK_HW_QUEUES=32
DM_MQ_HW_QUEUES=32
FIO_NUMJOBS=32

lockless: (this patchset applied)
*********
result.lockless_pinned:  read : io=236580MB, bw=23656MB/s, iops=6055.9K, runt= 10001msec
result.lockless_pinned:  read : io=108536MB, bw=10853MB/s, iops=2778.3K, runt= 10001msec
result.lockless_pinned:  read : io=106649MB, bw=10664MB/s, iops=2729.1K, runt= 10001msec
result.lockless_pinned:  read : io=162906MB, bw=16289MB/s, iops=4169.1K, runt= 10001msec

result.lockless_pinned_12:  read : io=165233MB, bw=16522MB/s, iops=4229.6K, runt= 10001msec
result.lockless_pinned_12:  read : io=96686MB, bw=9667.7MB/s, iops=2474.1K, runt= 10001msec
result.lockless_pinned_12:  read : io=97197MB, bw=9718.8MB/s, iops=2488.3K, runt= 10001msec
result.lockless_pinned_12:  read : io=104509MB, bw=10450MB/s, iops=2675.2K, runt= 10001msec

result.lockless_unpinned:  read : io=101525MB, bw=10151MB/s, iops=2598.8K, runt= 10001msec
result.lockless_unpinned:  read : io=61313MB, bw=6130.8MB/s, iops=1569.5K, runt= 10001msec
result.lockless_unpinned:  read : io=64892MB, bw=6488.6MB/s, iops=1661.8K, runt= 10001msec
result.lockless_unpinned:  read : io=78557MB, bw=7854.1MB/s, iops=2010.9K, runt= 10001msec

result.lockless_unpinned_12:  read : io=83455MB, bw=8344.7MB/s, iops=2136.3K, runt= 10001msec
result.lockless_unpinned_12:  read : io=50638MB, bw=5063.4MB/s, iops=1296.3K, runt= 10001msec
result.lockless_unpinned_12:  read : io=56103MB, bw=5609.8MB/s, iops=1436.1K, runt= 10001msec
result.lockless_unpinned_12:  read : io=56421MB, bw=5641.6MB/s, iops=1444.3K, runt= 10001msec

spinlock:
*********
result.spinlock_pinned:  read : io=236048MB, bw=23602MB/s, iops=6042.3K, runt= 10001msec
result.spinlock_pinned:  read : io=64657MB, bw=6465.4MB/s, iops=1655.5K, runt= 10001msec
result.spinlock_pinned:  read : io=67519MB, bw=6751.2MB/s, iops=1728.4K, runt= 10001msec
result.spinlock_pinned:  read : io=81409MB, bw=8140.4MB/s, iops=2083.9K, runt= 10001msec

result.spinlock_pinned_12:  read : io=159782MB, bw=15977MB/s, iops=4090.3K, runt= 10001msec
result.spinlock_pinned_12:  read : io=64368MB, bw=6436.2MB/s, iops=1647.7K, runt= 10001msec
result.spinlock_pinned_12:  read : io=67337MB, bw=6733.5MB/s, iops=1723.7K, runt= 10001msec
result.spinlock_pinned_12:  read : io=75453MB, bw=7544.6MB/s, iops=1931.5K, runt= 10001msec

result.spinlock_unpinned:  read : io=103267MB, bw=10326MB/s, iops=2643.4K, runt= 10001msec
result.spinlock_unpinned:  read : io=34751MB, bw=3474.8MB/s, iops=889526, runt= 10001msec
result.spinlock_unpinned:  read : io=34475MB, bw=3447.2MB/s, iops=882477, runt= 10001msec
result.spinlock_unpinned:  read : io=43793MB, bw=4378.1MB/s, iops=1121.0K, runt= 10001msec

result.spinlock_unpinned_12:  read : io=83573MB, bw=8356.5MB/s, iops=2139.3K, runt= 10001msec
result.spinlock_unpinned_12:  read : io=32715MB, bw=3271.2MB/s, iops=837414, runt= 10001msec
result.spinlock_unpinned_12:  read : io=34249MB, bw=3424.6MB/s, iops=876675, runt= 10001msec
result.spinlock_unpinned_12:  read : io=41486MB, bw=4148.3MB/s, iops=1061.1K, runt= 10001msec

Summary:
========
Pinning this test to a particular NUMA node helps.  As does using more
queues/threads -- which is a nice advance because before DM mpath
really hit a wall.

What makes these favorable results possible is switching over to
bitops, atomic counters and lockless_deference.

Comparing result.lockless_pinned vs result.spinlock_pinned you can see
that this patchset delivers between 40 and 50% IOPs and bandwidth
performance improvement.

Jeff Moyer has been helping review these changes (and has graciously
labored over _really_ understanding all the concurrency at play in DM
mpath) -- his review isn't yet complete but I wanted to get this
patchset out now to raise awareness about how I think DM multipath
will be changing (for inclussion during the Linux 4.7 merge window).

Mike Snitzer (4):
  dm mpath: switch to using bitops for state flags
  dm mpath: use atomic_t for counting members of 'struct multipath'
  dm mpath: move trigger_event member to the end of 'struct multipath'
  dm mpath: eliminate use of spinlock in IO fast-paths

 drivers/md/dm-mpath.c | 351 ++++++++++++++++++++++++++++----------------------
 1 file changed, 195 insertions(+), 156 deletions(-)

-- 
2.6.4 (Apple Git-63)


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC PATCH 1/4] dm mpath: switch to using bitops for state flags
  2016-03-31 20:04 [RFC PATCH 0/4] dm mpath: vastly improve blk-mq IO performance Mike Snitzer
@ 2016-03-31 20:04 ` Mike Snitzer
  2016-04-01  8:46   ` [dm-devel] " Johannes Thumshirn
  2016-04-07 14:59   ` Hannes Reinecke
  2016-03-31 20:04 ` [RFC PATCH 2/4] dm mpath: use atomic_t for counting members of 'struct multipath' Mike Snitzer
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 26+ messages in thread
From: Mike Snitzer @ 2016-03-31 20:04 UTC (permalink / raw)
  To: dm-devel
  Cc: jmoyer, linux-block, linux-scsi, j-nomura, hare, sagig, Mike Snitzer

Mechanical change that doesn't make any real effort to reduce the use of
m->lock; that will come later (once atomics are used for counters, etc).

Suggested-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/md/dm-mpath.c | 131 +++++++++++++++++++++++++++++---------------------
 1 file changed, 75 insertions(+), 56 deletions(-)

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 677ba22..598d4a1 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -83,13 +83,7 @@ struct multipath {
 	struct priority_group *current_pg;
 	struct priority_group *next_pg;	/* Switch to this PG if set */
 
-	bool queue_io:1;		/* Must we queue all I/O? */
-	bool queue_if_no_path:1;	/* Queue I/O if last path fails? */
-	bool saved_queue_if_no_path:1;	/* Saved state during suspension */
-	bool retain_attached_hw_handler:1; /* If there's already a hw_handler present, don't change it. */
-	bool pg_init_disabled:1;	/* pg_init is not currently allowed */
-	bool pg_init_required:1;	/* pg_init needs calling? */
-	bool pg_init_delay_retry:1;	/* Delay pg_init retry? */
+	unsigned long flags;		/* Multipath state flags */
 
 	unsigned pg_init_retries;	/* Number of times to retry pg_init */
 	unsigned pg_init_count;		/* Number of times pg_init called */
@@ -122,6 +116,17 @@ static struct workqueue_struct *kmultipathd, *kmpath_handlerd;
 static void trigger_event(struct work_struct *work);
 static void activate_path(struct work_struct *work);
 
+/*-----------------------------------------------
+ * Multipath state flags.
+ *-----------------------------------------------*/
+
+#define MPATHF_QUEUE_IO 0			/* Must we queue all I/O? */
+#define MPATHF_QUEUE_IF_NO_PATH 1		/* Queue I/O if last path fails? */
+#define MPATHF_SAVED_QUEUE_IF_NO_PATH 2		/* Saved state during suspension */
+#define MPATHF_RETAIN_ATTACHED_HW_HANDLER 3	/* If there's already a hw_handler present, don't change it. */
+#define MPATHF_PG_INIT_DISABLED 4		/* pg_init is not currently allowed */
+#define MPATHF_PG_INIT_REQUIRED 5		/* pg_init needs calling? */
+#define MPATHF_PG_INIT_DELAY_RETRY 6		/* Delay pg_init retry? */
 
 /*-----------------------------------------------
  * Allocation routines
@@ -189,7 +194,7 @@ static struct multipath *alloc_multipath(struct dm_target *ti, bool use_blk_mq)
 	if (m) {
 		INIT_LIST_HEAD(&m->priority_groups);
 		spin_lock_init(&m->lock);
-		m->queue_io = true;
+		set_bit(MPATHF_QUEUE_IO, &m->flags);
 		m->pg_init_delay_msecs = DM_PG_INIT_DELAY_DEFAULT;
 		INIT_WORK(&m->trigger_event, trigger_event);
 		init_waitqueue_head(&m->pg_init_wait);
@@ -274,17 +279,17 @@ static int __pg_init_all_paths(struct multipath *m)
 	struct pgpath *pgpath;
 	unsigned long pg_init_delay = 0;
 
-	if (m->pg_init_in_progress || m->pg_init_disabled)
+	if (m->pg_init_in_progress || test_bit(MPATHF_PG_INIT_DISABLED, &m->flags))
 		return 0;
 
 	m->pg_init_count++;
-	m->pg_init_required = false;
+	clear_bit(MPATHF_PG_INIT_REQUIRED, &m->flags);
 
 	/* Check here to reset pg_init_required */
 	if (!m->current_pg)
 		return 0;
 
-	if (m->pg_init_delay_retry)
+	if (test_bit(MPATHF_PG_INIT_DELAY_RETRY, &m->flags))
 		pg_init_delay = msecs_to_jiffies(m->pg_init_delay_msecs != DM_PG_INIT_DELAY_DEFAULT ?
 						 m->pg_init_delay_msecs : DM_PG_INIT_DELAY_MSECS);
 	list_for_each_entry(pgpath, &m->current_pg->pgpaths, list) {
@@ -304,11 +309,11 @@ static void __switch_pg(struct multipath *m, struct pgpath *pgpath)
 
 	/* Must we initialise the PG first, and queue I/O till it's ready? */
 	if (m->hw_handler_name) {
-		m->pg_init_required = true;
-		m->queue_io = true;
+		set_bit(MPATHF_PG_INIT_REQUIRED, &m->flags);
+		set_bit(MPATHF_QUEUE_IO, &m->flags);
 	} else {
-		m->pg_init_required = false;
-		m->queue_io = false;
+		clear_bit(MPATHF_PG_INIT_REQUIRED, &m->flags);
+		clear_bit(MPATHF_QUEUE_IO, &m->flags);
 	}
 
 	m->pg_init_count = 0;
@@ -337,7 +342,7 @@ static void __choose_pgpath(struct multipath *m, size_t nr_bytes)
 	bool bypassed = true;
 
 	if (!m->nr_valid_paths) {
-		m->queue_io = false;
+		clear_bit(MPATHF_QUEUE_IO, &m->flags);
 		goto failed;
 	}
 
@@ -365,7 +370,7 @@ static void __choose_pgpath(struct multipath *m, size_t nr_bytes)
 				continue;
 			if (!__choose_path_in_pg(m, pg, nr_bytes)) {
 				if (!bypassed)
-					m->pg_init_delay_retry = true;
+					set_bit(MPATHF_PG_INIT_DELAY_RETRY, &m->flags);
 				return;
 			}
 		}
@@ -389,8 +394,9 @@ failed:
  */
 static int __must_push_back(struct multipath *m)
 {
-	return (m->queue_if_no_path ||
-		(m->queue_if_no_path != m->saved_queue_if_no_path &&
+	return (test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags) ||
+		((test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags) !=
+		  test_bit(MPATHF_SAVED_QUEUE_IF_NO_PATH, &m->flags)) &&
 		 dm_noflush_suspending(m->ti)));
 }
 
@@ -411,7 +417,7 @@ static int __multipath_map(struct dm_target *ti, struct request *clone,
 	spin_lock_irq(&m->lock);
 
 	/* Do we need to select a new pgpath? */
-	if (!m->current_pgpath || !m->queue_io)
+	if (!m->current_pgpath || !test_bit(MPATHF_QUEUE_IO, &m->flags))
 		__choose_pgpath(m, nr_bytes);
 
 	pgpath = m->current_pgpath;
@@ -420,7 +426,8 @@ static int __multipath_map(struct dm_target *ti, struct request *clone,
 		if (!__must_push_back(m))
 			r = -EIO;	/* Failed */
 		goto out_unlock;
-	} else if (m->queue_io || m->pg_init_required) {
+	} else if (test_bit(MPATHF_QUEUE_IO, &m->flags) ||
+		   test_bit(MPATHF_PG_INIT_REQUIRED, &m->flags)) {
 		__pg_init_all_paths(m);
 		goto out_unlock;
 	}
@@ -503,11 +510,22 @@ static int queue_if_no_path(struct multipath *m, bool queue_if_no_path,
 
 	spin_lock_irqsave(&m->lock, flags);
 
-	if (save_old_value)
-		m->saved_queue_if_no_path = m->queue_if_no_path;
+	if (save_old_value) {
+		if (test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags))
+			set_bit(MPATHF_SAVED_QUEUE_IF_NO_PATH, &m->flags);
+		else
+			clear_bit(MPATHF_SAVED_QUEUE_IF_NO_PATH, &m->flags);
+	} else {
+		if (queue_if_no_path)
+			set_bit(MPATHF_SAVED_QUEUE_IF_NO_PATH, &m->flags);
+		else
+			clear_bit(MPATHF_SAVED_QUEUE_IF_NO_PATH, &m->flags);
+	}
+	if (queue_if_no_path)
+		set_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags);
 	else
-		m->saved_queue_if_no_path = queue_if_no_path;
-	m->queue_if_no_path = queue_if_no_path;
+		clear_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags);
+
 	spin_unlock_irqrestore(&m->lock, flags);
 
 	if (!queue_if_no_path)
@@ -600,10 +618,10 @@ static struct pgpath *parse_path(struct dm_arg_set *as, struct path_selector *ps
 		goto bad;
 	}
 
-	if (m->retain_attached_hw_handler || m->hw_handler_name)
+	if (test_bit(MPATHF_RETAIN_ATTACHED_HW_HANDLER, &m->flags) || m->hw_handler_name)
 		q = bdev_get_queue(p->path.dev->bdev);
 
-	if (m->retain_attached_hw_handler) {
+	if (test_bit(MPATHF_RETAIN_ATTACHED_HW_HANDLER, &m->flags)) {
 retain:
 		attached_handler_name = scsi_dh_attached_handler_name(q, GFP_KERNEL);
 		if (attached_handler_name) {
@@ -808,7 +826,7 @@ static int parse_features(struct dm_arg_set *as, struct multipath *m)
 		}
 
 		if (!strcasecmp(arg_name, "retain_attached_hw_handler")) {
-			m->retain_attached_hw_handler = true;
+			set_bit(MPATHF_RETAIN_ATTACHED_HW_HANDLER, &m->flags);
 			continue;
 		}
 
@@ -944,20 +962,16 @@ static void multipath_wait_for_pg_init_completion(struct multipath *m)
 
 static void flush_multipath_work(struct multipath *m)
 {
-	unsigned long flags;
-
-	spin_lock_irqsave(&m->lock, flags);
-	m->pg_init_disabled = true;
-	spin_unlock_irqrestore(&m->lock, flags);
+	set_bit(MPATHF_PG_INIT_DISABLED, &m->flags);
+	smp_mb__after_atomic();
 
 	flush_workqueue(kmpath_handlerd);
 	multipath_wait_for_pg_init_completion(m);
 	flush_workqueue(kmultipathd);
 	flush_work(&m->trigger_event);
 
-	spin_lock_irqsave(&m->lock, flags);
-	m->pg_init_disabled = false;
-	spin_unlock_irqrestore(&m->lock, flags);
+	clear_bit(MPATHF_PG_INIT_DISABLED, &m->flags);
+	smp_mb__after_atomic();
 }
 
 static void multipath_dtr(struct dm_target *ti)
@@ -1152,8 +1166,8 @@ static bool pg_init_limit_reached(struct multipath *m, struct pgpath *pgpath)
 
 	spin_lock_irqsave(&m->lock, flags);
 
-	if (m->pg_init_count <= m->pg_init_retries && !m->pg_init_disabled)
-		m->pg_init_required = true;
+	if (m->pg_init_count <= m->pg_init_retries && !test_bit(MPATHF_PG_INIT_DISABLED, &m->flags))
+		set_bit(MPATHF_PG_INIT_REQUIRED, &m->flags);
 	else
 		limit_reached = true;
 
@@ -1219,19 +1233,23 @@ static void pg_init_done(void *data, int errors)
 			m->current_pgpath = NULL;
 			m->current_pg = NULL;
 		}
-	} else if (!m->pg_init_required)
+	} else if (!test_bit(MPATHF_PG_INIT_REQUIRED, &m->flags))
 		pg->bypassed = false;
 
 	if (--m->pg_init_in_progress)
 		/* Activations of other paths are still on going */
 		goto out;
 
-	if (m->pg_init_required) {
-		m->pg_init_delay_retry = delay_retry;
+	if (test_bit(MPATHF_PG_INIT_REQUIRED, &m->flags)) {
+		if (delay_retry)
+			set_bit(MPATHF_PG_INIT_DELAY_RETRY, &m->flags);
+		else
+			clear_bit(MPATHF_PG_INIT_DELAY_RETRY, &m->flags);
+
 		if (__pg_init_all_paths(m))
 			goto out;
 	}
-	m->queue_io = false;
+	clear_bit(MPATHF_QUEUE_IO, &m->flags);
 
 	/*
 	 * Wake up any thread waiting to suspend.
@@ -1300,7 +1318,7 @@ static int do_end_io(struct multipath *m, struct request *clone,
 
 	spin_lock_irqsave(&m->lock, flags);
 	if (!m->nr_valid_paths) {
-		if (!m->queue_if_no_path) {
+		if (!test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags)) {
 			if (!__must_push_back(m))
 				r = -EIO;
 		} else {
@@ -1364,11 +1382,12 @@ static void multipath_postsuspend(struct dm_target *ti)
 static void multipath_resume(struct dm_target *ti)
 {
 	struct multipath *m = ti->private;
-	unsigned long flags;
 
-	spin_lock_irqsave(&m->lock, flags);
-	m->queue_if_no_path = m->saved_queue_if_no_path;
-	spin_unlock_irqrestore(&m->lock, flags);
+	if (test_bit(MPATHF_SAVED_QUEUE_IF_NO_PATH, &m->flags))
+		set_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags);
+	else
+		clear_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags);
+	smp_mb__after_atomic();
 }
 
 /*
@@ -1402,19 +1421,19 @@ static void multipath_status(struct dm_target *ti, status_type_t type,
 
 	/* Features */
 	if (type == STATUSTYPE_INFO)
-		DMEMIT("2 %u %u ", m->queue_io, m->pg_init_count);
+		DMEMIT("2 %u %u ", test_bit(MPATHF_QUEUE_IO, &m->flags), m->pg_init_count);
 	else {
-		DMEMIT("%u ", m->queue_if_no_path +
+		DMEMIT("%u ", test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags) +
 			      (m->pg_init_retries > 0) * 2 +
 			      (m->pg_init_delay_msecs != DM_PG_INIT_DELAY_DEFAULT) * 2 +
-			      m->retain_attached_hw_handler);
-		if (m->queue_if_no_path)
+			      test_bit(MPATHF_RETAIN_ATTACHED_HW_HANDLER, &m->flags));
+		if (test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags))
 			DMEMIT("queue_if_no_path ");
 		if (m->pg_init_retries)
 			DMEMIT("pg_init_retries %u ", m->pg_init_retries);
 		if (m->pg_init_delay_msecs != DM_PG_INIT_DELAY_DEFAULT)
 			DMEMIT("pg_init_delay_msecs %u ", m->pg_init_delay_msecs);
-		if (m->retain_attached_hw_handler)
+		if (test_bit(MPATHF_RETAIN_ATTACHED_HW_HANDLER, &m->flags))
 			DMEMIT("retain_attached_hw_handler ");
 	}
 
@@ -1572,7 +1591,7 @@ static int multipath_prepare_ioctl(struct dm_target *ti,
 		__choose_pgpath(m, 0);
 
 	if (m->current_pgpath) {
-		if (!m->queue_io) {
+		if (!test_bit(MPATHF_QUEUE_IO, &m->flags)) {
 			*bdev = m->current_pgpath->path.dev->bdev;
 			*mode = m->current_pgpath->path.dev->mode;
 			r = 0;
@@ -1582,7 +1601,7 @@ static int multipath_prepare_ioctl(struct dm_target *ti,
 		}
 	} else {
 		/* No path is available */
-		if (m->queue_if_no_path)
+		if (test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags))
 			r = -ENOTCONN;
 		else
 			r = -EIO;
@@ -1596,7 +1615,7 @@ static int multipath_prepare_ioctl(struct dm_target *ti,
 			/* Path status changed, redo selection */
 			__choose_pgpath(m, 0);
 		}
-		if (m->pg_init_required)
+		if (test_bit(MPATHF_PG_INIT_REQUIRED, &m->flags))
 			__pg_init_all_paths(m);
 		spin_unlock_irqrestore(&m->lock, flags);
 		dm_table_run_md_queue_async(m->ti->table);
@@ -1657,7 +1676,7 @@ static int multipath_busy(struct dm_target *ti)
 
 	/* pg_init in progress or no paths available */
 	if (m->pg_init_in_progress ||
-	    (!m->nr_valid_paths && m->queue_if_no_path)) {
+	    (!m->nr_valid_paths && test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags))) {
 		busy = true;
 		goto out;
 	}
-- 
2.6.4 (Apple Git-63)


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC PATCH 2/4] dm mpath: use atomic_t for counting members of 'struct multipath'
  2016-03-31 20:04 [RFC PATCH 0/4] dm mpath: vastly improve blk-mq IO performance Mike Snitzer
  2016-03-31 20:04 ` [RFC PATCH 1/4] dm mpath: switch to using bitops for state flags Mike Snitzer
@ 2016-03-31 20:04 ` Mike Snitzer
  2016-04-01  8:48   ` [dm-devel] " Johannes Thumshirn
  2016-04-07 15:02   ` Hannes Reinecke
  2016-03-31 20:04 ` [RFC PATCH 3/4] dm mpath: move trigger_event member to the end " Mike Snitzer
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 26+ messages in thread
From: Mike Snitzer @ 2016-03-31 20:04 UTC (permalink / raw)
  To: dm-devel
  Cc: jmoyer, linux-block, linux-scsi, j-nomura, hare, sagig, Mike Snitzer

The use of atomic_t for nr_valid_paths, pg_init_in_progress and
pg_init_count will allow relaxing the use of the m->lock spinlock.

Suggested-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/md/dm-mpath.c | 61 ++++++++++++++++++++++++++++-----------------------
 1 file changed, 33 insertions(+), 28 deletions(-)

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 598d4a1..780e5d0 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -76,9 +76,6 @@ struct multipath {
 
 	wait_queue_head_t pg_init_wait;	/* Wait for pg_init completion */
 
-	unsigned pg_init_in_progress;	/* Only one pg_init allowed at once */
-
-	unsigned nr_valid_paths;	/* Total number of usable paths */
 	struct pgpath *current_pgpath;
 	struct priority_group *current_pg;
 	struct priority_group *next_pg;	/* Switch to this PG if set */
@@ -86,9 +83,12 @@ struct multipath {
 	unsigned long flags;		/* Multipath state flags */
 
 	unsigned pg_init_retries;	/* Number of times to retry pg_init */
-	unsigned pg_init_count;		/* Number of times pg_init called */
 	unsigned pg_init_delay_msecs;	/* Number of msecs before pg_init retry */
 
+	atomic_t nr_valid_paths;	/* Total number of usable paths */
+	atomic_t pg_init_in_progress;	/* Only one pg_init allowed at once */
+	atomic_t pg_init_count;		/* Number of times pg_init called */
+
 	struct work_struct trigger_event;
 
 	/*
@@ -195,6 +195,9 @@ static struct multipath *alloc_multipath(struct dm_target *ti, bool use_blk_mq)
 		INIT_LIST_HEAD(&m->priority_groups);
 		spin_lock_init(&m->lock);
 		set_bit(MPATHF_QUEUE_IO, &m->flags);
+		atomic_set(&m->nr_valid_paths, 0);
+		atomic_set(&m->pg_init_in_progress, 0);
+		atomic_set(&m->pg_init_count, 0);
 		m->pg_init_delay_msecs = DM_PG_INIT_DELAY_DEFAULT;
 		INIT_WORK(&m->trigger_event, trigger_event);
 		init_waitqueue_head(&m->pg_init_wait);
@@ -279,10 +282,10 @@ static int __pg_init_all_paths(struct multipath *m)
 	struct pgpath *pgpath;
 	unsigned long pg_init_delay = 0;
 
-	if (m->pg_init_in_progress || test_bit(MPATHF_PG_INIT_DISABLED, &m->flags))
+	if (atomic_read(&m->pg_init_in_progress) || test_bit(MPATHF_PG_INIT_DISABLED, &m->flags))
 		return 0;
 
-	m->pg_init_count++;
+	atomic_inc(&m->pg_init_count);
 	clear_bit(MPATHF_PG_INIT_REQUIRED, &m->flags);
 
 	/* Check here to reset pg_init_required */
@@ -298,9 +301,9 @@ static int __pg_init_all_paths(struct multipath *m)
 			continue;
 		if (queue_delayed_work(kmpath_handlerd, &pgpath->activate_path,
 				       pg_init_delay))
-			m->pg_init_in_progress++;
+			atomic_inc(&m->pg_init_in_progress);
 	}
-	return m->pg_init_in_progress;
+	return atomic_read(&m->pg_init_in_progress);
 }
 
 static void __switch_pg(struct multipath *m, struct pgpath *pgpath)
@@ -316,7 +319,7 @@ static void __switch_pg(struct multipath *m, struct pgpath *pgpath)
 		clear_bit(MPATHF_QUEUE_IO, &m->flags);
 	}
 
-	m->pg_init_count = 0;
+	atomic_set(&m->pg_init_count, 0);
 }
 
 static int __choose_path_in_pg(struct multipath *m, struct priority_group *pg,
@@ -341,7 +344,7 @@ static void __choose_pgpath(struct multipath *m, size_t nr_bytes)
 	struct priority_group *pg;
 	bool bypassed = true;
 
-	if (!m->nr_valid_paths) {
+	if (!atomic_read(&m->nr_valid_paths)) {
 		clear_bit(MPATHF_QUEUE_IO, &m->flags);
 		goto failed;
 	}
@@ -902,6 +905,7 @@ static int multipath_ctr(struct dm_target *ti, unsigned int argc,
 	/* parse the priority groups */
 	while (as.argc) {
 		struct priority_group *pg;
+		unsigned nr_valid_paths = atomic_read(&m->nr_valid_paths);
 
 		pg = parse_priority_group(&as, m);
 		if (IS_ERR(pg)) {
@@ -909,7 +913,9 @@ static int multipath_ctr(struct dm_target *ti, unsigned int argc,
 			goto bad;
 		}
 
-		m->nr_valid_paths += pg->nr_pgpaths;
+		nr_valid_paths += pg->nr_pgpaths;
+		atomic_set(&m->nr_valid_paths, nr_valid_paths);
+
 		list_add_tail(&pg->list, &m->priority_groups);
 		pg_count++;
 		pg->pg_num = pg_count;
@@ -939,19 +945,14 @@ static int multipath_ctr(struct dm_target *ti, unsigned int argc,
 static void multipath_wait_for_pg_init_completion(struct multipath *m)
 {
 	DECLARE_WAITQUEUE(wait, current);
-	unsigned long flags;
 
 	add_wait_queue(&m->pg_init_wait, &wait);
 
 	while (1) {
 		set_current_state(TASK_UNINTERRUPTIBLE);
 
-		spin_lock_irqsave(&m->lock, flags);
-		if (!m->pg_init_in_progress) {
-			spin_unlock_irqrestore(&m->lock, flags);
+		if (!atomic_read(&m->pg_init_in_progress))
 			break;
-		}
-		spin_unlock_irqrestore(&m->lock, flags);
 
 		io_schedule();
 	}
@@ -1001,13 +1002,13 @@ static int fail_path(struct pgpath *pgpath)
 	pgpath->is_active = false;
 	pgpath->fail_count++;
 
-	m->nr_valid_paths--;
+	atomic_dec(&m->nr_valid_paths);
 
 	if (pgpath == m->current_pgpath)
 		m->current_pgpath = NULL;
 
 	dm_path_uevent(DM_UEVENT_PATH_FAILED, m->ti,
-		      pgpath->path.dev->name, m->nr_valid_paths);
+		       pgpath->path.dev->name, atomic_read(&m->nr_valid_paths));
 
 	schedule_work(&m->trigger_event);
 
@@ -1025,6 +1026,7 @@ static int reinstate_path(struct pgpath *pgpath)
 	int r = 0, run_queue = 0;
 	unsigned long flags;
 	struct multipath *m = pgpath->pg->m;
+	unsigned nr_valid_paths;
 
 	spin_lock_irqsave(&m->lock, flags);
 
@@ -1039,16 +1041,17 @@ static int reinstate_path(struct pgpath *pgpath)
 
 	pgpath->is_active = true;
 
-	if (!m->nr_valid_paths++) {
+	nr_valid_paths = atomic_inc_return(&m->nr_valid_paths);
+	if (nr_valid_paths == 1) {
 		m->current_pgpath = NULL;
 		run_queue = 1;
 	} else if (m->hw_handler_name && (m->current_pg == pgpath->pg)) {
 		if (queue_work(kmpath_handlerd, &pgpath->activate_path.work))
-			m->pg_init_in_progress++;
+			atomic_inc(&m->pg_init_in_progress);
 	}
 
 	dm_path_uevent(DM_UEVENT_PATH_REINSTATED, m->ti,
-		      pgpath->path.dev->name, m->nr_valid_paths);
+		       pgpath->path.dev->name, nr_valid_paths);
 
 	schedule_work(&m->trigger_event);
 
@@ -1166,7 +1169,8 @@ static bool pg_init_limit_reached(struct multipath *m, struct pgpath *pgpath)
 
 	spin_lock_irqsave(&m->lock, flags);
 
-	if (m->pg_init_count <= m->pg_init_retries && !test_bit(MPATHF_PG_INIT_DISABLED, &m->flags))
+	if (atomic_read(&m->pg_init_count) <= m->pg_init_retries &&
+	    !test_bit(MPATHF_PG_INIT_DISABLED, &m->flags))
 		set_bit(MPATHF_PG_INIT_REQUIRED, &m->flags);
 	else
 		limit_reached = true;
@@ -1236,7 +1240,7 @@ static void pg_init_done(void *data, int errors)
 	} else if (!test_bit(MPATHF_PG_INIT_REQUIRED, &m->flags))
 		pg->bypassed = false;
 
-	if (--m->pg_init_in_progress)
+	if (atomic_dec_return(&m->pg_init_in_progress) > 0)
 		/* Activations of other paths are still on going */
 		goto out;
 
@@ -1317,7 +1321,7 @@ static int do_end_io(struct multipath *m, struct request *clone,
 		fail_path(mpio->pgpath);
 
 	spin_lock_irqsave(&m->lock, flags);
-	if (!m->nr_valid_paths) {
+	if (!atomic_read(&m->nr_valid_paths)) {
 		if (!test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags)) {
 			if (!__must_push_back(m))
 				r = -EIO;
@@ -1421,7 +1425,8 @@ static void multipath_status(struct dm_target *ti, status_type_t type,
 
 	/* Features */
 	if (type == STATUSTYPE_INFO)
-		DMEMIT("2 %u %u ", test_bit(MPATHF_QUEUE_IO, &m->flags), m->pg_init_count);
+		DMEMIT("2 %u %u ", test_bit(MPATHF_QUEUE_IO, &m->flags),
+		       atomic_read(&m->pg_init_count));
 	else {
 		DMEMIT("%u ", test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags) +
 			      (m->pg_init_retries > 0) * 2 +
@@ -1675,8 +1680,8 @@ static int multipath_busy(struct dm_target *ti)
 	spin_lock_irqsave(&m->lock, flags);
 
 	/* pg_init in progress or no paths available */
-	if (m->pg_init_in_progress ||
-	    (!m->nr_valid_paths && test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags))) {
+	if (atomic_read(&m->pg_init_in_progress) ||
+	    (!atomic_read(&m->nr_valid_paths) && test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags))) {
 		busy = true;
 		goto out;
 	}
-- 
2.6.4 (Apple Git-63)


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC PATCH 3/4] dm mpath: move trigger_event member to the end of 'struct multipath'
  2016-03-31 20:04 [RFC PATCH 0/4] dm mpath: vastly improve blk-mq IO performance Mike Snitzer
  2016-03-31 20:04 ` [RFC PATCH 1/4] dm mpath: switch to using bitops for state flags Mike Snitzer
  2016-03-31 20:04 ` [RFC PATCH 2/4] dm mpath: use atomic_t for counting members of 'struct multipath' Mike Snitzer
@ 2016-03-31 20:04 ` Mike Snitzer
  2016-04-01  8:50   ` [dm-devel] " Johannes Thumshirn
  2016-04-07 15:03   ` Hannes Reinecke
  2016-03-31 20:04 ` [RFC PATCH 4/4] dm mpath: eliminate use of spinlock in IO fast-paths Mike Snitzer
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 26+ messages in thread
From: Mike Snitzer @ 2016-03-31 20:04 UTC (permalink / raw)
  To: dm-devel
  Cc: jmoyer, linux-block, linux-scsi, j-nomura, hare, sagig, Mike Snitzer

Allows the 'work_mutex' member to no longer cross a cacheline.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/md/dm-mpath.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 780e5d0..54daf96 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -89,8 +89,6 @@ struct multipath {
 	atomic_t pg_init_in_progress;	/* Only one pg_init allowed at once */
 	atomic_t pg_init_count;		/* Number of times pg_init called */
 
-	struct work_struct trigger_event;
-
 	/*
 	 * We must use a mempool of dm_mpath_io structs so that we
 	 * can resubmit bios on error.
@@ -98,6 +96,7 @@ struct multipath {
 	mempool_t *mpio_pool;
 
 	struct mutex work_mutex;
+	struct work_struct trigger_event;
 };
 
 /*
-- 
2.6.4 (Apple Git-63)


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC PATCH 4/4] dm mpath: eliminate use of spinlock in IO fast-paths
  2016-03-31 20:04 [RFC PATCH 0/4] dm mpath: vastly improve blk-mq IO performance Mike Snitzer
                   ` (2 preceding siblings ...)
  2016-03-31 20:04 ` [RFC PATCH 3/4] dm mpath: move trigger_event member to the end " Mike Snitzer
@ 2016-03-31 20:04 ` Mike Snitzer
  2016-04-01  9:02   ` [dm-devel] " Johannes Thumshirn
  2016-04-07 15:03   ` Hannes Reinecke
  2016-04-01  8:12 ` [dm-devel] [RFC PATCH 0/4] dm mpath: vastly improve blk-mq IO performance Johannes Thumshirn
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 26+ messages in thread
From: Mike Snitzer @ 2016-03-31 20:04 UTC (permalink / raw)
  To: dm-devel
  Cc: jmoyer, linux-block, linux-scsi, j-nomura, hare, sagig, Mike Snitzer

The primary motivation of this commit is to improve the scalability of
DM multipath on large NUMA systems where m->lock spinlock contention has
been proven to be a serious bottleneck on really fast storage.

The ability to atomically read a pointer, using lockless_dereference(),
is leveraged in this commit.  But all pointer writes are still protected
by the m->lock spinlock (which is fine since these all now occur in the
slow-path).

The following functions no longer require the m->lock spinlock in their
fast-path: multipath_busy(), __multipath_map(), and do_end_io()

And choose_pgpath() is modified to _not_ update m->current_pgpath unless
it also switches the path-group.  This is done to avoid needing to take
the m->lock everytime __multipath_map() calls choose_pgpath().
But m->current_pgpath will be reset if it is failed via fail_path().

Suggested-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/md/dm-mpath.c | 170 +++++++++++++++++++++++++++-----------------------
 1 file changed, 93 insertions(+), 77 deletions(-)

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 54daf96..52baf8a 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -305,9 +305,21 @@ static int __pg_init_all_paths(struct multipath *m)
 	return atomic_read(&m->pg_init_in_progress);
 }
 
-static void __switch_pg(struct multipath *m, struct pgpath *pgpath)
+static int pg_init_all_paths(struct multipath *m)
 {
-	m->current_pg = pgpath->pg;
+	int r;
+	unsigned long flags;
+
+	spin_lock_irqsave(&m->lock, flags);
+	r = __pg_init_all_paths(m);
+	spin_unlock_irqrestore(&m->lock, flags);
+
+	return r;
+}
+
+static void __switch_pg(struct multipath *m, struct priority_group *pg)
+{
+	m->current_pg = pg;
 
 	/* Must we initialise the PG first, and queue I/O till it's ready? */
 	if (m->hw_handler_name) {
@@ -321,26 +333,36 @@ static void __switch_pg(struct multipath *m, struct pgpath *pgpath)
 	atomic_set(&m->pg_init_count, 0);
 }
 
-static int __choose_path_in_pg(struct multipath *m, struct priority_group *pg,
-			       size_t nr_bytes)
+static struct pgpath *choose_path_in_pg(struct multipath *m,
+					struct priority_group *pg,
+					size_t nr_bytes)
 {
+	unsigned long flags;
 	struct dm_path *path;
+	struct pgpath *pgpath;
 
 	path = pg->ps.type->select_path(&pg->ps, nr_bytes);
 	if (!path)
-		return -ENXIO;
+		return ERR_PTR(-ENXIO);
 
-	m->current_pgpath = path_to_pgpath(path);
+	pgpath = path_to_pgpath(path);
 
-	if (m->current_pg != pg)
-		__switch_pg(m, m->current_pgpath);
+	if (unlikely(lockless_dereference(m->current_pg) != pg)) {
+		/* Only update current_pgpath if pg changed */
+		spin_lock_irqsave(&m->lock, flags);
+		m->current_pgpath = pgpath;
+		__switch_pg(m, pg);
+		spin_unlock_irqrestore(&m->lock, flags);
+	}
 
-	return 0;
+	return pgpath;
 }
 
-static void __choose_pgpath(struct multipath *m, size_t nr_bytes)
+static struct pgpath *choose_pgpath(struct multipath *m, size_t nr_bytes)
 {
+	unsigned long flags;
 	struct priority_group *pg;
+	struct pgpath *pgpath;
 	bool bypassed = true;
 
 	if (!atomic_read(&m->nr_valid_paths)) {
@@ -349,16 +371,28 @@ static void __choose_pgpath(struct multipath *m, size_t nr_bytes)
 	}
 
 	/* Were we instructed to switch PG? */
-	if (m->next_pg) {
+	if (lockless_dereference(m->next_pg)) {
+		spin_lock_irqsave(&m->lock, flags);
 		pg = m->next_pg;
+		if (!pg) {
+			spin_unlock_irqrestore(&m->lock, flags);
+			goto check_current_pg;
+		}
 		m->next_pg = NULL;
-		if (!__choose_path_in_pg(m, pg, nr_bytes))
-			return;
+		spin_unlock_irqrestore(&m->lock, flags);
+		pgpath = choose_path_in_pg(m, pg, nr_bytes);
+		if (!IS_ERR_OR_NULL(pgpath))
+			return pgpath;
 	}
 
 	/* Don't change PG until it has no remaining paths */
-	if (m->current_pg && !__choose_path_in_pg(m, m->current_pg, nr_bytes))
-		return;
+check_current_pg:
+	pg = lockless_dereference(m->current_pg);
+	if (pg) {
+		pgpath = choose_path_in_pg(m, pg, nr_bytes);
+		if (!IS_ERR_OR_NULL(pgpath))
+			return pgpath;
+	}
 
 	/*
 	 * Loop through priority groups until we find a valid path.
@@ -370,31 +404,34 @@ static void __choose_pgpath(struct multipath *m, size_t nr_bytes)
 		list_for_each_entry(pg, &m->priority_groups, list) {
 			if (pg->bypassed == bypassed)
 				continue;
-			if (!__choose_path_in_pg(m, pg, nr_bytes)) {
+			pgpath = choose_path_in_pg(m, pg, nr_bytes);
+			if (!IS_ERR_OR_NULL(pgpath)) {
 				if (!bypassed)
 					set_bit(MPATHF_PG_INIT_DELAY_RETRY, &m->flags);
-				return;
+				return pgpath;
 			}
 		}
 	} while (bypassed--);
 
 failed:
+	spin_lock_irqsave(&m->lock, flags);
 	m->current_pgpath = NULL;
 	m->current_pg = NULL;
+	spin_unlock_irqrestore(&m->lock, flags);
+
+	return NULL;
 }
 
 /*
  * Check whether bios must be queued in the device-mapper core rather
  * than here in the target.
  *
- * m->lock must be held on entry.
- *
  * If m->queue_if_no_path and m->saved_queue_if_no_path hold the
  * same value then we are not between multipath_presuspend()
  * and multipath_resume() calls and we have no need to check
  * for the DMF_NOFLUSH_SUSPENDING flag.
  */
-static int __must_push_back(struct multipath *m)
+static int must_push_back(struct multipath *m)
 {
 	return (test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags) ||
 		((test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags) !=
@@ -416,36 +453,31 @@ static int __multipath_map(struct dm_target *ti, struct request *clone,
 	struct block_device *bdev;
 	struct dm_mpath_io *mpio;
 
-	spin_lock_irq(&m->lock);
-
 	/* Do we need to select a new pgpath? */
-	if (!m->current_pgpath || !test_bit(MPATHF_QUEUE_IO, &m->flags))
-		__choose_pgpath(m, nr_bytes);
-
-	pgpath = m->current_pgpath;
+	pgpath = lockless_dereference(m->current_pgpath);
+	if (!pgpath || !test_bit(MPATHF_QUEUE_IO, &m->flags))
+		pgpath = choose_pgpath(m, nr_bytes);
 
 	if (!pgpath) {
-		if (!__must_push_back(m))
+		if (!must_push_back(m))
 			r = -EIO;	/* Failed */
-		goto out_unlock;
+		return r;
 	} else if (test_bit(MPATHF_QUEUE_IO, &m->flags) ||
 		   test_bit(MPATHF_PG_INIT_REQUIRED, &m->flags)) {
-		__pg_init_all_paths(m);
-		goto out_unlock;
+		pg_init_all_paths(m);
+		return r;
 	}
 
 	mpio = set_mpio(m, map_context);
 	if (!mpio)
 		/* ENOMEM, requeue */
-		goto out_unlock;
+		return r;
 
 	mpio->pgpath = pgpath;
 	mpio->nr_bytes = nr_bytes;
 
 	bdev = pgpath->path.dev->bdev;
 
-	spin_unlock_irq(&m->lock);
-
 	if (clone) {
 		/*
 		 * Old request-based interface: allocated clone is passed in.
@@ -477,11 +509,6 @@ static int __multipath_map(struct dm_target *ti, struct request *clone,
 					      &pgpath->path,
 					      nr_bytes);
 	return DM_MAPIO_REMAPPED;
-
-out_unlock:
-	spin_unlock_irq(&m->lock);
-
-	return r;
 }
 
 static int multipath_map(struct dm_target *ti, struct request *clone,
@@ -1308,7 +1335,6 @@ static int do_end_io(struct multipath *m, struct request *clone,
 	 * clone bios for it and resubmit it later.
 	 */
 	int r = DM_ENDIO_REQUEUE;
-	unsigned long flags;
 
 	if (!error && !clone->errors)
 		return 0;	/* I/O complete */
@@ -1319,17 +1345,15 @@ static int do_end_io(struct multipath *m, struct request *clone,
 	if (mpio->pgpath)
 		fail_path(mpio->pgpath);
 
-	spin_lock_irqsave(&m->lock, flags);
 	if (!atomic_read(&m->nr_valid_paths)) {
 		if (!test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags)) {
-			if (!__must_push_back(m))
+			if (!must_push_back(m))
 				r = -EIO;
 		} else {
 			if (error == -EBADE)
 				r = error;
 		}
 	}
-	spin_unlock_irqrestore(&m->lock, flags);
 
 	return r;
 }
@@ -1586,18 +1610,17 @@ static int multipath_prepare_ioctl(struct dm_target *ti,
 		struct block_device **bdev, fmode_t *mode)
 {
 	struct multipath *m = ti->private;
-	unsigned long flags;
+	struct pgpath *current_pgpath;
 	int r;
 
-	spin_lock_irqsave(&m->lock, flags);
-
-	if (!m->current_pgpath)
-		__choose_pgpath(m, 0);
+	current_pgpath = lockless_dereference(m->current_pgpath);
+	if (!current_pgpath)
+		current_pgpath = choose_pgpath(m, 0);
 
-	if (m->current_pgpath) {
+	if (current_pgpath) {
 		if (!test_bit(MPATHF_QUEUE_IO, &m->flags)) {
-			*bdev = m->current_pgpath->path.dev->bdev;
-			*mode = m->current_pgpath->path.dev->mode;
+			*bdev = current_pgpath->path.dev->bdev;
+			*mode = current_pgpath->path.dev->mode;
 			r = 0;
 		} else {
 			/* pg_init has not started or completed */
@@ -1611,17 +1634,13 @@ static int multipath_prepare_ioctl(struct dm_target *ti,
 			r = -EIO;
 	}
 
-	spin_unlock_irqrestore(&m->lock, flags);
-
 	if (r == -ENOTCONN) {
-		spin_lock_irqsave(&m->lock, flags);
-		if (!m->current_pg) {
+		if (!lockless_dereference(m->current_pg)) {
 			/* Path status changed, redo selection */
-			__choose_pgpath(m, 0);
+			(void) choose_pgpath(m, 0);
 		}
 		if (test_bit(MPATHF_PG_INIT_REQUIRED, &m->flags))
-			__pg_init_all_paths(m);
-		spin_unlock_irqrestore(&m->lock, flags);
+			pg_init_all_paths(m);
 		dm_table_run_md_queue_async(m->ti->table);
 	}
 
@@ -1672,39 +1691,37 @@ static int multipath_busy(struct dm_target *ti)
 {
 	bool busy = false, has_active = false;
 	struct multipath *m = ti->private;
-	struct priority_group *pg;
+	struct priority_group *pg, *next_pg;
 	struct pgpath *pgpath;
-	unsigned long flags;
-
-	spin_lock_irqsave(&m->lock, flags);
 
 	/* pg_init in progress or no paths available */
 	if (atomic_read(&m->pg_init_in_progress) ||
-	    (!atomic_read(&m->nr_valid_paths) && test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags))) {
-		busy = true;
-		goto out;
-	}
+	    (!atomic_read(&m->nr_valid_paths) && test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags)))
+		return true;
+
 	/* Guess which priority_group will be used at next mapping time */
-	if (unlikely(!m->current_pgpath && m->next_pg))
-		pg = m->next_pg;
-	else if (likely(m->current_pg))
-		pg = m->current_pg;
-	else
+	pg = lockless_dereference(m->current_pg);
+	next_pg = lockless_dereference(m->next_pg);
+	if (unlikely(!lockless_dereference(m->current_pgpath) && next_pg))
+		pg = next_pg;
+
+	if (!pg) {
 		/*
 		 * We don't know which pg will be used at next mapping time.
-		 * We don't call __choose_pgpath() here to avoid to trigger
+		 * We don't call choose_pgpath() here to avoid to trigger
 		 * pg_init just by busy checking.
 		 * So we don't know whether underlying devices we will be using
 		 * at next mapping time are busy or not. Just try mapping.
 		 */
-		goto out;
+		return busy;
+	}
 
 	/*
 	 * If there is one non-busy active path at least, the path selector
 	 * will be able to select it. So we consider such a pg as not busy.
 	 */
 	busy = true;
-	list_for_each_entry(pgpath, &pg->pgpaths, list)
+	list_for_each_entry(pgpath, &pg->pgpaths, list) {
 		if (pgpath->is_active) {
 			has_active = true;
 			if (!pgpath_busy(pgpath)) {
@@ -1712,17 +1729,16 @@ static int multipath_busy(struct dm_target *ti)
 				break;
 			}
 		}
+	}
 
-	if (!has_active)
+	if (!has_active) {
 		/*
 		 * No active path in this pg, so this pg won't be used and
 		 * the current_pg will be changed at next mapping time.
 		 * We need to try mapping to determine it.
 		 */
 		busy = false;
-
-out:
-	spin_unlock_irqrestore(&m->lock, flags);
+	}
 
 	return busy;
 }
-- 
2.6.4 (Apple Git-63)


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [dm-devel] [RFC PATCH 0/4] dm mpath: vastly improve blk-mq IO performance
  2016-03-31 20:04 [RFC PATCH 0/4] dm mpath: vastly improve blk-mq IO performance Mike Snitzer
                   ` (3 preceding siblings ...)
  2016-03-31 20:04 ` [RFC PATCH 4/4] dm mpath: eliminate use of spinlock in IO fast-paths Mike Snitzer
@ 2016-04-01  8:12 ` Johannes Thumshirn
  2016-04-01 13:22   ` Mike Snitzer
  2016-04-07 14:58   ` Hannes Reinecke
  2016-05-09 15:48 ` Bart Van Assche
  6 siblings, 1 reply; 26+ messages in thread
From: Johannes Thumshirn @ 2016-04-01  8:12 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: dm-devel, linux-scsi, sagig, linux-block, jmoyer, j-nomura

On 2016-03-31 22:04, Mike Snitzer wrote:
> I developed these changes some weeks ago but have since focused on
> regression and performance testing on larger NUMA systems.
> 
> For regression testing I've been using mptest:
> https://github.com/snitm/mptest
> 
> For performance testing I've been using a null_blk device (with
> various configuration permutations, e.g. pinning memory to a
> particular NUMA node, and varied number of submit_queues).
> 
> By eliminating multipath's heavy use of the m->lock spinlock in the
> fast IO paths serious performance improvements are realized.

Hi Mike,

Are this the patches you pointed Hannes to?

If yes, please add my Tested-by: Johannes Thumshirn <jthumshirn@suse.de>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dm-devel] [RFC PATCH 1/4] dm mpath: switch to using bitops for state flags
  2016-03-31 20:04 ` [RFC PATCH 1/4] dm mpath: switch to using bitops for state flags Mike Snitzer
@ 2016-04-01  8:46   ` Johannes Thumshirn
  2016-04-07 14:59   ` Hannes Reinecke
  1 sibling, 0 replies; 26+ messages in thread
From: Johannes Thumshirn @ 2016-04-01  8:46 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: dm-devel, linux-scsi, sagig, linux-block, jmoyer, j-nomura

On 2016-03-31 22:04, Mike Snitzer wrote:
> Mechanical change that doesn't make any real effort to reduce the use 
> of
> m->lock; that will come later (once atomics are used for counters, 
> etc).
> 
> Suggested-by: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Mike Snitzer <snitzer@redhat.com>

Looks good,
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dm-devel] [RFC PATCH 2/4] dm mpath: use atomic_t for counting members of 'struct multipath'
  2016-03-31 20:04 ` [RFC PATCH 2/4] dm mpath: use atomic_t for counting members of 'struct multipath' Mike Snitzer
@ 2016-04-01  8:48   ` Johannes Thumshirn
  2016-04-07 15:02   ` Hannes Reinecke
  1 sibling, 0 replies; 26+ messages in thread
From: Johannes Thumshirn @ 2016-04-01  8:48 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: dm-devel, linux-scsi, sagig, linux-block, jmoyer, j-nomura

On 2016-03-31 22:04, Mike Snitzer wrote:
> The use of atomic_t for nr_valid_paths, pg_init_in_progress and
> pg_init_count will allow relaxing the use of the m->lock spinlock.
> 
> Suggested-by: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Mike Snitzer <snitzer@redhat.com>

Looks good,
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dm-devel] [RFC PATCH 3/4] dm mpath: move trigger_event member to the end of 'struct multipath'
  2016-03-31 20:04 ` [RFC PATCH 3/4] dm mpath: move trigger_event member to the end " Mike Snitzer
@ 2016-04-01  8:50   ` Johannes Thumshirn
  2016-04-07 15:03   ` Hannes Reinecke
  1 sibling, 0 replies; 26+ messages in thread
From: Johannes Thumshirn @ 2016-04-01  8:50 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: dm-devel, linux-scsi, sagig, linux-block, jmoyer, j-nomura

On 2016-03-31 22:04, Mike Snitzer wrote:
> Allows the 'work_mutex' member to no longer cross a cacheline.
> 
> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> ---

Looks good,
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dm-devel] [RFC PATCH 4/4] dm mpath: eliminate use of spinlock in IO fast-paths
  2016-03-31 20:04 ` [RFC PATCH 4/4] dm mpath: eliminate use of spinlock in IO fast-paths Mike Snitzer
@ 2016-04-01  9:02   ` Johannes Thumshirn
  2016-04-07 15:03   ` Hannes Reinecke
  1 sibling, 0 replies; 26+ messages in thread
From: Johannes Thumshirn @ 2016-04-01  9:02 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: dm-devel, linux-scsi, sagig, linux-block, jmoyer, j-nomura

On 2016-03-31 22:04, Mike Snitzer wrote:
> The primary motivation of this commit is to improve the scalability of
> DM multipath on large NUMA systems where m->lock spinlock contention 
> has
> been proven to be a serious bottleneck on really fast storage.
> 
> The ability to atomically read a pointer, using lockless_dereference(),
> is leveraged in this commit.  But all pointer writes are still 
> protected
> by the m->lock spinlock (which is fine since these all now occur in the
> slow-path).
> 
> The following functions no longer require the m->lock spinlock in their
> fast-path: multipath_busy(), __multipath_map(), and do_end_io()
> 
> And choose_pgpath() is modified to _not_ update m->current_pgpath 
> unless
> it also switches the path-group.  This is done to avoid needing to take
> the m->lock everytime __multipath_map() calls choose_pgpath().
> But m->current_pgpath will be reset if it is failed via fail_path().
> 
> Suggested-by: Jeff Moyer <jmoyer@redhat.com>
> Signed-off-by: Mike Snitzer <snitzer@redhat.com>

Looks good,
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/4] dm mpath: vastly improve blk-mq IO performance
  2016-04-01  8:12 ` [dm-devel] [RFC PATCH 0/4] dm mpath: vastly improve blk-mq IO performance Johannes Thumshirn
@ 2016-04-01 13:22   ` Mike Snitzer
  2016-04-01 13:37     ` Johannes Thumshirn
  0 siblings, 1 reply; 26+ messages in thread
From: Mike Snitzer @ 2016-04-01 13:22 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: dm-devel, linux-scsi, sagig, linux-block, jmoyer, j-nomura

On Fri, Apr 01 2016 at  4:12am -0400,
Johannes Thumshirn <jthumshirn@suse.de> wrote:

> On 2016-03-31 22:04, Mike Snitzer wrote:
> >I developed these changes some weeks ago but have since focused on
> >regression and performance testing on larger NUMA systems.
> >
> >For regression testing I've been using mptest:
> >https://github.com/snitm/mptest
> >
> >For performance testing I've been using a null_blk device (with
> >various configuration permutations, e.g. pinning memory to a
> >particular NUMA node, and varied number of submit_queues).
> >
> >By eliminating multipath's heavy use of the m->lock spinlock in the
> >fast IO paths serious performance improvements are realized.
> 
> Hi Mike,
> 
> Are this the patches you pointed Hannes to?
> 
> If yes, please add my Tested-by: Johannes Thumshirn <jthumshirn@suse.de>

No they are not.

Hannes seems to have last pulled in my DM mpath changes that (ab)used RCU.
I ended up dropping those changes and this patchset is the replacement.

So please retest with this patchset (I know you guys have a large setup
that these changes are very relevant for).  If you could actually share
_how_ yo've tested that'd help me understand how these changes are
holding up.  So far all looks good for me...

Mike

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/4] dm mpath: vastly improve blk-mq IO performance
  2016-04-01 13:22   ` Mike Snitzer
@ 2016-04-01 13:37     ` Johannes Thumshirn
  2016-04-01 14:14       ` Mike Snitzer
  0 siblings, 1 reply; 26+ messages in thread
From: Johannes Thumshirn @ 2016-04-01 13:37 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: dm-devel, linux-scsi, sagig, linux-block, jmoyer, j-nomura, hare

[ +Cc Hannes ]

On 2016-04-01 15:22, Mike Snitzer wrote:
> On Fri, Apr 01 2016 at  4:12am -0400,
> Johannes Thumshirn <jthumshirn@suse.de> wrote:
> 
>> On 2016-03-31 22:04, Mike Snitzer wrote:
>> >I developed these changes some weeks ago but have since focused on
>> >regression and performance testing on larger NUMA systems.
>> >
>> >For regression testing I've been using mptest:
>> >https://github.com/snitm/mptest
>> >
>> >For performance testing I've been using a null_blk device (with
>> >various configuration permutations, e.g. pinning memory to a
>> >particular NUMA node, and varied number of submit_queues).
>> >
>> >By eliminating multipath's heavy use of the m->lock spinlock in the
>> >fast IO paths serious performance improvements are realized.
>> 
>> Hi Mike,
>> 
>> Are this the patches you pointed Hannes to?
>> 
>> If yes, please add my Tested-by: Johannes Thumshirn 
>> <jthumshirn@suse.de>
> 
> No they are not.
> 
> Hannes seems to have last pulled in my DM mpath changes that (ab)used 
> RCU.
> I ended up dropping those changes and this patchset is the replacement.

Now that you're saying it I can remember some inspiring RCU usage in the 
patches.

> So please retest with this patchset (I know you guys have a large setup
> that these changes are very relevant for).  If you could actually share
> _how_ yo've tested that'd help me understand how these changes are
> holding up.  So far all looks good for me...

The test itself is actually quite simple, we're testing with fio against 
a fiber channel array (all SSDs but I was very careful to only write 
into the cache)

Here's my fio job file:
[mq-test]
iodepth=128
numjobs=40
group_reporting
direct=1
ioengine=libaio
size=3G
filename=/dev/dm-0
filename=/dev/dm-1
filename=/dev/dm-2
filename=/dev/dm-3
filename=/dev/dm-4
filename=/dev/dm-5
filename=/dev/dm-6
filename=/dev/dm-7
name="MQ Test"

and the test runner:
#!/bin/sh

for rw in 'randread' 'randwrite' 'read' 'write'; do
         for bs in '4k' '8k' '16k' '32k' '64k'; do
                 fio mq-test.fio --bs="${bs}" --rw="${rw}" 
--output="fio-${bs}-${rw}.txt"
         done
done

The initiator has 40 CPUs on 4 NUMA nodes (no HT) and 64GB RAM. I'm not 
sure how much in term of numbers I can share from the old patchset (will 
ask Hannes on Monday), but I'm aware I'll have to when I retested with 
your new patches and we want to compare the results.

Byte,
    Johannes

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/4] dm mpath: vastly improve blk-mq IO performance
  2016-04-01 13:37     ` Johannes Thumshirn
@ 2016-04-01 14:14       ` Mike Snitzer
  0 siblings, 0 replies; 26+ messages in thread
From: Mike Snitzer @ 2016-04-01 14:14 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: dm-devel, linux-scsi, sagig, linux-block, jmoyer, j-nomura, hare

On Fri, Apr 01 2016 at  9:37am -0400,
Johannes Thumshirn <jthumshirn@suse.de> wrote:

> [ +Cc Hannes ]
> 
> On 2016-04-01 15:22, Mike Snitzer wrote:
> >On Fri, Apr 01 2016 at  4:12am -0400,
> >Johannes Thumshirn <jthumshirn@suse.de> wrote:
> >
> >>On 2016-03-31 22:04, Mike Snitzer wrote:
> >>>I developed these changes some weeks ago but have since focused on
> >>>regression and performance testing on larger NUMA systems.
> >>>
> >>>For regression testing I've been using mptest:
> >>>https://github.com/snitm/mptest
> >>>
> >>>For performance testing I've been using a null_blk device (with
> >>>various configuration permutations, e.g. pinning memory to a
> >>>particular NUMA node, and varied number of submit_queues).
> >>>
> >>>By eliminating multipath's heavy use of the m->lock spinlock in the
> >>>fast IO paths serious performance improvements are realized.
> >>
> >>Hi Mike,
> >>
> >>Are this the patches you pointed Hannes to?
> >>
> >>If yes, please add my Tested-by: Johannes Thumshirn
> >><jthumshirn@suse.de>
> >
> >No they are not.
> >
> >Hannes seems to have last pulled in my DM mpath changes that
> >(ab)used RCU.
> >I ended up dropping those changes and this patchset is the replacement.
> 
> Now that you're saying it I can remember some inspiring RCU usage in
> the patches.
> 
> >So please retest with this patchset (I know you guys have a large setup
> >that these changes are very relevant for).  If you could actually share
> >_how_ yo've tested that'd help me understand how these changes are
> >holding up.  So far all looks good for me...
> 
> The test itself is actually quite simple, we're testing with fio
> against a fiber channel array (all SSDs but I was very careful to
> only write into the cache)
> 
> Here's my fio job file:
> [mq-test]
> iodepth=128
> numjobs=40
> group_reporting
> direct=1
> ioengine=libaio
> size=3G
> filename=/dev/dm-0
> filename=/dev/dm-1
> filename=/dev/dm-2
> filename=/dev/dm-3
> filename=/dev/dm-4
> filename=/dev/dm-5
> filename=/dev/dm-6
> filename=/dev/dm-7
> name="MQ Test"
> 
> and the test runner:
> #!/bin/sh
> 
> for rw in 'randread' 'randwrite' 'read' 'write'; do
>         for bs in '4k' '8k' '16k' '32k' '64k'; do
>                 fio mq-test.fio --bs="${bs}" --rw="${rw}"
> --output="fio-${bs}-${rw}.txt"
>         done
> done
> 
> The initiator has 40 CPUs on 4 NUMA nodes (no HT) and 64GB RAM. I'm
> not sure how much in term of numbers I can share from the old
> patchset (will ask Hannes on Monday), but I'm aware I'll have to
> when I retested with your new patches and we want to compare the
> results.

OK, yes please share as much as you can.  Having shared code that seems
to perform well it'd be nice to see some results from your testbed.
It'll help build my confidence in the code actually landing upstream.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/4] dm mpath: vastly improve blk-mq IO performance
  2016-03-31 20:04 [RFC PATCH 0/4] dm mpath: vastly improve blk-mq IO performance Mike Snitzer
@ 2016-04-07 14:58   ` Hannes Reinecke
  2016-03-31 20:04 ` [RFC PATCH 2/4] dm mpath: use atomic_t for counting members of 'struct multipath' Mike Snitzer
                     ` (5 subsequent siblings)
  6 siblings, 0 replies; 26+ messages in thread
From: Hannes Reinecke @ 2016-04-07 14:58 UTC (permalink / raw)
  To: Mike Snitzer, dm-devel; +Cc: jmoyer, linux-block, linux-scsi, j-nomura, sagig

On 03/31/2016 10:04 PM, Mike Snitzer wrote:
> I developed these changes some weeks ago but have since focused on
> regression and performance testing on larger NUMA systems.
> 
> For regression testing I've been using mptest:
> https://github.com/snitm/mptest
> 
> For performance testing I've been using a null_blk device (with
> various configuration permutations, e.g. pinning memory to a
> particular NUMA node, and varied number of submit_queues).
> 
> By eliminating multipath's heavy use of the m->lock spinlock in the
> fast IO paths serious performance improvements are realized.
> 
[ .. ]
> Jeff Moyer has been helping review these changes (and has graciously
> labored over _really_ understanding all the concurrency at play in DM
> mpath) -- his review isn't yet complete but I wanted to get this
> patchset out now to raise awareness about how I think DM multipath
> will be changing (for inclussion during the Linux 4.7 merge window).
> 
> Mike Snitzer (4):
>   dm mpath: switch to using bitops for state flags
>   dm mpath: use atomic_t for counting members of 'struct multipath'
>   dm mpath: move trigger_event member to the end of 'struct multipath'
>   dm mpath: eliminate use of spinlock in IO fast-paths
> 
>  drivers/md/dm-mpath.c | 351 ++++++++++++++++++++++++++++----------------------
>  1 file changed, 195 insertions(+), 156 deletions(-)
> 
Finally got around to test this.
The performance is comparable to the previous (RCU-ified) patchset,
however, this one is the far superious approach.
In fact, the first two are pretty much identical to what I've
already had, but I've shirked at modifying the path selectors.
So well done here.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: F. Imend�rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N�rnberg)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/4] dm mpath: vastly improve blk-mq IO performance
@ 2016-04-07 14:58   ` Hannes Reinecke
  0 siblings, 0 replies; 26+ messages in thread
From: Hannes Reinecke @ 2016-04-07 14:58 UTC (permalink / raw)
  To: Mike Snitzer, dm-devel; +Cc: jmoyer, linux-block, linux-scsi, j-nomura, sagig

On 03/31/2016 10:04 PM, Mike Snitzer wrote:
> I developed these changes some weeks ago but have since focused on
> regression and performance testing on larger NUMA systems.
> 
> For regression testing I've been using mptest:
> https://github.com/snitm/mptest
> 
> For performance testing I've been using a null_blk device (with
> various configuration permutations, e.g. pinning memory to a
> particular NUMA node, and varied number of submit_queues).
> 
> By eliminating multipath's heavy use of the m->lock spinlock in the
> fast IO paths serious performance improvements are realized.
> 
[ .. ]
> Jeff Moyer has been helping review these changes (and has graciously
> labored over _really_ understanding all the concurrency at play in DM
> mpath) -- his review isn't yet complete but I wanted to get this
> patchset out now to raise awareness about how I think DM multipath
> will be changing (for inclussion during the Linux 4.7 merge window).
> 
> Mike Snitzer (4):
>   dm mpath: switch to using bitops for state flags
>   dm mpath: use atomic_t for counting members of 'struct multipath'
>   dm mpath: move trigger_event member to the end of 'struct multipath'
>   dm mpath: eliminate use of spinlock in IO fast-paths
> 
>  drivers/md/dm-mpath.c | 351 ++++++++++++++++++++++++++++----------------------
>  1 file changed, 195 insertions(+), 156 deletions(-)
> 
Finally got around to test this.
The performance is comparable to the previous (RCU-ified) patchset,
however, this one is the far superious approach.
In fact, the first two are pretty much identical to what I've
already had, but I've shirked at modifying the path selectors.
So well done here.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 1/4] dm mpath: switch to using bitops for state flags
  2016-03-31 20:04 ` [RFC PATCH 1/4] dm mpath: switch to using bitops for state flags Mike Snitzer
  2016-04-01  8:46   ` [dm-devel] " Johannes Thumshirn
@ 2016-04-07 14:59   ` Hannes Reinecke
  1 sibling, 0 replies; 26+ messages in thread
From: Hannes Reinecke @ 2016-04-07 14:59 UTC (permalink / raw)
  To: Mike Snitzer, dm-devel; +Cc: jmoyer, linux-block, linux-scsi, j-nomura, sagig

On 03/31/2016 10:04 PM, Mike Snitzer wrote:
> Mechanical change that doesn't make any real effort to reduce the use of
> m->lock; that will come later (once atomics are used for counters, etc).
> 
> Suggested-by: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> ---
>  drivers/md/dm-mpath.c | 131 +++++++++++++++++++++++++++++---------------------
>  1 file changed, 75 insertions(+), 56 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 2/4] dm mpath: use atomic_t for counting members of 'struct multipath'
  2016-03-31 20:04 ` [RFC PATCH 2/4] dm mpath: use atomic_t for counting members of 'struct multipath' Mike Snitzer
  2016-04-01  8:48   ` [dm-devel] " Johannes Thumshirn
@ 2016-04-07 15:02   ` Hannes Reinecke
  1 sibling, 0 replies; 26+ messages in thread
From: Hannes Reinecke @ 2016-04-07 15:02 UTC (permalink / raw)
  To: Mike Snitzer, dm-devel; +Cc: jmoyer, linux-block, linux-scsi, j-nomura, sagig

On 03/31/2016 10:04 PM, Mike Snitzer wrote:
> The use of atomic_t for nr_valid_paths, pg_init_in_progress and
> pg_init_count will allow relaxing the use of the m->lock spinlock.
> 
> Suggested-by: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> ---
>  drivers/md/dm-mpath.c | 61 ++++++++++++++++++++++++++++-----------------------
>  1 file changed, 33 insertions(+), 28 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 3/4] dm mpath: move trigger_event member to the end of 'struct multipath'
  2016-03-31 20:04 ` [RFC PATCH 3/4] dm mpath: move trigger_event member to the end " Mike Snitzer
  2016-04-01  8:50   ` [dm-devel] " Johannes Thumshirn
@ 2016-04-07 15:03   ` Hannes Reinecke
  1 sibling, 0 replies; 26+ messages in thread
From: Hannes Reinecke @ 2016-04-07 15:03 UTC (permalink / raw)
  To: Mike Snitzer, dm-devel; +Cc: jmoyer, linux-block, linux-scsi, j-nomura, sagig

On 03/31/2016 10:04 PM, Mike Snitzer wrote:
> Allows the 'work_mutex' member to no longer cross a cacheline.
> 
> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> ---
>  drivers/md/dm-mpath.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
> index 780e5d0..54daf96 100644
> --- a/drivers/md/dm-mpath.c
> +++ b/drivers/md/dm-mpath.c
> @@ -89,8 +89,6 @@ struct multipath {
>  	atomic_t pg_init_in_progress;	/* Only one pg_init allowed at once */
>  	atomic_t pg_init_count;		/* Number of times pg_init called */
>  
> -	struct work_struct trigger_event;
> -
>  	/*
>  	 * We must use a mempool of dm_mpath_io structs so that we
>  	 * can resubmit bios on error.
> @@ -98,6 +96,7 @@ struct multipath {
>  	mempool_t *mpio_pool;
>  
>  	struct mutex work_mutex;
> +	struct work_struct trigger_event;
>  };
>  
>  /*
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 4/4] dm mpath: eliminate use of spinlock in IO fast-paths
  2016-03-31 20:04 ` [RFC PATCH 4/4] dm mpath: eliminate use of spinlock in IO fast-paths Mike Snitzer
  2016-04-01  9:02   ` [dm-devel] " Johannes Thumshirn
@ 2016-04-07 15:03   ` Hannes Reinecke
  1 sibling, 0 replies; 26+ messages in thread
From: Hannes Reinecke @ 2016-04-07 15:03 UTC (permalink / raw)
  To: Mike Snitzer, dm-devel; +Cc: jmoyer, linux-block, linux-scsi, j-nomura, sagig

On 03/31/2016 10:04 PM, Mike Snitzer wrote:
> The primary motivation of this commit is to improve the scalability of
> DM multipath on large NUMA systems where m->lock spinlock contention has
> been proven to be a serious bottleneck on really fast storage.
> 
> The ability to atomically read a pointer, using lockless_dereference(),
> is leveraged in this commit.  But all pointer writes are still protected
> by the m->lock spinlock (which is fine since these all now occur in the
> slow-path).
> 
> The following functions no longer require the m->lock spinlock in their
> fast-path: multipath_busy(), __multipath_map(), and do_end_io()
> 
> And choose_pgpath() is modified to _not_ update m->current_pgpath unless
> it also switches the path-group.  This is done to avoid needing to take
> the m->lock everytime __multipath_map() calls choose_pgpath().
> But m->current_pgpath will be reset if it is failed via fail_path().
> 
> Suggested-by: Jeff Moyer <jmoyer@redhat.com>
> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> ---
>  drivers/md/dm-mpath.c | 170 +++++++++++++++++++++++++++-----------------------
>  1 file changed, 93 insertions(+), 77 deletions(-)
> 
I really like this :-)

Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/4] dm mpath: vastly improve blk-mq IO performance
  2016-04-07 14:58   ` Hannes Reinecke
  (?)
@ 2016-04-07 15:34   ` Mike Snitzer
  2016-04-08 11:42       ` Johannes Thumshirn
  -1 siblings, 1 reply; 26+ messages in thread
From: Mike Snitzer @ 2016-04-07 15:34 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: dm-devel, jmoyer, linux-block, linux-scsi, j-nomura, sagig

On Thu, Apr 07 2016 at 10:58am -0400,
Hannes Reinecke <hare@suse.de> wrote:

> On 03/31/2016 10:04 PM, Mike Snitzer wrote:
> > I developed these changes some weeks ago but have since focused on
> > regression and performance testing on larger NUMA systems.
> > 
> > For regression testing I've been using mptest:
> > https://github.com/snitm/mptest
> > 
> > For performance testing I've been using a null_blk device (with
> > various configuration permutations, e.g. pinning memory to a
> > particular NUMA node, and varied number of submit_queues).
> > 
> > By eliminating multipath's heavy use of the m->lock spinlock in the
> > fast IO paths serious performance improvements are realized.
> > 
> [ .. ]
> > Jeff Moyer has been helping review these changes (and has graciously
> > labored over _really_ understanding all the concurrency at play in DM
> > mpath) -- his review isn't yet complete but I wanted to get this
> > patchset out now to raise awareness about how I think DM multipath
> > will be changing (for inclussion during the Linux 4.7 merge window).
> > 
> > Mike Snitzer (4):
> >   dm mpath: switch to using bitops for state flags
> >   dm mpath: use atomic_t for counting members of 'struct multipath'
> >   dm mpath: move trigger_event member to the end of 'struct multipath'
> >   dm mpath: eliminate use of spinlock in IO fast-paths
> > 
> >  drivers/md/dm-mpath.c | 351 ++++++++++++++++++++++++++++----------------------
> >  1 file changed, 195 insertions(+), 156 deletions(-)
> > 
> Finally got around to test this.
> The performance is comparable to the previous (RCU-ified) patchset,
> however, this one is the far superious approach.
> In fact, the first two are pretty much identical to what I've
> already had, but I've shirked at modifying the path selectors.
> So well done here.

Awesome, thanks for reviewing and testing, very much appreciated.

I'll get this set staged in linux-next for 4.7 shortly.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dm-devel] [RFC PATCH 0/4] dm mpath: vastly improve blk-mq IO performance
  2016-04-07 15:34   ` Mike Snitzer
@ 2016-04-08 11:42       ` Johannes Thumshirn
  0 siblings, 0 replies; 26+ messages in thread
From: Johannes Thumshirn @ 2016-04-08 11:42 UTC (permalink / raw)
  To: dm-devel
  Cc: Mike Snitzer, Hannes Reinecke, linux-scsi, sagig, jmoyer,
	linux-block, j-nomura

Ladies and Gentlemen,
To show off some numbers from our testing:

All tests are performed against the cache of the Array, not the disks as we 
wanted to test the Linux stack not the Disk Array.

All single queue tests have been performed with the deadline I/O Scheduler.

Comments welcome, have fun reading :-)

QLogic 32GBit FC HBA NetApp EF560 Array w/ DM MQ this patchset:
========================================================
Random Read:
BS   | IOPS     | BW                |
------+------------+-------------------+
4k   | 785136 | 3066.1MB/s |
8k   | 734983 | 5742.6MB/s |
16k | 398516 | 6226.9MB/s |
32k | 200589 | 6268.5MB/s |
64k | 100417 | 6276.2MB/s |

Sequential Read:
BS   | IOPS     | BW                |
------+------------+-------------------+
4k   | 788620 | 3080.6MB/s |
8k   | 736359 | 5752.9MB/s |
16k | 398597 | 6228.8MB/s |
32k |  200487| 6265.3MB/s |
64k | 100402 | 6275.2MB/s |

Random Write:
BS   | IOPS     | BW                |
------+------------+-------------------+
4k   | 242105 | 968423KB/s |
8k   | 222290 | 1736.7MB/s |
16k | 178191 | 2784.3MB/s |
32k | 133619 | 4175.7MB/s |
64k |   97693 | 6105.9MB/s |

Sequential Write:
BS   | IOPS     | BW                |
------+------------+-------------------+
4k   | 134788 | 539155KB/s |
8k   | 132361 | 1034.8MB/s |
16k | 129941 | 2030.4MB/s |
32k | 128288 | 4009.4MB/s |
64k |   97776 | 6111.0MB/s |

QLogic 32GBit FC HBA NetApp EF560 Array w/ DM SQ this patchset:
========================================================
Random Read:
BS   | IOPS     | BW                |
------+------------+-------------------+
4k   | 112402 | 449608KB/s |
8k   | 112818 | 902551KB/s |
16k | 111885 | 1748.3MB/s |
32k | 188015 | 5875.6MB/s |
64k |   99021 |  6188.9MB/s |

Sequential Read:
BS   | IOPS     | BW                |
------+------------+-------------------+
4k   | 115046 | 460186KB/s |
8k   | 113974 | 911799KB/s |
16k | 113374 | 1771.5MB/s |
32k | 192932 | 6029.2MB/s |
64k | 100474 | 6279.7MB/s |

Random Write:
BS   | IOPS     | BW                |
------+------------+-------------------+
4k   | 114284 | 457138KB/s |
8k   | 113992 | 911944KB/s |
16k | 113715 | 1776.9MB/s |
32k | 130402 | 4075.9MB/s |
64k |   92243 | 5765.3MB/s |

Sequential Write:
BS   | IOPS     | BW                |
------+------------+-------------------+
4k   | 115540 | 462162KB/s |
8k   | 114243 | 913951KB/s |
16k | 300153 | 4689.1MB/s |
32k | 141069 | 4408.5MB/s |
64k |   97620 | 6101.3MB/s |


QLogic 32GBit FC HBA NetApp EF560 Array w/ DM MQ previous patchset:
============================================================
Random Read:
BS   | IOPS     | BW                |
------+------------+-------------------+
4k   | 782733 | 3057.6MB/s |
8k   | 732143 | 5719.9MB/s |
16k | 398314 | 6223.7MB/s |
32k | 200538 | 6266.9MB/s |
64k | 100422 | 6276.5MB/s |

Sequential Read:
BS   | IOPS     | BW                |
------+------------+-------------------+
4k   | 786707 | 3073.8MB/s |
8k   | 730579 | 5707.7MB/s |
16k | 398799 | 6231.3MB/s |
32k | 200518 | 6266.2MB/s |
64k | 100397 | 6274.9MB/s |

Random Write:
BS   | IOPS     | BW                |
------+------------+-------------------+
4k   | 242426 | 969707KB/s |
8k   | 223079 | 1742.9MB/s |
16k | 177889 | 2779.6MB/s |
32k | 133637 | 4176.2MB/s |
64k |   97727 | 6107.1MB/s |

Sequential Write:
BS   | IOPS     | BW                |
------+------------+-------------------+
4k   | 134360 | 537442KB/s |
8k   | 129738 | 1013.6MB/s |
16k | 129746 | 2027.3MB/s |
32k | 127875 | 3996.1MB/s |
64k |   97683 | 6105.3MB/s |

Emulex 16GBit FC HBA NetApp EF560 Array w/ DM MQ this patchset :
========================================================
[Beware, this is with Hannes' lockless lpfc patches, which are not upstream as 
they're quite experimental, but are good at showing the capability of the new 
dm-mpath]

Random Read:
BS   | IOPS     | BW                |
------+------------+-------------------+
4k   | 939752 | 3670.1MB/s |
8k   | 741462 | 5792.7MB/s |
16k | 399285 | 6238.9MB/s |
32k | 196490 | 6140.4MB/s |
64k | 100325 | 6270.4MB/s |

Sequential Read:
BS   | IOPS     | BW
------+------------+-------------------+
4k   | 926222 | 3618.6MB/s |
8k   | 750125 | 5860.4MB/s |
16k | 397770 | 6215.2MB/s |
32k | 200130 | 6254.8MB/s | 
64k | 100397 | 6274.9MB/s |

Random Write:
BS   | IOPS     | BW                |
------+------------+-------------------+
4k   | 251938 | 984.14MB/s |
8k   | 226712 | 1771.2MB/s |
16k | 180739 | 2824.5MB/s |
32k | 133316 | 4166.2MB/s |
64k |  98738  | 6171.2MB/s |

Sequential Write:
BS   | IOPS     | BW                |
------+------------+-------------------+
4k   | 134660 | 538643KB/s |
8k   | 131585 | 1028.9MB/s |
16k | 131030 | 2047.4MB/s |
32k | 126987 | 3968.4MB/s |
64k |  98882  | 6180.2MB/s |

Emulex 16GBit FC HBA NetApp EF560 Array w/ DM SQ this patchset:
========================================================
Random Read:
BS   | IOPS     | BW                |
------+------------+-------------------+
4k   | 101860 | 407443KB/s |
8k   | 100242 | 801940KB/s |
16k |   99774 | 1558.1MB/s |
32k | 134304 | 4197.8MB/s |
64k | 100126 | 6257.1MB/s |

Sequential Read:
BS   | IOPS     | BW                |
------+------------+-------------------+
4k   | 285585 | 1115.6MB/s | 
8k   | 313619 | 2450.2MB/s |
16k | 190696 | 2979.7MB/s |
32k | 200569 | 6267.9MB/s |
64k | 100366 | 6272.1MB/s |

Random Write:
BS   | IOPS     | BW                |
------+------------+-------------------+
4k   |  98434  | 393738KB/s |
8k   |  97286  | 778294KB/s | 
16k |  97623  | 1525.4MB/s |
32k | 126999 | 3968.8MB/s |
64k |  94309  | 5894.4MB/s |

Sequential Write:
BS   | IOPS     | BW                |
------+------------+-------------------+
4k   | 330066 | 1289.4MB/s |
8k   | 452792 | 3537.5MB/s |
16k | 334481 | 5226.3MB/s |
32k | 186916 | 5841.2MB/s |
64k |  88698  | 5543.7MB/s |

-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: Felix Imend�rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N�rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dm-devel] [RFC PATCH 0/4] dm mpath: vastly improve blk-mq IO performance
@ 2016-04-08 11:42       ` Johannes Thumshirn
  0 siblings, 0 replies; 26+ messages in thread
From: Johannes Thumshirn @ 2016-04-08 11:42 UTC (permalink / raw)
  To: dm-devel
  Cc: Mike Snitzer, Hannes Reinecke, linux-scsi, sagig, jmoyer,
	linux-block, j-nomura

Ladies and Gentlemen,
To show off some numbers from our testing:

All tests are performed against the cache of the Array, not the disks as we 
wanted to test the Linux stack not the Disk Array.

All single queue tests have been performed with the deadline I/O Scheduler.

Comments welcome, have fun reading :-)

QLogic 32GBit FC HBA NetApp EF560 Array w/ DM MQ this patchset:
========================================================
Random Read:
BS   | IOPS     | BW                |
------+------------+-------------------+
4k   | 785136 | 3066.1MB/s |
8k   | 734983 | 5742.6MB/s |
16k | 398516 | 6226.9MB/s |
32k | 200589 | 6268.5MB/s |
64k | 100417 | 6276.2MB/s |

Sequential Read:
BS   | IOPS     | BW                |
------+------------+-------------------+
4k   | 788620 | 3080.6MB/s |
8k   | 736359 | 5752.9MB/s |
16k | 398597 | 6228.8MB/s |
32k |  200487| 6265.3MB/s |
64k | 100402 | 6275.2MB/s |

Random Write:
BS   | IOPS     | BW                |
------+------------+-------------------+
4k   | 242105 | 968423KB/s |
8k   | 222290 | 1736.7MB/s |
16k | 178191 | 2784.3MB/s |
32k | 133619 | 4175.7MB/s |
64k |   97693 | 6105.9MB/s |

Sequential Write:
BS   | IOPS     | BW                |
------+------------+-------------------+
4k   | 134788 | 539155KB/s |
8k   | 132361 | 1034.8MB/s |
16k | 129941 | 2030.4MB/s |
32k | 128288 | 4009.4MB/s |
64k |   97776 | 6111.0MB/s |

QLogic 32GBit FC HBA NetApp EF560 Array w/ DM SQ this patchset:
========================================================
Random Read:
BS   | IOPS     | BW                |
------+------------+-------------------+
4k   | 112402 | 449608KB/s |
8k   | 112818 | 902551KB/s |
16k | 111885 | 1748.3MB/s |
32k | 188015 | 5875.6MB/s |
64k |   99021 |  6188.9MB/s |

Sequential Read:
BS   | IOPS     | BW                |
------+------------+-------------------+
4k   | 115046 | 460186KB/s |
8k   | 113974 | 911799KB/s |
16k | 113374 | 1771.5MB/s |
32k | 192932 | 6029.2MB/s |
64k | 100474 | 6279.7MB/s |

Random Write:
BS   | IOPS     | BW                |
------+------------+-------------------+
4k   | 114284 | 457138KB/s |
8k   | 113992 | 911944KB/s |
16k | 113715 | 1776.9MB/s |
32k | 130402 | 4075.9MB/s |
64k |   92243 | 5765.3MB/s |

Sequential Write:
BS   | IOPS     | BW                |
------+------------+-------------------+
4k   | 115540 | 462162KB/s |
8k   | 114243 | 913951KB/s |
16k | 300153 | 4689.1MB/s |
32k | 141069 | 4408.5MB/s |
64k |   97620 | 6101.3MB/s |


QLogic 32GBit FC HBA NetApp EF560 Array w/ DM MQ previous patchset:
============================================================
Random Read:
BS   | IOPS     | BW                |
------+------------+-------------------+
4k   | 782733 | 3057.6MB/s |
8k   | 732143 | 5719.9MB/s |
16k | 398314 | 6223.7MB/s |
32k | 200538 | 6266.9MB/s |
64k | 100422 | 6276.5MB/s |

Sequential Read:
BS   | IOPS     | BW                |
------+------------+-------------------+
4k   | 786707 | 3073.8MB/s |
8k   | 730579 | 5707.7MB/s |
16k | 398799 | 6231.3MB/s |
32k | 200518 | 6266.2MB/s |
64k | 100397 | 6274.9MB/s |

Random Write:
BS   | IOPS     | BW                |
------+------------+-------------------+
4k   | 242426 | 969707KB/s |
8k   | 223079 | 1742.9MB/s |
16k | 177889 | 2779.6MB/s |
32k | 133637 | 4176.2MB/s |
64k |   97727 | 6107.1MB/s |

Sequential Write:
BS   | IOPS     | BW                |
------+------------+-------------------+
4k   | 134360 | 537442KB/s |
8k   | 129738 | 1013.6MB/s |
16k | 129746 | 2027.3MB/s |
32k | 127875 | 3996.1MB/s |
64k |   97683 | 6105.3MB/s |

Emulex 16GBit FC HBA NetApp EF560 Array w/ DM MQ this patchset :
========================================================
[Beware, this is with Hannes' lockless lpfc patches, which are not upstream as 
they're quite experimental, but are good at showing the capability of the new 
dm-mpath]

Random Read:
BS   | IOPS     | BW                |
------+------------+-------------------+
4k   | 939752 | 3670.1MB/s |
8k   | 741462 | 5792.7MB/s |
16k | 399285 | 6238.9MB/s |
32k | 196490 | 6140.4MB/s |
64k | 100325 | 6270.4MB/s |

Sequential Read:
BS   | IOPS     | BW
------+------------+-------------------+
4k   | 926222 | 3618.6MB/s |
8k   | 750125 | 5860.4MB/s |
16k | 397770 | 6215.2MB/s |
32k | 200130 | 6254.8MB/s | 
64k | 100397 | 6274.9MB/s |

Random Write:
BS   | IOPS     | BW                |
------+------------+-------------------+
4k   | 251938 | 984.14MB/s |
8k   | 226712 | 1771.2MB/s |
16k | 180739 | 2824.5MB/s |
32k | 133316 | 4166.2MB/s |
64k |  98738  | 6171.2MB/s |

Sequential Write:
BS   | IOPS     | BW                |
------+------------+-------------------+
4k   | 134660 | 538643KB/s |
8k   | 131585 | 1028.9MB/s |
16k | 131030 | 2047.4MB/s |
32k | 126987 | 3968.4MB/s |
64k |  98882  | 6180.2MB/s |

Emulex 16GBit FC HBA NetApp EF560 Array w/ DM SQ this patchset:
========================================================
Random Read:
BS   | IOPS     | BW                |
------+------------+-------------------+
4k   | 101860 | 407443KB/s |
8k   | 100242 | 801940KB/s |
16k |   99774 | 1558.1MB/s |
32k | 134304 | 4197.8MB/s |
64k | 100126 | 6257.1MB/s |

Sequential Read:
BS   | IOPS     | BW                |
------+------------+-------------------+
4k   | 285585 | 1115.6MB/s | 
8k   | 313619 | 2450.2MB/s |
16k | 190696 | 2979.7MB/s |
32k | 200569 | 6267.9MB/s |
64k | 100366 | 6272.1MB/s |

Random Write:
BS   | IOPS     | BW                |
------+------------+-------------------+
4k   |  98434  | 393738KB/s |
8k   |  97286  | 778294KB/s | 
16k |  97623  | 1525.4MB/s |
32k | 126999 | 3968.8MB/s |
64k |  94309  | 5894.4MB/s |

Sequential Write:
BS   | IOPS     | BW                |
------+------------+-------------------+
4k   | 330066 | 1289.4MB/s |
8k   | 452792 | 3537.5MB/s |
16k | 334481 | 5226.3MB/s |
32k | 186916 | 5841.2MB/s |
64k |  88698  | 5543.7MB/s |

-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/4] dm mpath: vastly improve blk-mq IO performance
  2016-04-08 11:42       ` Johannes Thumshirn
  (?)
@ 2016-04-08 19:29       ` Mike Snitzer
  2016-04-13  7:03           ` Johannes Thumshirn
  -1 siblings, 1 reply; 26+ messages in thread
From: Mike Snitzer @ 2016-04-08 19:29 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: dm-devel, linux-scsi, sagig, linux-block, jmoyer, j-nomura

On Fri, Apr 08 2016 at  7:42am -0400,
Johannes Thumshirn <jthumshirn@suse.de> wrote:

> Ladies and Gentlemen,
> To show off some numbers from our testing:
> 
> All tests are performed against the cache of the Array, not the disks as we 
> wanted to test the Linux stack not the Disk Array.
> 
> All single queue tests have been performed with the deadline I/O Scheduler.
> 
> Comments welcome, have fun reading :-)

Any chance you collected performance results from DM MQ on this same
testbed without any variant of my lockless patches?  The DM SQ results
aren't too interesting a reference point.  Seeing how much better
lockless DM MQ (multipath) is than the old m->lock heacy code (still in
4.6) would be more interesting.

Not a big deal if you don't have it.. but figured I'd check to see.

And thanks for the numbers you've provided.
Mike

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dm-devel] [RFC PATCH 0/4] dm mpath: vastly improve blk-mq IO performance
  2016-04-08 19:29       ` Mike Snitzer
@ 2016-04-13  7:03           ` Johannes Thumshirn
  0 siblings, 0 replies; 26+ messages in thread
From: Johannes Thumshirn @ 2016-04-13  7:03 UTC (permalink / raw)
  To: dm-devel; +Cc: Mike Snitzer, linux-scsi, sagig, jmoyer, linux-block, j-nomura

On Freitag, 8. April 2016 15:29:39 CEST Mike Snitzer wrote:
> On Fri, Apr 08 2016 at  7:42am -0400,
> Johannes Thumshirn <jthumshirn@suse.de> wrote:
> 
> > Ladies and Gentlemen,
> > To show off some numbers from our testing:
> > 
> > All tests are performed against the cache of the Array, not the disks as we 
> > wanted to test the Linux stack not the Disk Array.
> > 
> > All single queue tests have been performed with the deadline I/O Scheduler.
> > 
> > Comments welcome, have fun reading :-)
> 
> Any chance you collected performance results from DM MQ on this same
> testbed without any variant of my lockless patches?  The DM SQ results
> aren't too interesting a reference point.  Seeing how much better
> lockless DM MQ (multipath) is than the old m->lock heacy code (still in
> 4.6) would be more interesting.
> 
> Not a big deal if you don't have it.. but figured I'd check to see.

I'll have to look if there are some of the old logfiles still available, but we
can't re-test, as the array was just temporarily allocated to us and we've
now lost access to it. But IIRC it's been somewhere around 300K IOPS, but 
don't quote me on that.

Byte,
	Joahnnes

 
> And thanks for the numbers you've provided.
> Mike
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
> 


-- 
Johannes Thumshirn                                                                                Storage
jthumshirn@suse.de                                                             +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: Felix Imend�rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N�rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dm-devel] [RFC PATCH 0/4] dm mpath: vastly improve blk-mq IO performance
@ 2016-04-13  7:03           ` Johannes Thumshirn
  0 siblings, 0 replies; 26+ messages in thread
From: Johannes Thumshirn @ 2016-04-13  7:03 UTC (permalink / raw)
  To: dm-devel; +Cc: Mike Snitzer, linux-scsi, sagig, jmoyer, linux-block, j-nomura

On Freitag, 8. April 2016 15:29:39 CEST Mike Snitzer wrote:
> On Fri, Apr 08 2016 at  7:42am -0400,
> Johannes Thumshirn <jthumshirn@suse.de> wrote:
> 
> > Ladies and Gentlemen,
> > To show off some numbers from our testing:
> > 
> > All tests are performed against the cache of the Array, not the disks as we 
> > wanted to test the Linux stack not the Disk Array.
> > 
> > All single queue tests have been performed with the deadline I/O Scheduler.
> > 
> > Comments welcome, have fun reading :-)
> 
> Any chance you collected performance results from DM MQ on this same
> testbed without any variant of my lockless patches?  The DM SQ results
> aren't too interesting a reference point.  Seeing how much better
> lockless DM MQ (multipath) is than the old m->lock heacy code (still in
> 4.6) would be more interesting.
> 
> Not a big deal if you don't have it.. but figured I'd check to see.

I'll have to look if there are some of the old logfiles still available, but we
can't re-test, as the array was just temporarily allocated to us and we've
now lost access to it. But IIRC it's been somewhere around 300K IOPS, but 
don't quote me on that.

Byte,
	Joahnnes

 
> And thanks for the numbers you've provided.
> Mike
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
> 


-- 
Johannes Thumshirn                                                                                Storage
jthumshirn@suse.de                                                             +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/4] dm mpath: vastly improve blk-mq IO performance
  2016-03-31 20:04 [RFC PATCH 0/4] dm mpath: vastly improve blk-mq IO performance Mike Snitzer
                   ` (5 preceding siblings ...)
  2016-04-07 14:58   ` Hannes Reinecke
@ 2016-05-09 15:48 ` Bart Van Assche
  6 siblings, 0 replies; 26+ messages in thread
From: Bart Van Assche @ 2016-05-09 15:48 UTC (permalink / raw)
  To: linux-scsi

Mike Snitzer <snitzer <at> redhat.com> writes:
> I developed these changes some weeks ago but have since focused on
> regression and performance testing on larger NUMA systems.

Hello Mike,

Sorry that I do yet have any performance numbers available for the SRP 
protocol for this patch series. But I want to let you know that since I 
started testing this patch series three weeks ago that I haven't seen any 
regressions caused by this patch series.

Bart.


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2016-05-09 16:30 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-31 20:04 [RFC PATCH 0/4] dm mpath: vastly improve blk-mq IO performance Mike Snitzer
2016-03-31 20:04 ` [RFC PATCH 1/4] dm mpath: switch to using bitops for state flags Mike Snitzer
2016-04-01  8:46   ` [dm-devel] " Johannes Thumshirn
2016-04-07 14:59   ` Hannes Reinecke
2016-03-31 20:04 ` [RFC PATCH 2/4] dm mpath: use atomic_t for counting members of 'struct multipath' Mike Snitzer
2016-04-01  8:48   ` [dm-devel] " Johannes Thumshirn
2016-04-07 15:02   ` Hannes Reinecke
2016-03-31 20:04 ` [RFC PATCH 3/4] dm mpath: move trigger_event member to the end " Mike Snitzer
2016-04-01  8:50   ` [dm-devel] " Johannes Thumshirn
2016-04-07 15:03   ` Hannes Reinecke
2016-03-31 20:04 ` [RFC PATCH 4/4] dm mpath: eliminate use of spinlock in IO fast-paths Mike Snitzer
2016-04-01  9:02   ` [dm-devel] " Johannes Thumshirn
2016-04-07 15:03   ` Hannes Reinecke
2016-04-01  8:12 ` [dm-devel] [RFC PATCH 0/4] dm mpath: vastly improve blk-mq IO performance Johannes Thumshirn
2016-04-01 13:22   ` Mike Snitzer
2016-04-01 13:37     ` Johannes Thumshirn
2016-04-01 14:14       ` Mike Snitzer
2016-04-07 14:58 ` Hannes Reinecke
2016-04-07 14:58   ` Hannes Reinecke
2016-04-07 15:34   ` Mike Snitzer
2016-04-08 11:42     ` [dm-devel] " Johannes Thumshirn
2016-04-08 11:42       ` Johannes Thumshirn
2016-04-08 19:29       ` Mike Snitzer
2016-04-13  7:03         ` [dm-devel] " Johannes Thumshirn
2016-04-13  7:03           ` Johannes Thumshirn
2016-05-09 15:48 ` Bart Van Assche

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.