All of lore.kernel.org
 help / color / mirror / Atom feed
* dm-multipath low performance with blk-mq
@ 2016-01-18 12:04 Sagi Grimberg
  2016-01-19 10:37 ` Sagi Grimberg
  0 siblings, 1 reply; 127+ messages in thread
From: Sagi Grimberg @ 2016-01-18 12:04 UTC (permalink / raw)


Hi All,

I've recently tried out dm-multipath over a "super-fast" nvme device
and noticed a serious lock contention in dm-multipath that requires some
extra attention. The nvme device is a simple loopback device emulation
backed by null_blk device.

With this I've seen dm-multipath pushing around ~470K IOPs while
the native (loopback) nvme performance can easily push up to 1500K+ IOPs.

perf output [1] reveals a huge lock contention on the multipath lock
which is a per-dm_target contention point which seem to defeat the
purpose of blk-mq i/O path.

The two current bottlenecks seem to come from multipath_busy and
__multipath_map. Would it make better sense to move to a percpu_ref
model with freeze/unfreeze logic for updates similar to what blk-mq
is doing?

Thoughts?


[1]:
-  23.67%              fio  [kernel.kallsyms]    [k] 
queued_spin_lock_slowpath
    - queued_spin_lock_slowpath
       - 51.40% _raw_spin_lock_irqsave
          - 99.98% multipath_busy
               dm_mq_queue_rq
               __blk_mq_run_hw_queue
               blk_mq_run_hw_queue
               blk_mq_insert_requests
               blk_mq_flush_plug_list
               blk_flush_plug_list
               blk_finish_plug
               do_io_submit
               SyS_io_submit
               entry_SYSCALL_64_fastpath
             + io_submit
       - 48.05% _raw_spin_lock_irq
          - 100.00% __multipath_map
               multipath_clone_and_map
               target_message
               dispatch_io
               __blk_mq_run_hw_queue
               blk_mq_run_hw_queue
               blk_mq_insert_requests
               blk_mq_flush_plug_list
               blk_flush_plug_list
               blk_finish_plug
               do_io_submit
               SyS_io_submit
               entry_SYSCALL_64_fastpath
             + io_submit
+   1.70%              fio  [kernel.kallsyms]    [k] __blk_mq_run_hw_queue
+   1.56%              fio  fio                  [.] get_io_u
+   1.06%              fio  [kernel.kallsyms]    [k] blk_account_io_start
+   0.92%              fio  fio                  [.] do_io
+   0.82%              fio  [kernel.kallsyms]    [k] do_blockdev_direct_IO
+   0.81%              fio  [kernel.kallsyms]    [k] 
blk_mq_hctx_mark_pending
+   0.75%              fio  [kernel.kallsyms]    [k] __blk_mq_alloc_request
+   0.75%              fio  [kernel.kallsyms]    [k] __bt_get
+   0.69%              fio  [kernel.kallsyms]    [k] do_direct_IO

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: dm-multipath low performance with blk-mq
  2016-01-18 12:04 dm-multipath low performance with blk-mq Sagi Grimberg
@ 2016-01-19 10:37 ` Sagi Grimberg
  2016-01-19 22:45     ` Mike Snitzer
  0 siblings, 1 reply; 127+ messages in thread
From: Sagi Grimberg @ 2016-01-19 10:37 UTC (permalink / raw)
  To: device-mapper development
  Cc: Christoph Hellwig, keith.busch, Mike Snitzer, Bart Van Assche

This time with the correct dm-devel...

> Hi All,
>
> I've recently tried out dm-multipath over a "super-fast" nvme device
> and noticed a serious lock contention in dm-multipath that requires some
> extra attention. The nvme device is a simple loopback device emulation
> backed by null_blk device.
>
> With this I've seen dm-multipath pushing around ~470K IOPs while
> the native (loopback) nvme performance can easily push up to 1500K+ IOPs.
>
> perf output [1] reveals a huge lock contention on the multipath lock
> which is a per-dm_target contention point which seem to defeat the
> purpose of blk-mq i/O path.
>
> The two current bottlenecks seem to come from multipath_busy and
> __multipath_map. Would it make better sense to move to a percpu_ref
> model with freeze/unfreeze logic for updates similar to what blk-mq
> is doing?
>
> Thoughts?
>
>
> [1]:
> -  23.67%              fio  [kernel.kallsyms]    [k]
> queued_spin_lock_slowpath
>     - queued_spin_lock_slowpath
>        - 51.40% _raw_spin_lock_irqsave
>           - 99.98% multipath_busy
>                dm_mq_queue_rq
>                __blk_mq_run_hw_queue
>                blk_mq_run_hw_queue
>                blk_mq_insert_requests
>                blk_mq_flush_plug_list
>                blk_flush_plug_list
>                blk_finish_plug
>                do_io_submit
>                SyS_io_submit
>                entry_SYSCALL_64_fastpath
>              + io_submit
>        - 48.05% _raw_spin_lock_irq
>           - 100.00% __multipath_map
>                multipath_clone_and_map
>                target_message
>                dispatch_io
>                __blk_mq_run_hw_queue
>                blk_mq_run_hw_queue
>                blk_mq_insert_requests
>                blk_mq_flush_plug_list
>                blk_flush_plug_list
>                blk_finish_plug
>                do_io_submit
>                SyS_io_submit
>                entry_SYSCALL_64_fastpath
>              + io_submit
> +   1.70%              fio  [kernel.kallsyms]    [k] __blk_mq_run_hw_queue
> +   1.56%              fio  fio                  [.] get_io_u
> +   1.06%              fio  [kernel.kallsyms]    [k] blk_account_io_start
> +   0.92%              fio  fio                  [.] do_io
> +   0.82%              fio  [kernel.kallsyms]    [k] do_blockdev_direct_IO
> +   0.81%              fio  [kernel.kallsyms]    [k]
> blk_mq_hctx_mark_pending
> +   0.75%              fio  [kernel.kallsyms]    [k] __blk_mq_alloc_request
> +   0.75%              fio  [kernel.kallsyms]    [k] __bt_get
> +   0.69%              fio  [kernel.kallsyms]    [k] do_direct_IO

^ permalink raw reply	[flat|nested] 127+ messages in thread

* dm-multipath low performance with blk-mq
  2016-01-19 10:37 ` Sagi Grimberg
@ 2016-01-19 22:45     ` Mike Snitzer
  0 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-01-19 22:45 UTC (permalink / raw)


On Mon, Jan 18 2016 at  7:04am -0500,
Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:

> Hi All,
> 
> I've recently tried out dm-multipath over a "super-fast" nvme device
> and noticed a serious lock contention in dm-multipath that requires some
> extra attention. The nvme device is a simple loopback device emulation
> backed by null_blk device.
> 
> With this I've seen dm-multipath pushing around ~470K IOPs while
> the native (loopback) nvme performance can easily push up to 1500K+ IOPs.
> 
> perf output [1] reveals a huge lock contention on the multipath lock
> which is a per-dm_target contention point which seem to defeat the
> purpose of blk-mq i/O path.
> 
> The two current bottlenecks seem to come from multipath_busy and
> __multipath_map. Would it make better sense to move to a percpu_ref
> model with freeze/unfreeze logic for updates similar to what blk-mq
> is doing?
>
> Thoughts?

Your perf output clearly does identify the 'struct multipath' spinlock
as a bottleneck.

Is it fair to assume that implied in your test is that you increased
md->tag_set.nr_hw_queues to > 1 in dm_init_request_based_blk_mq_queue()?

I'd like to start by replicating your testbed.  So I'll see about
setting up the nvme loop driver you referenced in earlier mail.
Can you share your fio job file and fio commandline for your test?

Unrolling the dm-mpath.c implementation of .request_fn vs blk-mq and
identifiying a locking strategy for the 'struct multipath' member
accesses will take time to investigate.  If others can spare their
expertise to help speed up the discovery of the proper way forward I'd
very much appreciate it.

I'll consult with people like Mikulas (who did work to improve DM core's
scalability with changes like commit 83d5e5b0af9 "dm: optimize use SRCU
and RCU").

But I'll need to do further research on what fix is appropriate for
increasing the parallelism of the locking across blk-mq queues.  Part of
the challenge associated with that is that while blk-mq will know there
are multiple queues: the DM multipath target is currently oblivious.
Pushing that understanding down to the multipath target is likely needed
so that resources can be initialized and managed accordingly.  Certainly
made more complex when you consider we do still have support for the old
.request_fn code path (via dm-mpath.c:multipath_map).  But it could
easily be that this new locking strategy will work if number of queues
is 1 or >1.

This discovery will take time but I'll make it a priority and do my
best.

Mike

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: dm-multipath low performance with blk-mq
@ 2016-01-19 22:45     ` Mike Snitzer
  0 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-01-19 22:45 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, keith.busch, dm-devel, linux-nvme, Bart Van Assche

On Mon, Jan 18 2016 at  7:04am -0500,
Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:

> Hi All,
> 
> I've recently tried out dm-multipath over a "super-fast" nvme device
> and noticed a serious lock contention in dm-multipath that requires some
> extra attention. The nvme device is a simple loopback device emulation
> backed by null_blk device.
> 
> With this I've seen dm-multipath pushing around ~470K IOPs while
> the native (loopback) nvme performance can easily push up to 1500K+ IOPs.
> 
> perf output [1] reveals a huge lock contention on the multipath lock
> which is a per-dm_target contention point which seem to defeat the
> purpose of blk-mq i/O path.
> 
> The two current bottlenecks seem to come from multipath_busy and
> __multipath_map. Would it make better sense to move to a percpu_ref
> model with freeze/unfreeze logic for updates similar to what blk-mq
> is doing?
>
> Thoughts?

Your perf output clearly does identify the 'struct multipath' spinlock
as a bottleneck.

Is it fair to assume that implied in your test is that you increased
md->tag_set.nr_hw_queues to > 1 in dm_init_request_based_blk_mq_queue()?

I'd like to start by replicating your testbed.  So I'll see about
setting up the nvme loop driver you referenced in earlier mail.
Can you share your fio job file and fio commandline for your test?

Unrolling the dm-mpath.c implementation of .request_fn vs blk-mq and
identifiying a locking strategy for the 'struct multipath' member
accesses will take time to investigate.  If others can spare their
expertise to help speed up the discovery of the proper way forward I'd
very much appreciate it.

I'll consult with people like Mikulas (who did work to improve DM core's
scalability with changes like commit 83d5e5b0af9 "dm: optimize use SRCU
and RCU").

But I'll need to do further research on what fix is appropriate for
increasing the parallelism of the locking across blk-mq queues.  Part of
the challenge associated with that is that while blk-mq will know there
are multiple queues: the DM multipath target is currently oblivious.
Pushing that understanding down to the multipath target is likely needed
so that resources can be initialized and managed accordingly.  Certainly
made more complex when you consider we do still have support for the old
.request_fn code path (via dm-mpath.c:multipath_map).  But it could
easily be that this new locking strategy will work if number of queues
is 1 or >1.

This discovery will take time but I'll make it a priority and do my
best.

Mike

^ permalink raw reply	[flat|nested] 127+ messages in thread

* dm-multipath low performance with blk-mq
  2016-01-19 22:45     ` Mike Snitzer
@ 2016-01-25 21:40       ` Mike Snitzer
  -1 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-01-25 21:40 UTC (permalink / raw)


On Tue, Jan 19 2016 at  5:45P -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Mon, Jan 18 2016 at  7:04am -0500,
> Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
> 
> > Hi All,
> > 
> > I've recently tried out dm-multipath over a "super-fast" nvme device
> > and noticed a serious lock contention in dm-multipath that requires some
> > extra attention. The nvme device is a simple loopback device emulation
> > backed by null_blk device.
> > 
> > With this I've seen dm-multipath pushing around ~470K IOPs while
> > the native (loopback) nvme performance can easily push up to 1500K+ IOPs.
> > 
> > perf output [1] reveals a huge lock contention on the multipath lock
> > which is a per-dm_target contention point which seem to defeat the
> > purpose of blk-mq i/O path.
> > 
> > The two current bottlenecks seem to come from multipath_busy and
> > __multipath_map. Would it make better sense to move to a percpu_ref
> > model with freeze/unfreeze logic for updates similar to what blk-mq
> > is doing?
> >
> > Thoughts?
> 
> Your perf output clearly does identify the 'struct multipath' spinlock
> as a bottleneck.
> 
> Is it fair to assume that implied in your test is that you increased
> md->tag_set.nr_hw_queues to > 1 in dm_init_request_based_blk_mq_queue()?
> 
> I'd like to start by replicating your testbed.  So I'll see about
> setting up the nvme loop driver you referenced in earlier mail.
> Can you share your fio job file and fio commandline for your test?

Would still appreciate answers to my 2 questions above (did you modify
md->tag_set.nr_hw_queues and can you share your fio job?)

I've yet to reproduce your config (using hch's nvme loop driver) or
test to verify your findings but I did develop a patch that switches
from spinlock_t to rwlock_t.  I've only compile tested this but I'll try
to reproduce your setup and then test this patch to see if it helps.

Your worst offenders (multipath_busy and __multipath_map) are now using
a read lock in the fast path.

 drivers/md/dm-mpath.c | 127 +++++++++++++++++++++++++++++++++-----------------
 1 file changed, 83 insertions(+), 44 deletions(-)

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index cfa29f5..34aadb1 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -11,6 +11,7 @@
 #include "dm-path-selector.h"
 #include "dm-uevent.h"
 
+#include <linux/rwlock_types.h>
 #include <linux/blkdev.h>
 #include <linux/ctype.h>
 #include <linux/init.h>
@@ -67,7 +68,7 @@ struct multipath {
 	const char *hw_handler_name;
 	char *hw_handler_params;
 
-	spinlock_t lock;
+	rwlock_t lock;
 
 	unsigned nr_priority_groups;
 	struct list_head priority_groups;
@@ -189,7 +190,7 @@ static struct multipath *alloc_multipath(struct dm_target *ti)
 	m = kzalloc(sizeof(*m), GFP_KERNEL);
 	if (m) {
 		INIT_LIST_HEAD(&m->priority_groups);
-		spin_lock_init(&m->lock);
+		rwlock_init(&m->lock);
 		m->queue_io = 1;
 		m->pg_init_delay_msecs = DM_PG_INIT_DELAY_DEFAULT;
 		INIT_WORK(&m->trigger_event, trigger_event);
@@ -386,12 +387,24 @@ static int __multipath_map(struct dm_target *ti, struct request *clone,
 	struct pgpath *pgpath;
 	struct block_device *bdev;
 	struct dm_mpath_io *mpio;
+	bool use_write_lock = false;
 
-	spin_lock_irq(&m->lock);
+retry:
+	if (!use_write_lock)
+		read_lock_irq(&m->lock);
+	else
+		write_lock_irq(&m->lock);
 
 	/* Do we need to select a new pgpath? */
-	if (!m->current_pgpath ||
-	    (!m->queue_io && (m->repeat_count && --m->repeat_count == 0)))
+	if (!use_write_lock) {
+		if (!m->current_pgpath ||
+		    (!m->queue_io && (m->repeat_count && m->repeat_count == 1))) {
+			use_write_lock = true;
+			read_unlock_irq(&m->lock);
+			goto retry;
+		}
+	} else if (!m->current_pgpath ||
+		   (!m->queue_io && (m->repeat_count && --m->repeat_count == 0)))
 		__choose_pgpath(m, nr_bytes);
 
 	pgpath = m->current_pgpath;
@@ -401,13 +414,23 @@ static int __multipath_map(struct dm_target *ti, struct request *clone,
 			r = -EIO;	/* Failed */
 		goto out_unlock;
 	} else if (m->queue_io || m->pg_init_required) {
+		if (!use_write_lock) {
+			use_write_lock = true;
+			read_unlock_irq(&m->lock);
+			goto retry;
+		}
 		__pg_init_all_paths(m);
 		goto out_unlock;
 	}
 
+	if (!use_write_lock)
+		read_unlock_irq(&m->lock);
+	else
+		write_unlock_irq(&m->lock);
+
 	if (set_mapinfo(m, map_context) < 0)
 		/* ENOMEM, requeue */
-		goto out_unlock;
+		return r;
 
 	mpio = map_context->ptr;
 	mpio->pgpath = pgpath;
@@ -415,8 +438,6 @@ static int __multipath_map(struct dm_target *ti, struct request *clone,
 
 	bdev = pgpath->path.dev->bdev;
 
-	spin_unlock_irq(&m->lock);
-
 	if (clone) {
 		/* Old request-based interface: allocated clone is passed in */
 		clone->q = bdev_get_queue(bdev);
@@ -443,7 +464,10 @@ static int __multipath_map(struct dm_target *ti, struct request *clone,
 	return DM_MAPIO_REMAPPED;
 
 out_unlock:
-	spin_unlock_irq(&m->lock);
+	if (!use_write_lock)
+		read_unlock_irq(&m->lock);
+	else
+		write_unlock_irq(&m->lock);
 
 	return r;
 }
@@ -474,14 +498,15 @@ static int queue_if_no_path(struct multipath *m, unsigned queue_if_no_path,
 {
 	unsigned long flags;
 
-	spin_lock_irqsave(&m->lock, flags);
+	write_lock_irqsave(&m->lock, flags);
 
 	if (save_old_value)
 		m->saved_queue_if_no_path = m->queue_if_no_path;
 	else
 		m->saved_queue_if_no_path = queue_if_no_path;
 	m->queue_if_no_path = queue_if_no_path;
-	spin_unlock_irqrestore(&m->lock, flags);
+
+	write_unlock_irqrestore(&m->lock, flags);
 
 	if (!queue_if_no_path)
 		dm_table_run_md_queue_async(m->ti->table);
@@ -898,12 +923,12 @@ static void multipath_wait_for_pg_init_completion(struct multipath *m)
 	while (1) {
 		set_current_state(TASK_UNINTERRUPTIBLE);
 
-		spin_lock_irqsave(&m->lock, flags);
+		read_lock_irqsave(&m->lock, flags);
 		if (!m->pg_init_in_progress) {
-			spin_unlock_irqrestore(&m->lock, flags);
+			read_unlock_irqrestore(&m->lock, flags);
 			break;
 		}
-		spin_unlock_irqrestore(&m->lock, flags);
+		read_unlock_irqrestore(&m->lock, flags);
 
 		io_schedule();
 	}
@@ -916,18 +941,18 @@ static void flush_multipath_work(struct multipath *m)
 {
 	unsigned long flags;
 
-	spin_lock_irqsave(&m->lock, flags);
+	write_lock_irqsave(&m->lock, flags);
 	m->pg_init_disabled = 1;
-	spin_unlock_irqrestore(&m->lock, flags);
+	write_unlock_irqrestore(&m->lock, flags);
 
 	flush_workqueue(kmpath_handlerd);
 	multipath_wait_for_pg_init_completion(m);
 	flush_workqueue(kmultipathd);
 	flush_work(&m->trigger_event);
 
-	spin_lock_irqsave(&m->lock, flags);
+	write_lock_irqsave(&m->lock, flags);
 	m->pg_init_disabled = 0;
-	spin_unlock_irqrestore(&m->lock, flags);
+	write_unlock_irqrestore(&m->lock, flags);
 }
 
 static void multipath_dtr(struct dm_target *ti)
@@ -946,7 +971,7 @@ static int fail_path(struct pgpath *pgpath)
 	unsigned long flags;
 	struct multipath *m = pgpath->pg->m;
 
-	spin_lock_irqsave(&m->lock, flags);
+	write_lock_irqsave(&m->lock, flags);
 
 	if (!pgpath->is_active)
 		goto out;
@@ -968,7 +993,7 @@ static int fail_path(struct pgpath *pgpath)
 	schedule_work(&m->trigger_event);
 
 out:
-	spin_unlock_irqrestore(&m->lock, flags);
+	write_unlock_irqrestore(&m->lock, flags);
 
 	return 0;
 }
@@ -982,7 +1007,7 @@ static int reinstate_path(struct pgpath *pgpath)
 	unsigned long flags;
 	struct multipath *m = pgpath->pg->m;
 
-	spin_lock_irqsave(&m->lock, flags);
+	write_lock_irqsave(&m->lock, flags);
 
 	if (pgpath->is_active)
 		goto out;
@@ -1014,7 +1039,7 @@ static int reinstate_path(struct pgpath *pgpath)
 	schedule_work(&m->trigger_event);
 
 out:
-	spin_unlock_irqrestore(&m->lock, flags);
+	write_unlock_irqrestore(&m->lock, flags);
 	if (run_queue)
 		dm_table_run_md_queue_async(m->ti->table);
 
@@ -1049,13 +1074,13 @@ static void bypass_pg(struct multipath *m, struct priority_group *pg,
 {
 	unsigned long flags;
 
-	spin_lock_irqsave(&m->lock, flags);
+	write_lock_irqsave(&m->lock, flags);
 
 	pg->bypassed = bypassed;
 	m->current_pgpath = NULL;
 	m->current_pg = NULL;
 
-	spin_unlock_irqrestore(&m->lock, flags);
+	write_unlock_irqrestore(&m->lock, flags);
 
 	schedule_work(&m->trigger_event);
 }
@@ -1076,7 +1101,7 @@ static int switch_pg_num(struct multipath *m, const char *pgstr)
 		return -EINVAL;
 	}
 
-	spin_lock_irqsave(&m->lock, flags);
+	write_lock_irqsave(&m->lock, flags);
 	list_for_each_entry(pg, &m->priority_groups, list) {
 		pg->bypassed = 0;
 		if (--pgnum)
@@ -1086,7 +1111,7 @@ static int switch_pg_num(struct multipath *m, const char *pgstr)
 		m->current_pg = NULL;
 		m->next_pg = pg;
 	}
-	spin_unlock_irqrestore(&m->lock, flags);
+	write_unlock_irqrestore(&m->lock, flags);
 
 	schedule_work(&m->trigger_event);
 	return 0;
@@ -1125,14 +1150,14 @@ static int pg_init_limit_reached(struct multipath *m, struct pgpath *pgpath)
 	unsigned long flags;
 	int limit_reached = 0;
 
-	spin_lock_irqsave(&m->lock, flags);
+	write_lock_irqsave(&m->lock, flags);
 
 	if (m->pg_init_count <= m->pg_init_retries && !m->pg_init_disabled)
 		m->pg_init_required = 1;
 	else
 		limit_reached = 1;
 
-	spin_unlock_irqrestore(&m->lock, flags);
+	write_unlock_irqrestore(&m->lock, flags);
 
 	return limit_reached;
 }
@@ -1186,7 +1211,7 @@ static void pg_init_done(void *data, int errors)
 		fail_path(pgpath);
 	}
 
-	spin_lock_irqsave(&m->lock, flags);
+	write_lock_irqsave(&m->lock, flags);
 	if (errors) {
 		if (pgpath == m->current_pgpath) {
 			DMERR("Could not failover device. Error %d.", errors);
@@ -1213,7 +1238,7 @@ static void pg_init_done(void *data, int errors)
 	wake_up(&m->pg_init_wait);
 
 out:
-	spin_unlock_irqrestore(&m->lock, flags);
+	write_unlock_irqrestore(&m->lock, flags);
 }
 
 static void activate_path(struct work_struct *work)
@@ -1272,7 +1297,7 @@ static int do_end_io(struct multipath *m, struct request *clone,
 	if (mpio->pgpath)
 		fail_path(mpio->pgpath);
 
-	spin_lock_irqsave(&m->lock, flags);
+	read_lock_irqsave(&m->lock, flags);
 	if (!m->nr_valid_paths) {
 		if (!m->queue_if_no_path) {
 			if (!__must_push_back(m))
@@ -1282,7 +1307,7 @@ static int do_end_io(struct multipath *m, struct request *clone,
 				r = error;
 		}
 	}
-	spin_unlock_irqrestore(&m->lock, flags);
+	read_unlock_irqrestore(&m->lock, flags);
 
 	return r;
 }
@@ -1340,9 +1365,9 @@ static void multipath_resume(struct dm_target *ti)
 	struct multipath *m = (struct multipath *) ti->private;
 	unsigned long flags;
 
-	spin_lock_irqsave(&m->lock, flags);
+	write_lock_irqsave(&m->lock, flags);
 	m->queue_if_no_path = m->saved_queue_if_no_path;
-	spin_unlock_irqrestore(&m->lock, flags);
+	write_unlock_irqrestore(&m->lock, flags);
 }
 
 /*
@@ -1372,7 +1397,7 @@ static void multipath_status(struct dm_target *ti, status_type_t type,
 	unsigned pg_num;
 	char state;
 
-	spin_lock_irqsave(&m->lock, flags);
+	read_lock_irqsave(&m->lock, flags);
 
 	/* Features */
 	if (type == STATUSTYPE_INFO)
@@ -1467,7 +1492,7 @@ static void multipath_status(struct dm_target *ti, status_type_t type,
 		break;
 	}
 
-	spin_unlock_irqrestore(&m->lock, flags);
+	read_unlock_irqrestore(&m->lock, flags);
 }
 
 static int multipath_message(struct dm_target *ti, unsigned argc, char **argv)
@@ -1534,16 +1559,27 @@ out:
 }
 
 static int multipath_prepare_ioctl(struct dm_target *ti,
-		struct block_device **bdev, fmode_t *mode)
+				   struct block_device **bdev, fmode_t *mode)
 {
 	struct multipath *m = ti->private;
 	unsigned long flags;
 	int r;
+	bool use_write_lock = false;
 
-	spin_lock_irqsave(&m->lock, flags);
+retry:
+	if (!use_write_lock)
+		read_lock_irqsave(&m->lock, flags);
+	else
+		write_lock_irqsave(&m->lock, flags);
 
-	if (!m->current_pgpath)
+	if (!m->current_pgpath) {
+		if (!use_write_lock) {
+			use_write_lock = true;
+			read_unlock_irqrestore(&m->lock, flags);
+			goto retry;
+		}
 		__choose_pgpath(m, 0);
+	}
 
 	if (m->current_pgpath) {
 		if (!m->queue_io) {
@@ -1562,17 +1598,20 @@ static int multipath_prepare_ioctl(struct dm_target *ti,
 			r = -EIO;
 	}
 
-	spin_unlock_irqrestore(&m->lock, flags);
+	if (!use_write_lock)
+		read_unlock_irqrestore(&m->lock, flags);
+	else
+		write_unlock_irqrestore(&m->lock, flags);
 
 	if (r == -ENOTCONN) {
-		spin_lock_irqsave(&m->lock, flags);
+		write_lock_irqsave(&m->lock, flags);
 		if (!m->current_pg) {
 			/* Path status changed, redo selection */
 			__choose_pgpath(m, 0);
 		}
 		if (m->pg_init_required)
 			__pg_init_all_paths(m);
-		spin_unlock_irqrestore(&m->lock, flags);
+		write_unlock_irqrestore(&m->lock, flags);
 		dm_table_run_md_queue_async(m->ti->table);
 	}
 
@@ -1627,7 +1666,7 @@ static int multipath_busy(struct dm_target *ti)
 	struct pgpath *pgpath;
 	unsigned long flags;
 
-	spin_lock_irqsave(&m->lock, flags);
+	read_lock_irqsave(&m->lock, flags);
 
 	/* pg_init in progress or no paths available */
 	if (m->pg_init_in_progress ||
@@ -1674,7 +1713,7 @@ static int multipath_busy(struct dm_target *ti)
 		busy = 0;
 
 out:
-	spin_unlock_irqrestore(&m->lock, flags);
+	read_unlock_irqrestore(&m->lock, flags);
 
 	return busy;
 }

^ permalink raw reply related	[flat|nested] 127+ messages in thread

* Re: dm-multipath low performance with blk-mq
@ 2016-01-25 21:40       ` Mike Snitzer
  0 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-01-25 21:40 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, keith.busch, dm-devel, linux-nvme, Bart Van Assche

On Tue, Jan 19 2016 at  5:45P -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Mon, Jan 18 2016 at  7:04am -0500,
> Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
> 
> > Hi All,
> > 
> > I've recently tried out dm-multipath over a "super-fast" nvme device
> > and noticed a serious lock contention in dm-multipath that requires some
> > extra attention. The nvme device is a simple loopback device emulation
> > backed by null_blk device.
> > 
> > With this I've seen dm-multipath pushing around ~470K IOPs while
> > the native (loopback) nvme performance can easily push up to 1500K+ IOPs.
> > 
> > perf output [1] reveals a huge lock contention on the multipath lock
> > which is a per-dm_target contention point which seem to defeat the
> > purpose of blk-mq i/O path.
> > 
> > The two current bottlenecks seem to come from multipath_busy and
> > __multipath_map. Would it make better sense to move to a percpu_ref
> > model with freeze/unfreeze logic for updates similar to what blk-mq
> > is doing?
> >
> > Thoughts?
> 
> Your perf output clearly does identify the 'struct multipath' spinlock
> as a bottleneck.
> 
> Is it fair to assume that implied in your test is that you increased
> md->tag_set.nr_hw_queues to > 1 in dm_init_request_based_blk_mq_queue()?
> 
> I'd like to start by replicating your testbed.  So I'll see about
> setting up the nvme loop driver you referenced in earlier mail.
> Can you share your fio job file and fio commandline for your test?

Would still appreciate answers to my 2 questions above (did you modify
md->tag_set.nr_hw_queues and can you share your fio job?)

I've yet to reproduce your config (using hch's nvme loop driver) or
test to verify your findings but I did develop a patch that switches
from spinlock_t to rwlock_t.  I've only compile tested this but I'll try
to reproduce your setup and then test this patch to see if it helps.

Your worst offenders (multipath_busy and __multipath_map) are now using
a read lock in the fast path.

 drivers/md/dm-mpath.c | 127 +++++++++++++++++++++++++++++++++-----------------
 1 file changed, 83 insertions(+), 44 deletions(-)

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index cfa29f5..34aadb1 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -11,6 +11,7 @@
 #include "dm-path-selector.h"
 #include "dm-uevent.h"
 
+#include <linux/rwlock_types.h>
 #include <linux/blkdev.h>
 #include <linux/ctype.h>
 #include <linux/init.h>
@@ -67,7 +68,7 @@ struct multipath {
 	const char *hw_handler_name;
 	char *hw_handler_params;
 
-	spinlock_t lock;
+	rwlock_t lock;
 
 	unsigned nr_priority_groups;
 	struct list_head priority_groups;
@@ -189,7 +190,7 @@ static struct multipath *alloc_multipath(struct dm_target *ti)
 	m = kzalloc(sizeof(*m), GFP_KERNEL);
 	if (m) {
 		INIT_LIST_HEAD(&m->priority_groups);
-		spin_lock_init(&m->lock);
+		rwlock_init(&m->lock);
 		m->queue_io = 1;
 		m->pg_init_delay_msecs = DM_PG_INIT_DELAY_DEFAULT;
 		INIT_WORK(&m->trigger_event, trigger_event);
@@ -386,12 +387,24 @@ static int __multipath_map(struct dm_target *ti, struct request *clone,
 	struct pgpath *pgpath;
 	struct block_device *bdev;
 	struct dm_mpath_io *mpio;
+	bool use_write_lock = false;
 
-	spin_lock_irq(&m->lock);
+retry:
+	if (!use_write_lock)
+		read_lock_irq(&m->lock);
+	else
+		write_lock_irq(&m->lock);
 
 	/* Do we need to select a new pgpath? */
-	if (!m->current_pgpath ||
-	    (!m->queue_io && (m->repeat_count && --m->repeat_count == 0)))
+	if (!use_write_lock) {
+		if (!m->current_pgpath ||
+		    (!m->queue_io && (m->repeat_count && m->repeat_count == 1))) {
+			use_write_lock = true;
+			read_unlock_irq(&m->lock);
+			goto retry;
+		}
+	} else if (!m->current_pgpath ||
+		   (!m->queue_io && (m->repeat_count && --m->repeat_count == 0)))
 		__choose_pgpath(m, nr_bytes);
 
 	pgpath = m->current_pgpath;
@@ -401,13 +414,23 @@ static int __multipath_map(struct dm_target *ti, struct request *clone,
 			r = -EIO;	/* Failed */
 		goto out_unlock;
 	} else if (m->queue_io || m->pg_init_required) {
+		if (!use_write_lock) {
+			use_write_lock = true;
+			read_unlock_irq(&m->lock);
+			goto retry;
+		}
 		__pg_init_all_paths(m);
 		goto out_unlock;
 	}
 
+	if (!use_write_lock)
+		read_unlock_irq(&m->lock);
+	else
+		write_unlock_irq(&m->lock);
+
 	if (set_mapinfo(m, map_context) < 0)
 		/* ENOMEM, requeue */
-		goto out_unlock;
+		return r;
 
 	mpio = map_context->ptr;
 	mpio->pgpath = pgpath;
@@ -415,8 +438,6 @@ static int __multipath_map(struct dm_target *ti, struct request *clone,
 
 	bdev = pgpath->path.dev->bdev;
 
-	spin_unlock_irq(&m->lock);
-
 	if (clone) {
 		/* Old request-based interface: allocated clone is passed in */
 		clone->q = bdev_get_queue(bdev);
@@ -443,7 +464,10 @@ static int __multipath_map(struct dm_target *ti, struct request *clone,
 	return DM_MAPIO_REMAPPED;
 
 out_unlock:
-	spin_unlock_irq(&m->lock);
+	if (!use_write_lock)
+		read_unlock_irq(&m->lock);
+	else
+		write_unlock_irq(&m->lock);
 
 	return r;
 }
@@ -474,14 +498,15 @@ static int queue_if_no_path(struct multipath *m, unsigned queue_if_no_path,
 {
 	unsigned long flags;
 
-	spin_lock_irqsave(&m->lock, flags);
+	write_lock_irqsave(&m->lock, flags);
 
 	if (save_old_value)
 		m->saved_queue_if_no_path = m->queue_if_no_path;
 	else
 		m->saved_queue_if_no_path = queue_if_no_path;
 	m->queue_if_no_path = queue_if_no_path;
-	spin_unlock_irqrestore(&m->lock, flags);
+
+	write_unlock_irqrestore(&m->lock, flags);
 
 	if (!queue_if_no_path)
 		dm_table_run_md_queue_async(m->ti->table);
@@ -898,12 +923,12 @@ static void multipath_wait_for_pg_init_completion(struct multipath *m)
 	while (1) {
 		set_current_state(TASK_UNINTERRUPTIBLE);
 
-		spin_lock_irqsave(&m->lock, flags);
+		read_lock_irqsave(&m->lock, flags);
 		if (!m->pg_init_in_progress) {
-			spin_unlock_irqrestore(&m->lock, flags);
+			read_unlock_irqrestore(&m->lock, flags);
 			break;
 		}
-		spin_unlock_irqrestore(&m->lock, flags);
+		read_unlock_irqrestore(&m->lock, flags);
 
 		io_schedule();
 	}
@@ -916,18 +941,18 @@ static void flush_multipath_work(struct multipath *m)
 {
 	unsigned long flags;
 
-	spin_lock_irqsave(&m->lock, flags);
+	write_lock_irqsave(&m->lock, flags);
 	m->pg_init_disabled = 1;
-	spin_unlock_irqrestore(&m->lock, flags);
+	write_unlock_irqrestore(&m->lock, flags);
 
 	flush_workqueue(kmpath_handlerd);
 	multipath_wait_for_pg_init_completion(m);
 	flush_workqueue(kmultipathd);
 	flush_work(&m->trigger_event);
 
-	spin_lock_irqsave(&m->lock, flags);
+	write_lock_irqsave(&m->lock, flags);
 	m->pg_init_disabled = 0;
-	spin_unlock_irqrestore(&m->lock, flags);
+	write_unlock_irqrestore(&m->lock, flags);
 }
 
 static void multipath_dtr(struct dm_target *ti)
@@ -946,7 +971,7 @@ static int fail_path(struct pgpath *pgpath)
 	unsigned long flags;
 	struct multipath *m = pgpath->pg->m;
 
-	spin_lock_irqsave(&m->lock, flags);
+	write_lock_irqsave(&m->lock, flags);
 
 	if (!pgpath->is_active)
 		goto out;
@@ -968,7 +993,7 @@ static int fail_path(struct pgpath *pgpath)
 	schedule_work(&m->trigger_event);
 
 out:
-	spin_unlock_irqrestore(&m->lock, flags);
+	write_unlock_irqrestore(&m->lock, flags);
 
 	return 0;
 }
@@ -982,7 +1007,7 @@ static int reinstate_path(struct pgpath *pgpath)
 	unsigned long flags;
 	struct multipath *m = pgpath->pg->m;
 
-	spin_lock_irqsave(&m->lock, flags);
+	write_lock_irqsave(&m->lock, flags);
 
 	if (pgpath->is_active)
 		goto out;
@@ -1014,7 +1039,7 @@ static int reinstate_path(struct pgpath *pgpath)
 	schedule_work(&m->trigger_event);
 
 out:
-	spin_unlock_irqrestore(&m->lock, flags);
+	write_unlock_irqrestore(&m->lock, flags);
 	if (run_queue)
 		dm_table_run_md_queue_async(m->ti->table);
 
@@ -1049,13 +1074,13 @@ static void bypass_pg(struct multipath *m, struct priority_group *pg,
 {
 	unsigned long flags;
 
-	spin_lock_irqsave(&m->lock, flags);
+	write_lock_irqsave(&m->lock, flags);
 
 	pg->bypassed = bypassed;
 	m->current_pgpath = NULL;
 	m->current_pg = NULL;
 
-	spin_unlock_irqrestore(&m->lock, flags);
+	write_unlock_irqrestore(&m->lock, flags);
 
 	schedule_work(&m->trigger_event);
 }
@@ -1076,7 +1101,7 @@ static int switch_pg_num(struct multipath *m, const char *pgstr)
 		return -EINVAL;
 	}
 
-	spin_lock_irqsave(&m->lock, flags);
+	write_lock_irqsave(&m->lock, flags);
 	list_for_each_entry(pg, &m->priority_groups, list) {
 		pg->bypassed = 0;
 		if (--pgnum)
@@ -1086,7 +1111,7 @@ static int switch_pg_num(struct multipath *m, const char *pgstr)
 		m->current_pg = NULL;
 		m->next_pg = pg;
 	}
-	spin_unlock_irqrestore(&m->lock, flags);
+	write_unlock_irqrestore(&m->lock, flags);
 
 	schedule_work(&m->trigger_event);
 	return 0;
@@ -1125,14 +1150,14 @@ static int pg_init_limit_reached(struct multipath *m, struct pgpath *pgpath)
 	unsigned long flags;
 	int limit_reached = 0;
 
-	spin_lock_irqsave(&m->lock, flags);
+	write_lock_irqsave(&m->lock, flags);
 
 	if (m->pg_init_count <= m->pg_init_retries && !m->pg_init_disabled)
 		m->pg_init_required = 1;
 	else
 		limit_reached = 1;
 
-	spin_unlock_irqrestore(&m->lock, flags);
+	write_unlock_irqrestore(&m->lock, flags);
 
 	return limit_reached;
 }
@@ -1186,7 +1211,7 @@ static void pg_init_done(void *data, int errors)
 		fail_path(pgpath);
 	}
 
-	spin_lock_irqsave(&m->lock, flags);
+	write_lock_irqsave(&m->lock, flags);
 	if (errors) {
 		if (pgpath == m->current_pgpath) {
 			DMERR("Could not failover device. Error %d.", errors);
@@ -1213,7 +1238,7 @@ static void pg_init_done(void *data, int errors)
 	wake_up(&m->pg_init_wait);
 
 out:
-	spin_unlock_irqrestore(&m->lock, flags);
+	write_unlock_irqrestore(&m->lock, flags);
 }
 
 static void activate_path(struct work_struct *work)
@@ -1272,7 +1297,7 @@ static int do_end_io(struct multipath *m, struct request *clone,
 	if (mpio->pgpath)
 		fail_path(mpio->pgpath);
 
-	spin_lock_irqsave(&m->lock, flags);
+	read_lock_irqsave(&m->lock, flags);
 	if (!m->nr_valid_paths) {
 		if (!m->queue_if_no_path) {
 			if (!__must_push_back(m))
@@ -1282,7 +1307,7 @@ static int do_end_io(struct multipath *m, struct request *clone,
 				r = error;
 		}
 	}
-	spin_unlock_irqrestore(&m->lock, flags);
+	read_unlock_irqrestore(&m->lock, flags);
 
 	return r;
 }
@@ -1340,9 +1365,9 @@ static void multipath_resume(struct dm_target *ti)
 	struct multipath *m = (struct multipath *) ti->private;
 	unsigned long flags;
 
-	spin_lock_irqsave(&m->lock, flags);
+	write_lock_irqsave(&m->lock, flags);
 	m->queue_if_no_path = m->saved_queue_if_no_path;
-	spin_unlock_irqrestore(&m->lock, flags);
+	write_unlock_irqrestore(&m->lock, flags);
 }
 
 /*
@@ -1372,7 +1397,7 @@ static void multipath_status(struct dm_target *ti, status_type_t type,
 	unsigned pg_num;
 	char state;
 
-	spin_lock_irqsave(&m->lock, flags);
+	read_lock_irqsave(&m->lock, flags);
 
 	/* Features */
 	if (type == STATUSTYPE_INFO)
@@ -1467,7 +1492,7 @@ static void multipath_status(struct dm_target *ti, status_type_t type,
 		break;
 	}
 
-	spin_unlock_irqrestore(&m->lock, flags);
+	read_unlock_irqrestore(&m->lock, flags);
 }
 
 static int multipath_message(struct dm_target *ti, unsigned argc, char **argv)
@@ -1534,16 +1559,27 @@ out:
 }
 
 static int multipath_prepare_ioctl(struct dm_target *ti,
-		struct block_device **bdev, fmode_t *mode)
+				   struct block_device **bdev, fmode_t *mode)
 {
 	struct multipath *m = ti->private;
 	unsigned long flags;
 	int r;
+	bool use_write_lock = false;
 
-	spin_lock_irqsave(&m->lock, flags);
+retry:
+	if (!use_write_lock)
+		read_lock_irqsave(&m->lock, flags);
+	else
+		write_lock_irqsave(&m->lock, flags);
 
-	if (!m->current_pgpath)
+	if (!m->current_pgpath) {
+		if (!use_write_lock) {
+			use_write_lock = true;
+			read_unlock_irqrestore(&m->lock, flags);
+			goto retry;
+		}
 		__choose_pgpath(m, 0);
+	}
 
 	if (m->current_pgpath) {
 		if (!m->queue_io) {
@@ -1562,17 +1598,20 @@ static int multipath_prepare_ioctl(struct dm_target *ti,
 			r = -EIO;
 	}
 
-	spin_unlock_irqrestore(&m->lock, flags);
+	if (!use_write_lock)
+		read_unlock_irqrestore(&m->lock, flags);
+	else
+		write_unlock_irqrestore(&m->lock, flags);
 
 	if (r == -ENOTCONN) {
-		spin_lock_irqsave(&m->lock, flags);
+		write_lock_irqsave(&m->lock, flags);
 		if (!m->current_pg) {
 			/* Path status changed, redo selection */
 			__choose_pgpath(m, 0);
 		}
 		if (m->pg_init_required)
 			__pg_init_all_paths(m);
-		spin_unlock_irqrestore(&m->lock, flags);
+		write_unlock_irqrestore(&m->lock, flags);
 		dm_table_run_md_queue_async(m->ti->table);
 	}
 
@@ -1627,7 +1666,7 @@ static int multipath_busy(struct dm_target *ti)
 	struct pgpath *pgpath;
 	unsigned long flags;
 
-	spin_lock_irqsave(&m->lock, flags);
+	read_lock_irqsave(&m->lock, flags);
 
 	/* pg_init in progress or no paths available */
 	if (m->pg_init_in_progress ||
@@ -1674,7 +1713,7 @@ static int multipath_busy(struct dm_target *ti)
 		busy = 0;
 
 out:
-	spin_unlock_irqrestore(&m->lock, flags);
+	read_unlock_irqrestore(&m->lock, flags);
 
 	return busy;
 }

^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [dm-devel] dm-multipath low performance with blk-mq
  2016-01-25 21:40       ` Mike Snitzer
@ 2016-01-25 23:37         ` Benjamin Marzinski
  -1 siblings, 0 replies; 127+ messages in thread
From: Benjamin Marzinski @ 2016-01-25 23:37 UTC (permalink / raw)


On Mon, Jan 25, 2016@04:40:16PM -0500, Mike Snitzer wrote:
> On Tue, Jan 19 2016 at  5:45P -0500,
> Mike Snitzer <snitzer@redhat.com> wrote:

I don't think this is going to help __multipath_map() without some
configuration changes.  Now that we're running on already merged
requests instead of bios, the m->repeat_count is almost always set to 1,
so we call the path_selector every time, which means that we'll always
need the write lock. Bumping up the number of IOs we send before calling
the path selector again will give this patch a change to do some good
here.

To do that you need to set:

	rr_min_io_rq <something_bigger_than_one>

in the defaults section of /etc/multipath.conf and then reload the
multipathd service.

The patch should hopefully help in multipath_busy() regardless of the
the rr_min_io_rq setting.

-Ben

> 
> > On Mon, Jan 18 2016 at  7:04am -0500,
> > Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
> > 
> > > Hi All,
> > > 
> > > I've recently tried out dm-multipath over a "super-fast" nvme device
> > > and noticed a serious lock contention in dm-multipath that requires some
> > > extra attention. The nvme device is a simple loopback device emulation
> > > backed by null_blk device.
> > > 
> > > With this I've seen dm-multipath pushing around ~470K IOPs while
> > > the native (loopback) nvme performance can easily push up to 1500K+ IOPs.
> > > 
> > > perf output [1] reveals a huge lock contention on the multipath lock
> > > which is a per-dm_target contention point which seem to defeat the
> > > purpose of blk-mq i/O path.
> > > 
> > > The two current bottlenecks seem to come from multipath_busy and
> > > __multipath_map. Would it make better sense to move to a percpu_ref
> > > model with freeze/unfreeze logic for updates similar to what blk-mq
> > > is doing?
> > >
> > > Thoughts?
> > 
> > Your perf output clearly does identify the 'struct multipath' spinlock
> > as a bottleneck.
> > 
> > Is it fair to assume that implied in your test is that you increased
> > md->tag_set.nr_hw_queues to > 1 in dm_init_request_based_blk_mq_queue()?
> > 
> > I'd like to start by replicating your testbed.  So I'll see about
> > setting up the nvme loop driver you referenced in earlier mail.
> > Can you share your fio job file and fio commandline for your test?
> 
> Would still appreciate answers to my 2 questions above (did you modify
> md->tag_set.nr_hw_queues and can you share your fio job?)
> 
> I've yet to reproduce your config (using hch's nvme loop driver) or
> test to verify your findings but I did develop a patch that switches
> from spinlock_t to rwlock_t.  I've only compile tested this but I'll try
> to reproduce your setup and then test this patch to see if it helps.
> 
> Your worst offenders (multipath_busy and __multipath_map) are now using
> a read lock in the fast path.
> 
>  drivers/md/dm-mpath.c | 127 +++++++++++++++++++++++++++++++++-----------------
>  1 file changed, 83 insertions(+), 44 deletions(-)
> 
> diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
> index cfa29f5..34aadb1 100644
> --- a/drivers/md/dm-mpath.c
> +++ b/drivers/md/dm-mpath.c
> @@ -11,6 +11,7 @@
>  #include "dm-path-selector.h"
>  #include "dm-uevent.h"
>  
> +#include <linux/rwlock_types.h>
>  #include <linux/blkdev.h>
>  #include <linux/ctype.h>
>  #include <linux/init.h>
> @@ -67,7 +68,7 @@ struct multipath {
>  	const char *hw_handler_name;
>  	char *hw_handler_params;
>  
> -	spinlock_t lock;
> +	rwlock_t lock;
>  
>  	unsigned nr_priority_groups;
>  	struct list_head priority_groups;
> @@ -189,7 +190,7 @@ static struct multipath *alloc_multipath(struct dm_target *ti)
>  	m = kzalloc(sizeof(*m), GFP_KERNEL);
>  	if (m) {
>  		INIT_LIST_HEAD(&m->priority_groups);
> -		spin_lock_init(&m->lock);
> +		rwlock_init(&m->lock);
>  		m->queue_io = 1;
>  		m->pg_init_delay_msecs = DM_PG_INIT_DELAY_DEFAULT;
>  		INIT_WORK(&m->trigger_event, trigger_event);
> @@ -386,12 +387,24 @@ static int __multipath_map(struct dm_target *ti, struct request *clone,
>  	struct pgpath *pgpath;
>  	struct block_device *bdev;
>  	struct dm_mpath_io *mpio;
> +	bool use_write_lock = false;
>  
> -	spin_lock_irq(&m->lock);
> +retry:
> +	if (!use_write_lock)
> +		read_lock_irq(&m->lock);
> +	else
> +		write_lock_irq(&m->lock);
>  
>  	/* Do we need to select a new pgpath? */
> -	if (!m->current_pgpath ||
> -	    (!m->queue_io && (m->repeat_count && --m->repeat_count == 0)))
> +	if (!use_write_lock) {
> +		if (!m->current_pgpath ||
> +		    (!m->queue_io && (m->repeat_count && m->repeat_count == 1))) {
> +			use_write_lock = true;
> +			read_unlock_irq(&m->lock);
> +			goto retry;
> +		}
> +	} else if (!m->current_pgpath ||
> +		   (!m->queue_io && (m->repeat_count && --m->repeat_count == 0)))
>  		__choose_pgpath(m, nr_bytes);
>  
>  	pgpath = m->current_pgpath;
> @@ -401,13 +414,23 @@ static int __multipath_map(struct dm_target *ti, struct request *clone,
>  			r = -EIO;	/* Failed */
>  		goto out_unlock;
>  	} else if (m->queue_io || m->pg_init_required) {
> +		if (!use_write_lock) {
> +			use_write_lock = true;
> +			read_unlock_irq(&m->lock);
> +			goto retry;
> +		}
>  		__pg_init_all_paths(m);
>  		goto out_unlock;
>  	}
>  
> +	if (!use_write_lock)
> +		read_unlock_irq(&m->lock);
> +	else
> +		write_unlock_irq(&m->lock);
> +
>  	if (set_mapinfo(m, map_context) < 0)
>  		/* ENOMEM, requeue */
> -		goto out_unlock;
> +		return r;
>  
>  	mpio = map_context->ptr;
>  	mpio->pgpath = pgpath;
> @@ -415,8 +438,6 @@ static int __multipath_map(struct dm_target *ti, struct request *clone,
>  
>  	bdev = pgpath->path.dev->bdev;
>  
> -	spin_unlock_irq(&m->lock);
> -
>  	if (clone) {
>  		/* Old request-based interface: allocated clone is passed in */
>  		clone->q = bdev_get_queue(bdev);
> @@ -443,7 +464,10 @@ static int __multipath_map(struct dm_target *ti, struct request *clone,
>  	return DM_MAPIO_REMAPPED;
>  
>  out_unlock:
> -	spin_unlock_irq(&m->lock);
> +	if (!use_write_lock)
> +		read_unlock_irq(&m->lock);
> +	else
> +		write_unlock_irq(&m->lock);
>  
>  	return r;
>  }
> @@ -474,14 +498,15 @@ static int queue_if_no_path(struct multipath *m, unsigned queue_if_no_path,
>  {
>  	unsigned long flags;
>  
> -	spin_lock_irqsave(&m->lock, flags);
> +	write_lock_irqsave(&m->lock, flags);
>  
>  	if (save_old_value)
>  		m->saved_queue_if_no_path = m->queue_if_no_path;
>  	else
>  		m->saved_queue_if_no_path = queue_if_no_path;
>  	m->queue_if_no_path = queue_if_no_path;
> -	spin_unlock_irqrestore(&m->lock, flags);
> +
> +	write_unlock_irqrestore(&m->lock, flags);
>  
>  	if (!queue_if_no_path)
>  		dm_table_run_md_queue_async(m->ti->table);
> @@ -898,12 +923,12 @@ static void multipath_wait_for_pg_init_completion(struct multipath *m)
>  	while (1) {
>  		set_current_state(TASK_UNINTERRUPTIBLE);
>  
> -		spin_lock_irqsave(&m->lock, flags);
> +		read_lock_irqsave(&m->lock, flags);
>  		if (!m->pg_init_in_progress) {
> -			spin_unlock_irqrestore(&m->lock, flags);
> +			read_unlock_irqrestore(&m->lock, flags);
>  			break;
>  		}
> -		spin_unlock_irqrestore(&m->lock, flags);
> +		read_unlock_irqrestore(&m->lock, flags);
>  
>  		io_schedule();
>  	}
> @@ -916,18 +941,18 @@ static void flush_multipath_work(struct multipath *m)
>  {
>  	unsigned long flags;
>  
> -	spin_lock_irqsave(&m->lock, flags);
> +	write_lock_irqsave(&m->lock, flags);
>  	m->pg_init_disabled = 1;
> -	spin_unlock_irqrestore(&m->lock, flags);
> +	write_unlock_irqrestore(&m->lock, flags);
>  
>  	flush_workqueue(kmpath_handlerd);
>  	multipath_wait_for_pg_init_completion(m);
>  	flush_workqueue(kmultipathd);
>  	flush_work(&m->trigger_event);
>  
> -	spin_lock_irqsave(&m->lock, flags);
> +	write_lock_irqsave(&m->lock, flags);
>  	m->pg_init_disabled = 0;
> -	spin_unlock_irqrestore(&m->lock, flags);
> +	write_unlock_irqrestore(&m->lock, flags);
>  }
>  
>  static void multipath_dtr(struct dm_target *ti)
> @@ -946,7 +971,7 @@ static int fail_path(struct pgpath *pgpath)
>  	unsigned long flags;
>  	struct multipath *m = pgpath->pg->m;
>  
> -	spin_lock_irqsave(&m->lock, flags);
> +	write_lock_irqsave(&m->lock, flags);
>  
>  	if (!pgpath->is_active)
>  		goto out;
> @@ -968,7 +993,7 @@ static int fail_path(struct pgpath *pgpath)
>  	schedule_work(&m->trigger_event);
>  
>  out:
> -	spin_unlock_irqrestore(&m->lock, flags);
> +	write_unlock_irqrestore(&m->lock, flags);
>  
>  	return 0;
>  }
> @@ -982,7 +1007,7 @@ static int reinstate_path(struct pgpath *pgpath)
>  	unsigned long flags;
>  	struct multipath *m = pgpath->pg->m;
>  
> -	spin_lock_irqsave(&m->lock, flags);
> +	write_lock_irqsave(&m->lock, flags);
>  
>  	if (pgpath->is_active)
>  		goto out;
> @@ -1014,7 +1039,7 @@ static int reinstate_path(struct pgpath *pgpath)
>  	schedule_work(&m->trigger_event);
>  
>  out:
> -	spin_unlock_irqrestore(&m->lock, flags);
> +	write_unlock_irqrestore(&m->lock, flags);
>  	if (run_queue)
>  		dm_table_run_md_queue_async(m->ti->table);
>  
> @@ -1049,13 +1074,13 @@ static void bypass_pg(struct multipath *m, struct priority_group *pg,
>  {
>  	unsigned long flags;
>  
> -	spin_lock_irqsave(&m->lock, flags);
> +	write_lock_irqsave(&m->lock, flags);
>  
>  	pg->bypassed = bypassed;
>  	m->current_pgpath = NULL;
>  	m->current_pg = NULL;
>  
> -	spin_unlock_irqrestore(&m->lock, flags);
> +	write_unlock_irqrestore(&m->lock, flags);
>  
>  	schedule_work(&m->trigger_event);
>  }
> @@ -1076,7 +1101,7 @@ static int switch_pg_num(struct multipath *m, const char *pgstr)
>  		return -EINVAL;
>  	}
>  
> -	spin_lock_irqsave(&m->lock, flags);
> +	write_lock_irqsave(&m->lock, flags);
>  	list_for_each_entry(pg, &m->priority_groups, list) {
>  		pg->bypassed = 0;
>  		if (--pgnum)
> @@ -1086,7 +1111,7 @@ static int switch_pg_num(struct multipath *m, const char *pgstr)
>  		m->current_pg = NULL;
>  		m->next_pg = pg;
>  	}
> -	spin_unlock_irqrestore(&m->lock, flags);
> +	write_unlock_irqrestore(&m->lock, flags);
>  
>  	schedule_work(&m->trigger_event);
>  	return 0;
> @@ -1125,14 +1150,14 @@ static int pg_init_limit_reached(struct multipath *m, struct pgpath *pgpath)
>  	unsigned long flags;
>  	int limit_reached = 0;
>  
> -	spin_lock_irqsave(&m->lock, flags);
> +	write_lock_irqsave(&m->lock, flags);
>  
>  	if (m->pg_init_count <= m->pg_init_retries && !m->pg_init_disabled)
>  		m->pg_init_required = 1;
>  	else
>  		limit_reached = 1;
>  
> -	spin_unlock_irqrestore(&m->lock, flags);
> +	write_unlock_irqrestore(&m->lock, flags);
>  
>  	return limit_reached;
>  }
> @@ -1186,7 +1211,7 @@ static void pg_init_done(void *data, int errors)
>  		fail_path(pgpath);
>  	}
>  
> -	spin_lock_irqsave(&m->lock, flags);
> +	write_lock_irqsave(&m->lock, flags);
>  	if (errors) {
>  		if (pgpath == m->current_pgpath) {
>  			DMERR("Could not failover device. Error %d.", errors);
> @@ -1213,7 +1238,7 @@ static void pg_init_done(void *data, int errors)
>  	wake_up(&m->pg_init_wait);
>  
>  out:
> -	spin_unlock_irqrestore(&m->lock, flags);
> +	write_unlock_irqrestore(&m->lock, flags);
>  }
>  
>  static void activate_path(struct work_struct *work)
> @@ -1272,7 +1297,7 @@ static int do_end_io(struct multipath *m, struct request *clone,
>  	if (mpio->pgpath)
>  		fail_path(mpio->pgpath);
>  
> -	spin_lock_irqsave(&m->lock, flags);
> +	read_lock_irqsave(&m->lock, flags);
>  	if (!m->nr_valid_paths) {
>  		if (!m->queue_if_no_path) {
>  			if (!__must_push_back(m))
> @@ -1282,7 +1307,7 @@ static int do_end_io(struct multipath *m, struct request *clone,
>  				r = error;
>  		}
>  	}
> -	spin_unlock_irqrestore(&m->lock, flags);
> +	read_unlock_irqrestore(&m->lock, flags);
>  
>  	return r;
>  }
> @@ -1340,9 +1365,9 @@ static void multipath_resume(struct dm_target *ti)
>  	struct multipath *m = (struct multipath *) ti->private;
>  	unsigned long flags;
>  
> -	spin_lock_irqsave(&m->lock, flags);
> +	write_lock_irqsave(&m->lock, flags);
>  	m->queue_if_no_path = m->saved_queue_if_no_path;
> -	spin_unlock_irqrestore(&m->lock, flags);
> +	write_unlock_irqrestore(&m->lock, flags);
>  }
>  
>  /*
> @@ -1372,7 +1397,7 @@ static void multipath_status(struct dm_target *ti, status_type_t type,
>  	unsigned pg_num;
>  	char state;
>  
> -	spin_lock_irqsave(&m->lock, flags);
> +	read_lock_irqsave(&m->lock, flags);
>  
>  	/* Features */
>  	if (type == STATUSTYPE_INFO)
> @@ -1467,7 +1492,7 @@ static void multipath_status(struct dm_target *ti, status_type_t type,
>  		break;
>  	}
>  
> -	spin_unlock_irqrestore(&m->lock, flags);
> +	read_unlock_irqrestore(&m->lock, flags);
>  }
>  
>  static int multipath_message(struct dm_target *ti, unsigned argc, char **argv)
> @@ -1534,16 +1559,27 @@ out:
>  }
>  
>  static int multipath_prepare_ioctl(struct dm_target *ti,
> -		struct block_device **bdev, fmode_t *mode)
> +				   struct block_device **bdev, fmode_t *mode)
>  {
>  	struct multipath *m = ti->private;
>  	unsigned long flags;
>  	int r;
> +	bool use_write_lock = false;
>  
> -	spin_lock_irqsave(&m->lock, flags);
> +retry:
> +	if (!use_write_lock)
> +		read_lock_irqsave(&m->lock, flags);
> +	else
> +		write_lock_irqsave(&m->lock, flags);
>  
> -	if (!m->current_pgpath)
> +	if (!m->current_pgpath) {
> +		if (!use_write_lock) {
> +			use_write_lock = true;
> +			read_unlock_irqrestore(&m->lock, flags);
> +			goto retry;
> +		}
>  		__choose_pgpath(m, 0);
> +	}
>  
>  	if (m->current_pgpath) {
>  		if (!m->queue_io) {
> @@ -1562,17 +1598,20 @@ static int multipath_prepare_ioctl(struct dm_target *ti,
>  			r = -EIO;
>  	}
>  
> -	spin_unlock_irqrestore(&m->lock, flags);
> +	if (!use_write_lock)
> +		read_unlock_irqrestore(&m->lock, flags);
> +	else
> +		write_unlock_irqrestore(&m->lock, flags);
>  
>  	if (r == -ENOTCONN) {
> -		spin_lock_irqsave(&m->lock, flags);
> +		write_lock_irqsave(&m->lock, flags);
>  		if (!m->current_pg) {
>  			/* Path status changed, redo selection */
>  			__choose_pgpath(m, 0);
>  		}
>  		if (m->pg_init_required)
>  			__pg_init_all_paths(m);
> -		spin_unlock_irqrestore(&m->lock, flags);
> +		write_unlock_irqrestore(&m->lock, flags);
>  		dm_table_run_md_queue_async(m->ti->table);
>  	}
>  
> @@ -1627,7 +1666,7 @@ static int multipath_busy(struct dm_target *ti)
>  	struct pgpath *pgpath;
>  	unsigned long flags;
>  
> -	spin_lock_irqsave(&m->lock, flags);
> +	read_lock_irqsave(&m->lock, flags);
>  
>  	/* pg_init in progress or no paths available */
>  	if (m->pg_init_in_progress ||
> @@ -1674,7 +1713,7 @@ static int multipath_busy(struct dm_target *ti)
>  		busy = 0;
>  
>  out:
> -	spin_unlock_irqrestore(&m->lock, flags);
> +	read_unlock_irqrestore(&m->lock, flags);
>  
>  	return busy;
>  }
> 
> --
> dm-devel mailing list
> dm-devel at redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: dm-multipath low performance with blk-mq
@ 2016-01-25 23:37         ` Benjamin Marzinski
  0 siblings, 0 replies; 127+ messages in thread
From: Benjamin Marzinski @ 2016-01-25 23:37 UTC (permalink / raw)
  To: device-mapper development
  Cc: Christoph Hellwig, keith.busch, Bart Van Assche, linux-nvme,
	Sagi Grimberg

On Mon, Jan 25, 2016 at 04:40:16PM -0500, Mike Snitzer wrote:
> On Tue, Jan 19 2016 at  5:45P -0500,
> Mike Snitzer <snitzer@redhat.com> wrote:

I don't think this is going to help __multipath_map() without some
configuration changes.  Now that we're running on already merged
requests instead of bios, the m->repeat_count is almost always set to 1,
so we call the path_selector every time, which means that we'll always
need the write lock. Bumping up the number of IOs we send before calling
the path selector again will give this patch a change to do some good
here.

To do that you need to set:

	rr_min_io_rq <something_bigger_than_one>

in the defaults section of /etc/multipath.conf and then reload the
multipathd service.

The patch should hopefully help in multipath_busy() regardless of the
the rr_min_io_rq setting.

-Ben

> 
> > On Mon, Jan 18 2016 at  7:04am -0500,
> > Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
> > 
> > > Hi All,
> > > 
> > > I've recently tried out dm-multipath over a "super-fast" nvme device
> > > and noticed a serious lock contention in dm-multipath that requires some
> > > extra attention. The nvme device is a simple loopback device emulation
> > > backed by null_blk device.
> > > 
> > > With this I've seen dm-multipath pushing around ~470K IOPs while
> > > the native (loopback) nvme performance can easily push up to 1500K+ IOPs.
> > > 
> > > perf output [1] reveals a huge lock contention on the multipath lock
> > > which is a per-dm_target contention point which seem to defeat the
> > > purpose of blk-mq i/O path.
> > > 
> > > The two current bottlenecks seem to come from multipath_busy and
> > > __multipath_map. Would it make better sense to move to a percpu_ref
> > > model with freeze/unfreeze logic for updates similar to what blk-mq
> > > is doing?
> > >
> > > Thoughts?
> > 
> > Your perf output clearly does identify the 'struct multipath' spinlock
> > as a bottleneck.
> > 
> > Is it fair to assume that implied in your test is that you increased
> > md->tag_set.nr_hw_queues to > 1 in dm_init_request_based_blk_mq_queue()?
> > 
> > I'd like to start by replicating your testbed.  So I'll see about
> > setting up the nvme loop driver you referenced in earlier mail.
> > Can you share your fio job file and fio commandline for your test?
> 
> Would still appreciate answers to my 2 questions above (did you modify
> md->tag_set.nr_hw_queues and can you share your fio job?)
> 
> I've yet to reproduce your config (using hch's nvme loop driver) or
> test to verify your findings but I did develop a patch that switches
> from spinlock_t to rwlock_t.  I've only compile tested this but I'll try
> to reproduce your setup and then test this patch to see if it helps.
> 
> Your worst offenders (multipath_busy and __multipath_map) are now using
> a read lock in the fast path.
> 
>  drivers/md/dm-mpath.c | 127 +++++++++++++++++++++++++++++++++-----------------
>  1 file changed, 83 insertions(+), 44 deletions(-)
> 
> diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
> index cfa29f5..34aadb1 100644
> --- a/drivers/md/dm-mpath.c
> +++ b/drivers/md/dm-mpath.c
> @@ -11,6 +11,7 @@
>  #include "dm-path-selector.h"
>  #include "dm-uevent.h"
>  
> +#include <linux/rwlock_types.h>
>  #include <linux/blkdev.h>
>  #include <linux/ctype.h>
>  #include <linux/init.h>
> @@ -67,7 +68,7 @@ struct multipath {
>  	const char *hw_handler_name;
>  	char *hw_handler_params;
>  
> -	spinlock_t lock;
> +	rwlock_t lock;
>  
>  	unsigned nr_priority_groups;
>  	struct list_head priority_groups;
> @@ -189,7 +190,7 @@ static struct multipath *alloc_multipath(struct dm_target *ti)
>  	m = kzalloc(sizeof(*m), GFP_KERNEL);
>  	if (m) {
>  		INIT_LIST_HEAD(&m->priority_groups);
> -		spin_lock_init(&m->lock);
> +		rwlock_init(&m->lock);
>  		m->queue_io = 1;
>  		m->pg_init_delay_msecs = DM_PG_INIT_DELAY_DEFAULT;
>  		INIT_WORK(&m->trigger_event, trigger_event);
> @@ -386,12 +387,24 @@ static int __multipath_map(struct dm_target *ti, struct request *clone,
>  	struct pgpath *pgpath;
>  	struct block_device *bdev;
>  	struct dm_mpath_io *mpio;
> +	bool use_write_lock = false;
>  
> -	spin_lock_irq(&m->lock);
> +retry:
> +	if (!use_write_lock)
> +		read_lock_irq(&m->lock);
> +	else
> +		write_lock_irq(&m->lock);
>  
>  	/* Do we need to select a new pgpath? */
> -	if (!m->current_pgpath ||
> -	    (!m->queue_io && (m->repeat_count && --m->repeat_count == 0)))
> +	if (!use_write_lock) {
> +		if (!m->current_pgpath ||
> +		    (!m->queue_io && (m->repeat_count && m->repeat_count == 1))) {
> +			use_write_lock = true;
> +			read_unlock_irq(&m->lock);
> +			goto retry;
> +		}
> +	} else if (!m->current_pgpath ||
> +		   (!m->queue_io && (m->repeat_count && --m->repeat_count == 0)))
>  		__choose_pgpath(m, nr_bytes);
>  
>  	pgpath = m->current_pgpath;
> @@ -401,13 +414,23 @@ static int __multipath_map(struct dm_target *ti, struct request *clone,
>  			r = -EIO;	/* Failed */
>  		goto out_unlock;
>  	} else if (m->queue_io || m->pg_init_required) {
> +		if (!use_write_lock) {
> +			use_write_lock = true;
> +			read_unlock_irq(&m->lock);
> +			goto retry;
> +		}
>  		__pg_init_all_paths(m);
>  		goto out_unlock;
>  	}
>  
> +	if (!use_write_lock)
> +		read_unlock_irq(&m->lock);
> +	else
> +		write_unlock_irq(&m->lock);
> +
>  	if (set_mapinfo(m, map_context) < 0)
>  		/* ENOMEM, requeue */
> -		goto out_unlock;
> +		return r;
>  
>  	mpio = map_context->ptr;
>  	mpio->pgpath = pgpath;
> @@ -415,8 +438,6 @@ static int __multipath_map(struct dm_target *ti, struct request *clone,
>  
>  	bdev = pgpath->path.dev->bdev;
>  
> -	spin_unlock_irq(&m->lock);
> -
>  	if (clone) {
>  		/* Old request-based interface: allocated clone is passed in */
>  		clone->q = bdev_get_queue(bdev);
> @@ -443,7 +464,10 @@ static int __multipath_map(struct dm_target *ti, struct request *clone,
>  	return DM_MAPIO_REMAPPED;
>  
>  out_unlock:
> -	spin_unlock_irq(&m->lock);
> +	if (!use_write_lock)
> +		read_unlock_irq(&m->lock);
> +	else
> +		write_unlock_irq(&m->lock);
>  
>  	return r;
>  }
> @@ -474,14 +498,15 @@ static int queue_if_no_path(struct multipath *m, unsigned queue_if_no_path,
>  {
>  	unsigned long flags;
>  
> -	spin_lock_irqsave(&m->lock, flags);
> +	write_lock_irqsave(&m->lock, flags);
>  
>  	if (save_old_value)
>  		m->saved_queue_if_no_path = m->queue_if_no_path;
>  	else
>  		m->saved_queue_if_no_path = queue_if_no_path;
>  	m->queue_if_no_path = queue_if_no_path;
> -	spin_unlock_irqrestore(&m->lock, flags);
> +
> +	write_unlock_irqrestore(&m->lock, flags);
>  
>  	if (!queue_if_no_path)
>  		dm_table_run_md_queue_async(m->ti->table);
> @@ -898,12 +923,12 @@ static void multipath_wait_for_pg_init_completion(struct multipath *m)
>  	while (1) {
>  		set_current_state(TASK_UNINTERRUPTIBLE);
>  
> -		spin_lock_irqsave(&m->lock, flags);
> +		read_lock_irqsave(&m->lock, flags);
>  		if (!m->pg_init_in_progress) {
> -			spin_unlock_irqrestore(&m->lock, flags);
> +			read_unlock_irqrestore(&m->lock, flags);
>  			break;
>  		}
> -		spin_unlock_irqrestore(&m->lock, flags);
> +		read_unlock_irqrestore(&m->lock, flags);
>  
>  		io_schedule();
>  	}
> @@ -916,18 +941,18 @@ static void flush_multipath_work(struct multipath *m)
>  {
>  	unsigned long flags;
>  
> -	spin_lock_irqsave(&m->lock, flags);
> +	write_lock_irqsave(&m->lock, flags);
>  	m->pg_init_disabled = 1;
> -	spin_unlock_irqrestore(&m->lock, flags);
> +	write_unlock_irqrestore(&m->lock, flags);
>  
>  	flush_workqueue(kmpath_handlerd);
>  	multipath_wait_for_pg_init_completion(m);
>  	flush_workqueue(kmultipathd);
>  	flush_work(&m->trigger_event);
>  
> -	spin_lock_irqsave(&m->lock, flags);
> +	write_lock_irqsave(&m->lock, flags);
>  	m->pg_init_disabled = 0;
> -	spin_unlock_irqrestore(&m->lock, flags);
> +	write_unlock_irqrestore(&m->lock, flags);
>  }
>  
>  static void multipath_dtr(struct dm_target *ti)
> @@ -946,7 +971,7 @@ static int fail_path(struct pgpath *pgpath)
>  	unsigned long flags;
>  	struct multipath *m = pgpath->pg->m;
>  
> -	spin_lock_irqsave(&m->lock, flags);
> +	write_lock_irqsave(&m->lock, flags);
>  
>  	if (!pgpath->is_active)
>  		goto out;
> @@ -968,7 +993,7 @@ static int fail_path(struct pgpath *pgpath)
>  	schedule_work(&m->trigger_event);
>  
>  out:
> -	spin_unlock_irqrestore(&m->lock, flags);
> +	write_unlock_irqrestore(&m->lock, flags);
>  
>  	return 0;
>  }
> @@ -982,7 +1007,7 @@ static int reinstate_path(struct pgpath *pgpath)
>  	unsigned long flags;
>  	struct multipath *m = pgpath->pg->m;
>  
> -	spin_lock_irqsave(&m->lock, flags);
> +	write_lock_irqsave(&m->lock, flags);
>  
>  	if (pgpath->is_active)
>  		goto out;
> @@ -1014,7 +1039,7 @@ static int reinstate_path(struct pgpath *pgpath)
>  	schedule_work(&m->trigger_event);
>  
>  out:
> -	spin_unlock_irqrestore(&m->lock, flags);
> +	write_unlock_irqrestore(&m->lock, flags);
>  	if (run_queue)
>  		dm_table_run_md_queue_async(m->ti->table);
>  
> @@ -1049,13 +1074,13 @@ static void bypass_pg(struct multipath *m, struct priority_group *pg,
>  {
>  	unsigned long flags;
>  
> -	spin_lock_irqsave(&m->lock, flags);
> +	write_lock_irqsave(&m->lock, flags);
>  
>  	pg->bypassed = bypassed;
>  	m->current_pgpath = NULL;
>  	m->current_pg = NULL;
>  
> -	spin_unlock_irqrestore(&m->lock, flags);
> +	write_unlock_irqrestore(&m->lock, flags);
>  
>  	schedule_work(&m->trigger_event);
>  }
> @@ -1076,7 +1101,7 @@ static int switch_pg_num(struct multipath *m, const char *pgstr)
>  		return -EINVAL;
>  	}
>  
> -	spin_lock_irqsave(&m->lock, flags);
> +	write_lock_irqsave(&m->lock, flags);
>  	list_for_each_entry(pg, &m->priority_groups, list) {
>  		pg->bypassed = 0;
>  		if (--pgnum)
> @@ -1086,7 +1111,7 @@ static int switch_pg_num(struct multipath *m, const char *pgstr)
>  		m->current_pg = NULL;
>  		m->next_pg = pg;
>  	}
> -	spin_unlock_irqrestore(&m->lock, flags);
> +	write_unlock_irqrestore(&m->lock, flags);
>  
>  	schedule_work(&m->trigger_event);
>  	return 0;
> @@ -1125,14 +1150,14 @@ static int pg_init_limit_reached(struct multipath *m, struct pgpath *pgpath)
>  	unsigned long flags;
>  	int limit_reached = 0;
>  
> -	spin_lock_irqsave(&m->lock, flags);
> +	write_lock_irqsave(&m->lock, flags);
>  
>  	if (m->pg_init_count <= m->pg_init_retries && !m->pg_init_disabled)
>  		m->pg_init_required = 1;
>  	else
>  		limit_reached = 1;
>  
> -	spin_unlock_irqrestore(&m->lock, flags);
> +	write_unlock_irqrestore(&m->lock, flags);
>  
>  	return limit_reached;
>  }
> @@ -1186,7 +1211,7 @@ static void pg_init_done(void *data, int errors)
>  		fail_path(pgpath);
>  	}
>  
> -	spin_lock_irqsave(&m->lock, flags);
> +	write_lock_irqsave(&m->lock, flags);
>  	if (errors) {
>  		if (pgpath == m->current_pgpath) {
>  			DMERR("Could not failover device. Error %d.", errors);
> @@ -1213,7 +1238,7 @@ static void pg_init_done(void *data, int errors)
>  	wake_up(&m->pg_init_wait);
>  
>  out:
> -	spin_unlock_irqrestore(&m->lock, flags);
> +	write_unlock_irqrestore(&m->lock, flags);
>  }
>  
>  static void activate_path(struct work_struct *work)
> @@ -1272,7 +1297,7 @@ static int do_end_io(struct multipath *m, struct request *clone,
>  	if (mpio->pgpath)
>  		fail_path(mpio->pgpath);
>  
> -	spin_lock_irqsave(&m->lock, flags);
> +	read_lock_irqsave(&m->lock, flags);
>  	if (!m->nr_valid_paths) {
>  		if (!m->queue_if_no_path) {
>  			if (!__must_push_back(m))
> @@ -1282,7 +1307,7 @@ static int do_end_io(struct multipath *m, struct request *clone,
>  				r = error;
>  		}
>  	}
> -	spin_unlock_irqrestore(&m->lock, flags);
> +	read_unlock_irqrestore(&m->lock, flags);
>  
>  	return r;
>  }
> @@ -1340,9 +1365,9 @@ static void multipath_resume(struct dm_target *ti)
>  	struct multipath *m = (struct multipath *) ti->private;
>  	unsigned long flags;
>  
> -	spin_lock_irqsave(&m->lock, flags);
> +	write_lock_irqsave(&m->lock, flags);
>  	m->queue_if_no_path = m->saved_queue_if_no_path;
> -	spin_unlock_irqrestore(&m->lock, flags);
> +	write_unlock_irqrestore(&m->lock, flags);
>  }
>  
>  /*
> @@ -1372,7 +1397,7 @@ static void multipath_status(struct dm_target *ti, status_type_t type,
>  	unsigned pg_num;
>  	char state;
>  
> -	spin_lock_irqsave(&m->lock, flags);
> +	read_lock_irqsave(&m->lock, flags);
>  
>  	/* Features */
>  	if (type == STATUSTYPE_INFO)
> @@ -1467,7 +1492,7 @@ static void multipath_status(struct dm_target *ti, status_type_t type,
>  		break;
>  	}
>  
> -	spin_unlock_irqrestore(&m->lock, flags);
> +	read_unlock_irqrestore(&m->lock, flags);
>  }
>  
>  static int multipath_message(struct dm_target *ti, unsigned argc, char **argv)
> @@ -1534,16 +1559,27 @@ out:
>  }
>  
>  static int multipath_prepare_ioctl(struct dm_target *ti,
> -		struct block_device **bdev, fmode_t *mode)
> +				   struct block_device **bdev, fmode_t *mode)
>  {
>  	struct multipath *m = ti->private;
>  	unsigned long flags;
>  	int r;
> +	bool use_write_lock = false;
>  
> -	spin_lock_irqsave(&m->lock, flags);
> +retry:
> +	if (!use_write_lock)
> +		read_lock_irqsave(&m->lock, flags);
> +	else
> +		write_lock_irqsave(&m->lock, flags);
>  
> -	if (!m->current_pgpath)
> +	if (!m->current_pgpath) {
> +		if (!use_write_lock) {
> +			use_write_lock = true;
> +			read_unlock_irqrestore(&m->lock, flags);
> +			goto retry;
> +		}
>  		__choose_pgpath(m, 0);
> +	}
>  
>  	if (m->current_pgpath) {
>  		if (!m->queue_io) {
> @@ -1562,17 +1598,20 @@ static int multipath_prepare_ioctl(struct dm_target *ti,
>  			r = -EIO;
>  	}
>  
> -	spin_unlock_irqrestore(&m->lock, flags);
> +	if (!use_write_lock)
> +		read_unlock_irqrestore(&m->lock, flags);
> +	else
> +		write_unlock_irqrestore(&m->lock, flags);
>  
>  	if (r == -ENOTCONN) {
> -		spin_lock_irqsave(&m->lock, flags);
> +		write_lock_irqsave(&m->lock, flags);
>  		if (!m->current_pg) {
>  			/* Path status changed, redo selection */
>  			__choose_pgpath(m, 0);
>  		}
>  		if (m->pg_init_required)
>  			__pg_init_all_paths(m);
> -		spin_unlock_irqrestore(&m->lock, flags);
> +		write_unlock_irqrestore(&m->lock, flags);
>  		dm_table_run_md_queue_async(m->ti->table);
>  	}
>  
> @@ -1627,7 +1666,7 @@ static int multipath_busy(struct dm_target *ti)
>  	struct pgpath *pgpath;
>  	unsigned long flags;
>  
> -	spin_lock_irqsave(&m->lock, flags);
> +	read_lock_irqsave(&m->lock, flags);
>  
>  	/* pg_init in progress or no paths available */
>  	if (m->pg_init_in_progress ||
> @@ -1674,7 +1713,7 @@ static int multipath_busy(struct dm_target *ti)
>  		busy = 0;
>  
>  out:
> -	spin_unlock_irqrestore(&m->lock, flags);
> +	read_unlock_irqrestore(&m->lock, flags);
>  
>  	return busy;
>  }
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 127+ messages in thread

* [dm-devel] dm-multipath low performance with blk-mq
  2016-01-25 21:40       ` Mike Snitzer
@ 2016-01-26  1:49         ` Benjamin Marzinski
  -1 siblings, 0 replies; 127+ messages in thread
From: Benjamin Marzinski @ 2016-01-26  1:49 UTC (permalink / raw)


On Mon, Jan 25, 2016@04:40:16PM -0500, Mike Snitzer wrote:
> On Tue, Jan 19 2016 at  5:45P -0500,
> Mike Snitzer <snitzer@redhat.com> wrote:
> 
> > On Mon, Jan 18 2016 at  7:04am -0500,
> > Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
> 
> Would still appreciate answers to my 2 questions above (did you modify
> md->tag_set.nr_hw_queues and can you share your fio job?)

The number of CPUs (if it's not obvious from the fio job) and any
relevant information about the memory architecture (is this a NUMA
system? How many nodes?) might also help reproducing this.

-Ben

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: dm-multipath low performance with blk-mq
@ 2016-01-26  1:49         ` Benjamin Marzinski
  0 siblings, 0 replies; 127+ messages in thread
From: Benjamin Marzinski @ 2016-01-26  1:49 UTC (permalink / raw)
  To: device-mapper development
  Cc: Christoph Hellwig, keith.busch, Bart Van Assche, linux-nvme,
	Sagi Grimberg

On Mon, Jan 25, 2016 at 04:40:16PM -0500, Mike Snitzer wrote:
> On Tue, Jan 19 2016 at  5:45P -0500,
> Mike Snitzer <snitzer@redhat.com> wrote:
> 
> > On Mon, Jan 18 2016 at  7:04am -0500,
> > Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
> 
> Would still appreciate answers to my 2 questions above (did you modify
> md->tag_set.nr_hw_queues and can you share your fio job?)

The number of CPUs (if it's not obvious from the fio job) and any
relevant information about the memory architecture (is this a NUMA
system? How many nodes?) might also help reproducing this.

-Ben

^ permalink raw reply	[flat|nested] 127+ messages in thread

* dm-multipath low performance with blk-mq
  2016-01-25 23:37         ` Benjamin Marzinski
@ 2016-01-26 13:29           ` Mike Snitzer
  -1 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-01-26 13:29 UTC (permalink / raw)


On Mon, Jan 25 2016 at  6:37pm -0500,
Benjamin Marzinski <bmarzins@redhat.com> wrote:

> On Mon, Jan 25, 2016@04:40:16PM -0500, Mike Snitzer wrote:
> > On Tue, Jan 19 2016 at  5:45P -0500,
> > Mike Snitzer <snitzer@redhat.com> wrote:
> 
> I don't think this is going to help __multipath_map() without some
> configuration changes.  Now that we're running on already merged
> requests instead of bios, the m->repeat_count is almost always set to 1,
> so we call the path_selector every time, which means that we'll always
> need the write lock. Bumping up the number of IOs we send before calling
> the path selector again will give this patch a change to do some good
> here.
> 
> To do that you need to set:
> 
> 	rr_min_io_rq <something_bigger_than_one>
> 
> in the defaults section of /etc/multipath.conf and then reload the
> multipathd service.
> 
> The patch should hopefully help in multipath_busy() regardless of the
> the rr_min_io_rq setting.

This patch, while generic, is meant to help the blk-mq case.  A blk-mq
request_queue doesn't have an elevator so the requests will not have
seen merging.

But yes, implied in the patch is the requirement to increase
m->repeat_count via multipathd's rr_min_io_rq (I'll backfill a proper
header once it is tested).

Thanks,
Mike

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: dm-multipath low performance with blk-mq
@ 2016-01-26 13:29           ` Mike Snitzer
  0 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-01-26 13:29 UTC (permalink / raw)
  To: Benjamin Marzinski
  Cc: keith.busch, Sagi Grimberg, linux-nvme, Christoph Hellwig,
	device-mapper development, Bart Van Assche

On Mon, Jan 25 2016 at  6:37pm -0500,
Benjamin Marzinski <bmarzins@redhat.com> wrote:

> On Mon, Jan 25, 2016 at 04:40:16PM -0500, Mike Snitzer wrote:
> > On Tue, Jan 19 2016 at  5:45P -0500,
> > Mike Snitzer <snitzer@redhat.com> wrote:
> 
> I don't think this is going to help __multipath_map() without some
> configuration changes.  Now that we're running on already merged
> requests instead of bios, the m->repeat_count is almost always set to 1,
> so we call the path_selector every time, which means that we'll always
> need the write lock. Bumping up the number of IOs we send before calling
> the path selector again will give this patch a change to do some good
> here.
> 
> To do that you need to set:
> 
> 	rr_min_io_rq <something_bigger_than_one>
> 
> in the defaults section of /etc/multipath.conf and then reload the
> multipathd service.
> 
> The patch should hopefully help in multipath_busy() regardless of the
> the rr_min_io_rq setting.

This patch, while generic, is meant to help the blk-mq case.  A blk-mq
request_queue doesn't have an elevator so the requests will not have
seen merging.

But yes, implied in the patch is the requirement to increase
m->repeat_count via multipathd's rr_min_io_rq (I'll backfill a proper
header once it is tested).

Thanks,
Mike

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: dm-multipath low performance with blk-mq
  2016-01-26 13:29           ` Mike Snitzer
  (?)
@ 2016-01-26 14:01           ` Hannes Reinecke
  2016-01-26 14:47               ` Mike Snitzer
  2016-01-26 15:57             ` Benjamin Marzinski
  -1 siblings, 2 replies; 127+ messages in thread
From: Hannes Reinecke @ 2016-01-26 14:01 UTC (permalink / raw)
  To: dm-devel

On 01/26/2016 02:29 PM, Mike Snitzer wrote:
> On Mon, Jan 25 2016 at  6:37pm -0500,
> Benjamin Marzinski <bmarzins@redhat.com> wrote:
> 
>> On Mon, Jan 25, 2016 at 04:40:16PM -0500, Mike Snitzer wrote:
>>> On Tue, Jan 19 2016 at  5:45P -0500,
>>> Mike Snitzer <snitzer@redhat.com> wrote:
>>
>> I don't think this is going to help __multipath_map() without some
>> configuration changes.  Now that we're running on already merged
>> requests instead of bios, the m->repeat_count is almost always set to 1,
>> so we call the path_selector every time, which means that we'll always
>> need the write lock. Bumping up the number of IOs we send before calling
>> the path selector again will give this patch a change to do some good
>> here.
>>
>> To do that you need to set:
>>
>> 	rr_min_io_rq <something_bigger_than_one>
>>
>> in the defaults section of /etc/multipath.conf and then reload the
>> multipathd service.
>>
>> The patch should hopefully help in multipath_busy() regardless of the
>> the rr_min_io_rq setting.
> 
> This patch, while generic, is meant to help the blk-mq case.  A blk-mq
> request_queue doesn't have an elevator so the requests will not have
> seen merging.
> 
> But yes, implied in the patch is the requirement to increase
> m->repeat_count via multipathd's rr_min_io_rq (I'll backfill a proper
> header once it is tested).
> 
But that would defeat load balancing, would it not?
IE when you want to do load balancing you would constantly change
paths, thereby always taking the write lock.
Which would render the patch pointless.

I was rather wondering if we could expose all active paths as
hardware contexts and let blk-mq do the I/O mapping.
That way we would only have to take the write lock if we have to
choose a new pgpath/priority group ie in the case the active
priority group goes down.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 127+ messages in thread

* dm-multipath low performance with blk-mq
  2016-01-26 14:01           ` Hannes Reinecke
@ 2016-01-26 14:47               ` Mike Snitzer
  2016-01-26 15:57             ` Benjamin Marzinski
  1 sibling, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-01-26 14:47 UTC (permalink / raw)


[you're killing me.. you nuked all CCs again]

On Tue, Jan 26 2016 at  9:01am -0500,
Hannes Reinecke <hare@suse.de> wrote:

> On 01/26/2016 02:29 PM, Mike Snitzer wrote:
> > On Mon, Jan 25 2016 at  6:37pm -0500,
> > Benjamin Marzinski <bmarzins@redhat.com> wrote:
> > 
> >> On Mon, Jan 25, 2016@04:40:16PM -0500, Mike Snitzer wrote:
> >>> On Tue, Jan 19 2016 at  5:45P -0500,
> >>> Mike Snitzer <snitzer@redhat.com> wrote:
> >>
> >> I don't think this is going to help __multipath_map() without some
> >> configuration changes.  Now that we're running on already merged
> >> requests instead of bios, the m->repeat_count is almost always set to 1,
> >> so we call the path_selector every time, which means that we'll always
> >> need the write lock. Bumping up the number of IOs we send before calling
> >> the path selector again will give this patch a change to do some good
> >> here.
> >>
> >> To do that you need to set:
> >>
> >> 	rr_min_io_rq <something_bigger_than_one>
> >>
> >> in the defaults section of /etc/multipath.conf and then reload the
> >> multipathd service.
> >>
> >> The patch should hopefully help in multipath_busy() regardless of the
> >> the rr_min_io_rq setting.
> > 
> > This patch, while generic, is meant to help the blk-mq case.  A blk-mq
> > request_queue doesn't have an elevator so the requests will not have
> > seen merging.
> > 
> > But yes, implied in the patch is the requirement to increase
> > m->repeat_count via multipathd's rr_min_io_rq (I'll backfill a proper
> > header once it is tested).
> > 
> But that would defeat load balancing, would it not?
> IE when you want to do load balancing you would constantly change
> paths, thereby always taking the write lock.
> Which would render the patch pointless.

Increasing m->repeat_count slightly for blk-mq could be beneficial
considering there isn't an elevator.  I do concede the need for finding
the sweet-spot (not too small, not too large so as to starve load
balancing) is less than ideal.  But it needs testing.

This initial m->lock conversion from spinlock_t to rwlock_t is just the
first step on addressing the locking bottlenecks we've not had a need to
look at until now.  Could be the rwlock_t also gets replaced with a more
complex locking model.

More work is possible to make path switching lockless.  Not yet clear
(to me) on how to approach it.  And yes, the work gets incrementally
more challenging (percpu, rcu, whatever... that code is "harder",
especially when refactoring existing code with legacy requirements).

> I was rather wondering if we could expose all active paths as
> hardware contexts and let blk-mq do the I/O mapping.
> That way we would only have to take the write lock if we have to
> choose a new pgpath/priority group ie in the case the active
> priority group goes down.

Training blk-mq to be multipath aware (priority groups, etc) is a
entirely new tangent that is one rabbit hole after another.

Yeah I know you want to throw away everything.  I'm not holding you back
from doing anything but I've told you I want incremental dm-multipath
improvements until it is clear there is no more room for improvement.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: dm-multipath low performance with blk-mq
@ 2016-01-26 14:47               ` Mike Snitzer
  0 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-01-26 14:47 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Christoph Hellwig, Sagi Grimberg, linux-nvme, keith.busch,
	dm-devel, Bart Van Assche

[you're killing me.. you nuked all CCs again]

On Tue, Jan 26 2016 at  9:01am -0500,
Hannes Reinecke <hare@suse.de> wrote:

> On 01/26/2016 02:29 PM, Mike Snitzer wrote:
> > On Mon, Jan 25 2016 at  6:37pm -0500,
> > Benjamin Marzinski <bmarzins@redhat.com> wrote:
> > 
> >> On Mon, Jan 25, 2016 at 04:40:16PM -0500, Mike Snitzer wrote:
> >>> On Tue, Jan 19 2016 at  5:45P -0500,
> >>> Mike Snitzer <snitzer@redhat.com> wrote:
> >>
> >> I don't think this is going to help __multipath_map() without some
> >> configuration changes.  Now that we're running on already merged
> >> requests instead of bios, the m->repeat_count is almost always set to 1,
> >> so we call the path_selector every time, which means that we'll always
> >> need the write lock. Bumping up the number of IOs we send before calling
> >> the path selector again will give this patch a change to do some good
> >> here.
> >>
> >> To do that you need to set:
> >>
> >> 	rr_min_io_rq <something_bigger_than_one>
> >>
> >> in the defaults section of /etc/multipath.conf and then reload the
> >> multipathd service.
> >>
> >> The patch should hopefully help in multipath_busy() regardless of the
> >> the rr_min_io_rq setting.
> > 
> > This patch, while generic, is meant to help the blk-mq case.  A blk-mq
> > request_queue doesn't have an elevator so the requests will not have
> > seen merging.
> > 
> > But yes, implied in the patch is the requirement to increase
> > m->repeat_count via multipathd's rr_min_io_rq (I'll backfill a proper
> > header once it is tested).
> > 
> But that would defeat load balancing, would it not?
> IE when you want to do load balancing you would constantly change
> paths, thereby always taking the write lock.
> Which would render the patch pointless.

Increasing m->repeat_count slightly for blk-mq could be beneficial
considering there isn't an elevator.  I do concede the need for finding
the sweet-spot (not too small, not too large so as to starve load
balancing) is less than ideal.  But it needs testing.

This initial m->lock conversion from spinlock_t to rwlock_t is just the
first step on addressing the locking bottlenecks we've not had a need to
look at until now.  Could be the rwlock_t also gets replaced with a more
complex locking model.

More work is possible to make path switching lockless.  Not yet clear
(to me) on how to approach it.  And yes, the work gets incrementally
more challenging (percpu, rcu, whatever... that code is "harder",
especially when refactoring existing code with legacy requirements).

> I was rather wondering if we could expose all active paths as
> hardware contexts and let blk-mq do the I/O mapping.
> That way we would only have to take the write lock if we have to
> choose a new pgpath/priority group ie in the case the active
> priority group goes down.

Training blk-mq to be multipath aware (priority groups, etc) is a
entirely new tangent that is one rabbit hole after another.

Yeah I know you want to throw away everything.  I'm not holding you back
from doing anything but I've told you I want incremental dm-multipath
improvements until it is clear there is no more room for improvement.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* dm-multipath low performance with blk-mq
  2016-01-26 14:47               ` Mike Snitzer
@ 2016-01-26 14:56                 ` Christoph Hellwig
  -1 siblings, 0 replies; 127+ messages in thread
From: Christoph Hellwig @ 2016-01-26 14:56 UTC (permalink / raw)


On Tue, Jan 26, 2016@09:47:13AM -0500, Mike Snitzer wrote:
> [you're killing me.. you nuked all CCs again]

I think that's the reply-to defaults for dm-devel.  They try to do that
to me all that time, but fortunately I configure mutt to not blindly
follow that bad advice but ask me instead.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: dm-multipath low performance with blk-mq
@ 2016-01-26 14:56                 ` Christoph Hellwig
  0 siblings, 0 replies; 127+ messages in thread
From: Christoph Hellwig @ 2016-01-26 14:56 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Christoph Hellwig, Sagi Grimberg, linux-nvme, keith.busch,
	dm-devel, Bart Van Assche

On Tue, Jan 26, 2016 at 09:47:13AM -0500, Mike Snitzer wrote:
> [you're killing me.. you nuked all CCs again]

I think that's the reply-to defaults for dm-devel.  They try to do that
to me all that time, but fortunately I configure mutt to not blindly
follow that bad advice but ask me instead.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* dm-multipath low performance with blk-mq
  2016-01-26 14:56                 ` Christoph Hellwig
@ 2016-01-26 15:27                   ` Mike Snitzer
  -1 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-01-26 15:27 UTC (permalink / raw)


On Tue, Jan 26 2016 at  9:56am -0500,
Christoph Hellwig <hch@infradead.org> wrote:

> On Tue, Jan 26, 2016@09:47:13AM -0500, Mike Snitzer wrote:
> > [you're killing me.. you nuked all CCs again]
> 
> I think that's the reply-to defaults for dm-devel.  They try to do that
> to me all that time, but fortunately I configure mutt to not blindly
> follow that bad advice but ask me instead.

Right, mutt saves me too.

I'll see about fixing the dm-devel list config.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: dm-multipath low performance with blk-mq
@ 2016-01-26 15:27                   ` Mike Snitzer
  0 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-01-26 15:27 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, linux-nvme, keith.busch, dm-devel, Bart Van Assche

On Tue, Jan 26 2016 at  9:56am -0500,
Christoph Hellwig <hch@infradead.org> wrote:

> On Tue, Jan 26, 2016 at 09:47:13AM -0500, Mike Snitzer wrote:
> > [you're killing me.. you nuked all CCs again]
> 
> I think that's the reply-to defaults for dm-devel.  They try to do that
> to me all that time, but fortunately I configure mutt to not blindly
> follow that bad advice but ask me instead.

Right, mutt saves me too.

I'll see about fixing the dm-devel list config.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: dm-multipath low performance with blk-mq
  2016-01-26 14:01           ` Hannes Reinecke
  2016-01-26 14:47               ` Mike Snitzer
@ 2016-01-26 15:57             ` Benjamin Marzinski
  1 sibling, 0 replies; 127+ messages in thread
From: Benjamin Marzinski @ 2016-01-26 15:57 UTC (permalink / raw)
  To: device-mapper development

On Tue, Jan 26, 2016 at 03:01:05PM +0100, Hannes Reinecke wrote:
> On 01/26/2016 02:29 PM, Mike Snitzer wrote:
> > On Mon, Jan 25 2016 at  6:37pm -0500,
> > Benjamin Marzinski <bmarzins@redhat.com> wrote:
> > 
> >> On Mon, Jan 25, 2016 at 04:40:16PM -0500, Mike Snitzer wrote:
> >>> On Tue, Jan 19 2016 at  5:45P -0500,
> >>> Mike Snitzer <snitzer@redhat.com> wrote:
> >>
> >> I don't think this is going to help __multipath_map() without some
> >> configuration changes.  Now that we're running on already merged
> >> requests instead of bios, the m->repeat_count is almost always set to 1,
> >> so we call the path_selector every time, which means that we'll always
> >> need the write lock. Bumping up the number of IOs we send before calling
> >> the path selector again will give this patch a change to do some good
> >> here.
> >>
> >> To do that you need to set:
> >>
> >> 	rr_min_io_rq <something_bigger_than_one>
> >>
> >> in the defaults section of /etc/multipath.conf and then reload the
> >> multipathd service.
> >>
> >> The patch should hopefully help in multipath_busy() regardless of the
> >> the rr_min_io_rq setting.
> > 
> > This patch, while generic, is meant to help the blk-mq case.  A blk-mq
> > request_queue doesn't have an elevator so the requests will not have
> > seen merging.
> > 
> > But yes, implied in the patch is the requirement to increase
> > m->repeat_count via multipathd's rr_min_io_rq (I'll backfill a proper
> > header once it is tested).
> > 
> But that would defeat load balancing, would it not?
> IE when you want to do load balancing you would constantly change
> paths, thereby always taking the write lock.
> Which would render the patch pointless.

But putting in a large rr_min_io_rq value will allow us to validate that
the patch does help things, and there's not another bottleneck hidden
right behind the spinlock.

> I was rather wondering if we could expose all active paths as
> hardware contexts and let blk-mq do the I/O mapping.
> That way we would only have to take the write lock if we have to
> choose a new pgpath/priority group ie in the case the active
> priority group goes down.
> 
> Cheers,
> 
> Hannes
> -- 
> Dr. Hannes Reinecke		   Teamlead Storage & Networking
> hare@suse.de			               +49 911 74053 688
> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
> GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
> HRB 21284 (AG Nürnberg)
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 127+ messages in thread

* dm-multipath low performance with blk-mq
  2016-01-25 21:40       ` Mike Snitzer
@ 2016-01-26 16:03         ` Mike Snitzer
  -1 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-01-26 16:03 UTC (permalink / raw)


On Mon, Jan 25 2016 at  4:40pm -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Tue, Jan 19 2016 at  5:45P -0500,
> Mike Snitzer <snitzer@redhat.com> wrote:
> 
> > On Mon, Jan 18 2016 at  7:04am -0500,
> > Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
> > 
> > > Hi All,
> > > 
> > > I've recently tried out dm-multipath over a "super-fast" nvme device
> > > and noticed a serious lock contention in dm-multipath that requires some
> > > extra attention. The nvme device is a simple loopback device emulation
> > > backed by null_blk device.
> > > 
> > > With this I've seen dm-multipath pushing around ~470K IOPs while
> > > the native (loopback) nvme performance can easily push up to 1500K+ IOPs.
> > > 
> > > perf output [1] reveals a huge lock contention on the multipath lock
> > > which is a per-dm_target contention point which seem to defeat the
> > > purpose of blk-mq i/O path.
> > > 
> > > The two current bottlenecks seem to come from multipath_busy and
> > > __multipath_map. Would it make better sense to move to a percpu_ref
> > > model with freeze/unfreeze logic for updates similar to what blk-mq
> > > is doing?
> > >
> > > Thoughts?
> > 
> > Your perf output clearly does identify the 'struct multipath' spinlock
> > as a bottleneck.
> > 
> > Is it fair to assume that implied in your test is that you increased
> > md->tag_set.nr_hw_queues to > 1 in dm_init_request_based_blk_mq_queue()?
> > 
> > I'd like to start by replicating your testbed.  So I'll see about
> > setting up the nvme loop driver you referenced in earlier mail.
> > Can you share your fio job file and fio commandline for your test?
> 
> Would still appreciate answers to my 2 questions above (did you modify
> md->tag_set.nr_hw_queues and can you share your fio job?)
> 
> I've yet to reproduce your config (using hch's nvme loop driver) or

Christoph, any chance you could rebase your 'nvme-loop.2' on v4.5-rc1?

Or point me to a branch that is more current...

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: dm-multipath low performance with blk-mq
@ 2016-01-26 16:03         ` Mike Snitzer
  0 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-01-26 16:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: keith.busch, Bart Van Assche, dm-devel, linux-nvme, Sagi Grimberg

On Mon, Jan 25 2016 at  4:40pm -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Tue, Jan 19 2016 at  5:45P -0500,
> Mike Snitzer <snitzer@redhat.com> wrote:
> 
> > On Mon, Jan 18 2016 at  7:04am -0500,
> > Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
> > 
> > > Hi All,
> > > 
> > > I've recently tried out dm-multipath over a "super-fast" nvme device
> > > and noticed a serious lock contention in dm-multipath that requires some
> > > extra attention. The nvme device is a simple loopback device emulation
> > > backed by null_blk device.
> > > 
> > > With this I've seen dm-multipath pushing around ~470K IOPs while
> > > the native (loopback) nvme performance can easily push up to 1500K+ IOPs.
> > > 
> > > perf output [1] reveals a huge lock contention on the multipath lock
> > > which is a per-dm_target contention point which seem to defeat the
> > > purpose of blk-mq i/O path.
> > > 
> > > The two current bottlenecks seem to come from multipath_busy and
> > > __multipath_map. Would it make better sense to move to a percpu_ref
> > > model with freeze/unfreeze logic for updates similar to what blk-mq
> > > is doing?
> > >
> > > Thoughts?
> > 
> > Your perf output clearly does identify the 'struct multipath' spinlock
> > as a bottleneck.
> > 
> > Is it fair to assume that implied in your test is that you increased
> > md->tag_set.nr_hw_queues to > 1 in dm_init_request_based_blk_mq_queue()?
> > 
> > I'd like to start by replicating your testbed.  So I'll see about
> > setting up the nvme loop driver you referenced in earlier mail.
> > Can you share your fio job file and fio commandline for your test?
> 
> Would still appreciate answers to my 2 questions above (did you modify
> md->tag_set.nr_hw_queues and can you share your fio job?)
> 
> I've yet to reproduce your config (using hch's nvme loop driver) or

Christoph, any chance you could rebase your 'nvme-loop.2' on v4.5-rc1?

Or point me to a branch that is more current...

^ permalink raw reply	[flat|nested] 127+ messages in thread

* dm-multipath low performance with blk-mq
  2016-01-26 16:03         ` Mike Snitzer
@ 2016-01-26 16:44           ` Christoph Hellwig
  -1 siblings, 0 replies; 127+ messages in thread
From: Christoph Hellwig @ 2016-01-26 16:44 UTC (permalink / raw)


On Tue, Jan 26, 2016@11:03:24AM -0500, Mike Snitzer wrote:
> Christoph, any chance you could rebase your 'nvme-loop.2' on v4.5-rc1?

Don't think I have time at the moment, sorry..

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: dm-multipath low performance with blk-mq
@ 2016-01-26 16:44           ` Christoph Hellwig
  0 siblings, 0 replies; 127+ messages in thread
From: Christoph Hellwig @ 2016-01-26 16:44 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: keith.busch, Sagi Grimberg, linux-nvme, Christoph Hellwig,
	dm-devel, Bart Van Assche

On Tue, Jan 26, 2016 at 11:03:24AM -0500, Mike Snitzer wrote:
> Christoph, any chance you could rebase your 'nvme-loop.2' on v4.5-rc1?

Don't think I have time at the moment, sorry..

^ permalink raw reply	[flat|nested] 127+ messages in thread

* [dm-devel] dm-multipath low performance with blk-mq
  2016-01-26 16:03         ` Mike Snitzer
@ 2016-01-26 21:40           ` Benjamin Marzinski
  -1 siblings, 0 replies; 127+ messages in thread
From: Benjamin Marzinski @ 2016-01-26 21:40 UTC (permalink / raw)


On Tue, Jan 26, 2016@11:03:24AM -0500, Mike Snitzer wrote:
> 
> Christoph, any chance you could rebase your 'nvme-loop.2' on v4.5-rc1?
> 
> Or point me to a branch that is more current...

Failing that, you could try using the null_blk device directly. It
doesn't provide enough information for multipathd to set up a device on
it, but you could manually create one with dmsetup. Without multipathd
you won't get failback working, but it should be good fine for
performance work.

-Ben

> 
> --
> dm-devel mailing list
> dm-devel at redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: dm-multipath low performance with blk-mq
@ 2016-01-26 21:40           ` Benjamin Marzinski
  0 siblings, 0 replies; 127+ messages in thread
From: Benjamin Marzinski @ 2016-01-26 21:40 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: keith.busch, Sagi Grimberg, linux-nvme, Christoph Hellwig,
	dm-devel, Bart Van Assche

On Tue, Jan 26, 2016 at 11:03:24AM -0500, Mike Snitzer wrote:
> 
> Christoph, any chance you could rebase your 'nvme-loop.2' on v4.5-rc1?
> 
> Or point me to a branch that is more current...

Failing that, you could try using the null_blk device directly. It
doesn't provide enough information for multipathd to set up a device on
it, but you could manually create one with dmsetup. Without multipathd
you won't get failback working, but it should be good fine for
performance work.

-Ben

> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 127+ messages in thread

* dm-multipath low performance with blk-mq
  2016-01-26 16:44           ` Christoph Hellwig
@ 2016-01-27  2:09             ` Mike Snitzer
  -1 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-01-27  2:09 UTC (permalink / raw)


On Tue, Jan 26 2016 at 11:44am -0500,
Christoph Hellwig <hch@infradead.org> wrote:

> On Tue, Jan 26, 2016@11:03:24AM -0500, Mike Snitzer wrote:
> > Christoph, any chance you could rebase your 'nvme-loop.2' on v4.5-rc1?
> 
> Don't think I have time at the moment, sorry..

No problem, I pulled the latest DM changes ontop of 'nvme-loop.2', see:
http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel

I was able to use nvme-loop like was documented here:
http://git.infradead.org/users/hch/block.git/commit/47d8d5b1db270463b5bd7b6a68cd89bd8762840d

But I'm not too sure how Sagi used the resulting nvme device with
multipath.. maybe he just manually pushed down a DM multipath table?
And only used the single /dev/nvme0n1 device?  Sagi?

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: dm-multipath low performance with blk-mq
@ 2016-01-27  2:09             ` Mike Snitzer
  0 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-01-27  2:09 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg
  Cc: keith.busch, Bart Van Assche, dm-devel, linux-nvme

On Tue, Jan 26 2016 at 11:44am -0500,
Christoph Hellwig <hch@infradead.org> wrote:

> On Tue, Jan 26, 2016 at 11:03:24AM -0500, Mike Snitzer wrote:
> > Christoph, any chance you could rebase your 'nvme-loop.2' on v4.5-rc1?
> 
> Don't think I have time at the moment, sorry..

No problem, I pulled the latest DM changes ontop of 'nvme-loop.2', see:
http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel

I was able to use nvme-loop like was documented here:
http://git.infradead.org/users/hch/block.git/commit/47d8d5b1db270463b5bd7b6a68cd89bd8762840d

But I'm not too sure how Sagi used the resulting nvme device with
multipath.. maybe he just manually pushed down a DM multipath table?
And only used the single /dev/nvme0n1 device?  Sagi?

^ permalink raw reply	[flat|nested] 127+ messages in thread

* dm-multipath low performance with blk-mq
  2016-01-27  2:09             ` Mike Snitzer
@ 2016-01-27 11:10               ` Sagi Grimberg
  -1 siblings, 0 replies; 127+ messages in thread
From: Sagi Grimberg @ 2016-01-27 11:10 UTC (permalink / raw)



>> Don't think I have time at the moment, sorry..
>
> No problem, I pulled the latest DM changes ontop of 'nvme-loop.2', see:
> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel
>
> I was able to use nvme-loop like was documented here:
> http://git.infradead.org/users/hch/block.git/commit/47d8d5b1db270463b5bd7b6a68cd89bd8762840d
>
> But I'm not too sure how Sagi used the resulting nvme device with
> multipath.. maybe he just manually pushed down a DM multipath table?
> And only used the single /dev/nvme0n1 device?  Sagi?

Hi Mike, sorry for the late response, I'm kinda caught up with other stuff.

So I think you are missing the a patch to support multipath that didn't
exist back when Christoph submitted the patchset.

It's a temp hack, and I don't have time to even check that it applies
correctly on nvme-loop.2, but here it is:

--
diff --git a/drivers/nvme/target/admin-cmd.c 
b/drivers/nvme/target/admin-cmd.c
index 0931e91..532234b 100644
--- a/drivers/nvme/target/admin-cmd.c
+++ b/drivers/nvme/target/admin-cmd.c
@@ -180,7 +180,8 @@ static void nvmet_execute_identify_ns(struct 
nvmet_req *req)
          */
         id->nmic = (1 << 0);

-       /* XXX: provide a nguid value! */
+       /* XXX: provide a real nguid value! */
+       memcpy(&id->nguid, &ns->nguid, sizeof(uuid_le));

         id->lbaf[0].ds = ns->blksize_shift;

diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index e2a8893..a99343a 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -145,6 +145,9 @@ struct nvmet_ns *nvmet_ns_alloc(struct nvmet_subsys 
*subsys, u32 nsid)
         ns->nsid = nsid;
         ns->subsys = subsys;

+       /* XXX: Hacking nguids with uuid  */
+       uuid_le_gen(&ns->nguid);
+
         return ns;

  free_ns:
@@ -436,6 +439,7 @@ struct nvmet_subsys *nvmet_subsys_alloc(const char 
*subsys_name)
         if (!subsys)
                 return NULL;

+       subsys->ver = NVME_VS(1, 2);
         subsys->subsys_name = kstrndup(subsys_name, NVMF_NQN_SIZE,
                         GFP_KERNEL);
         if (IS_ERR(subsys->subsys_name)) {
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 111aa5e..8f68c5f 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -19,6 +19,7 @@ struct nvmet_ns {
         u32                     nsid;
         u32                     blksize_shift;
         loff_t                  size;
+       uuid_le                 nguid;

         struct nvmet_subsys     *subsys;
         const char              *device_path;
--

^ permalink raw reply related	[flat|nested] 127+ messages in thread

* Re: dm-multipath low performance with blk-mq
@ 2016-01-27 11:10               ` Sagi Grimberg
  0 siblings, 0 replies; 127+ messages in thread
From: Sagi Grimberg @ 2016-01-27 11:10 UTC (permalink / raw)
  To: Mike Snitzer, Christoph Hellwig
  Cc: keith.busch, Bart Van Assche, dm-devel, linux-nvme


>> Don't think I have time at the moment, sorry..
>
> No problem, I pulled the latest DM changes ontop of 'nvme-loop.2', see:
> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel
>
> I was able to use nvme-loop like was documented here:
> http://git.infradead.org/users/hch/block.git/commit/47d8d5b1db270463b5bd7b6a68cd89bd8762840d
>
> But I'm not too sure how Sagi used the resulting nvme device with
> multipath.. maybe he just manually pushed down a DM multipath table?
> And only used the single /dev/nvme0n1 device?  Sagi?

Hi Mike, sorry for the late response, I'm kinda caught up with other stuff.

So I think you are missing the a patch to support multipath that didn't
exist back when Christoph submitted the patchset.

It's a temp hack, and I don't have time to even check that it applies
correctly on nvme-loop.2, but here it is:

--
diff --git a/drivers/nvme/target/admin-cmd.c 
b/drivers/nvme/target/admin-cmd.c
index 0931e91..532234b 100644
--- a/drivers/nvme/target/admin-cmd.c
+++ b/drivers/nvme/target/admin-cmd.c
@@ -180,7 +180,8 @@ static void nvmet_execute_identify_ns(struct 
nvmet_req *req)
          */
         id->nmic = (1 << 0);

-       /* XXX: provide a nguid value! */
+       /* XXX: provide a real nguid value! */
+       memcpy(&id->nguid, &ns->nguid, sizeof(uuid_le));

         id->lbaf[0].ds = ns->blksize_shift;

diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index e2a8893..a99343a 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -145,6 +145,9 @@ struct nvmet_ns *nvmet_ns_alloc(struct nvmet_subsys 
*subsys, u32 nsid)
         ns->nsid = nsid;
         ns->subsys = subsys;

+       /* XXX: Hacking nguids with uuid  */
+       uuid_le_gen(&ns->nguid);
+
         return ns;

  free_ns:
@@ -436,6 +439,7 @@ struct nvmet_subsys *nvmet_subsys_alloc(const char 
*subsys_name)
         if (!subsys)
                 return NULL;

+       subsys->ver = NVME_VS(1, 2);
         subsys->subsys_name = kstrndup(subsys_name, NVMF_NQN_SIZE,
                         GFP_KERNEL);
         if (IS_ERR(subsys->subsys_name)) {
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 111aa5e..8f68c5f 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -19,6 +19,7 @@ struct nvmet_ns {
         u32                     nsid;
         u32                     blksize_shift;
         loff_t                  size;
+       uuid_le                 nguid;

         struct nvmet_subsys     *subsys;
         const char              *device_path;
--

^ permalink raw reply related	[flat|nested] 127+ messages in thread

* dm-multipath low performance with blk-mq
  2016-01-26 13:29           ` Mike Snitzer
@ 2016-01-27 11:14             ` Sagi Grimberg
  -1 siblings, 0 replies; 127+ messages in thread
From: Sagi Grimberg @ 2016-01-27 11:14 UTC (permalink / raw)



>> I don't think this is going to help __multipath_map() without some
>> configuration changes.  Now that we're running on already merged
>> requests instead of bios, the m->repeat_count is almost always set to 1,
>> so we call the path_selector every time, which means that we'll always
>> need the write lock. Bumping up the number of IOs we send before calling
>> the path selector again will give this patch a change to do some good
>> here.
>>
>> To do that you need to set:
>>
>> 	rr_min_io_rq <something_bigger_than_one>
>>
>> in the defaults section of /etc/multipath.conf and then reload the
>> multipathd service.
>>
>> The patch should hopefully help in multipath_busy() regardless of the
>> the rr_min_io_rq setting.
>
> This patch, while generic, is meant to help the blk-mq case.  A blk-mq
> request_queue doesn't have an elevator so the requests will not have
> seen merging.
>
> But yes, implied in the patch is the requirement to increase
> m->repeat_count via multipathd's rr_min_io_rq (I'll backfill a proper
> header once it is tested).

I'll test it once I get some spare time (hopefully soon...)

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: dm-multipath low performance with blk-mq
@ 2016-01-27 11:14             ` Sagi Grimberg
  0 siblings, 0 replies; 127+ messages in thread
From: Sagi Grimberg @ 2016-01-27 11:14 UTC (permalink / raw)
  To: Mike Snitzer, Benjamin Marzinski
  Cc: Christoph Hellwig, keith.busch, device-mapper development,
	linux-nvme, Bart Van Assche


>> I don't think this is going to help __multipath_map() without some
>> configuration changes.  Now that we're running on already merged
>> requests instead of bios, the m->repeat_count is almost always set to 1,
>> so we call the path_selector every time, which means that we'll always
>> need the write lock. Bumping up the number of IOs we send before calling
>> the path selector again will give this patch a change to do some good
>> here.
>>
>> To do that you need to set:
>>
>> 	rr_min_io_rq <something_bigger_than_one>
>>
>> in the defaults section of /etc/multipath.conf and then reload the
>> multipathd service.
>>
>> The patch should hopefully help in multipath_busy() regardless of the
>> the rr_min_io_rq setting.
>
> This patch, while generic, is meant to help the blk-mq case.  A blk-mq
> request_queue doesn't have an elevator so the requests will not have
> seen merging.
>
> But yes, implied in the patch is the requirement to increase
> m->repeat_count via multipathd's rr_min_io_rq (I'll backfill a proper
> header once it is tested).

I'll test it once I get some spare time (hopefully soon...)

^ permalink raw reply	[flat|nested] 127+ messages in thread

* dm-multipath low performance with blk-mq
  2016-01-27 11:14             ` Sagi Grimberg
@ 2016-01-27 17:48               ` Mike Snitzer
  -1 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-01-27 17:48 UTC (permalink / raw)


On Wed, Jan 27 2016 at  6:14am -0500,
Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:

> 
> >>I don't think this is going to help __multipath_map() without some
> >>configuration changes.  Now that we're running on already merged
> >>requests instead of bios, the m->repeat_count is almost always set to 1,
> >>so we call the path_selector every time, which means that we'll always
> >>need the write lock. Bumping up the number of IOs we send before calling
> >>the path selector again will give this patch a change to do some good
> >>here.
> >>
> >>To do that you need to set:
> >>
> >>	rr_min_io_rq <something_bigger_than_one>
> >>
> >>in the defaults section of /etc/multipath.conf and then reload the
> >>multipathd service.
> >>
> >>The patch should hopefully help in multipath_busy() regardless of the
> >>the rr_min_io_rq setting.
> >
> >This patch, while generic, is meant to help the blk-mq case.  A blk-mq
> >request_queue doesn't have an elevator so the requests will not have
> >seen merging.
> >
> >But yes, implied in the patch is the requirement to increase
> >m->repeat_count via multipathd's rr_min_io_rq (I'll backfill a proper
> >header once it is tested).
> 
> I'll test it once I get some spare time (hopefully soon...)

OK thanks.

BTW, I _cannot_ get null_blk to come even close to your reported 1500K+
IOPs on 2 "fast" systems I have access to.  Which arguments are you
loading the null_blk module with?

I've been using:
modprobe null_blk gb=4 bs=4096 nr_devices=1 queue_mode=2 submit_queues=12

On my 1 system is a 12 core single socket, single NUMA node with 12G of
memory, I can only get ~500K read IOPs and ~85K write IOPs.

On another much larger system with 72 cores and 4 NUMA nodes with 128G
of memory, I can only get ~310K read IOPs and ~175K write IOPs.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: dm-multipath low performance with blk-mq
@ 2016-01-27 17:48               ` Mike Snitzer
  0 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-01-27 17:48 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: axboe, keith.busch, linux-nvme, Christoph Hellwig,
	device-mapper development, Bart Van Assche

On Wed, Jan 27 2016 at  6:14am -0500,
Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:

> 
> >>I don't think this is going to help __multipath_map() without some
> >>configuration changes.  Now that we're running on already merged
> >>requests instead of bios, the m->repeat_count is almost always set to 1,
> >>so we call the path_selector every time, which means that we'll always
> >>need the write lock. Bumping up the number of IOs we send before calling
> >>the path selector again will give this patch a change to do some good
> >>here.
> >>
> >>To do that you need to set:
> >>
> >>	rr_min_io_rq <something_bigger_than_one>
> >>
> >>in the defaults section of /etc/multipath.conf and then reload the
> >>multipathd service.
> >>
> >>The patch should hopefully help in multipath_busy() regardless of the
> >>the rr_min_io_rq setting.
> >
> >This patch, while generic, is meant to help the blk-mq case.  A blk-mq
> >request_queue doesn't have an elevator so the requests will not have
> >seen merging.
> >
> >But yes, implied in the patch is the requirement to increase
> >m->repeat_count via multipathd's rr_min_io_rq (I'll backfill a proper
> >header once it is tested).
> 
> I'll test it once I get some spare time (hopefully soon...)

OK thanks.

BTW, I _cannot_ get null_blk to come even close to your reported 1500K+
IOPs on 2 "fast" systems I have access to.  Which arguments are you
loading the null_blk module with?

I've been using:
modprobe null_blk gb=4 bs=4096 nr_devices=1 queue_mode=2 submit_queues=12

On my 1 system is a 12 core single socket, single NUMA node with 12G of
memory, I can only get ~500K read IOPs and ~85K write IOPs.

On another much larger system with 72 cores and 4 NUMA nodes with 128G
of memory, I can only get ~310K read IOPs and ~175K write IOPs.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* dm-multipath low performance with blk-mq
  2016-01-27 17:48               ` Mike Snitzer
@ 2016-01-27 17:51                 ` Jens Axboe
  -1 siblings, 0 replies; 127+ messages in thread
From: Jens Axboe @ 2016-01-27 17:51 UTC (permalink / raw)


On 01/27/2016 10:48 AM, Mike Snitzer wrote:
> On Wed, Jan 27 2016 at  6:14am -0500,
> Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
>
>>
>>>> I don't think this is going to help __multipath_map() without some
>>>> configuration changes.  Now that we're running on already merged
>>>> requests instead of bios, the m->repeat_count is almost always set to 1,
>>>> so we call the path_selector every time, which means that we'll always
>>>> need the write lock. Bumping up the number of IOs we send before calling
>>>> the path selector again will give this patch a change to do some good
>>>> here.
>>>>
>>>> To do that you need to set:
>>>>
>>>> 	rr_min_io_rq <something_bigger_than_one>
>>>>
>>>> in the defaults section of /etc/multipath.conf and then reload the
>>>> multipathd service.
>>>>
>>>> The patch should hopefully help in multipath_busy() regardless of the
>>>> the rr_min_io_rq setting.
>>>
>>> This patch, while generic, is meant to help the blk-mq case.  A blk-mq
>>> request_queue doesn't have an elevator so the requests will not have
>>> seen merging.
>>>
>>> But yes, implied in the patch is the requirement to increase
>>> m->repeat_count via multipathd's rr_min_io_rq (I'll backfill a proper
>>> header once it is tested).
>>
>> I'll test it once I get some spare time (hopefully soon...)
>
> OK thanks.
>
> BTW, I _cannot_ get null_blk to come even close to your reported 1500K+
> IOPs on 2 "fast" systems I have access to.  Which arguments are you
> loading the null_blk module with?
>
> I've been using:
> modprobe null_blk gb=4 bs=4096 nr_devices=1 queue_mode=2 submit_queues=12
>
> On my 1 system is a 12 core single socket, single NUMA node with 12G of
> memory, I can only get ~500K read IOPs and ~85K write IOPs.
>
> On another much larger system with 72 cores and 4 NUMA nodes with 128G
> of memory, I can only get ~310K read IOPs and ~175K write IOPs.

Look at the completion method (irqmode) and completion time 
(completion_nsec).

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: dm-multipath low performance with blk-mq
@ 2016-01-27 17:51                 ` Jens Axboe
  0 siblings, 0 replies; 127+ messages in thread
From: Jens Axboe @ 2016-01-27 17:51 UTC (permalink / raw)
  To: Mike Snitzer, Sagi Grimberg
  Cc: keith.busch, linux-nvme, Christoph Hellwig,
	device-mapper development, Bart Van Assche

On 01/27/2016 10:48 AM, Mike Snitzer wrote:
> On Wed, Jan 27 2016 at  6:14am -0500,
> Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
>
>>
>>>> I don't think this is going to help __multipath_map() without some
>>>> configuration changes.  Now that we're running on already merged
>>>> requests instead of bios, the m->repeat_count is almost always set to 1,
>>>> so we call the path_selector every time, which means that we'll always
>>>> need the write lock. Bumping up the number of IOs we send before calling
>>>> the path selector again will give this patch a change to do some good
>>>> here.
>>>>
>>>> To do that you need to set:
>>>>
>>>> 	rr_min_io_rq <something_bigger_than_one>
>>>>
>>>> in the defaults section of /etc/multipath.conf and then reload the
>>>> multipathd service.
>>>>
>>>> The patch should hopefully help in multipath_busy() regardless of the
>>>> the rr_min_io_rq setting.
>>>
>>> This patch, while generic, is meant to help the blk-mq case.  A blk-mq
>>> request_queue doesn't have an elevator so the requests will not have
>>> seen merging.
>>>
>>> But yes, implied in the patch is the requirement to increase
>>> m->repeat_count via multipathd's rr_min_io_rq (I'll backfill a proper
>>> header once it is tested).
>>
>> I'll test it once I get some spare time (hopefully soon...)
>
> OK thanks.
>
> BTW, I _cannot_ get null_blk to come even close to your reported 1500K+
> IOPs on 2 "fast" systems I have access to.  Which arguments are you
> loading the null_blk module with?
>
> I've been using:
> modprobe null_blk gb=4 bs=4096 nr_devices=1 queue_mode=2 submit_queues=12
>
> On my 1 system is a 12 core single socket, single NUMA node with 12G of
> memory, I can only get ~500K read IOPs and ~85K write IOPs.
>
> On another much larger system with 72 cores and 4 NUMA nodes with 128G
> of memory, I can only get ~310K read IOPs and ~175K write IOPs.

Look at the completion method (irqmode) and completion time 
(completion_nsec).

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 127+ messages in thread

* dm-multipath low performance with blk-mq
  2016-01-27 17:48               ` Mike Snitzer
@ 2016-01-27 17:56                 ` Sagi Grimberg
  -1 siblings, 0 replies; 127+ messages in thread
From: Sagi Grimberg @ 2016-01-27 17:56 UTC (permalink / raw)




On 27/01/2016 19:48, Mike Snitzer wrote:
> On Wed, Jan 27 2016 at  6:14am -0500,
> Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
>
>>
>>>> I don't think this is going to help __multipath_map() without some
>>>> configuration changes.  Now that we're running on already merged
>>>> requests instead of bios, the m->repeat_count is almost always set to 1,
>>>> so we call the path_selector every time, which means that we'll always
>>>> need the write lock. Bumping up the number of IOs we send before calling
>>>> the path selector again will give this patch a change to do some good
>>>> here.
>>>>
>>>> To do that you need to set:
>>>>
>>>> 	rr_min_io_rq <something_bigger_than_one>
>>>>
>>>> in the defaults section of /etc/multipath.conf and then reload the
>>>> multipathd service.
>>>>
>>>> The patch should hopefully help in multipath_busy() regardless of the
>>>> the rr_min_io_rq setting.
>>>
>>> This patch, while generic, is meant to help the blk-mq case.  A blk-mq
>>> request_queue doesn't have an elevator so the requests will not have
>>> seen merging.
>>>
>>> But yes, implied in the patch is the requirement to increase
>>> m->repeat_count via multipathd's rr_min_io_rq (I'll backfill a proper
>>> header once it is tested).
>>
>> I'll test it once I get some spare time (hopefully soon...)
>
> OK thanks.
>
> BTW, I _cannot_ get null_blk to come even close to your reported 1500K+
> IOPs on 2 "fast" systems I have access to.  Which arguments are you
> loading the null_blk module with?
>
> I've been using:
> modprobe null_blk gb=4 bs=4096 nr_devices=1 queue_mode=2 submit_queues=12

$ for f in /sys/module/null_blk/parameters/*; do echo $f; cat $f; done
/sys/module/null_blk/parameters/bs
512
/sys/module/null_blk/parameters/completion_nsec
10000
/sys/module/null_blk/parameters/gb
250
/sys/module/null_blk/parameters/home_node
-1
/sys/module/null_blk/parameters/hw_queue_depth
64
/sys/module/null_blk/parameters/irqmode
1
/sys/module/null_blk/parameters/nr_devices
2
/sys/module/null_blk/parameters/queue_mode
2
/sys/module/null_blk/parameters/submit_queues
24
/sys/module/null_blk/parameters/use_lightnvm
N
/sys/module/null_blk/parameters/use_per_node_hctx
N

$ fio --group_reporting --rw=randread --bs=4k --numjobs=24 --iodepth=32 
--runtime=99999999 --time_based --loops=1 --ioengine=libaio --direct=1 
--invalidate=1 --randrepeat=1 --norandommap --exitall --name task_nullb0 
--filename=/dev/nullb0
task_nullb0: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, 
iodepth=32
...
fio-2.1.10
Starting 24 processes
Jobs: 24 (f=24): [rrrrrrrrrrrrrrrrrrrrrrrr] [0.0% done] [7234MB/0KB/0KB 
/s] [1852K/0/0 iops] [eta 1157d:09h:46m:22s]

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: dm-multipath low performance with blk-mq
@ 2016-01-27 17:56                 ` Sagi Grimberg
  0 siblings, 0 replies; 127+ messages in thread
From: Sagi Grimberg @ 2016-01-27 17:56 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: axboe, keith.busch, linux-nvme, Christoph Hellwig,
	device-mapper development, Bart Van Assche



On 27/01/2016 19:48, Mike Snitzer wrote:
> On Wed, Jan 27 2016 at  6:14am -0500,
> Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
>
>>
>>>> I don't think this is going to help __multipath_map() without some
>>>> configuration changes.  Now that we're running on already merged
>>>> requests instead of bios, the m->repeat_count is almost always set to 1,
>>>> so we call the path_selector every time, which means that we'll always
>>>> need the write lock. Bumping up the number of IOs we send before calling
>>>> the path selector again will give this patch a change to do some good
>>>> here.
>>>>
>>>> To do that you need to set:
>>>>
>>>> 	rr_min_io_rq <something_bigger_than_one>
>>>>
>>>> in the defaults section of /etc/multipath.conf and then reload the
>>>> multipathd service.
>>>>
>>>> The patch should hopefully help in multipath_busy() regardless of the
>>>> the rr_min_io_rq setting.
>>>
>>> This patch, while generic, is meant to help the blk-mq case.  A blk-mq
>>> request_queue doesn't have an elevator so the requests will not have
>>> seen merging.
>>>
>>> But yes, implied in the patch is the requirement to increase
>>> m->repeat_count via multipathd's rr_min_io_rq (I'll backfill a proper
>>> header once it is tested).
>>
>> I'll test it once I get some spare time (hopefully soon...)
>
> OK thanks.
>
> BTW, I _cannot_ get null_blk to come even close to your reported 1500K+
> IOPs on 2 "fast" systems I have access to.  Which arguments are you
> loading the null_blk module with?
>
> I've been using:
> modprobe null_blk gb=4 bs=4096 nr_devices=1 queue_mode=2 submit_queues=12

$ for f in /sys/module/null_blk/parameters/*; do echo $f; cat $f; done
/sys/module/null_blk/parameters/bs
512
/sys/module/null_blk/parameters/completion_nsec
10000
/sys/module/null_blk/parameters/gb
250
/sys/module/null_blk/parameters/home_node
-1
/sys/module/null_blk/parameters/hw_queue_depth
64
/sys/module/null_blk/parameters/irqmode
1
/sys/module/null_blk/parameters/nr_devices
2
/sys/module/null_blk/parameters/queue_mode
2
/sys/module/null_blk/parameters/submit_queues
24
/sys/module/null_blk/parameters/use_lightnvm
N
/sys/module/null_blk/parameters/use_per_node_hctx
N

$ fio --group_reporting --rw=randread --bs=4k --numjobs=24 --iodepth=32 
--runtime=99999999 --time_based --loops=1 --ioengine=libaio --direct=1 
--invalidate=1 --randrepeat=1 --norandommap --exitall --name task_nullb0 
--filename=/dev/nullb0
task_nullb0: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, 
iodepth=32
...
fio-2.1.10
Starting 24 processes
Jobs: 24 (f=24): [rrrrrrrrrrrrrrrrrrrrrrrr] [0.0% done] [7234MB/0KB/0KB 
/s] [1852K/0/0 iops] [eta 1157d:09h:46m:22s]

^ permalink raw reply	[flat|nested] 127+ messages in thread

* dm-multipath low performance with blk-mq
  2016-01-27 17:51                 ` Jens Axboe
@ 2016-01-27 18:16                   ` Mike Snitzer
  -1 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-01-27 18:16 UTC (permalink / raw)


On Wed, Jan 27 2016 at 12:51pm -0500,
Jens Axboe <axboe@kernel.dk> wrote:

> On 01/27/2016 10:48 AM, Mike Snitzer wrote:
> >
> >BTW, I _cannot_ get null_blk to come even close to your reported 1500K+
> >IOPs on 2 "fast" systems I have access to.  Which arguments are you
> >loading the null_blk module with?
> >
> >I've been using:
> >modprobe null_blk gb=4 bs=4096 nr_devices=1 queue_mode=2 submit_queues=12
> >
> >On my 1 system is a 12 core single socket, single NUMA node with 12G of
> >memory, I can only get ~500K read IOPs and ~85K write IOPs.
> >
> >On another much larger system with 72 cores and 4 NUMA nodes with 128G
> >of memory, I can only get ~310K read IOPs and ~175K write IOPs.
> 
> Look at the completion method (irqmode) and completion time
> (completion_nsec).

OK, I found that queue_mode=0 (bio-based) is _much_ faster than blk-mq
(2, the default).  Improving to ~950K read IOPs and ~675K write IOPs (on
the single numa node system).

Default for irqmode is 1 (softirq).  2 (timer) yields poor results.  0
(none) seems slightly slower than 1.

And if I use completion_nsec=1 I can bump up to ~990K read IOPs.

Seems the best, for IOPs, so far on this system is with:
modprobe null_blk gb=4 bs=4096 nr_devices=1 queue_mode=0 irqmode=1 completion_nsec=1 submit_queues=4

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: dm-multipath low performance with blk-mq
@ 2016-01-27 18:16                   ` Mike Snitzer
  0 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-01-27 18:16 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Sagi Grimberg, linux-nvme, keith.busch,
	device-mapper development, Bart Van Assche

On Wed, Jan 27 2016 at 12:51pm -0500,
Jens Axboe <axboe@kernel.dk> wrote:

> On 01/27/2016 10:48 AM, Mike Snitzer wrote:
> >
> >BTW, I _cannot_ get null_blk to come even close to your reported 1500K+
> >IOPs on 2 "fast" systems I have access to.  Which arguments are you
> >loading the null_blk module with?
> >
> >I've been using:
> >modprobe null_blk gb=4 bs=4096 nr_devices=1 queue_mode=2 submit_queues=12
> >
> >On my 1 system is a 12 core single socket, single NUMA node with 12G of
> >memory, I can only get ~500K read IOPs and ~85K write IOPs.
> >
> >On another much larger system with 72 cores and 4 NUMA nodes with 128G
> >of memory, I can only get ~310K read IOPs and ~175K write IOPs.
> 
> Look at the completion method (irqmode) and completion time
> (completion_nsec).

OK, I found that queue_mode=0 (bio-based) is _much_ faster than blk-mq
(2, the default).  Improving to ~950K read IOPs and ~675K write IOPs (on
the single numa node system).

Default for irqmode is 1 (softirq).  2 (timer) yields poor results.  0
(none) seems slightly slower than 1.

And if I use completion_nsec=1 I can bump up to ~990K read IOPs.

Seems the best, for IOPs, so far on this system is with:
modprobe null_blk gb=4 bs=4096 nr_devices=1 queue_mode=0 irqmode=1 completion_nsec=1 submit_queues=4

^ permalink raw reply	[flat|nested] 127+ messages in thread

* dm-multipath low performance with blk-mq
  2016-01-27 18:16                   ` Mike Snitzer
@ 2016-01-27 18:26                     ` Jens Axboe
  -1 siblings, 0 replies; 127+ messages in thread
From: Jens Axboe @ 2016-01-27 18:26 UTC (permalink / raw)


On 01/27/2016 11:16 AM, Mike Snitzer wrote:
> On Wed, Jan 27 2016 at 12:51pm -0500,
> Jens Axboe <axboe@kernel.dk> wrote:
>
>> On 01/27/2016 10:48 AM, Mike Snitzer wrote:
>>>
>>> BTW, I _cannot_ get null_blk to come even close to your reported 1500K+
>>> IOPs on 2 "fast" systems I have access to.  Which arguments are you
>>> loading the null_blk module with?
>>>
>>> I've been using:
>>> modprobe null_blk gb=4 bs=4096 nr_devices=1 queue_mode=2 submit_queues=12
>>>
>>> On my 1 system is a 12 core single socket, single NUMA node with 12G of
>>> memory, I can only get ~500K read IOPs and ~85K write IOPs.
>>>
>>> On another much larger system with 72 cores and 4 NUMA nodes with 128G
>>> of memory, I can only get ~310K read IOPs and ~175K write IOPs.
>>
>> Look at the completion method (irqmode) and completion time
>> (completion_nsec).
>
> OK, I found that queue_mode=0 (bio-based) is _much_ faster than blk-mq
> (2, the default).  Improving to ~950K read IOPs and ~675K write IOPs (on
> the single numa node system).
>
> Default for irqmode is 1 (softirq).  2 (timer) yields poor results.  0
> (none) seems slightly slower than 1.
>
> And if I use completion_nsec=1 I can bump up to ~990K read IOPs.
>
> Seems the best, for IOPs, so far on this system is with:
> modprobe null_blk gb=4 bs=4096 nr_devices=1 queue_mode=0 irqmode=1 completion_nsec=1 submit_queues=4

That sounds a bit odd. queue_mode=0 will always be a bit faster, 
depending on how many threads, etc. But from 310K to 950K, that sounds 
very suspicious.

And 500K/85K read/write is very low. Just did a quickie on a 2 node box 
I have here, single thread performance with queue_mode=2 is around 
500K/500K read/write.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: dm-multipath low performance with blk-mq
@ 2016-01-27 18:26                     ` Jens Axboe
  0 siblings, 0 replies; 127+ messages in thread
From: Jens Axboe @ 2016-01-27 18:26 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Christoph Hellwig, Sagi Grimberg, linux-nvme, keith.busch,
	device-mapper development, Bart Van Assche

On 01/27/2016 11:16 AM, Mike Snitzer wrote:
> On Wed, Jan 27 2016 at 12:51pm -0500,
> Jens Axboe <axboe@kernel.dk> wrote:
>
>> On 01/27/2016 10:48 AM, Mike Snitzer wrote:
>>>
>>> BTW, I _cannot_ get null_blk to come even close to your reported 1500K+
>>> IOPs on 2 "fast" systems I have access to.  Which arguments are you
>>> loading the null_blk module with?
>>>
>>> I've been using:
>>> modprobe null_blk gb=4 bs=4096 nr_devices=1 queue_mode=2 submit_queues=12
>>>
>>> On my 1 system is a 12 core single socket, single NUMA node with 12G of
>>> memory, I can only get ~500K read IOPs and ~85K write IOPs.
>>>
>>> On another much larger system with 72 cores and 4 NUMA nodes with 128G
>>> of memory, I can only get ~310K read IOPs and ~175K write IOPs.
>>
>> Look at the completion method (irqmode) and completion time
>> (completion_nsec).
>
> OK, I found that queue_mode=0 (bio-based) is _much_ faster than blk-mq
> (2, the default).  Improving to ~950K read IOPs and ~675K write IOPs (on
> the single numa node system).
>
> Default for irqmode is 1 (softirq).  2 (timer) yields poor results.  0
> (none) seems slightly slower than 1.
>
> And if I use completion_nsec=1 I can bump up to ~990K read IOPs.
>
> Seems the best, for IOPs, so far on this system is with:
> modprobe null_blk gb=4 bs=4096 nr_devices=1 queue_mode=0 irqmode=1 completion_nsec=1 submit_queues=4

That sounds a bit odd. queue_mode=0 will always be a bit faster, 
depending on how many threads, etc. But from 310K to 950K, that sounds 
very suspicious.

And 500K/85K read/write is very low. Just did a quickie on a 2 node box 
I have here, single thread performance with queue_mode=2 is around 
500K/500K read/write.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 127+ messages in thread

* dm-multipath low performance with blk-mq
  2016-01-27 17:56                 ` Sagi Grimberg
@ 2016-01-27 18:42                   ` Mike Snitzer
  -1 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-01-27 18:42 UTC (permalink / raw)


On Wed, Jan 27 2016 at 12:56pm -0500,
Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:

> 
> 
> On 27/01/2016 19:48, Mike Snitzer wrote:
> >
> >BTW, I _cannot_ get null_blk to come even close to your reported 1500K+
> >IOPs on 2 "fast" systems I have access to.  Which arguments are you
> >loading the null_blk module with?
> >
> >I've been using:
> >modprobe null_blk gb=4 bs=4096 nr_devices=1 queue_mode=2 submit_queues=12
> 
> $ for f in /sys/module/null_blk/parameters/*; do echo $f; cat $f; done
> /sys/module/null_blk/parameters/bs
> 512
> /sys/module/null_blk/parameters/completion_nsec
> 10000
> /sys/module/null_blk/parameters/gb
> 250
> /sys/module/null_blk/parameters/home_node
> -1
> /sys/module/null_blk/parameters/hw_queue_depth
> 64
> /sys/module/null_blk/parameters/irqmode
> 1
> /sys/module/null_blk/parameters/nr_devices
> 2
> /sys/module/null_blk/parameters/queue_mode
> 2
> /sys/module/null_blk/parameters/submit_queues
> 24
> /sys/module/null_blk/parameters/use_lightnvm
> N
> /sys/module/null_blk/parameters/use_per_node_hctx
> N
> 
> $ fio --group_reporting --rw=randread --bs=4k --numjobs=24
> --iodepth=32 --runtime=99999999 --time_based --loops=1
> --ioengine=libaio --direct=1 --invalidate=1 --randrepeat=1
> --norandommap --exitall --name task_nullb0 --filename=/dev/nullb0
> task_nullb0: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
> ioengine=libaio, iodepth=32
> ...
> fio-2.1.10
> Starting 24 processes
> Jobs: 24 (f=24): [rrrrrrrrrrrrrrrrrrrrrrrr] [0.0% done]
> [7234MB/0KB/0KB /s] [1852K/0/0 iops] [eta 1157d:09h:46m:22s]

Thanks, the number of fio threads was pretty important.  I'm still
seeing better IOPs with queue_mode=0 (bio-based).

Jobs: 24 (f=24): [r(24)] [11.7% done] [11073MB/0KB/0KB /s] [2835K/0/0 iops] [eta 14m:42s]

(with queue_mode=2 I get ~1930K IOPs.. which I need to use to stack
request-based DM multipath ontop)

Now I can focus on why dm-multipath is slow...

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: dm-multipath low performance with blk-mq
@ 2016-01-27 18:42                   ` Mike Snitzer
  0 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-01-27 18:42 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: axboe, Christoph Hellwig, linux-nvme, keith.busch,
	device-mapper development, Bart Van Assche

On Wed, Jan 27 2016 at 12:56pm -0500,
Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:

> 
> 
> On 27/01/2016 19:48, Mike Snitzer wrote:
> >
> >BTW, I _cannot_ get null_blk to come even close to your reported 1500K+
> >IOPs on 2 "fast" systems I have access to.  Which arguments are you
> >loading the null_blk module with?
> >
> >I've been using:
> >modprobe null_blk gb=4 bs=4096 nr_devices=1 queue_mode=2 submit_queues=12
> 
> $ for f in /sys/module/null_blk/parameters/*; do echo $f; cat $f; done
> /sys/module/null_blk/parameters/bs
> 512
> /sys/module/null_blk/parameters/completion_nsec
> 10000
> /sys/module/null_blk/parameters/gb
> 250
> /sys/module/null_blk/parameters/home_node
> -1
> /sys/module/null_blk/parameters/hw_queue_depth
> 64
> /sys/module/null_blk/parameters/irqmode
> 1
> /sys/module/null_blk/parameters/nr_devices
> 2
> /sys/module/null_blk/parameters/queue_mode
> 2
> /sys/module/null_blk/parameters/submit_queues
> 24
> /sys/module/null_blk/parameters/use_lightnvm
> N
> /sys/module/null_blk/parameters/use_per_node_hctx
> N
> 
> $ fio --group_reporting --rw=randread --bs=4k --numjobs=24
> --iodepth=32 --runtime=99999999 --time_based --loops=1
> --ioengine=libaio --direct=1 --invalidate=1 --randrepeat=1
> --norandommap --exitall --name task_nullb0 --filename=/dev/nullb0
> task_nullb0: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
> ioengine=libaio, iodepth=32
> ...
> fio-2.1.10
> Starting 24 processes
> Jobs: 24 (f=24): [rrrrrrrrrrrrrrrrrrrrrrrr] [0.0% done]
> [7234MB/0KB/0KB /s] [1852K/0/0 iops] [eta 1157d:09h:46m:22s]

Thanks, the number of fio threads was pretty important.  I'm still
seeing better IOPs with queue_mode=0 (bio-based).

Jobs: 24 (f=24): [r(24)] [11.7% done] [11073MB/0KB/0KB /s] [2835K/0/0 iops] [eta 14m:42s]

(with queue_mode=2 I get ~1930K IOPs.. which I need to use to stack
request-based DM multipath ontop)

Now I can focus on why dm-multipath is slow...

^ permalink raw reply	[flat|nested] 127+ messages in thread

* dm-multipath low performance with blk-mq
  2016-01-27 18:26                     ` Jens Axboe
@ 2016-01-27 19:14                       ` Mike Snitzer
  -1 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-01-27 19:14 UTC (permalink / raw)


On Wed, Jan 27 2016 at  1:26pm -0500,
Jens Axboe <axboe@kernel.dk> wrote:

> On 01/27/2016 11:16 AM, Mike Snitzer wrote:
> >On Wed, Jan 27 2016 at 12:51pm -0500,
> >Jens Axboe <axboe@kernel.dk> wrote:
> >
> >>On 01/27/2016 10:48 AM, Mike Snitzer wrote:
> >>>
> >>>BTW, I _cannot_ get null_blk to come even close to your reported 1500K+
> >>>IOPs on 2 "fast" systems I have access to.  Which arguments are you
> >>>loading the null_blk module with?
> >>>
> >>>I've been using:
> >>>modprobe null_blk gb=4 bs=4096 nr_devices=1 queue_mode=2 submit_queues=12
> >>>
> >>>On my 1 system is a 12 core single socket, single NUMA node with 12G of
> >>>memory, I can only get ~500K read IOPs and ~85K write IOPs.
> >>>
> >>>On another much larger system with 72 cores and 4 NUMA nodes with 128G
> >>>of memory, I can only get ~310K read IOPs and ~175K write IOPs.
> >>
> >>Look at the completion method (irqmode) and completion time
> >>(completion_nsec).
> >
> >OK, I found that queue_mode=0 (bio-based) is _much_ faster than blk-mq
> >(2, the default).  Improving to ~950K read IOPs and ~675K write IOPs (on
> >the single numa node system).
> >
> >Default for irqmode is 1 (softirq).  2 (timer) yields poor results.  0
> >(none) seems slightly slower than 1.
> >
> >And if I use completion_nsec=1 I can bump up to ~990K read IOPs.
> >
> >Seems the best, for IOPs, so far on this system is with:
> >modprobe null_blk gb=4 bs=4096 nr_devices=1 queue_mode=0 irqmode=1 completion_nsec=1 submit_queues=4
> 
> That sounds a bit odd. queue_mode=0 will always be a bit faster,
> depending on how many threads, etc. But from 310K to 950K, that
> sounds very suspicious.

I definitely see much better results with bio-based over blk-mq.
For the multithreaded fio job Sagi shared I'm seeing bio-based ~2835K vs
blk-mq ~1950K (read IOPs).

> And 500K/85K read/write is very low. Just did a quickie on a 2 node
> box I have here, single thread performance with queue_mode=2 is
> around 500K/500K read/write.

Yeah, I was using some crap ioping test to get those 500K/85K results.
I've now switched to the fio test Sagi shared and am seeing:

~1950K IOPs with blk-mq nullb0, only ~310K with .request_fn dm-mpath
ontop, and ~955K with blk-mq dm-mpath ontop -- all read IOPs.

So at least now I can get my eye back on the prize of improving blk-mq
dm-multipath!

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: dm-multipath low performance with blk-mq
@ 2016-01-27 19:14                       ` Mike Snitzer
  0 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-01-27 19:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: keith.busch, Sagi Grimberg, linux-nvme, Christoph Hellwig,
	device-mapper development, Bart Van Assche

On Wed, Jan 27 2016 at  1:26pm -0500,
Jens Axboe <axboe@kernel.dk> wrote:

> On 01/27/2016 11:16 AM, Mike Snitzer wrote:
> >On Wed, Jan 27 2016 at 12:51pm -0500,
> >Jens Axboe <axboe@kernel.dk> wrote:
> >
> >>On 01/27/2016 10:48 AM, Mike Snitzer wrote:
> >>>
> >>>BTW, I _cannot_ get null_blk to come even close to your reported 1500K+
> >>>IOPs on 2 "fast" systems I have access to.  Which arguments are you
> >>>loading the null_blk module with?
> >>>
> >>>I've been using:
> >>>modprobe null_blk gb=4 bs=4096 nr_devices=1 queue_mode=2 submit_queues=12
> >>>
> >>>On my 1 system is a 12 core single socket, single NUMA node with 12G of
> >>>memory, I can only get ~500K read IOPs and ~85K write IOPs.
> >>>
> >>>On another much larger system with 72 cores and 4 NUMA nodes with 128G
> >>>of memory, I can only get ~310K read IOPs and ~175K write IOPs.
> >>
> >>Look at the completion method (irqmode) and completion time
> >>(completion_nsec).
> >
> >OK, I found that queue_mode=0 (bio-based) is _much_ faster than blk-mq
> >(2, the default).  Improving to ~950K read IOPs and ~675K write IOPs (on
> >the single numa node system).
> >
> >Default for irqmode is 1 (softirq).  2 (timer) yields poor results.  0
> >(none) seems slightly slower than 1.
> >
> >And if I use completion_nsec=1 I can bump up to ~990K read IOPs.
> >
> >Seems the best, for IOPs, so far on this system is with:
> >modprobe null_blk gb=4 bs=4096 nr_devices=1 queue_mode=0 irqmode=1 completion_nsec=1 submit_queues=4
> 
> That sounds a bit odd. queue_mode=0 will always be a bit faster,
> depending on how many threads, etc. But from 310K to 950K, that
> sounds very suspicious.

I definitely see much better results with bio-based over blk-mq.
For the multithreaded fio job Sagi shared I'm seeing bio-based ~2835K vs
blk-mq ~1950K (read IOPs).

> And 500K/85K read/write is very low. Just did a quickie on a 2 node
> box I have here, single thread performance with queue_mode=2 is
> around 500K/500K read/write.

Yeah, I was using some crap ioping test to get those 500K/85K results.
I've now switched to the fio test Sagi shared and am seeing:

~1950K IOPs with blk-mq nullb0, only ~310K with .request_fn dm-mpath
ontop, and ~955K with blk-mq dm-mpath ontop -- all read IOPs.

So at least now I can get my eye back on the prize of improving blk-mq
dm-multipath!

^ permalink raw reply	[flat|nested] 127+ messages in thread

* dm-multipath low performance with blk-mq
  2016-01-27 18:42                   ` Mike Snitzer
@ 2016-01-27 19:49                     ` Jens Axboe
  -1 siblings, 0 replies; 127+ messages in thread
From: Jens Axboe @ 2016-01-27 19:49 UTC (permalink / raw)


On 01/27/2016 11:42 AM, Mike Snitzer wrote:
> On Wed, Jan 27 2016 at 12:56pm -0500,
> Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
>
>>
>>
>> On 27/01/2016 19:48, Mike Snitzer wrote:
>>>
>>> BTW, I _cannot_ get null_blk to come even close to your reported 1500K+
>>> IOPs on 2 "fast" systems I have access to.  Which arguments are you
>>> loading the null_blk module with?
>>>
>>> I've been using:
>>> modprobe null_blk gb=4 bs=4096 nr_devices=1 queue_mode=2 submit_queues=12
>>
>> $ for f in /sys/module/null_blk/parameters/*; do echo $f; cat $f; done
>> /sys/module/null_blk/parameters/bs
>> 512
>> /sys/module/null_blk/parameters/completion_nsec
>> 10000
>> /sys/module/null_blk/parameters/gb
>> 250
>> /sys/module/null_blk/parameters/home_node
>> -1
>> /sys/module/null_blk/parameters/hw_queue_depth
>> 64
>> /sys/module/null_blk/parameters/irqmode
>> 1
>> /sys/module/null_blk/parameters/nr_devices
>> 2
>> /sys/module/null_blk/parameters/queue_mode
>> 2
>> /sys/module/null_blk/parameters/submit_queues
>> 24
>> /sys/module/null_blk/parameters/use_lightnvm
>> N
>> /sys/module/null_blk/parameters/use_per_node_hctx
>> N
>>
>> $ fio --group_reporting --rw=randread --bs=4k --numjobs=24
>> --iodepth=32 --runtime=99999999 --time_based --loops=1
>> --ioengine=libaio --direct=1 --invalidate=1 --randrepeat=1
>> --norandommap --exitall --name task_nullb0 --filename=/dev/nullb0
>> task_nullb0: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
>> ioengine=libaio, iodepth=32
>> ...
>> fio-2.1.10
>> Starting 24 processes
>> Jobs: 24 (f=24): [rrrrrrrrrrrrrrrrrrrrrrrr] [0.0% done]
>> [7234MB/0KB/0KB /s] [1852K/0/0 iops] [eta 1157d:09h:46m:22s]
>
> Thanks, the number of fio threads was pretty important.  I'm still
> seeing better IOPs with queue_mode=0 (bio-based).
>
> Jobs: 24 (f=24): [r(24)] [11.7% done] [11073MB/0KB/0KB /s] [2835K/0/0 iops] [eta 14m:42s]
>
> (with queue_mode=2 I get ~1930K IOPs.. which I need to use to stack
> request-based DM multipath ontop)
>
> Now I can focus on why dm-multipath is slow...

queue_mode=0 doesn't do a whole lot, once you add the required bits for 
a normal driver, that will eat up a bit of overhead. The point was to 
retain scaling, and to avoid drivers having to built support for the 
required functionality from scratch. If we do that once and 
fast/correct, then all mq drivers get it.

That said, your jump is a big one. Some of that is support 
functionality, and some of it (I bet) is just doing io stats. If you 
disable io stats with queue_mode=2, the performance will most likely 
increase. Other things we just can't disable or don't want to disable, 
if we are going to keep this as an indication of what a real driver 
could do through the mq stack.

Now, if queue_mode=1 is faster, then there's certainly an issue!

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: dm-multipath low performance with blk-mq
@ 2016-01-27 19:49                     ` Jens Axboe
  0 siblings, 0 replies; 127+ messages in thread
From: Jens Axboe @ 2016-01-27 19:49 UTC (permalink / raw)
  To: Mike Snitzer, Sagi Grimberg
  Cc: keith.busch, Christoph Hellwig, device-mapper development,
	linux-nvme, Bart Van Assche

On 01/27/2016 11:42 AM, Mike Snitzer wrote:
> On Wed, Jan 27 2016 at 12:56pm -0500,
> Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
>
>>
>>
>> On 27/01/2016 19:48, Mike Snitzer wrote:
>>>
>>> BTW, I _cannot_ get null_blk to come even close to your reported 1500K+
>>> IOPs on 2 "fast" systems I have access to.  Which arguments are you
>>> loading the null_blk module with?
>>>
>>> I've been using:
>>> modprobe null_blk gb=4 bs=4096 nr_devices=1 queue_mode=2 submit_queues=12
>>
>> $ for f in /sys/module/null_blk/parameters/*; do echo $f; cat $f; done
>> /sys/module/null_blk/parameters/bs
>> 512
>> /sys/module/null_blk/parameters/completion_nsec
>> 10000
>> /sys/module/null_blk/parameters/gb
>> 250
>> /sys/module/null_blk/parameters/home_node
>> -1
>> /sys/module/null_blk/parameters/hw_queue_depth
>> 64
>> /sys/module/null_blk/parameters/irqmode
>> 1
>> /sys/module/null_blk/parameters/nr_devices
>> 2
>> /sys/module/null_blk/parameters/queue_mode
>> 2
>> /sys/module/null_blk/parameters/submit_queues
>> 24
>> /sys/module/null_blk/parameters/use_lightnvm
>> N
>> /sys/module/null_blk/parameters/use_per_node_hctx
>> N
>>
>> $ fio --group_reporting --rw=randread --bs=4k --numjobs=24
>> --iodepth=32 --runtime=99999999 --time_based --loops=1
>> --ioengine=libaio --direct=1 --invalidate=1 --randrepeat=1
>> --norandommap --exitall --name task_nullb0 --filename=/dev/nullb0
>> task_nullb0: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
>> ioengine=libaio, iodepth=32
>> ...
>> fio-2.1.10
>> Starting 24 processes
>> Jobs: 24 (f=24): [rrrrrrrrrrrrrrrrrrrrrrrr] [0.0% done]
>> [7234MB/0KB/0KB /s] [1852K/0/0 iops] [eta 1157d:09h:46m:22s]
>
> Thanks, the number of fio threads was pretty important.  I'm still
> seeing better IOPs with queue_mode=0 (bio-based).
>
> Jobs: 24 (f=24): [r(24)] [11.7% done] [11073MB/0KB/0KB /s] [2835K/0/0 iops] [eta 14m:42s]
>
> (with queue_mode=2 I get ~1930K IOPs.. which I need to use to stack
> request-based DM multipath ontop)
>
> Now I can focus on why dm-multipath is slow...

queue_mode=0 doesn't do a whole lot, once you add the required bits for 
a normal driver, that will eat up a bit of overhead. The point was to 
retain scaling, and to avoid drivers having to built support for the 
required functionality from scratch. If we do that once and 
fast/correct, then all mq drivers get it.

That said, your jump is a big one. Some of that is support 
functionality, and some of it (I bet) is just doing io stats. If you 
disable io stats with queue_mode=2, the performance will most likely 
increase. Other things we just can't disable or don't want to disable, 
if we are going to keep this as an indication of what a real driver 
could do through the mq stack.

Now, if queue_mode=1 is faster, then there's certainly an issue!

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 127+ messages in thread

* dm-multipath low performance with blk-mq
  2016-01-27 19:14                       ` Mike Snitzer
@ 2016-01-27 19:50                         ` Jens Axboe
  -1 siblings, 0 replies; 127+ messages in thread
From: Jens Axboe @ 2016-01-27 19:50 UTC (permalink / raw)


On 01/27/2016 12:14 PM, Mike Snitzer wrote:
> Yeah, I was using some crap ioping test to get those 500K/85K results.
> I've now switched to the fio test Sagi shared [...]

Tsk tsk

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: dm-multipath low performance with blk-mq
@ 2016-01-27 19:50                         ` Jens Axboe
  0 siblings, 0 replies; 127+ messages in thread
From: Jens Axboe @ 2016-01-27 19:50 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: keith.busch, Sagi Grimberg, linux-nvme, Christoph Hellwig,
	device-mapper development, Bart Van Assche

On 01/27/2016 12:14 PM, Mike Snitzer wrote:
> Yeah, I was using some crap ioping test to get those 500K/85K results.
> I've now switched to the fio test Sagi shared [...]

Tsk tsk

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 127+ messages in thread

* dm-multipath low performance with blk-mq
  2016-01-27 19:49                     ` Jens Axboe
@ 2016-01-27 20:45                       ` Mike Snitzer
  -1 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-01-27 20:45 UTC (permalink / raw)


On Wed, Jan 27 2016 at  2:49pm -0500,
Jens Axboe <axboe@kernel.dk> wrote:

> On 01/27/2016 11:42 AM, Mike Snitzer wrote:
> >On Wed, Jan 27 2016 at 12:56pm -0500,
> >Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
> >
> >>
> >>
> >>On 27/01/2016 19:48, Mike Snitzer wrote:
> >>>
> >>>BTW, I _cannot_ get null_blk to come even close to your reported 1500K+
> >>>IOPs on 2 "fast" systems I have access to.  Which arguments are you
> >>>loading the null_blk module with?
> >>>
> >>>I've been using:
> >>>modprobe null_blk gb=4 bs=4096 nr_devices=1 queue_mode=2 submit_queues=12
> >>
> >>$ for f in /sys/module/null_blk/parameters/*; do echo $f; cat $f; done
> >>/sys/module/null_blk/parameters/bs
> >>512
> >>/sys/module/null_blk/parameters/completion_nsec
> >>10000
> >>/sys/module/null_blk/parameters/gb
> >>250
> >>/sys/module/null_blk/parameters/home_node
> >>-1
> >>/sys/module/null_blk/parameters/hw_queue_depth
> >>64
> >>/sys/module/null_blk/parameters/irqmode
> >>1
> >>/sys/module/null_blk/parameters/nr_devices
> >>2
> >>/sys/module/null_blk/parameters/queue_mode
> >>2
> >>/sys/module/null_blk/parameters/submit_queues
> >>24
> >>/sys/module/null_blk/parameters/use_lightnvm
> >>N
> >>/sys/module/null_blk/parameters/use_per_node_hctx
> >>N
> >>
> >>$ fio --group_reporting --rw=randread --bs=4k --numjobs=24
> >>--iodepth=32 --runtime=99999999 --time_based --loops=1
> >>--ioengine=libaio --direct=1 --invalidate=1 --randrepeat=1
> >>--norandommap --exitall --name task_nullb0 --filename=/dev/nullb0
> >>task_nullb0: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
> >>ioengine=libaio, iodepth=32
> >>...
> >>fio-2.1.10
> >>Starting 24 processes
> >>Jobs: 24 (f=24): [rrrrrrrrrrrrrrrrrrrrrrrr] [0.0% done]
> >>[7234MB/0KB/0KB /s] [1852K/0/0 iops] [eta 1157d:09h:46m:22s]
> >
> >Thanks, the number of fio threads was pretty important.  I'm still
> >seeing better IOPs with queue_mode=0 (bio-based).
> >
> >Jobs: 24 (f=24): [r(24)] [11.7% done] [11073MB/0KB/0KB /s] [2835K/0/0 iops] [eta 14m:42s]
> >
> >(with queue_mode=2 I get ~1930K IOPs.. which I need to use to stack
> >request-based DM multipath ontop)
> >
> >Now I can focus on why dm-multipath is slow...
> 
> queue_mode=0 doesn't do a whole lot, once you add the required bits
> for a normal driver, that will eat up a bit of overhead. The point
> was to retain scaling, and to avoid drivers having to built support
> for the required functionality from scratch. If we do that once and
> fast/correct, then all mq drivers get it.
> 
> That said, your jump is a big one. Some of that is support
> functionality, and some of it (I bet) is just doing io stats. If you
> disable io stats with queue_mode=2, the performance will most likely
> increase. Other things we just can't disable or don't want to
> disable, if we are going to keep this as an indication of what a
> real driver could do through the mq stack.
> 
> Now, if queue_mode=1 is faster, then there's certainly an issue!

queue_mode=1 is awful, so we're safe there ;)

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: dm-multipath low performance with blk-mq
@ 2016-01-27 20:45                       ` Mike Snitzer
  0 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-01-27 20:45 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Sagi Grimberg, linux-nvme, keith.busch,
	device-mapper development, Bart Van Assche

On Wed, Jan 27 2016 at  2:49pm -0500,
Jens Axboe <axboe@kernel.dk> wrote:

> On 01/27/2016 11:42 AM, Mike Snitzer wrote:
> >On Wed, Jan 27 2016 at 12:56pm -0500,
> >Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
> >
> >>
> >>
> >>On 27/01/2016 19:48, Mike Snitzer wrote:
> >>>
> >>>BTW, I _cannot_ get null_blk to come even close to your reported 1500K+
> >>>IOPs on 2 "fast" systems I have access to.  Which arguments are you
> >>>loading the null_blk module with?
> >>>
> >>>I've been using:
> >>>modprobe null_blk gb=4 bs=4096 nr_devices=1 queue_mode=2 submit_queues=12
> >>
> >>$ for f in /sys/module/null_blk/parameters/*; do echo $f; cat $f; done
> >>/sys/module/null_blk/parameters/bs
> >>512
> >>/sys/module/null_blk/parameters/completion_nsec
> >>10000
> >>/sys/module/null_blk/parameters/gb
> >>250
> >>/sys/module/null_blk/parameters/home_node
> >>-1
> >>/sys/module/null_blk/parameters/hw_queue_depth
> >>64
> >>/sys/module/null_blk/parameters/irqmode
> >>1
> >>/sys/module/null_blk/parameters/nr_devices
> >>2
> >>/sys/module/null_blk/parameters/queue_mode
> >>2
> >>/sys/module/null_blk/parameters/submit_queues
> >>24
> >>/sys/module/null_blk/parameters/use_lightnvm
> >>N
> >>/sys/module/null_blk/parameters/use_per_node_hctx
> >>N
> >>
> >>$ fio --group_reporting --rw=randread --bs=4k --numjobs=24
> >>--iodepth=32 --runtime=99999999 --time_based --loops=1
> >>--ioengine=libaio --direct=1 --invalidate=1 --randrepeat=1
> >>--norandommap --exitall --name task_nullb0 --filename=/dev/nullb0
> >>task_nullb0: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
> >>ioengine=libaio, iodepth=32
> >>...
> >>fio-2.1.10
> >>Starting 24 processes
> >>Jobs: 24 (f=24): [rrrrrrrrrrrrrrrrrrrrrrrr] [0.0% done]
> >>[7234MB/0KB/0KB /s] [1852K/0/0 iops] [eta 1157d:09h:46m:22s]
> >
> >Thanks, the number of fio threads was pretty important.  I'm still
> >seeing better IOPs with queue_mode=0 (bio-based).
> >
> >Jobs: 24 (f=24): [r(24)] [11.7% done] [11073MB/0KB/0KB /s] [2835K/0/0 iops] [eta 14m:42s]
> >
> >(with queue_mode=2 I get ~1930K IOPs.. which I need to use to stack
> >request-based DM multipath ontop)
> >
> >Now I can focus on why dm-multipath is slow...
> 
> queue_mode=0 doesn't do a whole lot, once you add the required bits
> for a normal driver, that will eat up a bit of overhead. The point
> was to retain scaling, and to avoid drivers having to built support
> for the required functionality from scratch. If we do that once and
> fast/correct, then all mq drivers get it.
> 
> That said, your jump is a big one. Some of that is support
> functionality, and some of it (I bet) is just doing io stats. If you
> disable io stats with queue_mode=2, the performance will most likely
> increase. Other things we just can't disable or don't want to
> disable, if we are going to keep this as an indication of what a
> real driver could do through the mq stack.
> 
> Now, if queue_mode=1 is faster, then there's certainly an issue!

queue_mode=1 is awful, so we're safe there ;)

^ permalink raw reply	[flat|nested] 127+ messages in thread

* dm-multipath low performance with blk-mq
  2016-01-27 17:56                 ` Sagi Grimberg
@ 2016-01-29 23:35                   ` Mike Snitzer
  -1 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-01-29 23:35 UTC (permalink / raw)


On Wed, Jan 27 2016 at 12:56pm -0500,
Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:

> 
> 
> On 27/01/2016 19:48, Mike Snitzer wrote:
> >On Wed, Jan 27 2016 at  6:14am -0500,
> >Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
> >
> >>
> >>>>I don't think this is going to help __multipath_map() without some
> >>>>configuration changes.  Now that we're running on already merged
> >>>>requests instead of bios, the m->repeat_count is almost always set to 1,
> >>>>so we call the path_selector every time, which means that we'll always
> >>>>need the write lock. Bumping up the number of IOs we send before calling
> >>>>the path selector again will give this patch a change to do some good
> >>>>here.
> >>>>
> >>>>To do that you need to set:
> >>>>
> >>>>	rr_min_io_rq <something_bigger_than_one>
> >>>>
> >>>>in the defaults section of /etc/multipath.conf and then reload the
> >>>>multipathd service.
> >>>>
> >>>>The patch should hopefully help in multipath_busy() regardless of the
> >>>>the rr_min_io_rq setting.
> >>>
> >>>This patch, while generic, is meant to help the blk-mq case.  A blk-mq
> >>>request_queue doesn't have an elevator so the requests will not have
> >>>seen merging.
> >>>
> >>>But yes, implied in the patch is the requirement to increase
> >>>m->repeat_count via multipathd's rr_min_io_rq (I'll backfill a proper
> >>>header once it is tested).
> >>
> >>I'll test it once I get some spare time (hopefully soon...)
> >
> >OK thanks.
> >
> >BTW, I _cannot_ get null_blk to come even close to your reported 1500K+
> >IOPs on 2 "fast" systems I have access to.  Which arguments are you
> >loading the null_blk module with?
> >
> >I've been using:
> >modprobe null_blk gb=4 bs=4096 nr_devices=1 queue_mode=2 submit_queues=12
> 
> $ for f in /sys/module/null_blk/parameters/*; do echo $f; cat $f; done
> /sys/module/null_blk/parameters/bs
> 512
> /sys/module/null_blk/parameters/completion_nsec
> 10000
> /sys/module/null_blk/parameters/gb
> 250
> /sys/module/null_blk/parameters/home_node
> -1
> /sys/module/null_blk/parameters/hw_queue_depth
> 64
> /sys/module/null_blk/parameters/irqmode
> 1
> /sys/module/null_blk/parameters/nr_devices
> 2
> /sys/module/null_blk/parameters/queue_mode
> 2
> /sys/module/null_blk/parameters/submit_queues
> 24
> /sys/module/null_blk/parameters/use_lightnvm
> N
> /sys/module/null_blk/parameters/use_per_node_hctx
> N
> 
> $ fio --group_reporting --rw=randread --bs=4k --numjobs=24
> --iodepth=32 --runtime=99999999 --time_based --loops=1
> --ioengine=libaio --direct=1 --invalidate=1 --randrepeat=1
> --norandommap --exitall --name task_nullb0 --filename=/dev/nullb0
> task_nullb0: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
> ioengine=libaio, iodepth=32
> ...
> fio-2.1.10
> Starting 24 processes
> Jobs: 24 (f=24): [rrrrrrrrrrrrrrrrrrrrrrrr] [0.0% done]
> [7234MB/0KB/0KB /s] [1852K/0/0 iops] [eta 1157d:09h:46m:22s]

Your test above is prone to exhaust the dm-mpath blk-mq tags (128)
because 24 threads * 32 easily exceeds 128 (by a factor of 6).

I found that we were context switching (via bt_get's io_schedule)
waiting for tags to become available.

This is embarassing but, until Jens told me today, I was oblivious to
the fact that the number of blk-mq's tags per hw_queue was defined by
tag_set.queue_depth.

Previously request-based DM's blk-mq support had:
md->tag_set.queue_depth = BLKDEV_MAX_RQ; (again: 128)

Now I have a patch that allows tuning queue_depth via dm_mod module
parameter.  And I'll likely bump the default to 4096 or something (doing
so eliminated blocking in bt_get).

But eliminating the tags bottleneck only raised my read IOPs from ~600K
to ~800K (using 1 hw_queue for both null_blk and dm-mpath).

When I raise nr_hw_queues to 4 for null_blk (keeping dm-mq at 1) I see a
whole lot more context switching due to request-based DM's use of
ksoftirqd (and kworkers) for request completion.

So I'm moving on to optimizing the completion path.  But at least some
progress was made, more to come...

Mike

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: dm-multipath low performance with blk-mq
@ 2016-01-29 23:35                   ` Mike Snitzer
  0 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-01-29 23:35 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: axboe, keith.busch, linux-nvme, Christoph Hellwig,
	device-mapper development, linux-block, Bart Van Assche

On Wed, Jan 27 2016 at 12:56pm -0500,
Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:

> 
> 
> On 27/01/2016 19:48, Mike Snitzer wrote:
> >On Wed, Jan 27 2016 at  6:14am -0500,
> >Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
> >
> >>
> >>>>I don't think this is going to help __multipath_map() without some
> >>>>configuration changes.  Now that we're running on already merged
> >>>>requests instead of bios, the m->repeat_count is almost always set to 1,
> >>>>so we call the path_selector every time, which means that we'll always
> >>>>need the write lock. Bumping up the number of IOs we send before calling
> >>>>the path selector again will give this patch a change to do some good
> >>>>here.
> >>>>
> >>>>To do that you need to set:
> >>>>
> >>>>	rr_min_io_rq <something_bigger_than_one>
> >>>>
> >>>>in the defaults section of /etc/multipath.conf and then reload the
> >>>>multipathd service.
> >>>>
> >>>>The patch should hopefully help in multipath_busy() regardless of the
> >>>>the rr_min_io_rq setting.
> >>>
> >>>This patch, while generic, is meant to help the blk-mq case.  A blk-mq
> >>>request_queue doesn't have an elevator so the requests will not have
> >>>seen merging.
> >>>
> >>>But yes, implied in the patch is the requirement to increase
> >>>m->repeat_count via multipathd's rr_min_io_rq (I'll backfill a proper
> >>>header once it is tested).
> >>
> >>I'll test it once I get some spare time (hopefully soon...)
> >
> >OK thanks.
> >
> >BTW, I _cannot_ get null_blk to come even close to your reported 1500K+
> >IOPs on 2 "fast" systems I have access to.  Which arguments are you
> >loading the null_blk module with?
> >
> >I've been using:
> >modprobe null_blk gb=4 bs=4096 nr_devices=1 queue_mode=2 submit_queues=12
> 
> $ for f in /sys/module/null_blk/parameters/*; do echo $f; cat $f; done
> /sys/module/null_blk/parameters/bs
> 512
> /sys/module/null_blk/parameters/completion_nsec
> 10000
> /sys/module/null_blk/parameters/gb
> 250
> /sys/module/null_blk/parameters/home_node
> -1
> /sys/module/null_blk/parameters/hw_queue_depth
> 64
> /sys/module/null_blk/parameters/irqmode
> 1
> /sys/module/null_blk/parameters/nr_devices
> 2
> /sys/module/null_blk/parameters/queue_mode
> 2
> /sys/module/null_blk/parameters/submit_queues
> 24
> /sys/module/null_blk/parameters/use_lightnvm
> N
> /sys/module/null_blk/parameters/use_per_node_hctx
> N
> 
> $ fio --group_reporting --rw=randread --bs=4k --numjobs=24
> --iodepth=32 --runtime=99999999 --time_based --loops=1
> --ioengine=libaio --direct=1 --invalidate=1 --randrepeat=1
> --norandommap --exitall --name task_nullb0 --filename=/dev/nullb0
> task_nullb0: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
> ioengine=libaio, iodepth=32
> ...
> fio-2.1.10
> Starting 24 processes
> Jobs: 24 (f=24): [rrrrrrrrrrrrrrrrrrrrrrrr] [0.0% done]
> [7234MB/0KB/0KB /s] [1852K/0/0 iops] [eta 1157d:09h:46m:22s]

Your test above is prone to exhaust the dm-mpath blk-mq tags (128)
because 24 threads * 32 easily exceeds 128 (by a factor of 6).

I found that we were context switching (via bt_get's io_schedule)
waiting for tags to become available.

This is embarassing but, until Jens told me today, I was oblivious to
the fact that the number of blk-mq's tags per hw_queue was defined by
tag_set.queue_depth.

Previously request-based DM's blk-mq support had:
md->tag_set.queue_depth = BLKDEV_MAX_RQ; (again: 128)

Now I have a patch that allows tuning queue_depth via dm_mod module
parameter.  And I'll likely bump the default to 4096 or something (doing
so eliminated blocking in bt_get).

But eliminating the tags bottleneck only raised my read IOPs from ~600K
to ~800K (using 1 hw_queue for both null_blk and dm-mpath).

When I raise nr_hw_queues to 4 for null_blk (keeping dm-mq at 1) I see a
whole lot more context switching due to request-based DM's use of
ksoftirqd (and kworkers) for request completion.

So I'm moving on to optimizing the completion path.  But at least some
progress was made, more to come...

Mike

^ permalink raw reply	[flat|nested] 127+ messages in thread

* dm-multipath low performance with blk-mq
  2016-01-29 23:35                   ` Mike Snitzer
@ 2016-01-30  8:52                     ` Hannes Reinecke
  -1 siblings, 0 replies; 127+ messages in thread
From: Hannes Reinecke @ 2016-01-30  8:52 UTC (permalink / raw)


On 01/30/2016 12:35 AM, Mike Snitzer wrote:
> On Wed, Jan 27 2016 at 12:56pm -0500,
> Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
>
>>
>>
>> On 27/01/2016 19:48, Mike Snitzer wrote:
>>> On Wed, Jan 27 2016 at  6:14am -0500,
>>> Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
>>>
>>>>
>>>>>> I don't think this is going to help __multipath_map() without some
>>>>>> configuration changes.  Now that we're running on already merged
>>>>>> requests instead of bios, the m->repeat_count is almost always set to 1,
>>>>>> so we call the path_selector every time, which means that we'll always
>>>>>> need the write lock. Bumping up the number of IOs we send before calling
>>>>>> the path selector again will give this patch a change to do some good
>>>>>> here.
>>>>>>
>>>>>> To do that you need to set:
>>>>>>
>>>>>> 	rr_min_io_rq <something_bigger_than_one>
>>>>>>
>>>>>> in the defaults section of /etc/multipath.conf and then reload the
>>>>>> multipathd service.
>>>>>>
>>>>>> The patch should hopefully help in multipath_busy() regardless of the
>>>>>> the rr_min_io_rq setting.
>>>>>
>>>>> This patch, while generic, is meant to help the blk-mq case.  A blk-mq
>>>>> request_queue doesn't have an elevator so the requests will not have
>>>>> seen merging.
>>>>>
>>>>> But yes, implied in the patch is the requirement to increase
>>>>> m->repeat_count via multipathd's rr_min_io_rq (I'll backfill a proper
>>>>> header once it is tested).
>>>>
>>>> I'll test it once I get some spare time (hopefully soon...)
>>>
>>> OK thanks.
>>>
>>> BTW, I _cannot_ get null_blk to come even close to your reported 1500K+
>>> IOPs on 2 "fast" systems I have access to.  Which arguments are you
>>> loading the null_blk module with?
>>>
>>> I've been using:
>>> modprobe null_blk gb=4 bs=4096 nr_devices=1 queue_mode=2 submit_queues=12
>>
>> $ for f in /sys/module/null_blk/parameters/*; do echo $f; cat $f; done
>> /sys/module/null_blk/parameters/bs
>> 512
>> /sys/module/null_blk/parameters/completion_nsec
>> 10000
>> /sys/module/null_blk/parameters/gb
>> 250
>> /sys/module/null_blk/parameters/home_node
>> -1
>> /sys/module/null_blk/parameters/hw_queue_depth
>> 64
>> /sys/module/null_blk/parameters/irqmode
>> 1
>> /sys/module/null_blk/parameters/nr_devices
>> 2
>> /sys/module/null_blk/parameters/queue_mode
>> 2
>> /sys/module/null_blk/parameters/submit_queues
>> 24
>> /sys/module/null_blk/parameters/use_lightnvm
>> N
>> /sys/module/null_blk/parameters/use_per_node_hctx
>> N
>>
>> $ fio --group_reporting --rw=randread --bs=4k --numjobs=24
>> --iodepth=32 --runtime=99999999 --time_based --loops=1
>> --ioengine=libaio --direct=1 --invalidate=1 --randrepeat=1
>> --norandommap --exitall --name task_nullb0 --filename=/dev/nullb0
>> task_nullb0: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
>> ioengine=libaio, iodepth=32
>> ...
>> fio-2.1.10
>> Starting 24 processes
>> Jobs: 24 (f=24): [rrrrrrrrrrrrrrrrrrrrrrrr] [0.0% done]
>> [7234MB/0KB/0KB /s] [1852K/0/0 iops] [eta 1157d:09h:46m:22s]
>
> Your test above is prone to exhaust the dm-mpath blk-mq tags (128)
> because 24 threads * 32 easily exceeds 128 (by a factor of 6).
>
> I found that we were context switching (via bt_get's io_schedule)
> waiting for tags to become available.
>
> This is embarassing but, until Jens told me today, I was oblivious to
> the fact that the number of blk-mq's tags per hw_queue was defined by
> tag_set.queue_depth.
>
> Previously request-based DM's blk-mq support had:
> md->tag_set.queue_depth = BLKDEV_MAX_RQ; (again: 128)
>
> Now I have a patch that allows tuning queue_depth via dm_mod module
> parameter.  And I'll likely bump the default to 4096 or something (doing
> so eliminated blocking in bt_get).
>
> But eliminating the tags bottleneck only raised my read IOPs from ~600K
> to ~800K (using 1 hw_queue for both null_blk and dm-mpath).
>
> When I raise nr_hw_queues to 4 for null_blk (keeping dm-mq at 1) I see a
> whole lot more context switching due to request-based DM's use of
> ksoftirqd (and kworkers) for request completion.
>
> So I'm moving on to optimizing the completion path.  But at least some
> progress was made, more to come...
>
Would you mind sharing your patches?
We're currently doing tests with a high-performance FC setup
(16G FC with all-flash storage), and are still 20% short of the 
announced backend performance.

Just as a side note: we're currently getting 550k IOPs.
With unpatched dm-mpath.
So nearly on par with your null-blk setup. but with real hardware.
(Which in itself is pretty cool. You should get faster RAM :-)

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare at suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: J. Hawn, J. Guild, F. Imend?rffer, HRB 16746 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: dm-multipath low performance with blk-mq
@ 2016-01-30  8:52                     ` Hannes Reinecke
  0 siblings, 0 replies; 127+ messages in thread
From: Hannes Reinecke @ 2016-01-30  8:52 UTC (permalink / raw)
  To: Mike Snitzer, Sagi Grimberg
  Cc: axboe, linux-block, Christoph Hellwig, linux-nvme, keith.busch,
	device-mapper development, Bart Van Assche

On 01/30/2016 12:35 AM, Mike Snitzer wrote:
> On Wed, Jan 27 2016 at 12:56pm -0500,
> Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
>
>>
>>
>> On 27/01/2016 19:48, Mike Snitzer wrote:
>>> On Wed, Jan 27 2016 at  6:14am -0500,
>>> Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
>>>
>>>>
>>>>>> I don't think this is going to help __multipath_map() without some
>>>>>> configuration changes.  Now that we're running on already merged
>>>>>> requests instead of bios, the m->repeat_count is almost always set to 1,
>>>>>> so we call the path_selector every time, which means that we'll always
>>>>>> need the write lock. Bumping up the number of IOs we send before calling
>>>>>> the path selector again will give this patch a change to do some good
>>>>>> here.
>>>>>>
>>>>>> To do that you need to set:
>>>>>>
>>>>>> 	rr_min_io_rq <something_bigger_than_one>
>>>>>>
>>>>>> in the defaults section of /etc/multipath.conf and then reload the
>>>>>> multipathd service.
>>>>>>
>>>>>> The patch should hopefully help in multipath_busy() regardless of the
>>>>>> the rr_min_io_rq setting.
>>>>>
>>>>> This patch, while generic, is meant to help the blk-mq case.  A blk-mq
>>>>> request_queue doesn't have an elevator so the requests will not have
>>>>> seen merging.
>>>>>
>>>>> But yes, implied in the patch is the requirement to increase
>>>>> m->repeat_count via multipathd's rr_min_io_rq (I'll backfill a proper
>>>>> header once it is tested).
>>>>
>>>> I'll test it once I get some spare time (hopefully soon...)
>>>
>>> OK thanks.
>>>
>>> BTW, I _cannot_ get null_blk to come even close to your reported 1500K+
>>> IOPs on 2 "fast" systems I have access to.  Which arguments are you
>>> loading the null_blk module with?
>>>
>>> I've been using:
>>> modprobe null_blk gb=4 bs=4096 nr_devices=1 queue_mode=2 submit_queues=12
>>
>> $ for f in /sys/module/null_blk/parameters/*; do echo $f; cat $f; done
>> /sys/module/null_blk/parameters/bs
>> 512
>> /sys/module/null_blk/parameters/completion_nsec
>> 10000
>> /sys/module/null_blk/parameters/gb
>> 250
>> /sys/module/null_blk/parameters/home_node
>> -1
>> /sys/module/null_blk/parameters/hw_queue_depth
>> 64
>> /sys/module/null_blk/parameters/irqmode
>> 1
>> /sys/module/null_blk/parameters/nr_devices
>> 2
>> /sys/module/null_blk/parameters/queue_mode
>> 2
>> /sys/module/null_blk/parameters/submit_queues
>> 24
>> /sys/module/null_blk/parameters/use_lightnvm
>> N
>> /sys/module/null_blk/parameters/use_per_node_hctx
>> N
>>
>> $ fio --group_reporting --rw=randread --bs=4k --numjobs=24
>> --iodepth=32 --runtime=99999999 --time_based --loops=1
>> --ioengine=libaio --direct=1 --invalidate=1 --randrepeat=1
>> --norandommap --exitall --name task_nullb0 --filename=/dev/nullb0
>> task_nullb0: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
>> ioengine=libaio, iodepth=32
>> ...
>> fio-2.1.10
>> Starting 24 processes
>> Jobs: 24 (f=24): [rrrrrrrrrrrrrrrrrrrrrrrr] [0.0% done]
>> [7234MB/0KB/0KB /s] [1852K/0/0 iops] [eta 1157d:09h:46m:22s]
>
> Your test above is prone to exhaust the dm-mpath blk-mq tags (128)
> because 24 threads * 32 easily exceeds 128 (by a factor of 6).
>
> I found that we were context switching (via bt_get's io_schedule)
> waiting for tags to become available.
>
> This is embarassing but, until Jens told me today, I was oblivious to
> the fact that the number of blk-mq's tags per hw_queue was defined by
> tag_set.queue_depth.
>
> Previously request-based DM's blk-mq support had:
> md->tag_set.queue_depth = BLKDEV_MAX_RQ; (again: 128)
>
> Now I have a patch that allows tuning queue_depth via dm_mod module
> parameter.  And I'll likely bump the default to 4096 or something (doing
> so eliminated blocking in bt_get).
>
> But eliminating the tags bottleneck only raised my read IOPs from ~600K
> to ~800K (using 1 hw_queue for both null_blk and dm-mpath).
>
> When I raise nr_hw_queues to 4 for null_blk (keeping dm-mq at 1) I see a
> whole lot more context switching due to request-based DM's use of
> ksoftirqd (and kworkers) for request completion.
>
> So I'm moving on to optimizing the completion path.  But at least some
> progress was made, more to come...
>
Would you mind sharing your patches?
We're currently doing tests with a high-performance FC setup
(16G FC with all-flash storage), and are still 20% short of the 
announced backend performance.

Just as a side note: we're currently getting 550k IOPs.
With unpatched dm-mpath.
So nearly on par with your null-blk setup. but with real hardware.
(Which in itself is pretty cool. You should get faster RAM :-)

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 127+ messages in thread

* dm-multipath low performance with blk-mq
  2016-01-30  8:52                     ` Hannes Reinecke
@ 2016-01-30 19:12                       ` Mike Snitzer
  -1 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-01-30 19:12 UTC (permalink / raw)


On Sat, Jan 30 2016 at  3:52am -0500,
Hannes Reinecke <hare@suse.de> wrote:

> On 01/30/2016 12:35 AM, Mike Snitzer wrote:
> >
> >Your test above is prone to exhaust the dm-mpath blk-mq tags (128)
> >because 24 threads * 32 easily exceeds 128 (by a factor of 6).
> >
> >I found that we were context switching (via bt_get's io_schedule)
> >waiting for tags to become available.
> >
> >This is embarassing but, until Jens told me today, I was oblivious to
> >the fact that the number of blk-mq's tags per hw_queue was defined by
> >tag_set.queue_depth.
> >
> >Previously request-based DM's blk-mq support had:
> >md->tag_set.queue_depth = BLKDEV_MAX_RQ; (again: 128)
> >
> >Now I have a patch that allows tuning queue_depth via dm_mod module
> >parameter.  And I'll likely bump the default to 4096 or something (doing
> >so eliminated blocking in bt_get).
> >
> >But eliminating the tags bottleneck only raised my read IOPs from ~600K
> >to ~800K (using 1 hw_queue for both null_blk and dm-mpath).
> >
> >When I raise nr_hw_queues to 4 for null_blk (keeping dm-mq at 1) I see a
> >whole lot more context switching due to request-based DM's use of
> >ksoftirqd (and kworkers) for request completion.
> >
> >So I'm moving on to optimizing the completion path.  But at least some
> >progress was made, more to come...
> >
>
> Would you mind sharing your patches?

I'm still working through this.  I'll hopefully have a handful of
RFC-level changes by end of day Monday.  But could take longer.

One change that I already shared in a previous mail is:
http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel2&id=99ebcaf36d9d1fa3acec98492c36664d57ba8fbd

> We're currently doing tests with a high-performance FC setup
> (16G FC with all-flash storage), and are still 20% short of the
> announced backend performance.
>
> Just as a side note: we're currently getting 550k IOPs.
> With unpatched dm-mpath.

What is your test workload?  If you can share I'll be sure to factor it
into my testing.

> So nearly on par with your null-blk setup. but with real hardware.
> (Which in itself is pretty cool. You should get faster RAM :-)

You've misunderstood what I said my null_blk (RAM) performance is.

My null_blk test gets ~1900K read IOPs.  But dm-mpath ontop only gets
between 600K and 1000K IOPs depending on $FIO_QUEUE_DEPTH and if I
use multiple $NULL_BLK_HW_QUEUES.

Here is the script I've been using to test:

#!/bin/sh

set -xv

NULL_BLK_HW_QUEUES=1
NULL_BLK_QUEUE_DEPTH=4096

DM_MQ_HW_QUEUES=1
DM_MQ_QUEUE_DEPTH=4096

FIO=/root/snitm/git/fio/fio
FIO_QUEUE_DEPTH=32
FIO_RUNTIME=10
FIO_NUMJOBS=12

PERF=perf
#PERF=/root/snitm/git/linux/tools/perf/perf

run_fio() {
    DEVICE=$1
    TASK_NAME=$(basename ${DEVICE})
    PERF_RECORD=$2
    RUN_CMD="${FIO} --cpus_allowed_policy=split --group_reporting --rw=randread --bs=4k --numjobs=${FIO_NUMJOBS} \
              --iodepth=${FIO_QUEUE_DEPTH} --runtime=${FIO_RUNTIME} --time_based --loops=1 --ioengine=libaio \
              --direct=1 --invalidate=1 --randrepeat=1 --norandommap --exitall --name task_${TASK_NAME} --filename=${DEVICE}"

    if [ ! -z "${PERF_RECORD}" ]; then
	${PERF_RECORD} ${RUN_CMD}
	mv perf.data perf.data.${TASK_NAME}
    else
	${RUN_CMD}
    fi
}

dmsetup remove dm_mq
modprobe -r null_blk

modprobe null_blk gb=4 bs=512 hw_queue_depth=${NULL_BLK_QUEUE_DEPTH} nr_devices=1 queue_mode=2 irqmode=1 completion_nsec=1 submit_queues=${NULL_BLK_HW_QUEUES}

run_fio /dev/nullb0
run_fio /dev/nullb0 "${PERF} record -ag -e cs"

echo Y > /sys/module/dm_mod/parameters/use_blk_mq
echo ${DM_MQ_QUEUE_DEPTH} > /sys/module/dm_mod/parameters/blk_mq_queue_depth
echo ${DM_MQ_HW_QUEUES} > /sys/module/dm_mod/parameters/blk_mq_hw_queues
echo "0 8388608 multipath 0 0 1 1 service-time 0 1 2 /dev/nullb0 1000 1" | dmsetup create dm_mq

run_fio /dev/mapper/dm_mq
run_fio /dev/mapper/dm_mq "${PERF} record -ag -e cs"

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: dm-multipath low performance with blk-mq
@ 2016-01-30 19:12                       ` Mike Snitzer
  0 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-01-30 19:12 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme, keith.busch,
	device-mapper development, linux-block, Bart Van Assche

On Sat, Jan 30 2016 at  3:52am -0500,
Hannes Reinecke <hare@suse.de> wrote:

> On 01/30/2016 12:35 AM, Mike Snitzer wrote:
> >
> >Your test above is prone to exhaust the dm-mpath blk-mq tags (128)
> >because 24 threads * 32 easily exceeds 128 (by a factor of 6).
> >
> >I found that we were context switching (via bt_get's io_schedule)
> >waiting for tags to become available.
> >
> >This is embarassing but, until Jens told me today, I was oblivious to
> >the fact that the number of blk-mq's tags per hw_queue was defined by
> >tag_set.queue_depth.
> >
> >Previously request-based DM's blk-mq support had:
> >md->tag_set.queue_depth = BLKDEV_MAX_RQ; (again: 128)
> >
> >Now I have a patch that allows tuning queue_depth via dm_mod module
> >parameter.  And I'll likely bump the default to 4096 or something (doing
> >so eliminated blocking in bt_get).
> >
> >But eliminating the tags bottleneck only raised my read IOPs from ~600K
> >to ~800K (using 1 hw_queue for both null_blk and dm-mpath).
> >
> >When I raise nr_hw_queues to 4 for null_blk (keeping dm-mq at 1) I see a
> >whole lot more context switching due to request-based DM's use of
> >ksoftirqd (and kworkers) for request completion.
> >
> >So I'm moving on to optimizing the completion path.  But at least some
> >progress was made, more to come...
> >
>
> Would you mind sharing your patches?

I'm still working through this.  I'll hopefully have a handful of
RFC-level changes by end of day Monday.  But could take longer.

One change that I already shared in a previous mail is:
http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel2&id=99ebcaf36d9d1fa3acec98492c36664d57ba8fbd

> We're currently doing tests with a high-performance FC setup
> (16G FC with all-flash storage), and are still 20% short of the
> announced backend performance.
>
> Just as a side note: we're currently getting 550k IOPs.
> With unpatched dm-mpath.

What is your test workload?  If you can share I'll be sure to factor it
into my testing.

> So nearly on par with your null-blk setup. but with real hardware.
> (Which in itself is pretty cool. You should get faster RAM :-)

You've misunderstood what I said my null_blk (RAM) performance is.

My null_blk test gets ~1900K read IOPs.  But dm-mpath ontop only gets
between 600K and 1000K IOPs depending on $FIO_QUEUE_DEPTH and if I
use multiple $NULL_BLK_HW_QUEUES.

Here is the script I've been using to test:

#!/bin/sh

set -xv

NULL_BLK_HW_QUEUES=1
NULL_BLK_QUEUE_DEPTH=4096

DM_MQ_HW_QUEUES=1
DM_MQ_QUEUE_DEPTH=4096

FIO=/root/snitm/git/fio/fio
FIO_QUEUE_DEPTH=32
FIO_RUNTIME=10
FIO_NUMJOBS=12

PERF=perf
#PERF=/root/snitm/git/linux/tools/perf/perf

run_fio() {
    DEVICE=$1
    TASK_NAME=$(basename ${DEVICE})
    PERF_RECORD=$2
    RUN_CMD="${FIO} --cpus_allowed_policy=split --group_reporting --rw=randread --bs=4k --numjobs=${FIO_NUMJOBS} \
              --iodepth=${FIO_QUEUE_DEPTH} --runtime=${FIO_RUNTIME} --time_based --loops=1 --ioengine=libaio \
              --direct=1 --invalidate=1 --randrepeat=1 --norandommap --exitall --name task_${TASK_NAME} --filename=${DEVICE}"

    if [ ! -z "${PERF_RECORD}" ]; then
	${PERF_RECORD} ${RUN_CMD}
	mv perf.data perf.data.${TASK_NAME}
    else
	${RUN_CMD}
    fi
}

dmsetup remove dm_mq
modprobe -r null_blk

modprobe null_blk gb=4 bs=512 hw_queue_depth=${NULL_BLK_QUEUE_DEPTH} nr_devices=1 queue_mode=2 irqmode=1 completion_nsec=1 submit_queues=${NULL_BLK_HW_QUEUES}

run_fio /dev/nullb0
run_fio /dev/nullb0 "${PERF} record -ag -e cs"

echo Y > /sys/module/dm_mod/parameters/use_blk_mq
echo ${DM_MQ_QUEUE_DEPTH} > /sys/module/dm_mod/parameters/blk_mq_queue_depth
echo ${DM_MQ_HW_QUEUES} > /sys/module/dm_mod/parameters/blk_mq_hw_queues
echo "0 8388608 multipath 0 0 1 1 service-time 0 1 2 /dev/nullb0 1000 1" | dmsetup create dm_mq

run_fio /dev/mapper/dm_mq
run_fio /dev/mapper/dm_mq "${PERF} record -ag -e cs"

^ permalink raw reply	[flat|nested] 127+ messages in thread

* dm-multipath low performance with blk-mq
  2016-01-30 19:12                       ` Mike Snitzer
@ 2016-02-01  6:46                         ` Hannes Reinecke
  -1 siblings, 0 replies; 127+ messages in thread
From: Hannes Reinecke @ 2016-02-01  6:46 UTC (permalink / raw)


On 01/30/2016 08:12 PM, Mike Snitzer wrote:
> On Sat, Jan 30 2016 at  3:52am -0500,
> Hannes Reinecke <hare@suse.de> wrote:
> 
>> On 01/30/2016 12:35 AM, Mike Snitzer wrote:
>>>
>>> Your test above is prone to exhaust the dm-mpath blk-mq tags (128)
>>> because 24 threads * 32 easily exceeds 128 (by a factor of 6).
>>>
>>> I found that we were context switching (via bt_get's io_schedule)
>>> waiting for tags to become available.
>>>
>>> This is embarassing but, until Jens told me today, I was oblivious to
>>> the fact that the number of blk-mq's tags per hw_queue was defined by
>>> tag_set.queue_depth.
>>>
>>> Previously request-based DM's blk-mq support had:
>>> md->tag_set.queue_depth = BLKDEV_MAX_RQ; (again: 128)
>>>
>>> Now I have a patch that allows tuning queue_depth via dm_mod module
>>> parameter.  And I'll likely bump the default to 4096 or something (doing
>>> so eliminated blocking in bt_get).
>>>
>>> But eliminating the tags bottleneck only raised my read IOPs from ~600K
>>> to ~800K (using 1 hw_queue for both null_blk and dm-mpath).
>>>
>>> When I raise nr_hw_queues to 4 for null_blk (keeping dm-mq at 1) I see a
>>> whole lot more context switching due to request-based DM's use of
>>> ksoftirqd (and kworkers) for request completion.
>>>
>>> So I'm moving on to optimizing the completion path.  But at least some
>>> progress was made, more to come...
>>>
>>
>> Would you mind sharing your patches?
> 
> I'm still working through this.  I'll hopefully have a handful of
> RFC-level changes by end of day Monday.  But could take longer.
> 
> One change that I already shared in a previous mail is:
> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel2&id=99ebcaf36d9d1fa3acec98492c36664d57ba8fbd
> 
>> We're currently doing tests with a high-performance FC setup
>> (16G FC with all-flash storage), and are still 20% short of the
>> announced backend performance.
>>
>> Just as a side note: we're currently getting 550k IOPs.
>> With unpatched dm-mpath.
> 
> What is your test workload?  If you can share I'll be sure to factor it
> into my testing.
> 
That's a plain random read via fio, using 8 LUNs on the target.

>> So nearly on par with your null-blk setup. but with real hardware.
>> (Which in itself is pretty cool. You should get faster RAM :-)
> 
> You've misunderstood what I said my null_blk (RAM) performance is.
> 
> My null_blk test gets ~1900K read IOPs.  But dm-mpath ontop only gets
> between 600K and 1000K IOPs depending on $FIO_QUEUE_DEPTH and if I
> use multiple $NULL_BLK_HW_QUEUES.
> 
Right.
We're using two 16G FC links, each talking to 4 LUNs.
With dm-mpath on top. The FC HBAs have a hardware queue depth
of roughly 2000, so we might need to tweak the queue depth of the
multipath devices, too.


Will be having a look at your patches.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: dm-multipath low performance with blk-mq
@ 2016-02-01  6:46                         ` Hannes Reinecke
  0 siblings, 0 replies; 127+ messages in thread
From: Hannes Reinecke @ 2016-02-01  6:46 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: axboe, keith.busch, Sagi Grimberg, linux-nvme, Christoph Hellwig,
	device-mapper development, linux-block, Bart Van Assche

On 01/30/2016 08:12 PM, Mike Snitzer wrote:
> On Sat, Jan 30 2016 at  3:52am -0500,
> Hannes Reinecke <hare@suse.de> wrote:
> 
>> On 01/30/2016 12:35 AM, Mike Snitzer wrote:
>>>
>>> Your test above is prone to exhaust the dm-mpath blk-mq tags (128)
>>> because 24 threads * 32 easily exceeds 128 (by a factor of 6).
>>>
>>> I found that we were context switching (via bt_get's io_schedule)
>>> waiting for tags to become available.
>>>
>>> This is embarassing but, until Jens told me today, I was oblivious to
>>> the fact that the number of blk-mq's tags per hw_queue was defined by
>>> tag_set.queue_depth.
>>>
>>> Previously request-based DM's blk-mq support had:
>>> md->tag_set.queue_depth = BLKDEV_MAX_RQ; (again: 128)
>>>
>>> Now I have a patch that allows tuning queue_depth via dm_mod module
>>> parameter.  And I'll likely bump the default to 4096 or something (doing
>>> so eliminated blocking in bt_get).
>>>
>>> But eliminating the tags bottleneck only raised my read IOPs from ~600K
>>> to ~800K (using 1 hw_queue for both null_blk and dm-mpath).
>>>
>>> When I raise nr_hw_queues to 4 for null_blk (keeping dm-mq at 1) I see a
>>> whole lot more context switching due to request-based DM's use of
>>> ksoftirqd (and kworkers) for request completion.
>>>
>>> So I'm moving on to optimizing the completion path.  But at least some
>>> progress was made, more to come...
>>>
>>
>> Would you mind sharing your patches?
> 
> I'm still working through this.  I'll hopefully have a handful of
> RFC-level changes by end of day Monday.  But could take longer.
> 
> One change that I already shared in a previous mail is:
> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel2&id=99ebcaf36d9d1fa3acec98492c36664d57ba8fbd
> 
>> We're currently doing tests with a high-performance FC setup
>> (16G FC with all-flash storage), and are still 20% short of the
>> announced backend performance.
>>
>> Just as a side note: we're currently getting 550k IOPs.
>> With unpatched dm-mpath.
> 
> What is your test workload?  If you can share I'll be sure to factor it
> into my testing.
> 
That's a plain random read via fio, using 8 LUNs on the target.

>> So nearly on par with your null-blk setup. but with real hardware.
>> (Which in itself is pretty cool. You should get faster RAM :-)
> 
> You've misunderstood what I said my null_blk (RAM) performance is.
> 
> My null_blk test gets ~1900K read IOPs.  But dm-mpath ontop only gets
> between 600K and 1000K IOPs depending on $FIO_QUEUE_DEPTH and if I
> use multiple $NULL_BLK_HW_QUEUES.
> 
Right.
We're using two 16G FC links, each talking to 4 LUNs.
With dm-mpath on top. The FC HBAs have a hardware queue depth
of roughly 2000, so we might need to tweak the queue depth of the
multipath devices, too.


Will be having a look at your patches.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 127+ messages in thread

* dm-multipath low performance with blk-mq
  2016-02-01  6:46                         ` Hannes Reinecke
@ 2016-02-03 18:04                           ` Mike Snitzer
  -1 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-02-03 18:04 UTC (permalink / raw)


On Mon, Feb 01 2016 at  1:46am -0500,
Hannes Reinecke <hare@suse.de> wrote:

> On 01/30/2016 08:12 PM, Mike Snitzer wrote:
> > On Sat, Jan 30 2016 at  3:52am -0500,
> > Hannes Reinecke <hare@suse.de> wrote:
> > > 
> >> So nearly on par with your null-blk setup. but with real hardware.
> >> (Which in itself is pretty cool. You should get faster RAM :-)
> > 
> > You've misunderstood what I said my null_blk (RAM) performance is.
> > 
> > My null_blk test gets ~1900K read IOPs.  But dm-mpath ontop only gets
> > between 600K and 1000K IOPs depending on $FIO_QUEUE_DEPTH and if I
> > use multiple $NULL_BLK_HW_QUEUES.
> > 
> Right.
> We're using two 16G FC links, each talking to 4 LUNs.
> With dm-mpath on top. The FC HBAs have a hardware queue depth
> of roughly 2000, so we might need to tweak the queue depth of the
> multipath devices, too.
> 
> 
> Will be having a look at your patches.

I have staged quite a few patches in linux-next for the 4.6 merge window:
https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.6

I'm open to posting them to dm-devel if it would ease review.  Let me
know.

These changes range from:
- defaulting to queue_depth of 2048 (rather than 64) request per blk-mq
  hw queue -- fixed stalls waiting for finite amount of tags (in bt_get)
- making additional use of the DM-multipath blk-mq device's pdu for
  mpath per-io data structures
- using blk-mq interfaces rather than generic wrappers (mainly just
  helps document the nature of the requests in blk-mq specific code
  paths)
- avoiding running the blk-mq hw queues on request completion (doesn't
  seem to help like it does for .request_fn multipath; only serves to
  generate extra kblockd work for no observed gain)
- optimize both .request_fn (dm_request_fn) and blk-mq (dm_mq_queue_rq)
  so they don't bother with the bio-based DM pattern of finding which
  target is used to map IO at the particular offset -- request-based DM
  only ever has a single immutable target associated with it
- removal of dead code and code comment improvements

I've seen blk-mq DM-multipath performance improvement but _not_ enough
to consider this line of work "done".  I'd be very interested to see
what kind of improvements you (Hannes) and Sagi can realize with your
respective testbeds.

I'm still not clear on where the considerable performance loss is coming
from (on null_blk devices I see ~1900K read IOPs but I'm still only
seeing ~1000K read IOPs when blk-mq DM-multipath is layered ontop).
What is very much apparent is layering dm-mq multipath ontop of null_blk
results in a HUGE amount of additional context switches.  I can only
infer that the request completion for this stacked device (blk-mq queue
ontop of blk-mq queue, with 2 completions: 1 for clone completing on
underlying device and 1 for original request completing) is the reason
for all the extra context switches.

Here are pictures of 'perf report' for perf datat collected using
'perf record -ag -e cs'.

Against null_blk:
http://people.redhat.com/msnitzer/perf-report-cs-null_blk.png
Against dm-mpath ontop of the same null_blk:
http://people.redhat.com/msnitzer/perf-report-cs-dm_mq.png

Looks like there may be some low-hanging fruit associated with steering
completion to reduce all the excessive ksoftirq and kworker context
switching.  Pin-pointing the reason these tasks are context switching is
my next focus.

I've yet to actually test on DM-multipath device with more than one
path.  Hannes, Sagi, and/or others: on such a setup it would be
interesting to see if increasing the 'blk_mq_nr_hw_queues' helps at all.
Any 'perf report' traces that shed light on bottlenecks you might be
experiencing would obviously be appreciated.  I'm skeptical there is
enough parallelism in the dm-mpath.c code to allow for proper scaling --
switching to RCU could help this.

Mike

p.s.
I experimented with using the top-level DM multipath blk-mq queue's
pdu for the underlying clone 'struct request' that is implicitly needed
when issuing the request to the underlying path -- by (ab)using
blk_mq_tag_set_rq that is used by blk-flush.c.  blk-mq hated me for
trying this.  I kept getting list corruption on unplug with this (and
many variants on work along these lines):
http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel2&id=7b7203c93cec7ad3a0ae2a2da567d45f46fe8098

I stopped that line of work due to inability to make it function.. but
it was a skunk-works experiment that needed to die anyway (as I'm sure
Jens will agree).

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: dm-multipath low performance with blk-mq
@ 2016-02-03 18:04                           ` Mike Snitzer
  0 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-02-03 18:04 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme, keith.busch,
	device-mapper development, linux-block, Bart Van Assche

On Mon, Feb 01 2016 at  1:46am -0500,
Hannes Reinecke <hare@suse.de> wrote:

> On 01/30/2016 08:12 PM, Mike Snitzer wrote:
> > On Sat, Jan 30 2016 at  3:52am -0500,
> > Hannes Reinecke <hare@suse.de> wrote:
> > > 
> >> So nearly on par with your null-blk setup. but with real hardware.
> >> (Which in itself is pretty cool. You should get faster RAM :-)
> > 
> > You've misunderstood what I said my null_blk (RAM) performance is.
> > 
> > My null_blk test gets ~1900K read IOPs.  But dm-mpath ontop only gets
> > between 600K and 1000K IOPs depending on $FIO_QUEUE_DEPTH and if I
> > use multiple $NULL_BLK_HW_QUEUES.
> > 
> Right.
> We're using two 16G FC links, each talking to 4 LUNs.
> With dm-mpath on top. The FC HBAs have a hardware queue depth
> of roughly 2000, so we might need to tweak the queue depth of the
> multipath devices, too.
> 
> 
> Will be having a look at your patches.

I have staged quite a few patches in linux-next for the 4.6 merge window:
https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.6

I'm open to posting them to dm-devel if it would ease review.  Let me
know.

These changes range from:
- defaulting to queue_depth of 2048 (rather than 64) request per blk-mq
  hw queue -- fixed stalls waiting for finite amount of tags (in bt_get)
- making additional use of the DM-multipath blk-mq device's pdu for
  mpath per-io data structures
- using blk-mq interfaces rather than generic wrappers (mainly just
  helps document the nature of the requests in blk-mq specific code
  paths)
- avoiding running the blk-mq hw queues on request completion (doesn't
  seem to help like it does for .request_fn multipath; only serves to
  generate extra kblockd work for no observed gain)
- optimize both .request_fn (dm_request_fn) and blk-mq (dm_mq_queue_rq)
  so they don't bother with the bio-based DM pattern of finding which
  target is used to map IO at the particular offset -- request-based DM
  only ever has a single immutable target associated with it
- removal of dead code and code comment improvements

I've seen blk-mq DM-multipath performance improvement but _not_ enough
to consider this line of work "done".  I'd be very interested to see
what kind of improvements you (Hannes) and Sagi can realize with your
respective testbeds.

I'm still not clear on where the considerable performance loss is coming
from (on null_blk devices I see ~1900K read IOPs but I'm still only
seeing ~1000K read IOPs when blk-mq DM-multipath is layered ontop).
What is very much apparent is layering dm-mq multipath ontop of null_blk
results in a HUGE amount of additional context switches.  I can only
infer that the request completion for this stacked device (blk-mq queue
ontop of blk-mq queue, with 2 completions: 1 for clone completing on
underlying device and 1 for original request completing) is the reason
for all the extra context switches.

Here are pictures of 'perf report' for perf datat collected using
'perf record -ag -e cs'.

Against null_blk:
http://people.redhat.com/msnitzer/perf-report-cs-null_blk.png
Against dm-mpath ontop of the same null_blk:
http://people.redhat.com/msnitzer/perf-report-cs-dm_mq.png

Looks like there may be some low-hanging fruit associated with steering
completion to reduce all the excessive ksoftirq and kworker context
switching.  Pin-pointing the reason these tasks are context switching is
my next focus.

I've yet to actually test on DM-multipath device with more than one
path.  Hannes, Sagi, and/or others: on such a setup it would be
interesting to see if increasing the 'blk_mq_nr_hw_queues' helps at all.
Any 'perf report' traces that shed light on bottlenecks you might be
experiencing would obviously be appreciated.  I'm skeptical there is
enough parallelism in the dm-mpath.c code to allow for proper scaling --
switching to RCU could help this.

Mike

p.s.
I experimented with using the top-level DM multipath blk-mq queue's
pdu for the underlying clone 'struct request' that is implicitly needed
when issuing the request to the underlying path -- by (ab)using
blk_mq_tag_set_rq that is used by blk-flush.c.  blk-mq hated me for
trying this.  I kept getting list corruption on unplug with this (and
many variants on work along these lines):
http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel2&id=7b7203c93cec7ad3a0ae2a2da567d45f46fe8098

I stopped that line of work due to inability to make it function.. but
it was a skunk-works experiment that needed to die anyway (as I'm sure
Jens will agree).

^ permalink raw reply	[flat|nested] 127+ messages in thread

* dm-multipath low performance with blk-mq
  2016-02-03 18:04                           ` Mike Snitzer
@ 2016-02-03 18:24                             ` Mike Snitzer
  -1 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-02-03 18:24 UTC (permalink / raw)


On Wed, Feb 03 2016 at  1:04pm -0500,
Mike Snitzer <snitzer@redhat.com> wrote:
 
> I'm still not clear on where the considerable performance loss is coming
> from (on null_blk device I see ~1900K read IOPs but I'm still only
> seeing ~1000K read IOPs when blk-mq DM-multipath is layered ontop).
> What is very much apparent is: layering dm-mq multipath ontop of null_blk
> results in a HUGE amount of additional context switches.  I can only
> infer that the request completion for this stacked device (blk-mq queue
> ontop of blk-mq queue, with 2 completions: 1 for clone completing on
> underlying device and 1 for original request completing) is the reason
> for all the extra context switches.

Starts to explain, certainly not the "reason"; that is still very much
TBD...

> Here are pictures of 'perf report' for perf datat collected using
> 'perf record -ag -e cs'.
> 
> Against null_blk:
> http://people.redhat.com/msnitzer/perf-report-cs-null_blk.png

if dm-mq nr_hw_queues=1 and null_blk nr_hw_queues=1
  cpu          : usr=25.53%, sys=74.40%, ctx=1970, majf=0, minf=474
if dm-mq nr_hw_queues=1 and null_blk nr_hw_queues=4
  cpu          : usr=26.79%, sys=73.15%, ctx=2067, majf=0, minf=479

> Against dm-mpath ontop of the same null_blk:
> http://people.redhat.com/msnitzer/perf-report-cs-dm_mq.png

if dm-mq nr_hw_queues=1 and null_blk nr_hw_queues=1
  cpu          : usr=11.07%, sys=33.90%, ctx=667784, majf=0, minf=466
if dm-mq nr_hw_queues=1 and null_blk nr_hw_queues=4
  cpu          : usr=15.22%, sys=48.44%, ctx=2314901, majf=0, minf=466

So yeah, the percentages reflected in these respective images didn't do
the huge increase in context switches justice... we _must_ figure out
why we're seeing so many context switches with dm-mq.

The same fio job is ran to measure these context switches, e.g.:

fio --cpus_allowed_policy=split --group_reporting --rw=randread --bs=4k
--numjobs=12 --iodepth=32 --runtime=10 --time_based --loops=1
--ioengine=libaio --direct=1 --invalidate=1 --randrepeat=1 --norandommap
--exitall --name task_nullb0 --filename=/dev/nullb0

fio --cpus_allowed_policy=split --group_reporting --rw=randread --bs=4k
--numjobs=12 --iodepth=32 --runtime=10 --time_based --loops=1
--ioengine=libaio --direct=1 --invalidate=1 --randrepeat=1 --norandommap
--exitall --name task_dm_mq --filename=/dev/mapper/dm_mq

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: dm-multipath low performance with blk-mq
@ 2016-02-03 18:24                             ` Mike Snitzer
  0 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-02-03 18:24 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme, keith.busch,
	device-mapper development, linux-block, Bart Van Assche

On Wed, Feb 03 2016 at  1:04pm -0500,
Mike Snitzer <snitzer@redhat.com> wrote:
 
> I'm still not clear on where the considerable performance loss is coming
> from (on null_blk device I see ~1900K read IOPs but I'm still only
> seeing ~1000K read IOPs when blk-mq DM-multipath is layered ontop).
> What is very much apparent is: layering dm-mq multipath ontop of null_blk
> results in a HUGE amount of additional context switches.  I can only
> infer that the request completion for this stacked device (blk-mq queue
> ontop of blk-mq queue, with 2 completions: 1 for clone completing on
> underlying device and 1 for original request completing) is the reason
> for all the extra context switches.

Starts to explain, certainly not the "reason"; that is still very much
TBD...

> Here are pictures of 'perf report' for perf datat collected using
> 'perf record -ag -e cs'.
> 
> Against null_blk:
> http://people.redhat.com/msnitzer/perf-report-cs-null_blk.png

if dm-mq nr_hw_queues=1 and null_blk nr_hw_queues=1
  cpu          : usr=25.53%, sys=74.40%, ctx=1970, majf=0, minf=474
if dm-mq nr_hw_queues=1 and null_blk nr_hw_queues=4
  cpu          : usr=26.79%, sys=73.15%, ctx=2067, majf=0, minf=479

> Against dm-mpath ontop of the same null_blk:
> http://people.redhat.com/msnitzer/perf-report-cs-dm_mq.png

if dm-mq nr_hw_queues=1 and null_blk nr_hw_queues=1
  cpu          : usr=11.07%, sys=33.90%, ctx=667784, majf=0, minf=466
if dm-mq nr_hw_queues=1 and null_blk nr_hw_queues=4
  cpu          : usr=15.22%, sys=48.44%, ctx=2314901, majf=0, minf=466

So yeah, the percentages reflected in these respective images didn't do
the huge increase in context switches justice... we _must_ figure out
why we're seeing so many context switches with dm-mq.

The same fio job is ran to measure these context switches, e.g.:

fio --cpus_allowed_policy=split --group_reporting --rw=randread --bs=4k
--numjobs=12 --iodepth=32 --runtime=10 --time_based --loops=1
--ioengine=libaio --direct=1 --invalidate=1 --randrepeat=1 --norandommap
--exitall --name task_nullb0 --filename=/dev/nullb0

fio --cpus_allowed_policy=split --group_reporting --rw=randread --bs=4k
--numjobs=12 --iodepth=32 --runtime=10 --time_based --loops=1
--ioengine=libaio --direct=1 --invalidate=1 --randrepeat=1 --norandommap
--exitall --name task_dm_mq --filename=/dev/mapper/dm_mq

^ permalink raw reply	[flat|nested] 127+ messages in thread

* dm-multipath low performance with blk-mq
  2016-02-03 18:24                             ` Mike Snitzer
@ 2016-02-03 19:22                               ` Mike Snitzer
  -1 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-02-03 19:22 UTC (permalink / raw)



On Wed, Feb 03 2016 at  1:24pm -0500,
Mike Snitzer <snitzer@redhat.com> wrote:
 
> > Here are pictures of 'perf report' for perf datat collected using
> > 'perf record -ag -e cs'.
> > 
> > Against null_blk:
> > http://people.redhat.com/msnitzer/perf-report-cs-null_blk.png
> 
> if dm-mq nr_hw_queues=1 and null_blk nr_hw_queues=1
>   cpu          : usr=25.53%, sys=74.40%, ctx=1970, majf=0, minf=474
> if dm-mq nr_hw_queues=1 and null_blk nr_hw_queues=4
>   cpu          : usr=26.79%, sys=73.15%, ctx=2067, majf=0, minf=479
> 
> > Against dm-mpath ontop of the same null_blk:
> > http://people.redhat.com/msnitzer/perf-report-cs-dm_mq.png
> 
> if dm-mq nr_hw_queues=1 and null_blk nr_hw_queues=1
>   cpu          : usr=11.07%, sys=33.90%, ctx=667784, majf=0, minf=466
> if dm-mq nr_hw_queues=1 and null_blk nr_hw_queues=4
>   cpu          : usr=15.22%, sys=48.44%, ctx=2314901, majf=0, minf=466

I promise, this is my last reply to myself ;)

The above dm-mq results were _without_ using this commit:
"dm: don't blk_mq_run_hw_queues in blk-mq request completion"
https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.6&id=cc6ca783e8f0669112c5f4154f51a7cb17b76006

But with that commit I'm still seeing the high context switches:

if dm-mq nr_hw_queues=1 and null_blk nr_hw_queues=1
   cpu          : usr=11.78%, sys=36.11%, ctx=690262, majf=0, minf=470
if dm-mq nr_hw_queues=1 and null_blk nr_hw_queues=4
  cpu          : usr=15.62%, sys=49.95%, ctx=2425084, majf=0, minf=466

So running blk_mq_run_hw_queues (async to punt to kblockd) on dm-mq
request completion isn't the source of any of the accounted context
switches...

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: dm-multipath low performance with blk-mq
@ 2016-02-03 19:22                               ` Mike Snitzer
  0 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-02-03 19:22 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: axboe, keith.busch, Sagi Grimberg, linux-nvme, Christoph Hellwig,
	device-mapper development, linux-block, Bart Van Assche


On Wed, Feb 03 2016 at  1:24pm -0500,
Mike Snitzer <snitzer@redhat.com> wrote:
 
> > Here are pictures of 'perf report' for perf datat collected using
> > 'perf record -ag -e cs'.
> > 
> > Against null_blk:
> > http://people.redhat.com/msnitzer/perf-report-cs-null_blk.png
> 
> if dm-mq nr_hw_queues=1 and null_blk nr_hw_queues=1
>   cpu          : usr=25.53%, sys=74.40%, ctx=1970, majf=0, minf=474
> if dm-mq nr_hw_queues=1 and null_blk nr_hw_queues=4
>   cpu          : usr=26.79%, sys=73.15%, ctx=2067, majf=0, minf=479
> 
> > Against dm-mpath ontop of the same null_blk:
> > http://people.redhat.com/msnitzer/perf-report-cs-dm_mq.png
> 
> if dm-mq nr_hw_queues=1 and null_blk nr_hw_queues=1
>   cpu          : usr=11.07%, sys=33.90%, ctx=667784, majf=0, minf=466
> if dm-mq nr_hw_queues=1 and null_blk nr_hw_queues=4
>   cpu          : usr=15.22%, sys=48.44%, ctx=2314901, majf=0, minf=466

I promise, this is my last reply to myself ;)

The above dm-mq results were _without_ using this commit:
"dm: don't blk_mq_run_hw_queues in blk-mq request completion"
https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.6&id=cc6ca783e8f0669112c5f4154f51a7cb17b76006

But with that commit I'm still seeing the high context switches:

if dm-mq nr_hw_queues=1 and null_blk nr_hw_queues=1
   cpu          : usr=11.78%, sys=36.11%, ctx=690262, majf=0, minf=470
if dm-mq nr_hw_queues=1 and null_blk nr_hw_queues=4
  cpu          : usr=15.62%, sys=49.95%, ctx=2425084, majf=0, minf=466

So running blk_mq_run_hw_queues (async to punt to kblockd) on dm-mq
request completion isn't the source of any of the accounted context
switches...

^ permalink raw reply	[flat|nested] 127+ messages in thread

* dm-multipath low performance with blk-mq
  2016-02-03 18:24                             ` Mike Snitzer
@ 2016-02-04  6:54                               ` Hannes Reinecke
  -1 siblings, 0 replies; 127+ messages in thread
From: Hannes Reinecke @ 2016-02-04  6:54 UTC (permalink / raw)


On 02/03/2016 07:24 PM, Mike Snitzer wrote:
> On Wed, Feb 03 2016 at  1:04pm -0500,
> Mike Snitzer <snitzer@redhat.com> wrote:
>  
>> I'm still not clear on where the considerable performance loss is coming
>> from (on null_blk device I see ~1900K read IOPs but I'm still only
>> seeing ~1000K read IOPs when blk-mq DM-multipath is layered ontop).
>> What is very much apparent is: layering dm-mq multipath ontop of null_blk
>> results in a HUGE amount of additional context switches.  I can only
>> infer that the request completion for this stacked device (blk-mq queue
>> ontop of blk-mq queue, with 2 completions: 1 for clone completing on
>> underlying device and 1 for original request completing) is the reason
>> for all the extra context switches.
> 
> Starts to explain, certainly not the "reason"; that is still very much
> TBD...
> 
>> Here are pictures of 'perf report' for perf datat collected using
>> 'perf record -ag -e cs'.
>>
>> Against null_blk:
>> http://people.redhat.com/msnitzer/perf-report-cs-null_blk.png
> 
> if dm-mq nr_hw_queues=1 and null_blk nr_hw_queues=1
>   cpu          : usr=25.53%, sys=74.40%, ctx=1970, majf=0, minf=474
> if dm-mq nr_hw_queues=1 and null_blk nr_hw_queues=4
>   cpu          : usr=26.79%, sys=73.15%, ctx=2067, majf=0, minf=479
> 
>> Against dm-mpath ontop of the same null_blk:
>> http://people.redhat.com/msnitzer/perf-report-cs-dm_mq.png
> 
> if dm-mq nr_hw_queues=1 and null_blk nr_hw_queues=1
>   cpu          : usr=11.07%, sys=33.90%, ctx=667784, majf=0, minf=466
> if dm-mq nr_hw_queues=1 and null_blk nr_hw_queues=4
>   cpu          : usr=15.22%, sys=48.44%, ctx=2314901, majf=0, minf=466
> 
> So yeah, the percentages reflected in these respective images didn't do
> the huge increase in context switches justice... we _must_ figure out
> why we're seeing so many context switches with dm-mq.
> 
Well, the most obvious one being that you're using 1 dm-mq queue vs
4 null_blk queues.
So you will have have to do an additional context switch for 75% of
the total I/Os submitted.

Have you tested with 4 dm-mq hw queues?

To avoid context switches we would have to align the dm-mq queues to
the underlying blk-mq layout for the paths.

And we need to look at making the main submission path lockless;
I was wondering if we really need to take the lock if we don't
switch priority groups; maybe we can establish a similar algorithm
blk-mq does; if we were to have a queue per valid path in any given
priority group we should be able to run lockless and only take the
lock if we need to switch priority groups.

But anyway, I'll be looking at your patches.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: dm-multipath low performance with blk-mq
@ 2016-02-04  6:54                               ` Hannes Reinecke
  0 siblings, 0 replies; 127+ messages in thread
From: Hannes Reinecke @ 2016-02-04  6:54 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme, keith.busch,
	device-mapper development, linux-block, Bart Van Assche

On 02/03/2016 07:24 PM, Mike Snitzer wrote:
> On Wed, Feb 03 2016 at  1:04pm -0500,
> Mike Snitzer <snitzer@redhat.com> wrote:
>  
>> I'm still not clear on where the considerable performance loss is coming
>> from (on null_blk device I see ~1900K read IOPs but I'm still only
>> seeing ~1000K read IOPs when blk-mq DM-multipath is layered ontop).
>> What is very much apparent is: layering dm-mq multipath ontop of null_blk
>> results in a HUGE amount of additional context switches.  I can only
>> infer that the request completion for this stacked device (blk-mq queue
>> ontop of blk-mq queue, with 2 completions: 1 for clone completing on
>> underlying device and 1 for original request completing) is the reason
>> for all the extra context switches.
> 
> Starts to explain, certainly not the "reason"; that is still very much
> TBD...
> 
>> Here are pictures of 'perf report' for perf datat collected using
>> 'perf record -ag -e cs'.
>>
>> Against null_blk:
>> http://people.redhat.com/msnitzer/perf-report-cs-null_blk.png
> 
> if dm-mq nr_hw_queues=1 and null_blk nr_hw_queues=1
>   cpu          : usr=25.53%, sys=74.40%, ctx=1970, majf=0, minf=474
> if dm-mq nr_hw_queues=1 and null_blk nr_hw_queues=4
>   cpu          : usr=26.79%, sys=73.15%, ctx=2067, majf=0, minf=479
> 
>> Against dm-mpath ontop of the same null_blk:
>> http://people.redhat.com/msnitzer/perf-report-cs-dm_mq.png
> 
> if dm-mq nr_hw_queues=1 and null_blk nr_hw_queues=1
>   cpu          : usr=11.07%, sys=33.90%, ctx=667784, majf=0, minf=466
> if dm-mq nr_hw_queues=1 and null_blk nr_hw_queues=4
>   cpu          : usr=15.22%, sys=48.44%, ctx=2314901, majf=0, minf=466
> 
> So yeah, the percentages reflected in these respective images didn't do
> the huge increase in context switches justice... we _must_ figure out
> why we're seeing so many context switches with dm-mq.
> 
Well, the most obvious one being that you're using 1 dm-mq queue vs
4 null_blk queues.
So you will have have to do an additional context switch for 75% of
the total I/Os submitted.

Have you tested with 4 dm-mq hw queues?

To avoid context switches we would have to align the dm-mq queues to
the underlying blk-mq layout for the paths.

And we need to look at making the main submission path lockless;
I was wondering if we really need to take the lock if we don't
switch priority groups; maybe we can establish a similar algorithm
blk-mq does; if we were to have a queue per valid path in any given
priority group we should be able to run lockless and only take the
lock if we need to switch priority groups.

But anyway, I'll be looking at your patches.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 127+ messages in thread

* dm-multipath low performance with blk-mq
  2016-02-04  6:54                               ` Hannes Reinecke
@ 2016-02-04 13:54                                 ` Mike Snitzer
  -1 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-02-04 13:54 UTC (permalink / raw)


On Thu, Feb 04 2016 at  1:54am -0500,
Hannes Reinecke <hare@suse.de> wrote:

> On 02/03/2016 07:24 PM, Mike Snitzer wrote:
> > On Wed, Feb 03 2016 at  1:04pm -0500,
> > Mike Snitzer <snitzer@redhat.com> wrote:
> >  
> >> I'm still not clear on where the considerable performance loss is coming
> >> from (on null_blk device I see ~1900K read IOPs but I'm still only
> >> seeing ~1000K read IOPs when blk-mq DM-multipath is layered ontop).
> >> What is very much apparent is: layering dm-mq multipath ontop of null_blk
> >> results in a HUGE amount of additional context switches.  I can only
> >> infer that the request completion for this stacked device (blk-mq queue
> >> ontop of blk-mq queue, with 2 completions: 1 for clone completing on
> >> underlying device and 1 for original request completing) is the reason
> >> for all the extra context switches.
> > 
> > Starts to explain, certainly not the "reason"; that is still very much
> > TBD...
> > 
> >> Here are pictures of 'perf report' for perf datat collected using
> >> 'perf record -ag -e cs'.
> >>
> >> Against null_blk:
> >> http://people.redhat.com/msnitzer/perf-report-cs-null_blk.png
> > 
> > if dm-mq nr_hw_queues=1 and null_blk nr_hw_queues=1
> >   cpu          : usr=25.53%, sys=74.40%, ctx=1970, majf=0, minf=474
> > if dm-mq nr_hw_queues=1 and null_blk nr_hw_queues=4
> >   cpu          : usr=26.79%, sys=73.15%, ctx=2067, majf=0, minf=479
> > 
> >> Against dm-mpath ontop of the same null_blk:
> >> http://people.redhat.com/msnitzer/perf-report-cs-dm_mq.png
> > 
> > if dm-mq nr_hw_queues=1 and null_blk nr_hw_queues=1
> >   cpu          : usr=11.07%, sys=33.90%, ctx=667784, majf=0, minf=466
> > if dm-mq nr_hw_queues=1 and null_blk nr_hw_queues=4
> >   cpu          : usr=15.22%, sys=48.44%, ctx=2314901, majf=0, minf=466
> > 
> > So yeah, the percentages reflected in these respective images didn't do
> > the huge increase in context switches justice... we _must_ figure out
> > why we're seeing so many context switches with dm-mq.
> > 
> Well, the most obvious one being that you're using 1 dm-mq queue vs
> 4 null_blk queues.
> So you will have have to do an additional context switch for 75% of
> the total I/Os submitted.

Right, that case is certainly prone to more context switches.  But I'm
initially most concerned about the case where both only have 1 queue.

> Have you tested with 4 dm-mq hw queues?

Yes, it makes performance worse.  This is likely rooted in dm-mpath IO
path not being lockless.  But I also have concern about whether the
clone, sent to the underlying path, is completing on a different cpu
than dm-mq's original request.

I'll be using ftrace to try to dig into the various aspects of this
(perf, as I know how to use it, isn't giving me enough precision in its
reporting).

> To avoid context switches we would have to align the dm-mq queues to
> the underlying blk-mq layout for the paths.

Right, we need to take more care (how remains TBD).  But for now I'm
just going to focus on the case where both dm-mq and null_blk have 1 for
nr_hw_queues.  As you can see even in that config the number of context
switches goes from 1970 to 667784 (and there is a huge loss of system
cpu utilization) once dm-mq w/ 1 hw_queue is stacked ontop on the
null_blk device.

Once we understand the source of all the additional context switching
for this more simplistic stacked configuration we can look closer at
scaling as we add more underlying paths.

> And we need to look at making the main submission path lockless;
> I was wondering if we really need to take the lock if we don't
> switch priority groups; maybe we can establish a similar algorithm
> blk-mq does; if we were to have a queue per valid path in any given
> priority group we should be able to run lockless and only take the
> lock if we need to switch priority groups.

I'd like to explore this further with you once I come back up from this
frustrating deep dive on "what is causing all these context switches!?"
 
> But anyway, I'll be looking at your patches.

Thanks, sadly none of the patches are going to fix the performance
problems but I do think they are a step forward.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: dm-multipath low performance with blk-mq
@ 2016-02-04 13:54                                 ` Mike Snitzer
  0 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-02-04 13:54 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme, keith.busch,
	device-mapper development, linux-block, Bart Van Assche

On Thu, Feb 04 2016 at  1:54am -0500,
Hannes Reinecke <hare@suse.de> wrote:

> On 02/03/2016 07:24 PM, Mike Snitzer wrote:
> > On Wed, Feb 03 2016 at  1:04pm -0500,
> > Mike Snitzer <snitzer@redhat.com> wrote:
> >  
> >> I'm still not clear on where the considerable performance loss is coming
> >> from (on null_blk device I see ~1900K read IOPs but I'm still only
> >> seeing ~1000K read IOPs when blk-mq DM-multipath is layered ontop).
> >> What is very much apparent is: layering dm-mq multipath ontop of null_blk
> >> results in a HUGE amount of additional context switches.  I can only
> >> infer that the request completion for this stacked device (blk-mq queue
> >> ontop of blk-mq queue, with 2 completions: 1 for clone completing on
> >> underlying device and 1 for original request completing) is the reason
> >> for all the extra context switches.
> > 
> > Starts to explain, certainly not the "reason"; that is still very much
> > TBD...
> > 
> >> Here are pictures of 'perf report' for perf datat collected using
> >> 'perf record -ag -e cs'.
> >>
> >> Against null_blk:
> >> http://people.redhat.com/msnitzer/perf-report-cs-null_blk.png
> > 
> > if dm-mq nr_hw_queues=1 and null_blk nr_hw_queues=1
> >   cpu          : usr=25.53%, sys=74.40%, ctx=1970, majf=0, minf=474
> > if dm-mq nr_hw_queues=1 and null_blk nr_hw_queues=4
> >   cpu          : usr=26.79%, sys=73.15%, ctx=2067, majf=0, minf=479
> > 
> >> Against dm-mpath ontop of the same null_blk:
> >> http://people.redhat.com/msnitzer/perf-report-cs-dm_mq.png
> > 
> > if dm-mq nr_hw_queues=1 and null_blk nr_hw_queues=1
> >   cpu          : usr=11.07%, sys=33.90%, ctx=667784, majf=0, minf=466
> > if dm-mq nr_hw_queues=1 and null_blk nr_hw_queues=4
> >   cpu          : usr=15.22%, sys=48.44%, ctx=2314901, majf=0, minf=466
> > 
> > So yeah, the percentages reflected in these respective images didn't do
> > the huge increase in context switches justice... we _must_ figure out
> > why we're seeing so many context switches with dm-mq.
> > 
> Well, the most obvious one being that you're using 1 dm-mq queue vs
> 4 null_blk queues.
> So you will have have to do an additional context switch for 75% of
> the total I/Os submitted.

Right, that case is certainly prone to more context switches.  But I'm
initially most concerned about the case where both only have 1 queue.

> Have you tested with 4 dm-mq hw queues?

Yes, it makes performance worse.  This is likely rooted in dm-mpath IO
path not being lockless.  But I also have concern about whether the
clone, sent to the underlying path, is completing on a different cpu
than dm-mq's original request.

I'll be using ftrace to try to dig into the various aspects of this
(perf, as I know how to use it, isn't giving me enough precision in its
reporting).

> To avoid context switches we would have to align the dm-mq queues to
> the underlying blk-mq layout for the paths.

Right, we need to take more care (how remains TBD).  But for now I'm
just going to focus on the case where both dm-mq and null_blk have 1 for
nr_hw_queues.  As you can see even in that config the number of context
switches goes from 1970 to 667784 (and there is a huge loss of system
cpu utilization) once dm-mq w/ 1 hw_queue is stacked ontop on the
null_blk device.

Once we understand the source of all the additional context switching
for this more simplistic stacked configuration we can look closer at
scaling as we add more underlying paths.

> And we need to look at making the main submission path lockless;
> I was wondering if we really need to take the lock if we don't
> switch priority groups; maybe we can establish a similar algorithm
> blk-mq does; if we were to have a queue per valid path in any given
> priority group we should be able to run lockless and only take the
> lock if we need to switch priority groups.

I'd like to explore this further with you once I come back up from this
frustrating deep dive on "what is causing all these context switches!?"
 
> But anyway, I'll be looking at your patches.

Thanks, sadly none of the patches are going to fix the performance
problems but I do think they are a step forward.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* dm-multipath low performance with blk-mq
  2016-02-04 13:54                                 ` Mike Snitzer
@ 2016-02-04 13:58                                   ` Hannes Reinecke
  -1 siblings, 0 replies; 127+ messages in thread
From: Hannes Reinecke @ 2016-02-04 13:58 UTC (permalink / raw)


On 02/04/2016 02:54 PM, Mike Snitzer wrote:
> On Thu, Feb 04 2016 at  1:54am -0500,
> Hannes Reinecke <hare@suse.de> wrote:
> 
[ .. ]
>> But anyway, I'll be looking at your patches.
> 
> Thanks, sadly none of the patches are going to fix the performance
> problems but I do think they are a step forward.
> 
Hmm. I've got a slew of patches converting dm-mpath to use atomic_t
and bitops; with that we should be able to move to rcu for path
lookup and do away with most of the locking.
Quite raw, though; drop me a mail if you're interested.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: dm-multipath low performance with blk-mq
@ 2016-02-04 13:58                                   ` Hannes Reinecke
  0 siblings, 0 replies; 127+ messages in thread
From: Hannes Reinecke @ 2016-02-04 13:58 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme, keith.busch,
	device-mapper development, linux-block, Bart Van Assche

On 02/04/2016 02:54 PM, Mike Snitzer wrote:
> On Thu, Feb 04 2016 at  1:54am -0500,
> Hannes Reinecke <hare@suse.de> wrote:
> 
[ .. ]
>> But anyway, I'll be looking at your patches.
> 
> Thanks, sadly none of the patches are going to fix the performance
> problems but I do think they are a step forward.
> 
Hmm. I've got a slew of patches converting dm-mpath to use atomic_t
and bitops; with that we should be able to move to rcu for path
lookup and do away with most of the locking.
Quite raw, though; drop me a mail if you're interested.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 127+ messages in thread

* dm-multipath low performance with blk-mq
  2016-02-04 13:58                                   ` Hannes Reinecke
@ 2016-02-04 14:09                                     ` Mike Snitzer
  -1 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-02-04 14:09 UTC (permalink / raw)


On Thu, Feb 04 2016 at  8:58am -0500,
Hannes Reinecke <hare@suse.de> wrote:

> On 02/04/2016 02:54 PM, Mike Snitzer wrote:
> > On Thu, Feb 04 2016 at  1:54am -0500,
> > Hannes Reinecke <hare@suse.de> wrote:
> > 
> [ .. ]
> >> But anyway, I'll be looking at your patches.
> > 
> > Thanks, sadly none of the patches are going to fix the performance
> > problems but I do think they are a step forward.
> > 
> Hmm. I've got a slew of patches converting dm-mpath to use atomic_t
> and bitops; with that we should be able to move to rcu for path
> lookup and do away with most of the locking.
> Quite raw, though; drop me a mail if you're interested.

Hmm, ok I just switched m->lock from spinlock_t to rwlock_t, see:
https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.6&id=a5226e23a6958ac9b7ade13a983604c43d232c7d

So any patch you'd have in this area would need rebasing.  I'll gladly
look at what you have (even if it isn't rebased).  So yes please share.

(it could be that there isn't a big enough win associated with switching
to rwlock_t -- that we could get away without doing that particular
churn.. open to that if you think rwlock_t pointless given we'll take
the write lock after repeat_count drops to 0)

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: dm-multipath low performance with blk-mq
@ 2016-02-04 14:09                                     ` Mike Snitzer
  0 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-02-04 14:09 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme, keith.busch,
	device-mapper development, linux-block, Bart Van Assche

On Thu, Feb 04 2016 at  8:58am -0500,
Hannes Reinecke <hare@suse.de> wrote:

> On 02/04/2016 02:54 PM, Mike Snitzer wrote:
> > On Thu, Feb 04 2016 at  1:54am -0500,
> > Hannes Reinecke <hare@suse.de> wrote:
> > 
> [ .. ]
> >> But anyway, I'll be looking at your patches.
> > 
> > Thanks, sadly none of the patches are going to fix the performance
> > problems but I do think they are a step forward.
> > 
> Hmm. I've got a slew of patches converting dm-mpath to use atomic_t
> and bitops; with that we should be able to move to rcu for path
> lookup and do away with most of the locking.
> Quite raw, though; drop me a mail if you're interested.

Hmm, ok I just switched m->lock from spinlock_t to rwlock_t, see:
https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.6&id=a5226e23a6958ac9b7ade13a983604c43d232c7d

So any patch you'd have in this area would need rebasing.  I'll gladly
look at what you have (even if it isn't rebased).  So yes please share.

(it could be that there isn't a big enough win associated with switching
to rwlock_t -- that we could get away without doing that particular
churn.. open to that if you think rwlock_t pointless given we'll take
the write lock after repeat_count drops to 0)

^ permalink raw reply	[flat|nested] 127+ messages in thread

* dm-multipath low performance with blk-mq
  2016-02-04 14:09                                     ` Mike Snitzer
@ 2016-02-04 14:32                                       ` Hannes Reinecke
  -1 siblings, 0 replies; 127+ messages in thread
From: Hannes Reinecke @ 2016-02-04 14:32 UTC (permalink / raw)


On 02/04/2016 03:09 PM, Mike Snitzer wrote:
> On Thu, Feb 04 2016 at  8:58am -0500,
> Hannes Reinecke <hare@suse.de> wrote:
> 
>> On 02/04/2016 02:54 PM, Mike Snitzer wrote:
>>> On Thu, Feb 04 2016 at  1:54am -0500,
>>> Hannes Reinecke <hare@suse.de> wrote:
>>>
>> [ .. ]
>>>> But anyway, I'll be looking at your patches.
>>>
>>> Thanks, sadly none of the patches are going to fix the performance
>>> problems but I do think they are a step forward.
>>>
>> Hmm. I've got a slew of patches converting dm-mpath to use atomic_t
>> and bitops; with that we should be able to move to rcu for path
>> lookup and do away with most of the locking.
>> Quite raw, though; drop me a mail if you're interested.
> 
> Hmm, ok I just switched m->lock from spinlock_t to rwlock_t, see:
> https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.6&id=a5226e23a6958ac9b7ade13a983604c43d232c7d
> 
> So any patch you'd have in this area would need rebasing.  I'll gladly
> look at what you have (even if it isn't rebased).  So yes please share.
> 
> (it could be that there isn't a big enough win associated with switching
> to rwlock_t -- that we could get away without doing that particular
> churn.. open to that if you think rwlock_t pointless given we'll take
> the write lock after repeat_count drops to 0)
> 
personally, I don't think the switching to a rwlock_t will buy us
anything; for a decent performance you have to set rq_min_io to 1
anyway, thereby defeating the purpose of the rwlock.

My thinking was rather a different direction:
Move the crucial bits of the multipath structure to atomics, and
split off the path selection code into one bit for selecting the
path within a path group, and another which switches the path groups.
When we do that we could use rcus for the paths themselves, and
would only need to take the spinlock if we need to switch path
groups. Which should be okay as switching path groups is
(potentially) a rather slow operation anyway.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dm-atomic.tar.gz
Type: application/x-gzip
Size: 10584 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20160204/b81a6d8b/attachment.bin>

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: dm-multipath low performance with blk-mq
@ 2016-02-04 14:32                                       ` Hannes Reinecke
  0 siblings, 0 replies; 127+ messages in thread
From: Hannes Reinecke @ 2016-02-04 14:32 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme, keith.busch,
	device-mapper development, linux-block, Bart Van Assche

[-- Attachment #1: Type: text/plain, Size: 2299 bytes --]

On 02/04/2016 03:09 PM, Mike Snitzer wrote:
> On Thu, Feb 04 2016 at  8:58am -0500,
> Hannes Reinecke <hare@suse.de> wrote:
> 
>> On 02/04/2016 02:54 PM, Mike Snitzer wrote:
>>> On Thu, Feb 04 2016 at  1:54am -0500,
>>> Hannes Reinecke <hare@suse.de> wrote:
>>>
>> [ .. ]
>>>> But anyway, I'll be looking at your patches.
>>>
>>> Thanks, sadly none of the patches are going to fix the performance
>>> problems but I do think they are a step forward.
>>>
>> Hmm. I've got a slew of patches converting dm-mpath to use atomic_t
>> and bitops; with that we should be able to move to rcu for path
>> lookup and do away with most of the locking.
>> Quite raw, though; drop me a mail if you're interested.
> 
> Hmm, ok I just switched m->lock from spinlock_t to rwlock_t, see:
> https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.6&id=a5226e23a6958ac9b7ade13a983604c43d232c7d
> 
> So any patch you'd have in this area would need rebasing.  I'll gladly
> look at what you have (even if it isn't rebased).  So yes please share.
> 
> (it could be that there isn't a big enough win associated with switching
> to rwlock_t -- that we could get away without doing that particular
> churn.. open to that if you think rwlock_t pointless given we'll take
> the write lock after repeat_count drops to 0)
> 
personally, I don't think the switching to a rwlock_t will buy us
anything; for a decent performance you have to set rq_min_io to 1
anyway, thereby defeating the purpose of the rwlock.

My thinking was rather a different direction:
Move the crucial bits of the multipath structure to atomics, and
split off the path selection code into one bit for selecting the
path within a path group, and another which switches the path groups.
When we do that we could use rcus for the paths themselves, and
would only need to take the spinlock if we need to switch path
groups. Which should be okay as switching path groups is
(potentially) a rather slow operation anyway.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

[-- Attachment #2: dm-atomic.tar.gz --]
[-- Type: application/x-gzip, Size: 10584 bytes --]

[-- Attachment #3: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 127+ messages in thread

* dm-multipath low performance with blk-mq
  2016-02-04 14:32                                       ` Hannes Reinecke
@ 2016-02-04 14:44                                         ` Mike Snitzer
  -1 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-02-04 14:44 UTC (permalink / raw)


On Thu, Feb 04 2016 at  9:32am -0500,
Hannes Reinecke <hare@suse.de> wrote:

> On 02/04/2016 03:09 PM, Mike Snitzer wrote:
> > On Thu, Feb 04 2016 at  8:58am -0500,
> > Hannes Reinecke <hare@suse.de> wrote:
> > 
> >> On 02/04/2016 02:54 PM, Mike Snitzer wrote:
> >>> On Thu, Feb 04 2016 at  1:54am -0500,
> >>> Hannes Reinecke <hare@suse.de> wrote:
> >>>
> >> [ .. ]
> >>>> But anyway, I'll be looking at your patches.
> >>>
> >>> Thanks, sadly none of the patches are going to fix the performance
> >>> problems but I do think they are a step forward.
> >>>
> >> Hmm. I've got a slew of patches converting dm-mpath to use atomic_t
> >> and bitops; with that we should be able to move to rcu for path
> >> lookup and do away with most of the locking.
> >> Quite raw, though; drop me a mail if you're interested.
> > 
> > Hmm, ok I just switched m->lock from spinlock_t to rwlock_t, see:
> > https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.6&id=a5226e23a6958ac9b7ade13a983604c43d232c7d
> > 
> > So any patch you'd have in this area would need rebasing.  I'll gladly
> > look at what you have (even if it isn't rebased).  So yes please share.
> > 
> > (it could be that there isn't a big enough win associated with switching
> > to rwlock_t -- that we could get away without doing that particular
> > churn.. open to that if you think rwlock_t pointless given we'll take
> > the write lock after repeat_count drops to 0)
> > 
> personally, I don't think the switching to a rwlock_t will buy us
> anything; for a decent performance you have to set rq_min_io to 1
> anyway, thereby defeating the purpose of the rwlock.

OK, I'll drop the rwlock_t commit.
 
> My thinking was rather a different direction:
> Move the crucial bits of the multipath structure to atomics, and
> split off the path selection code into one bit for selecting the
> path within a path group, and another which switches the path groups.
> When we do that we could use rcus for the paths themselves, and
> would only need to take the spinlock if we need to switch path
> groups. Which should be okay as switching path groups is
> (potentially) a rather slow operation anyway.

If you were to put focus to dusting your work off and helping me get
switched over to RCU I'd be very thankful.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: dm-multipath low performance with blk-mq
@ 2016-02-04 14:44                                         ` Mike Snitzer
  0 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-02-04 14:44 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme, keith.busch,
	device-mapper development, linux-block, Bart Van Assche

On Thu, Feb 04 2016 at  9:32am -0500,
Hannes Reinecke <hare@suse.de> wrote:

> On 02/04/2016 03:09 PM, Mike Snitzer wrote:
> > On Thu, Feb 04 2016 at  8:58am -0500,
> > Hannes Reinecke <hare@suse.de> wrote:
> > 
> >> On 02/04/2016 02:54 PM, Mike Snitzer wrote:
> >>> On Thu, Feb 04 2016 at  1:54am -0500,
> >>> Hannes Reinecke <hare@suse.de> wrote:
> >>>
> >> [ .. ]
> >>>> But anyway, I'll be looking at your patches.
> >>>
> >>> Thanks, sadly none of the patches are going to fix the performance
> >>> problems but I do think they are a step forward.
> >>>
> >> Hmm. I've got a slew of patches converting dm-mpath to use atomic_t
> >> and bitops; with that we should be able to move to rcu for path
> >> lookup and do away with most of the locking.
> >> Quite raw, though; drop me a mail if you're interested.
> > 
> > Hmm, ok I just switched m->lock from spinlock_t to rwlock_t, see:
> > https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.6&id=a5226e23a6958ac9b7ade13a983604c43d232c7d
> > 
> > So any patch you'd have in this area would need rebasing.  I'll gladly
> > look at what you have (even if it isn't rebased).  So yes please share.
> > 
> > (it could be that there isn't a big enough win associated with switching
> > to rwlock_t -- that we could get away without doing that particular
> > churn.. open to that if you think rwlock_t pointless given we'll take
> > the write lock after repeat_count drops to 0)
> > 
> personally, I don't think the switching to a rwlock_t will buy us
> anything; for a decent performance you have to set rq_min_io to 1
> anyway, thereby defeating the purpose of the rwlock.

OK, I'll drop the rwlock_t commit.
 
> My thinking was rather a different direction:
> Move the crucial bits of the multipath structure to atomics, and
> split off the path selection code into one bit for selecting the
> path within a path group, and another which switches the path groups.
> When we do that we could use rcus for the paths themselves, and
> would only need to take the spinlock if we need to switch path
> groups. Which should be okay as switching path groups is
> (potentially) a rather slow operation anyway.

If you were to put focus to dusting your work off and helping me get
switched over to RCU I'd be very thankful.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* [RFC PATCH] dm: fix excessive dm-mq context switching
  2016-02-04 13:54                                 ` Mike Snitzer
@ 2016-02-05 15:13                                   ` Mike Snitzer
  -1 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-02-05 15:13 UTC (permalink / raw)


On Thu, Feb 04 2016 at  8:54P -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Thu, Feb 04 2016 at  1:54am -0500,
> Hannes Reinecke <hare@suse.de> wrote:
> 
> > On 02/03/2016 07:24 PM, Mike Snitzer wrote:
> > > On Wed, Feb 03 2016 at  1:04pm -0500,
> > > Mike Snitzer <snitzer@redhat.com> wrote:
> > >  
> > >> I'm still not clear on where the considerable performance loss is coming
> > >> from (on null_blk device I see ~1900K read IOPs but I'm still only
> > >> seeing ~1000K read IOPs when blk-mq DM-multipath is layered ontop).
> > >> What is very much apparent is: layering dm-mq multipath ontop of null_blk
> > >> results in a HUGE amount of additional context switches.  I can only
> > >> infer that the request completion for this stacked device (blk-mq queue
> > >> ontop of blk-mq queue, with 2 completions: 1 for clone completing on
> > >> underlying device and 1 for original request completing) is the reason
> > >> for all the extra context switches.
> > > 
> > > Starts to explain, certainly not the "reason"; that is still very much
> > > TBD...
> > > 
> > >> Here are pictures of 'perf report' for perf datat collected using
> > >> 'perf record -ag -e cs'.
> > >>
> > >> Against null_blk:
> > >> http://people.redhat.com/msnitzer/perf-report-cs-null_blk.png
> > > 
> > > if dm-mq nr_hw_queues=1 and null_blk nr_hw_queues=1
> > >   cpu          : usr=25.53%, sys=74.40%, ctx=1970, majf=0, minf=474
> > > if dm-mq nr_hw_queues=1 and null_blk nr_hw_queues=4
> > >   cpu          : usr=26.79%, sys=73.15%, ctx=2067, majf=0, minf=479
> > > 
> > >> Against dm-mpath ontop of the same null_blk:
> > >> http://people.redhat.com/msnitzer/perf-report-cs-dm_mq.png
> > > 
> > > if dm-mq nr_hw_queues=1 and null_blk nr_hw_queues=1
> > >   cpu          : usr=11.07%, sys=33.90%, ctx=667784, majf=0, minf=466
> > > if dm-mq nr_hw_queues=1 and null_blk nr_hw_queues=4
> > >   cpu          : usr=15.22%, sys=48.44%, ctx=2314901, majf=0, minf=466
> > > 
> > > So yeah, the percentages reflected in these respective images didn't do
> > > the huge increase in context switches justice... we _must_ figure out
> > > why we're seeing so many context switches with dm-mq.
> > > 
> > Well, the most obvious one being that you're using 1 dm-mq queue vs
> > 4 null_blk queues.
> > So you will have have to do an additional context switch for 75% of
> > the total I/Os submitted.
> 
> Right, that case is certainly prone to more context switches.  But I'm
> initially most concerned about the case where both only have 1 queue.
> 
> > Have you tested with 4 dm-mq hw queues?
> 
> Yes, it makes performance worse.  This is likely rooted in dm-mpath IO
> path not being lockless.  But I also have concern about whether the
> clone, sent to the underlying path, is completing on a different cpu
> than dm-mq's original request.
> 
> I'll be using ftrace to try to dig into the various aspects of this
> (perf, as I know how to use it, isn't giving me enough precision in its
> reporting).
> 
> > To avoid context switches we would have to align the dm-mq queues to
> > the underlying blk-mq layout for the paths.
> 
> Right, we need to take more care (how remains TBD).  But for now I'm
> just going to focus on the case where both dm-mq and null_blk have 1 for
> nr_hw_queues.  As you can see even in that config the number of context
> switches goes from 1970 to 667784 (and there is a huge loss of system
> cpu utilization) once dm-mq w/ 1 hw_queue is stacked ontop on the
> null_blk device.
> 
> Once we understand the source of all the additional context switching
> for this more simplistic stacked configuration we can look closer at
> scaling as we add more underlying paths.

Following is RFC because it really speaks to dm-mq _needing_ a variant
of blk_mq_complete_request() that supports partial completions.  Not
supporting partial completions really isn't an option for DM multipath.

From: Mike Snitzer <snitzer@redhat.com>
Date: Fri, 5 Feb 2016 08:49:01 -0500
Subject: [RFC PATCH] dm: fix excessive dm-mq context switching

Request-based DM's blk-mq support (dm-mq) was reported to be 50% slower
than if an underlying null_blk device were used directly.  This biggest
reason for this drop in performance is that blk_insert_clone_request()
was calling blk_mq_insert_request() with @async=true.  This forced the
use of kblockd_schedule_delayed_work_on() to run the queues which
ushered in ping-ponging between process context (fio in this case) and
kblockd's kworker to submit the cloned request.  The ftrace
function_graph tracer showed:

  kworker-2013  =>   fio-12190
  fio-12190    =>  kworker-2013
  ...
  kworker-2013  =>   fio-12190
  fio-12190    =>  kworker-2013
  ...

Fixing blk_mq_insert_request() to _not_ use kblockd to submit the cloned
requests isn't enough to fix eliminated the oberved context switches.

In addition to this dm-mq specific blk-core fix, there were 2 DM core
fixes to dm-mq that (when paired with the blk-core fix) completely
eliminate the observed context switching:

1)  don't blk_mq_run_hw_queues in blk-mq request completion

    Motivated by desire to reduce overhead of dm-mq, punting to kblockd
    just increases context switches.

    In my testing against a really fast null_blk device there was no benefit
    to running blk_mq_run_hw_queues() on completion (and no other blk-mq
    driver does this).  So hopefully this change doesn't induce the need for
    yet another revert like commit 621739b00e16ca2d !

2)  use blk_mq_complete_request() in dm_complete_request()

    blk_complete_request() doesn't offer the traditional q->mq_ops vs
    .request_fn branching pattern that other historic block interfaces
    do (e.g. blk_get_request).  Using blk_mq_complete_request() for
    blk-mq requests is important for performance but it doesn't handle
    partial completions -- which is a pretty big problem given the
    potential for partial completions with DM multipath due to path
    failure(s).  As such this makes this entire patch only RFC-worthy.

dm-mq "fix" #2 is _much_ more important than #1 for eliminating the
excessive context switches.
Before: cpu          : usr=15.10%, sys=59.39%, ctx=7905181, majf=0, minf=475
After:  cpu          : usr=20.60%, sys=79.35%, ctx=2008, majf=0, minf=472

With these changes the multithreaded async read IOPs improved from ~950K
to ~1350K for this dm-mq stacked on null_blk test-case.  The raw read
IOPs of the underlying null_blk device for the same workload is ~1950K.

Reported-by: Sagi Grimberg <sagig at dev.mellanox.co.il>
Cc: Jens Axboe <axboe at kernel.dk>
Signed-off-by: Mike Snitzer <snitzer at redhat.com>
---
 block/blk-core.c |  2 +-
 drivers/md/dm.c  | 13 ++++++-------
 2 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index ab51685..c60e233 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2198,7 +2198,7 @@ int blk_insert_cloned_request(struct request_queue *q, struct request *rq)
 	if (q->mq_ops) {
 		if (blk_queue_io_stat(q))
 			blk_account_io_start(rq, true);
-		blk_mq_insert_request(rq, false, true, true);
+		blk_mq_insert_request(rq, false, true, false);
 		return 0;
 	}
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index c683f6d..a618477 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1119,12 +1119,8 @@ static void rq_completed(struct mapped_device *md, int rw, bool run_queue)
 	 * back into ->request_fn() could deadlock attempting to grab the
 	 * queue lock again.
 	 */
-	if (run_queue) {
-		if (md->queue->mq_ops)
-			blk_mq_run_hw_queues(md->queue, true);
-		else
-			blk_run_queue_async(md->queue);
-	}
+	if (!md->queue->mq_ops && run_queue)
+		blk_mq_run_hw_queues(md->queue, true);
 
 	/*
 	 * dm_put() must be at the end of this function. See the comment above
@@ -1344,7 +1340,10 @@ static void dm_complete_request(struct request *rq, int error)
 	struct dm_rq_target_io *tio = tio_from_request(rq);
 
 	tio->error = error;
-	blk_complete_request(rq);
+	if (!rq->q->mq_ops)
+		blk_complete_request(rq);
+	else
+		blk_mq_complete_request(rq, rq->errors);
 }
 
 /*
-- 
2.5.4 (Apple Git-61)

^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [RFC PATCH] dm: fix excessive dm-mq context switching
@ 2016-02-05 15:13                                   ` Mike Snitzer
  0 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-02-05 15:13 UTC (permalink / raw)
  To: axboe, Hannes Reinecke, Sagi Grimberg, Christoph Hellwig
  Cc: keith.busch, linux-block, device-mapper development, linux-nvme,
	Bart Van Assche

On Thu, Feb 04 2016 at  8:54P -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Thu, Feb 04 2016 at  1:54am -0500,
> Hannes Reinecke <hare@suse.de> wrote:
> 
> > On 02/03/2016 07:24 PM, Mike Snitzer wrote:
> > > On Wed, Feb 03 2016 at  1:04pm -0500,
> > > Mike Snitzer <snitzer@redhat.com> wrote:
> > >  
> > >> I'm still not clear on where the considerable performance loss is coming
> > >> from (on null_blk device I see ~1900K read IOPs but I'm still only
> > >> seeing ~1000K read IOPs when blk-mq DM-multipath is layered ontop).
> > >> What is very much apparent is: layering dm-mq multipath ontop of null_blk
> > >> results in a HUGE amount of additional context switches.  I can only
> > >> infer that the request completion for this stacked device (blk-mq queue
> > >> ontop of blk-mq queue, with 2 completions: 1 for clone completing on
> > >> underlying device and 1 for original request completing) is the reason
> > >> for all the extra context switches.
> > > 
> > > Starts to explain, certainly not the "reason"; that is still very much
> > > TBD...
> > > 
> > >> Here are pictures of 'perf report' for perf datat collected using
> > >> 'perf record -ag -e cs'.
> > >>
> > >> Against null_blk:
> > >> http://people.redhat.com/msnitzer/perf-report-cs-null_blk.png
> > > 
> > > if dm-mq nr_hw_queues=1 and null_blk nr_hw_queues=1
> > >   cpu          : usr=25.53%, sys=74.40%, ctx=1970, majf=0, minf=474
> > > if dm-mq nr_hw_queues=1 and null_blk nr_hw_queues=4
> > >   cpu          : usr=26.79%, sys=73.15%, ctx=2067, majf=0, minf=479
> > > 
> > >> Against dm-mpath ontop of the same null_blk:
> > >> http://people.redhat.com/msnitzer/perf-report-cs-dm_mq.png
> > > 
> > > if dm-mq nr_hw_queues=1 and null_blk nr_hw_queues=1
> > >   cpu          : usr=11.07%, sys=33.90%, ctx=667784, majf=0, minf=466
> > > if dm-mq nr_hw_queues=1 and null_blk nr_hw_queues=4
> > >   cpu          : usr=15.22%, sys=48.44%, ctx=2314901, majf=0, minf=466
> > > 
> > > So yeah, the percentages reflected in these respective images didn't do
> > > the huge increase in context switches justice... we _must_ figure out
> > > why we're seeing so many context switches with dm-mq.
> > > 
> > Well, the most obvious one being that you're using 1 dm-mq queue vs
> > 4 null_blk queues.
> > So you will have have to do an additional context switch for 75% of
> > the total I/Os submitted.
> 
> Right, that case is certainly prone to more context switches.  But I'm
> initially most concerned about the case where both only have 1 queue.
> 
> > Have you tested with 4 dm-mq hw queues?
> 
> Yes, it makes performance worse.  This is likely rooted in dm-mpath IO
> path not being lockless.  But I also have concern about whether the
> clone, sent to the underlying path, is completing on a different cpu
> than dm-mq's original request.
> 
> I'll be using ftrace to try to dig into the various aspects of this
> (perf, as I know how to use it, isn't giving me enough precision in its
> reporting).
> 
> > To avoid context switches we would have to align the dm-mq queues to
> > the underlying blk-mq layout for the paths.
> 
> Right, we need to take more care (how remains TBD).  But for now I'm
> just going to focus on the case where both dm-mq and null_blk have 1 for
> nr_hw_queues.  As you can see even in that config the number of context
> switches goes from 1970 to 667784 (and there is a huge loss of system
> cpu utilization) once dm-mq w/ 1 hw_queue is stacked ontop on the
> null_blk device.
> 
> Once we understand the source of all the additional context switching
> for this more simplistic stacked configuration we can look closer at
> scaling as we add more underlying paths.

Following is RFC because it really speaks to dm-mq _needing_ a variant
of blk_mq_complete_request() that supports partial completions.  Not
supporting partial completions really isn't an option for DM multipath.

From: Mike Snitzer <snitzer@redhat.com>
Date: Fri, 5 Feb 2016 08:49:01 -0500
Subject: [RFC PATCH] dm: fix excessive dm-mq context switching

Request-based DM's blk-mq support (dm-mq) was reported to be 50% slower
than if an underlying null_blk device were used directly.  This biggest
reason for this drop in performance is that blk_insert_clone_request()
was calling blk_mq_insert_request() with @async=true.  This forced the
use of kblockd_schedule_delayed_work_on() to run the queues which
ushered in ping-ponging between process context (fio in this case) and
kblockd's kworker to submit the cloned request.  The ftrace
function_graph tracer showed:

  kworker-2013  =>   fio-12190
  fio-12190    =>  kworker-2013
  ...
  kworker-2013  =>   fio-12190
  fio-12190    =>  kworker-2013
  ...

Fixing blk_mq_insert_request() to _not_ use kblockd to submit the cloned
requests isn't enough to fix eliminated the oberved context switches.

In addition to this dm-mq specific blk-core fix, there were 2 DM core
fixes to dm-mq that (when paired with the blk-core fix) completely
eliminate the observed context switching:

1)  don't blk_mq_run_hw_queues in blk-mq request completion

    Motivated by desire to reduce overhead of dm-mq, punting to kblockd
    just increases context switches.

    In my testing against a really fast null_blk device there was no benefit
    to running blk_mq_run_hw_queues() on completion (and no other blk-mq
    driver does this).  So hopefully this change doesn't induce the need for
    yet another revert like commit 621739b00e16ca2d !

2)  use blk_mq_complete_request() in dm_complete_request()

    blk_complete_request() doesn't offer the traditional q->mq_ops vs
    .request_fn branching pattern that other historic block interfaces
    do (e.g. blk_get_request).  Using blk_mq_complete_request() for
    blk-mq requests is important for performance but it doesn't handle
    partial completions -- which is a pretty big problem given the
    potential for partial completions with DM multipath due to path
    failure(s).  As such this makes this entire patch only RFC-worthy.

dm-mq "fix" #2 is _much_ more important than #1 for eliminating the
excessive context switches.
Before: cpu          : usr=15.10%, sys=59.39%, ctx=7905181, majf=0, minf=475
After:  cpu          : usr=20.60%, sys=79.35%, ctx=2008, majf=0, minf=472

With these changes the multithreaded async read IOPs improved from ~950K
to ~1350K for this dm-mq stacked on null_blk test-case.  The raw read
IOPs of the underlying null_blk device for the same workload is ~1950K.

Reported-by: Sagi Grimberg <sagig@dev.mellanox.co.il>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 block/blk-core.c |  2 +-
 drivers/md/dm.c  | 13 ++++++-------
 2 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index ab51685..c60e233 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2198,7 +2198,7 @@ int blk_insert_cloned_request(struct request_queue *q, struct request *rq)
 	if (q->mq_ops) {
 		if (blk_queue_io_stat(q))
 			blk_account_io_start(rq, true);
-		blk_mq_insert_request(rq, false, true, true);
+		blk_mq_insert_request(rq, false, true, false);
 		return 0;
 	}
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index c683f6d..a618477 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1119,12 +1119,8 @@ static void rq_completed(struct mapped_device *md, int rw, bool run_queue)
 	 * back into ->request_fn() could deadlock attempting to grab the
 	 * queue lock again.
 	 */
-	if (run_queue) {
-		if (md->queue->mq_ops)
-			blk_mq_run_hw_queues(md->queue, true);
-		else
-			blk_run_queue_async(md->queue);
-	}
+	if (!md->queue->mq_ops && run_queue)
+		blk_mq_run_hw_queues(md->queue, true);
 
 	/*
 	 * dm_put() must be at the end of this function. See the comment above
@@ -1344,7 +1340,10 @@ static void dm_complete_request(struct request *rq, int error)
 	struct dm_rq_target_io *tio = tio_from_request(rq);
 
 	tio->error = error;
-	blk_complete_request(rq);
+	if (!rq->q->mq_ops)
+		blk_complete_request(rq);
+	else
+		blk_mq_complete_request(rq, rq->errors);
 }
 
 /*
-- 
2.5.4 (Apple Git-61)

^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [RFC PATCH] dm: fix excessive dm-mq context switching
  2016-02-05 15:13                                   ` Mike Snitzer
@ 2016-02-05 18:05                                     ` Mike Snitzer
  -1 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-02-05 18:05 UTC (permalink / raw)


On Fri, Feb 05 2016 at 10:13am -0500,
Mike Snitzer <snitzer@redhat.com> wrote:
 
> Following is RFC because it really speaks to dm-mq _needing_ a variant
> of blk_mq_complete_request() that supports partial completions.  Not
> supporting partial completions really isn't an option for DM multipath.
> 
> From: Mike Snitzer <snitzer at redhat.com>
> Date: Fri, 5 Feb 2016 08:49:01 -0500
> Subject: [RFC PATCH] dm: fix excessive dm-mq context switching
> 
> Request-based DM's blk-mq support (dm-mq) was reported to be 50% slower
> than if an underlying null_blk device were used directly.  This biggest
> reason for this drop in performance is that blk_insert_clone_request()
> was calling blk_mq_insert_request() with @async=true.  This forced the
> use of kblockd_schedule_delayed_work_on() to run the queues which
> ushered in ping-ponging between process context (fio in this case) and
> kblockd's kworker to submit the cloned request.  The ftrace
> function_graph tracer showed:
> 
>   kworker-2013  =>   fio-12190
>   fio-12190    =>  kworker-2013
>   ...
>   kworker-2013  =>   fio-12190
>   fio-12190    =>  kworker-2013
>   ...
> 
> Fixing blk_mq_insert_request() to _not_ use kblockd to submit the cloned
> requests isn't enough to fix eliminated the oberved context switches.
> 
> In addition to this dm-mq specific blk-core fix, there were 2 DM core
> fixes to dm-mq that (when paired with the blk-core fix) completely
> eliminate the observed context switching:
> 
> 1)  don't blk_mq_run_hw_queues in blk-mq request completion
> 
>     Motivated by desire to reduce overhead of dm-mq, punting to kblockd
>     just increases context switches.
> 
>     In my testing against a really fast null_blk device there was no benefit
>     to running blk_mq_run_hw_queues() on completion (and no other blk-mq
>     driver does this).  So hopefully this change doesn't induce the need for
>     yet another revert like commit 621739b00e16ca2d !
> 
> 2)  use blk_mq_complete_request() in dm_complete_request()
> 
>     blk_complete_request() doesn't offer the traditional q->mq_ops vs
>     .request_fn branching pattern that other historic block interfaces
>     do (e.g. blk_get_request).  Using blk_mq_complete_request() for
>     blk-mq requests is important for performance but it doesn't handle
>     partial completions -- which is a pretty big problem given the
>     potential for partial completions with DM multipath due to path
>     failure(s).  As such this makes this entire patch only RFC-worthy.

> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index c683f6d..a618477 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1344,7 +1340,10 @@ static void dm_complete_request(struct request *rq, int error)
>  	struct dm_rq_target_io *tio = tio_from_request(rq);
>  
>  	tio->error = error;
> -	blk_complete_request(rq);
> +	if (!rq->q->mq_ops)
> +		blk_complete_request(rq);
> +	else
> +		blk_mq_complete_request(rq, rq->errors);
>  }
>  
>  /*

Looking closer, DM is very likely OK just using blk_mq_complete_request.

blk_complete_request() also doesn't provide native partial completion
support (it relies on the driver to do it, which DM core does):

/**
 * blk_complete_request - end I/O on a request
 * @req:      the request being processed
 *
 * Description:
 *     Ends all I/O on a request. It does not handle partial completions,
 *     unless the driver actually implements this in its completion callback
 *     through requeueing. The actual completion happens out-of-order,
 *     through a softirq handler. The user must have registered a completion
 *     callback through blk_queue_softirq_done().
 **/

blk_mq_complete_request() is effectively implemented in a comparable
fashion to blk_complete_request().  Given that DM core is providing
partial completion support by dm.c:end_clone_bio() triggering requeueing
of the request via dm-mpath.c:multipath_end_io()'s return of
DM_ENDIO_REQUEUE.

So I'm thinking I can drop the "RFC" for this patch and run with
it.. once I get Jens' feedback (hopefully) confirming my understanding.

Jens, please advise.  If you're comfortable providing your Acked-by I
can get this fix in for 4.5-rc4 or so...

Thanks!

Mike

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [RFC PATCH] dm: fix excessive dm-mq context switching
@ 2016-02-05 18:05                                     ` Mike Snitzer
  0 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-02-05 18:05 UTC (permalink / raw)
  To: axboe, Hannes Reinecke, Sagi Grimberg, Christoph Hellwig
  Cc: keith.busch, linux-block, device-mapper development, linux-nvme,
	Bart Van Assche

On Fri, Feb 05 2016 at 10:13am -0500,
Mike Snitzer <snitzer@redhat.com> wrote:
 
> Following is RFC because it really speaks to dm-mq _needing_ a variant
> of blk_mq_complete_request() that supports partial completions.  Not
> supporting partial completions really isn't an option for DM multipath.
> 
> From: Mike Snitzer <snitzer@redhat.com>
> Date: Fri, 5 Feb 2016 08:49:01 -0500
> Subject: [RFC PATCH] dm: fix excessive dm-mq context switching
> 
> Request-based DM's blk-mq support (dm-mq) was reported to be 50% slower
> than if an underlying null_blk device were used directly.  This biggest
> reason for this drop in performance is that blk_insert_clone_request()
> was calling blk_mq_insert_request() with @async=true.  This forced the
> use of kblockd_schedule_delayed_work_on() to run the queues which
> ushered in ping-ponging between process context (fio in this case) and
> kblockd's kworker to submit the cloned request.  The ftrace
> function_graph tracer showed:
> 
>   kworker-2013  =>   fio-12190
>   fio-12190    =>  kworker-2013
>   ...
>   kworker-2013  =>   fio-12190
>   fio-12190    =>  kworker-2013
>   ...
> 
> Fixing blk_mq_insert_request() to _not_ use kblockd to submit the cloned
> requests isn't enough to fix eliminated the oberved context switches.
> 
> In addition to this dm-mq specific blk-core fix, there were 2 DM core
> fixes to dm-mq that (when paired with the blk-core fix) completely
> eliminate the observed context switching:
> 
> 1)  don't blk_mq_run_hw_queues in blk-mq request completion
> 
>     Motivated by desire to reduce overhead of dm-mq, punting to kblockd
>     just increases context switches.
> 
>     In my testing against a really fast null_blk device there was no benefit
>     to running blk_mq_run_hw_queues() on completion (and no other blk-mq
>     driver does this).  So hopefully this change doesn't induce the need for
>     yet another revert like commit 621739b00e16ca2d !
> 
> 2)  use blk_mq_complete_request() in dm_complete_request()
> 
>     blk_complete_request() doesn't offer the traditional q->mq_ops vs
>     .request_fn branching pattern that other historic block interfaces
>     do (e.g. blk_get_request).  Using blk_mq_complete_request() for
>     blk-mq requests is important for performance but it doesn't handle
>     partial completions -- which is a pretty big problem given the
>     potential for partial completions with DM multipath due to path
>     failure(s).  As such this makes this entire patch only RFC-worthy.

> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index c683f6d..a618477 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1344,7 +1340,10 @@ static void dm_complete_request(struct request *rq, int error)
>  	struct dm_rq_target_io *tio = tio_from_request(rq);
>  
>  	tio->error = error;
> -	blk_complete_request(rq);
> +	if (!rq->q->mq_ops)
> +		blk_complete_request(rq);
> +	else
> +		blk_mq_complete_request(rq, rq->errors);
>  }
>  
>  /*

Looking closer, DM is very likely OK just using blk_mq_complete_request.

blk_complete_request() also doesn't provide native partial completion
support (it relies on the driver to do it, which DM core does):

/**
 * blk_complete_request - end I/O on a request
 * @req:      the request being processed
 *
 * Description:
 *     Ends all I/O on a request. It does not handle partial completions,
 *     unless the driver actually implements this in its completion callback
 *     through requeueing. The actual completion happens out-of-order,
 *     through a softirq handler. The user must have registered a completion
 *     callback through blk_queue_softirq_done().
 **/

blk_mq_complete_request() is effectively implemented in a comparable
fashion to blk_complete_request().  Given that DM core is providing
partial completion support by dm.c:end_clone_bio() triggering requeueing
of the request via dm-mpath.c:multipath_end_io()'s return of
DM_ENDIO_REQUEUE.

So I'm thinking I can drop the "RFC" for this patch and run with
it.. once I get Jens' feedback (hopefully) confirming my understanding.

Jens, please advise.  If you're comfortable providing your Acked-by I
can get this fix in for 4.5-rc4 or so...

Thanks!

Mike

^ permalink raw reply	[flat|nested] 127+ messages in thread

* [RFC PATCH] dm: fix excessive dm-mq context switching
  2016-02-05 18:05                                     ` Mike Snitzer
@ 2016-02-05 19:19                                       ` Mike Snitzer
  -1 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-02-05 19:19 UTC (permalink / raw)


On Fri, Feb 05 2016 at  1:05pm -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Fri, Feb 05 2016 at 10:13am -0500,
> Mike Snitzer <snitzer@redhat.com> wrote:
>  
> > Following is RFC because it really speaks to dm-mq _needing_ a variant
> > of blk_mq_complete_request() that supports partial completions.  Not
> > supporting partial completions really isn't an option for DM multipath.
> > 
> > From: Mike Snitzer <snitzer at redhat.com>
> > Date: Fri, 5 Feb 2016 08:49:01 -0500
> > Subject: [RFC PATCH] dm: fix excessive dm-mq context switching
> > 
> > Request-based DM's blk-mq support (dm-mq) was reported to be 50% slower
> > than if an underlying null_blk device were used directly.  This biggest
> > reason for this drop in performance is that blk_insert_clone_request()
> > was calling blk_mq_insert_request() with @async=true.  This forced the
> > use of kblockd_schedule_delayed_work_on() to run the queues which
> > ushered in ping-ponging between process context (fio in this case) and
> > kblockd's kworker to submit the cloned request.  The ftrace
> > function_graph tracer showed:
> > 
> >   kworker-2013  =>   fio-12190
> >   fio-12190    =>  kworker-2013
> >   ...
> >   kworker-2013  =>   fio-12190
> >   fio-12190    =>  kworker-2013
> >   ...
> > 
> > Fixing blk_mq_insert_request() to _not_ use kblockd to submit the cloned
> > requests isn't enough to fix eliminated the oberved context switches.
> > 
> > In addition to this dm-mq specific blk-core fix, there were 2 DM core
> > fixes to dm-mq that (when paired with the blk-core fix) completely
> > eliminate the observed context switching:
> > 
> > 1)  don't blk_mq_run_hw_queues in blk-mq request completion
> > 
> >     Motivated by desire to reduce overhead of dm-mq, punting to kblockd
> >     just increases context switches.
> > 
> >     In my testing against a really fast null_blk device there was no benefit
> >     to running blk_mq_run_hw_queues() on completion (and no other blk-mq
> >     driver does this).  So hopefully this change doesn't induce the need for
> >     yet another revert like commit 621739b00e16ca2d !
> > 
> > 2)  use blk_mq_complete_request() in dm_complete_request()
> > 
> >     blk_complete_request() doesn't offer the traditional q->mq_ops vs
> >     .request_fn branching pattern that other historic block interfaces
> >     do (e.g. blk_get_request).  Using blk_mq_complete_request() for
> >     blk-mq requests is important for performance but it doesn't handle
> >     partial completions -- which is a pretty big problem given the
> >     potential for partial completions with DM multipath due to path
> >     failure(s).  As such this makes this entire patch only RFC-worthy.
> 
> > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > index c683f6d..a618477 100644
> > --- a/drivers/md/dm.c
> > +++ b/drivers/md/dm.c
> > @@ -1344,7 +1340,10 @@ static void dm_complete_request(struct request *rq, int error)
> >  	struct dm_rq_target_io *tio = tio_from_request(rq);
> >  
> >  	tio->error = error;
> > -	blk_complete_request(rq);
> > +	if (!rq->q->mq_ops)
> > +		blk_complete_request(rq);
> > +	else
> > +		blk_mq_complete_request(rq, rq->errors);
> >  }
> >  
> >  /*
> 
> Looking closer, DM is very likely OK just using blk_mq_complete_request.
> 
> blk_complete_request() also doesn't provide native partial completion
> support (it relies on the driver to do it, which DM core does):
> 
> /**
>  * blk_complete_request - end I/O on a request
>  * @req:      the request being processed
>  *
>  * Description:
>  *     Ends all I/O on a request. It does not handle partial completions,
>  *     unless the driver actually implements this in its completion callback
>  *     through requeueing. The actual completion happens out-of-order,
>  *     through a softirq handler. The user must have registered a completion
>  *     callback through blk_queue_softirq_done().
>  **/
> 
> blk_mq_complete_request() is effectively implemented in a comparable
> fashion to blk_complete_request().  Given that DM core is providing
> partial completion support by dm.c:end_clone_bio() triggering requeueing
> of the request via dm-mpath.c:multipath_end_io()'s return of
> DM_ENDIO_REQUEUE.
> 
> So I'm thinking I can drop the "RFC" for this patch and run with
> it.. once I get Jens' feedback (hopefully) confirming my understanding.
> 
> Jens, please advise.  If you're comfortable providing your Acked-by I
> can get this fix in for 4.5-rc4 or so...

FYI, here is the latest revised patch:
https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.6&id=a5b835282422ec41991c1dbdb88daa4af7d166d2

(revised patch header and fixed a thinko in the dm.c:rq_completed()
change from the RFC patch I posted earlier)

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [RFC PATCH] dm: fix excessive dm-mq context switching
@ 2016-02-05 19:19                                       ` Mike Snitzer
  0 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-02-05 19:19 UTC (permalink / raw)
  To: axboe, Hannes Reinecke, Sagi Grimberg, Christoph Hellwig
  Cc: keith.busch, linux-block, device-mapper development, linux-nvme,
	Bart Van Assche

On Fri, Feb 05 2016 at  1:05pm -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Fri, Feb 05 2016 at 10:13am -0500,
> Mike Snitzer <snitzer@redhat.com> wrote:
>  
> > Following is RFC because it really speaks to dm-mq _needing_ a variant
> > of blk_mq_complete_request() that supports partial completions.  Not
> > supporting partial completions really isn't an option for DM multipath.
> > 
> > From: Mike Snitzer <snitzer@redhat.com>
> > Date: Fri, 5 Feb 2016 08:49:01 -0500
> > Subject: [RFC PATCH] dm: fix excessive dm-mq context switching
> > 
> > Request-based DM's blk-mq support (dm-mq) was reported to be 50% slower
> > than if an underlying null_blk device were used directly.  This biggest
> > reason for this drop in performance is that blk_insert_clone_request()
> > was calling blk_mq_insert_request() with @async=true.  This forced the
> > use of kblockd_schedule_delayed_work_on() to run the queues which
> > ushered in ping-ponging between process context (fio in this case) and
> > kblockd's kworker to submit the cloned request.  The ftrace
> > function_graph tracer showed:
> > 
> >   kworker-2013  =>   fio-12190
> >   fio-12190    =>  kworker-2013
> >   ...
> >   kworker-2013  =>   fio-12190
> >   fio-12190    =>  kworker-2013
> >   ...
> > 
> > Fixing blk_mq_insert_request() to _not_ use kblockd to submit the cloned
> > requests isn't enough to fix eliminated the oberved context switches.
> > 
> > In addition to this dm-mq specific blk-core fix, there were 2 DM core
> > fixes to dm-mq that (when paired with the blk-core fix) completely
> > eliminate the observed context switching:
> > 
> > 1)  don't blk_mq_run_hw_queues in blk-mq request completion
> > 
> >     Motivated by desire to reduce overhead of dm-mq, punting to kblockd
> >     just increases context switches.
> > 
> >     In my testing against a really fast null_blk device there was no benefit
> >     to running blk_mq_run_hw_queues() on completion (and no other blk-mq
> >     driver does this).  So hopefully this change doesn't induce the need for
> >     yet another revert like commit 621739b00e16ca2d !
> > 
> > 2)  use blk_mq_complete_request() in dm_complete_request()
> > 
> >     blk_complete_request() doesn't offer the traditional q->mq_ops vs
> >     .request_fn branching pattern that other historic block interfaces
> >     do (e.g. blk_get_request).  Using blk_mq_complete_request() for
> >     blk-mq requests is important for performance but it doesn't handle
> >     partial completions -- which is a pretty big problem given the
> >     potential for partial completions with DM multipath due to path
> >     failure(s).  As such this makes this entire patch only RFC-worthy.
> 
> > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > index c683f6d..a618477 100644
> > --- a/drivers/md/dm.c
> > +++ b/drivers/md/dm.c
> > @@ -1344,7 +1340,10 @@ static void dm_complete_request(struct request *rq, int error)
> >  	struct dm_rq_target_io *tio = tio_from_request(rq);
> >  
> >  	tio->error = error;
> > -	blk_complete_request(rq);
> > +	if (!rq->q->mq_ops)
> > +		blk_complete_request(rq);
> > +	else
> > +		blk_mq_complete_request(rq, rq->errors);
> >  }
> >  
> >  /*
> 
> Looking closer, DM is very likely OK just using blk_mq_complete_request.
> 
> blk_complete_request() also doesn't provide native partial completion
> support (it relies on the driver to do it, which DM core does):
> 
> /**
>  * blk_complete_request - end I/O on a request
>  * @req:      the request being processed
>  *
>  * Description:
>  *     Ends all I/O on a request. It does not handle partial completions,
>  *     unless the driver actually implements this in its completion callback
>  *     through requeueing. The actual completion happens out-of-order,
>  *     through a softirq handler. The user must have registered a completion
>  *     callback through blk_queue_softirq_done().
>  **/
> 
> blk_mq_complete_request() is effectively implemented in a comparable
> fashion to blk_complete_request().  Given that DM core is providing
> partial completion support by dm.c:end_clone_bio() triggering requeueing
> of the request via dm-mpath.c:multipath_end_io()'s return of
> DM_ENDIO_REQUEUE.
> 
> So I'm thinking I can drop the "RFC" for this patch and run with
> it.. once I get Jens' feedback (hopefully) confirming my understanding.
> 
> Jens, please advise.  If you're comfortable providing your Acked-by I
> can get this fix in for 4.5-rc4 or so...

FYI, here is the latest revised patch:
https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.6&id=a5b835282422ec41991c1dbdb88daa4af7d166d2

(revised patch header and fixed a thinko in the dm.c:rq_completed()
change from the RFC patch I posted earlier)

^ permalink raw reply	[flat|nested] 127+ messages in thread

* [RFC PATCH] dm: fix excessive dm-mq context switching
  2016-02-05 19:19                                       ` Mike Snitzer
@ 2016-02-07 15:41                                         ` Sagi Grimberg
  -1 siblings, 0 replies; 127+ messages in thread
From: Sagi Grimberg @ 2016-02-07 15:41 UTC (permalink / raw)



> FYI, here is the latest revised patch:
> https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.6&id=a5b835282422ec41991c1dbdb88daa4af7d166d2
>
> (revised patch header and fixed a thinko in the dm.c:rq_completed()
> change from the RFC patch I posted earlier)

Hi Mike,

So I gave your patches a go (dm-4.6) but I still don't see the
improvement you reported (while I do see a minor improvement).

null_blk queue_mode=2 submit_queues=24
dm_mod blk_mq_nr_hw_queues=24 blk_mq_queue_depth=4096 use_blk_mq=Y

I see 620K IOPs on dm_mq vs. 1750K IOPs on raw nullb0.

Is there something I'm missing?

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [RFC PATCH] dm: fix excessive dm-mq context switching
@ 2016-02-07 15:41                                         ` Sagi Grimberg
  0 siblings, 0 replies; 127+ messages in thread
From: Sagi Grimberg @ 2016-02-07 15:41 UTC (permalink / raw)
  To: Mike Snitzer, axboe, Hannes Reinecke, Christoph Hellwig
  Cc: keith.busch, linux-block, device-mapper development, linux-nvme,
	Bart Van Assche


> FYI, here is the latest revised patch:
> https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.6&id=a5b835282422ec41991c1dbdb88daa4af7d166d2
>
> (revised patch header and fixed a thinko in the dm.c:rq_completed()
> change from the RFC patch I posted earlier)

Hi Mike,

So I gave your patches a go (dm-4.6) but I still don't see the
improvement you reported (while I do see a minor improvement).

null_blk queue_mode=2 submit_queues=24
dm_mod blk_mq_nr_hw_queues=24 blk_mq_queue_depth=4096 use_blk_mq=Y

I see 620K IOPs on dm_mq vs. 1750K IOPs on raw nullb0.

Is there something I'm missing?

^ permalink raw reply	[flat|nested] 127+ messages in thread

* [RFC PATCH] dm: fix excessive dm-mq context switching
  2016-02-07 15:41                                         ` Sagi Grimberg
@ 2016-02-07 16:07                                           ` Mike Snitzer
  -1 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-02-07 16:07 UTC (permalink / raw)


On Sun, Feb 07 2016 at 10:41am -0500,
Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:

> 
> >FYI, here is the latest revised patch:
> >https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.6&id=a5b835282422ec41991c1dbdb88daa4af7d166d2
> >
> >(revised patch header and fixed a thinko in the dm.c:rq_completed()
> >change from the RFC patch I posted earlier)
> 
> Hi Mike,
> 
> So I gave your patches a go (dm-4.6) but I still don't see the
> improvement you reported (while I do see a minor improvement).
> 
> null_blk queue_mode=2 submit_queues=24
> dm_mod blk_mq_nr_hw_queues=24 blk_mq_queue_depth=4096 use_blk_mq=Y
> 
> I see 620K IOPs on dm_mq vs. 1750K IOPs on raw nullb0.

blk_mq_nr_hw_queues=24 isn't likely to help you (but with these patches,
the first being the most important, it shouldn't hurt either provided
you have 24 cpus).

Could be you have multiple NUMA nodes and are seeing problems from that?

I have 12 cpus (in the same physical cpu) and only a single NUMA node.
I get the same results as blk_mq_nr_hw_queues=12 with
blk_mq_nr_hw_queues=4 (same goes for null_blk submit_queues).
I've seen my IOPs go from ~950K to ~1400K.  The peak null_blk can get on
my setup is ~1950K.  So I'm still seeing a ~25% drop with dm-mq (but
that is much better than the over 50% drop I saw seeing).

> Is there something I'm missing?

Not sure, I just emailed out all my patches (and cc'd you).  Please
verify you're using the latest here (same as 'dm-4.6' branch):
https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=for-next

I rebased a couple times... so please diff what you have tested against
this latest 'dm-4.6' branch.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [RFC PATCH] dm: fix excessive dm-mq context switching
@ 2016-02-07 16:07                                           ` Mike Snitzer
  0 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-02-07 16:07 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: axboe, keith.busch, linux-nvme, Christoph Hellwig,
	device-mapper development, linux-block, Bart Van Assche

On Sun, Feb 07 2016 at 10:41am -0500,
Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:

> 
> >FYI, here is the latest revised patch:
> >https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.6&id=a5b835282422ec41991c1dbdb88daa4af7d166d2
> >
> >(revised patch header and fixed a thinko in the dm.c:rq_completed()
> >change from the RFC patch I posted earlier)
> 
> Hi Mike,
> 
> So I gave your patches a go (dm-4.6) but I still don't see the
> improvement you reported (while I do see a minor improvement).
> 
> null_blk queue_mode=2 submit_queues=24
> dm_mod blk_mq_nr_hw_queues=24 blk_mq_queue_depth=4096 use_blk_mq=Y
> 
> I see 620K IOPs on dm_mq vs. 1750K IOPs on raw nullb0.

blk_mq_nr_hw_queues=24 isn't likely to help you (but with these patches,
the first being the most important, it shouldn't hurt either provided
you have 24 cpus).

Could be you have multiple NUMA nodes and are seeing problems from that?

I have 12 cpus (in the same physical cpu) and only a single NUMA node.
I get the same results as blk_mq_nr_hw_queues=12 with
blk_mq_nr_hw_queues=4 (same goes for null_blk submit_queues).
I've seen my IOPs go from ~950K to ~1400K.  The peak null_blk can get on
my setup is ~1950K.  So I'm still seeing a ~25% drop with dm-mq (but
that is much better than the over 50% drop I saw seeing).

> Is there something I'm missing?

Not sure, I just emailed out all my patches (and cc'd you).  Please
verify you're using the latest here (same as 'dm-4.6' branch):
https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=for-next

I rebased a couple times... so please diff what you have tested against
this latest 'dm-4.6' branch.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* [RFC PATCH] dm: fix excessive dm-mq context switching
  2016-02-07 15:41                                         ` Sagi Grimberg
@ 2016-02-07 16:37                                           ` Bart Van Assche
  -1 siblings, 0 replies; 127+ messages in thread
From: Bart Van Assche @ 2016-02-07 16:37 UTC (permalink / raw)


On 02/07/16 07:41, Sagi Grimberg wrote:
>
>> FYI, here is the latest revised patch:
>> https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.6&id=a5b835282422ec41991c1dbdb88daa4af7d166d2
>>
>> (revised patch header and fixed a thinko in the dm.c:rq_completed()
>> change from the RFC patch I posted earlier)
>
> Hi Mike,
>
> So I gave your patches a go (dm-4.6) but I still don't see the
> improvement you reported (while I do see a minor improvement).
>
> null_blk queue_mode=2 submit_queues=24
> dm_mod blk_mq_nr_hw_queues=24 blk_mq_queue_depth=4096 use_blk_mq=Y
>
> I see 620K IOPs on dm_mq vs. 1750K IOPs on raw nullb0.
>
> Is there something I'm missing?

Hello Sagi,

Did you run your test on a NUMA system ? If so, can you check with e.g. 
perf record -ags -e LLC-load-misses sleep 10 && perf report whether this 
workload triggers perhaps lock contention ? What you need to look for in 
the perf output is whether any functions occupy more than 10% CPU time.

Bart.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [RFC PATCH] dm: fix excessive dm-mq context switching
@ 2016-02-07 16:37                                           ` Bart Van Assche
  0 siblings, 0 replies; 127+ messages in thread
From: Bart Van Assche @ 2016-02-07 16:37 UTC (permalink / raw)
  To: Sagi Grimberg, Mike Snitzer, axboe, Hannes Reinecke, Christoph Hellwig
  Cc: keith.busch, linux-block, device-mapper development, linux-nvme

On 02/07/16 07:41, Sagi Grimberg wrote:
>
>> FYI, here is the latest revised patch:
>> https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.6&id=a5b835282422ec41991c1dbdb88daa4af7d166d2
>>
>> (revised patch header and fixed a thinko in the dm.c:rq_completed()
>> change from the RFC patch I posted earlier)
>
> Hi Mike,
>
> So I gave your patches a go (dm-4.6) but I still don't see the
> improvement you reported (while I do see a minor improvement).
>
> null_blk queue_mode=2 submit_queues=24
> dm_mod blk_mq_nr_hw_queues=24 blk_mq_queue_depth=4096 use_blk_mq=Y
>
> I see 620K IOPs on dm_mq vs. 1750K IOPs on raw nullb0.
>
> Is there something I'm missing?

Hello Sagi,

Did you run your test on a NUMA system ? If so, can you check with e.g. 
perf record -ags -e LLC-load-misses sleep 10 && perf report whether this 
workload triggers perhaps lock contention ? What you need to look for in 
the perf output is whether any functions occupy more than 10% CPU time.

Bart.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* [RFC PATCH] dm: fix excessive dm-mq context switching
  2016-02-07 16:07                                           ` Mike Snitzer
@ 2016-02-07 16:42                                             ` Sagi Grimberg
  -1 siblings, 0 replies; 127+ messages in thread
From: Sagi Grimberg @ 2016-02-07 16:42 UTC (permalink / raw)



>> Hi Mike,
>>
>> So I gave your patches a go (dm-4.6) but I still don't see the
>> improvement you reported (while I do see a minor improvement).
>>
>> null_blk queue_mode=2 submit_queues=24
>> dm_mod blk_mq_nr_hw_queues=24 blk_mq_queue_depth=4096 use_blk_mq=Y
>>
>> I see 620K IOPs on dm_mq vs. 1750K IOPs on raw nullb0.
>
> blk_mq_nr_hw_queues=24 isn't likely to help you (but with these patches,
> the first being the most important, it shouldn't hurt either provided
> you have 24 cpus).

I tried with less but as you said, it didn't have an impact...

> Could be you have multiple NUMA nodes and are seeing problems from that?

I am running on a dual socket server, this can most likely be the
culprit...

> I have 12 cpus (in the same physical cpu) and only a single NUMA node.
> I get the same results as blk_mq_nr_hw_queues=12 with
> blk_mq_nr_hw_queues=4 (same goes for null_blk submit_queues).
> I've seen my IOPs go from ~950K to ~1400K.  The peak null_blk can get on
> my setup is ~1950K.  So I'm still seeing a ~25% drop with dm-mq (but
> that is much better than the over 50% drop I saw seeing).

That's what I was planning on :(

>> Is there something I'm missing?
>
> Not sure, I just emailed out all my patches (and cc'd you).  Please
> verify you're using the latest here (same as 'dm-4.6' branch):
> https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=for-next
>
> I rebased a couple times... so please diff what you have tested against
> this latest 'dm-4.6' branch.

I am. I'll try to instrument what's going on...

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [RFC PATCH] dm: fix excessive dm-mq context switching
@ 2016-02-07 16:42                                             ` Sagi Grimberg
  0 siblings, 0 replies; 127+ messages in thread
From: Sagi Grimberg @ 2016-02-07 16:42 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: axboe, keith.busch, linux-nvme, Christoph Hellwig,
	device-mapper development, linux-block, Bart Van Assche


>> Hi Mike,
>>
>> So I gave your patches a go (dm-4.6) but I still don't see the
>> improvement you reported (while I do see a minor improvement).
>>
>> null_blk queue_mode=2 submit_queues=24
>> dm_mod blk_mq_nr_hw_queues=24 blk_mq_queue_depth=4096 use_blk_mq=Y
>>
>> I see 620K IOPs on dm_mq vs. 1750K IOPs on raw nullb0.
>
> blk_mq_nr_hw_queues=24 isn't likely to help you (but with these patches,
> the first being the most important, it shouldn't hurt either provided
> you have 24 cpus).

I tried with less but as you said, it didn't have an impact...

> Could be you have multiple NUMA nodes and are seeing problems from that?

I am running on a dual socket server, this can most likely be the
culprit...

> I have 12 cpus (in the same physical cpu) and only a single NUMA node.
> I get the same results as blk_mq_nr_hw_queues=12 with
> blk_mq_nr_hw_queues=4 (same goes for null_blk submit_queues).
> I've seen my IOPs go from ~950K to ~1400K.  The peak null_blk can get on
> my setup is ~1950K.  So I'm still seeing a ~25% drop with dm-mq (but
> that is much better than the over 50% drop I saw seeing).

That's what I was planning on :(

>> Is there something I'm missing?
>
> Not sure, I just emailed out all my patches (and cc'd you).  Please
> verify you're using the latest here (same as 'dm-4.6' branch):
> https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=for-next
>
> I rebased a couple times... so please diff what you have tested against
> this latest 'dm-4.6' branch.

I am. I'll try to instrument what's going on...

^ permalink raw reply	[flat|nested] 127+ messages in thread

* [RFC PATCH] dm: fix excessive dm-mq context switching
  2016-02-07 16:37                                           ` Bart Van Assche
@ 2016-02-07 16:43                                             ` Sagi Grimberg
  -1 siblings, 0 replies; 127+ messages in thread
From: Sagi Grimberg @ 2016-02-07 16:43 UTC (permalink / raw)



> Hello Sagi,

Hey Bart,

> Did you run your test on a NUMA system ?

I did.

> If so, can you check with e.g.
> perf record -ags -e LLC-load-misses sleep 10 && perf report whether this
> workload triggers perhaps lock contention ? What you need to look for in
> the perf output is whether any functions occupy more than 10% CPU time.

I will, thanks for the tip!

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [RFC PATCH] dm: fix excessive dm-mq context switching
@ 2016-02-07 16:43                                             ` Sagi Grimberg
  0 siblings, 0 replies; 127+ messages in thread
From: Sagi Grimberg @ 2016-02-07 16:43 UTC (permalink / raw)
  To: Bart Van Assche, Mike Snitzer, axboe, Hannes Reinecke, Christoph Hellwig
  Cc: keith.busch, linux-block, device-mapper development, linux-nvme


> Hello Sagi,

Hey Bart,

> Did you run your test on a NUMA system ?

I did.

> If so, can you check with e.g.
> perf record -ags -e LLC-load-misses sleep 10 && perf report whether this
> workload triggers perhaps lock contention ? What you need to look for in
> the perf output is whether any functions occupy more than 10% CPU time.

I will, thanks for the tip!

^ permalink raw reply	[flat|nested] 127+ messages in thread

* [RFC PATCH] dm: fix excessive dm-mq context switching
  2016-02-07 16:43                                             ` Sagi Grimberg
@ 2016-02-07 16:53                                               ` Mike Snitzer
  -1 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-02-07 16:53 UTC (permalink / raw)


On Sun, Feb 07 2016 at 11:43am -0500,
Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:

> 
> >Hello Sagi,
> 
> Hey Bart,
> 
> >Did you run your test on a NUMA system ?
> 
> I did.
> 
> >If so, can you check with e.g.
> >perf record -ags -e LLC-load-misses sleep 10 && perf report whether this
> >workload triggers perhaps lock contention ? What you need to look for in
> >the perf output is whether any functions occupy more than 10% CPU time.
> 
> I will, thanks for the tip!

Also, I found ftrace's function_graph tracer very helpful (it is how I
found the various issues fixed by the first context switch patch).  Here
is my latest script:

#!/bin/sh

set -xv

NULL_BLK_HW_QUEUES=4
NULL_BLK_QUEUE_DEPTH=4096

DM_MQ_HW_QUEUES=4
DM_MQ_QUEUE_DEPTH=2048

FIO=/root/snitm/git/fio/fio
FIO_QUEUE_DEPTH=32
FIO_RUNTIME=10
FIO_NUMJOBS=12

PERF=perf
#PERF=/root/snitm/git/linux/tools/perf/perf

run_fio() {
    DEVICE=$1
    TASK_NAME=$(basename ${DEVICE})
    PERF_RECORD=$2
    RUN_CMD="${FIO} --cpus_allowed_policy=split --group_reporting --rw=randread --bs=4k --numjobs=${FIO_NUMJOBS} \
              --iodepth=${FIO_QUEUE_DEPTH} --runtime=${FIO_RUNTIME} --time_based --loops=1 --ioengine=libaio \
              --direct=1 --invalidate=1 --randrepeat=1 --norandommap --exitall --name task_${TASK_NAME} --filename=${DEVICE}"

    if [ ! -z "${PERF_RECORD}" ]; then
    ${PERF_RECORD} ${RUN_CMD}
    mv perf.data perf.data.${TASK_NAME}
    else
    ${RUN_CMD}
    fi
}

run_fio_with_ftrace() {
    DEVICE=$1
    TASK_NAME=$(basename ${DEVICE})

    echo > /sys/kernel/debug/tracing/trace
    echo 0 > /sys/kernel/debug/tracing/tracing_on
    echo function_graph > /sys/kernel/debug/tracing/current_tracer
    echo 1 > /sys/kernel/debug/tracing/tracing_on
    run_fio $DEVICE
    echo 0 > /sys/kernel/debug/tracing/tracing_on
    cat /sys/kernel/debug/tracing/trace > trace.${TASK_NAME}
    echo nop > /sys/kernel/debug/tracing/current_tracer
}

dmsetup remove dm_mq
modprobe -r null_blk

modprobe null_blk gb=4 bs=512 hw_queue_depth=${NULL_BLK_QUEUE_DEPTH} nr_devices=1 queue_mode=2 irqmode=1 completion_nsec=1 submit_queues=${NULL_BLK_HW_QUEUES}
#run_fio /dev/nullb0 "${PERF} record -ag -e cs"
#run_fio /dev/nullb0 "${PERF} stat"

echo Y > /sys/module/dm_mod/parameters/use_blk_mq
echo ${DM_MQ_QUEUE_DEPTH} > /sys/module/dm_mod/parameters/blk_mq_queue_depth
echo ${DM_MQ_HW_QUEUES} > /sys/module/dm_mod/parameters/blk_mq_nr_hw_queues
echo "0 8388608 multipath 0 0 1 1 service-time 0 1 2 /dev/nullb0 1000 1" | dmsetup create dm_mq
#echo "0 8388608 linear /dev/nullb0 0" | dmsetup create dm_mq

run_fio_with_ftrace /dev/mapper/dm_mq

#run_fio /dev/mapper/dm_mq
#run_fio /dev/mapper/dm_mq "${PERF} record -ag -e cs"
#run_fio /dev/mapper/dm_mq "${PERF} record -ag"
#run_fio /dev/mapper/dm_mq "${PERF} stat"

#run_fio /dev/mapper/dm_mq "trace-cmd record -e all"

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [RFC PATCH] dm: fix excessive dm-mq context switching
@ 2016-02-07 16:53                                               ` Mike Snitzer
  0 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-02-07 16:53 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: axboe, Christoph Hellwig, linux-nvme, keith.busch,
	device-mapper development, linux-block, Bart Van Assche

On Sun, Feb 07 2016 at 11:43am -0500,
Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:

> 
> >Hello Sagi,
> 
> Hey Bart,
> 
> >Did you run your test on a NUMA system ?
> 
> I did.
> 
> >If so, can you check with e.g.
> >perf record -ags -e LLC-load-misses sleep 10 && perf report whether this
> >workload triggers perhaps lock contention ? What you need to look for in
> >the perf output is whether any functions occupy more than 10% CPU time.
> 
> I will, thanks for the tip!

Also, I found ftrace's function_graph tracer very helpful (it is how I
found the various issues fixed by the first context switch patch).  Here
is my latest script:

#!/bin/sh

set -xv

NULL_BLK_HW_QUEUES=4
NULL_BLK_QUEUE_DEPTH=4096

DM_MQ_HW_QUEUES=4
DM_MQ_QUEUE_DEPTH=2048

FIO=/root/snitm/git/fio/fio
FIO_QUEUE_DEPTH=32
FIO_RUNTIME=10
FIO_NUMJOBS=12

PERF=perf
#PERF=/root/snitm/git/linux/tools/perf/perf

run_fio() {
    DEVICE=$1
    TASK_NAME=$(basename ${DEVICE})
    PERF_RECORD=$2
    RUN_CMD="${FIO} --cpus_allowed_policy=split --group_reporting --rw=randread --bs=4k --numjobs=${FIO_NUMJOBS} \
              --iodepth=${FIO_QUEUE_DEPTH} --runtime=${FIO_RUNTIME} --time_based --loops=1 --ioengine=libaio \
              --direct=1 --invalidate=1 --randrepeat=1 --norandommap --exitall --name task_${TASK_NAME} --filename=${DEVICE}"

    if [ ! -z "${PERF_RECORD}" ]; then
    ${PERF_RECORD} ${RUN_CMD}
    mv perf.data perf.data.${TASK_NAME}
    else
    ${RUN_CMD}
    fi
}

run_fio_with_ftrace() {
    DEVICE=$1
    TASK_NAME=$(basename ${DEVICE})

    echo > /sys/kernel/debug/tracing/trace
    echo 0 > /sys/kernel/debug/tracing/tracing_on
    echo function_graph > /sys/kernel/debug/tracing/current_tracer
    echo 1 > /sys/kernel/debug/tracing/tracing_on
    run_fio $DEVICE
    echo 0 > /sys/kernel/debug/tracing/tracing_on
    cat /sys/kernel/debug/tracing/trace > trace.${TASK_NAME}
    echo nop > /sys/kernel/debug/tracing/current_tracer
}

dmsetup remove dm_mq
modprobe -r null_blk

modprobe null_blk gb=4 bs=512 hw_queue_depth=${NULL_BLK_QUEUE_DEPTH} nr_devices=1 queue_mode=2 irqmode=1 completion_nsec=1 submit_queues=${NULL_BLK_HW_QUEUES}
#run_fio /dev/nullb0 "${PERF} record -ag -e cs"
#run_fio /dev/nullb0 "${PERF} stat"

echo Y > /sys/module/dm_mod/parameters/use_blk_mq
echo ${DM_MQ_QUEUE_DEPTH} > /sys/module/dm_mod/parameters/blk_mq_queue_depth
echo ${DM_MQ_HW_QUEUES} > /sys/module/dm_mod/parameters/blk_mq_nr_hw_queues
echo "0 8388608 multipath 0 0 1 1 service-time 0 1 2 /dev/nullb0 1000 1" | dmsetup create dm_mq
#echo "0 8388608 linear /dev/nullb0 0" | dmsetup create dm_mq

run_fio_with_ftrace /dev/mapper/dm_mq

#run_fio /dev/mapper/dm_mq
#run_fio /dev/mapper/dm_mq "${PERF} record -ag -e cs"
#run_fio /dev/mapper/dm_mq "${PERF} record -ag"
#run_fio /dev/mapper/dm_mq "${PERF} stat"

#run_fio /dev/mapper/dm_mq "trace-cmd record -e all"

^ permalink raw reply	[flat|nested] 127+ messages in thread

* [RFC PATCH] dm: fix excessive dm-mq context switching
  2016-02-07 16:43                                             ` Sagi Grimberg
@ 2016-02-07 16:54                                               ` Sagi Grimberg
  -1 siblings, 0 replies; 127+ messages in thread
From: Sagi Grimberg @ 2016-02-07 16:54 UTC (permalink / raw)



>> If so, can you check with e.g.
>> perf record -ags -e LLC-load-misses sleep 10 && perf report whether this
>> workload triggers perhaps lock contention ? What you need to look for in
>> the perf output is whether any functions occupy more than 10% CPU time.
>
> I will, thanks for the tip!

The perf report is very similar to the one that started this effort..

I'm afraid we'll need to resolve the per-target m->lock in order
to scale with NUMA...

-  17.33%              fio  [kernel.kallsyms]        [k] 
queued_spin_lock_slowpath
    - queued_spin_lock_slowpath
       - 52.09% _raw_spin_lock_irq
            __multipath_map
            multipath_clone_and_map
            map_request
            dm_mq_queue_rq
            __blk_mq_run_hw_queue
            blk_mq_run_hw_queue
            blk_mq_insert_requests
            blk_mq_flush_plug_list
            blk_flush_plug_list
            blk_finish_plug
            do_io_submit
            SyS_io_submit
            entry_SYSCALL_64_fastpath
          + io_submit
       - 46.87% _raw_spin_lock_irqsave
          - 99.97% multipath_busy
               dm_mq_queue_rq
               __blk_mq_run_hw_queue
               blk_mq_run_hw_queue
               blk_mq_insert_requests
               blk_mq_flush_plug_list
               blk_flush_plug_list
               blk_finish_plug
               do_io_submit
               SyS_io_submit
               entry_SYSCALL_64_fastpath
             + io_submit
+   4.99%              fio  [kernel.kallsyms]        [k] 
blk_account_io_start
+   3.93%              fio  [dm_multipath]           [k] __multipath_map
+   2.64%              fio  [dm_multipath]           [k] multipath_busy
+   2.38%              fio  [kernel.kallsyms]        [k] 
_raw_spin_lock_irqsave
+   2.31%              fio  [dm_mod]                 [k] dm_mq_queue_rq
+   2.25%              fio  [kernel.kallsyms]        [k] 
blk_mq_hctx_mark_pending
+   1.81%              fio  [kernel.kallsyms]        [k] blk_queue_enter
+   1.61%             perf  [kernel.kallsyms]        [k] 
copy_user_generic_string
+   1.40%              fio  [kernel.kallsyms]        [k] 
__blk_mq_run_hw_queue
+   1.26%              fio  [kernel.kallsyms]        [k] part_round_stats
+   1.14%              fio  [kernel.kallsyms]        [k] _raw_spin_lock_irq
+   0.96%              fio  [kernel.kallsyms]        [k] __bt_get
+   0.73%              fio  [kernel.kallsyms]        [k] enqueue_task_fair
+   0.71%              fio  [kernel.kallsyms]        [k] enqueue_entity
+   0.69%              fio  [dm_mod]                 [k] dm_start_request
+   0.60%      ksoftirqd/6  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   0.59%     ksoftirqd/10  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   0.59%              fio  [kernel.kallsyms]        [k] 
_raw_spin_unlock_irqrestore
+   0.58%     ksoftirqd/19  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   0.58%     ksoftirqd/18  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   0.58%     ksoftirqd/23  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [RFC PATCH] dm: fix excessive dm-mq context switching
@ 2016-02-07 16:54                                               ` Sagi Grimberg
  0 siblings, 0 replies; 127+ messages in thread
From: Sagi Grimberg @ 2016-02-07 16:54 UTC (permalink / raw)
  To: Bart Van Assche, Mike Snitzer, axboe, Hannes Reinecke, Christoph Hellwig
  Cc: keith.busch, linux-block, device-mapper development, linux-nvme


>> If so, can you check with e.g.
>> perf record -ags -e LLC-load-misses sleep 10 && perf report whether this
>> workload triggers perhaps lock contention ? What you need to look for in
>> the perf output is whether any functions occupy more than 10% CPU time.
>
> I will, thanks for the tip!

The perf report is very similar to the one that started this effort..

I'm afraid we'll need to resolve the per-target m->lock in order
to scale with NUMA...

-  17.33%              fio  [kernel.kallsyms]        [k] 
queued_spin_lock_slowpath
    - queued_spin_lock_slowpath
       - 52.09% _raw_spin_lock_irq
            __multipath_map
            multipath_clone_and_map
            map_request
            dm_mq_queue_rq
            __blk_mq_run_hw_queue
            blk_mq_run_hw_queue
            blk_mq_insert_requests
            blk_mq_flush_plug_list
            blk_flush_plug_list
            blk_finish_plug
            do_io_submit
            SyS_io_submit
            entry_SYSCALL_64_fastpath
          + io_submit
       - 46.87% _raw_spin_lock_irqsave
          - 99.97% multipath_busy
               dm_mq_queue_rq
               __blk_mq_run_hw_queue
               blk_mq_run_hw_queue
               blk_mq_insert_requests
               blk_mq_flush_plug_list
               blk_flush_plug_list
               blk_finish_plug
               do_io_submit
               SyS_io_submit
               entry_SYSCALL_64_fastpath
             + io_submit
+   4.99%              fio  [kernel.kallsyms]        [k] 
blk_account_io_start
+   3.93%              fio  [dm_multipath]           [k] __multipath_map
+   2.64%              fio  [dm_multipath]           [k] multipath_busy
+   2.38%              fio  [kernel.kallsyms]        [k] 
_raw_spin_lock_irqsave
+   2.31%              fio  [dm_mod]                 [k] dm_mq_queue_rq
+   2.25%              fio  [kernel.kallsyms]        [k] 
blk_mq_hctx_mark_pending
+   1.81%              fio  [kernel.kallsyms]        [k] blk_queue_enter
+   1.61%             perf  [kernel.kallsyms]        [k] 
copy_user_generic_string
+   1.40%              fio  [kernel.kallsyms]        [k] 
__blk_mq_run_hw_queue
+   1.26%              fio  [kernel.kallsyms]        [k] part_round_stats
+   1.14%              fio  [kernel.kallsyms]        [k] _raw_spin_lock_irq
+   0.96%              fio  [kernel.kallsyms]        [k] __bt_get
+   0.73%              fio  [kernel.kallsyms]        [k] enqueue_task_fair
+   0.71%              fio  [kernel.kallsyms]        [k] enqueue_entity
+   0.69%              fio  [dm_mod]                 [k] dm_start_request
+   0.60%      ksoftirqd/6  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   0.59%     ksoftirqd/10  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   0.59%              fio  [kernel.kallsyms]        [k] 
_raw_spin_unlock_irqrestore
+   0.58%     ksoftirqd/19  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   0.58%     ksoftirqd/18  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   0.58%     ksoftirqd/23  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues

^ permalink raw reply	[flat|nested] 127+ messages in thread

* [RFC PATCH] dm: fix excessive dm-mq context switching
  2016-02-07 16:54                                               ` Sagi Grimberg
@ 2016-02-07 17:20                                                 ` Mike Snitzer
  -1 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-02-07 17:20 UTC (permalink / raw)


On Sun, Feb 07 2016 at 11:54am -0500,
Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:

> 
> >>If so, can you check with e.g.
> >>perf record -ags -e LLC-load-misses sleep 10 && perf report whether this
> >>workload triggers perhaps lock contention ? What you need to look for in
> >>the perf output is whether any functions occupy more than 10% CPU time.
> >
> >I will, thanks for the tip!
> 
> The perf report is very similar to the one that started this effort..
> 
> I'm afraid we'll need to resolve the per-target m->lock in order
> to scale with NUMA...

Could be.  Just for testing, you can try the 2 topmost commits I've put
here (once applied both __multipath_map and multipath_busy won't have
_any_ locking.. again, very much test-only):

http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel2

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [RFC PATCH] dm: fix excessive dm-mq context switching
@ 2016-02-07 17:20                                                 ` Mike Snitzer
  0 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-02-07 17:20 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: axboe, Christoph Hellwig, linux-nvme, keith.busch,
	device-mapper development, linux-block, Bart Van Assche

On Sun, Feb 07 2016 at 11:54am -0500,
Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:

> 
> >>If so, can you check with e.g.
> >>perf record -ags -e LLC-load-misses sleep 10 && perf report whether this
> >>workload triggers perhaps lock contention ? What you need to look for in
> >>the perf output is whether any functions occupy more than 10% CPU time.
> >
> >I will, thanks for the tip!
> 
> The perf report is very similar to the one that started this effort..
> 
> I'm afraid we'll need to resolve the per-target m->lock in order
> to scale with NUMA...

Could be.  Just for testing, you can try the 2 topmost commits I've put
here (once applied both __multipath_map and multipath_busy won't have
_any_ locking.. again, very much test-only):

http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel2

^ permalink raw reply	[flat|nested] 127+ messages in thread

* [RFC PATCH] dm: fix excessive dm-mq context switching
  2016-02-07 17:20                                                 ` Mike Snitzer
@ 2016-02-08 12:21                                                   ` Sagi Grimberg
  -1 siblings, 0 replies; 127+ messages in thread
From: Sagi Grimberg @ 2016-02-08 12:21 UTC (permalink / raw)



>> The perf report is very similar to the one that started this effort..
>>
>> I'm afraid we'll need to resolve the per-target m->lock in order
>> to scale with NUMA...
>
> Could be.  Just for testing, you can try the 2 topmost commits I've put
> here (once applied both __multipath_map and multipath_busy won't have
> _any_ locking.. again, very much test-only):
>
> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel2

Hi Mike,

So I still don't see the IOPs scale like I expected. With these two
patches applied I see ~670K IOPs while the perf output is different
and does not indicate a clear lock contention.

--
-   4.67%              fio  [kernel.kallsyms]        [k] 
blk_account_io_start
    - blk_account_io_start
       - 56.05% blk_insert_cloned_request
            map_request
            dm_mq_queue_rq
            __blk_mq_run_hw_queue
            blk_mq_run_hw_queue
            blk_mq_insert_requests
            blk_mq_flush_plug_list
            blk_flush_plug_list
            blk_finish_plug
            do_io_submit
            SyS_io_submit
            entry_SYSCALL_64_fastpath
          + io_submit
       - 43.94% blk_mq_bio_to_request
            blk_mq_make_request
            generic_make_request
            submit_bio
            do_blockdev_direct_IO
            __blockdev_direct_IO
            blkdev_direct_IO
            generic_file_read_iter
            blkdev_read_iter
            aio_run_iocb
            io_submit_one
            do_io_submit
            SyS_io_submit
            entry_SYSCALL_64_fastpath
          + io_submit
-   2.52%              fio  [dm_mod]                 [k] dm_mq_queue_rq
    - dm_mq_queue_rq
       - 99.16% __blk_mq_run_hw_queue
            blk_mq_run_hw_queue
            blk_mq_insert_requests
            blk_mq_flush_plug_list
            blk_flush_plug_list
            blk_finish_plug
            do_io_submit
            SyS_io_submit
            entry_SYSCALL_64_fastpath
          + io_submit
-   2.52%              fio  [dm_mod]                 [k] dm_mq_queue_rq
    - dm_mq_queue_rq
       - 99.16% __blk_mq_run_hw_queue
            blk_mq_run_hw_queue
            blk_mq_insert_requests
            blk_mq_flush_plug_list
            blk_flush_plug_list
            blk_finish_plug
            do_io_submit
            SyS_io_submit
            entry_SYSCALL_64_fastpath
          + io_submit
       + 0.84% blk_mq_run_hw_queue
-   2.46%              fio  [kernel.kallsyms]        [k] 
blk_mq_hctx_mark_pending
    - blk_mq_hctx_mark_pending
       - 99.79% blk_mq_insert_requests
            blk_mq_flush_plug_list
            blk_flush_plug_list
            blk_finish_plug
            do_io_submit
            SyS_io_submit
            entry_SYSCALL_64_fastpath
          + io_submit
-   2.07%      ksoftirqd/6  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
    - blk_mq_run_hw_queues
       - 99.70% rq_completed
            dm_done
            dm_softirq_done
            blk_done_softirq
          + __do_softirq
+   2.06%      ksoftirqd/0  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   2.02%      ksoftirqd/9  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   2.00%     ksoftirqd/20  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   2.00%     ksoftirqd/12  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.99%     ksoftirqd/11  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.97%     ksoftirqd/18  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.96%      ksoftirqd/1  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.95%     ksoftirqd/14  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.95%     ksoftirqd/13  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.94%      ksoftirqd/5  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.94%      ksoftirqd/8  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.93%      ksoftirqd/2  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.92%     ksoftirqd/21  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.92%     ksoftirqd/17  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.92%      ksoftirqd/7  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.91%     ksoftirqd/23  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.84%      ksoftirqd/4  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.81%     ksoftirqd/19  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.76%      ksoftirqd/3  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.76%     ksoftirqd/16  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.75%     ksoftirqd/15  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.74%     ksoftirqd/22  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.72%     ksoftirqd/10  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.38%             perf  [kernel.kallsyms]        [k] 
copy_user_generic_string
+   1.20%              fio  [kernel.kallsyms]        [k] enqueue_task_fair
+   1.18%              fio  [kernel.kallsyms]        [k] part_round_stats
+   1.08%              fio  [kernel.kallsyms]        [k] enqueue_entity
+   1.07%              fio  [kernel.kallsyms]        [k] _raw_spin_lock
+   1.02%              fio  [kernel.kallsyms]        [k] 
__blk_mq_run_hw_queue
+   0.79%              fio  [dm_multipath]           [k] multipath_busy
+   0.57%              fio  [kernel.kallsyms]        [k] insert_work
+   0.54%              fio  [kernel.kallsyms]        [k] blk_flush_plug_list
--

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [RFC PATCH] dm: fix excessive dm-mq context switching
@ 2016-02-08 12:21                                                   ` Sagi Grimberg
  0 siblings, 0 replies; 127+ messages in thread
From: Sagi Grimberg @ 2016-02-08 12:21 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: axboe, Christoph Hellwig, linux-nvme, keith.busch,
	device-mapper development, linux-block, Bart Van Assche


>> The perf report is very similar to the one that started this effort..
>>
>> I'm afraid we'll need to resolve the per-target m->lock in order
>> to scale with NUMA...
>
> Could be.  Just for testing, you can try the 2 topmost commits I've put
> here (once applied both __multipath_map and multipath_busy won't have
> _any_ locking.. again, very much test-only):
>
> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel2

Hi Mike,

So I still don't see the IOPs scale like I expected. With these two
patches applied I see ~670K IOPs while the perf output is different
and does not indicate a clear lock contention.

--
-   4.67%              fio  [kernel.kallsyms]        [k] 
blk_account_io_start
    - blk_account_io_start
       - 56.05% blk_insert_cloned_request
            map_request
            dm_mq_queue_rq
            __blk_mq_run_hw_queue
            blk_mq_run_hw_queue
            blk_mq_insert_requests
            blk_mq_flush_plug_list
            blk_flush_plug_list
            blk_finish_plug
            do_io_submit
            SyS_io_submit
            entry_SYSCALL_64_fastpath
          + io_submit
       - 43.94% blk_mq_bio_to_request
            blk_mq_make_request
            generic_make_request
            submit_bio
            do_blockdev_direct_IO
            __blockdev_direct_IO
            blkdev_direct_IO
            generic_file_read_iter
            blkdev_read_iter
            aio_run_iocb
            io_submit_one
            do_io_submit
            SyS_io_submit
            entry_SYSCALL_64_fastpath
          + io_submit
-   2.52%              fio  [dm_mod]                 [k] dm_mq_queue_rq
    - dm_mq_queue_rq
       - 99.16% __blk_mq_run_hw_queue
            blk_mq_run_hw_queue
            blk_mq_insert_requests
            blk_mq_flush_plug_list
            blk_flush_plug_list
            blk_finish_plug
            do_io_submit
            SyS_io_submit
            entry_SYSCALL_64_fastpath
          + io_submit
-   2.52%              fio  [dm_mod]                 [k] dm_mq_queue_rq
    - dm_mq_queue_rq
       - 99.16% __blk_mq_run_hw_queue
            blk_mq_run_hw_queue
            blk_mq_insert_requests
            blk_mq_flush_plug_list
            blk_flush_plug_list
            blk_finish_plug
            do_io_submit
            SyS_io_submit
            entry_SYSCALL_64_fastpath
          + io_submit
       + 0.84% blk_mq_run_hw_queue
-   2.46%              fio  [kernel.kallsyms]        [k] 
blk_mq_hctx_mark_pending
    - blk_mq_hctx_mark_pending
       - 99.79% blk_mq_insert_requests
            blk_mq_flush_plug_list
            blk_flush_plug_list
            blk_finish_plug
            do_io_submit
            SyS_io_submit
            entry_SYSCALL_64_fastpath
          + io_submit
-   2.07%      ksoftirqd/6  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
    - blk_mq_run_hw_queues
       - 99.70% rq_completed
            dm_done
            dm_softirq_done
            blk_done_softirq
          + __do_softirq
+   2.06%      ksoftirqd/0  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   2.02%      ksoftirqd/9  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   2.00%     ksoftirqd/20  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   2.00%     ksoftirqd/12  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.99%     ksoftirqd/11  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.97%     ksoftirqd/18  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.96%      ksoftirqd/1  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.95%     ksoftirqd/14  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.95%     ksoftirqd/13  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.94%      ksoftirqd/5  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.94%      ksoftirqd/8  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.93%      ksoftirqd/2  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.92%     ksoftirqd/21  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.92%     ksoftirqd/17  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.92%      ksoftirqd/7  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.91%     ksoftirqd/23  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.84%      ksoftirqd/4  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.81%     ksoftirqd/19  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.76%      ksoftirqd/3  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.76%     ksoftirqd/16  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.75%     ksoftirqd/15  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.74%     ksoftirqd/22  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.72%     ksoftirqd/10  [kernel.kallsyms]        [k] 
blk_mq_run_hw_queues
+   1.38%             perf  [kernel.kallsyms]        [k] 
copy_user_generic_string
+   1.20%              fio  [kernel.kallsyms]        [k] enqueue_task_fair
+   1.18%              fio  [kernel.kallsyms]        [k] part_round_stats
+   1.08%              fio  [kernel.kallsyms]        [k] enqueue_entity
+   1.07%              fio  [kernel.kallsyms]        [k] _raw_spin_lock
+   1.02%              fio  [kernel.kallsyms]        [k] 
__blk_mq_run_hw_queue
+   0.79%              fio  [dm_multipath]           [k] multipath_busy
+   0.57%              fio  [kernel.kallsyms]        [k] insert_work
+   0.54%              fio  [kernel.kallsyms]        [k] blk_flush_plug_list
--

^ permalink raw reply	[flat|nested] 127+ messages in thread

* [RFC PATCH] dm: fix excessive dm-mq context switching
  2016-02-08 12:21                                                   ` Sagi Grimberg
@ 2016-02-08 14:34                                                     ` Mike Snitzer
  -1 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-02-08 14:34 UTC (permalink / raw)


On Mon, Feb 08 2016 at  7:21am -0500,
Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:

> 
> >>The perf report is very similar to the one that started this effort..
> >>
> >>I'm afraid we'll need to resolve the per-target m->lock in order
> >>to scale with NUMA...
> >
> >Could be.  Just for testing, you can try the 2 topmost commits I've put
> >here (once applied both __multipath_map and multipath_busy won't have
> >_any_ locking.. again, very much test-only):
> >
> >http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel2
> 
> Hi Mike,
> 
> So I still don't see the IOPs scale like I expected. With these two
> patches applied I see ~670K IOPs while the perf output is different
> and does not indicate a clear lock contention.

Right, perf (with default events) isn't the right tool to track this down.

But I'm seeing something that speaks to you not running with the first
context switching fix (which seems odd):

> -   2.07%      ksoftirqd/6  [kernel.kallsyms]        [k]
> blk_mq_run_hw_queues
>    - blk_mq_run_hw_queues
>       - 99.70% rq_completed
>            dm_done
>            dm_softirq_done
>            blk_done_softirq
>          + __do_softirq

As you can see here:
https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-next&id=a5b835282422ec41991c1dbdb88daa4af7d166d2

rq_completed() shouldn't be calling blk_mq_run_hw_queues() with the
latest code.

Please triple check you have the latest code, e.g.:
git diff snitzer/devel2

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [RFC PATCH] dm: fix excessive dm-mq context switching
@ 2016-02-08 14:34                                                     ` Mike Snitzer
  0 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-02-08 14:34 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: axboe, Christoph Hellwig, linux-nvme, keith.busch,
	device-mapper development, linux-block, Bart Van Assche

On Mon, Feb 08 2016 at  7:21am -0500,
Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:

> 
> >>The perf report is very similar to the one that started this effort..
> >>
> >>I'm afraid we'll need to resolve the per-target m->lock in order
> >>to scale with NUMA...
> >
> >Could be.  Just for testing, you can try the 2 topmost commits I've put
> >here (once applied both __multipath_map and multipath_busy won't have
> >_any_ locking.. again, very much test-only):
> >
> >http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel2
> 
> Hi Mike,
> 
> So I still don't see the IOPs scale like I expected. With these two
> patches applied I see ~670K IOPs while the perf output is different
> and does not indicate a clear lock contention.

Right, perf (with default events) isn't the right tool to track this down.

But I'm seeing something that speaks to you not running with the first
context switching fix (which seems odd):

> -   2.07%      ksoftirqd/6  [kernel.kallsyms]        [k]
> blk_mq_run_hw_queues
>    - blk_mq_run_hw_queues
>       - 99.70% rq_completed
>            dm_done
>            dm_softirq_done
>            blk_done_softirq
>          + __do_softirq

As you can see here:
https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-next&id=a5b835282422ec41991c1dbdb88daa4af7d166d2

rq_completed() shouldn't be calling blk_mq_run_hw_queues() with the
latest code.

Please triple check you have the latest code, e.g.:
git diff snitzer/devel2

^ permalink raw reply	[flat|nested] 127+ messages in thread

* [RFC PATCH] dm: fix excessive dm-mq context switching
  2016-02-07 17:20                                                 ` Mike Snitzer
@ 2016-02-09  7:50                                                   ` Hannes Reinecke
  -1 siblings, 0 replies; 127+ messages in thread
From: Hannes Reinecke @ 2016-02-09  7:50 UTC (permalink / raw)


On 02/07/2016 06:20 PM, Mike Snitzer wrote:
> On Sun, Feb 07 2016 at 11:54am -0500,
> Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
> 
>>
>>>> If so, can you check with e.g.
>>>> perf record -ags -e LLC-load-misses sleep 10 && perf report whether this
>>>> workload triggers perhaps lock contention ? What you need to look for in
>>>> the perf output is whether any functions occupy more than 10% CPU time.
>>>
>>> I will, thanks for the tip!
>>
>> The perf report is very similar to the one that started this effort..
>>
>> I'm afraid we'll need to resolve the per-target m->lock in order
>> to scale with NUMA...
> 
> Could be.  Just for testing, you can try the 2 topmost commits I've put
> here (once applied both __multipath_map and multipath_busy won't have
> _any_ locking.. again, very much test-only):
> 
> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel2
> 
So, I gave those patches a spin.
Sad to say, they do _not_ resolve the issue fully.

My testbed (2 paths per LUN, 40 CPUs, 4 cores) yields 505k IOPs with
those patches.
Using a single path (without those patches, but still running
multipath on top of that path) the same testbed yields 550k IOPs.
Which very much smells like a lock contention ...
We do get a slight improvement, though; without those patches I
could only get about 350k IOPs. But still, I would somehow expect 2
paths to be faster than just one ..

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [RFC PATCH] dm: fix excessive dm-mq context switching
@ 2016-02-09  7:50                                                   ` Hannes Reinecke
  0 siblings, 0 replies; 127+ messages in thread
From: Hannes Reinecke @ 2016-02-09  7:50 UTC (permalink / raw)
  To: Mike Snitzer, Sagi Grimberg
  Cc: axboe, keith.busch, linux-nvme, Christoph Hellwig,
	device-mapper development, linux-block, Bart Van Assche

On 02/07/2016 06:20 PM, Mike Snitzer wrote:
> On Sun, Feb 07 2016 at 11:54am -0500,
> Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
> 
>>
>>>> If so, can you check with e.g.
>>>> perf record -ags -e LLC-load-misses sleep 10 && perf report whether this
>>>> workload triggers perhaps lock contention ? What you need to look for in
>>>> the perf output is whether any functions occupy more than 10% CPU time.
>>>
>>> I will, thanks for the tip!
>>
>> The perf report is very similar to the one that started this effort..
>>
>> I'm afraid we'll need to resolve the per-target m->lock in order
>> to scale with NUMA...
> 
> Could be.  Just for testing, you can try the 2 topmost commits I've put
> here (once applied both __multipath_map and multipath_busy won't have
> _any_ locking.. again, very much test-only):
> 
> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel2
> 
So, I gave those patches a spin.
Sad to say, they do _not_ resolve the issue fully.

My testbed (2 paths per LUN, 40 CPUs, 4 cores) yields 505k IOPs with
those patches.
Using a single path (without those patches, but still running
multipath on top of that path) the same testbed yields 550k IOPs.
Which very much smells like a lock contention ...
We do get a slight improvement, though; without those patches I
could only get about 350k IOPs. But still, I would somehow expect 2
paths to be faster than just one ..

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 127+ messages in thread

* [RFC PATCH] dm: fix excessive dm-mq context switching
  2016-02-09  7:50                                                   ` Hannes Reinecke
@ 2016-02-09 14:55                                                     ` Mike Snitzer
  -1 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-02-09 14:55 UTC (permalink / raw)


On Tue, Feb 09 2016 at  2:50am -0500,
Hannes Reinecke <hare@suse.de> wrote:

> On 02/07/2016 06:20 PM, Mike Snitzer wrote:
> > On Sun, Feb 07 2016 at 11:54am -0500,
> > Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
> > 
> >>
> >>>> If so, can you check with e.g.
> >>>> perf record -ags -e LLC-load-misses sleep 10 && perf report whether this
> >>>> workload triggers perhaps lock contention ? What you need to look for in
> >>>> the perf output is whether any functions occupy more than 10% CPU time.
> >>>
> >>> I will, thanks for the tip!
> >>
> >> The perf report is very similar to the one that started this effort..
> >>
> >> I'm afraid we'll need to resolve the per-target m->lock in order
> >> to scale with NUMA...
> > 
> > Could be.  Just for testing, you can try the 2 topmost commits I've put
> > here (once applied both __multipath_map and multipath_busy won't have
> > _any_ locking.. again, very much test-only):
> > 
> > http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel2
> > 
> So, I gave those patches a spin.
> Sad to say, they do _not_ resolve the issue fully.
>
> My testbed (2 paths per LUN, 40 CPUs, 4 cores) yields 505k IOPs with
> those patches.

That isn't a surprise.  We knew the m->lock spinlock contention to be a
problem.  And NUMA makes it even worse.

> Using a single path (without those patches, but still running
> multipath on top of that path) the same testbed yields 550k IOPs.
> Which very much smells like a lock contention ...
> We do get a slight improvement, though; without those patches I
> could only get about 350k IOPs. But still, I would somehow expect 2
> paths to be faster than just one ..

https://www.redhat.com/archives/dm-devel/2016-February/msg00036.html

hint hint...

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [RFC PATCH] dm: fix excessive dm-mq context switching
@ 2016-02-09 14:55                                                     ` Mike Snitzer
  0 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-02-09 14:55 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: axboe, keith.busch, Sagi Grimberg, linux-nvme, Christoph Hellwig,
	device-mapper development, linux-block, Bart Van Assche

On Tue, Feb 09 2016 at  2:50am -0500,
Hannes Reinecke <hare@suse.de> wrote:

> On 02/07/2016 06:20 PM, Mike Snitzer wrote:
> > On Sun, Feb 07 2016 at 11:54am -0500,
> > Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
> > 
> >>
> >>>> If so, can you check with e.g.
> >>>> perf record -ags -e LLC-load-misses sleep 10 && perf report whether this
> >>>> workload triggers perhaps lock contention ? What you need to look for in
> >>>> the perf output is whether any functions occupy more than 10% CPU time.
> >>>
> >>> I will, thanks for the tip!
> >>
> >> The perf report is very similar to the one that started this effort..
> >>
> >> I'm afraid we'll need to resolve the per-target m->lock in order
> >> to scale with NUMA...
> > 
> > Could be.  Just for testing, you can try the 2 topmost commits I've put
> > here (once applied both __multipath_map and multipath_busy won't have
> > _any_ locking.. again, very much test-only):
> > 
> > http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel2
> > 
> So, I gave those patches a spin.
> Sad to say, they do _not_ resolve the issue fully.
>
> My testbed (2 paths per LUN, 40 CPUs, 4 cores) yields 505k IOPs with
> those patches.

That isn't a surprise.  We knew the m->lock spinlock contention to be a
problem.  And NUMA makes it even worse.

> Using a single path (without those patches, but still running
> multipath on top of that path) the same testbed yields 550k IOPs.
> Which very much smells like a lock contention ...
> We do get a slight improvement, though; without those patches I
> could only get about 350k IOPs. But still, I would somehow expect 2
> paths to be faster than just one ..

https://www.redhat.com/archives/dm-devel/2016-February/msg00036.html

hint hint...

^ permalink raw reply	[flat|nested] 127+ messages in thread

* [RFC PATCH] dm: fix excessive dm-mq context switching
  2016-02-09 14:55                                                     ` Mike Snitzer
@ 2016-02-09 15:32                                                       ` Hannes Reinecke
  -1 siblings, 0 replies; 127+ messages in thread
From: Hannes Reinecke @ 2016-02-09 15:32 UTC (permalink / raw)


On 02/09/2016 03:55 PM, Mike Snitzer wrote:
> On Tue, Feb 09 2016 at  2:50am -0500,
> Hannes Reinecke <hare@suse.de> wrote:
> 
>> On 02/07/2016 06:20 PM, Mike Snitzer wrote:
>>> On Sun, Feb 07 2016 at 11:54am -0500,
>>> Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
>>>
>>>>
>>>>>> If so, can you check with e.g.
>>>>>> perf record -ags -e LLC-load-misses sleep 10 && perf report whether this
>>>>>> workload triggers perhaps lock contention ? What you need to look for in
>>>>>> the perf output is whether any functions occupy more than 10% CPU time.
>>>>>
>>>>> I will, thanks for the tip!
>>>>
>>>> The perf report is very similar to the one that started this effort..
>>>>
>>>> I'm afraid we'll need to resolve the per-target m->lock in order
>>>> to scale with NUMA...
>>>
>>> Could be.  Just for testing, you can try the 2 topmost commits I've put
>>> here (once applied both __multipath_map and multipath_busy won't have
>>> _any_ locking.. again, very much test-only):
>>>
>>> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel2
>>>
>> So, I gave those patches a spin.
>> Sad to say, they do _not_ resolve the issue fully.
>>
>> My testbed (2 paths per LUN, 40 CPUs, 4 cores) yields 505k IOPs with
>> those patches.
> 
> That isn't a surprise.  We knew the m->lock spinlock contention to be a
> problem.  And NUMA makes it even worse.
> 
>> Using a single path (without those patches, but still running
>> multipath on top of that path) the same testbed yields 550k IOPs.
>> Which very much smells like a lock contention ...
>> We do get a slight improvement, though; without those patches I
>> could only get about 350k IOPs. But still, I would somehow expect 2
>> paths to be faster than just one ..
> 
> https://www.redhat.com/archives/dm-devel/2016-February/msg00036.html
> 
> hint hint...
> 
I hoped they wouldn't be needed with your patches.
Plus perf revealed that I first need to address a spinlock
contention in the lpfc driver before that even would make sense.

So more debugging to follow.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [RFC PATCH] dm: fix excessive dm-mq context switching
@ 2016-02-09 15:32                                                       ` Hannes Reinecke
  0 siblings, 0 replies; 127+ messages in thread
From: Hannes Reinecke @ 2016-02-09 15:32 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: axboe, keith.busch, Sagi Grimberg, linux-nvme, Christoph Hellwig,
	device-mapper development, linux-block, Bart Van Assche

On 02/09/2016 03:55 PM, Mike Snitzer wrote:
> On Tue, Feb 09 2016 at  2:50am -0500,
> Hannes Reinecke <hare@suse.de> wrote:
> 
>> On 02/07/2016 06:20 PM, Mike Snitzer wrote:
>>> On Sun, Feb 07 2016 at 11:54am -0500,
>>> Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
>>>
>>>>
>>>>>> If so, can you check with e.g.
>>>>>> perf record -ags -e LLC-load-misses sleep 10 && perf report whether this
>>>>>> workload triggers perhaps lock contention ? What you need to look for in
>>>>>> the perf output is whether any functions occupy more than 10% CPU time.
>>>>>
>>>>> I will, thanks for the tip!
>>>>
>>>> The perf report is very similar to the one that started this effort..
>>>>
>>>> I'm afraid we'll need to resolve the per-target m->lock in order
>>>> to scale with NUMA...
>>>
>>> Could be.  Just for testing, you can try the 2 topmost commits I've put
>>> here (once applied both __multipath_map and multipath_busy won't have
>>> _any_ locking.. again, very much test-only):
>>>
>>> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel2
>>>
>> So, I gave those patches a spin.
>> Sad to say, they do _not_ resolve the issue fully.
>>
>> My testbed (2 paths per LUN, 40 CPUs, 4 cores) yields 505k IOPs with
>> those patches.
> 
> That isn't a surprise.  We knew the m->lock spinlock contention to be a
> problem.  And NUMA makes it even worse.
> 
>> Using a single path (without those patches, but still running
>> multipath on top of that path) the same testbed yields 550k IOPs.
>> Which very much smells like a lock contention ...
>> We do get a slight improvement, though; without those patches I
>> could only get about 350k IOPs. But still, I would somehow expect 2
>> paths to be faster than just one ..
> 
> https://www.redhat.com/archives/dm-devel/2016-February/msg00036.html
> 
> hint hint...
> 
I hoped they wouldn't be needed with your patches.
Plus perf revealed that I first need to address a spinlock
contention in the lpfc driver before that even would make sense.

So more debugging to follow.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 127+ messages in thread

* [RFC PATCH] dm: fix excessive dm-mq context switching
  2016-02-09 15:32                                                       ` Hannes Reinecke
@ 2016-02-10  0:45                                                         ` Mike Snitzer
  -1 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-02-10  0:45 UTC (permalink / raw)


On Tue, Feb 09 2016 at 10:32am -0500,
Hannes Reinecke <hare@suse.de> wrote:

> On 02/09/2016 03:55 PM, Mike Snitzer wrote:
> > On Tue, Feb 09 2016 at  2:50am -0500,
> > Hannes Reinecke <hare@suse.de> wrote:
> > 
> >> On 02/07/2016 06:20 PM, Mike Snitzer wrote:
> >>> On Sun, Feb 07 2016 at 11:54am -0500,
> >>> Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
> >>>
> >>>>
> >>>>>> If so, can you check with e.g.
> >>>>>> perf record -ags -e LLC-load-misses sleep 10 && perf report whether this
> >>>>>> workload triggers perhaps lock contention ? What you need to look for in
> >>>>>> the perf output is whether any functions occupy more than 10% CPU time.
> >>>>>
> >>>>> I will, thanks for the tip!
> >>>>
> >>>> The perf report is very similar to the one that started this effort..
> >>>>
> >>>> I'm afraid we'll need to resolve the per-target m->lock in order
> >>>> to scale with NUMA...
> >>>
> >>> Could be.  Just for testing, you can try the 2 topmost commits I've put
> >>> here (once applied both __multipath_map and multipath_busy won't have
> >>> _any_ locking.. again, very much test-only):
> >>>
> >>> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel2
> >>>
> >> So, I gave those patches a spin.
> >> Sad to say, they do _not_ resolve the issue fully.
> >>
> >> My testbed (2 paths per LUN, 40 CPUs, 4 cores) yields 505k IOPs with
> >> those patches.
> > 
> > That isn't a surprise.  We knew the m->lock spinlock contention to be a
> > problem.  And NUMA makes it even worse.
> > 
> >> Using a single path (without those patches, but still running
> >> multipath on top of that path) the same testbed yields 550k IOPs.
> >> Which very much smells like a lock contention ...
> >> We do get a slight improvement, though; without those patches I
> >> could only get about 350k IOPs. But still, I would somehow expect 2
> >> paths to be faster than just one ..
> > 
> > https://www.redhat.com/archives/dm-devel/2016-February/msg00036.html
> > 
> > hint hint...
> > 
> I hoped they wouldn't be needed with your patches.
> Plus perf revealed that I first need to address a spinlock
> contention in the lpfc driver before that even would make sense.
> 
> So more debugging to follow.

OK, I took a crack at embracing RCU.  Only slightly better performance
on my single NUMA node testbed.  (But I'll have to track down a system
with multiple NUMA nodes to do any justice to the next wave of this
optimization effort)

This RCU work is very heavy-handed and way too fiddley (there could
easily be bugs).  Anyway, please see:
http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel2&id=d80a7e4f8b5be9c81e4d452137623b003fa64745

But this might give you something to build on to arrive at something
more scalable?

Mike

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [RFC PATCH] dm: fix excessive dm-mq context switching
@ 2016-02-10  0:45                                                         ` Mike Snitzer
  0 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-02-10  0:45 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: axboe, keith.busch, Sagi Grimberg, linux-nvme, Christoph Hellwig,
	device-mapper development, linux-block, Bart Van Assche

On Tue, Feb 09 2016 at 10:32am -0500,
Hannes Reinecke <hare@suse.de> wrote:

> On 02/09/2016 03:55 PM, Mike Snitzer wrote:
> > On Tue, Feb 09 2016 at  2:50am -0500,
> > Hannes Reinecke <hare@suse.de> wrote:
> > 
> >> On 02/07/2016 06:20 PM, Mike Snitzer wrote:
> >>> On Sun, Feb 07 2016 at 11:54am -0500,
> >>> Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
> >>>
> >>>>
> >>>>>> If so, can you check with e.g.
> >>>>>> perf record -ags -e LLC-load-misses sleep 10 && perf report whether this
> >>>>>> workload triggers perhaps lock contention ? What you need to look for in
> >>>>>> the perf output is whether any functions occupy more than 10% CPU time.
> >>>>>
> >>>>> I will, thanks for the tip!
> >>>>
> >>>> The perf report is very similar to the one that started this effort..
> >>>>
> >>>> I'm afraid we'll need to resolve the per-target m->lock in order
> >>>> to scale with NUMA...
> >>>
> >>> Could be.  Just for testing, you can try the 2 topmost commits I've put
> >>> here (once applied both __multipath_map and multipath_busy won't have
> >>> _any_ locking.. again, very much test-only):
> >>>
> >>> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel2
> >>>
> >> So, I gave those patches a spin.
> >> Sad to say, they do _not_ resolve the issue fully.
> >>
> >> My testbed (2 paths per LUN, 40 CPUs, 4 cores) yields 505k IOPs with
> >> those patches.
> > 
> > That isn't a surprise.  We knew the m->lock spinlock contention to be a
> > problem.  And NUMA makes it even worse.
> > 
> >> Using a single path (without those patches, but still running
> >> multipath on top of that path) the same testbed yields 550k IOPs.
> >> Which very much smells like a lock contention ...
> >> We do get a slight improvement, though; without those patches I
> >> could only get about 350k IOPs. But still, I would somehow expect 2
> >> paths to be faster than just one ..
> > 
> > https://www.redhat.com/archives/dm-devel/2016-February/msg00036.html
> > 
> > hint hint...
> > 
> I hoped they wouldn't be needed with your patches.
> Plus perf revealed that I first need to address a spinlock
> contention in the lpfc driver before that even would make sense.
> 
> So more debugging to follow.

OK, I took a crack at embracing RCU.  Only slightly better performance
on my single NUMA node testbed.  (But I'll have to track down a system
with multiple NUMA nodes to do any justice to the next wave of this
optimization effort)

This RCU work is very heavy-handed and way too fiddley (there could
easily be bugs).  Anyway, please see:
http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel2&id=d80a7e4f8b5be9c81e4d452137623b003fa64745

But this might give you something to build on to arrive at something
more scalable?

Mike

^ permalink raw reply	[flat|nested] 127+ messages in thread

* RCU-ified dm-mpath for testing/review
  2016-02-10  0:45                                                         ` Mike Snitzer
  (?)
@ 2016-02-11  1:50                                                         ` Mike Snitzer
  2016-02-11  3:35                                                             ` Mike Snitzer
  2016-02-11 15:34                                                             ` Mike Snitzer
  -1 siblings, 2 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-02-11  1:50 UTC (permalink / raw)
  To: Hannes Reinecke, Sagi Grimberg
  Cc: axboe, linux-block, Christoph Hellwig, linux-nvme, keith.busch,
	device-mapper development, Bart Van Assche

On Tue, Feb 09 2016 at  7:45pm -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> 
> OK, I took a crack at embracing RCU.  Only slightly better performance
> on my single NUMA node testbed.  (But I'll have to track down a system
> with multiple NUMA nodes to do any justice to the next wave of this
> optimization effort)
> 
> This RCU work is very heavy-handed and way too fiddley (there could
> easily be bugs).  Anyway, please see:
> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel2&id=d80a7e4f8b5be9c81e4d452137623b003fa64745
> 
> But this might give you something to build on to arrive at something
> more scalable?

I've a bit more polished version of this work (broken up into multiple
commits, with some fixes, etc) here:
http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel3

Hannes and/or Sagi, if you get a chance to try this on your NUMA system
please let me know how it goes.

Mike

^ permalink raw reply	[flat|nested] 127+ messages in thread

* RCU-ified dm-mpath for testing/review
  2016-02-11  1:50                                                         ` RCU-ified dm-mpath for testing/review Mike Snitzer
@ 2016-02-11  3:35                                                             ` Mike Snitzer
  2016-02-11 15:34                                                             ` Mike Snitzer
  1 sibling, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-02-11  3:35 UTC (permalink / raw)


On Wed, Feb 10 2016 at  8:50pm -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Tue, Feb 09 2016 at  7:45pm -0500,
> Mike Snitzer <snitzer@redhat.com> wrote:
> 
> > 
> > OK, I took a crack at embracing RCU.  Only slightly better performance
> > on my single NUMA node testbed.  (But I'll have to track down a system
> > with multiple NUMA nodes to do any justice to the next wave of this
> > optimization effort)
> > 
> > This RCU work is very heavy-handed and way too fiddley (there could
> > easily be bugs).  Anyway, please see:
> > http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel2&id=d80a7e4f8b5be9c81e4d452137623b003fa64745
> > 
> > But this might give you something to build on to arrive at something
> > more scalable?
> 
> I've a bit more polished version of this work (broken up into multiple
> commits, with some fixes, etc) here:
> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel3
> 
> Hannes and/or Sagi, if you get a chance to try this on your NUMA system
> please let me know how it goes.

FYI, with these changes my single NUMA node testbed's read IOPs went
from:

 ~1310K to ~1410K w/ nr_hw_queues dm-mq=4 and null_blk=4
 ~1330K to ~1415K w/ nr_hw_queues dm-mq=4 and null_blk=12
 ~1365K to ~1425K w/ nr_hw_queues dm-mq=12 and null_blk=12

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: RCU-ified dm-mpath for testing/review
@ 2016-02-11  3:35                                                             ` Mike Snitzer
  0 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-02-11  3:35 UTC (permalink / raw)
  To: Hannes Reinecke, Sagi Grimberg
  Cc: axboe, linux-block, Christoph Hellwig, linux-nvme, keith.busch,
	device-mapper development, Bart Van Assche

On Wed, Feb 10 2016 at  8:50pm -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Tue, Feb 09 2016 at  7:45pm -0500,
> Mike Snitzer <snitzer@redhat.com> wrote:
> 
> > 
> > OK, I took a crack at embracing RCU.  Only slightly better performance
> > on my single NUMA node testbed.  (But I'll have to track down a system
> > with multiple NUMA nodes to do any justice to the next wave of this
> > optimization effort)
> > 
> > This RCU work is very heavy-handed and way too fiddley (there could
> > easily be bugs).  Anyway, please see:
> > http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel2&id=d80a7e4f8b5be9c81e4d452137623b003fa64745
> > 
> > But this might give you something to build on to arrive at something
> > more scalable?
> 
> I've a bit more polished version of this work (broken up into multiple
> commits, with some fixes, etc) here:
> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel3
> 
> Hannes and/or Sagi, if you get a chance to try this on your NUMA system
> please let me know how it goes.

FYI, with these changes my single NUMA node testbed's read IOPs went
from:

 ~1310K to ~1410K w/ nr_hw_queues dm-mq=4 and null_blk=4
 ~1330K to ~1415K w/ nr_hw_queues dm-mq=4 and null_blk=12
 ~1365K to ~1425K w/ nr_hw_queues dm-mq=12 and null_blk=12

^ permalink raw reply	[flat|nested] 127+ messages in thread

* RCU-ified dm-mpath for testing/review
  2016-02-11  1:50                                                         ` RCU-ified dm-mpath for testing/review Mike Snitzer
@ 2016-02-11 15:34                                                             ` Mike Snitzer
  2016-02-11 15:34                                                             ` Mike Snitzer
  1 sibling, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-02-11 15:34 UTC (permalink / raw)


On Wed, Feb 10 2016 at  8:50pm -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Tue, Feb 09 2016 at  7:45pm -0500,
> Mike Snitzer <snitzer@redhat.com> wrote:
> 
> > 
> > OK, I took a crack at embracing RCU.  Only slightly better performance
> > on my single NUMA node testbed.  (But I'll have to track down a system
> > with multiple NUMA nodes to do any justice to the next wave of this
> > optimization effort)
> > 
> > This RCU work is very heavy-handed and way too fiddley (there could
> > easily be bugs).  Anyway, please see:
> > http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel2&id=d80a7e4f8b5be9c81e4d452137623b003fa64745
> > 
> > But this might give you something to build on to arrive at something
> > more scalable?
> 
> I've a bit more polished version of this work (broken up into multiple
> commits, with some fixes, etc) here:
> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel3
> 
> Hannes and/or Sagi, if you get a chance to try this on your NUMA system
> please let me know how it goes.

Initial review has uncovered some locking problems with the current code
(nothing that caused crashes or hangs in my testing but...) so please
hold off on testing until you hear from me (hopefully tomorrow).

Thanks,
Mike

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: RCU-ified dm-mpath for testing/review
@ 2016-02-11 15:34                                                             ` Mike Snitzer
  0 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-02-11 15:34 UTC (permalink / raw)
  To: Hannes Reinecke, Sagi Grimberg
  Cc: axboe, linux-block, Christoph Hellwig, linux-nvme, keith.busch,
	device-mapper development, Bart Van Assche

On Wed, Feb 10 2016 at  8:50pm -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Tue, Feb 09 2016 at  7:45pm -0500,
> Mike Snitzer <snitzer@redhat.com> wrote:
> 
> > 
> > OK, I took a crack at embracing RCU.  Only slightly better performance
> > on my single NUMA node testbed.  (But I'll have to track down a system
> > with multiple NUMA nodes to do any justice to the next wave of this
> > optimization effort)
> > 
> > This RCU work is very heavy-handed and way too fiddley (there could
> > easily be bugs).  Anyway, please see:
> > http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel2&id=d80a7e4f8b5be9c81e4d452137623b003fa64745
> > 
> > But this might give you something to build on to arrive at something
> > more scalable?
> 
> I've a bit more polished version of this work (broken up into multiple
> commits, with some fixes, etc) here:
> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel3
> 
> Hannes and/or Sagi, if you get a chance to try this on your NUMA system
> please let me know how it goes.

Initial review has uncovered some locking problems with the current code
(nothing that caused crashes or hangs in my testing but...) so please
hold off on testing until you hear from me (hopefully tomorrow).

Thanks,
Mike

^ permalink raw reply	[flat|nested] 127+ messages in thread

* RCU-ified dm-mpath for testing/review
  2016-02-11 15:34                                                             ` Mike Snitzer
@ 2016-02-12 15:18                                                               ` Hannes Reinecke
  -1 siblings, 0 replies; 127+ messages in thread
From: Hannes Reinecke @ 2016-02-12 15:18 UTC (permalink / raw)


On 02/11/2016 04:34 PM, Mike Snitzer wrote:
> On Wed, Feb 10 2016 at  8:50pm -0500,
> Mike Snitzer <snitzer@redhat.com> wrote:
> 
>> On Tue, Feb 09 2016 at  7:45pm -0500,
>> Mike Snitzer <snitzer@redhat.com> wrote:
>>
>>>
>>> OK, I took a crack at embracing RCU.  Only slightly better performance
>>> on my single NUMA node testbed.  (But I'll have to track down a system
>>> with multiple NUMA nodes to do any justice to the next wave of this
>>> optimization effort)
>>>
>>> This RCU work is very heavy-handed and way too fiddley (there could
>>> easily be bugs).  Anyway, please see:
>>> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel2&id=d80a7e4f8b5be9c81e4d452137623b003fa64745
>>>
>>> But this might give you something to build on to arrive at something
>>> more scalable?
>>
>> I've a bit more polished version of this work (broken up into multiple
>> commits, with some fixes, etc) here:
>> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel3
>>
>> Hannes and/or Sagi, if you get a chance to try this on your NUMA system
>> please let me know how it goes.
> 
> Initial review has uncovered some locking problems with the current code
> (nothing that caused crashes or hangs in my testing but...) so please
> hold off on testing until you hear from me (hopefully tomorrow).
> 
Good news is that I've managed to hit the roof for my array with the
devel2 version of those patches. (And a _heavily_ patched-up lpfc
driver :-)
So from that perspective everything's fine now; we've reached the
hardware limit for my setup.
Which in itself is quite impressive; beating Intel P3700 with 16FC
is not bad methinks :-)

So thanks for all your work here.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: RCU-ified dm-mpath for testing/review
@ 2016-02-12 15:18                                                               ` Hannes Reinecke
  0 siblings, 0 replies; 127+ messages in thread
From: Hannes Reinecke @ 2016-02-12 15:18 UTC (permalink / raw)
  To: Mike Snitzer, Sagi Grimberg
  Cc: axboe, keith.busch, Christoph Hellwig, linux-nvme, linux-block,
	device-mapper development, Bart Van Assche

On 02/11/2016 04:34 PM, Mike Snitzer wrote:
> On Wed, Feb 10 2016 at  8:50pm -0500,
> Mike Snitzer <snitzer@redhat.com> wrote:
> 
>> On Tue, Feb 09 2016 at  7:45pm -0500,
>> Mike Snitzer <snitzer@redhat.com> wrote:
>>
>>>
>>> OK, I took a crack at embracing RCU.  Only slightly better performance
>>> on my single NUMA node testbed.  (But I'll have to track down a system
>>> with multiple NUMA nodes to do any justice to the next wave of this
>>> optimization effort)
>>>
>>> This RCU work is very heavy-handed and way too fiddley (there could
>>> easily be bugs).  Anyway, please see:
>>> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel2&id=d80a7e4f8b5be9c81e4d452137623b003fa64745
>>>
>>> But this might give you something to build on to arrive at something
>>> more scalable?
>>
>> I've a bit more polished version of this work (broken up into multiple
>> commits, with some fixes, etc) here:
>> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel3
>>
>> Hannes and/or Sagi, if you get a chance to try this on your NUMA system
>> please let me know how it goes.
> 
> Initial review has uncovered some locking problems with the current code
> (nothing that caused crashes or hangs in my testing but...) so please
> hold off on testing until you hear from me (hopefully tomorrow).
> 
Good news is that I've managed to hit the roof for my array with the
devel2 version of those patches. (And a _heavily_ patched-up lpfc
driver :-)
So from that perspective everything's fine now; we've reached the
hardware limit for my setup.
Which in itself is quite impressive; beating Intel P3700 with 16FC
is not bad methinks :-)

So thanks for all your work here.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 127+ messages in thread

* RCU-ified dm-mpath for testing/review
  2016-02-12 15:18                                                               ` Hannes Reinecke
@ 2016-02-12 15:26                                                                 ` Mike Snitzer
  -1 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-02-12 15:26 UTC (permalink / raw)


On Fri, Feb 12 2016 at 10:18am -0500,
Hannes Reinecke <hare@suse.de> wrote:

> On 02/11/2016 04:34 PM, Mike Snitzer wrote:
> > On Wed, Feb 10 2016 at  8:50pm -0500,
> > Mike Snitzer <snitzer@redhat.com> wrote:
> > 
> >> On Tue, Feb 09 2016 at  7:45pm -0500,
> >> Mike Snitzer <snitzer@redhat.com> wrote:
> >>
> >>>
> >>> OK, I took a crack at embracing RCU.  Only slightly better performance
> >>> on my single NUMA node testbed.  (But I'll have to track down a system
> >>> with multiple NUMA nodes to do any justice to the next wave of this
> >>> optimization effort)
> >>>
> >>> This RCU work is very heavy-handed and way too fiddley (there could
> >>> easily be bugs).  Anyway, please see:
> >>> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel2&id=d80a7e4f8b5be9c81e4d452137623b003fa64745
> >>>
> >>> But this might give you something to build on to arrive at something
> >>> more scalable?
> >>
> >> I've a bit more polished version of this work (broken up into multiple
> >> commits, with some fixes, etc) here:
> >> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel3
> >>
> >> Hannes and/or Sagi, if you get a chance to try this on your NUMA system
> >> please let me know how it goes.
> > 
> > Initial review has uncovered some locking problems with the current code
> > (nothing that caused crashes or hangs in my testing but...) so please
> > hold off on testing until you hear from me (hopefully tomorrow).
> > 
> Good news is that I've managed to hit the roof for my array with the
> devel2 version of those patches. (And a _heavily_ patched-up lpfc
> driver :-)
> So from that perspective everything's fine now; we've reached the
> hardware limit for my setup.
> Which in itself is quite impressive; beating Intel P3700 with 16FC
> is not bad methinks :-)
> 
> So thanks for all your work here.

Ah, that's really good news!  But devel2 is definitely _not_ destined
for upstream.  'devel3' is much closer to "ready".  But your testing and
review would really push it forward.

Please see/test:
http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel3

Also, please read this header:
http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel3&id=65a01b76502dd68e8ca298ee6614c0151b677f4a

Even with devel2 I hacked it such that repeat_count > 1 is effectively
broken.  I'm now _seriously_ considering deprecating repeat_count
completely (adding a DMWARN that will inform the user. e.g.:
"repeat_count > 1 is no longer supported").  I see no point going to
great lengths to maintain a dm-mpath feature that was only a hack for
when dm-mpath was bio-based.  What do you think?

Thanks,
Mike

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: RCU-ified dm-mpath for testing/review
@ 2016-02-12 15:26                                                                 ` Mike Snitzer
  0 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-02-12 15:26 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: axboe, keith.busch, Christoph Hellwig, Sagi Grimberg, linux-nvme,
	linux-block, device-mapper development, Bart Van Assche

On Fri, Feb 12 2016 at 10:18am -0500,
Hannes Reinecke <hare@suse.de> wrote:

> On 02/11/2016 04:34 PM, Mike Snitzer wrote:
> > On Wed, Feb 10 2016 at  8:50pm -0500,
> > Mike Snitzer <snitzer@redhat.com> wrote:
> > 
> >> On Tue, Feb 09 2016 at  7:45pm -0500,
> >> Mike Snitzer <snitzer@redhat.com> wrote:
> >>
> >>>
> >>> OK, I took a crack at embracing RCU.  Only slightly better performance
> >>> on my single NUMA node testbed.  (But I'll have to track down a system
> >>> with multiple NUMA nodes to do any justice to the next wave of this
> >>> optimization effort)
> >>>
> >>> This RCU work is very heavy-handed and way too fiddley (there could
> >>> easily be bugs).  Anyway, please see:
> >>> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel2&id=d80a7e4f8b5be9c81e4d452137623b003fa64745
> >>>
> >>> But this might give you something to build on to arrive at something
> >>> more scalable?
> >>
> >> I've a bit more polished version of this work (broken up into multiple
> >> commits, with some fixes, etc) here:
> >> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel3
> >>
> >> Hannes and/or Sagi, if you get a chance to try this on your NUMA system
> >> please let me know how it goes.
> > 
> > Initial review has uncovered some locking problems with the current code
> > (nothing that caused crashes or hangs in my testing but...) so please
> > hold off on testing until you hear from me (hopefully tomorrow).
> > 
> Good news is that I've managed to hit the roof for my array with the
> devel2 version of those patches. (And a _heavily_ patched-up lpfc
> driver :-)
> So from that perspective everything's fine now; we've reached the
> hardware limit for my setup.
> Which in itself is quite impressive; beating Intel P3700 with 16FC
> is not bad methinks :-)
> 
> So thanks for all your work here.

Ah, that's really good news!  But devel2 is definitely _not_ destined
for upstream.  'devel3' is much closer to "ready".  But your testing and
review would really push it forward.

Please see/test:
http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel3

Also, please read this header:
http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel3&id=65a01b76502dd68e8ca298ee6614c0151b677f4a

Even with devel2 I hacked it such that repeat_count > 1 is effectively
broken.  I'm now _seriously_ considering deprecating repeat_count
completely (adding a DMWARN that will inform the user. e.g.:
"repeat_count > 1 is no longer supported").  I see no point going to
great lengths to maintain a dm-mpath feature that was only a hack for
when dm-mpath was bio-based.  What do you think?

Thanks,
Mike

^ permalink raw reply	[flat|nested] 127+ messages in thread

* RCU-ified dm-mpath for testing/review
  2016-02-12 15:26                                                                 ` Mike Snitzer
@ 2016-02-12 16:04                                                                   ` Hannes Reinecke
  -1 siblings, 0 replies; 127+ messages in thread
From: Hannes Reinecke @ 2016-02-12 16:04 UTC (permalink / raw)


On 02/12/2016 04:26 PM, Mike Snitzer wrote:
> On Fri, Feb 12 2016 at 10:18am -0500,
> Hannes Reinecke <hare@suse.de> wrote:
>
>> On 02/11/2016 04:34 PM, Mike Snitzer wrote:
>>> On Wed, Feb 10 2016 at  8:50pm -0500,
>>> Mike Snitzer <snitzer@redhat.com> wrote:
>>>
>>>> On Tue, Feb 09 2016 at  7:45pm -0500,
>>>> Mike Snitzer <snitzer@redhat.com> wrote:
>>>>
>>>>>
>>>>> OK, I took a crack at embracing RCU.  Only slightly better performance
>>>>> on my single NUMA node testbed.  (But I'll have to track down a system
>>>>> with multiple NUMA nodes to do any justice to the next wave of this
>>>>> optimization effort)
>>>>>
>>>>> This RCU work is very heavy-handed and way too fiddley (there could
>>>>> easily be bugs).  Anyway, please see:
>>>>> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel2&id=d80a7e4f8b5be9c81e4d452137623b003fa64745
>>>>>
>>>>> But this might give you something to build on to arrive at something
>>>>> more scalable?
>>>>
>>>> I've a bit more polished version of this work (broken up into multiple
>>>> commits, with some fixes, etc) here:
>>>> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel3
>>>>
>>>> Hannes and/or Sagi, if you get a chance to try this on your NUMA system
>>>> please let me know how it goes.
>>>
>>> Initial review has uncovered some locking problems with the current code
>>> (nothing that caused crashes or hangs in my testing but...) so please
>>> hold off on testing until you hear from me (hopefully tomorrow).
>>>
>> Good news is that I've managed to hit the roof for my array with the
>> devel2 version of those patches. (And a _heavily_ patched-up lpfc
>> driver :-)
>> So from that perspective everything's fine now; we've reached the
>> hardware limit for my setup.
>> Which in itself is quite impressive; beating Intel P3700 with 16FC
>> is not bad methinks :-)
>>
>> So thanks for all your work here.
>
> Ah, that's really good news!  But devel2 is definitely _not_ destined
> for upstream.  'devel3' is much closer to "ready".  But your testing and
> review would really push it forward.
>
> Please see/test:
> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel3
>
> Also, please read this header:
> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel3&id=65a01b76502dd68e8ca298ee6614c0151b677f4a
>
> Even with devel2 I hacked it such that repeat_count > 1 is effectively
> broken.  I'm now _seriously_ considering deprecating repeat_count
> completely (adding a DMWARN that will inform the user. e.g.:
> "repeat_count > 1 is no longer supported").  I see no point going to
> great lengths to maintain a dm-mpath feature that was only a hack for
> when dm-mpath was bio-based.  What do you think?
>
Drop it, and make setting of which a no-op.
Never liked it anyway, and these decisions should really be delegated to 
the path selector.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare at suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: J. Hawn, J. Guild, F. Imend?rffer, HRB 16746 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: RCU-ified dm-mpath for testing/review
@ 2016-02-12 16:04                                                                   ` Hannes Reinecke
  0 siblings, 0 replies; 127+ messages in thread
From: Hannes Reinecke @ 2016-02-12 16:04 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: axboe, keith.busch, Christoph Hellwig, Sagi Grimberg, linux-nvme,
	linux-block, device-mapper development, Bart Van Assche

On 02/12/2016 04:26 PM, Mike Snitzer wrote:
> On Fri, Feb 12 2016 at 10:18am -0500,
> Hannes Reinecke <hare@suse.de> wrote:
>
>> On 02/11/2016 04:34 PM, Mike Snitzer wrote:
>>> On Wed, Feb 10 2016 at  8:50pm -0500,
>>> Mike Snitzer <snitzer@redhat.com> wrote:
>>>
>>>> On Tue, Feb 09 2016 at  7:45pm -0500,
>>>> Mike Snitzer <snitzer@redhat.com> wrote:
>>>>
>>>>>
>>>>> OK, I took a crack at embracing RCU.  Only slightly better performance
>>>>> on my single NUMA node testbed.  (But I'll have to track down a system
>>>>> with multiple NUMA nodes to do any justice to the next wave of this
>>>>> optimization effort)
>>>>>
>>>>> This RCU work is very heavy-handed and way too fiddley (there could
>>>>> easily be bugs).  Anyway, please see:
>>>>> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel2&id=d80a7e4f8b5be9c81e4d452137623b003fa64745
>>>>>
>>>>> But this might give you something to build on to arrive at something
>>>>> more scalable?
>>>>
>>>> I've a bit more polished version of this work (broken up into multiple
>>>> commits, with some fixes, etc) here:
>>>> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel3
>>>>
>>>> Hannes and/or Sagi, if you get a chance to try this on your NUMA system
>>>> please let me know how it goes.
>>>
>>> Initial review has uncovered some locking problems with the current code
>>> (nothing that caused crashes or hangs in my testing but...) so please
>>> hold off on testing until you hear from me (hopefully tomorrow).
>>>
>> Good news is that I've managed to hit the roof for my array with the
>> devel2 version of those patches. (And a _heavily_ patched-up lpfc
>> driver :-)
>> So from that perspective everything's fine now; we've reached the
>> hardware limit for my setup.
>> Which in itself is quite impressive; beating Intel P3700 with 16FC
>> is not bad methinks :-)
>>
>> So thanks for all your work here.
>
> Ah, that's really good news!  But devel2 is definitely _not_ destined
> for upstream.  'devel3' is much closer to "ready".  But your testing and
> review would really push it forward.
>
> Please see/test:
> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel3
>
> Also, please read this header:
> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel3&id=65a01b76502dd68e8ca298ee6614c0151b677f4a
>
> Even with devel2 I hacked it such that repeat_count > 1 is effectively
> broken.  I'm now _seriously_ considering deprecating repeat_count
> completely (adding a DMWARN that will inform the user. e.g.:
> "repeat_count > 1 is no longer supported").  I see no point going to
> great lengths to maintain a dm-mpath feature that was only a hack for
> when dm-mpath was bio-based.  What do you think?
>
Drop it, and make setting of which a no-op.
Never liked it anyway, and these decisions should really be delegated to 
the path selector.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 127+ messages in thread

* RCU-ified dm-mpath for testing/review
  2016-02-12 16:04                                                                   ` Hannes Reinecke
@ 2016-02-12 18:00                                                                     ` Mike Snitzer
  -1 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-02-12 18:00 UTC (permalink / raw)


On Fri, Feb 12 2016 at 11:04am -0500,
Hannes Reinecke <hare@suse.de> wrote:

> On 02/12/2016 04:26 PM, Mike Snitzer wrote:
> >On Fri, Feb 12 2016 at 10:18am -0500,
> >Hannes Reinecke <hare@suse.de> wrote:
> >>>
> >>Good news is that I've managed to hit the roof for my array with the
> >>devel2 version of those patches. (And a _heavily_ patched-up lpfc
> >>driver :-)
> >>So from that perspective everything's fine now; we've reached the
> >>hardware limit for my setup.
> >>Which in itself is quite impressive; beating Intel P3700 with 16FC
> >>is not bad methinks :-)
> >>
> >>So thanks for all your work here.
> >
> >Ah, that's really good news!  But devel2 is definitely _not_ destined
> >for upstream.  'devel3' is much closer to "ready".  But your testing and
> >review would really push it forward.
> >
> >Please see/test:
> >http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel3
> >
> >Also, please read this header:
> >http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel3&id=65a01b76502dd68e8ca298ee6614c0151b677f4a
> >
> >Even with devel2 I hacked it such that repeat_count > 1 is effectively
> >broken.  I'm now _seriously_ considering deprecating repeat_count
> >completely (adding a DMWARN that will inform the user. e.g.:
> >"repeat_count > 1 is no longer supported").  I see no point going to
> >great lengths to maintain a dm-mpath feature that was only a hack for
> >when dm-mpath was bio-based.  What do you think?
>
> Drop it, and make setting of which a no-op.
> Never liked it anyway, and these decisions should really be
> delegated to the path selector.

Sure, but my point is DM-mpath will no longer be able to provide the
ability to properly handle repeat_count > 1 (because updating the
->current_pgpath crushes the write-side of the RCU).

As such I've rebased 'devel3' to impose repeat_count = 1 (both in
dm-mpath.c and the defaults in each path selector).

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: RCU-ified dm-mpath for testing/review
@ 2016-02-12 18:00                                                                     ` Mike Snitzer
  0 siblings, 0 replies; 127+ messages in thread
From: Mike Snitzer @ 2016-02-12 18:00 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: axboe, keith.busch, Christoph Hellwig, Sagi Grimberg, linux-nvme,
	linux-block, device-mapper development, Bart Van Assche

On Fri, Feb 12 2016 at 11:04am -0500,
Hannes Reinecke <hare@suse.de> wrote:

> On 02/12/2016 04:26 PM, Mike Snitzer wrote:
> >On Fri, Feb 12 2016 at 10:18am -0500,
> >Hannes Reinecke <hare@suse.de> wrote:
> >>>
> >>Good news is that I've managed to hit the roof for my array with the
> >>devel2 version of those patches. (And a _heavily_ patched-up lpfc
> >>driver :-)
> >>So from that perspective everything's fine now; we've reached the
> >>hardware limit for my setup.
> >>Which in itself is quite impressive; beating Intel P3700 with 16FC
> >>is not bad methinks :-)
> >>
> >>So thanks for all your work here.
> >
> >Ah, that's really good news!  But devel2 is definitely _not_ destined
> >for upstream.  'devel3' is much closer to "ready".  But your testing and
> >review would really push it forward.
> >
> >Please see/test:
> >http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel3
> >
> >Also, please read this header:
> >http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel3&id=65a01b76502dd68e8ca298ee6614c0151b677f4a
> >
> >Even with devel2 I hacked it such that repeat_count > 1 is effectively
> >broken.  I'm now _seriously_ considering deprecating repeat_count
> >completely (adding a DMWARN that will inform the user. e.g.:
> >"repeat_count > 1 is no longer supported").  I see no point going to
> >great lengths to maintain a dm-mpath feature that was only a hack for
> >when dm-mpath was bio-based.  What do you think?
>
> Drop it, and make setting of which a no-op.
> Never liked it anyway, and these decisions should really be
> delegated to the path selector.

Sure, but my point is DM-mpath will no longer be able to provide the
ability to properly handle repeat_count > 1 (because updating the
->current_pgpath crushes the write-side of the RCU).

As such I've rebased 'devel3' to impose repeat_count = 1 (both in
dm-mpath.c and the defaults in each path selector).

^ permalink raw reply	[flat|nested] 127+ messages in thread

* RCU-ified dm-mpath for testing/review
  2016-02-12 18:00                                                                     ` Mike Snitzer
@ 2016-02-15  6:47                                                                       ` Hannes Reinecke
  -1 siblings, 0 replies; 127+ messages in thread
From: Hannes Reinecke @ 2016-02-15  6:47 UTC (permalink / raw)


On 02/12/2016 07:00 PM, Mike Snitzer wrote:
> On Fri, Feb 12 2016 at 11:04am -0500,
> Hannes Reinecke <hare@suse.de> wrote:
> 
>> On 02/12/2016 04:26 PM, Mike Snitzer wrote:
>>> On Fri, Feb 12 2016 at 10:18am -0500,
>>> Hannes Reinecke <hare@suse.de> wrote:
>>>>>
>>>> Good news is that I've managed to hit the roof for my array with the
>>>> devel2 version of those patches. (And a _heavily_ patched-up lpfc
>>>> driver :-)
>>>> So from that perspective everything's fine now; we've reached the
>>>> hardware limit for my setup.
>>>> Which in itself is quite impressive; beating Intel P3700 with 16FC
>>>> is not bad methinks :-)
>>>>
>>>> So thanks for all your work here.
>>>
>>> Ah, that's really good news!  But devel2 is definitely _not_ destined
>>> for upstream.  'devel3' is much closer to "ready".  But your testing and
>>> review would really push it forward.
>>>
>>> Please see/test:
>>> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel3
>>>
>>> Also, please read this header:
>>> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel3&id=65a01b76502dd68e8ca298ee6614c0151b677f4a
>>>
>>> Even with devel2 I hacked it such that repeat_count > 1 is effectively
>>> broken.  I'm now _seriously_ considering deprecating repeat_count
>>> completely (adding a DMWARN that will inform the user. e.g.:
>>> "repeat_count > 1 is no longer supported").  I see no point going to
>>> great lengths to maintain a dm-mpath feature that was only a hack for
>>> when dm-mpath was bio-based.  What do you think?
>>
>> Drop it, and make setting of which a no-op.
>> Never liked it anyway, and these decisions should really be
>> delegated to the path selector.
> 
> Sure, but my point is DM-mpath will no longer be able to provide the
> ability to properly handle repeat_count > 1 (because updating the
> ->current_pgpath crushes the write-side of the RCU).
> 
Fully understood. But as I said, the _need_ for repeat_count has
essentially vanished with the move to request-based multipath.

> As such I've rebased 'devel3' to impose repeat_count = 1 (both in
> dm-mpath.c and the defaults in each path selector).

Kewl.
I'll give it a go.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: RCU-ified dm-mpath for testing/review
@ 2016-02-15  6:47                                                                       ` Hannes Reinecke
  0 siblings, 0 replies; 127+ messages in thread
From: Hannes Reinecke @ 2016-02-15  6:47 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: axboe, keith.busch, Christoph Hellwig, Sagi Grimberg, linux-nvme,
	linux-block, device-mapper development, Bart Van Assche

On 02/12/2016 07:00 PM, Mike Snitzer wrote:
> On Fri, Feb 12 2016 at 11:04am -0500,
> Hannes Reinecke <hare@suse.de> wrote:
> 
>> On 02/12/2016 04:26 PM, Mike Snitzer wrote:
>>> On Fri, Feb 12 2016 at 10:18am -0500,
>>> Hannes Reinecke <hare@suse.de> wrote:
>>>>>
>>>> Good news is that I've managed to hit the roof for my array with the
>>>> devel2 version of those patches. (And a _heavily_ patched-up lpfc
>>>> driver :-)
>>>> So from that perspective everything's fine now; we've reached the
>>>> hardware limit for my setup.
>>>> Which in itself is quite impressive; beating Intel P3700 with 16FC
>>>> is not bad methinks :-)
>>>>
>>>> So thanks for all your work here.
>>>
>>> Ah, that's really good news!  But devel2 is definitely _not_ destined
>>> for upstream.  'devel3' is much closer to "ready".  But your testing and
>>> review would really push it forward.
>>>
>>> Please see/test:
>>> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel3
>>>
>>> Also, please read this header:
>>> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel3&id=65a01b76502dd68e8ca298ee6614c0151b677f4a
>>>
>>> Even with devel2 I hacked it such that repeat_count > 1 is effectively
>>> broken.  I'm now _seriously_ considering deprecating repeat_count
>>> completely (adding a DMWARN that will inform the user. e.g.:
>>> "repeat_count > 1 is no longer supported").  I see no point going to
>>> great lengths to maintain a dm-mpath feature that was only a hack for
>>> when dm-mpath was bio-based.  What do you think?
>>
>> Drop it, and make setting of which a no-op.
>> Never liked it anyway, and these decisions should really be
>> delegated to the path selector.
> 
> Sure, but my point is DM-mpath will no longer be able to provide the
> ability to properly handle repeat_count > 1 (because updating the
> ->current_pgpath crushes the write-side of the RCU).
> 
Fully understood. But as I said, the _need_ for repeat_count has
essentially vanished with the move to request-based multipath.

> As such I've rebased 'devel3' to impose repeat_count = 1 (both in
> dm-mpath.c and the defaults in each path selector).

Kewl.
I'll give it a go.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 127+ messages in thread

end of thread, other threads:[~2016-02-15  6:47 UTC | newest]

Thread overview: 127+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-18 12:04 dm-multipath low performance with blk-mq Sagi Grimberg
2016-01-19 10:37 ` Sagi Grimberg
2016-01-19 22:45   ` Mike Snitzer
2016-01-19 22:45     ` Mike Snitzer
2016-01-25 21:40     ` Mike Snitzer
2016-01-25 21:40       ` Mike Snitzer
2016-01-25 23:37       ` [dm-devel] " Benjamin Marzinski
2016-01-25 23:37         ` Benjamin Marzinski
2016-01-26 13:29         ` Mike Snitzer
2016-01-26 13:29           ` Mike Snitzer
2016-01-26 14:01           ` Hannes Reinecke
2016-01-26 14:47             ` Mike Snitzer
2016-01-26 14:47               ` Mike Snitzer
2016-01-26 14:56               ` Christoph Hellwig
2016-01-26 14:56                 ` Christoph Hellwig
2016-01-26 15:27                 ` Mike Snitzer
2016-01-26 15:27                   ` Mike Snitzer
2016-01-26 15:57             ` Benjamin Marzinski
2016-01-27 11:14           ` Sagi Grimberg
2016-01-27 11:14             ` Sagi Grimberg
2016-01-27 17:48             ` Mike Snitzer
2016-01-27 17:48               ` Mike Snitzer
2016-01-27 17:51               ` Jens Axboe
2016-01-27 17:51                 ` Jens Axboe
2016-01-27 18:16                 ` Mike Snitzer
2016-01-27 18:16                   ` Mike Snitzer
2016-01-27 18:26                   ` Jens Axboe
2016-01-27 18:26                     ` Jens Axboe
2016-01-27 19:14                     ` Mike Snitzer
2016-01-27 19:14                       ` Mike Snitzer
2016-01-27 19:50                       ` Jens Axboe
2016-01-27 19:50                         ` Jens Axboe
2016-01-27 17:56               ` Sagi Grimberg
2016-01-27 17:56                 ` Sagi Grimberg
2016-01-27 18:42                 ` Mike Snitzer
2016-01-27 18:42                   ` Mike Snitzer
2016-01-27 19:49                   ` Jens Axboe
2016-01-27 19:49                     ` Jens Axboe
2016-01-27 20:45                     ` Mike Snitzer
2016-01-27 20:45                       ` Mike Snitzer
2016-01-29 23:35                 ` Mike Snitzer
2016-01-29 23:35                   ` Mike Snitzer
2016-01-30  8:52                   ` Hannes Reinecke
2016-01-30  8:52                     ` Hannes Reinecke
2016-01-30 19:12                     ` Mike Snitzer
2016-01-30 19:12                       ` Mike Snitzer
2016-02-01  6:46                       ` Hannes Reinecke
2016-02-01  6:46                         ` Hannes Reinecke
2016-02-03 18:04                         ` Mike Snitzer
2016-02-03 18:04                           ` Mike Snitzer
2016-02-03 18:24                           ` Mike Snitzer
2016-02-03 18:24                             ` Mike Snitzer
2016-02-03 19:22                             ` Mike Snitzer
2016-02-03 19:22                               ` Mike Snitzer
2016-02-04  6:54                             ` Hannes Reinecke
2016-02-04  6:54                               ` Hannes Reinecke
2016-02-04 13:54                               ` Mike Snitzer
2016-02-04 13:54                                 ` Mike Snitzer
2016-02-04 13:58                                 ` Hannes Reinecke
2016-02-04 13:58                                   ` Hannes Reinecke
2016-02-04 14:09                                   ` Mike Snitzer
2016-02-04 14:09                                     ` Mike Snitzer
2016-02-04 14:32                                     ` Hannes Reinecke
2016-02-04 14:32                                       ` Hannes Reinecke
2016-02-04 14:44                                       ` Mike Snitzer
2016-02-04 14:44                                         ` Mike Snitzer
2016-02-05 15:13                                 ` [RFC PATCH] dm: fix excessive dm-mq context switching Mike Snitzer
2016-02-05 15:13                                   ` Mike Snitzer
2016-02-05 18:05                                   ` Mike Snitzer
2016-02-05 18:05                                     ` Mike Snitzer
2016-02-05 19:19                                     ` Mike Snitzer
2016-02-05 19:19                                       ` Mike Snitzer
2016-02-07 15:41                                       ` Sagi Grimberg
2016-02-07 15:41                                         ` Sagi Grimberg
2016-02-07 16:07                                         ` Mike Snitzer
2016-02-07 16:07                                           ` Mike Snitzer
2016-02-07 16:42                                           ` Sagi Grimberg
2016-02-07 16:42                                             ` Sagi Grimberg
2016-02-07 16:37                                         ` Bart Van Assche
2016-02-07 16:37                                           ` Bart Van Assche
2016-02-07 16:43                                           ` Sagi Grimberg
2016-02-07 16:43                                             ` Sagi Grimberg
2016-02-07 16:53                                             ` Mike Snitzer
2016-02-07 16:53                                               ` Mike Snitzer
2016-02-07 16:54                                             ` Sagi Grimberg
2016-02-07 16:54                                               ` Sagi Grimberg
2016-02-07 17:20                                               ` Mike Snitzer
2016-02-07 17:20                                                 ` Mike Snitzer
2016-02-08 12:21                                                 ` Sagi Grimberg
2016-02-08 12:21                                                   ` Sagi Grimberg
2016-02-08 14:34                                                   ` Mike Snitzer
2016-02-08 14:34                                                     ` Mike Snitzer
2016-02-09  7:50                                                 ` Hannes Reinecke
2016-02-09  7:50                                                   ` Hannes Reinecke
2016-02-09 14:55                                                   ` Mike Snitzer
2016-02-09 14:55                                                     ` Mike Snitzer
2016-02-09 15:32                                                     ` Hannes Reinecke
2016-02-09 15:32                                                       ` Hannes Reinecke
2016-02-10  0:45                                                       ` Mike Snitzer
2016-02-10  0:45                                                         ` Mike Snitzer
2016-02-11  1:50                                                         ` RCU-ified dm-mpath for testing/review Mike Snitzer
2016-02-11  3:35                                                           ` Mike Snitzer
2016-02-11  3:35                                                             ` Mike Snitzer
2016-02-11 15:34                                                           ` Mike Snitzer
2016-02-11 15:34                                                             ` Mike Snitzer
2016-02-12 15:18                                                             ` Hannes Reinecke
2016-02-12 15:18                                                               ` Hannes Reinecke
2016-02-12 15:26                                                               ` Mike Snitzer
2016-02-12 15:26                                                                 ` Mike Snitzer
2016-02-12 16:04                                                                 ` Hannes Reinecke
2016-02-12 16:04                                                                   ` Hannes Reinecke
2016-02-12 18:00                                                                   ` Mike Snitzer
2016-02-12 18:00                                                                     ` Mike Snitzer
2016-02-15  6:47                                                                     ` Hannes Reinecke
2016-02-15  6:47                                                                       ` Hannes Reinecke
2016-01-26  1:49       ` [dm-devel] dm-multipath low performance with blk-mq Benjamin Marzinski
2016-01-26  1:49         ` Benjamin Marzinski
2016-01-26 16:03       ` Mike Snitzer
2016-01-26 16:03         ` Mike Snitzer
2016-01-26 16:44         ` Christoph Hellwig
2016-01-26 16:44           ` Christoph Hellwig
2016-01-27  2:09           ` Mike Snitzer
2016-01-27  2:09             ` Mike Snitzer
2016-01-27 11:10             ` Sagi Grimberg
2016-01-27 11:10               ` Sagi Grimberg
2016-01-26 21:40         ` [dm-devel] " Benjamin Marzinski
2016-01-26 21:40           ` Benjamin Marzinski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.