All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RESEND v2 1/2] blk-mq: Fix failed allocation path when mapping queues
@ 2016-12-06 15:31 Gabriel Krisman Bertazi
  2016-12-06 15:31 ` [PATCH RESEND v2 2/2] blk-mq: Avoid memory reclaim when remapping queues Gabriel Krisman Bertazi
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Gabriel Krisman Bertazi @ 2016-12-06 15:31 UTC (permalink / raw)
  To: axboe
  Cc: wenxiong, gpiccoli, hch, Gabriel Krisman Bertazi, Brian King,
	Douglas Miller, linux-block, linux-scsi

In blk_mq_map_swqueue, there is a memory optimization that frees the
tags of a queue that has gone unmapped.  Later, if that hctx is remapped
after another topology change, the tags need to be reallocated.

If this allocation fails, a simple WARN_ON triggers, but the block layer
ends up with an active hctx without any corresponding set of tags.
Then, any income IO to that hctx can trigger an Oops.

I can reproduce it consistently by running IO, flipping CPUs on and off
and eventually injecting a memory allocation failure in that path.

In the fix below, if the system experiences a failed allocation of any
hctx's tags, we remap all the ctxs of that queue to the hctx_0, which
should always keep it's tags.  There is a minor performance hit, since
our mapping just got worse after the error path, but this is
the simplest solution to handle this error path.  The performance hit
will disappear after another successful remap.

I considered dropping the memory optimization all together, but it
seemed a bad trade-off to handle this very specific error case.

This should apply cleanly on top of Jen's for-next branch.

The Oops is the one below:

SP (3fff935ce4d0) is in userspace
1:mon> e
cpu 0x1: Vector: 300 (Data Access) at [c000000fe99eb110]
    pc: c0000000005e868c: __sbitmap_queue_get+0x2c/0x180
    lr: c000000000575328: __bt_get+0x48/0xd0
    sp: c000000fe99eb390
   msr: 900000010280b033
   dar: 28
 dsisr: 40000000
  current = 0xc000000fe9966800
  paca    = 0xc000000007e80300   softe: 0        irq_happened: 0x01
    pid   = 11035, comm = aio-stress
Linux version 4.8.0-rc6+ (root@bean) (gcc version 5.4.0 20160609
(Ubuntu/IBM 5.4.0-6ubuntu1~16.04.2) ) #3 SMP Mon Oct 10 20:16:53 CDT 2016
1:mon> s
[c000000fe99eb3d0] c000000000575328 __bt_get+0x48/0xd0
[c000000fe99eb400] c000000000575838 bt_get.isra.1+0x78/0x2d0
[c000000fe99eb480] c000000000575cb4 blk_mq_get_tag+0x44/0x100
[c000000fe99eb4b0] c00000000056f6f4 __blk_mq_alloc_request+0x44/0x220
[c000000fe99eb500] c000000000570050 blk_mq_map_request+0x100/0x1f0
[c000000fe99eb580] c000000000574650 blk_mq_make_request+0xf0/0x540
[c000000fe99eb640] c000000000561c44 generic_make_request+0x144/0x230
[c000000fe99eb690] c000000000561e00 submit_bio+0xd0/0x200
[c000000fe99eb740] c0000000003ef740 ext4_io_submit+0x90/0xb0
[c000000fe99eb770] c0000000003e95d8 ext4_writepages+0x588/0xdd0
[c000000fe99eb910] c00000000025a9f0 do_writepages+0x60/0xc0
[c000000fe99eb940] c000000000246c88 __filemap_fdatawrite_range+0xf8/0x180
[c000000fe99eb9e0] c000000000246f90 filemap_write_and_wait_range+0x70/0xf0
[c000000fe99eba20] c0000000003dd844 ext4_sync_file+0x214/0x540
[c000000fe99eba80] c000000000364718 vfs_fsync_range+0x78/0x130
[c000000fe99ebad0] c0000000003dd46c ext4_file_write_iter+0x35c/0x430
[c000000fe99ebb90] c00000000038c280 aio_run_iocb+0x3b0/0x450
[c000000fe99ebce0] c00000000038dc28 do_io_submit+0x368/0x730
[c000000fe99ebe30] c000000000009404 system_call+0x38/0xec

Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Cc: Brian King <brking@linux.vnet.ibm.com>
Cc: Douglas Miller <dougmill@linux.vnet.ibm.com>
Cc: linux-block@vger.kernel.org
Cc: linux-scsi@vger.kernel.org
---
 block/blk-mq.c | 21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6fb94bd69375..6718f894fbe1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1870,7 +1870,7 @@ static void blk_mq_init_cpu_queues(struct request_queue *q,
 static void blk_mq_map_swqueue(struct request_queue *q,
 			       const struct cpumask *online_mask)
 {
-	unsigned int i;
+	unsigned int i, hctx_idx;
 	struct blk_mq_hw_ctx *hctx;
 	struct blk_mq_ctx *ctx;
 	struct blk_mq_tag_set *set = q->tag_set;
@@ -1893,6 +1893,15 @@ static void blk_mq_map_swqueue(struct request_queue *q,
 		if (!cpumask_test_cpu(i, online_mask))
 			continue;
 
+		hctx_idx = q->mq_map[i];
+		/* unmapped hw queue can be remapped after CPU topo changed */
+		if (!set->tags[hctx_idx]) {
+			set->tags[hctx_idx] = blk_mq_init_rq_map(set, hctx_idx);
+
+			if (!set->tags[hctx_idx])
+				q->mq_map[i] = 0;
+		}
+
 		ctx = per_cpu_ptr(q->queue_ctx, i);
 		hctx = blk_mq_map_queue(q, i);
 
@@ -1909,7 +1918,10 @@ static void blk_mq_map_swqueue(struct request_queue *q,
 		 * disable it and free the request entries.
 		 */
 		if (!hctx->nr_ctx) {
-			if (set->tags[i]) {
+			/* Never unmap queue 0.  We need it as a
+			 * fallback in case of a new remap fails
+			 * allocation. */
+			if (i && set->tags[i]) {
 				blk_mq_free_rq_map(set, set->tags[i], i);
 				set->tags[i] = NULL;
 			}
@@ -1917,11 +1929,8 @@ static void blk_mq_map_swqueue(struct request_queue *q,
 			continue;
 		}
 
-		/* unmapped hw queue can be remapped after CPU topo changed */
-		if (!set->tags[i])
-			set->tags[i] = blk_mq_init_rq_map(set, i);
 		hctx->tags = set->tags[i];
-		WARN_ON(!hctx->tags);
+		BUG_ON(!hctx->tags);
 
 		/*
 		 * Set the map size to the number of mapped software queues.
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH RESEND v2 2/2] blk-mq: Avoid memory reclaim when remapping queues
  2016-12-06 15:31 [PATCH RESEND v2 1/2] blk-mq: Fix failed allocation path when mapping queues Gabriel Krisman Bertazi
@ 2016-12-06 15:31 ` Gabriel Krisman Bertazi
  2016-12-07 20:10   ` Douglas Miller
  2016-12-14 15:14   ` Jens Axboe
  2016-12-07 20:06 ` [PATCH RESEND v2 1/2] blk-mq: Fix failed allocation path when mapping queues Douglas Miller
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 9+ messages in thread
From: Gabriel Krisman Bertazi @ 2016-12-06 15:31 UTC (permalink / raw)
  To: axboe
  Cc: wenxiong, gpiccoli, hch, Gabriel Krisman Bertazi, Brian King,
	Douglas Miller, linux-block, linux-scsi

While stressing memory and IO at the same time we changed SMT settings,
we were able to consistently trigger deadlocks in the mm system, which
froze the entire machine.

I think that under memory stress conditions, the large allocations
performed by blk_mq_init_rq_map may trigger a reclaim, which stalls
waiting on the block layer remmaping completion, thus deadlocking the
system.  The trace below was collected after the machine stalled,
waiting for the hotplug event completion.

The simplest fix for this is to make allocations in this path
non-reclaimable, with GFP_NOIO.  With this patch, We couldn't hit the
issue anymore.

This should apply on top of Jen's for-next branch cleanly.

Changes since v1:
  - Use GFP_NOIO instead of GFP_NOWAIT.

 Call Trace:
[c000000f0160aaf0] [c000000f0160ab50] 0xc000000f0160ab50 (unreliable)
[c000000f0160acc0] [c000000000016624] __switch_to+0x2e4/0x430
[c000000f0160ad20] [c000000000b1a880] __schedule+0x310/0x9b0
[c000000f0160ae00] [c000000000b1af68] schedule+0x48/0xc0
[c000000f0160ae30] [c000000000b1b4b0] schedule_preempt_disabled+0x20/0x30
[c000000f0160ae50] [c000000000b1d4fc] __mutex_lock_slowpath+0xec/0x1f0
[c000000f0160aed0] [c000000000b1d678] mutex_lock+0x78/0xa0
[c000000f0160af00] [d000000019413cac] xfs_reclaim_inodes_ag+0x33c/0x380 [xfs]
[c000000f0160b0b0] [d000000019415164] xfs_reclaim_inodes_nr+0x54/0x70 [xfs]
[c000000f0160b0f0] [d0000000194297f8] xfs_fs_free_cached_objects+0x38/0x60 [xfs]
[c000000f0160b120] [c0000000003172c8] super_cache_scan+0x1f8/0x210
[c000000f0160b190] [c00000000026301c] shrink_slab.part.13+0x21c/0x4c0
[c000000f0160b2d0] [c000000000268088] shrink_zone+0x2d8/0x3c0
[c000000f0160b380] [c00000000026834c] do_try_to_free_pages+0x1dc/0x520
[c000000f0160b450] [c00000000026876c] try_to_free_pages+0xdc/0x250
[c000000f0160b4e0] [c000000000251978] __alloc_pages_nodemask+0x868/0x10d0
[c000000f0160b6f0] [c000000000567030] blk_mq_init_rq_map+0x160/0x380
[c000000f0160b7a0] [c00000000056758c] blk_mq_map_swqueue+0x33c/0x360
[c000000f0160b820] [c000000000567904] blk_mq_queue_reinit+0x64/0xb0
[c000000f0160b850] [c00000000056a16c] blk_mq_queue_reinit_notify+0x19c/0x250
[c000000f0160b8a0] [c0000000000f5d38] notifier_call_chain+0x98/0x100
[c000000f0160b8f0] [c0000000000c5fb0] __cpu_notify+0x70/0xe0
[c000000f0160b930] [c0000000000c63c4] notify_prepare+0x44/0xb0
[c000000f0160b9b0] [c0000000000c52f4] cpuhp_invoke_callback+0x84/0x250
[c000000f0160ba10] [c0000000000c570c] cpuhp_up_callbacks+0x5c/0x120
[c000000f0160ba60] [c0000000000c7cb8] _cpu_up+0xf8/0x1d0
[c000000f0160bac0] [c0000000000c7eb0] do_cpu_up+0x120/0x150
[c000000f0160bb40] [c0000000006fe024] cpu_subsys_online+0x64/0xe0
[c000000f0160bb90] [c0000000006f5124] device_online+0xb4/0x120
[c000000f0160bbd0] [c0000000006f5244] online_store+0xb4/0xc0
[c000000f0160bc20] [c0000000006f0a68] dev_attr_store+0x68/0xa0
[c000000f0160bc60] [c0000000003ccc30] sysfs_kf_write+0x80/0xb0
[c000000f0160bca0] [c0000000003cbabc] kernfs_fop_write+0x17c/0x250
[c000000f0160bcf0] [c00000000030fe6c] __vfs_write+0x6c/0x1e0
[c000000f0160bd90] [c000000000311490] vfs_write+0xd0/0x270
[c000000f0160bde0] [c0000000003131fc] SyS_write+0x6c/0x110
[c000000f0160be30] [c000000000009204] system_call+0x38/0xec

Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Cc: Brian King <brking@linux.vnet.ibm.com>
Cc: Douglas Miller <dougmill@linux.vnet.ibm.com>
Cc: linux-block@vger.kernel.org
Cc: linux-scsi@vger.kernel.org
---
 block/blk-mq.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6718f894fbe1..5f4e452eef72 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1605,7 +1605,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
 	INIT_LIST_HEAD(&tags->page_list);
 
 	tags->rqs = kzalloc_node(set->queue_depth * sizeof(struct request *),
-				 GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY,
+				 GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
 				 set->numa_node);
 	if (!tags->rqs) {
 		blk_mq_free_tags(tags);
@@ -1631,7 +1631,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
 
 		do {
 			page = alloc_pages_node(set->numa_node,
-				GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO,
+				GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO,
 				this_order);
 			if (page)
 				break;
@@ -1652,7 +1652,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
 		 * Allow kmemleak to scan these pages as they contain pointers
 		 * to additional allocations like via ops->init_request().
 		 */
-		kmemleak_alloc(p, order_to_size(this_order), 1, GFP_KERNEL);
+		kmemleak_alloc(p, order_to_size(this_order), 1, GFP_NOIO);
 		entries_per_page = order_to_size(this_order) / rq_size;
 		to_do = min(entries_per_page, set->queue_depth - i);
 		left -= to_do * rq_size;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH RESEND v2 1/2] blk-mq: Fix failed allocation path when mapping queues
  2016-12-06 15:31 [PATCH RESEND v2 1/2] blk-mq: Fix failed allocation path when mapping queues Gabriel Krisman Bertazi
  2016-12-06 15:31 ` [PATCH RESEND v2 2/2] blk-mq: Avoid memory reclaim when remapping queues Gabriel Krisman Bertazi
@ 2016-12-07 20:06 ` Douglas Miller
  2016-12-07 20:12   ` Douglas Miller
  2016-12-14  0:39 ` Gabriel Krisman Bertazi
  2016-12-14 15:13 ` Jens Axboe
  3 siblings, 1 reply; 9+ messages in thread
From: Douglas Miller @ 2016-12-07 20:06 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi, axboe
  Cc: wenxiong, gpiccoli, hch, Brian King, linux-block, linux-scsi

On 12/06/2016 09:31 AM, Gabriel Krisman Bertazi wrote:
> In blk_mq_map_swqueue, there is a memory optimization that frees the
> tags of a queue that has gone unmapped.  Later, if that hctx is remapped
> after another topology change, the tags need to be reallocated.
>
> If this allocation fails, a simple WARN_ON triggers, but the block layer
> ends up with an active hctx without any corresponding set of tags.
> Then, any income IO to that hctx can trigger an Oops.
>
> I can reproduce it consistently by running IO, flipping CPUs on and off
> and eventually injecting a memory allocation failure in that path.
>
> In the fix below, if the system experiences a failed allocation of any
> hctx's tags, we remap all the ctxs of that queue to the hctx_0, which
> should always keep it's tags.  There is a minor performance hit, since
> our mapping just got worse after the error path, but this is
> the simplest solution to handle this error path.  The performance hit
> will disappear after another successful remap.
>
> I considered dropping the memory optimization all together, but it
> seemed a bad trade-off to handle this very specific error case.
>
> This should apply cleanly on top of Jen's for-next branch.
>
> The Oops is the one below:
>
> SP (3fff935ce4d0) is in userspace
> 1:mon> e
> cpu 0x1: Vector: 300 (Data Access) at [c000000fe99eb110]
>      pc: c0000000005e868c: __sbitmap_queue_get+0x2c/0x180
>      lr: c000000000575328: __bt_get+0x48/0xd0
>      sp: c000000fe99eb390
>     msr: 900000010280b033
>     dar: 28
>   dsisr: 40000000
>    current = 0xc000000fe9966800
>    paca    = 0xc000000007e80300   softe: 0        irq_happened: 0x01
>      pid   = 11035, comm = aio-stress
> Linux version 4.8.0-rc6+ (root@bean) (gcc version 5.4.0 20160609
> (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.2) ) #3 SMP Mon Oct 10 20:16:53 CDT 2016
> 1:mon> s
> [c000000fe99eb3d0] c000000000575328 __bt_get+0x48/0xd0
> [c000000fe99eb400] c000000000575838 bt_get.isra.1+0x78/0x2d0
> [c000000fe99eb480] c000000000575cb4 blk_mq_get_tag+0x44/0x100
> [c000000fe99eb4b0] c00000000056f6f4 __blk_mq_alloc_request+0x44/0x220
> [c000000fe99eb500] c000000000570050 blk_mq_map_request+0x100/0x1f0
> [c000000fe99eb580] c000000000574650 blk_mq_make_request+0xf0/0x540
> [c000000fe99eb640] c000000000561c44 generic_make_request+0x144/0x230
> [c000000fe99eb690] c000000000561e00 submit_bio+0xd0/0x200
> [c000000fe99eb740] c0000000003ef740 ext4_io_submit+0x90/0xb0
> [c000000fe99eb770] c0000000003e95d8 ext4_writepages+0x588/0xdd0
> [c000000fe99eb910] c00000000025a9f0 do_writepages+0x60/0xc0
> [c000000fe99eb940] c000000000246c88 __filemap_fdatawrite_range+0xf8/0x180
> [c000000fe99eb9e0] c000000000246f90 filemap_write_and_wait_range+0x70/0xf0
> [c000000fe99eba20] c0000000003dd844 ext4_sync_file+0x214/0x540
> [c000000fe99eba80] c000000000364718 vfs_fsync_range+0x78/0x130
> [c000000fe99ebad0] c0000000003dd46c ext4_file_write_iter+0x35c/0x430
> [c000000fe99ebb90] c00000000038c280 aio_run_iocb+0x3b0/0x450
> [c000000fe99ebce0] c00000000038dc28 do_io_submit+0x368/0x730
> [c000000fe99ebe30] c000000000009404 system_call+0x38/0xec
>
> Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
> Cc: Brian King <brking@linux.vnet.ibm.com>
> Cc: Douglas Miller <dougmill@linux.vnet.ibm.com>
> Cc: linux-block@vger.kernel.org
> Cc: linux-scsi@vger.kernel.org
> ---
>   block/blk-mq.c | 21 +++++++++++++++------
>   1 file changed, 15 insertions(+), 6 deletions(-)
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 6fb94bd69375..6718f894fbe1 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1870,7 +1870,7 @@ static void blk_mq_init_cpu_queues(struct request_queue *q,
>   static void blk_mq_map_swqueue(struct request_queue *q,
>   			       const struct cpumask *online_mask)
>   {
> -	unsigned int i;
> +	unsigned int i, hctx_idx;
>   	struct blk_mq_hw_ctx *hctx;
>   	struct blk_mq_ctx *ctx;
>   	struct blk_mq_tag_set *set = q->tag_set;
> @@ -1893,6 +1893,15 @@ static void blk_mq_map_swqueue(struct request_queue *q,
>   		if (!cpumask_test_cpu(i, online_mask))
>   			continue;
>
> +		hctx_idx = q->mq_map[i];
> +		/* unmapped hw queue can be remapped after CPU topo changed */
> +		if (!set->tags[hctx_idx]) {
> +			set->tags[hctx_idx] = blk_mq_init_rq_map(set, hctx_idx);
> +
> +			if (!set->tags[hctx_idx])
> +				q->mq_map[i] = 0;
> +		}
> +
>   		ctx = per_cpu_ptr(q->queue_ctx, i);
>   		hctx = blk_mq_map_queue(q, i);
>
> @@ -1909,7 +1918,10 @@ static void blk_mq_map_swqueue(struct request_queue *q,
>   		 * disable it and free the request entries.
>   		 */
>   		if (!hctx->nr_ctx) {
> -			if (set->tags[i]) {
> +			/* Never unmap queue 0.  We need it as a
> +			 * fallback in case of a new remap fails
> +			 * allocation. */
> +			if (i && set->tags[i]) {
>   				blk_mq_free_rq_map(set, set->tags[i], i);
>   				set->tags[i] = NULL;
>   			}
> @@ -1917,11 +1929,8 @@ static void blk_mq_map_swqueue(struct request_queue *q,
>   			continue;
>   		}
>
> -		/* unmapped hw queue can be remapped after CPU topo changed */
> -		if (!set->tags[i])
> -			set->tags[i] = blk_mq_init_rq_map(set, i);
>   		hctx->tags = set->tags[i];
> -		WARN_ON(!hctx->tags);
> +		BUG_ON(!hctx->tags);
>
>   		/*
>   		 * Set the map size to the number of mapped software queues.
Signed-off-by: Douglas Miller <dougmill@linux.vnet.ibm.com>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RESEND v2 2/2] blk-mq: Avoid memory reclaim when remapping queues
  2016-12-06 15:31 ` [PATCH RESEND v2 2/2] blk-mq: Avoid memory reclaim when remapping queues Gabriel Krisman Bertazi
@ 2016-12-07 20:10   ` Douglas Miller
  2016-12-14 15:14   ` Jens Axboe
  1 sibling, 0 replies; 9+ messages in thread
From: Douglas Miller @ 2016-12-07 20:10 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi, axboe
  Cc: wenxiong, gpiccoli, hch, Brian King, linux-block, linux-scsi

On 12/06/2016 09:31 AM, Gabriel Krisman Bertazi wrote:
> While stressing memory and IO at the same time we changed SMT settings,
> we were able to consistently trigger deadlocks in the mm system, which
> froze the entire machine.
>
> I think that under memory stress conditions, the large allocations
> performed by blk_mq_init_rq_map may trigger a reclaim, which stalls
> waiting on the block layer remmaping completion, thus deadlocking the
> system.  The trace below was collected after the machine stalled,
> waiting for the hotplug event completion.
>
> The simplest fix for this is to make allocations in this path
> non-reclaimable, with GFP_NOIO.  With this patch, We couldn't hit the
> issue anymore.
>
> This should apply on top of Jen's for-next branch cleanly.
>
> Changes since v1:
>    - Use GFP_NOIO instead of GFP_NOWAIT.
>
>   Call Trace:
> [c000000f0160aaf0] [c000000f0160ab50] 0xc000000f0160ab50 (unreliable)
> [c000000f0160acc0] [c000000000016624] __switch_to+0x2e4/0x430
> [c000000f0160ad20] [c000000000b1a880] __schedule+0x310/0x9b0
> [c000000f0160ae00] [c000000000b1af68] schedule+0x48/0xc0
> [c000000f0160ae30] [c000000000b1b4b0] schedule_preempt_disabled+0x20/0x30
> [c000000f0160ae50] [c000000000b1d4fc] __mutex_lock_slowpath+0xec/0x1f0
> [c000000f0160aed0] [c000000000b1d678] mutex_lock+0x78/0xa0
> [c000000f0160af00] [d000000019413cac] xfs_reclaim_inodes_ag+0x33c/0x380 [xfs]
> [c000000f0160b0b0] [d000000019415164] xfs_reclaim_inodes_nr+0x54/0x70 [xfs]
> [c000000f0160b0f0] [d0000000194297f8] xfs_fs_free_cached_objects+0x38/0x60 [xfs]
> [c000000f0160b120] [c0000000003172c8] super_cache_scan+0x1f8/0x210
> [c000000f0160b190] [c00000000026301c] shrink_slab.part.13+0x21c/0x4c0
> [c000000f0160b2d0] [c000000000268088] shrink_zone+0x2d8/0x3c0
> [c000000f0160b380] [c00000000026834c] do_try_to_free_pages+0x1dc/0x520
> [c000000f0160b450] [c00000000026876c] try_to_free_pages+0xdc/0x250
> [c000000f0160b4e0] [c000000000251978] __alloc_pages_nodemask+0x868/0x10d0
> [c000000f0160b6f0] [c000000000567030] blk_mq_init_rq_map+0x160/0x380
> [c000000f0160b7a0] [c00000000056758c] blk_mq_map_swqueue+0x33c/0x360
> [c000000f0160b820] [c000000000567904] blk_mq_queue_reinit+0x64/0xb0
> [c000000f0160b850] [c00000000056a16c] blk_mq_queue_reinit_notify+0x19c/0x250
> [c000000f0160b8a0] [c0000000000f5d38] notifier_call_chain+0x98/0x100
> [c000000f0160b8f0] [c0000000000c5fb0] __cpu_notify+0x70/0xe0
> [c000000f0160b930] [c0000000000c63c4] notify_prepare+0x44/0xb0
> [c000000f0160b9b0] [c0000000000c52f4] cpuhp_invoke_callback+0x84/0x250
> [c000000f0160ba10] [c0000000000c570c] cpuhp_up_callbacks+0x5c/0x120
> [c000000f0160ba60] [c0000000000c7cb8] _cpu_up+0xf8/0x1d0
> [c000000f0160bac0] [c0000000000c7eb0] do_cpu_up+0x120/0x150
> [c000000f0160bb40] [c0000000006fe024] cpu_subsys_online+0x64/0xe0
> [c000000f0160bb90] [c0000000006f5124] device_online+0xb4/0x120
> [c000000f0160bbd0] [c0000000006f5244] online_store+0xb4/0xc0
> [c000000f0160bc20] [c0000000006f0a68] dev_attr_store+0x68/0xa0
> [c000000f0160bc60] [c0000000003ccc30] sysfs_kf_write+0x80/0xb0
> [c000000f0160bca0] [c0000000003cbabc] kernfs_fop_write+0x17c/0x250
> [c000000f0160bcf0] [c00000000030fe6c] __vfs_write+0x6c/0x1e0
> [c000000f0160bd90] [c000000000311490] vfs_write+0xd0/0x270
> [c000000f0160bde0] [c0000000003131fc] SyS_write+0x6c/0x110
> [c000000f0160be30] [c000000000009204] system_call+0x38/0xec
>
> Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
> Cc: Brian King <brking@linux.vnet.ibm.com>
> Cc: Douglas Miller <dougmill@linux.vnet.ibm.com>
> Cc: linux-block@vger.kernel.org
> Cc: linux-scsi@vger.kernel.org
> ---
>   block/blk-mq.c | 6 +++---
>   1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 6718f894fbe1..5f4e452eef72 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1605,7 +1605,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
>   	INIT_LIST_HEAD(&tags->page_list);
>
>   	tags->rqs = kzalloc_node(set->queue_depth * sizeof(struct request *),
> -				 GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY,
> +				 GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
>   				 set->numa_node);
>   	if (!tags->rqs) {
>   		blk_mq_free_tags(tags);
> @@ -1631,7 +1631,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
>
>   		do {
>   			page = alloc_pages_node(set->numa_node,
> -				GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO,
> +				GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO,
>   				this_order);
>   			if (page)
>   				break;
> @@ -1652,7 +1652,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
>   		 * Allow kmemleak to scan these pages as they contain pointers
>   		 * to additional allocations like via ops->init_request().
>   		 */
> -		kmemleak_alloc(p, order_to_size(this_order), 1, GFP_KERNEL);
> +		kmemleak_alloc(p, order_to_size(this_order), 1, GFP_NOIO);
>   		entries_per_page = order_to_size(this_order) / rq_size;
>   		to_do = min(entries_per_page, set->queue_depth - i);
>   		left -= to_do * rq_size;
Reviewed-by: Douglas Miller <dougmill@linux.vnet.ibm.com>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RESEND v2 1/2] blk-mq: Fix failed allocation path when mapping queues
  2016-12-07 20:06 ` [PATCH RESEND v2 1/2] blk-mq: Fix failed allocation path when mapping queues Douglas Miller
@ 2016-12-07 20:12   ` Douglas Miller
  0 siblings, 0 replies; 9+ messages in thread
From: Douglas Miller @ 2016-12-07 20:12 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi, axboe
  Cc: wenxiong, gpiccoli, hch, Brian King, linux-block, linux-scsi

On 12/07/2016 02:06 PM, Douglas Miller wrote:
> On 12/06/2016 09:31 AM, Gabriel Krisman Bertazi wrote:
>> In blk_mq_map_swqueue, there is a memory optimization that frees the
>> tags of a queue that has gone unmapped.  Later, if that hctx is remapped
>> after another topology change, the tags need to be reallocated.
>>
>> If this allocation fails, a simple WARN_ON triggers, but the block layer
>> ends up with an active hctx without any corresponding set of tags.
>> Then, any income IO to that hctx can trigger an Oops.
>>
>> I can reproduce it consistently by running IO, flipping CPUs on and off
>> and eventually injecting a memory allocation failure in that path.
>>
>> In the fix below, if the system experiences a failed allocation of any
>> hctx's tags, we remap all the ctxs of that queue to the hctx_0, which
>> should always keep it's tags.  There is a minor performance hit, since
>> our mapping just got worse after the error path, but this is
>> the simplest solution to handle this error path.  The performance hit
>> will disappear after another successful remap.
>>
>> I considered dropping the memory optimization all together, but it
>> seemed a bad trade-off to handle this very specific error case.
>>
>> This should apply cleanly on top of Jen's for-next branch.
>>
>> The Oops is the one below:
>>
>> SP (3fff935ce4d0) is in userspace
>> 1:mon> e
>> cpu 0x1: Vector: 300 (Data Access) at [c000000fe99eb110]
>>      pc: c0000000005e868c: __sbitmap_queue_get+0x2c/0x180
>>      lr: c000000000575328: __bt_get+0x48/0xd0
>>      sp: c000000fe99eb390
>>     msr: 900000010280b033
>>     dar: 28
>>   dsisr: 40000000
>>    current = 0xc000000fe9966800
>>    paca    = 0xc000000007e80300   softe: 0        irq_happened: 0x01
>>      pid   = 11035, comm = aio-stress
>> Linux version 4.8.0-rc6+ (root@bean) (gcc version 5.4.0 20160609
>> (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.2) ) #3 SMP Mon Oct 10 20:16:53 CDT 
>> 2016
>> 1:mon> s
>> [c000000fe99eb3d0] c000000000575328 __bt_get+0x48/0xd0
>> [c000000fe99eb400] c000000000575838 bt_get.isra.1+0x78/0x2d0
>> [c000000fe99eb480] c000000000575cb4 blk_mq_get_tag+0x44/0x100
>> [c000000fe99eb4b0] c00000000056f6f4 __blk_mq_alloc_request+0x44/0x220
>> [c000000fe99eb500] c000000000570050 blk_mq_map_request+0x100/0x1f0
>> [c000000fe99eb580] c000000000574650 blk_mq_make_request+0xf0/0x540
>> [c000000fe99eb640] c000000000561c44 generic_make_request+0x144/0x230
>> [c000000fe99eb690] c000000000561e00 submit_bio+0xd0/0x200
>> [c000000fe99eb740] c0000000003ef740 ext4_io_submit+0x90/0xb0
>> [c000000fe99eb770] c0000000003e95d8 ext4_writepages+0x588/0xdd0
>> [c000000fe99eb910] c00000000025a9f0 do_writepages+0x60/0xc0
>> [c000000fe99eb940] c000000000246c88 
>> __filemap_fdatawrite_range+0xf8/0x180
>> [c000000fe99eb9e0] c000000000246f90 
>> filemap_write_and_wait_range+0x70/0xf0
>> [c000000fe99eba20] c0000000003dd844 ext4_sync_file+0x214/0x540
>> [c000000fe99eba80] c000000000364718 vfs_fsync_range+0x78/0x130
>> [c000000fe99ebad0] c0000000003dd46c ext4_file_write_iter+0x35c/0x430
>> [c000000fe99ebb90] c00000000038c280 aio_run_iocb+0x3b0/0x450
>> [c000000fe99ebce0] c00000000038dc28 do_io_submit+0x368/0x730
>> [c000000fe99ebe30] c000000000009404 system_call+0x38/0xec
>>
>> Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
>> Cc: Brian King <brking@linux.vnet.ibm.com>
>> Cc: Douglas Miller <dougmill@linux.vnet.ibm.com>
>> Cc: linux-block@vger.kernel.org
>> Cc: linux-scsi@vger.kernel.org
>> ---
>>   block/blk-mq.c | 21 +++++++++++++++------
>>   1 file changed, 15 insertions(+), 6 deletions(-)
>>
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 6fb94bd69375..6718f894fbe1 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -1870,7 +1870,7 @@ static void blk_mq_init_cpu_queues(struct 
>> request_queue *q,
>>   static void blk_mq_map_swqueue(struct request_queue *q,
>>                      const struct cpumask *online_mask)
>>   {
>> -    unsigned int i;
>> +    unsigned int i, hctx_idx;
>>       struct blk_mq_hw_ctx *hctx;
>>       struct blk_mq_ctx *ctx;
>>       struct blk_mq_tag_set *set = q->tag_set;
>> @@ -1893,6 +1893,15 @@ static void blk_mq_map_swqueue(struct 
>> request_queue *q,
>>           if (!cpumask_test_cpu(i, online_mask))
>>               continue;
>>
>> +        hctx_idx = q->mq_map[i];
>> +        /* unmapped hw queue can be remapped after CPU topo changed */
>> +        if (!set->tags[hctx_idx]) {
>> +            set->tags[hctx_idx] = blk_mq_init_rq_map(set, hctx_idx);
>> +
>> +            if (!set->tags[hctx_idx])
>> +                q->mq_map[i] = 0;
>> +        }
>> +
>>           ctx = per_cpu_ptr(q->queue_ctx, i);
>>           hctx = blk_mq_map_queue(q, i);
>>
>> @@ -1909,7 +1918,10 @@ static void blk_mq_map_swqueue(struct 
>> request_queue *q,
>>            * disable it and free the request entries.
>>            */
>>           if (!hctx->nr_ctx) {
>> -            if (set->tags[i]) {
>> +            /* Never unmap queue 0.  We need it as a
>> +             * fallback in case of a new remap fails
>> +             * allocation. */
>> +            if (i && set->tags[i]) {
>>                   blk_mq_free_rq_map(set, set->tags[i], i);
>>                   set->tags[i] = NULL;
>>               }
>> @@ -1917,11 +1929,8 @@ static void blk_mq_map_swqueue(struct 
>> request_queue *q,
>>               continue;
>>           }
>>
>> -        /* unmapped hw queue can be remapped after CPU topo changed */
>> -        if (!set->tags[i])
>> -            set->tags[i] = blk_mq_init_rq_map(set, i);
>>           hctx->tags = set->tags[i];
>> -        WARN_ON(!hctx->tags);
>> +        BUG_ON(!hctx->tags);
>>
>>           /*
>>            * Set the map size to the number of mapped software queues.
> Signed-off-by: Douglas Miller <dougmill@linux.vnet.ibm.com>
Correction, should be:

Reviewed-by: Douglas Miller <dougmill@linux.vnet.ibm.com>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RESEND v2 1/2] blk-mq: Fix failed allocation path when mapping queues
  2016-12-06 15:31 [PATCH RESEND v2 1/2] blk-mq: Fix failed allocation path when mapping queues Gabriel Krisman Bertazi
  2016-12-06 15:31 ` [PATCH RESEND v2 2/2] blk-mq: Avoid memory reclaim when remapping queues Gabriel Krisman Bertazi
  2016-12-07 20:06 ` [PATCH RESEND v2 1/2] blk-mq: Fix failed allocation path when mapping queues Douglas Miller
@ 2016-12-14  0:39 ` Gabriel Krisman Bertazi
  2016-12-14 14:11   ` Douglas Miller
  2016-12-14 15:13 ` Jens Axboe
  3 siblings, 1 reply; 9+ messages in thread
From: Gabriel Krisman Bertazi @ 2016-12-14  0:39 UTC (permalink / raw)
  To: axboe; +Cc: linux-block, dougmill

[-- Attachment #1: Type: text/plain, Size: 1575 bytes --]


On 12/06/2016 09:31 AM, Gabriel Krisman Bertazi wrote:
> In blk_mq_map_swqueue, there is a memory optimization that frees the
> tags of a queue that has gone unmapped.  Later, if that hctx is remapped
> after another topology change, the tags need to be reallocated.
>
> If this allocation fails, a simple WARN_ON triggers, but the block layer
> ends up with an active hctx without any corresponding set of tags.
> Then, any income IO to that hctx can trigger an Oops.
>
> I can reproduce it consistently by running IO, flipping CPUs on and off
> and eventually injecting a memory allocation failure in that path.
>
> In the fix below, if the system experiences a failed allocation of any
> hctx's tags, we remap all the ctxs of that queue to the hctx_0, which
> should always keep it's tags.  There is a minor performance hit, since
> our mapping just got worse after the error path, but this is
> the simplest solution to handle this error path.  The performance hit
> will disappear after another successful remap.
>
> I considered dropping the memory optimization all together, but it
> seemed a bad trade-off to handle this very specific error case.
>
> This should apply cleanly on top of Jen's for-next branch.

Hi,

I saw this patchset missed the first PR for 4.10, though I'd really like
to see it merged in this cycle, if possible.  If you see anything
particularly concerning about this that I missed, please let me know,
but I fear this has been around for a while without much feedback.

Thanks,

-- 
Gabriel Krisman Bertazi

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RESEND v2 1/2] blk-mq: Fix failed allocation path when mapping queues
  2016-12-14  0:39 ` Gabriel Krisman Bertazi
@ 2016-12-14 14:11   ` Douglas Miller
  0 siblings, 0 replies; 9+ messages in thread
From: Douglas Miller @ 2016-12-14 14:11 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi, axboe; +Cc: linux-block

On 12/13/2016 06:39 PM, Gabriel Krisman Bertazi wrote:
> On 12/06/2016 09:31 AM, Gabriel Krisman Bertazi wrote:
>> In blk_mq_map_swqueue, there is a memory optimization that frees the
>> tags of a queue that has gone unmapped.  Later, if that hctx is remapped
>> after another topology change, the tags need to be reallocated.
>>
>> If this allocation fails, a simple WARN_ON triggers, but the block layer
>> ends up with an active hctx without any corresponding set of tags.
>> Then, any income IO to that hctx can trigger an Oops.
>>
>> I can reproduce it consistently by running IO, flipping CPUs on and off
>> and eventually injecting a memory allocation failure in that path.
>>
>> In the fix below, if the system experiences a failed allocation of any
>> hctx's tags, we remap all the ctxs of that queue to the hctx_0, which
>> should always keep it's tags.  There is a minor performance hit, since
>> our mapping just got worse after the error path, but this is
>> the simplest solution to handle this error path.  The performance hit
>> will disappear after another successful remap.
>>
>> I considered dropping the memory optimization all together, but it
>> seemed a bad trade-off to handle this very specific error case.
>>
>> This should apply cleanly on top of Jen's for-next branch.
> Hi,
>
> I saw this patchset missed the first PR for 4.10, though I'd really like
> to see it merged in this cycle, if possible.  If you see anything
> particularly concerning about this that I missed, please let me know,
> but I fear this has been around for a while without much feedback.
>
> Thanks,
>
I want to repeat what Gabriel says. We are now seeing evidence of the 
condition often. This really needs to get fix soon.

Thanks,


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RESEND v2 1/2] blk-mq: Fix failed allocation path when mapping queues
  2016-12-06 15:31 [PATCH RESEND v2 1/2] blk-mq: Fix failed allocation path when mapping queues Gabriel Krisman Bertazi
                   ` (2 preceding siblings ...)
  2016-12-14  0:39 ` Gabriel Krisman Bertazi
@ 2016-12-14 15:13 ` Jens Axboe
  3 siblings, 0 replies; 9+ messages in thread
From: Jens Axboe @ 2016-12-14 15:13 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: wenxiong, gpiccoli, hch, Brian King, Douglas Miller, linux-block,
	linux-scsi

On 12/06/2016 08:31 AM, Gabriel Krisman Bertazi wrote:
> This should apply cleanly on top of Jen's for-next branch.

Jens, not Jen.

> @@ -1893,6 +1893,15 @@ static void blk_mq_map_swqueue(struct request_queue *q,
>  		if (!cpumask_test_cpu(i, online_mask))
>  			continue;
>  
> +		hctx_idx = q->mq_map[i];
> +		/* unmapped hw queue can be remapped after CPU topo changed */
> +		if (!set->tags[hctx_idx]) {
> +			set->tags[hctx_idx] = blk_mq_init_rq_map(set, hctx_idx);
> +
> +			if (!set->tags[hctx_idx])
> +				q->mq_map[i] = 0;
> +		}

This needs a comment on why you're assigning mq_map[i] - because we keep
queue 0, in case the rest fail allocating.

> @@ -1917,11 +1929,8 @@ static void blk_mq_map_swqueue(struct request_queue *q,
>  			continue;
>  		}
>  
> -		/* unmapped hw queue can be remapped after CPU topo changed */
> -		if (!set->tags[i])
> -			set->tags[i] = blk_mq_init_rq_map(set, i);
>  		hctx->tags = set->tags[i];
> -		WARN_ON(!hctx->tags);
> +		BUG_ON(!hctx->tags);

No BUG_ON(), please, it's not necessary to take down the machine if this fails.
It might be game over for that machine, if the driver is hosting the data
or root fs, but it might not be.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RESEND v2 2/2] blk-mq: Avoid memory reclaim when remapping queues
  2016-12-06 15:31 ` [PATCH RESEND v2 2/2] blk-mq: Avoid memory reclaim when remapping queues Gabriel Krisman Bertazi
  2016-12-07 20:10   ` Douglas Miller
@ 2016-12-14 15:14   ` Jens Axboe
  1 sibling, 0 replies; 9+ messages in thread
From: Jens Axboe @ 2016-12-14 15:14 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: wenxiong, gpiccoli, hch, Brian King, Douglas Miller, linux-block,
	linux-scsi

On 12/06/2016 08:31 AM, Gabriel Krisman Bertazi wrote:
> While stressing memory and IO at the same time we changed SMT settings,
> we were able to consistently trigger deadlocks in the mm system, which
> froze the entire machine.
> 
> I think that under memory stress conditions, the large allocations
> performed by blk_mq_init_rq_map may trigger a reclaim, which stalls
> waiting on the block layer remmaping completion, thus deadlocking the
> system.  The trace below was collected after the machine stalled,
> waiting for the hotplug event completion.
> 
> The simplest fix for this is to make allocations in this path
> non-reclaimable, with GFP_NOIO.  With this patch, We couldn't hit the
> issue anymore.

This looks fine.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2016-12-14 15:14 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-06 15:31 [PATCH RESEND v2 1/2] blk-mq: Fix failed allocation path when mapping queues Gabriel Krisman Bertazi
2016-12-06 15:31 ` [PATCH RESEND v2 2/2] blk-mq: Avoid memory reclaim when remapping queues Gabriel Krisman Bertazi
2016-12-07 20:10   ` Douglas Miller
2016-12-14 15:14   ` Jens Axboe
2016-12-07 20:06 ` [PATCH RESEND v2 1/2] blk-mq: Fix failed allocation path when mapping queues Douglas Miller
2016-12-07 20:12   ` Douglas Miller
2016-12-14  0:39 ` Gabriel Krisman Bertazi
2016-12-14 14:11   ` Douglas Miller
2016-12-14 15:13 ` Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.