All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
@ 2018-07-31 21:34 Anchal Agarwal
  2018-07-31 22:02 ` Anchal Agarwal
  2018-08-01 15:14 ` Jens Axboe
  0 siblings, 2 replies; 38+ messages in thread
From: Anchal Agarwal @ 2018-07-31 21:34 UTC (permalink / raw)
  To: axboe, linux-block, linux-kernel; +Cc: fllinden, msw, anchalag, sblbir

Hi folks,

This patch modifies commit e34cbd307477a
(blk-wbt: add general throttling mechanism)

I am currently running a large bare metal instance (i3.metal)
on EC2 with 72 cores, 512GB of RAM and NVME drives, with a
4.18 kernel. I have a workload that simulates a database
workload and I am running into lockup issues when writeback
throttling is enabled,with the hung task detector also
kicking in.

Crash dumps show that most CPUs (up to 50 of them) are
all trying to get the wbt wait queue lock while trying to add
themselves to it in __wbt_wait (see stack traces below).

[    0.948118] CPU: 45 PID: 0 Comm: swapper/45 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
[    0.948119] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
[    0.948120] task: ffff883f7878c000 task.stack: ffffc9000c69c000
[    0.948124] RIP: 0010:native_queued_spin_lock_slowpath+0xf8/0x1a0
[    0.948125] RSP: 0018:ffff883f7fcc3dc8 EFLAGS: 00000046
[    0.948126] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7fce2a00
[    0.948128] RDX: 000000000000001c RSI: 0000000000740001 RDI: ffff887f7709ca68
[    0.948129] RBP: 0000000000000002 R08: 0000000000b80000 R09: 0000000000000000
[    0.948130] R10: ffff883f7fcc3d78 R11: 000000000de27121 R12: 0000000000000002
[    0.948131] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
[    0.948132] FS:  0000000000000000(0000) GS:ffff883f7fcc0000(0000) knlGS:0000000000000000
[    0.948134] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.948135] CR2: 000000c424c77000 CR3: 0000000002010005 CR4: 00000000003606e0
[    0.948136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    0.948137] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    0.948138] Call Trace:
[    0.948139]  <IRQ>
[    0.948142]  do_raw_spin_lock+0xad/0xc0
[    0.948145]  _raw_spin_lock_irqsave+0x44/0x4b
[    0.948149]  ? __wake_up_common_lock+0x53/0x90
[    0.948150]  __wake_up_common_lock+0x53/0x90
[    0.948155]  wbt_done+0x7b/0xa0
[    0.948158]  blk_mq_free_request+0xb7/0x110
[    0.948161]  __blk_mq_complete_request+0xcb/0x140
[    0.948166]  nvme_process_cq+0xce/0x1a0 [nvme]
[    0.948169]  nvme_irq+0x23/0x50 [nvme]
[    0.948173]  __handle_irq_event_percpu+0x46/0x300
[    0.948176]  handle_irq_event_percpu+0x20/0x50
[    0.948179]  handle_irq_event+0x34/0x60
[    0.948181]  handle_edge_irq+0x77/0x190
[    0.948185]  handle_irq+0xaf/0x120
[    0.948188]  do_IRQ+0x53/0x110
[    0.948191]  common_interrupt+0x87/0x87
[    0.948192]  </IRQ>
....
[    0.311136] CPU: 4 PID: 9737 Comm: run_linux_amd64 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
[    0.311137] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
[    0.311138] task: ffff883f6e6a8000 task.stack: ffffc9000f1ec000
[    0.311141] RIP: 0010:native_queued_spin_lock_slowpath+0xf5/0x1a0
[    0.311142] RSP: 0018:ffffc9000f1efa28 EFLAGS: 00000046
[    0.311144] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7f722a00
[    0.311145] RDX: 0000000000000035 RSI: 0000000000d80001 RDI: ffff887f7709ca68
[    0.311146] RBP: 0000000000000202 R08: 0000000000140000 R09: 0000000000000000
[    0.311147] R10: ffffc9000f1ef9d8 R11: 000000001a249fa0 R12: ffff887f7709ca68
[    0.311148] R13: ffffc9000f1efad0 R14: 0000000000000000 R15: ffff887f7709ca00
[    0.311149] FS:  000000c423f30090(0000) GS:ffff883f7f700000(0000) knlGS:0000000000000000
[    0.311150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.311151] CR2: 00007feefcea4000 CR3: 0000007f7016e001 CR4: 00000000003606e0
[    0.311152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    0.311153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    0.311154] Call Trace:
[    0.311157]  do_raw_spin_lock+0xad/0xc0
[    0.311160]  _raw_spin_lock_irqsave+0x44/0x4b
[    0.311162]  ? prepare_to_wait_exclusive+0x28/0xb0
[    0.311164]  prepare_to_wait_exclusive+0x28/0xb0
[    0.311167]  wbt_wait+0x127/0x330
[    0.311169]  ? finish_wait+0x80/0x80
[    0.311172]  ? generic_make_request+0xda/0x3b0
[    0.311174]  blk_mq_make_request+0xd6/0x7b0
[    0.311176]  ? blk_queue_enter+0x24/0x260
[    0.311178]  ? generic_make_request+0xda/0x3b0
[    0.311181]  generic_make_request+0x10c/0x3b0
[    0.311183]  ? submit_bio+0x5c/0x110
[    0.311185]  submit_bio+0x5c/0x110
[    0.311197]  ? __ext4_journal_stop+0x36/0xa0 [ext4]
[    0.311210]  ext4_io_submit+0x48/0x60 [ext4]
[    0.311222]  ext4_writepages+0x810/0x11f0 [ext4]
[    0.311229]  ? do_writepages+0x3c/0xd0
[    0.311239]  ? ext4_mark_inode_dirty+0x260/0x260 [ext4]
[    0.311240]  do_writepages+0x3c/0xd0
[    0.311243]  ? _raw_spin_unlock+0x24/0x30
[    0.311245]  ? wbc_attach_and_unlock_inode+0x165/0x280
[    0.311248]  ? __filemap_fdatawrite_range+0xa3/0xe0
[    0.311250]  __filemap_fdatawrite_range+0xa3/0xe0
[    0.311253]  file_write_and_wait_range+0x34/0x90
[    0.311264]  ext4_sync_file+0x151/0x500 [ext4]
[    0.311267]  do_fsync+0x38/0x60
[    0.311270]  SyS_fsync+0xc/0x10
[    0.311272]  do_syscall_64+0x6f/0x170
[    0.311274]  entry_SYSCALL_64_after_hwframe+0x42/0xb7

In the original patch, wbt_done is waking up all the exclusive
processes in the wait queue, which can cause a thundering herd
if there is a large number of writer threads in the queue. The
original intention of the code seems to be to wake up one thread
only however, it uses wake_up_all() in __wbt_done(), and then
uses the following check in __wbt_wait to have only one thread
actually get out of the wait loop:

if (waitqueue_active(&rqw->wait) &&
            rqw->wait.head.next != &wait->entry)
                return false;

The problem with this is that the wait entry in wbt_wait is
define with DEFINE_WAIT, which uses the autoremove wakeup function.
That means that the above check is invalid - the wait entry will
have been removed from the queue already by the time we hit the
check in the loop.

Secondly, auto-removing the wait entries also means that the wait
queue essentially gets reordered "randomly" (e.g. threads re-add
themselves in the order they got to run after being woken up).
Additionally, new requests entering wbt_wait might overtake requests
that were queued earlier, because the wait queue will be
(temporarily) empty after the wake_up_all, so the waitqueue_active
check will not stop them. This can cause certain threads to starve
under high load.

The fix is to leave the woken up requests in the queue and remove
them in finish_wait() once the current thread breaks out of the
wait loop in __wbt_wait. This will ensure new requests always
end up at the back of the queue, and they won't overtake requests
that are already in the wait queue. With that change, the loop
in wbt_wait is also in line with many other wait loops in the kernel.
Waking up just one thread drastically reduces lock contention, as
does moving the wait queue add/remove out of the loop.

A significant drop in lockdep's lock contention numbers is seen when
running the test application on the patched kernel.

Signed-off-by: Anchal Agarwal <anchalag@amazon.com>
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
---
 block/blk-wbt.c | 55 ++++++++++++++++++++++++-------------------------------
 1 file changed, 24 insertions(+), 31 deletions(-)

diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index 4f89b28fa652..5733d3ab8ed5 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -186,7 +186,7 @@ void __wbt_done(struct rq_wb *rwb, enum wbt_flags wb_acct)
 		int diff = limit - inflight;
 
 		if (!inflight || diff >= rwb->wb_background / 2)
-			wake_up_all(&rqw->wait);
+			wake_up(&rqw->wait);
 	}
 }
 
@@ -533,30 +533,6 @@ static inline unsigned int get_limit(struct rq_wb *rwb, unsigned long rw)
 	return limit;
 }
 
-static inline bool may_queue(struct rq_wb *rwb, struct rq_wait *rqw,
-			     wait_queue_entry_t *wait, unsigned long rw)
-{
-	/*
-	 * inc it here even if disabled, since we'll dec it at completion.
-	 * this only happens if the task was sleeping in __wbt_wait(),
-	 * and someone turned it off at the same time.
-	 */
-	if (!rwb_enabled(rwb)) {
-		atomic_inc(&rqw->inflight);
-		return true;
-	}
-
-	/*
-	 * If the waitqueue is already active and we are not the next
-	 * in line to be woken up, wait for our turn.
-	 */
-	if (waitqueue_active(&rqw->wait) &&
-	    rqw->wait.head.next != &wait->entry)
-		return false;
-
-	return atomic_inc_below(&rqw->inflight, get_limit(rwb, rw));
-}
-
 /*
  * Block if we will exceed our limit, or if we are currently waiting for
  * the timer to kick off queuing again.
@@ -567,16 +543,32 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
 	__acquires(lock)
 {
 	struct rq_wait *rqw = get_rq_wait(rwb, wb_acct);
-	DEFINE_WAIT(wait);
+	DECLARE_WAITQUEUE(wait, current);
+
+	/*
+	* inc it here even if disabled, since we'll dec it at completion.
+	* this only happens if the task was sleeping in __wbt_wait(),
+	* and someone turned it off at the same time.
+	*/
+	if (!rwb_enabled(rwb)) {
+		atomic_inc(&rqw->inflight);
+		return;
+	}
 
-	if (may_queue(rwb, rqw, &wait, rw))
+	if (!waitqueue_active(&rqw->wait)
+		&& atomic_inc_below(&rqw->inflight, get_limit(rwb, rw)))
 		return;
 
+	add_wait_queue_exclusive(&rqw->wait, &wait);
 	do {
-		prepare_to_wait_exclusive(&rqw->wait, &wait,
-						TASK_UNINTERRUPTIBLE);
+		set_current_state(TASK_UNINTERRUPTIBLE);
+
+		if (!rwb_enabled(rwb)) {
+			atomic_inc(&rqw->inflight);
+			break;
+		}
 
-		if (may_queue(rwb, rqw, &wait, rw))
+		if (atomic_inc_below(&rqw->inflight, get_limit(rwb, rw)))
 			break;
 
 		if (lock) {
@@ -587,7 +579,8 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
 			io_schedule();
 	} while (1);
 
-	finish_wait(&rqw->wait, &wait);
+	__set_current_state(TASK_RUNNING);
+	remove_wait_queue(&rqw->wait, &wait);
 }
 
 static inline bool wbt_should_throttle(struct rq_wb *rwb, struct bio *bio)
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
  2018-07-31 21:34 [PATCH] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait Anchal Agarwal
@ 2018-07-31 22:02 ` Anchal Agarwal
  2018-08-01 15:14 ` Jens Axboe
  1 sibling, 0 replies; 38+ messages in thread
From: Anchal Agarwal @ 2018-07-31 22:02 UTC (permalink / raw)
  To: axboe, linux-block, linux-kernel; +Cc: fllinden, msw, anchalag, sblbir

Apologies, I just noticed that stack traces I posted are for 4.14 kernel.
However,I tested both on 4.14 and 4.18 and I could see the same issue 
in 4.18 too. The patch does resolves the issue in both 4.14 and 4.18.

Thanks,
Anchal Agarwal

On Tue, Jul 31, 2018 at 09:34:10PM +0000, Anchal Agarwal wrote:
> Hi folks,
> 
> This patch modifies commit e34cbd307477a
> (blk-wbt: add general throttling mechanism)
> 
> I am currently running a large bare metal instance (i3.metal)
> on EC2 with 72 cores, 512GB of RAM and NVME drives, with a
> 4.18 kernel. I have a workload that simulates a database
> workload and I am running into lockup issues when writeback
> throttling is enabled,with the hung task detector also
> kicking in.
> 
> Crash dumps show that most CPUs (up to 50 of them) are
> all trying to get the wbt wait queue lock while trying to add
> themselves to it in __wbt_wait (see stack traces below).
> 
> [    0.948118] CPU: 45 PID: 0 Comm: swapper/45 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
> [    0.948119] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
> [    0.948120] task: ffff883f7878c000 task.stack: ffffc9000c69c000
> [    0.948124] RIP: 0010:native_queued_spin_lock_slowpath+0xf8/0x1a0
> [    0.948125] RSP: 0018:ffff883f7fcc3dc8 EFLAGS: 00000046
> [    0.948126] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7fce2a00
> [    0.948128] RDX: 000000000000001c RSI: 0000000000740001 RDI: ffff887f7709ca68
> [    0.948129] RBP: 0000000000000002 R08: 0000000000b80000 R09: 0000000000000000
> [    0.948130] R10: ffff883f7fcc3d78 R11: 000000000de27121 R12: 0000000000000002
> [    0.948131] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
> [    0.948132] FS:  0000000000000000(0000) GS:ffff883f7fcc0000(0000) knlGS:0000000000000000
> [    0.948134] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    0.948135] CR2: 000000c424c77000 CR3: 0000000002010005 CR4: 00000000003606e0
> [    0.948136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [    0.948137] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [    0.948138] Call Trace:
> [    0.948139]  <IRQ>
> [    0.948142]  do_raw_spin_lock+0xad/0xc0
> [    0.948145]  _raw_spin_lock_irqsave+0x44/0x4b
> [    0.948149]  ? __wake_up_common_lock+0x53/0x90
> [    0.948150]  __wake_up_common_lock+0x53/0x90
> [    0.948155]  wbt_done+0x7b/0xa0
> [    0.948158]  blk_mq_free_request+0xb7/0x110
> [    0.948161]  __blk_mq_complete_request+0xcb/0x140
> [    0.948166]  nvme_process_cq+0xce/0x1a0 [nvme]
> [    0.948169]  nvme_irq+0x23/0x50 [nvme]
> [    0.948173]  __handle_irq_event_percpu+0x46/0x300
> [    0.948176]  handle_irq_event_percpu+0x20/0x50
> [    0.948179]  handle_irq_event+0x34/0x60
> [    0.948181]  handle_edge_irq+0x77/0x190
> [    0.948185]  handle_irq+0xaf/0x120
> [    0.948188]  do_IRQ+0x53/0x110
> [    0.948191]  common_interrupt+0x87/0x87
> [    0.948192]  </IRQ>
> ....
> [    0.311136] CPU: 4 PID: 9737 Comm: run_linux_amd64 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
> [    0.311137] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
> [    0.311138] task: ffff883f6e6a8000 task.stack: ffffc9000f1ec000
> [    0.311141] RIP: 0010:native_queued_spin_lock_slowpath+0xf5/0x1a0
> [    0.311142] RSP: 0018:ffffc9000f1efa28 EFLAGS: 00000046
> [    0.311144] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7f722a00
> [    0.311145] RDX: 0000000000000035 RSI: 0000000000d80001 RDI: ffff887f7709ca68
> [    0.311146] RBP: 0000000000000202 R08: 0000000000140000 R09: 0000000000000000
> [    0.311147] R10: ffffc9000f1ef9d8 R11: 000000001a249fa0 R12: ffff887f7709ca68
> [    0.311148] R13: ffffc9000f1efad0 R14: 0000000000000000 R15: ffff887f7709ca00
> [    0.311149] FS:  000000c423f30090(0000) GS:ffff883f7f700000(0000) knlGS:0000000000000000
> [    0.311150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    0.311151] CR2: 00007feefcea4000 CR3: 0000007f7016e001 CR4: 00000000003606e0
> [    0.311152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [    0.311153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [    0.311154] Call Trace:
> [    0.311157]  do_raw_spin_lock+0xad/0xc0
> [    0.311160]  _raw_spin_lock_irqsave+0x44/0x4b
> [    0.311162]  ? prepare_to_wait_exclusive+0x28/0xb0
> [    0.311164]  prepare_to_wait_exclusive+0x28/0xb0
> [    0.311167]  wbt_wait+0x127/0x330
> [    0.311169]  ? finish_wait+0x80/0x80
> [    0.311172]  ? generic_make_request+0xda/0x3b0
> [    0.311174]  blk_mq_make_request+0xd6/0x7b0
> [    0.311176]  ? blk_queue_enter+0x24/0x260
> [    0.311178]  ? generic_make_request+0xda/0x3b0
> [    0.311181]  generic_make_request+0x10c/0x3b0
> [    0.311183]  ? submit_bio+0x5c/0x110
> [    0.311185]  submit_bio+0x5c/0x110
> [    0.311197]  ? __ext4_journal_stop+0x36/0xa0 [ext4]
> [    0.311210]  ext4_io_submit+0x48/0x60 [ext4]
> [    0.311222]  ext4_writepages+0x810/0x11f0 [ext4]
> [    0.311229]  ? do_writepages+0x3c/0xd0
> [    0.311239]  ? ext4_mark_inode_dirty+0x260/0x260 [ext4]
> [    0.311240]  do_writepages+0x3c/0xd0
> [    0.311243]  ? _raw_spin_unlock+0x24/0x30
> [    0.311245]  ? wbc_attach_and_unlock_inode+0x165/0x280
> [    0.311248]  ? __filemap_fdatawrite_range+0xa3/0xe0
> [    0.311250]  __filemap_fdatawrite_range+0xa3/0xe0
> [    0.311253]  file_write_and_wait_range+0x34/0x90
> [    0.311264]  ext4_sync_file+0x151/0x500 [ext4]
> [    0.311267]  do_fsync+0x38/0x60
> [    0.311270]  SyS_fsync+0xc/0x10
> [    0.311272]  do_syscall_64+0x6f/0x170
> [    0.311274]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
> 
> In the original patch, wbt_done is waking up all the exclusive
> processes in the wait queue, which can cause a thundering herd
> if there is a large number of writer threads in the queue. The
> original intention of the code seems to be to wake up one thread
> only however, it uses wake_up_all() in __wbt_done(), and then
> uses the following check in __wbt_wait to have only one thread
> actually get out of the wait loop:
> 
> if (waitqueue_active(&rqw->wait) &&
>             rqw->wait.head.next != &wait->entry)
>                 return false;
> 
> The problem with this is that the wait entry in wbt_wait is
> define with DEFINE_WAIT, which uses the autoremove wakeup function.
> That means that the above check is invalid - the wait entry will
> have been removed from the queue already by the time we hit the
> check in the loop.
> 
> Secondly, auto-removing the wait entries also means that the wait
> queue essentially gets reordered "randomly" (e.g. threads re-add
> themselves in the order they got to run after being woken up).
> Additionally, new requests entering wbt_wait might overtake requests
> that were queued earlier, because the wait queue will be
> (temporarily) empty after the wake_up_all, so the waitqueue_active
> check will not stop them. This can cause certain threads to starve
> under high load.
> 
> The fix is to leave the woken up requests in the queue and remove
> them in finish_wait() once the current thread breaks out of the
> wait loop in __wbt_wait. This will ensure new requests always
> end up at the back of the queue, and they won't overtake requests
> that are already in the wait queue. With that change, the loop
> in wbt_wait is also in line with many other wait loops in the kernel.
> Waking up just one thread drastically reduces lock contention, as
> does moving the wait queue add/remove out of the loop.
> 
> A significant drop in lockdep's lock contention numbers is seen when
> running the test application on the patched kernel.
> 
> Signed-off-by: Anchal Agarwal <anchalag@amazon.com>
> Signed-off-by: Frank van der Linden <fllinden@amazon.com>
> ---
>  block/blk-wbt.c | 55 ++++++++++++++++++++++++-------------------------------
>  1 file changed, 24 insertions(+), 31 deletions(-)
> 
> diff --git a/block/blk-wbt.c b/block/blk-wbt.c
> index 4f89b28fa652..5733d3ab8ed5 100644
> --- a/block/blk-wbt.c
> +++ b/block/blk-wbt.c
> @@ -186,7 +186,7 @@ void __wbt_done(struct rq_wb *rwb, enum wbt_flags wb_acct)
>  		int diff = limit - inflight;
>  
>  		if (!inflight || diff >= rwb->wb_background / 2)
> -			wake_up_all(&rqw->wait);
> +			wake_up(&rqw->wait);
>  	}
>  }
>  
> @@ -533,30 +533,6 @@ static inline unsigned int get_limit(struct rq_wb *rwb, unsigned long rw)
>  	return limit;
>  }
>  
> -static inline bool may_queue(struct rq_wb *rwb, struct rq_wait *rqw,
> -			     wait_queue_entry_t *wait, unsigned long rw)
> -{
> -	/*
> -	 * inc it here even if disabled, since we'll dec it at completion.
> -	 * this only happens if the task was sleeping in __wbt_wait(),
> -	 * and someone turned it off at the same time.
> -	 */
> -	if (!rwb_enabled(rwb)) {
> -		atomic_inc(&rqw->inflight);
> -		return true;
> -	}
> -
> -	/*
> -	 * If the waitqueue is already active and we are not the next
> -	 * in line to be woken up, wait for our turn.
> -	 */
> -	if (waitqueue_active(&rqw->wait) &&
> -	    rqw->wait.head.next != &wait->entry)
> -		return false;
> -
> -	return atomic_inc_below(&rqw->inflight, get_limit(rwb, rw));
> -}
> -
>  /*
>   * Block if we will exceed our limit, or if we are currently waiting for
>   * the timer to kick off queuing again.
> @@ -567,16 +543,32 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
>  	__acquires(lock)
>  {
>  	struct rq_wait *rqw = get_rq_wait(rwb, wb_acct);
> -	DEFINE_WAIT(wait);
> +	DECLARE_WAITQUEUE(wait, current);
> +
> +	/*
> +	* inc it here even if disabled, since we'll dec it at completion.
> +	* this only happens if the task was sleeping in __wbt_wait(),
> +	* and someone turned it off at the same time.
> +	*/
> +	if (!rwb_enabled(rwb)) {
> +		atomic_inc(&rqw->inflight);
> +		return;
> +	}
>  
> -	if (may_queue(rwb, rqw, &wait, rw))
> +	if (!waitqueue_active(&rqw->wait)
> +		&& atomic_inc_below(&rqw->inflight, get_limit(rwb, rw)))
>  		return;
>  
> +	add_wait_queue_exclusive(&rqw->wait, &wait);
>  	do {
> -		prepare_to_wait_exclusive(&rqw->wait, &wait,
> -						TASK_UNINTERRUPTIBLE);
> +		set_current_state(TASK_UNINTERRUPTIBLE);
> +
> +		if (!rwb_enabled(rwb)) {
> +			atomic_inc(&rqw->inflight);
> +			break;
> +		}
>  
> -		if (may_queue(rwb, rqw, &wait, rw))
> +		if (atomic_inc_below(&rqw->inflight, get_limit(rwb, rw)))
>  			break;
>  
>  		if (lock) {
> @@ -587,7 +579,8 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
>  			io_schedule();
>  	} while (1);
>  
> -	finish_wait(&rqw->wait, &wait);
> +	__set_current_state(TASK_RUNNING);
> +	remove_wait_queue(&rqw->wait, &wait);
>  }
>  
>  static inline bool wbt_should_throttle(struct rq_wb *rwb, struct bio *bio)
> -- 
> 2.13.6

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
  2018-07-31 21:34 [PATCH] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait Anchal Agarwal
  2018-07-31 22:02 ` Anchal Agarwal
@ 2018-08-01 15:14 ` Jens Axboe
  2018-08-01 17:06   ` Anchal Agarwal
  1 sibling, 1 reply; 38+ messages in thread
From: Jens Axboe @ 2018-08-01 15:14 UTC (permalink / raw)
  To: Anchal Agarwal, linux-block, linux-kernel; +Cc: fllinden, msw, sblbir

On 7/31/18 3:34 PM, Anchal Agarwal wrote:
> Hi folks,
> 
> This patch modifies commit e34cbd307477a
> (blk-wbt: add general throttling mechanism)
> 
> I am currently running a large bare metal instance (i3.metal)
> on EC2 with 72 cores, 512GB of RAM and NVME drives, with a
> 4.18 kernel. I have a workload that simulates a database
> workload and I am running into lockup issues when writeback
> throttling is enabled,with the hung task detector also
> kicking in.
> 
> Crash dumps show that most CPUs (up to 50 of them) are
> all trying to get the wbt wait queue lock while trying to add
> themselves to it in __wbt_wait (see stack traces below).
> 
> [    0.948118] CPU: 45 PID: 0 Comm: swapper/45 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
> [    0.948119] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
> [    0.948120] task: ffff883f7878c000 task.stack: ffffc9000c69c000
> [    0.948124] RIP: 0010:native_queued_spin_lock_slowpath+0xf8/0x1a0
> [    0.948125] RSP: 0018:ffff883f7fcc3dc8 EFLAGS: 00000046
> [    0.948126] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7fce2a00
> [    0.948128] RDX: 000000000000001c RSI: 0000000000740001 RDI: ffff887f7709ca68
> [    0.948129] RBP: 0000000000000002 R08: 0000000000b80000 R09: 0000000000000000
> [    0.948130] R10: ffff883f7fcc3d78 R11: 000000000de27121 R12: 0000000000000002
> [    0.948131] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
> [    0.948132] FS:  0000000000000000(0000) GS:ffff883f7fcc0000(0000) knlGS:0000000000000000
> [    0.948134] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    0.948135] CR2: 000000c424c77000 CR3: 0000000002010005 CR4: 00000000003606e0
> [    0.948136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [    0.948137] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [    0.948138] Call Trace:
> [    0.948139]  <IRQ>
> [    0.948142]  do_raw_spin_lock+0xad/0xc0
> [    0.948145]  _raw_spin_lock_irqsave+0x44/0x4b
> [    0.948149]  ? __wake_up_common_lock+0x53/0x90
> [    0.948150]  __wake_up_common_lock+0x53/0x90
> [    0.948155]  wbt_done+0x7b/0xa0
> [    0.948158]  blk_mq_free_request+0xb7/0x110
> [    0.948161]  __blk_mq_complete_request+0xcb/0x140
> [    0.948166]  nvme_process_cq+0xce/0x1a0 [nvme]
> [    0.948169]  nvme_irq+0x23/0x50 [nvme]
> [    0.948173]  __handle_irq_event_percpu+0x46/0x300
> [    0.948176]  handle_irq_event_percpu+0x20/0x50
> [    0.948179]  handle_irq_event+0x34/0x60
> [    0.948181]  handle_edge_irq+0x77/0x190
> [    0.948185]  handle_irq+0xaf/0x120
> [    0.948188]  do_IRQ+0x53/0x110
> [    0.948191]  common_interrupt+0x87/0x87
> [    0.948192]  </IRQ>
> ....
> [    0.311136] CPU: 4 PID: 9737 Comm: run_linux_amd64 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
> [    0.311137] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
> [    0.311138] task: ffff883f6e6a8000 task.stack: ffffc9000f1ec000
> [    0.311141] RIP: 0010:native_queued_spin_lock_slowpath+0xf5/0x1a0
> [    0.311142] RSP: 0018:ffffc9000f1efa28 EFLAGS: 00000046
> [    0.311144] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7f722a00
> [    0.311145] RDX: 0000000000000035 RSI: 0000000000d80001 RDI: ffff887f7709ca68
> [    0.311146] RBP: 0000000000000202 R08: 0000000000140000 R09: 0000000000000000
> [    0.311147] R10: ffffc9000f1ef9d8 R11: 000000001a249fa0 R12: ffff887f7709ca68
> [    0.311148] R13: ffffc9000f1efad0 R14: 0000000000000000 R15: ffff887f7709ca00
> [    0.311149] FS:  000000c423f30090(0000) GS:ffff883f7f700000(0000) knlGS:0000000000000000
> [    0.311150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    0.311151] CR2: 00007feefcea4000 CR3: 0000007f7016e001 CR4: 00000000003606e0
> [    0.311152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [    0.311153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [    0.311154] Call Trace:
> [    0.311157]  do_raw_spin_lock+0xad/0xc0
> [    0.311160]  _raw_spin_lock_irqsave+0x44/0x4b
> [    0.311162]  ? prepare_to_wait_exclusive+0x28/0xb0
> [    0.311164]  prepare_to_wait_exclusive+0x28/0xb0
> [    0.311167]  wbt_wait+0x127/0x330
> [    0.311169]  ? finish_wait+0x80/0x80
> [    0.311172]  ? generic_make_request+0xda/0x3b0
> [    0.311174]  blk_mq_make_request+0xd6/0x7b0
> [    0.311176]  ? blk_queue_enter+0x24/0x260
> [    0.311178]  ? generic_make_request+0xda/0x3b0
> [    0.311181]  generic_make_request+0x10c/0x3b0
> [    0.311183]  ? submit_bio+0x5c/0x110
> [    0.311185]  submit_bio+0x5c/0x110
> [    0.311197]  ? __ext4_journal_stop+0x36/0xa0 [ext4]
> [    0.311210]  ext4_io_submit+0x48/0x60 [ext4]
> [    0.311222]  ext4_writepages+0x810/0x11f0 [ext4]
> [    0.311229]  ? do_writepages+0x3c/0xd0
> [    0.311239]  ? ext4_mark_inode_dirty+0x260/0x260 [ext4]
> [    0.311240]  do_writepages+0x3c/0xd0
> [    0.311243]  ? _raw_spin_unlock+0x24/0x30
> [    0.311245]  ? wbc_attach_and_unlock_inode+0x165/0x280
> [    0.311248]  ? __filemap_fdatawrite_range+0xa3/0xe0
> [    0.311250]  __filemap_fdatawrite_range+0xa3/0xe0
> [    0.311253]  file_write_and_wait_range+0x34/0x90
> [    0.311264]  ext4_sync_file+0x151/0x500 [ext4]
> [    0.311267]  do_fsync+0x38/0x60
> [    0.311270]  SyS_fsync+0xc/0x10
> [    0.311272]  do_syscall_64+0x6f/0x170
> [    0.311274]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
> 
> In the original patch, wbt_done is waking up all the exclusive
> processes in the wait queue, which can cause a thundering herd
> if there is a large number of writer threads in the queue. The
> original intention of the code seems to be to wake up one thread
> only however, it uses wake_up_all() in __wbt_done(), and then
> uses the following check in __wbt_wait to have only one thread
> actually get out of the wait loop:
> 
> if (waitqueue_active(&rqw->wait) &&
>             rqw->wait.head.next != &wait->entry)
>                 return false;
> 
> The problem with this is that the wait entry in wbt_wait is
> define with DEFINE_WAIT, which uses the autoremove wakeup function.
> That means that the above check is invalid - the wait entry will
> have been removed from the queue already by the time we hit the
> check in the loop.
> 
> Secondly, auto-removing the wait entries also means that the wait
> queue essentially gets reordered "randomly" (e.g. threads re-add
> themselves in the order they got to run after being woken up).
> Additionally, new requests entering wbt_wait might overtake requests
> that were queued earlier, because the wait queue will be
> (temporarily) empty after the wake_up_all, so the waitqueue_active
> check will not stop them. This can cause certain threads to starve
> under high load.
> 
> The fix is to leave the woken up requests in the queue and remove
> them in finish_wait() once the current thread breaks out of the
> wait loop in __wbt_wait. This will ensure new requests always
> end up at the back of the queue, and they won't overtake requests
> that are already in the wait queue. With that change, the loop
> in wbt_wait is also in line with many other wait loops in the kernel.
> Waking up just one thread drastically reduces lock contention, as
> does moving the wait queue add/remove out of the loop.
> 
> A significant drop in lockdep's lock contention numbers is seen when
> running the test application on the patched kernel.

I like the patch, and a few weeks ago we independently discovered that
the waitqueue list checking was bogus as well. My only worry is that
changes like this can be delicate, meaning that it's easy to introduce
stall conditions. What kind of testing did you push this through?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
  2018-08-01 15:14 ` Jens Axboe
@ 2018-08-01 17:06   ` Anchal Agarwal
  2018-08-01 22:09     ` Jens Axboe
  0 siblings, 1 reply; 38+ messages in thread
From: Anchal Agarwal @ 2018-08-01 17:06 UTC (permalink / raw)
  To: axboe, linux-block, linux-kernel; +Cc: fllinden, msw, anchalag, sblbir

On Wed, Aug 01, 2018 at 09:14:50AM -0600, Jens Axboe wrote:
> On 7/31/18 3:34 PM, Anchal Agarwal wrote:
> > Hi folks,
> > 
> > This patch modifies commit e34cbd307477a
> > (blk-wbt: add general throttling mechanism)
> > 
> > I am currently running a large bare metal instance (i3.metal)
> > on EC2 with 72 cores, 512GB of RAM and NVME drives, with a
> > 4.18 kernel. I have a workload that simulates a database
> > workload and I am running into lockup issues when writeback
> > throttling is enabled,with the hung task detector also
> > kicking in.
> > 
> > Crash dumps show that most CPUs (up to 50 of them) are
> > all trying to get the wbt wait queue lock while trying to add
> > themselves to it in __wbt_wait (see stack traces below).
> > 
> > [    0.948118] CPU: 45 PID: 0 Comm: swapper/45 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
> > [    0.948119] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
> > [    0.948120] task: ffff883f7878c000 task.stack: ffffc9000c69c000
> > [    0.948124] RIP: 0010:native_queued_spin_lock_slowpath+0xf8/0x1a0
> > [    0.948125] RSP: 0018:ffff883f7fcc3dc8 EFLAGS: 00000046
> > [    0.948126] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7fce2a00
> > [    0.948128] RDX: 000000000000001c RSI: 0000000000740001 RDI: ffff887f7709ca68
> > [    0.948129] RBP: 0000000000000002 R08: 0000000000b80000 R09: 0000000000000000
> > [    0.948130] R10: ffff883f7fcc3d78 R11: 000000000de27121 R12: 0000000000000002
> > [    0.948131] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
> > [    0.948132] FS:  0000000000000000(0000) GS:ffff883f7fcc0000(0000) knlGS:0000000000000000
> > [    0.948134] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [    0.948135] CR2: 000000c424c77000 CR3: 0000000002010005 CR4: 00000000003606e0
> > [    0.948136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [    0.948137] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [    0.948138] Call Trace:
> > [    0.948139]  <IRQ>
> > [    0.948142]  do_raw_spin_lock+0xad/0xc0
> > [    0.948145]  _raw_spin_lock_irqsave+0x44/0x4b
> > [    0.948149]  ? __wake_up_common_lock+0x53/0x90
> > [    0.948150]  __wake_up_common_lock+0x53/0x90
> > [    0.948155]  wbt_done+0x7b/0xa0
> > [    0.948158]  blk_mq_free_request+0xb7/0x110
> > [    0.948161]  __blk_mq_complete_request+0xcb/0x140
> > [    0.948166]  nvme_process_cq+0xce/0x1a0 [nvme]
> > [    0.948169]  nvme_irq+0x23/0x50 [nvme]
> > [    0.948173]  __handle_irq_event_percpu+0x46/0x300
> > [    0.948176]  handle_irq_event_percpu+0x20/0x50
> > [    0.948179]  handle_irq_event+0x34/0x60
> > [    0.948181]  handle_edge_irq+0x77/0x190
> > [    0.948185]  handle_irq+0xaf/0x120
> > [    0.948188]  do_IRQ+0x53/0x110
> > [    0.948191]  common_interrupt+0x87/0x87
> > [    0.948192]  </IRQ>
> > ....
> > [    0.311136] CPU: 4 PID: 9737 Comm: run_linux_amd64 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
> > [    0.311137] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
> > [    0.311138] task: ffff883f6e6a8000 task.stack: ffffc9000f1ec000
> > [    0.311141] RIP: 0010:native_queued_spin_lock_slowpath+0xf5/0x1a0
> > [    0.311142] RSP: 0018:ffffc9000f1efa28 EFLAGS: 00000046
> > [    0.311144] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7f722a00
> > [    0.311145] RDX: 0000000000000035 RSI: 0000000000d80001 RDI: ffff887f7709ca68
> > [    0.311146] RBP: 0000000000000202 R08: 0000000000140000 R09: 0000000000000000
> > [    0.311147] R10: ffffc9000f1ef9d8 R11: 000000001a249fa0 R12: ffff887f7709ca68
> > [    0.311148] R13: ffffc9000f1efad0 R14: 0000000000000000 R15: ffff887f7709ca00
> > [    0.311149] FS:  000000c423f30090(0000) GS:ffff883f7f700000(0000) knlGS:0000000000000000
> > [    0.311150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [    0.311151] CR2: 00007feefcea4000 CR3: 0000007f7016e001 CR4: 00000000003606e0
> > [    0.311152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [    0.311153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [    0.311154] Call Trace:
> > [    0.311157]  do_raw_spin_lock+0xad/0xc0
> > [    0.311160]  _raw_spin_lock_irqsave+0x44/0x4b
> > [    0.311162]  ? prepare_to_wait_exclusive+0x28/0xb0
> > [    0.311164]  prepare_to_wait_exclusive+0x28/0xb0
> > [    0.311167]  wbt_wait+0x127/0x330
> > [    0.311169]  ? finish_wait+0x80/0x80
> > [    0.311172]  ? generic_make_request+0xda/0x3b0
> > [    0.311174]  blk_mq_make_request+0xd6/0x7b0
> > [    0.311176]  ? blk_queue_enter+0x24/0x260
> > [    0.311178]  ? generic_make_request+0xda/0x3b0
> > [    0.311181]  generic_make_request+0x10c/0x3b0
> > [    0.311183]  ? submit_bio+0x5c/0x110
> > [    0.311185]  submit_bio+0x5c/0x110
> > [    0.311197]  ? __ext4_journal_stop+0x36/0xa0 [ext4]
> > [    0.311210]  ext4_io_submit+0x48/0x60 [ext4]
> > [    0.311222]  ext4_writepages+0x810/0x11f0 [ext4]
> > [    0.311229]  ? do_writepages+0x3c/0xd0
> > [    0.311239]  ? ext4_mark_inode_dirty+0x260/0x260 [ext4]
> > [    0.311240]  do_writepages+0x3c/0xd0
> > [    0.311243]  ? _raw_spin_unlock+0x24/0x30
> > [    0.311245]  ? wbc_attach_and_unlock_inode+0x165/0x280
> > [    0.311248]  ? __filemap_fdatawrite_range+0xa3/0xe0
> > [    0.311250]  __filemap_fdatawrite_range+0xa3/0xe0
> > [    0.311253]  file_write_and_wait_range+0x34/0x90
> > [    0.311264]  ext4_sync_file+0x151/0x500 [ext4]
> > [    0.311267]  do_fsync+0x38/0x60
> > [    0.311270]  SyS_fsync+0xc/0x10
> > [    0.311272]  do_syscall_64+0x6f/0x170
> > [    0.311274]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
> > 
> > In the original patch, wbt_done is waking up all the exclusive
> > processes in the wait queue, which can cause a thundering herd
> > if there is a large number of writer threads in the queue. The
> > original intention of the code seems to be to wake up one thread
> > only however, it uses wake_up_all() in __wbt_done(), and then
> > uses the following check in __wbt_wait to have only one thread
> > actually get out of the wait loop:
> > 
> > if (waitqueue_active(&rqw->wait) &&
> >             rqw->wait.head.next != &wait->entry)
> >                 return false;
> > 
> > The problem with this is that the wait entry in wbt_wait is
> > define with DEFINE_WAIT, which uses the autoremove wakeup function.
> > That means that the above check is invalid - the wait entry will
> > have been removed from the queue already by the time we hit the
> > check in the loop.
> > 
> > Secondly, auto-removing the wait entries also means that the wait
> > queue essentially gets reordered "randomly" (e.g. threads re-add
> > themselves in the order they got to run after being woken up).
> > Additionally, new requests entering wbt_wait might overtake requests
> > that were queued earlier, because the wait queue will be
> > (temporarily) empty after the wake_up_all, so the waitqueue_active
> > check will not stop them. This can cause certain threads to starve
> > under high load.
> > 
> > The fix is to leave the woken up requests in the queue and remove
> > them in finish_wait() once the current thread breaks out of the
> > wait loop in __wbt_wait. This will ensure new requests always
> > end up at the back of the queue, and they won't overtake requests
> > that are already in the wait queue. With that change, the loop
> > in wbt_wait is also in line with many other wait loops in the kernel.
> > Waking up just one thread drastically reduces lock contention, as
> > does moving the wait queue add/remove out of the loop.
> > 
> > A significant drop in lockdep's lock contention numbers is seen when
> > running the test application on the patched kernel.
> 
> I like the patch, and a few weeks ago we independently discovered that
> the waitqueue list checking was bogus as well. My only worry is that
> changes like this can be delicate, meaning that it's easy to introduce
> stall conditions. What kind of testing did you push this through?
> 
> -- 
> Jens Axboe
> 
I ran the following tests on both real HW with NVME devices attached
and emulated NVME too:

1. The test case I used to reproduce the issue, spawns a bunch of threads 
   to concurrently read and write files with random size and content. 
   Files are randomly fsync'd. The implementation is a FIFO queue of files. 
   When the queue fills the test starts to verify and remove the files. This 
   test will fail if there's a read, write, or hash check failure. It tests
   for file corruption when lots of small files are being read and written 
   with high concurrency.

2. Fio for random writes with a root NVME device of 200GB
  
  fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k 
  --direct=0 --size=10G --numjobs=2 --runtime=60 --group_reporting
  
  fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k
  --direct=0 --size=5G --numjobs=2 --runtime=30 --fsync=64 --group_reporting
  
  I did see an improvement in the bandwidth numbers reported on the patched
  kernel. 

Do you have any test case/suite in mind that you would suggest me to 
run to be sure that patch does not introduce any stall conditions?

Thanks,
Anchal Agarwal

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
  2018-08-01 17:06   ` Anchal Agarwal
@ 2018-08-01 22:09     ` Jens Axboe
  2018-08-07 14:29       ` Jens Axboe
  0 siblings, 1 reply; 38+ messages in thread
From: Jens Axboe @ 2018-08-01 22:09 UTC (permalink / raw)
  To: Anchal Agarwal, linux-block, linux-kernel; +Cc: fllinden, msw, sblbir

On 8/1/18 11:06 AM, Anchal Agarwal wrote:
> On Wed, Aug 01, 2018 at 09:14:50AM -0600, Jens Axboe wrote:
>> On 7/31/18 3:34 PM, Anchal Agarwal wrote:
>>> Hi folks,
>>>
>>> This patch modifies commit e34cbd307477a
>>> (blk-wbt: add general throttling mechanism)
>>>
>>> I am currently running a large bare metal instance (i3.metal)
>>> on EC2 with 72 cores, 512GB of RAM and NVME drives, with a
>>> 4.18 kernel. I have a workload that simulates a database
>>> workload and I am running into lockup issues when writeback
>>> throttling is enabled,with the hung task detector also
>>> kicking in.
>>>
>>> Crash dumps show that most CPUs (up to 50 of them) are
>>> all trying to get the wbt wait queue lock while trying to add
>>> themselves to it in __wbt_wait (see stack traces below).
>>>
>>> [    0.948118] CPU: 45 PID: 0 Comm: swapper/45 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
>>> [    0.948119] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
>>> [    0.948120] task: ffff883f7878c000 task.stack: ffffc9000c69c000
>>> [    0.948124] RIP: 0010:native_queued_spin_lock_slowpath+0xf8/0x1a0
>>> [    0.948125] RSP: 0018:ffff883f7fcc3dc8 EFLAGS: 00000046
>>> [    0.948126] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7fce2a00
>>> [    0.948128] RDX: 000000000000001c RSI: 0000000000740001 RDI: ffff887f7709ca68
>>> [    0.948129] RBP: 0000000000000002 R08: 0000000000b80000 R09: 0000000000000000
>>> [    0.948130] R10: ffff883f7fcc3d78 R11: 000000000de27121 R12: 0000000000000002
>>> [    0.948131] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
>>> [    0.948132] FS:  0000000000000000(0000) GS:ffff883f7fcc0000(0000) knlGS:0000000000000000
>>> [    0.948134] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [    0.948135] CR2: 000000c424c77000 CR3: 0000000002010005 CR4: 00000000003606e0
>>> [    0.948136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> [    0.948137] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>> [    0.948138] Call Trace:
>>> [    0.948139]  <IRQ>
>>> [    0.948142]  do_raw_spin_lock+0xad/0xc0
>>> [    0.948145]  _raw_spin_lock_irqsave+0x44/0x4b
>>> [    0.948149]  ? __wake_up_common_lock+0x53/0x90
>>> [    0.948150]  __wake_up_common_lock+0x53/0x90
>>> [    0.948155]  wbt_done+0x7b/0xa0
>>> [    0.948158]  blk_mq_free_request+0xb7/0x110
>>> [    0.948161]  __blk_mq_complete_request+0xcb/0x140
>>> [    0.948166]  nvme_process_cq+0xce/0x1a0 [nvme]
>>> [    0.948169]  nvme_irq+0x23/0x50 [nvme]
>>> [    0.948173]  __handle_irq_event_percpu+0x46/0x300
>>> [    0.948176]  handle_irq_event_percpu+0x20/0x50
>>> [    0.948179]  handle_irq_event+0x34/0x60
>>> [    0.948181]  handle_edge_irq+0x77/0x190
>>> [    0.948185]  handle_irq+0xaf/0x120
>>> [    0.948188]  do_IRQ+0x53/0x110
>>> [    0.948191]  common_interrupt+0x87/0x87
>>> [    0.948192]  </IRQ>
>>> ....
>>> [    0.311136] CPU: 4 PID: 9737 Comm: run_linux_amd64 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
>>> [    0.311137] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
>>> [    0.311138] task: ffff883f6e6a8000 task.stack: ffffc9000f1ec000
>>> [    0.311141] RIP: 0010:native_queued_spin_lock_slowpath+0xf5/0x1a0
>>> [    0.311142] RSP: 0018:ffffc9000f1efa28 EFLAGS: 00000046
>>> [    0.311144] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7f722a00
>>> [    0.311145] RDX: 0000000000000035 RSI: 0000000000d80001 RDI: ffff887f7709ca68
>>> [    0.311146] RBP: 0000000000000202 R08: 0000000000140000 R09: 0000000000000000
>>> [    0.311147] R10: ffffc9000f1ef9d8 R11: 000000001a249fa0 R12: ffff887f7709ca68
>>> [    0.311148] R13: ffffc9000f1efad0 R14: 0000000000000000 R15: ffff887f7709ca00
>>> [    0.311149] FS:  000000c423f30090(0000) GS:ffff883f7f700000(0000) knlGS:0000000000000000
>>> [    0.311150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [    0.311151] CR2: 00007feefcea4000 CR3: 0000007f7016e001 CR4: 00000000003606e0
>>> [    0.311152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> [    0.311153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>> [    0.311154] Call Trace:
>>> [    0.311157]  do_raw_spin_lock+0xad/0xc0
>>> [    0.311160]  _raw_spin_lock_irqsave+0x44/0x4b
>>> [    0.311162]  ? prepare_to_wait_exclusive+0x28/0xb0
>>> [    0.311164]  prepare_to_wait_exclusive+0x28/0xb0
>>> [    0.311167]  wbt_wait+0x127/0x330
>>> [    0.311169]  ? finish_wait+0x80/0x80
>>> [    0.311172]  ? generic_make_request+0xda/0x3b0
>>> [    0.311174]  blk_mq_make_request+0xd6/0x7b0
>>> [    0.311176]  ? blk_queue_enter+0x24/0x260
>>> [    0.311178]  ? generic_make_request+0xda/0x3b0
>>> [    0.311181]  generic_make_request+0x10c/0x3b0
>>> [    0.311183]  ? submit_bio+0x5c/0x110
>>> [    0.311185]  submit_bio+0x5c/0x110
>>> [    0.311197]  ? __ext4_journal_stop+0x36/0xa0 [ext4]
>>> [    0.311210]  ext4_io_submit+0x48/0x60 [ext4]
>>> [    0.311222]  ext4_writepages+0x810/0x11f0 [ext4]
>>> [    0.311229]  ? do_writepages+0x3c/0xd0
>>> [    0.311239]  ? ext4_mark_inode_dirty+0x260/0x260 [ext4]
>>> [    0.311240]  do_writepages+0x3c/0xd0
>>> [    0.311243]  ? _raw_spin_unlock+0x24/0x30
>>> [    0.311245]  ? wbc_attach_and_unlock_inode+0x165/0x280
>>> [    0.311248]  ? __filemap_fdatawrite_range+0xa3/0xe0
>>> [    0.311250]  __filemap_fdatawrite_range+0xa3/0xe0
>>> [    0.311253]  file_write_and_wait_range+0x34/0x90
>>> [    0.311264]  ext4_sync_file+0x151/0x500 [ext4]
>>> [    0.311267]  do_fsync+0x38/0x60
>>> [    0.311270]  SyS_fsync+0xc/0x10
>>> [    0.311272]  do_syscall_64+0x6f/0x170
>>> [    0.311274]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
>>>
>>> In the original patch, wbt_done is waking up all the exclusive
>>> processes in the wait queue, which can cause a thundering herd
>>> if there is a large number of writer threads in the queue. The
>>> original intention of the code seems to be to wake up one thread
>>> only however, it uses wake_up_all() in __wbt_done(), and then
>>> uses the following check in __wbt_wait to have only one thread
>>> actually get out of the wait loop:
>>>
>>> if (waitqueue_active(&rqw->wait) &&
>>>             rqw->wait.head.next != &wait->entry)
>>>                 return false;
>>>
>>> The problem with this is that the wait entry in wbt_wait is
>>> define with DEFINE_WAIT, which uses the autoremove wakeup function.
>>> That means that the above check is invalid - the wait entry will
>>> have been removed from the queue already by the time we hit the
>>> check in the loop.
>>>
>>> Secondly, auto-removing the wait entries also means that the wait
>>> queue essentially gets reordered "randomly" (e.g. threads re-add
>>> themselves in the order they got to run after being woken up).
>>> Additionally, new requests entering wbt_wait might overtake requests
>>> that were queued earlier, because the wait queue will be
>>> (temporarily) empty after the wake_up_all, so the waitqueue_active
>>> check will not stop them. This can cause certain threads to starve
>>> under high load.
>>>
>>> The fix is to leave the woken up requests in the queue and remove
>>> them in finish_wait() once the current thread breaks out of the
>>> wait loop in __wbt_wait. This will ensure new requests always
>>> end up at the back of the queue, and they won't overtake requests
>>> that are already in the wait queue. With that change, the loop
>>> in wbt_wait is also in line with many other wait loops in the kernel.
>>> Waking up just one thread drastically reduces lock contention, as
>>> does moving the wait queue add/remove out of the loop.
>>>
>>> A significant drop in lockdep's lock contention numbers is seen when
>>> running the test application on the patched kernel.
>>
>> I like the patch, and a few weeks ago we independently discovered that
>> the waitqueue list checking was bogus as well. My only worry is that
>> changes like this can be delicate, meaning that it's easy to introduce
>> stall conditions. What kind of testing did you push this through?
>>
>> -- 
>> Jens Axboe
>>
> I ran the following tests on both real HW with NVME devices attached
> and emulated NVME too:
> 
> 1. The test case I used to reproduce the issue, spawns a bunch of threads 
>    to concurrently read and write files with random size and content. 
>    Files are randomly fsync'd. The implementation is a FIFO queue of files. 
>    When the queue fills the test starts to verify and remove the files. This 
>    test will fail if there's a read, write, or hash check failure. It tests
>    for file corruption when lots of small files are being read and written 
>    with high concurrency.
> 
> 2. Fio for random writes with a root NVME device of 200GB
>   
>   fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k 
>   --direct=0 --size=10G --numjobs=2 --runtime=60 --group_reporting
>   
>   fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k
>   --direct=0 --size=5G --numjobs=2 --runtime=30 --fsync=64 --group_reporting
>   
>   I did see an improvement in the bandwidth numbers reported on the patched
>   kernel. 
> 
> Do you have any test case/suite in mind that you would suggest me to 
> run to be sure that patch does not introduce any stall conditions?

One thing that is always useful is to run xfstest, do a full run on
the device. If that works, then do another full run, this time limiting
the queue depth of the SCSI device to 1. If both of those pass, then
I'd feel pretty good getting this applied for 4.19.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
  2018-08-01 22:09     ` Jens Axboe
@ 2018-08-07 14:29       ` Jens Axboe
  2018-08-07 20:12         ` Anchal Agarwal
  0 siblings, 1 reply; 38+ messages in thread
From: Jens Axboe @ 2018-08-07 14:29 UTC (permalink / raw)
  To: Anchal Agarwal, linux-block, linux-kernel; +Cc: fllinden, msw, sblbir

On 8/1/18 4:09 PM, Jens Axboe wrote:
> On 8/1/18 11:06 AM, Anchal Agarwal wrote:
>> On Wed, Aug 01, 2018 at 09:14:50AM -0600, Jens Axboe wrote:
>>> On 7/31/18 3:34 PM, Anchal Agarwal wrote:
>>>> Hi folks,
>>>>
>>>> This patch modifies commit e34cbd307477a
>>>> (blk-wbt: add general throttling mechanism)
>>>>
>>>> I am currently running a large bare metal instance (i3.metal)
>>>> on EC2 with 72 cores, 512GB of RAM and NVME drives, with a
>>>> 4.18 kernel. I have a workload that simulates a database
>>>> workload and I am running into lockup issues when writeback
>>>> throttling is enabled,with the hung task detector also
>>>> kicking in.
>>>>
>>>> Crash dumps show that most CPUs (up to 50 of them) are
>>>> all trying to get the wbt wait queue lock while trying to add
>>>> themselves to it in __wbt_wait (see stack traces below).
>>>>
>>>> [    0.948118] CPU: 45 PID: 0 Comm: swapper/45 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
>>>> [    0.948119] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
>>>> [    0.948120] task: ffff883f7878c000 task.stack: ffffc9000c69c000
>>>> [    0.948124] RIP: 0010:native_queued_spin_lock_slowpath+0xf8/0x1a0
>>>> [    0.948125] RSP: 0018:ffff883f7fcc3dc8 EFLAGS: 00000046
>>>> [    0.948126] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7fce2a00
>>>> [    0.948128] RDX: 000000000000001c RSI: 0000000000740001 RDI: ffff887f7709ca68
>>>> [    0.948129] RBP: 0000000000000002 R08: 0000000000b80000 R09: 0000000000000000
>>>> [    0.948130] R10: ffff883f7fcc3d78 R11: 000000000de27121 R12: 0000000000000002
>>>> [    0.948131] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
>>>> [    0.948132] FS:  0000000000000000(0000) GS:ffff883f7fcc0000(0000) knlGS:0000000000000000
>>>> [    0.948134] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [    0.948135] CR2: 000000c424c77000 CR3: 0000000002010005 CR4: 00000000003606e0
>>>> [    0.948136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>> [    0.948137] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>> [    0.948138] Call Trace:
>>>> [    0.948139]  <IRQ>
>>>> [    0.948142]  do_raw_spin_lock+0xad/0xc0
>>>> [    0.948145]  _raw_spin_lock_irqsave+0x44/0x4b
>>>> [    0.948149]  ? __wake_up_common_lock+0x53/0x90
>>>> [    0.948150]  __wake_up_common_lock+0x53/0x90
>>>> [    0.948155]  wbt_done+0x7b/0xa0
>>>> [    0.948158]  blk_mq_free_request+0xb7/0x110
>>>> [    0.948161]  __blk_mq_complete_request+0xcb/0x140
>>>> [    0.948166]  nvme_process_cq+0xce/0x1a0 [nvme]
>>>> [    0.948169]  nvme_irq+0x23/0x50 [nvme]
>>>> [    0.948173]  __handle_irq_event_percpu+0x46/0x300
>>>> [    0.948176]  handle_irq_event_percpu+0x20/0x50
>>>> [    0.948179]  handle_irq_event+0x34/0x60
>>>> [    0.948181]  handle_edge_irq+0x77/0x190
>>>> [    0.948185]  handle_irq+0xaf/0x120
>>>> [    0.948188]  do_IRQ+0x53/0x110
>>>> [    0.948191]  common_interrupt+0x87/0x87
>>>> [    0.948192]  </IRQ>
>>>> ....
>>>> [    0.311136] CPU: 4 PID: 9737 Comm: run_linux_amd64 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
>>>> [    0.311137] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
>>>> [    0.311138] task: ffff883f6e6a8000 task.stack: ffffc9000f1ec000
>>>> [    0.311141] RIP: 0010:native_queued_spin_lock_slowpath+0xf5/0x1a0
>>>> [    0.311142] RSP: 0018:ffffc9000f1efa28 EFLAGS: 00000046
>>>> [    0.311144] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7f722a00
>>>> [    0.311145] RDX: 0000000000000035 RSI: 0000000000d80001 RDI: ffff887f7709ca68
>>>> [    0.311146] RBP: 0000000000000202 R08: 0000000000140000 R09: 0000000000000000
>>>> [    0.311147] R10: ffffc9000f1ef9d8 R11: 000000001a249fa0 R12: ffff887f7709ca68
>>>> [    0.311148] R13: ffffc9000f1efad0 R14: 0000000000000000 R15: ffff887f7709ca00
>>>> [    0.311149] FS:  000000c423f30090(0000) GS:ffff883f7f700000(0000) knlGS:0000000000000000
>>>> [    0.311150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [    0.311151] CR2: 00007feefcea4000 CR3: 0000007f7016e001 CR4: 00000000003606e0
>>>> [    0.311152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>> [    0.311153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>> [    0.311154] Call Trace:
>>>> [    0.311157]  do_raw_spin_lock+0xad/0xc0
>>>> [    0.311160]  _raw_spin_lock_irqsave+0x44/0x4b
>>>> [    0.311162]  ? prepare_to_wait_exclusive+0x28/0xb0
>>>> [    0.311164]  prepare_to_wait_exclusive+0x28/0xb0
>>>> [    0.311167]  wbt_wait+0x127/0x330
>>>> [    0.311169]  ? finish_wait+0x80/0x80
>>>> [    0.311172]  ? generic_make_request+0xda/0x3b0
>>>> [    0.311174]  blk_mq_make_request+0xd6/0x7b0
>>>> [    0.311176]  ? blk_queue_enter+0x24/0x260
>>>> [    0.311178]  ? generic_make_request+0xda/0x3b0
>>>> [    0.311181]  generic_make_request+0x10c/0x3b0
>>>> [    0.311183]  ? submit_bio+0x5c/0x110
>>>> [    0.311185]  submit_bio+0x5c/0x110
>>>> [    0.311197]  ? __ext4_journal_stop+0x36/0xa0 [ext4]
>>>> [    0.311210]  ext4_io_submit+0x48/0x60 [ext4]
>>>> [    0.311222]  ext4_writepages+0x810/0x11f0 [ext4]
>>>> [    0.311229]  ? do_writepages+0x3c/0xd0
>>>> [    0.311239]  ? ext4_mark_inode_dirty+0x260/0x260 [ext4]
>>>> [    0.311240]  do_writepages+0x3c/0xd0
>>>> [    0.311243]  ? _raw_spin_unlock+0x24/0x30
>>>> [    0.311245]  ? wbc_attach_and_unlock_inode+0x165/0x280
>>>> [    0.311248]  ? __filemap_fdatawrite_range+0xa3/0xe0
>>>> [    0.311250]  __filemap_fdatawrite_range+0xa3/0xe0
>>>> [    0.311253]  file_write_and_wait_range+0x34/0x90
>>>> [    0.311264]  ext4_sync_file+0x151/0x500 [ext4]
>>>> [    0.311267]  do_fsync+0x38/0x60
>>>> [    0.311270]  SyS_fsync+0xc/0x10
>>>> [    0.311272]  do_syscall_64+0x6f/0x170
>>>> [    0.311274]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
>>>>
>>>> In the original patch, wbt_done is waking up all the exclusive
>>>> processes in the wait queue, which can cause a thundering herd
>>>> if there is a large number of writer threads in the queue. The
>>>> original intention of the code seems to be to wake up one thread
>>>> only however, it uses wake_up_all() in __wbt_done(), and then
>>>> uses the following check in __wbt_wait to have only one thread
>>>> actually get out of the wait loop:
>>>>
>>>> if (waitqueue_active(&rqw->wait) &&
>>>>             rqw->wait.head.next != &wait->entry)
>>>>                 return false;
>>>>
>>>> The problem with this is that the wait entry in wbt_wait is
>>>> define with DEFINE_WAIT, which uses the autoremove wakeup function.
>>>> That means that the above check is invalid - the wait entry will
>>>> have been removed from the queue already by the time we hit the
>>>> check in the loop.
>>>>
>>>> Secondly, auto-removing the wait entries also means that the wait
>>>> queue essentially gets reordered "randomly" (e.g. threads re-add
>>>> themselves in the order they got to run after being woken up).
>>>> Additionally, new requests entering wbt_wait might overtake requests
>>>> that were queued earlier, because the wait queue will be
>>>> (temporarily) empty after the wake_up_all, so the waitqueue_active
>>>> check will not stop them. This can cause certain threads to starve
>>>> under high load.
>>>>
>>>> The fix is to leave the woken up requests in the queue and remove
>>>> them in finish_wait() once the current thread breaks out of the
>>>> wait loop in __wbt_wait. This will ensure new requests always
>>>> end up at the back of the queue, and they won't overtake requests
>>>> that are already in the wait queue. With that change, the loop
>>>> in wbt_wait is also in line with many other wait loops in the kernel.
>>>> Waking up just one thread drastically reduces lock contention, as
>>>> does moving the wait queue add/remove out of the loop.
>>>>
>>>> A significant drop in lockdep's lock contention numbers is seen when
>>>> running the test application on the patched kernel.
>>>
>>> I like the patch, and a few weeks ago we independently discovered that
>>> the waitqueue list checking was bogus as well. My only worry is that
>>> changes like this can be delicate, meaning that it's easy to introduce
>>> stall conditions. What kind of testing did you push this through?
>>>
>>> -- 
>>> Jens Axboe
>>>
>> I ran the following tests on both real HW with NVME devices attached
>> and emulated NVME too:
>>
>> 1. The test case I used to reproduce the issue, spawns a bunch of threads 
>>    to concurrently read and write files with random size and content. 
>>    Files are randomly fsync'd. The implementation is a FIFO queue of files. 
>>    When the queue fills the test starts to verify and remove the files. This 
>>    test will fail if there's a read, write, or hash check failure. It tests
>>    for file corruption when lots of small files are being read and written 
>>    with high concurrency.
>>
>> 2. Fio for random writes with a root NVME device of 200GB
>>   
>>   fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k 
>>   --direct=0 --size=10G --numjobs=2 --runtime=60 --group_reporting
>>   
>>   fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k
>>   --direct=0 --size=5G --numjobs=2 --runtime=30 --fsync=64 --group_reporting
>>   
>>   I did see an improvement in the bandwidth numbers reported on the patched
>>   kernel. 
>>
>> Do you have any test case/suite in mind that you would suggest me to 
>> run to be sure that patch does not introduce any stall conditions?
> 
> One thing that is always useful is to run xfstest, do a full run on
> the device. If that works, then do another full run, this time limiting
> the queue depth of the SCSI device to 1. If both of those pass, then
> I'd feel pretty good getting this applied for 4.19.

Did you get a chance to run this full test?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
  2018-08-07 14:29       ` Jens Axboe
@ 2018-08-07 20:12         ` Anchal Agarwal
  2018-08-07 20:39           ` Jens Axboe
  0 siblings, 1 reply; 38+ messages in thread
From: Anchal Agarwal @ 2018-08-07 20:12 UTC (permalink / raw)
  To: axboe, linux-block, linux-kernel
  Cc: fllinden, sblbir, msw, linux-kernel, linux-block, anchalag

On Tue, Aug 07, 2018 at 08:29:44AM -0600, Jens Axboe wrote:
> On 8/1/18 4:09 PM, Jens Axboe wrote:
> > On 8/1/18 11:06 AM, Anchal Agarwal wrote:
> >> On Wed, Aug 01, 2018 at 09:14:50AM -0600, Jens Axboe wrote:
> >>> On 7/31/18 3:34 PM, Anchal Agarwal wrote:
> >>>> Hi folks,
> >>>>
> >>>> This patch modifies commit e34cbd307477a
> >>>> (blk-wbt: add general throttling mechanism)
> >>>>
> >>>> I am currently running a large bare metal instance (i3.metal)
> >>>> on EC2 with 72 cores, 512GB of RAM and NVME drives, with a
> >>>> 4.18 kernel. I have a workload that simulates a database
> >>>> workload and I am running into lockup issues when writeback
> >>>> throttling is enabled,with the hung task detector also
> >>>> kicking in.
> >>>>
> >>>> Crash dumps show that most CPUs (up to 50 of them) are
> >>>> all trying to get the wbt wait queue lock while trying to add
> >>>> themselves to it in __wbt_wait (see stack traces below).
> >>>>
> >>>> [    0.948118] CPU: 45 PID: 0 Comm: swapper/45 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
> >>>> [    0.948119] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
> >>>> [    0.948120] task: ffff883f7878c000 task.stack: ffffc9000c69c000
> >>>> [    0.948124] RIP: 0010:native_queued_spin_lock_slowpath+0xf8/0x1a0
> >>>> [    0.948125] RSP: 0018:ffff883f7fcc3dc8 EFLAGS: 00000046
> >>>> [    0.948126] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7fce2a00
> >>>> [    0.948128] RDX: 000000000000001c RSI: 0000000000740001 RDI: ffff887f7709ca68
> >>>> [    0.948129] RBP: 0000000000000002 R08: 0000000000b80000 R09: 0000000000000000
> >>>> [    0.948130] R10: ffff883f7fcc3d78 R11: 000000000de27121 R12: 0000000000000002
> >>>> [    0.948131] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
> >>>> [    0.948132] FS:  0000000000000000(0000) GS:ffff883f7fcc0000(0000) knlGS:0000000000000000
> >>>> [    0.948134] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>> [    0.948135] CR2: 000000c424c77000 CR3: 0000000002010005 CR4: 00000000003606e0
> >>>> [    0.948136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >>>> [    0.948137] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >>>> [    0.948138] Call Trace:
> >>>> [    0.948139]  <IRQ>
> >>>> [    0.948142]  do_raw_spin_lock+0xad/0xc0
> >>>> [    0.948145]  _raw_spin_lock_irqsave+0x44/0x4b
> >>>> [    0.948149]  ? __wake_up_common_lock+0x53/0x90
> >>>> [    0.948150]  __wake_up_common_lock+0x53/0x90
> >>>> [    0.948155]  wbt_done+0x7b/0xa0
> >>>> [    0.948158]  blk_mq_free_request+0xb7/0x110
> >>>> [    0.948161]  __blk_mq_complete_request+0xcb/0x140
> >>>> [    0.948166]  nvme_process_cq+0xce/0x1a0 [nvme]
> >>>> [    0.948169]  nvme_irq+0x23/0x50 [nvme]
> >>>> [    0.948173]  __handle_irq_event_percpu+0x46/0x300
> >>>> [    0.948176]  handle_irq_event_percpu+0x20/0x50
> >>>> [    0.948179]  handle_irq_event+0x34/0x60
> >>>> [    0.948181]  handle_edge_irq+0x77/0x190
> >>>> [    0.948185]  handle_irq+0xaf/0x120
> >>>> [    0.948188]  do_IRQ+0x53/0x110
> >>>> [    0.948191]  common_interrupt+0x87/0x87
> >>>> [    0.948192]  </IRQ>
> >>>> ....
> >>>> [    0.311136] CPU: 4 PID: 9737 Comm: run_linux_amd64 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
> >>>> [    0.311137] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
> >>>> [    0.311138] task: ffff883f6e6a8000 task.stack: ffffc9000f1ec000
> >>>> [    0.311141] RIP: 0010:native_queued_spin_lock_slowpath+0xf5/0x1a0
> >>>> [    0.311142] RSP: 0018:ffffc9000f1efa28 EFLAGS: 00000046
> >>>> [    0.311144] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7f722a00
> >>>> [    0.311145] RDX: 0000000000000035 RSI: 0000000000d80001 RDI: ffff887f7709ca68
> >>>> [    0.311146] RBP: 0000000000000202 R08: 0000000000140000 R09: 0000000000000000
> >>>> [    0.311147] R10: ffffc9000f1ef9d8 R11: 000000001a249fa0 R12: ffff887f7709ca68
> >>>> [    0.311148] R13: ffffc9000f1efad0 R14: 0000000000000000 R15: ffff887f7709ca00
> >>>> [    0.311149] FS:  000000c423f30090(0000) GS:ffff883f7f700000(0000) knlGS:0000000000000000
> >>>> [    0.311150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>> [    0.311151] CR2: 00007feefcea4000 CR3: 0000007f7016e001 CR4: 00000000003606e0
> >>>> [    0.311152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >>>> [    0.311153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >>>> [    0.311154] Call Trace:
> >>>> [    0.311157]  do_raw_spin_lock+0xad/0xc0
> >>>> [    0.311160]  _raw_spin_lock_irqsave+0x44/0x4b
> >>>> [    0.311162]  ? prepare_to_wait_exclusive+0x28/0xb0
> >>>> [    0.311164]  prepare_to_wait_exclusive+0x28/0xb0
> >>>> [    0.311167]  wbt_wait+0x127/0x330
> >>>> [    0.311169]  ? finish_wait+0x80/0x80
> >>>> [    0.311172]  ? generic_make_request+0xda/0x3b0
> >>>> [    0.311174]  blk_mq_make_request+0xd6/0x7b0
> >>>> [    0.311176]  ? blk_queue_enter+0x24/0x260
> >>>> [    0.311178]  ? generic_make_request+0xda/0x3b0
> >>>> [    0.311181]  generic_make_request+0x10c/0x3b0
> >>>> [    0.311183]  ? submit_bio+0x5c/0x110
> >>>> [    0.311185]  submit_bio+0x5c/0x110
> >>>> [    0.311197]  ? __ext4_journal_stop+0x36/0xa0 [ext4]
> >>>> [    0.311210]  ext4_io_submit+0x48/0x60 [ext4]
> >>>> [    0.311222]  ext4_writepages+0x810/0x11f0 [ext4]
> >>>> [    0.311229]  ? do_writepages+0x3c/0xd0
> >>>> [    0.311239]  ? ext4_mark_inode_dirty+0x260/0x260 [ext4]
> >>>> [    0.311240]  do_writepages+0x3c/0xd0
> >>>> [    0.311243]  ? _raw_spin_unlock+0x24/0x30
> >>>> [    0.311245]  ? wbc_attach_and_unlock_inode+0x165/0x280
> >>>> [    0.311248]  ? __filemap_fdatawrite_range+0xa3/0xe0
> >>>> [    0.311250]  __filemap_fdatawrite_range+0xa3/0xe0
> >>>> [    0.311253]  file_write_and_wait_range+0x34/0x90
> >>>> [    0.311264]  ext4_sync_file+0x151/0x500 [ext4]
> >>>> [    0.311267]  do_fsync+0x38/0x60
> >>>> [    0.311270]  SyS_fsync+0xc/0x10
> >>>> [    0.311272]  do_syscall_64+0x6f/0x170
> >>>> [    0.311274]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
> >>>>
> >>>> In the original patch, wbt_done is waking up all the exclusive
> >>>> processes in the wait queue, which can cause a thundering herd
> >>>> if there is a large number of writer threads in the queue. The
> >>>> original intention of the code seems to be to wake up one thread
> >>>> only however, it uses wake_up_all() in __wbt_done(), and then
> >>>> uses the following check in __wbt_wait to have only one thread
> >>>> actually get out of the wait loop:
> >>>>
> >>>> if (waitqueue_active(&rqw->wait) &&
> >>>>             rqw->wait.head.next != &wait->entry)
> >>>>                 return false;
> >>>>
> >>>> The problem with this is that the wait entry in wbt_wait is
> >>>> define with DEFINE_WAIT, which uses the autoremove wakeup function.
> >>>> That means that the above check is invalid - the wait entry will
> >>>> have been removed from the queue already by the time we hit the
> >>>> check in the loop.
> >>>>
> >>>> Secondly, auto-removing the wait entries also means that the wait
> >>>> queue essentially gets reordered "randomly" (e.g. threads re-add
> >>>> themselves in the order they got to run after being woken up).
> >>>> Additionally, new requests entering wbt_wait might overtake requests
> >>>> that were queued earlier, because the wait queue will be
> >>>> (temporarily) empty after the wake_up_all, so the waitqueue_active
> >>>> check will not stop them. This can cause certain threads to starve
> >>>> under high load.
> >>>>
> >>>> The fix is to leave the woken up requests in the queue and remove
> >>>> them in finish_wait() once the current thread breaks out of the
> >>>> wait loop in __wbt_wait. This will ensure new requests always
> >>>> end up at the back of the queue, and they won't overtake requests
> >>>> that are already in the wait queue. With that change, the loop
> >>>> in wbt_wait is also in line with many other wait loops in the kernel.
> >>>> Waking up just one thread drastically reduces lock contention, as
> >>>> does moving the wait queue add/remove out of the loop.
> >>>>
> >>>> A significant drop in lockdep's lock contention numbers is seen when
> >>>> running the test application on the patched kernel.
> >>>
> >>> I like the patch, and a few weeks ago we independently discovered that
> >>> the waitqueue list checking was bogus as well. My only worry is that
> >>> changes like this can be delicate, meaning that it's easy to introduce
> >>> stall conditions. What kind of testing did you push this through?
> >>>
> >>> -- 
> >>> Jens Axboe
> >>>
> >> I ran the following tests on both real HW with NVME devices attached
> >> and emulated NVME too:
> >>
> >> 1. The test case I used to reproduce the issue, spawns a bunch of threads 
> >>    to concurrently read and write files with random size and content. 
> >>    Files are randomly fsync'd. The implementation is a FIFO queue of files. 
> >>    When the queue fills the test starts to verify and remove the files. This 
> >>    test will fail if there's a read, write, or hash check failure. It tests
> >>    for file corruption when lots of small files are being read and written 
> >>    with high concurrency.
> >>
> >> 2. Fio for random writes with a root NVME device of 200GB
> >>   
> >>   fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k 
> >>   --direct=0 --size=10G --numjobs=2 --runtime=60 --group_reporting
> >>   
> >>   fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k
> >>   --direct=0 --size=5G --numjobs=2 --runtime=30 --fsync=64 --group_reporting
> >>   
> >>   I did see an improvement in the bandwidth numbers reported on the patched
> >>   kernel. 
> >>
> >> Do you have any test case/suite in mind that you would suggest me to 
> >> run to be sure that patch does not introduce any stall conditions?
> > 
> > One thing that is always useful is to run xfstest, do a full run on
> > the device. If that works, then do another full run, this time limiting
> > the queue depth of the SCSI device to 1. If both of those pass, then
> > I'd feel pretty good getting this applied for 4.19.
> 
> Did you get a chance to run this full test?
> 
> -- 
> Jens Axboe
> 
>

Hi Jens,
Yes I did run the tests and was in the process of compiling concrete results
I tested following environments against xfs/auto group
1. Vanilla 4.18.rc kernel
2. 4.18 kernel with the blk-wbt patch
3. 4.18 kernel with the blk-wbt patch + io_queue_depth=2. I 
understand you asked for queue depth for SCSI device=1 however, I have NVME 
devices in my environment and 2 is the minimum value for io_queue_depth allowed 
according to the NVME driver code. The results pretty much look same with no 
stalls or exceptional failures. 
xfs/auto ran 296 odd tests with 3 failures and 130 something "no runs". 
Remaining tests passed. "Skipped tests"  were mostly due to missing features
(eg: reflink support on scratch filesystem)
The failures were consistent across runs on 3 different environments. 
I am also running full test suite but it is taking long time as I am 
hitting kernel BUG in xfs code in some generic tests. This BUG is not 
related to the patch and  I see them in vanilla kernel too. I am in 
the process of excluding these kind of tests as they come and 
re-run the suite however, this proces is time taking. 
Do you have any specific tests in mind that you would like me 
to run apart from what I have already tested above?


Thanks,
Anchal Agarwal 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
  2018-08-07 20:12         ` Anchal Agarwal
@ 2018-08-07 20:39           ` Jens Axboe
  2018-08-07 21:12             ` Anchal Agarwal
  0 siblings, 1 reply; 38+ messages in thread
From: Jens Axboe @ 2018-08-07 20:39 UTC (permalink / raw)
  To: Anchal Agarwal, linux-block, linux-kernel; +Cc: fllinden, sblbir, msw

On 8/7/18 2:12 PM, Anchal Agarwal wrote:
> On Tue, Aug 07, 2018 at 08:29:44AM -0600, Jens Axboe wrote:
>> On 8/1/18 4:09 PM, Jens Axboe wrote:
>>> On 8/1/18 11:06 AM, Anchal Agarwal wrote:
>>>> On Wed, Aug 01, 2018 at 09:14:50AM -0600, Jens Axboe wrote:
>>>>> On 7/31/18 3:34 PM, Anchal Agarwal wrote:
>>>>>> Hi folks,
>>>>>>
>>>>>> This patch modifies commit e34cbd307477a
>>>>>> (blk-wbt: add general throttling mechanism)
>>>>>>
>>>>>> I am currently running a large bare metal instance (i3.metal)
>>>>>> on EC2 with 72 cores, 512GB of RAM and NVME drives, with a
>>>>>> 4.18 kernel. I have a workload that simulates a database
>>>>>> workload and I am running into lockup issues when writeback
>>>>>> throttling is enabled,with the hung task detector also
>>>>>> kicking in.
>>>>>>
>>>>>> Crash dumps show that most CPUs (up to 50 of them) are
>>>>>> all trying to get the wbt wait queue lock while trying to add
>>>>>> themselves to it in __wbt_wait (see stack traces below).
>>>>>>
>>>>>> [    0.948118] CPU: 45 PID: 0 Comm: swapper/45 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
>>>>>> [    0.948119] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
>>>>>> [    0.948120] task: ffff883f7878c000 task.stack: ffffc9000c69c000
>>>>>> [    0.948124] RIP: 0010:native_queued_spin_lock_slowpath+0xf8/0x1a0
>>>>>> [    0.948125] RSP: 0018:ffff883f7fcc3dc8 EFLAGS: 00000046
>>>>>> [    0.948126] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7fce2a00
>>>>>> [    0.948128] RDX: 000000000000001c RSI: 0000000000740001 RDI: ffff887f7709ca68
>>>>>> [    0.948129] RBP: 0000000000000002 R08: 0000000000b80000 R09: 0000000000000000
>>>>>> [    0.948130] R10: ffff883f7fcc3d78 R11: 000000000de27121 R12: 0000000000000002
>>>>>> [    0.948131] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
>>>>>> [    0.948132] FS:  0000000000000000(0000) GS:ffff883f7fcc0000(0000) knlGS:0000000000000000
>>>>>> [    0.948134] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>> [    0.948135] CR2: 000000c424c77000 CR3: 0000000002010005 CR4: 00000000003606e0
>>>>>> [    0.948136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>>>> [    0.948137] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>>>> [    0.948138] Call Trace:
>>>>>> [    0.948139]  <IRQ>
>>>>>> [    0.948142]  do_raw_spin_lock+0xad/0xc0
>>>>>> [    0.948145]  _raw_spin_lock_irqsave+0x44/0x4b
>>>>>> [    0.948149]  ? __wake_up_common_lock+0x53/0x90
>>>>>> [    0.948150]  __wake_up_common_lock+0x53/0x90
>>>>>> [    0.948155]  wbt_done+0x7b/0xa0
>>>>>> [    0.948158]  blk_mq_free_request+0xb7/0x110
>>>>>> [    0.948161]  __blk_mq_complete_request+0xcb/0x140
>>>>>> [    0.948166]  nvme_process_cq+0xce/0x1a0 [nvme]
>>>>>> [    0.948169]  nvme_irq+0x23/0x50 [nvme]
>>>>>> [    0.948173]  __handle_irq_event_percpu+0x46/0x300
>>>>>> [    0.948176]  handle_irq_event_percpu+0x20/0x50
>>>>>> [    0.948179]  handle_irq_event+0x34/0x60
>>>>>> [    0.948181]  handle_edge_irq+0x77/0x190
>>>>>> [    0.948185]  handle_irq+0xaf/0x120
>>>>>> [    0.948188]  do_IRQ+0x53/0x110
>>>>>> [    0.948191]  common_interrupt+0x87/0x87
>>>>>> [    0.948192]  </IRQ>
>>>>>> ....
>>>>>> [    0.311136] CPU: 4 PID: 9737 Comm: run_linux_amd64 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
>>>>>> [    0.311137] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
>>>>>> [    0.311138] task: ffff883f6e6a8000 task.stack: ffffc9000f1ec000
>>>>>> [    0.311141] RIP: 0010:native_queued_spin_lock_slowpath+0xf5/0x1a0
>>>>>> [    0.311142] RSP: 0018:ffffc9000f1efa28 EFLAGS: 00000046
>>>>>> [    0.311144] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7f722a00
>>>>>> [    0.311145] RDX: 0000000000000035 RSI: 0000000000d80001 RDI: ffff887f7709ca68
>>>>>> [    0.311146] RBP: 0000000000000202 R08: 0000000000140000 R09: 0000000000000000
>>>>>> [    0.311147] R10: ffffc9000f1ef9d8 R11: 000000001a249fa0 R12: ffff887f7709ca68
>>>>>> [    0.311148] R13: ffffc9000f1efad0 R14: 0000000000000000 R15: ffff887f7709ca00
>>>>>> [    0.311149] FS:  000000c423f30090(0000) GS:ffff883f7f700000(0000) knlGS:0000000000000000
>>>>>> [    0.311150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>> [    0.311151] CR2: 00007feefcea4000 CR3: 0000007f7016e001 CR4: 00000000003606e0
>>>>>> [    0.311152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>>>> [    0.311153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>>>> [    0.311154] Call Trace:
>>>>>> [    0.311157]  do_raw_spin_lock+0xad/0xc0
>>>>>> [    0.311160]  _raw_spin_lock_irqsave+0x44/0x4b
>>>>>> [    0.311162]  ? prepare_to_wait_exclusive+0x28/0xb0
>>>>>> [    0.311164]  prepare_to_wait_exclusive+0x28/0xb0
>>>>>> [    0.311167]  wbt_wait+0x127/0x330
>>>>>> [    0.311169]  ? finish_wait+0x80/0x80
>>>>>> [    0.311172]  ? generic_make_request+0xda/0x3b0
>>>>>> [    0.311174]  blk_mq_make_request+0xd6/0x7b0
>>>>>> [    0.311176]  ? blk_queue_enter+0x24/0x260
>>>>>> [    0.311178]  ? generic_make_request+0xda/0x3b0
>>>>>> [    0.311181]  generic_make_request+0x10c/0x3b0
>>>>>> [    0.311183]  ? submit_bio+0x5c/0x110
>>>>>> [    0.311185]  submit_bio+0x5c/0x110
>>>>>> [    0.311197]  ? __ext4_journal_stop+0x36/0xa0 [ext4]
>>>>>> [    0.311210]  ext4_io_submit+0x48/0x60 [ext4]
>>>>>> [    0.311222]  ext4_writepages+0x810/0x11f0 [ext4]
>>>>>> [    0.311229]  ? do_writepages+0x3c/0xd0
>>>>>> [    0.311239]  ? ext4_mark_inode_dirty+0x260/0x260 [ext4]
>>>>>> [    0.311240]  do_writepages+0x3c/0xd0
>>>>>> [    0.311243]  ? _raw_spin_unlock+0x24/0x30
>>>>>> [    0.311245]  ? wbc_attach_and_unlock_inode+0x165/0x280
>>>>>> [    0.311248]  ? __filemap_fdatawrite_range+0xa3/0xe0
>>>>>> [    0.311250]  __filemap_fdatawrite_range+0xa3/0xe0
>>>>>> [    0.311253]  file_write_and_wait_range+0x34/0x90
>>>>>> [    0.311264]  ext4_sync_file+0x151/0x500 [ext4]
>>>>>> [    0.311267]  do_fsync+0x38/0x60
>>>>>> [    0.311270]  SyS_fsync+0xc/0x10
>>>>>> [    0.311272]  do_syscall_64+0x6f/0x170
>>>>>> [    0.311274]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
>>>>>>
>>>>>> In the original patch, wbt_done is waking up all the exclusive
>>>>>> processes in the wait queue, which can cause a thundering herd
>>>>>> if there is a large number of writer threads in the queue. The
>>>>>> original intention of the code seems to be to wake up one thread
>>>>>> only however, it uses wake_up_all() in __wbt_done(), and then
>>>>>> uses the following check in __wbt_wait to have only one thread
>>>>>> actually get out of the wait loop:
>>>>>>
>>>>>> if (waitqueue_active(&rqw->wait) &&
>>>>>>             rqw->wait.head.next != &wait->entry)
>>>>>>                 return false;
>>>>>>
>>>>>> The problem with this is that the wait entry in wbt_wait is
>>>>>> define with DEFINE_WAIT, which uses the autoremove wakeup function.
>>>>>> That means that the above check is invalid - the wait entry will
>>>>>> have been removed from the queue already by the time we hit the
>>>>>> check in the loop.
>>>>>>
>>>>>> Secondly, auto-removing the wait entries also means that the wait
>>>>>> queue essentially gets reordered "randomly" (e.g. threads re-add
>>>>>> themselves in the order they got to run after being woken up).
>>>>>> Additionally, new requests entering wbt_wait might overtake requests
>>>>>> that were queued earlier, because the wait queue will be
>>>>>> (temporarily) empty after the wake_up_all, so the waitqueue_active
>>>>>> check will not stop them. This can cause certain threads to starve
>>>>>> under high load.
>>>>>>
>>>>>> The fix is to leave the woken up requests in the queue and remove
>>>>>> them in finish_wait() once the current thread breaks out of the
>>>>>> wait loop in __wbt_wait. This will ensure new requests always
>>>>>> end up at the back of the queue, and they won't overtake requests
>>>>>> that are already in the wait queue. With that change, the loop
>>>>>> in wbt_wait is also in line with many other wait loops in the kernel.
>>>>>> Waking up just one thread drastically reduces lock contention, as
>>>>>> does moving the wait queue add/remove out of the loop.
>>>>>>
>>>>>> A significant drop in lockdep's lock contention numbers is seen when
>>>>>> running the test application on the patched kernel.
>>>>>
>>>>> I like the patch, and a few weeks ago we independently discovered that
>>>>> the waitqueue list checking was bogus as well. My only worry is that
>>>>> changes like this can be delicate, meaning that it's easy to introduce
>>>>> stall conditions. What kind of testing did you push this through?
>>>>>
>>>>> -- 
>>>>> Jens Axboe
>>>>>
>>>> I ran the following tests on both real HW with NVME devices attached
>>>> and emulated NVME too:
>>>>
>>>> 1. The test case I used to reproduce the issue, spawns a bunch of threads 
>>>>    to concurrently read and write files with random size and content. 
>>>>    Files are randomly fsync'd. The implementation is a FIFO queue of files. 
>>>>    When the queue fills the test starts to verify and remove the files. This 
>>>>    test will fail if there's a read, write, or hash check failure. It tests
>>>>    for file corruption when lots of small files are being read and written 
>>>>    with high concurrency.
>>>>
>>>> 2. Fio for random writes with a root NVME device of 200GB
>>>>   
>>>>   fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k 
>>>>   --direct=0 --size=10G --numjobs=2 --runtime=60 --group_reporting
>>>>   
>>>>   fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k
>>>>   --direct=0 --size=5G --numjobs=2 --runtime=30 --fsync=64 --group_reporting
>>>>   
>>>>   I did see an improvement in the bandwidth numbers reported on the patched
>>>>   kernel. 
>>>>
>>>> Do you have any test case/suite in mind that you would suggest me to 
>>>> run to be sure that patch does not introduce any stall conditions?
>>>
>>> One thing that is always useful is to run xfstest, do a full run on
>>> the device. If that works, then do another full run, this time limiting
>>> the queue depth of the SCSI device to 1. If both of those pass, then
>>> I'd feel pretty good getting this applied for 4.19.
>>
>> Did you get a chance to run this full test?
>>
>> -- 
>> Jens Axboe
>>
>>
> 
> Hi Jens,
> Yes I did run the tests and was in the process of compiling concrete results
> I tested following environments against xfs/auto group
> 1. Vanilla 4.18.rc kernel
> 2. 4.18 kernel with the blk-wbt patch
> 3. 4.18 kernel with the blk-wbt patch + io_queue_depth=2. I 
> understand you asked for queue depth for SCSI device=1 however, I have NVME 
> devices in my environment and 2 is the minimum value for io_queue_depth allowed 
> according to the NVME driver code. The results pretty much look same with no 
> stalls or exceptional failures. 
> xfs/auto ran 296 odd tests with 3 failures and 130 something "no runs". 
> Remaining tests passed. "Skipped tests"  were mostly due to missing features
> (eg: reflink support on scratch filesystem)
> The failures were consistent across runs on 3 different environments. 
> I am also running full test suite but it is taking long time as I am 
> hitting kernel BUG in xfs code in some generic tests. This BUG is not 
> related to the patch and  I see them in vanilla kernel too. I am in 
> the process of excluding these kind of tests as they come and 
> re-run the suite however, this proces is time taking. 
> Do you have any specific tests in mind that you would like me 
> to run apart from what I have already tested above?

Thanks, I think that looks good. I'll get your patch applied for
4.19.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
  2018-08-07 20:39           ` Jens Axboe
@ 2018-08-07 21:12             ` Anchal Agarwal
  2018-08-07 21:19               ` Jens Axboe
  2018-08-07 21:28               ` Matt Wilson
  0 siblings, 2 replies; 38+ messages in thread
From: Anchal Agarwal @ 2018-08-07 21:12 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-kernel, fllinden, sblbir, msw, anchalag

On Tue, Aug 07, 2018 at 02:39:48PM -0600, Jens Axboe wrote:
> On 8/7/18 2:12 PM, Anchal Agarwal wrote:
> > On Tue, Aug 07, 2018 at 08:29:44AM -0600, Jens Axboe wrote:
> >> On 8/1/18 4:09 PM, Jens Axboe wrote:
> >>> On 8/1/18 11:06 AM, Anchal Agarwal wrote:
> >>>> On Wed, Aug 01, 2018 at 09:14:50AM -0600, Jens Axboe wrote:
> >>>>> On 7/31/18 3:34 PM, Anchal Agarwal wrote:
> >>>>>> Hi folks,
> >>>>>>
> >>>>>> This patch modifies commit e34cbd307477a
> >>>>>> (blk-wbt: add general throttling mechanism)
> >>>>>>
> >>>>>> I am currently running a large bare metal instance (i3.metal)
> >>>>>> on EC2 with 72 cores, 512GB of RAM and NVME drives, with a
> >>>>>> 4.18 kernel. I have a workload that simulates a database
> >>>>>> workload and I am running into lockup issues when writeback
> >>>>>> throttling is enabled,with the hung task detector also
> >>>>>> kicking in.
> >>>>>>
> >>>>>> Crash dumps show that most CPUs (up to 50 of them) are
> >>>>>> all trying to get the wbt wait queue lock while trying to add
> >>>>>> themselves to it in __wbt_wait (see stack traces below).
> >>>>>>
> >>>>>> [    0.948118] CPU: 45 PID: 0 Comm: swapper/45 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
> >>>>>> [    0.948119] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
> >>>>>> [    0.948120] task: ffff883f7878c000 task.stack: ffffc9000c69c000
> >>>>>> [    0.948124] RIP: 0010:native_queued_spin_lock_slowpath+0xf8/0x1a0
> >>>>>> [    0.948125] RSP: 0018:ffff883f7fcc3dc8 EFLAGS: 00000046
> >>>>>> [    0.948126] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7fce2a00
> >>>>>> [    0.948128] RDX: 000000000000001c RSI: 0000000000740001 RDI: ffff887f7709ca68
> >>>>>> [    0.948129] RBP: 0000000000000002 R08: 0000000000b80000 R09: 0000000000000000
> >>>>>> [    0.948130] R10: ffff883f7fcc3d78 R11: 000000000de27121 R12: 0000000000000002
> >>>>>> [    0.948131] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
> >>>>>> [    0.948132] FS:  0000000000000000(0000) GS:ffff883f7fcc0000(0000) knlGS:0000000000000000
> >>>>>> [    0.948134] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>>>> [    0.948135] CR2: 000000c424c77000 CR3: 0000000002010005 CR4: 00000000003606e0
> >>>>>> [    0.948136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >>>>>> [    0.948137] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >>>>>> [    0.948138] Call Trace:
> >>>>>> [    0.948139]  <IRQ>
> >>>>>> [    0.948142]  do_raw_spin_lock+0xad/0xc0
> >>>>>> [    0.948145]  _raw_spin_lock_irqsave+0x44/0x4b
> >>>>>> [    0.948149]  ? __wake_up_common_lock+0x53/0x90
> >>>>>> [    0.948150]  __wake_up_common_lock+0x53/0x90
> >>>>>> [    0.948155]  wbt_done+0x7b/0xa0
> >>>>>> [    0.948158]  blk_mq_free_request+0xb7/0x110
> >>>>>> [    0.948161]  __blk_mq_complete_request+0xcb/0x140
> >>>>>> [    0.948166]  nvme_process_cq+0xce/0x1a0 [nvme]
> >>>>>> [    0.948169]  nvme_irq+0x23/0x50 [nvme]
> >>>>>> [    0.948173]  __handle_irq_event_percpu+0x46/0x300
> >>>>>> [    0.948176]  handle_irq_event_percpu+0x20/0x50
> >>>>>> [    0.948179]  handle_irq_event+0x34/0x60
> >>>>>> [    0.948181]  handle_edge_irq+0x77/0x190
> >>>>>> [    0.948185]  handle_irq+0xaf/0x120
> >>>>>> [    0.948188]  do_IRQ+0x53/0x110
> >>>>>> [    0.948191]  common_interrupt+0x87/0x87
> >>>>>> [    0.948192]  </IRQ>
> >>>>>> ....
> >>>>>> [    0.311136] CPU: 4 PID: 9737 Comm: run_linux_amd64 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
> >>>>>> [    0.311137] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
> >>>>>> [    0.311138] task: ffff883f6e6a8000 task.stack: ffffc9000f1ec000
> >>>>>> [    0.311141] RIP: 0010:native_queued_spin_lock_slowpath+0xf5/0x1a0
> >>>>>> [    0.311142] RSP: 0018:ffffc9000f1efa28 EFLAGS: 00000046
> >>>>>> [    0.311144] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7f722a00
> >>>>>> [    0.311145] RDX: 0000000000000035 RSI: 0000000000d80001 RDI: ffff887f7709ca68
> >>>>>> [    0.311146] RBP: 0000000000000202 R08: 0000000000140000 R09: 0000000000000000
> >>>>>> [    0.311147] R10: ffffc9000f1ef9d8 R11: 000000001a249fa0 R12: ffff887f7709ca68
> >>>>>> [    0.311148] R13: ffffc9000f1efad0 R14: 0000000000000000 R15: ffff887f7709ca00
> >>>>>> [    0.311149] FS:  000000c423f30090(0000) GS:ffff883f7f700000(0000) knlGS:0000000000000000
> >>>>>> [    0.311150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>>>> [    0.311151] CR2: 00007feefcea4000 CR3: 0000007f7016e001 CR4: 00000000003606e0
> >>>>>> [    0.311152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >>>>>> [    0.311153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >>>>>> [    0.311154] Call Trace:
> >>>>>> [    0.311157]  do_raw_spin_lock+0xad/0xc0
> >>>>>> [    0.311160]  _raw_spin_lock_irqsave+0x44/0x4b
> >>>>>> [    0.311162]  ? prepare_to_wait_exclusive+0x28/0xb0
> >>>>>> [    0.311164]  prepare_to_wait_exclusive+0x28/0xb0
> >>>>>> [    0.311167]  wbt_wait+0x127/0x330
> >>>>>> [    0.311169]  ? finish_wait+0x80/0x80
> >>>>>> [    0.311172]  ? generic_make_request+0xda/0x3b0
> >>>>>> [    0.311174]  blk_mq_make_request+0xd6/0x7b0
> >>>>>> [    0.311176]  ? blk_queue_enter+0x24/0x260
> >>>>>> [    0.311178]  ? generic_make_request+0xda/0x3b0
> >>>>>> [    0.311181]  generic_make_request+0x10c/0x3b0
> >>>>>> [    0.311183]  ? submit_bio+0x5c/0x110
> >>>>>> [    0.311185]  submit_bio+0x5c/0x110
> >>>>>> [    0.311197]  ? __ext4_journal_stop+0x36/0xa0 [ext4]
> >>>>>> [    0.311210]  ext4_io_submit+0x48/0x60 [ext4]
> >>>>>> [    0.311222]  ext4_writepages+0x810/0x11f0 [ext4]
> >>>>>> [    0.311229]  ? do_writepages+0x3c/0xd0
> >>>>>> [    0.311239]  ? ext4_mark_inode_dirty+0x260/0x260 [ext4]
> >>>>>> [    0.311240]  do_writepages+0x3c/0xd0
> >>>>>> [    0.311243]  ? _raw_spin_unlock+0x24/0x30
> >>>>>> [    0.311245]  ? wbc_attach_and_unlock_inode+0x165/0x280
> >>>>>> [    0.311248]  ? __filemap_fdatawrite_range+0xa3/0xe0
> >>>>>> [    0.311250]  __filemap_fdatawrite_range+0xa3/0xe0
> >>>>>> [    0.311253]  file_write_and_wait_range+0x34/0x90
> >>>>>> [    0.311264]  ext4_sync_file+0x151/0x500 [ext4]
> >>>>>> [    0.311267]  do_fsync+0x38/0x60
> >>>>>> [    0.311270]  SyS_fsync+0xc/0x10
> >>>>>> [    0.311272]  do_syscall_64+0x6f/0x170
> >>>>>> [    0.311274]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
> >>>>>>
> >>>>>> In the original patch, wbt_done is waking up all the exclusive
> >>>>>> processes in the wait queue, which can cause a thundering herd
> >>>>>> if there is a large number of writer threads in the queue. The
> >>>>>> original intention of the code seems to be to wake up one thread
> >>>>>> only however, it uses wake_up_all() in __wbt_done(), and then
> >>>>>> uses the following check in __wbt_wait to have only one thread
> >>>>>> actually get out of the wait loop:
> >>>>>>
> >>>>>> if (waitqueue_active(&rqw->wait) &&
> >>>>>>             rqw->wait.head.next != &wait->entry)
> >>>>>>                 return false;
> >>>>>>
> >>>>>> The problem with this is that the wait entry in wbt_wait is
> >>>>>> define with DEFINE_WAIT, which uses the autoremove wakeup function.
> >>>>>> That means that the above check is invalid - the wait entry will
> >>>>>> have been removed from the queue already by the time we hit the
> >>>>>> check in the loop.
> >>>>>>
> >>>>>> Secondly, auto-removing the wait entries also means that the wait
> >>>>>> queue essentially gets reordered "randomly" (e.g. threads re-add
> >>>>>> themselves in the order they got to run after being woken up).
> >>>>>> Additionally, new requests entering wbt_wait might overtake requests
> >>>>>> that were queued earlier, because the wait queue will be
> >>>>>> (temporarily) empty after the wake_up_all, so the waitqueue_active
> >>>>>> check will not stop them. This can cause certain threads to starve
> >>>>>> under high load.
> >>>>>>
> >>>>>> The fix is to leave the woken up requests in the queue and remove
> >>>>>> them in finish_wait() once the current thread breaks out of the
> >>>>>> wait loop in __wbt_wait. This will ensure new requests always
> >>>>>> end up at the back of the queue, and they won't overtake requests
> >>>>>> that are already in the wait queue. With that change, the loop
> >>>>>> in wbt_wait is also in line with many other wait loops in the kernel.
> >>>>>> Waking up just one thread drastically reduces lock contention, as
> >>>>>> does moving the wait queue add/remove out of the loop.
> >>>>>>
> >>>>>> A significant drop in lockdep's lock contention numbers is seen when
> >>>>>> running the test application on the patched kernel.
> >>>>>
> >>>>> I like the patch, and a few weeks ago we independently discovered that
> >>>>> the waitqueue list checking was bogus as well. My only worry is that
> >>>>> changes like this can be delicate, meaning that it's easy to introduce
> >>>>> stall conditions. What kind of testing did you push this through?
> >>>>>
> >>>>> -- 
> >>>>> Jens Axboe
> >>>>>
> >>>> I ran the following tests on both real HW with NVME devices attached
> >>>> and emulated NVME too:
> >>>>
> >>>> 1. The test case I used to reproduce the issue, spawns a bunch of threads 
> >>>>    to concurrently read and write files with random size and content. 
> >>>>    Files are randomly fsync'd. The implementation is a FIFO queue of files. 
> >>>>    When the queue fills the test starts to verify and remove the files. This 
> >>>>    test will fail if there's a read, write, or hash check failure. It tests
> >>>>    for file corruption when lots of small files are being read and written 
> >>>>    with high concurrency.
> >>>>
> >>>> 2. Fio for random writes with a root NVME device of 200GB
> >>>>   
> >>>>   fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k 
> >>>>   --direct=0 --size=10G --numjobs=2 --runtime=60 --group_reporting
> >>>>   
> >>>>   fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k
> >>>>   --direct=0 --size=5G --numjobs=2 --runtime=30 --fsync=64 --group_reporting
> >>>>   
> >>>>   I did see an improvement in the bandwidth numbers reported on the patched
> >>>>   kernel. 
> >>>>
> >>>> Do you have any test case/suite in mind that you would suggest me to 
> >>>> run to be sure that patch does not introduce any stall conditions?
> >>>
> >>> One thing that is always useful is to run xfstest, do a full run on
> >>> the device. If that works, then do another full run, this time limiting
> >>> the queue depth of the SCSI device to 1. If both of those pass, then
> >>> I'd feel pretty good getting this applied for 4.19.
> >>
> >> Did you get a chance to run this full test?
> >>
> >> -- 
> >> Jens Axboe
> >>
> >>
> > 
> > Hi Jens,
> > Yes I did run the tests and was in the process of compiling concrete results
> > I tested following environments against xfs/auto group
> > 1. Vanilla 4.18.rc kernel
> > 2. 4.18 kernel with the blk-wbt patch
> > 3. 4.18 kernel with the blk-wbt patch + io_queue_depth=2. I 
> > understand you asked for queue depth for SCSI device=1 however, I have NVME 
> > devices in my environment and 2 is the minimum value for io_queue_depth allowed 
> > according to the NVME driver code. The results pretty much look same with no 
> > stalls or exceptional failures. 
> > xfs/auto ran 296 odd tests with 3 failures and 130 something "no runs". 
> > Remaining tests passed. "Skipped tests"  were mostly due to missing features
> > (eg: reflink support on scratch filesystem)
> > The failures were consistent across runs on 3 different environments. 
> > I am also running full test suite but it is taking long time as I am 
> > hitting kernel BUG in xfs code in some generic tests. This BUG is not 
> > related to the patch and  I see them in vanilla kernel too. I am in 
> > the process of excluding these kind of tests as they come and 
> > re-run the suite however, this proces is time taking. 
> > Do you have any specific tests in mind that you would like me 
> > to run apart from what I have already tested above?
> 
> Thanks, I think that looks good. I'll get your patch applied for
> 4.19.
> 
> -- 
> Jens Axboe
> 
>

Hi Jens,
Thanks for accepting this. There is one small issue, I don't find any emails
send by me on the lkml mailing list. I am not sure why it didn't land there,
all I can see is your responses. Do you want one of us to resend the patch
or will you be able to do it?

Thanks,
Anchal Agarwal 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
  2018-08-07 21:12             ` Anchal Agarwal
@ 2018-08-07 21:19               ` Jens Axboe
  2018-08-07 22:06                 ` Anchal Agarwal
  2018-08-20 16:36                 ` Jens Axboe
  2018-08-07 21:28               ` Matt Wilson
  1 sibling, 2 replies; 38+ messages in thread
From: Jens Axboe @ 2018-08-07 21:19 UTC (permalink / raw)
  To: Anchal Agarwal; +Cc: linux-block, linux-kernel, fllinden, sblbir, msw

On 8/7/18 3:12 PM, Anchal Agarwal wrote:
> On Tue, Aug 07, 2018 at 02:39:48PM -0600, Jens Axboe wrote:
>> On 8/7/18 2:12 PM, Anchal Agarwal wrote:
>>> On Tue, Aug 07, 2018 at 08:29:44AM -0600, Jens Axboe wrote:
>>>> On 8/1/18 4:09 PM, Jens Axboe wrote:
>>>>> On 8/1/18 11:06 AM, Anchal Agarwal wrote:
>>>>>> On Wed, Aug 01, 2018 at 09:14:50AM -0600, Jens Axboe wrote:
>>>>>>> On 7/31/18 3:34 PM, Anchal Agarwal wrote:
>>>>>>>> Hi folks,
>>>>>>>>
>>>>>>>> This patch modifies commit e34cbd307477a
>>>>>>>> (blk-wbt: add general throttling mechanism)
>>>>>>>>
>>>>>>>> I am currently running a large bare metal instance (i3.metal)
>>>>>>>> on EC2 with 72 cores, 512GB of RAM and NVME drives, with a
>>>>>>>> 4.18 kernel. I have a workload that simulates a database
>>>>>>>> workload and I am running into lockup issues when writeback
>>>>>>>> throttling is enabled,with the hung task detector also
>>>>>>>> kicking in.
>>>>>>>>
>>>>>>>> Crash dumps show that most CPUs (up to 50 of them) are
>>>>>>>> all trying to get the wbt wait queue lock while trying to add
>>>>>>>> themselves to it in __wbt_wait (see stack traces below).
>>>>>>>>
>>>>>>>> [    0.948118] CPU: 45 PID: 0 Comm: swapper/45 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
>>>>>>>> [    0.948119] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
>>>>>>>> [    0.948120] task: ffff883f7878c000 task.stack: ffffc9000c69c000
>>>>>>>> [    0.948124] RIP: 0010:native_queued_spin_lock_slowpath+0xf8/0x1a0
>>>>>>>> [    0.948125] RSP: 0018:ffff883f7fcc3dc8 EFLAGS: 00000046
>>>>>>>> [    0.948126] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7fce2a00
>>>>>>>> [    0.948128] RDX: 000000000000001c RSI: 0000000000740001 RDI: ffff887f7709ca68
>>>>>>>> [    0.948129] RBP: 0000000000000002 R08: 0000000000b80000 R09: 0000000000000000
>>>>>>>> [    0.948130] R10: ffff883f7fcc3d78 R11: 000000000de27121 R12: 0000000000000002
>>>>>>>> [    0.948131] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
>>>>>>>> [    0.948132] FS:  0000000000000000(0000) GS:ffff883f7fcc0000(0000) knlGS:0000000000000000
>>>>>>>> [    0.948134] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>>>> [    0.948135] CR2: 000000c424c77000 CR3: 0000000002010005 CR4: 00000000003606e0
>>>>>>>> [    0.948136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>>>>>> [    0.948137] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>>>>>> [    0.948138] Call Trace:
>>>>>>>> [    0.948139]  <IRQ>
>>>>>>>> [    0.948142]  do_raw_spin_lock+0xad/0xc0
>>>>>>>> [    0.948145]  _raw_spin_lock_irqsave+0x44/0x4b
>>>>>>>> [    0.948149]  ? __wake_up_common_lock+0x53/0x90
>>>>>>>> [    0.948150]  __wake_up_common_lock+0x53/0x90
>>>>>>>> [    0.948155]  wbt_done+0x7b/0xa0
>>>>>>>> [    0.948158]  blk_mq_free_request+0xb7/0x110
>>>>>>>> [    0.948161]  __blk_mq_complete_request+0xcb/0x140
>>>>>>>> [    0.948166]  nvme_process_cq+0xce/0x1a0 [nvme]
>>>>>>>> [    0.948169]  nvme_irq+0x23/0x50 [nvme]
>>>>>>>> [    0.948173]  __handle_irq_event_percpu+0x46/0x300
>>>>>>>> [    0.948176]  handle_irq_event_percpu+0x20/0x50
>>>>>>>> [    0.948179]  handle_irq_event+0x34/0x60
>>>>>>>> [    0.948181]  handle_edge_irq+0x77/0x190
>>>>>>>> [    0.948185]  handle_irq+0xaf/0x120
>>>>>>>> [    0.948188]  do_IRQ+0x53/0x110
>>>>>>>> [    0.948191]  common_interrupt+0x87/0x87
>>>>>>>> [    0.948192]  </IRQ>
>>>>>>>> ....
>>>>>>>> [    0.311136] CPU: 4 PID: 9737 Comm: run_linux_amd64 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
>>>>>>>> [    0.311137] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
>>>>>>>> [    0.311138] task: ffff883f6e6a8000 task.stack: ffffc9000f1ec000
>>>>>>>> [    0.311141] RIP: 0010:native_queued_spin_lock_slowpath+0xf5/0x1a0
>>>>>>>> [    0.311142] RSP: 0018:ffffc9000f1efa28 EFLAGS: 00000046
>>>>>>>> [    0.311144] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7f722a00
>>>>>>>> [    0.311145] RDX: 0000000000000035 RSI: 0000000000d80001 RDI: ffff887f7709ca68
>>>>>>>> [    0.311146] RBP: 0000000000000202 R08: 0000000000140000 R09: 0000000000000000
>>>>>>>> [    0.311147] R10: ffffc9000f1ef9d8 R11: 000000001a249fa0 R12: ffff887f7709ca68
>>>>>>>> [    0.311148] R13: ffffc9000f1efad0 R14: 0000000000000000 R15: ffff887f7709ca00
>>>>>>>> [    0.311149] FS:  000000c423f30090(0000) GS:ffff883f7f700000(0000) knlGS:0000000000000000
>>>>>>>> [    0.311150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>>>> [    0.311151] CR2: 00007feefcea4000 CR3: 0000007f7016e001 CR4: 00000000003606e0
>>>>>>>> [    0.311152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>>>>>> [    0.311153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>>>>>> [    0.311154] Call Trace:
>>>>>>>> [    0.311157]  do_raw_spin_lock+0xad/0xc0
>>>>>>>> [    0.311160]  _raw_spin_lock_irqsave+0x44/0x4b
>>>>>>>> [    0.311162]  ? prepare_to_wait_exclusive+0x28/0xb0
>>>>>>>> [    0.311164]  prepare_to_wait_exclusive+0x28/0xb0
>>>>>>>> [    0.311167]  wbt_wait+0x127/0x330
>>>>>>>> [    0.311169]  ? finish_wait+0x80/0x80
>>>>>>>> [    0.311172]  ? generic_make_request+0xda/0x3b0
>>>>>>>> [    0.311174]  blk_mq_make_request+0xd6/0x7b0
>>>>>>>> [    0.311176]  ? blk_queue_enter+0x24/0x260
>>>>>>>> [    0.311178]  ? generic_make_request+0xda/0x3b0
>>>>>>>> [    0.311181]  generic_make_request+0x10c/0x3b0
>>>>>>>> [    0.311183]  ? submit_bio+0x5c/0x110
>>>>>>>> [    0.311185]  submit_bio+0x5c/0x110
>>>>>>>> [    0.311197]  ? __ext4_journal_stop+0x36/0xa0 [ext4]
>>>>>>>> [    0.311210]  ext4_io_submit+0x48/0x60 [ext4]
>>>>>>>> [    0.311222]  ext4_writepages+0x810/0x11f0 [ext4]
>>>>>>>> [    0.311229]  ? do_writepages+0x3c/0xd0
>>>>>>>> [    0.311239]  ? ext4_mark_inode_dirty+0x260/0x260 [ext4]
>>>>>>>> [    0.311240]  do_writepages+0x3c/0xd0
>>>>>>>> [    0.311243]  ? _raw_spin_unlock+0x24/0x30
>>>>>>>> [    0.311245]  ? wbc_attach_and_unlock_inode+0x165/0x280
>>>>>>>> [    0.311248]  ? __filemap_fdatawrite_range+0xa3/0xe0
>>>>>>>> [    0.311250]  __filemap_fdatawrite_range+0xa3/0xe0
>>>>>>>> [    0.311253]  file_write_and_wait_range+0x34/0x90
>>>>>>>> [    0.311264]  ext4_sync_file+0x151/0x500 [ext4]
>>>>>>>> [    0.311267]  do_fsync+0x38/0x60
>>>>>>>> [    0.311270]  SyS_fsync+0xc/0x10
>>>>>>>> [    0.311272]  do_syscall_64+0x6f/0x170
>>>>>>>> [    0.311274]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
>>>>>>>>
>>>>>>>> In the original patch, wbt_done is waking up all the exclusive
>>>>>>>> processes in the wait queue, which can cause a thundering herd
>>>>>>>> if there is a large number of writer threads in the queue. The
>>>>>>>> original intention of the code seems to be to wake up one thread
>>>>>>>> only however, it uses wake_up_all() in __wbt_done(), and then
>>>>>>>> uses the following check in __wbt_wait to have only one thread
>>>>>>>> actually get out of the wait loop:
>>>>>>>>
>>>>>>>> if (waitqueue_active(&rqw->wait) &&
>>>>>>>>             rqw->wait.head.next != &wait->entry)
>>>>>>>>                 return false;
>>>>>>>>
>>>>>>>> The problem with this is that the wait entry in wbt_wait is
>>>>>>>> define with DEFINE_WAIT, which uses the autoremove wakeup function.
>>>>>>>> That means that the above check is invalid - the wait entry will
>>>>>>>> have been removed from the queue already by the time we hit the
>>>>>>>> check in the loop.
>>>>>>>>
>>>>>>>> Secondly, auto-removing the wait entries also means that the wait
>>>>>>>> queue essentially gets reordered "randomly" (e.g. threads re-add
>>>>>>>> themselves in the order they got to run after being woken up).
>>>>>>>> Additionally, new requests entering wbt_wait might overtake requests
>>>>>>>> that were queued earlier, because the wait queue will be
>>>>>>>> (temporarily) empty after the wake_up_all, so the waitqueue_active
>>>>>>>> check will not stop them. This can cause certain threads to starve
>>>>>>>> under high load.
>>>>>>>>
>>>>>>>> The fix is to leave the woken up requests in the queue and remove
>>>>>>>> them in finish_wait() once the current thread breaks out of the
>>>>>>>> wait loop in __wbt_wait. This will ensure new requests always
>>>>>>>> end up at the back of the queue, and they won't overtake requests
>>>>>>>> that are already in the wait queue. With that change, the loop
>>>>>>>> in wbt_wait is also in line with many other wait loops in the kernel.
>>>>>>>> Waking up just one thread drastically reduces lock contention, as
>>>>>>>> does moving the wait queue add/remove out of the loop.
>>>>>>>>
>>>>>>>> A significant drop in lockdep's lock contention numbers is seen when
>>>>>>>> running the test application on the patched kernel.
>>>>>>>
>>>>>>> I like the patch, and a few weeks ago we independently discovered that
>>>>>>> the waitqueue list checking was bogus as well. My only worry is that
>>>>>>> changes like this can be delicate, meaning that it's easy to introduce
>>>>>>> stall conditions. What kind of testing did you push this through?
>>>>>>>
>>>>>>> -- 
>>>>>>> Jens Axboe
>>>>>>>
>>>>>> I ran the following tests on both real HW with NVME devices attached
>>>>>> and emulated NVME too:
>>>>>>
>>>>>> 1. The test case I used to reproduce the issue, spawns a bunch of threads 
>>>>>>    to concurrently read and write files with random size and content. 
>>>>>>    Files are randomly fsync'd. The implementation is a FIFO queue of files. 
>>>>>>    When the queue fills the test starts to verify and remove the files. This 
>>>>>>    test will fail if there's a read, write, or hash check failure. It tests
>>>>>>    for file corruption when lots of small files are being read and written 
>>>>>>    with high concurrency.
>>>>>>
>>>>>> 2. Fio for random writes with a root NVME device of 200GB
>>>>>>   
>>>>>>   fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k 
>>>>>>   --direct=0 --size=10G --numjobs=2 --runtime=60 --group_reporting
>>>>>>   
>>>>>>   fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k
>>>>>>   --direct=0 --size=5G --numjobs=2 --runtime=30 --fsync=64 --group_reporting
>>>>>>   
>>>>>>   I did see an improvement in the bandwidth numbers reported on the patched
>>>>>>   kernel. 
>>>>>>
>>>>>> Do you have any test case/suite in mind that you would suggest me to 
>>>>>> run to be sure that patch does not introduce any stall conditions?
>>>>>
>>>>> One thing that is always useful is to run xfstest, do a full run on
>>>>> the device. If that works, then do another full run, this time limiting
>>>>> the queue depth of the SCSI device to 1. If both of those pass, then
>>>>> I'd feel pretty good getting this applied for 4.19.
>>>>
>>>> Did you get a chance to run this full test?
>>>>
>>>> -- 
>>>> Jens Axboe
>>>>
>>>>
>>>
>>> Hi Jens,
>>> Yes I did run the tests and was in the process of compiling concrete results
>>> I tested following environments against xfs/auto group
>>> 1. Vanilla 4.18.rc kernel
>>> 2. 4.18 kernel with the blk-wbt patch
>>> 3. 4.18 kernel with the blk-wbt patch + io_queue_depth=2. I 
>>> understand you asked for queue depth for SCSI device=1 however, I have NVME 
>>> devices in my environment and 2 is the minimum value for io_queue_depth allowed 
>>> according to the NVME driver code. The results pretty much look same with no 
>>> stalls or exceptional failures. 
>>> xfs/auto ran 296 odd tests with 3 failures and 130 something "no runs". 
>>> Remaining tests passed. "Skipped tests"  were mostly due to missing features
>>> (eg: reflink support on scratch filesystem)
>>> The failures were consistent across runs on 3 different environments. 
>>> I am also running full test suite but it is taking long time as I am 
>>> hitting kernel BUG in xfs code in some generic tests. This BUG is not 
>>> related to the patch and  I see them in vanilla kernel too. I am in 
>>> the process of excluding these kind of tests as they come and 
>>> re-run the suite however, this proces is time taking. 
>>> Do you have any specific tests in mind that you would like me 
>>> to run apart from what I have already tested above?
>>
>> Thanks, I think that looks good. I'll get your patch applied for
>> 4.19.
>>
>> -- 
>> Jens Axboe
>>
>>
> 
> Hi Jens,
> Thanks for accepting this. There is one small issue, I don't find any emails
> send by me on the lkml mailing list. I am not sure why it didn't land there,
> all I can see is your responses. Do you want one of us to resend the patch
> or will you be able to do it?

That's odd, are you getting rejections on your emails? For reference, the
patch is here:

http://git.kernel.dk/cgit/linux-block/commit/?h=for-4.19/block&id=2887e41b910bb14fd847cf01ab7a5993db989d88

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
  2018-08-07 21:12             ` Anchal Agarwal
  2018-08-07 21:19               ` Jens Axboe
@ 2018-08-07 21:28               ` Matt Wilson
  1 sibling, 0 replies; 38+ messages in thread
From: Matt Wilson @ 2018-08-07 21:28 UTC (permalink / raw)
  To: Anchal Agarwal
  Cc: Jens Axboe, linux-block, linux-kernel, fllinden, sblbir, msw

On Tue, Aug 07, 2018 at 09:12:16PM +0000, Anchal Agarwal wrote:
> Hi Jens,
> Thanks for accepting this. There is one small issue, I don't find any emails
> send by me on the lkml mailing list. I am not sure why it didn't land there,
> all I can see is your responses. Do you want one of us to resend the patch
> or will you be able to do it?

Hi Anchal,

Usually this is due to how DMARC is set up for amazon.com. When mail
lists relay your messages without taking ownership of the envelope, it
invalidates the DKIM signatures and the DMARC policy can cause email
to be flagged as forged / spam.

If you configure your mailer and git to use amzn.com instead of
amazon.com it may help, as the DMARC policy for amzn.com is "none"
[1] rather than "quarantine" for amazon.com [2].

--msw

[1] https://mxtoolbox.com/SuperTool.aspx?action=dmarc%3aamzn.com&run=toolpage
[2] https://mxtoolbox.com/SuperTool.aspx?action=dmarc%3aamazon.com&run=toolpage

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
  2018-08-07 21:19               ` Jens Axboe
@ 2018-08-07 22:06                 ` Anchal Agarwal
  2018-08-20 16:36                 ` Jens Axboe
  1 sibling, 0 replies; 38+ messages in thread
From: Anchal Agarwal @ 2018-08-07 22:06 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-kernel, fllinden, sblbir, msw

On Tue, Aug 07, 2018 at 03:19:57PM -0600, Jens Axboe wrote:
> On 8/7/18 3:12 PM, Anchal Agarwal wrote:
> > On Tue, Aug 07, 2018 at 02:39:48PM -0600, Jens Axboe wrote:
> >> On 8/7/18 2:12 PM, Anchal Agarwal wrote:
> >>> On Tue, Aug 07, 2018 at 08:29:44AM -0600, Jens Axboe wrote:
> >>>> On 8/1/18 4:09 PM, Jens Axboe wrote:
> >>>>> On 8/1/18 11:06 AM, Anchal Agarwal wrote:
> >>>>>> On Wed, Aug 01, 2018 at 09:14:50AM -0600, Jens Axboe wrote:
> >>>>>>> On 7/31/18 3:34 PM, Anchal Agarwal wrote:
> >>>>>>>> Hi folks,
> >>>>>>>>
> >>>>>>>> This patch modifies commit e34cbd307477a
> >>>>>>>> (blk-wbt: add general throttling mechanism)
> >>>>>>>>
> >>>>>>>> I am currently running a large bare metal instance (i3.metal)
> >>>>>>>> on EC2 with 72 cores, 512GB of RAM and NVME drives, with a
> >>>>>>>> 4.18 kernel. I have a workload that simulates a database
> >>>>>>>> workload and I am running into lockup issues when writeback
> >>>>>>>> throttling is enabled,with the hung task detector also
> >>>>>>>> kicking in.
> >>>>>>>>
> >>>>>>>> Crash dumps show that most CPUs (up to 50 of them) are
> >>>>>>>> all trying to get the wbt wait queue lock while trying to add
> >>>>>>>> themselves to it in __wbt_wait (see stack traces below).
> >>>>>>>>
> >>>>>>>> [    0.948118] CPU: 45 PID: 0 Comm: swapper/45 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
> >>>>>>>> [    0.948119] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
> >>>>>>>> [    0.948120] task: ffff883f7878c000 task.stack: ffffc9000c69c000
> >>>>>>>> [    0.948124] RIP: 0010:native_queued_spin_lock_slowpath+0xf8/0x1a0
> >>>>>>>> [    0.948125] RSP: 0018:ffff883f7fcc3dc8 EFLAGS: 00000046
> >>>>>>>> [    0.948126] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7fce2a00
> >>>>>>>> [    0.948128] RDX: 000000000000001c RSI: 0000000000740001 RDI: ffff887f7709ca68
> >>>>>>>> [    0.948129] RBP: 0000000000000002 R08: 0000000000b80000 R09: 0000000000000000
> >>>>>>>> [    0.948130] R10: ffff883f7fcc3d78 R11: 000000000de27121 R12: 0000000000000002
> >>>>>>>> [    0.948131] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
> >>>>>>>> [    0.948132] FS:  0000000000000000(0000) GS:ffff883f7fcc0000(0000) knlGS:0000000000000000
> >>>>>>>> [    0.948134] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>>>>>> [    0.948135] CR2: 000000c424c77000 CR3: 0000000002010005 CR4: 00000000003606e0
> >>>>>>>> [    0.948136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >>>>>>>> [    0.948137] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >>>>>>>> [    0.948138] Call Trace:
> >>>>>>>> [    0.948139]  <IRQ>
> >>>>>>>> [    0.948142]  do_raw_spin_lock+0xad/0xc0
> >>>>>>>> [    0.948145]  _raw_spin_lock_irqsave+0x44/0x4b
> >>>>>>>> [    0.948149]  ? __wake_up_common_lock+0x53/0x90
> >>>>>>>> [    0.948150]  __wake_up_common_lock+0x53/0x90
> >>>>>>>> [    0.948155]  wbt_done+0x7b/0xa0
> >>>>>>>> [    0.948158]  blk_mq_free_request+0xb7/0x110
> >>>>>>>> [    0.948161]  __blk_mq_complete_request+0xcb/0x140
> >>>>>>>> [    0.948166]  nvme_process_cq+0xce/0x1a0 [nvme]
> >>>>>>>> [    0.948169]  nvme_irq+0x23/0x50 [nvme]
> >>>>>>>> [    0.948173]  __handle_irq_event_percpu+0x46/0x300
> >>>>>>>> [    0.948176]  handle_irq_event_percpu+0x20/0x50
> >>>>>>>> [    0.948179]  handle_irq_event+0x34/0x60
> >>>>>>>> [    0.948181]  handle_edge_irq+0x77/0x190
> >>>>>>>> [    0.948185]  handle_irq+0xaf/0x120
> >>>>>>>> [    0.948188]  do_IRQ+0x53/0x110
> >>>>>>>> [    0.948191]  common_interrupt+0x87/0x87
> >>>>>>>> [    0.948192]  </IRQ>
> >>>>>>>> ....
> >>>>>>>> [    0.311136] CPU: 4 PID: 9737 Comm: run_linux_amd64 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
> >>>>>>>> [    0.311137] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
> >>>>>>>> [    0.311138] task: ffff883f6e6a8000 task.stack: ffffc9000f1ec000
> >>>>>>>> [    0.311141] RIP: 0010:native_queued_spin_lock_slowpath+0xf5/0x1a0
> >>>>>>>> [    0.311142] RSP: 0018:ffffc9000f1efa28 EFLAGS: 00000046
> >>>>>>>> [    0.311144] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7f722a00
> >>>>>>>> [    0.311145] RDX: 0000000000000035 RSI: 0000000000d80001 RDI: ffff887f7709ca68
> >>>>>>>> [    0.311146] RBP: 0000000000000202 R08: 0000000000140000 R09: 0000000000000000
> >>>>>>>> [    0.311147] R10: ffffc9000f1ef9d8 R11: 000000001a249fa0 R12: ffff887f7709ca68
> >>>>>>>> [    0.311148] R13: ffffc9000f1efad0 R14: 0000000000000000 R15: ffff887f7709ca00
> >>>>>>>> [    0.311149] FS:  000000c423f30090(0000) GS:ffff883f7f700000(0000) knlGS:0000000000000000
> >>>>>>>> [    0.311150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>>>>>> [    0.311151] CR2: 00007feefcea4000 CR3: 0000007f7016e001 CR4: 00000000003606e0
> >>>>>>>> [    0.311152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >>>>>>>> [    0.311153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >>>>>>>> [    0.311154] Call Trace:
> >>>>>>>> [    0.311157]  do_raw_spin_lock+0xad/0xc0
> >>>>>>>> [    0.311160]  _raw_spin_lock_irqsave+0x44/0x4b
> >>>>>>>> [    0.311162]  ? prepare_to_wait_exclusive+0x28/0xb0
> >>>>>>>> [    0.311164]  prepare_to_wait_exclusive+0x28/0xb0
> >>>>>>>> [    0.311167]  wbt_wait+0x127/0x330
> >>>>>>>> [    0.311169]  ? finish_wait+0x80/0x80
> >>>>>>>> [    0.311172]  ? generic_make_request+0xda/0x3b0
> >>>>>>>> [    0.311174]  blk_mq_make_request+0xd6/0x7b0
> >>>>>>>> [    0.311176]  ? blk_queue_enter+0x24/0x260
> >>>>>>>> [    0.311178]  ? generic_make_request+0xda/0x3b0
> >>>>>>>> [    0.311181]  generic_make_request+0x10c/0x3b0
> >>>>>>>> [    0.311183]  ? submit_bio+0x5c/0x110
> >>>>>>>> [    0.311185]  submit_bio+0x5c/0x110
> >>>>>>>> [    0.311197]  ? __ext4_journal_stop+0x36/0xa0 [ext4]
> >>>>>>>> [    0.311210]  ext4_io_submit+0x48/0x60 [ext4]
> >>>>>>>> [    0.311222]  ext4_writepages+0x810/0x11f0 [ext4]
> >>>>>>>> [    0.311229]  ? do_writepages+0x3c/0xd0
> >>>>>>>> [    0.311239]  ? ext4_mark_inode_dirty+0x260/0x260 [ext4]
> >>>>>>>> [    0.311240]  do_writepages+0x3c/0xd0
> >>>>>>>> [    0.311243]  ? _raw_spin_unlock+0x24/0x30
> >>>>>>>> [    0.311245]  ? wbc_attach_and_unlock_inode+0x165/0x280
> >>>>>>>> [    0.311248]  ? __filemap_fdatawrite_range+0xa3/0xe0
> >>>>>>>> [    0.311250]  __filemap_fdatawrite_range+0xa3/0xe0
> >>>>>>>> [    0.311253]  file_write_and_wait_range+0x34/0x90
> >>>>>>>> [    0.311264]  ext4_sync_file+0x151/0x500 [ext4]
> >>>>>>>> [    0.311267]  do_fsync+0x38/0x60
> >>>>>>>> [    0.311270]  SyS_fsync+0xc/0x10
> >>>>>>>> [    0.311272]  do_syscall_64+0x6f/0x170
> >>>>>>>> [    0.311274]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
> >>>>>>>>
> >>>>>>>> In the original patch, wbt_done is waking up all the exclusive
> >>>>>>>> processes in the wait queue, which can cause a thundering herd
> >>>>>>>> if there is a large number of writer threads in the queue. The
> >>>>>>>> original intention of the code seems to be to wake up one thread
> >>>>>>>> only however, it uses wake_up_all() in __wbt_done(), and then
> >>>>>>>> uses the following check in __wbt_wait to have only one thread
> >>>>>>>> actually get out of the wait loop:
> >>>>>>>>
> >>>>>>>> if (waitqueue_active(&rqw->wait) &&
> >>>>>>>>             rqw->wait.head.next != &wait->entry)
> >>>>>>>>                 return false;
> >>>>>>>>
> >>>>>>>> The problem with this is that the wait entry in wbt_wait is
> >>>>>>>> define with DEFINE_WAIT, which uses the autoremove wakeup function.
> >>>>>>>> That means that the above check is invalid - the wait entry will
> >>>>>>>> have been removed from the queue already by the time we hit the
> >>>>>>>> check in the loop.
> >>>>>>>>
> >>>>>>>> Secondly, auto-removing the wait entries also means that the wait
> >>>>>>>> queue essentially gets reordered "randomly" (e.g. threads re-add
> >>>>>>>> themselves in the order they got to run after being woken up).
> >>>>>>>> Additionally, new requests entering wbt_wait might overtake requests
> >>>>>>>> that were queued earlier, because the wait queue will be
> >>>>>>>> (temporarily) empty after the wake_up_all, so the waitqueue_active
> >>>>>>>> check will not stop them. This can cause certain threads to starve
> >>>>>>>> under high load.
> >>>>>>>>
> >>>>>>>> The fix is to leave the woken up requests in the queue and remove
> >>>>>>>> them in finish_wait() once the current thread breaks out of the
> >>>>>>>> wait loop in __wbt_wait. This will ensure new requests always
> >>>>>>>> end up at the back of the queue, and they won't overtake requests
> >>>>>>>> that are already in the wait queue. With that change, the loop
> >>>>>>>> in wbt_wait is also in line with many other wait loops in the kernel.
> >>>>>>>> Waking up just one thread drastically reduces lock contention, as
> >>>>>>>> does moving the wait queue add/remove out of the loop.
> >>>>>>>>
> >>>>>>>> A significant drop in lockdep's lock contention numbers is seen when
> >>>>>>>> running the test application on the patched kernel.
> >>>>>>>
> >>>>>>> I like the patch, and a few weeks ago we independently discovered that
> >>>>>>> the waitqueue list checking was bogus as well. My only worry is that
> >>>>>>> changes like this can be delicate, meaning that it's easy to introduce
> >>>>>>> stall conditions. What kind of testing did you push this through?
> >>>>>>>
> >>>>>>> -- 
> >>>>>>> Jens Axboe
> >>>>>>>
> >>>>>> I ran the following tests on both real HW with NVME devices attached
> >>>>>> and emulated NVME too:
> >>>>>>
> >>>>>> 1. The test case I used to reproduce the issue, spawns a bunch of threads 
> >>>>>>    to concurrently read and write files with random size and content. 
> >>>>>>    Files are randomly fsync'd. The implementation is a FIFO queue of files. 
> >>>>>>    When the queue fills the test starts to verify and remove the files. This 
> >>>>>>    test will fail if there's a read, write, or hash check failure. It tests
> >>>>>>    for file corruption when lots of small files are being read and written 
> >>>>>>    with high concurrency.
> >>>>>>
> >>>>>> 2. Fio for random writes with a root NVME device of 200GB
> >>>>>>   
> >>>>>>   fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k 
> >>>>>>   --direct=0 --size=10G --numjobs=2 --runtime=60 --group_reporting
> >>>>>>   
> >>>>>>   fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k
> >>>>>>   --direct=0 --size=5G --numjobs=2 --runtime=30 --fsync=64 --group_reporting
> >>>>>>   
> >>>>>>   I did see an improvement in the bandwidth numbers reported on the patched
> >>>>>>   kernel. 
> >>>>>>
> >>>>>> Do you have any test case/suite in mind that you would suggest me to 
> >>>>>> run to be sure that patch does not introduce any stall conditions?
> >>>>>
> >>>>> One thing that is always useful is to run xfstest, do a full run on
> >>>>> the device. If that works, then do another full run, this time limiting
> >>>>> the queue depth of the SCSI device to 1. If both of those pass, then
> >>>>> I'd feel pretty good getting this applied for 4.19.
> >>>>
> >>>> Did you get a chance to run this full test?
> >>>>
> >>>> -- 
> >>>> Jens Axboe
> >>>>
> >>>>
> >>>
> >>> Hi Jens,
> >>> Yes I did run the tests and was in the process of compiling concrete results
> >>> I tested following environments against xfs/auto group
> >>> 1. Vanilla 4.18.rc kernel
> >>> 2. 4.18 kernel with the blk-wbt patch
> >>> 3. 4.18 kernel with the blk-wbt patch + io_queue_depth=2. I 
> >>> understand you asked for queue depth for SCSI device=1 however, I have NVME 
> >>> devices in my environment and 2 is the minimum value for io_queue_depth allowed 
> >>> according to the NVME driver code. The results pretty much look same with no 
> >>> stalls or exceptional failures. 
> >>> xfs/auto ran 296 odd tests with 3 failures and 130 something "no runs". 
> >>> Remaining tests passed. "Skipped tests"  were mostly due to missing features
> >>> (eg: reflink support on scratch filesystem)
> >>> The failures were consistent across runs on 3 different environments. 
> >>> I am also running full test suite but it is taking long time as I am 
> >>> hitting kernel BUG in xfs code in some generic tests. This BUG is not 
> >>> related to the patch and  I see them in vanilla kernel too. I am in 
> >>> the process of excluding these kind of tests as they come and 
> >>> re-run the suite however, this proces is time taking. 
> >>> Do you have any specific tests in mind that you would like me 
> >>> to run apart from what I have already tested above?
> >>
> >> Thanks, I think that looks good. I'll get your patch applied for
> >> 4.19.
> >>
> >> -- 
> >> Jens Axboe
> >>
> >>
> > 
> > Hi Jens,
> > Thanks for accepting this. There is one small issue, I don't find any emails
> > send by me on the lkml mailing list. I am not sure why it didn't land there,
> > all I can see is your responses. Do you want one of us to resend the patch
> > or will you be able to do it?
> 
> That's odd, are you getting rejections on your emails? For reference, the
> patch is here:
> 
> http://git.kernel.dk/cgit/linux-block/commit/?h=for-4.19/block&id=2887e41b910bb14fd847cf01ab7a5993db989d88
> 
> -- 
> Jens Axboe
>

Hi Jens,
No I didn't get any rejection on my emails however, its ok since you have it in your tree anyways. 
I don't find a need to resend it.

Thanks,
Anchal Agarwal 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
  2018-08-07 21:19               ` Jens Axboe
  2018-08-07 22:06                 ` Anchal Agarwal
@ 2018-08-20 16:36                 ` Jens Axboe
  2018-08-20 17:34                     ` van der Linden, Frank
  1 sibling, 1 reply; 38+ messages in thread
From: Jens Axboe @ 2018-08-20 16:36 UTC (permalink / raw)
  To: Anchal Agarwal; +Cc: linux-block, linux-kernel, fllinden, sblbir, msw

On 8/7/18 3:19 PM, Jens Axboe wrote:
> On 8/7/18 3:12 PM, Anchal Agarwal wrote:
>> On Tue, Aug 07, 2018 at 02:39:48PM -0600, Jens Axboe wrote:
>>> On 8/7/18 2:12 PM, Anchal Agarwal wrote:
>>>> On Tue, Aug 07, 2018 at 08:29:44AM -0600, Jens Axboe wrote:
>>>>> On 8/1/18 4:09 PM, Jens Axboe wrote:
>>>>>> On 8/1/18 11:06 AM, Anchal Agarwal wrote:
>>>>>>> On Wed, Aug 01, 2018 at 09:14:50AM -0600, Jens Axboe wrote:
>>>>>>>> On 7/31/18 3:34 PM, Anchal Agarwal wrote:
>>>>>>>>> Hi folks,
>>>>>>>>>
>>>>>>>>> This patch modifies commit e34cbd307477a
>>>>>>>>> (blk-wbt: add general throttling mechanism)
>>>>>>>>>
>>>>>>>>> I am currently running a large bare metal instance (i3.metal)
>>>>>>>>> on EC2 with 72 cores, 512GB of RAM and NVME drives, with a
>>>>>>>>> 4.18 kernel. I have a workload that simulates a database
>>>>>>>>> workload and I am running into lockup issues when writeback
>>>>>>>>> throttling is enabled,with the hung task detector also
>>>>>>>>> kicking in.
>>>>>>>>>
>>>>>>>>> Crash dumps show that most CPUs (up to 50 of them) are
>>>>>>>>> all trying to get the wbt wait queue lock while trying to add
>>>>>>>>> themselves to it in __wbt_wait (see stack traces below).
>>>>>>>>>
>>>>>>>>> [    0.948118] CPU: 45 PID: 0 Comm: swapper/45 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
>>>>>>>>> [    0.948119] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
>>>>>>>>> [    0.948120] task: ffff883f7878c000 task.stack: ffffc9000c69c000
>>>>>>>>> [    0.948124] RIP: 0010:native_queued_spin_lock_slowpath+0xf8/0x1a0
>>>>>>>>> [    0.948125] RSP: 0018:ffff883f7fcc3dc8 EFLAGS: 00000046
>>>>>>>>> [    0.948126] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7fce2a00
>>>>>>>>> [    0.948128] RDX: 000000000000001c RSI: 0000000000740001 RDI: ffff887f7709ca68
>>>>>>>>> [    0.948129] RBP: 0000000000000002 R08: 0000000000b80000 R09: 0000000000000000
>>>>>>>>> [    0.948130] R10: ffff883f7fcc3d78 R11: 000000000de27121 R12: 0000000000000002
>>>>>>>>> [    0.948131] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
>>>>>>>>> [    0.948132] FS:  0000000000000000(0000) GS:ffff883f7fcc0000(0000) knlGS:0000000000000000
>>>>>>>>> [    0.948134] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>>>>> [    0.948135] CR2: 000000c424c77000 CR3: 0000000002010005 CR4: 00000000003606e0
>>>>>>>>> [    0.948136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>>>>>>> [    0.948137] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>>>>>>> [    0.948138] Call Trace:
>>>>>>>>> [    0.948139]  <IRQ>
>>>>>>>>> [    0.948142]  do_raw_spin_lock+0xad/0xc0
>>>>>>>>> [    0.948145]  _raw_spin_lock_irqsave+0x44/0x4b
>>>>>>>>> [    0.948149]  ? __wake_up_common_lock+0x53/0x90
>>>>>>>>> [    0.948150]  __wake_up_common_lock+0x53/0x90
>>>>>>>>> [    0.948155]  wbt_done+0x7b/0xa0
>>>>>>>>> [    0.948158]  blk_mq_free_request+0xb7/0x110
>>>>>>>>> [    0.948161]  __blk_mq_complete_request+0xcb/0x140
>>>>>>>>> [    0.948166]  nvme_process_cq+0xce/0x1a0 [nvme]
>>>>>>>>> [    0.948169]  nvme_irq+0x23/0x50 [nvme]
>>>>>>>>> [    0.948173]  __handle_irq_event_percpu+0x46/0x300
>>>>>>>>> [    0.948176]  handle_irq_event_percpu+0x20/0x50
>>>>>>>>> [    0.948179]  handle_irq_event+0x34/0x60
>>>>>>>>> [    0.948181]  handle_edge_irq+0x77/0x190
>>>>>>>>> [    0.948185]  handle_irq+0xaf/0x120
>>>>>>>>> [    0.948188]  do_IRQ+0x53/0x110
>>>>>>>>> [    0.948191]  common_interrupt+0x87/0x87
>>>>>>>>> [    0.948192]  </IRQ>
>>>>>>>>> ....
>>>>>>>>> [    0.311136] CPU: 4 PID: 9737 Comm: run_linux_amd64 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
>>>>>>>>> [    0.311137] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
>>>>>>>>> [    0.311138] task: ffff883f6e6a8000 task.stack: ffffc9000f1ec000
>>>>>>>>> [    0.311141] RIP: 0010:native_queued_spin_lock_slowpath+0xf5/0x1a0
>>>>>>>>> [    0.311142] RSP: 0018:ffffc9000f1efa28 EFLAGS: 00000046
>>>>>>>>> [    0.311144] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7f722a00
>>>>>>>>> [    0.311145] RDX: 0000000000000035 RSI: 0000000000d80001 RDI: ffff887f7709ca68
>>>>>>>>> [    0.311146] RBP: 0000000000000202 R08: 0000000000140000 R09: 0000000000000000
>>>>>>>>> [    0.311147] R10: ffffc9000f1ef9d8 R11: 000000001a249fa0 R12: ffff887f7709ca68
>>>>>>>>> [    0.311148] R13: ffffc9000f1efad0 R14: 0000000000000000 R15: ffff887f7709ca00
>>>>>>>>> [    0.311149] FS:  000000c423f30090(0000) GS:ffff883f7f700000(0000) knlGS:0000000000000000
>>>>>>>>> [    0.311150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>>>>> [    0.311151] CR2: 00007feefcea4000 CR3: 0000007f7016e001 CR4: 00000000003606e0
>>>>>>>>> [    0.311152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>>>>>>> [    0.311153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>>>>>>> [    0.311154] Call Trace:
>>>>>>>>> [    0.311157]  do_raw_spin_lock+0xad/0xc0
>>>>>>>>> [    0.311160]  _raw_spin_lock_irqsave+0x44/0x4b
>>>>>>>>> [    0.311162]  ? prepare_to_wait_exclusive+0x28/0xb0
>>>>>>>>> [    0.311164]  prepare_to_wait_exclusive+0x28/0xb0
>>>>>>>>> [    0.311167]  wbt_wait+0x127/0x330
>>>>>>>>> [    0.311169]  ? finish_wait+0x80/0x80
>>>>>>>>> [    0.311172]  ? generic_make_request+0xda/0x3b0
>>>>>>>>> [    0.311174]  blk_mq_make_request+0xd6/0x7b0
>>>>>>>>> [    0.311176]  ? blk_queue_enter+0x24/0x260
>>>>>>>>> [    0.311178]  ? generic_make_request+0xda/0x3b0
>>>>>>>>> [    0.311181]  generic_make_request+0x10c/0x3b0
>>>>>>>>> [    0.311183]  ? submit_bio+0x5c/0x110
>>>>>>>>> [    0.311185]  submit_bio+0x5c/0x110
>>>>>>>>> [    0.311197]  ? __ext4_journal_stop+0x36/0xa0 [ext4]
>>>>>>>>> [    0.311210]  ext4_io_submit+0x48/0x60 [ext4]
>>>>>>>>> [    0.311222]  ext4_writepages+0x810/0x11f0 [ext4]
>>>>>>>>> [    0.311229]  ? do_writepages+0x3c/0xd0
>>>>>>>>> [    0.311239]  ? ext4_mark_inode_dirty+0x260/0x260 [ext4]
>>>>>>>>> [    0.311240]  do_writepages+0x3c/0xd0
>>>>>>>>> [    0.311243]  ? _raw_spin_unlock+0x24/0x30
>>>>>>>>> [    0.311245]  ? wbc_attach_and_unlock_inode+0x165/0x280
>>>>>>>>> [    0.311248]  ? __filemap_fdatawrite_range+0xa3/0xe0
>>>>>>>>> [    0.311250]  __filemap_fdatawrite_range+0xa3/0xe0
>>>>>>>>> [    0.311253]  file_write_and_wait_range+0x34/0x90
>>>>>>>>> [    0.311264]  ext4_sync_file+0x151/0x500 [ext4]
>>>>>>>>> [    0.311267]  do_fsync+0x38/0x60
>>>>>>>>> [    0.311270]  SyS_fsync+0xc/0x10
>>>>>>>>> [    0.311272]  do_syscall_64+0x6f/0x170
>>>>>>>>> [    0.311274]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
>>>>>>>>>
>>>>>>>>> In the original patch, wbt_done is waking up all the exclusive
>>>>>>>>> processes in the wait queue, which can cause a thundering herd
>>>>>>>>> if there is a large number of writer threads in the queue. The
>>>>>>>>> original intention of the code seems to be to wake up one thread
>>>>>>>>> only however, it uses wake_up_all() in __wbt_done(), and then
>>>>>>>>> uses the following check in __wbt_wait to have only one thread
>>>>>>>>> actually get out of the wait loop:
>>>>>>>>>
>>>>>>>>> if (waitqueue_active(&rqw->wait) &&
>>>>>>>>>             rqw->wait.head.next != &wait->entry)
>>>>>>>>>                 return false;
>>>>>>>>>
>>>>>>>>> The problem with this is that the wait entry in wbt_wait is
>>>>>>>>> define with DEFINE_WAIT, which uses the autoremove wakeup function.
>>>>>>>>> That means that the above check is invalid - the wait entry will
>>>>>>>>> have been removed from the queue already by the time we hit the
>>>>>>>>> check in the loop.
>>>>>>>>>
>>>>>>>>> Secondly, auto-removing the wait entries also means that the wait
>>>>>>>>> queue essentially gets reordered "randomly" (e.g. threads re-add
>>>>>>>>> themselves in the order they got to run after being woken up).
>>>>>>>>> Additionally, new requests entering wbt_wait might overtake requests
>>>>>>>>> that were queued earlier, because the wait queue will be
>>>>>>>>> (temporarily) empty after the wake_up_all, so the waitqueue_active
>>>>>>>>> check will not stop them. This can cause certain threads to starve
>>>>>>>>> under high load.
>>>>>>>>>
>>>>>>>>> The fix is to leave the woken up requests in the queue and remove
>>>>>>>>> them in finish_wait() once the current thread breaks out of the
>>>>>>>>> wait loop in __wbt_wait. This will ensure new requests always
>>>>>>>>> end up at the back of the queue, and they won't overtake requests
>>>>>>>>> that are already in the wait queue. With that change, the loop
>>>>>>>>> in wbt_wait is also in line with many other wait loops in the kernel.
>>>>>>>>> Waking up just one thread drastically reduces lock contention, as
>>>>>>>>> does moving the wait queue add/remove out of the loop.
>>>>>>>>>
>>>>>>>>> A significant drop in lockdep's lock contention numbers is seen when
>>>>>>>>> running the test application on the patched kernel.
>>>>>>>>
>>>>>>>> I like the patch, and a few weeks ago we independently discovered that
>>>>>>>> the waitqueue list checking was bogus as well. My only worry is that
>>>>>>>> changes like this can be delicate, meaning that it's easy to introduce
>>>>>>>> stall conditions. What kind of testing did you push this through?
>>>>>>>>
>>>>>>>> -- 
>>>>>>>> Jens Axboe
>>>>>>>>
>>>>>>> I ran the following tests on both real HW with NVME devices attached
>>>>>>> and emulated NVME too:
>>>>>>>
>>>>>>> 1. The test case I used to reproduce the issue, spawns a bunch of threads 
>>>>>>>    to concurrently read and write files with random size and content. 
>>>>>>>    Files are randomly fsync'd. The implementation is a FIFO queue of files. 
>>>>>>>    When the queue fills the test starts to verify and remove the files. This 
>>>>>>>    test will fail if there's a read, write, or hash check failure. It tests
>>>>>>>    for file corruption when lots of small files are being read and written 
>>>>>>>    with high concurrency.
>>>>>>>
>>>>>>> 2. Fio for random writes with a root NVME device of 200GB
>>>>>>>   
>>>>>>>   fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k 
>>>>>>>   --direct=0 --size=10G --numjobs=2 --runtime=60 --group_reporting
>>>>>>>   
>>>>>>>   fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k
>>>>>>>   --direct=0 --size=5G --numjobs=2 --runtime=30 --fsync=64 --group_reporting
>>>>>>>   
>>>>>>>   I did see an improvement in the bandwidth numbers reported on the patched
>>>>>>>   kernel. 
>>>>>>>
>>>>>>> Do you have any test case/suite in mind that you would suggest me to 
>>>>>>> run to be sure that patch does not introduce any stall conditions?
>>>>>>
>>>>>> One thing that is always useful is to run xfstest, do a full run on
>>>>>> the device. If that works, then do another full run, this time limiting
>>>>>> the queue depth of the SCSI device to 1. If both of those pass, then
>>>>>> I'd feel pretty good getting this applied for 4.19.
>>>>>
>>>>> Did you get a chance to run this full test?
>>>>>
>>>>> -- 
>>>>> Jens Axboe
>>>>>
>>>>>
>>>>
>>>> Hi Jens,
>>>> Yes I did run the tests and was in the process of compiling concrete results
>>>> I tested following environments against xfs/auto group
>>>> 1. Vanilla 4.18.rc kernel
>>>> 2. 4.18 kernel with the blk-wbt patch
>>>> 3. 4.18 kernel with the blk-wbt patch + io_queue_depth=2. I 
>>>> understand you asked for queue depth for SCSI device=1 however, I have NVME 
>>>> devices in my environment and 2 is the minimum value for io_queue_depth allowed 
>>>> according to the NVME driver code. The results pretty much look same with no 
>>>> stalls or exceptional failures. 
>>>> xfs/auto ran 296 odd tests with 3 failures and 130 something "no runs". 
>>>> Remaining tests passed. "Skipped tests"  were mostly due to missing features
>>>> (eg: reflink support on scratch filesystem)
>>>> The failures were consistent across runs on 3 different environments. 
>>>> I am also running full test suite but it is taking long time as I am 
>>>> hitting kernel BUG in xfs code in some generic tests. This BUG is not 
>>>> related to the patch and  I see them in vanilla kernel too. I am in 
>>>> the process of excluding these kind of tests as they come and 
>>>> re-run the suite however, this proces is time taking. 
>>>> Do you have any specific tests in mind that you would like me 
>>>> to run apart from what I have already tested above?
>>>
>>> Thanks, I think that looks good. I'll get your patch applied for
>>> 4.19.
>>>
>>> -- 
>>> Jens Axboe
>>>
>>>
>>
>> Hi Jens,
>> Thanks for accepting this. There is one small issue, I don't find any emails
>> send by me on the lkml mailing list. I am not sure why it didn't land there,
>> all I can see is your responses. Do you want one of us to resend the patch
>> or will you be able to do it?
> 
> That's odd, are you getting rejections on your emails? For reference, the
> patch is here:
> 
> http://git.kernel.dk/cgit/linux-block/commit/?h=for-4.19/block&id=2887e41b910bb14fd847cf01ab7a5993db989d88

One issue with this, as far as I can tell. Right now we've switched to
waking one task at the time, which is obviously more efficient. But if
we do that with exclusive waits, then we have to ensure that this task
makes progress. If we wake up a task, and then fail to get a queueing
token, then we'll go back to sleep. We need to ensure that someone makes
forward progress at this point. There are two ways I can see that
happening:

1) The task woken _always_ gets to queue an IO
2) If the task woken is NOT allowed to queue an IO, then it must select
   a new task to wake up. That new task is then subjected to rule 1 or 2
   as well.

For #1, it could be as simple as:

if (slept || !rwb_enabled(rwb)) {
	atomic_inc(&rqw->inflight);
	break;
}

but this obviously won't always be fair. Might be good enough however,
instead of having to eg replace the generic wait queues with a priority
list/queue.

Note that this isn't an entirely new issue, it's just so much easier to
hit with the single wakeups.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
  2018-08-20 16:36                 ` Jens Axboe
@ 2018-08-20 17:34                     ` van der Linden, Frank
  0 siblings, 0 replies; 38+ messages in thread
From: van der Linden, Frank @ 2018-08-20 17:34 UTC (permalink / raw)
  To: Jens Axboe, Agarwal, Anchal
  Cc: linux-block, linux-kernel, Singh, Balbir, Wilson, Matt

On 8/20/18 9:37 AM, Jens Axboe wrote:=0A=
> On 8/7/18 3:19 PM, Jens Axboe wrote:=0A=
>> On 8/7/18 3:12 PM, Anchal Agarwal wrote:=0A=
>>> On Tue, Aug 07, 2018 at 02:39:48PM -0600, Jens Axboe wrote:=0A=
>>>> On 8/7/18 2:12 PM, Anchal Agarwal wrote:=0A=
>>>>> On Tue, Aug 07, 2018 at 08:29:44AM -0600, Jens Axboe wrote:=0A=
>>>>>> On 8/1/18 4:09 PM, Jens Axboe wrote:=0A=
>>>>>>> On 8/1/18 11:06 AM, Anchal Agarwal wrote:=0A=
>>>>>>>> On Wed, Aug 01, 2018 at 09:14:50AM -0600, Jens Axboe wrote:=0A=
>>>>>>>>> On 7/31/18 3:34 PM, Anchal Agarwal wrote:=0A=
>>>>>>>>>> Hi folks,=0A=
>>>>>>>>>>=0A=
>>>>>>>>>> This patch modifies commit e34cbd307477a=0A=
>>>>>>>>>> (blk-wbt: add general throttling mechanism)=0A=
>>>>>>>>>>=0A=
>>>>>>>>>> I am currently running a large bare metal instance (i3.metal)=0A=
>>>>>>>>>> on EC2 with 72 cores, 512GB of RAM and NVME drives, with a=0A=
>>>>>>>>>> 4.18 kernel. I have a workload that simulates a database=0A=
>>>>>>>>>> workload and I am running into lockup issues when writeback=0A=
>>>>>>>>>> throttling is enabled,with the hung task detector also=0A=
>>>>>>>>>> kicking in.=0A=
>>>>>>>>>>=0A=
>>>>>>>>>> Crash dumps show that most CPUs (up to 50 of them) are=0A=
>>>>>>>>>> all trying to get the wbt wait queue lock while trying to add=0A=
>>>>>>>>>> themselves to it in __wbt_wait (see stack traces below).=0A=
>>>>>>>>>>=0A=
>>>>>>>>>> [    0.948118] CPU: 45 PID: 0 Comm: swapper/45 Not tainted 4.14.=
51-62.38.amzn1.x86_64 #1=0A=
>>>>>>>>>> [    0.948119] Hardware name: Amazon EC2 i3.metal/Not Specified,=
 BIOS 1.0 10/16/2017=0A=
>>>>>>>>>> [    0.948120] task: ffff883f7878c000 task.stack: ffffc9000c69c0=
00=0A=
>>>>>>>>>> [    0.948124] RIP: 0010:native_queued_spin_lock_slowpath+0xf8/0=
x1a0=0A=
>>>>>>>>>> [    0.948125] RSP: 0018:ffff883f7fcc3dc8 EFLAGS: 00000046=0A=
>>>>>>>>>> [    0.948126] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: =
ffff883f7fce2a00=0A=
>>>>>>>>>> [    0.948128] RDX: 000000000000001c RSI: 0000000000740001 RDI: =
ffff887f7709ca68=0A=
>>>>>>>>>> [    0.948129] RBP: 0000000000000002 R08: 0000000000b80000 R09: =
0000000000000000=0A=
>>>>>>>>>> [    0.948130] R10: ffff883f7fcc3d78 R11: 000000000de27121 R12: =
0000000000000002=0A=
>>>>>>>>>> [    0.948131] R13: 0000000000000003 R14: 0000000000000000 R15: =
0000000000000000=0A=
>>>>>>>>>> [    0.948132] FS:  0000000000000000(0000) GS:ffff883f7fcc0000(0=
000) knlGS:0000000000000000=0A=
>>>>>>>>>> [    0.948134] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033=
=0A=
>>>>>>>>>> [    0.948135] CR2: 000000c424c77000 CR3: 0000000002010005 CR4: =
00000000003606e0=0A=
>>>>>>>>>> [    0.948136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: =
0000000000000000=0A=
>>>>>>>>>> [    0.948137] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: =
0000000000000400=0A=
>>>>>>>>>> [    0.948138] Call Trace:=0A=
>>>>>>>>>> [    0.948139]  <IRQ>=0A=
>>>>>>>>>> [    0.948142]  do_raw_spin_lock+0xad/0xc0=0A=
>>>>>>>>>> [    0.948145]  _raw_spin_lock_irqsave+0x44/0x4b=0A=
>>>>>>>>>> [    0.948149]  ? __wake_up_common_lock+0x53/0x90=0A=
>>>>>>>>>> [    0.948150]  __wake_up_common_lock+0x53/0x90=0A=
>>>>>>>>>> [    0.948155]  wbt_done+0x7b/0xa0=0A=
>>>>>>>>>> [    0.948158]  blk_mq_free_request+0xb7/0x110=0A=
>>>>>>>>>> [    0.948161]  __blk_mq_complete_request+0xcb/0x140=0A=
>>>>>>>>>> [    0.948166]  nvme_process_cq+0xce/0x1a0 [nvme]=0A=
>>>>>>>>>> [    0.948169]  nvme_irq+0x23/0x50 [nvme]=0A=
>>>>>>>>>> [    0.948173]  __handle_irq_event_percpu+0x46/0x300=0A=
>>>>>>>>>> [    0.948176]  handle_irq_event_percpu+0x20/0x50=0A=
>>>>>>>>>> [    0.948179]  handle_irq_event+0x34/0x60=0A=
>>>>>>>>>> [    0.948181]  handle_edge_irq+0x77/0x190=0A=
>>>>>>>>>> [    0.948185]  handle_irq+0xaf/0x120=0A=
>>>>>>>>>> [    0.948188]  do_IRQ+0x53/0x110=0A=
>>>>>>>>>> [    0.948191]  common_interrupt+0x87/0x87=0A=
>>>>>>>>>> [    0.948192]  </IRQ>=0A=
>>>>>>>>>> ....=0A=
>>>>>>>>>> [    0.311136] CPU: 4 PID: 9737 Comm: run_linux_amd64 Not tainte=
d 4.14.51-62.38.amzn1.x86_64 #1=0A=
>>>>>>>>>> [    0.311137] Hardware name: Amazon EC2 i3.metal/Not Specified,=
 BIOS 1.0 10/16/2017=0A=
>>>>>>>>>> [    0.311138] task: ffff883f6e6a8000 task.stack: ffffc9000f1ec0=
00=0A=
>>>>>>>>>> [    0.311141] RIP: 0010:native_queued_spin_lock_slowpath+0xf5/0=
x1a0=0A=
>>>>>>>>>> [    0.311142] RSP: 0018:ffffc9000f1efa28 EFLAGS: 00000046=0A=
>>>>>>>>>> [    0.311144] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: =
ffff883f7f722a00=0A=
>>>>>>>>>> [    0.311145] RDX: 0000000000000035 RSI: 0000000000d80001 RDI: =
ffff887f7709ca68=0A=
>>>>>>>>>> [    0.311146] RBP: 0000000000000202 R08: 0000000000140000 R09: =
0000000000000000=0A=
>>>>>>>>>> [    0.311147] R10: ffffc9000f1ef9d8 R11: 000000001a249fa0 R12: =
ffff887f7709ca68=0A=
>>>>>>>>>> [    0.311148] R13: ffffc9000f1efad0 R14: 0000000000000000 R15: =
ffff887f7709ca00=0A=
>>>>>>>>>> [    0.311149] FS:  000000c423f30090(0000) GS:ffff883f7f700000(0=
000) knlGS:0000000000000000=0A=
>>>>>>>>>> [    0.311150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033=
=0A=
>>>>>>>>>> [    0.311151] CR2: 00007feefcea4000 CR3: 0000007f7016e001 CR4: =
00000000003606e0=0A=
>>>>>>>>>> [    0.311152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: =
0000000000000000=0A=
>>>>>>>>>> [    0.311153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: =
0000000000000400=0A=
>>>>>>>>>> [    0.311154] Call Trace:=0A=
>>>>>>>>>> [    0.311157]  do_raw_spin_lock+0xad/0xc0=0A=
>>>>>>>>>> [    0.311160]  _raw_spin_lock_irqsave+0x44/0x4b=0A=
>>>>>>>>>> [    0.311162]  ? prepare_to_wait_exclusive+0x28/0xb0=0A=
>>>>>>>>>> [    0.311164]  prepare_to_wait_exclusive+0x28/0xb0=0A=
>>>>>>>>>> [    0.311167]  wbt_wait+0x127/0x330=0A=
>>>>>>>>>> [    0.311169]  ? finish_wait+0x80/0x80=0A=
>>>>>>>>>> [    0.311172]  ? generic_make_request+0xda/0x3b0=0A=
>>>>>>>>>> [    0.311174]  blk_mq_make_request+0xd6/0x7b0=0A=
>>>>>>>>>> [    0.311176]  ? blk_queue_enter+0x24/0x260=0A=
>>>>>>>>>> [    0.311178]  ? generic_make_request+0xda/0x3b0=0A=
>>>>>>>>>> [    0.311181]  generic_make_request+0x10c/0x3b0=0A=
>>>>>>>>>> [    0.311183]  ? submit_bio+0x5c/0x110=0A=
>>>>>>>>>> [    0.311185]  submit_bio+0x5c/0x110=0A=
>>>>>>>>>> [    0.311197]  ? __ext4_journal_stop+0x36/0xa0 [ext4]=0A=
>>>>>>>>>> [    0.311210]  ext4_io_submit+0x48/0x60 [ext4]=0A=
>>>>>>>>>> [    0.311222]  ext4_writepages+0x810/0x11f0 [ext4]=0A=
>>>>>>>>>> [    0.311229]  ? do_writepages+0x3c/0xd0=0A=
>>>>>>>>>> [    0.311239]  ? ext4_mark_inode_dirty+0x260/0x260 [ext4]=0A=
>>>>>>>>>> [    0.311240]  do_writepages+0x3c/0xd0=0A=
>>>>>>>>>> [    0.311243]  ? _raw_spin_unlock+0x24/0x30=0A=
>>>>>>>>>> [    0.311245]  ? wbc_attach_and_unlock_inode+0x165/0x280=0A=
>>>>>>>>>> [    0.311248]  ? __filemap_fdatawrite_range+0xa3/0xe0=0A=
>>>>>>>>>> [    0.311250]  __filemap_fdatawrite_range+0xa3/0xe0=0A=
>>>>>>>>>> [    0.311253]  file_write_and_wait_range+0x34/0x90=0A=
>>>>>>>>>> [    0.311264]  ext4_sync_file+0x151/0x500 [ext4]=0A=
>>>>>>>>>> [    0.311267]  do_fsync+0x38/0x60=0A=
>>>>>>>>>> [    0.311270]  SyS_fsync+0xc/0x10=0A=
>>>>>>>>>> [    0.311272]  do_syscall_64+0x6f/0x170=0A=
>>>>>>>>>> [    0.311274]  entry_SYSCALL_64_after_hwframe+0x42/0xb7=0A=
>>>>>>>>>>=0A=
>>>>>>>>>> In the original patch, wbt_done is waking up all the exclusive=
=0A=
>>>>>>>>>> processes in the wait queue, which can cause a thundering herd=
=0A=
>>>>>>>>>> if there is a large number of writer threads in the queue. The=
=0A=
>>>>>>>>>> original intention of the code seems to be to wake up one thread=
=0A=
>>>>>>>>>> only however, it uses wake_up_all() in __wbt_done(), and then=0A=
>>>>>>>>>> uses the following check in __wbt_wait to have only one thread=
=0A=
>>>>>>>>>> actually get out of the wait loop:=0A=
>>>>>>>>>>=0A=
>>>>>>>>>> if (waitqueue_active(&rqw->wait) &&=0A=
>>>>>>>>>>             rqw->wait.head.next !=3D &wait->entry)=0A=
>>>>>>>>>>                 return false;=0A=
>>>>>>>>>>=0A=
>>>>>>>>>> The problem with this is that the wait entry in wbt_wait is=0A=
>>>>>>>>>> define with DEFINE_WAIT, which uses the autoremove wakeup functi=
on.=0A=
>>>>>>>>>> That means that the above check is invalid - the wait entry will=
=0A=
>>>>>>>>>> have been removed from the queue already by the time we hit the=
=0A=
>>>>>>>>>> check in the loop.=0A=
>>>>>>>>>>=0A=
>>>>>>>>>> Secondly, auto-removing the wait entries also means that the wai=
t=0A=
>>>>>>>>>> queue essentially gets reordered "randomly" (e.g. threads re-add=
=0A=
>>>>>>>>>> themselves in the order they got to run after being woken up).=
=0A=
>>>>>>>>>> Additionally, new requests entering wbt_wait might overtake requ=
ests=0A=
>>>>>>>>>> that were queued earlier, because the wait queue will be=0A=
>>>>>>>>>> (temporarily) empty after the wake_up_all, so the waitqueue_acti=
ve=0A=
>>>>>>>>>> check will not stop them. This can cause certain threads to star=
ve=0A=
>>>>>>>>>> under high load.=0A=
>>>>>>>>>>=0A=
>>>>>>>>>> The fix is to leave the woken up requests in the queue and remov=
e=0A=
>>>>>>>>>> them in finish_wait() once the current thread breaks out of the=
=0A=
>>>>>>>>>> wait loop in __wbt_wait. This will ensure new requests always=0A=
>>>>>>>>>> end up at the back of the queue, and they won't overtake request=
s=0A=
>>>>>>>>>> that are already in the wait queue. With that change, the loop=
=0A=
>>>>>>>>>> in wbt_wait is also in line with many other wait loops in the ke=
rnel.=0A=
>>>>>>>>>> Waking up just one thread drastically reduces lock contention, a=
s=0A=
>>>>>>>>>> does moving the wait queue add/remove out of the loop.=0A=
>>>>>>>>>>=0A=
>>>>>>>>>> A significant drop in lockdep's lock contention numbers is seen =
when=0A=
>>>>>>>>>> running the test application on the patched kernel.=0A=
>>>>>>>>> I like the patch, and a few weeks ago we independently discovered=
 that=0A=
>>>>>>>>> the waitqueue list checking was bogus as well. My only worry is t=
hat=0A=
>>>>>>>>> changes like this can be delicate, meaning that it's easy to intr=
oduce=0A=
>>>>>>>>> stall conditions. What kind of testing did you push this through?=
=0A=
>>>>>>>>>=0A=
>>>>>>>>> -- =0A=
>>>>>>>>> Jens Axboe=0A=
>>>>>>>>>=0A=
>>>>>>>> I ran the following tests on both real HW with NVME devices attach=
ed=0A=
>>>>>>>> and emulated NVME too:=0A=
>>>>>>>>=0A=
>>>>>>>> 1. The test case I used to reproduce the issue, spawns a bunch of =
threads =0A=
>>>>>>>>    to concurrently read and write files with random size and conte=
nt. =0A=
>>>>>>>>    Files are randomly fsync'd. The implementation is a FIFO queue =
of files. =0A=
>>>>>>>>    When the queue fills the test starts to verify and remove the f=
iles. This =0A=
>>>>>>>>    test will fail if there's a read, write, or hash check failure.=
 It tests=0A=
>>>>>>>>    for file corruption when lots of small files are being read and=
 written =0A=
>>>>>>>>    with high concurrency.=0A=
>>>>>>>>=0A=
>>>>>>>> 2. Fio for random writes with a root NVME device of 200GB=0A=
>>>>>>>>   =0A=
>>>>>>>>   fio --name=3Drandwrite --ioengine=3Dlibaio --iodepth=3D1 --rw=3D=
randwrite --bs=3D4k =0A=
>>>>>>>>   --direct=3D0 --size=3D10G --numjobs=3D2 --runtime=3D60 --group_r=
eporting=0A=
>>>>>>>>   =0A=
>>>>>>>>   fio --name=3Drandwrite --ioengine=3Dlibaio --iodepth=3D1 --rw=3D=
randwrite --bs=3D4k=0A=
>>>>>>>>   --direct=3D0 --size=3D5G --numjobs=3D2 --runtime=3D30 --fsync=3D=
64 --group_reporting=0A=
>>>>>>>>   =0A=
>>>>>>>>   I did see an improvement in the bandwidth numbers reported on th=
e patched=0A=
>>>>>>>>   kernel. =0A=
>>>>>>>>=0A=
>>>>>>>> Do you have any test case/suite in mind that you would suggest me =
to =0A=
>>>>>>>> run to be sure that patch does not introduce any stall conditions?=
=0A=
>>>>>>> One thing that is always useful is to run xfstest, do a full run on=
=0A=
>>>>>>> the device. If that works, then do another full run, this time limi=
ting=0A=
>>>>>>> the queue depth of the SCSI device to 1. If both of those pass, the=
n=0A=
>>>>>>> I'd feel pretty good getting this applied for 4.19.=0A=
>>>>>> Did you get a chance to run this full test?=0A=
>>>>>>=0A=
>>>>>> -- =0A=
>>>>>> Jens Axboe=0A=
>>>>>>=0A=
>>>>>>=0A=
>>>>> Hi Jens,=0A=
>>>>> Yes I did run the tests and was in the process of compiling concrete =
results=0A=
>>>>> I tested following environments against xfs/auto group=0A=
>>>>> 1. Vanilla 4.18.rc kernel=0A=
>>>>> 2. 4.18 kernel with the blk-wbt patch=0A=
>>>>> 3. 4.18 kernel with the blk-wbt patch + io_queue_depth=3D2. I =0A=
>>>>> understand you asked for queue depth for SCSI device=3D1 however, I h=
ave NVME =0A=
>>>>> devices in my environment and 2 is the minimum value for io_queue_dep=
th allowed =0A=
>>>>> according to the NVME driver code. The results pretty much look same =
with no =0A=
>>>>> stalls or exceptional failures. =0A=
>>>>> xfs/auto ran 296 odd tests with 3 failures and 130 something "no runs=
". =0A=
>>>>> Remaining tests passed. "Skipped tests"  were mostly due to missing f=
eatures=0A=
>>>>> (eg: reflink support on scratch filesystem)=0A=
>>>>> The failures were consistent across runs on 3 different environments.=
 =0A=
>>>>> I am also running full test suite but it is taking long time as I am =
=0A=
>>>>> hitting kernel BUG in xfs code in some generic tests. This BUG is not=
 =0A=
>>>>> related to the patch and  I see them in vanilla kernel too. I am in =
=0A=
>>>>> the process of excluding these kind of tests as they come and =0A=
>>>>> re-run the suite however, this proces is time taking. =0A=
>>>>> Do you have any specific tests in mind that you would like me =0A=
>>>>> to run apart from what I have already tested above?=0A=
>>>> Thanks, I think that looks good. I'll get your patch applied for=0A=
>>>> 4.19.=0A=
>>>>=0A=
>>>> -- =0A=
>>>> Jens Axboe=0A=
>>>>=0A=
>>>>=0A=
>>> Hi Jens,=0A=
>>> Thanks for accepting this. There is one small issue, I don't find any e=
mails=0A=
>>> send by me on the lkml mailing list. I am not sure why it didn't land t=
here,=0A=
>>> all I can see is your responses. Do you want one of us to resend the pa=
tch=0A=
>>> or will you be able to do it?=0A=
>> That's odd, are you getting rejections on your emails? For reference, th=
e=0A=
>> patch is here:=0A=
>>=0A=
>> http://git.kernel.dk/cgit/linux-block/commit/?h=3Dfor-4.19/block&id=3D28=
87e41b910bb14fd847cf01ab7a5993db989d88=0A=
> One issue with this, as far as I can tell. Right now we've switched to=0A=
> waking one task at the time, which is obviously more efficient. But if=0A=
> we do that with exclusive waits, then we have to ensure that this task=0A=
> makes progress. If we wake up a task, and then fail to get a queueing=0A=
> token, then we'll go back to sleep. We need to ensure that someone makes=
=0A=
> forward progress at this point. There are two ways I can see that=0A=
> happening:=0A=
>=0A=
> 1) The task woken _always_ gets to queue an IO=0A=
> 2) If the task woken is NOT allowed to queue an IO, then it must select=
=0A=
>    a new task to wake up. That new task is then subjected to rule 1 or 2=
=0A=
>    as well.=0A=
>=0A=
> For #1, it could be as simple as:=0A=
>=0A=
> if (slept || !rwb_enabled(rwb)) {=0A=
> 	atomic_inc(&rqw->inflight);=0A=
> 	break;=0A=
> }=0A=
>=0A=
> but this obviously won't always be fair. Might be good enough however,=0A=
> instead of having to eg replace the generic wait queues with a priority=
=0A=
> list/queue.=0A=
>=0A=
> Note that this isn't an entirely new issue, it's just so much easier to=
=0A=
> hit with the single wakeups.=0A=
>=0A=
Hi Jens,=0A=
=0A=
What is the scenario that you see under which the woken up task does not=0A=
get to run?=0A=
=0A=
The theory behind leaving the task on the wait queue is that the=0A=
waitqueue_active check in wbt_wait prevents new tasks from taking up a=0A=
slot in the queue (e.g. incrementing inflight). So, there should not be=0A=
a way for inflight to be incremented between the time the wake_up is=0A=
done and the task at the head of the wait queue runs. That's the idea=0A=
anyway :-) If we missed something, let us know.=0A=
=0A=
- Frank=0A=

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
@ 2018-08-20 17:34                     ` van der Linden, Frank
  0 siblings, 0 replies; 38+ messages in thread
From: van der Linden, Frank @ 2018-08-20 17:34 UTC (permalink / raw)
  To: Jens Axboe, Agarwal, Anchal
  Cc: linux-block, linux-kernel, Singh, Balbir, Wilson, Matt

On 8/20/18 9:37 AM, Jens Axboe wrote:
> On 8/7/18 3:19 PM, Jens Axboe wrote:
>> On 8/7/18 3:12 PM, Anchal Agarwal wrote:
>>> On Tue, Aug 07, 2018 at 02:39:48PM -0600, Jens Axboe wrote:
>>>> On 8/7/18 2:12 PM, Anchal Agarwal wrote:
>>>>> On Tue, Aug 07, 2018 at 08:29:44AM -0600, Jens Axboe wrote:
>>>>>> On 8/1/18 4:09 PM, Jens Axboe wrote:
>>>>>>> On 8/1/18 11:06 AM, Anchal Agarwal wrote:
>>>>>>>> On Wed, Aug 01, 2018 at 09:14:50AM -0600, Jens Axboe wrote:
>>>>>>>>> On 7/31/18 3:34 PM, Anchal Agarwal wrote:
>>>>>>>>>> Hi folks,
>>>>>>>>>>
>>>>>>>>>> This patch modifies commit e34cbd307477a
>>>>>>>>>> (blk-wbt: add general throttling mechanism)
>>>>>>>>>>
>>>>>>>>>> I am currently running a large bare metal instance (i3.metal)
>>>>>>>>>> on EC2 with 72 cores, 512GB of RAM and NVME drives, with a
>>>>>>>>>> 4.18 kernel. I have a workload that simulates a database
>>>>>>>>>> workload and I am running into lockup issues when writeback
>>>>>>>>>> throttling is enabled,with the hung task detector also
>>>>>>>>>> kicking in.
>>>>>>>>>>
>>>>>>>>>> Crash dumps show that most CPUs (up to 50 of them) are
>>>>>>>>>> all trying to get the wbt wait queue lock while trying to add
>>>>>>>>>> themselves to it in __wbt_wait (see stack traces below).
>>>>>>>>>>
>>>>>>>>>> [    0.948118] CPU: 45 PID: 0 Comm: swapper/45 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
>>>>>>>>>> [    0.948119] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
>>>>>>>>>> [    0.948120] task: ffff883f7878c000 task.stack: ffffc9000c69c000
>>>>>>>>>> [    0.948124] RIP: 0010:native_queued_spin_lock_slowpath+0xf8/0x1a0
>>>>>>>>>> [    0.948125] RSP: 0018:ffff883f7fcc3dc8 EFLAGS: 00000046
>>>>>>>>>> [    0.948126] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7fce2a00
>>>>>>>>>> [    0.948128] RDX: 000000000000001c RSI: 0000000000740001 RDI: ffff887f7709ca68
>>>>>>>>>> [    0.948129] RBP: 0000000000000002 R08: 0000000000b80000 R09: 0000000000000000
>>>>>>>>>> [    0.948130] R10: ffff883f7fcc3d78 R11: 000000000de27121 R12: 0000000000000002
>>>>>>>>>> [    0.948131] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
>>>>>>>>>> [    0.948132] FS:  0000000000000000(0000) GS:ffff883f7fcc0000(0000) knlGS:0000000000000000
>>>>>>>>>> [    0.948134] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>>>>>> [    0.948135] CR2: 000000c424c77000 CR3: 0000000002010005 CR4: 00000000003606e0
>>>>>>>>>> [    0.948136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>>>>>>>> [    0.948137] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>>>>>>>> [    0.948138] Call Trace:
>>>>>>>>>> [    0.948139]  <IRQ>
>>>>>>>>>> [    0.948142]  do_raw_spin_lock+0xad/0xc0
>>>>>>>>>> [    0.948145]  _raw_spin_lock_irqsave+0x44/0x4b
>>>>>>>>>> [    0.948149]  ? __wake_up_common_lock+0x53/0x90
>>>>>>>>>> [    0.948150]  __wake_up_common_lock+0x53/0x90
>>>>>>>>>> [    0.948155]  wbt_done+0x7b/0xa0
>>>>>>>>>> [    0.948158]  blk_mq_free_request+0xb7/0x110
>>>>>>>>>> [    0.948161]  __blk_mq_complete_request+0xcb/0x140
>>>>>>>>>> [    0.948166]  nvme_process_cq+0xce/0x1a0 [nvme]
>>>>>>>>>> [    0.948169]  nvme_irq+0x23/0x50 [nvme]
>>>>>>>>>> [    0.948173]  __handle_irq_event_percpu+0x46/0x300
>>>>>>>>>> [    0.948176]  handle_irq_event_percpu+0x20/0x50
>>>>>>>>>> [    0.948179]  handle_irq_event+0x34/0x60
>>>>>>>>>> [    0.948181]  handle_edge_irq+0x77/0x190
>>>>>>>>>> [    0.948185]  handle_irq+0xaf/0x120
>>>>>>>>>> [    0.948188]  do_IRQ+0x53/0x110
>>>>>>>>>> [    0.948191]  common_interrupt+0x87/0x87
>>>>>>>>>> [    0.948192]  </IRQ>
>>>>>>>>>> ....
>>>>>>>>>> [    0.311136] CPU: 4 PID: 9737 Comm: run_linux_amd64 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
>>>>>>>>>> [    0.311137] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
>>>>>>>>>> [    0.311138] task: ffff883f6e6a8000 task.stack: ffffc9000f1ec000
>>>>>>>>>> [    0.311141] RIP: 0010:native_queued_spin_lock_slowpath+0xf5/0x1a0
>>>>>>>>>> [    0.311142] RSP: 0018:ffffc9000f1efa28 EFLAGS: 00000046
>>>>>>>>>> [    0.311144] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7f722a00
>>>>>>>>>> [    0.311145] RDX: 0000000000000035 RSI: 0000000000d80001 RDI: ffff887f7709ca68
>>>>>>>>>> [    0.311146] RBP: 0000000000000202 R08: 0000000000140000 R09: 0000000000000000
>>>>>>>>>> [    0.311147] R10: ffffc9000f1ef9d8 R11: 000000001a249fa0 R12: ffff887f7709ca68
>>>>>>>>>> [    0.311148] R13: ffffc9000f1efad0 R14: 0000000000000000 R15: ffff887f7709ca00
>>>>>>>>>> [    0.311149] FS:  000000c423f30090(0000) GS:ffff883f7f700000(0000) knlGS:0000000000000000
>>>>>>>>>> [    0.311150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>>>>>> [    0.311151] CR2: 00007feefcea4000 CR3: 0000007f7016e001 CR4: 00000000003606e0
>>>>>>>>>> [    0.311152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>>>>>>>> [    0.311153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>>>>>>>> [    0.311154] Call Trace:
>>>>>>>>>> [    0.311157]  do_raw_spin_lock+0xad/0xc0
>>>>>>>>>> [    0.311160]  _raw_spin_lock_irqsave+0x44/0x4b
>>>>>>>>>> [    0.311162]  ? prepare_to_wait_exclusive+0x28/0xb0
>>>>>>>>>> [    0.311164]  prepare_to_wait_exclusive+0x28/0xb0
>>>>>>>>>> [    0.311167]  wbt_wait+0x127/0x330
>>>>>>>>>> [    0.311169]  ? finish_wait+0x80/0x80
>>>>>>>>>> [    0.311172]  ? generic_make_request+0xda/0x3b0
>>>>>>>>>> [    0.311174]  blk_mq_make_request+0xd6/0x7b0
>>>>>>>>>> [    0.311176]  ? blk_queue_enter+0x24/0x260
>>>>>>>>>> [    0.311178]  ? generic_make_request+0xda/0x3b0
>>>>>>>>>> [    0.311181]  generic_make_request+0x10c/0x3b0
>>>>>>>>>> [    0.311183]  ? submit_bio+0x5c/0x110
>>>>>>>>>> [    0.311185]  submit_bio+0x5c/0x110
>>>>>>>>>> [    0.311197]  ? __ext4_journal_stop+0x36/0xa0 [ext4]
>>>>>>>>>> [    0.311210]  ext4_io_submit+0x48/0x60 [ext4]
>>>>>>>>>> [    0.311222]  ext4_writepages+0x810/0x11f0 [ext4]
>>>>>>>>>> [    0.311229]  ? do_writepages+0x3c/0xd0
>>>>>>>>>> [    0.311239]  ? ext4_mark_inode_dirty+0x260/0x260 [ext4]
>>>>>>>>>> [    0.311240]  do_writepages+0x3c/0xd0
>>>>>>>>>> [    0.311243]  ? _raw_spin_unlock+0x24/0x30
>>>>>>>>>> [    0.311245]  ? wbc_attach_and_unlock_inode+0x165/0x280
>>>>>>>>>> [    0.311248]  ? __filemap_fdatawrite_range+0xa3/0xe0
>>>>>>>>>> [    0.311250]  __filemap_fdatawrite_range+0xa3/0xe0
>>>>>>>>>> [    0.311253]  file_write_and_wait_range+0x34/0x90
>>>>>>>>>> [    0.311264]  ext4_sync_file+0x151/0x500 [ext4]
>>>>>>>>>> [    0.311267]  do_fsync+0x38/0x60
>>>>>>>>>> [    0.311270]  SyS_fsync+0xc/0x10
>>>>>>>>>> [    0.311272]  do_syscall_64+0x6f/0x170
>>>>>>>>>> [    0.311274]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
>>>>>>>>>>
>>>>>>>>>> In the original patch, wbt_done is waking up all the exclusive
>>>>>>>>>> processes in the wait queue, which can cause a thundering herd
>>>>>>>>>> if there is a large number of writer threads in the queue. The
>>>>>>>>>> original intention of the code seems to be to wake up one thread
>>>>>>>>>> only however, it uses wake_up_all() in __wbt_done(), and then
>>>>>>>>>> uses the following check in __wbt_wait to have only one thread
>>>>>>>>>> actually get out of the wait loop:
>>>>>>>>>>
>>>>>>>>>> if (waitqueue_active(&rqw->wait) &&
>>>>>>>>>>             rqw->wait.head.next != &wait->entry)
>>>>>>>>>>                 return false;
>>>>>>>>>>
>>>>>>>>>> The problem with this is that the wait entry in wbt_wait is
>>>>>>>>>> define with DEFINE_WAIT, which uses the autoremove wakeup function.
>>>>>>>>>> That means that the above check is invalid - the wait entry will
>>>>>>>>>> have been removed from the queue already by the time we hit the
>>>>>>>>>> check in the loop.
>>>>>>>>>>
>>>>>>>>>> Secondly, auto-removing the wait entries also means that the wait
>>>>>>>>>> queue essentially gets reordered "randomly" (e.g. threads re-add
>>>>>>>>>> themselves in the order they got to run after being woken up).
>>>>>>>>>> Additionally, new requests entering wbt_wait might overtake requests
>>>>>>>>>> that were queued earlier, because the wait queue will be
>>>>>>>>>> (temporarily) empty after the wake_up_all, so the waitqueue_active
>>>>>>>>>> check will not stop them. This can cause certain threads to starve
>>>>>>>>>> under high load.
>>>>>>>>>>
>>>>>>>>>> The fix is to leave the woken up requests in the queue and remove
>>>>>>>>>> them in finish_wait() once the current thread breaks out of the
>>>>>>>>>> wait loop in __wbt_wait. This will ensure new requests always
>>>>>>>>>> end up at the back of the queue, and they won't overtake requests
>>>>>>>>>> that are already in the wait queue. With that change, the loop
>>>>>>>>>> in wbt_wait is also in line with many other wait loops in the kernel.
>>>>>>>>>> Waking up just one thread drastically reduces lock contention, as
>>>>>>>>>> does moving the wait queue add/remove out of the loop.
>>>>>>>>>>
>>>>>>>>>> A significant drop in lockdep's lock contention numbers is seen when
>>>>>>>>>> running the test application on the patched kernel.
>>>>>>>>> I like the patch, and a few weeks ago we independently discovered that
>>>>>>>>> the waitqueue list checking was bogus as well. My only worry is that
>>>>>>>>> changes like this can be delicate, meaning that it's easy to introduce
>>>>>>>>> stall conditions. What kind of testing did you push this through?
>>>>>>>>>
>>>>>>>>> -- 
>>>>>>>>> Jens Axboe
>>>>>>>>>
>>>>>>>> I ran the following tests on both real HW with NVME devices attached
>>>>>>>> and emulated NVME too:
>>>>>>>>
>>>>>>>> 1. The test case I used to reproduce the issue, spawns a bunch of threads 
>>>>>>>>    to concurrently read and write files with random size and content. 
>>>>>>>>    Files are randomly fsync'd. The implementation is a FIFO queue of files. 
>>>>>>>>    When the queue fills the test starts to verify and remove the files. This 
>>>>>>>>    test will fail if there's a read, write, or hash check failure. It tests
>>>>>>>>    for file corruption when lots of small files are being read and written 
>>>>>>>>    with high concurrency.
>>>>>>>>
>>>>>>>> 2. Fio for random writes with a root NVME device of 200GB
>>>>>>>>   
>>>>>>>>   fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k 
>>>>>>>>   --direct=0 --size=10G --numjobs=2 --runtime=60 --group_reporting
>>>>>>>>   
>>>>>>>>   fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k
>>>>>>>>   --direct=0 --size=5G --numjobs=2 --runtime=30 --fsync=64 --group_reporting
>>>>>>>>   
>>>>>>>>   I did see an improvement in the bandwidth numbers reported on the patched
>>>>>>>>   kernel. 
>>>>>>>>
>>>>>>>> Do you have any test case/suite in mind that you would suggest me to 
>>>>>>>> run to be sure that patch does not introduce any stall conditions?
>>>>>>> One thing that is always useful is to run xfstest, do a full run on
>>>>>>> the device. If that works, then do another full run, this time limiting
>>>>>>> the queue depth of the SCSI device to 1. If both of those pass, then
>>>>>>> I'd feel pretty good getting this applied for 4.19.
>>>>>> Did you get a chance to run this full test?
>>>>>>
>>>>>> -- 
>>>>>> Jens Axboe
>>>>>>
>>>>>>
>>>>> Hi Jens,
>>>>> Yes I did run the tests and was in the process of compiling concrete results
>>>>> I tested following environments against xfs/auto group
>>>>> 1. Vanilla 4.18.rc kernel
>>>>> 2. 4.18 kernel with the blk-wbt patch
>>>>> 3. 4.18 kernel with the blk-wbt patch + io_queue_depth=2. I 
>>>>> understand you asked for queue depth for SCSI device=1 however, I have NVME 
>>>>> devices in my environment and 2 is the minimum value for io_queue_depth allowed 
>>>>> according to the NVME driver code. The results pretty much look same with no 
>>>>> stalls or exceptional failures. 
>>>>> xfs/auto ran 296 odd tests with 3 failures and 130 something "no runs". 
>>>>> Remaining tests passed. "Skipped tests"  were mostly due to missing features
>>>>> (eg: reflink support on scratch filesystem)
>>>>> The failures were consistent across runs on 3 different environments. 
>>>>> I am also running full test suite but it is taking long time as I am 
>>>>> hitting kernel BUG in xfs code in some generic tests. This BUG is not 
>>>>> related to the patch and  I see them in vanilla kernel too. I am in 
>>>>> the process of excluding these kind of tests as they come and 
>>>>> re-run the suite however, this proces is time taking. 
>>>>> Do you have any specific tests in mind that you would like me 
>>>>> to run apart from what I have already tested above?
>>>> Thanks, I think that looks good. I'll get your patch applied for
>>>> 4.19.
>>>>
>>>> -- 
>>>> Jens Axboe
>>>>
>>>>
>>> Hi Jens,
>>> Thanks for accepting this. There is one small issue, I don't find any emails
>>> send by me on the lkml mailing list. I am not sure why it didn't land there,
>>> all I can see is your responses. Do you want one of us to resend the patch
>>> or will you be able to do it?
>> That's odd, are you getting rejections on your emails? For reference, the
>> patch is here:
>>
>> http://git.kernel.dk/cgit/linux-block/commit/?h=for-4.19/block&id=2887e41b910bb14fd847cf01ab7a5993db989d88
> One issue with this, as far as I can tell. Right now we've switched to
> waking one task at the time, which is obviously more efficient. But if
> we do that with exclusive waits, then we have to ensure that this task
> makes progress. If we wake up a task, and then fail to get a queueing
> token, then we'll go back to sleep. We need to ensure that someone makes
> forward progress at this point. There are two ways I can see that
> happening:
>
> 1) The task woken _always_ gets to queue an IO
> 2) If the task woken is NOT allowed to queue an IO, then it must select
>    a new task to wake up. That new task is then subjected to rule 1 or 2
>    as well.
>
> For #1, it could be as simple as:
>
> if (slept || !rwb_enabled(rwb)) {
> 	atomic_inc(&rqw->inflight);
> 	break;
> }
>
> but this obviously won't always be fair. Might be good enough however,
> instead of having to eg replace the generic wait queues with a priority
> list/queue.
>
> Note that this isn't an entirely new issue, it's just so much easier to
> hit with the single wakeups.
>
Hi Jens,

What is the scenario that you see under which the woken up task does not
get to run?

The theory behind leaving the task on the wait queue is that the
waitqueue_active check in wbt_wait prevents new tasks from taking up a
slot in the queue (e.g. incrementing inflight). So, there should not be
a way for inflight to be incremented between the time the wake_up is
done and the task at the head of the wait queue runs. That's the idea
anyway :-) If we missed something, let us know.

- Frank

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
  2018-08-20 17:34                     ` van der Linden, Frank
  (?)
@ 2018-08-20 19:08                     ` Jens Axboe
  2018-08-20 19:29                       ` Jens Axboe
  -1 siblings, 1 reply; 38+ messages in thread
From: Jens Axboe @ 2018-08-20 19:08 UTC (permalink / raw)
  To: van der Linden, Frank, Agarwal, Anchal
  Cc: linux-block, linux-kernel, Singh, Balbir, Wilson, Matt

On 8/20/18 11:34 AM, van der Linden, Frank wrote:
> On 8/20/18 9:37 AM, Jens Axboe wrote:
>> On 8/7/18 3:19 PM, Jens Axboe wrote:
>>> On 8/7/18 3:12 PM, Anchal Agarwal wrote:
>>>> On Tue, Aug 07, 2018 at 02:39:48PM -0600, Jens Axboe wrote:
>>>>> On 8/7/18 2:12 PM, Anchal Agarwal wrote:
>>>>>> On Tue, Aug 07, 2018 at 08:29:44AM -0600, Jens Axboe wrote:
>>>>>>> On 8/1/18 4:09 PM, Jens Axboe wrote:
>>>>>>>> On 8/1/18 11:06 AM, Anchal Agarwal wrote:
>>>>>>>>> On Wed, Aug 01, 2018 at 09:14:50AM -0600, Jens Axboe wrote:
>>>>>>>>>> On 7/31/18 3:34 PM, Anchal Agarwal wrote:
>>>>>>>>>>> Hi folks,
>>>>>>>>>>>
>>>>>>>>>>> This patch modifies commit e34cbd307477a
>>>>>>>>>>> (blk-wbt: add general throttling mechanism)
>>>>>>>>>>>
>>>>>>>>>>> I am currently running a large bare metal instance (i3.metal)
>>>>>>>>>>> on EC2 with 72 cores, 512GB of RAM and NVME drives, with a
>>>>>>>>>>> 4.18 kernel. I have a workload that simulates a database
>>>>>>>>>>> workload and I am running into lockup issues when writeback
>>>>>>>>>>> throttling is enabled,with the hung task detector also
>>>>>>>>>>> kicking in.
>>>>>>>>>>>
>>>>>>>>>>> Crash dumps show that most CPUs (up to 50 of them) are
>>>>>>>>>>> all trying to get the wbt wait queue lock while trying to add
>>>>>>>>>>> themselves to it in __wbt_wait (see stack traces below).
>>>>>>>>>>>
>>>>>>>>>>> [    0.948118] CPU: 45 PID: 0 Comm: swapper/45 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
>>>>>>>>>>> [    0.948119] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
>>>>>>>>>>> [    0.948120] task: ffff883f7878c000 task.stack: ffffc9000c69c000
>>>>>>>>>>> [    0.948124] RIP: 0010:native_queued_spin_lock_slowpath+0xf8/0x1a0
>>>>>>>>>>> [    0.948125] RSP: 0018:ffff883f7fcc3dc8 EFLAGS: 00000046
>>>>>>>>>>> [    0.948126] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7fce2a00
>>>>>>>>>>> [    0.948128] RDX: 000000000000001c RSI: 0000000000740001 RDI: ffff887f7709ca68
>>>>>>>>>>> [    0.948129] RBP: 0000000000000002 R08: 0000000000b80000 R09: 0000000000000000
>>>>>>>>>>> [    0.948130] R10: ffff883f7fcc3d78 R11: 000000000de27121 R12: 0000000000000002
>>>>>>>>>>> [    0.948131] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
>>>>>>>>>>> [    0.948132] FS:  0000000000000000(0000) GS:ffff883f7fcc0000(0000) knlGS:0000000000000000
>>>>>>>>>>> [    0.948134] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>>>>>>> [    0.948135] CR2: 000000c424c77000 CR3: 0000000002010005 CR4: 00000000003606e0
>>>>>>>>>>> [    0.948136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>>>>>>>>> [    0.948137] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>>>>>>>>> [    0.948138] Call Trace:
>>>>>>>>>>> [    0.948139]  <IRQ>
>>>>>>>>>>> [    0.948142]  do_raw_spin_lock+0xad/0xc0
>>>>>>>>>>> [    0.948145]  _raw_spin_lock_irqsave+0x44/0x4b
>>>>>>>>>>> [    0.948149]  ? __wake_up_common_lock+0x53/0x90
>>>>>>>>>>> [    0.948150]  __wake_up_common_lock+0x53/0x90
>>>>>>>>>>> [    0.948155]  wbt_done+0x7b/0xa0
>>>>>>>>>>> [    0.948158]  blk_mq_free_request+0xb7/0x110
>>>>>>>>>>> [    0.948161]  __blk_mq_complete_request+0xcb/0x140
>>>>>>>>>>> [    0.948166]  nvme_process_cq+0xce/0x1a0 [nvme]
>>>>>>>>>>> [    0.948169]  nvme_irq+0x23/0x50 [nvme]
>>>>>>>>>>> [    0.948173]  __handle_irq_event_percpu+0x46/0x300
>>>>>>>>>>> [    0.948176]  handle_irq_event_percpu+0x20/0x50
>>>>>>>>>>> [    0.948179]  handle_irq_event+0x34/0x60
>>>>>>>>>>> [    0.948181]  handle_edge_irq+0x77/0x190
>>>>>>>>>>> [    0.948185]  handle_irq+0xaf/0x120
>>>>>>>>>>> [    0.948188]  do_IRQ+0x53/0x110
>>>>>>>>>>> [    0.948191]  common_interrupt+0x87/0x87
>>>>>>>>>>> [    0.948192]  </IRQ>
>>>>>>>>>>> ....
>>>>>>>>>>> [    0.311136] CPU: 4 PID: 9737 Comm: run_linux_amd64 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
>>>>>>>>>>> [    0.311137] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
>>>>>>>>>>> [    0.311138] task: ffff883f6e6a8000 task.stack: ffffc9000f1ec000
>>>>>>>>>>> [    0.311141] RIP: 0010:native_queued_spin_lock_slowpath+0xf5/0x1a0
>>>>>>>>>>> [    0.311142] RSP: 0018:ffffc9000f1efa28 EFLAGS: 00000046
>>>>>>>>>>> [    0.311144] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7f722a00
>>>>>>>>>>> [    0.311145] RDX: 0000000000000035 RSI: 0000000000d80001 RDI: ffff887f7709ca68
>>>>>>>>>>> [    0.311146] RBP: 0000000000000202 R08: 0000000000140000 R09: 0000000000000000
>>>>>>>>>>> [    0.311147] R10: ffffc9000f1ef9d8 R11: 000000001a249fa0 R12: ffff887f7709ca68
>>>>>>>>>>> [    0.311148] R13: ffffc9000f1efad0 R14: 0000000000000000 R15: ffff887f7709ca00
>>>>>>>>>>> [    0.311149] FS:  000000c423f30090(0000) GS:ffff883f7f700000(0000) knlGS:0000000000000000
>>>>>>>>>>> [    0.311150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>>>>>>> [    0.311151] CR2: 00007feefcea4000 CR3: 0000007f7016e001 CR4: 00000000003606e0
>>>>>>>>>>> [    0.311152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>>>>>>>>> [    0.311153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>>>>>>>>> [    0.311154] Call Trace:
>>>>>>>>>>> [    0.311157]  do_raw_spin_lock+0xad/0xc0
>>>>>>>>>>> [    0.311160]  _raw_spin_lock_irqsave+0x44/0x4b
>>>>>>>>>>> [    0.311162]  ? prepare_to_wait_exclusive+0x28/0xb0
>>>>>>>>>>> [    0.311164]  prepare_to_wait_exclusive+0x28/0xb0
>>>>>>>>>>> [    0.311167]  wbt_wait+0x127/0x330
>>>>>>>>>>> [    0.311169]  ? finish_wait+0x80/0x80
>>>>>>>>>>> [    0.311172]  ? generic_make_request+0xda/0x3b0
>>>>>>>>>>> [    0.311174]  blk_mq_make_request+0xd6/0x7b0
>>>>>>>>>>> [    0.311176]  ? blk_queue_enter+0x24/0x260
>>>>>>>>>>> [    0.311178]  ? generic_make_request+0xda/0x3b0
>>>>>>>>>>> [    0.311181]  generic_make_request+0x10c/0x3b0
>>>>>>>>>>> [    0.311183]  ? submit_bio+0x5c/0x110
>>>>>>>>>>> [    0.311185]  submit_bio+0x5c/0x110
>>>>>>>>>>> [    0.311197]  ? __ext4_journal_stop+0x36/0xa0 [ext4]
>>>>>>>>>>> [    0.311210]  ext4_io_submit+0x48/0x60 [ext4]
>>>>>>>>>>> [    0.311222]  ext4_writepages+0x810/0x11f0 [ext4]
>>>>>>>>>>> [    0.311229]  ? do_writepages+0x3c/0xd0
>>>>>>>>>>> [    0.311239]  ? ext4_mark_inode_dirty+0x260/0x260 [ext4]
>>>>>>>>>>> [    0.311240]  do_writepages+0x3c/0xd0
>>>>>>>>>>> [    0.311243]  ? _raw_spin_unlock+0x24/0x30
>>>>>>>>>>> [    0.311245]  ? wbc_attach_and_unlock_inode+0x165/0x280
>>>>>>>>>>> [    0.311248]  ? __filemap_fdatawrite_range+0xa3/0xe0
>>>>>>>>>>> [    0.311250]  __filemap_fdatawrite_range+0xa3/0xe0
>>>>>>>>>>> [    0.311253]  file_write_and_wait_range+0x34/0x90
>>>>>>>>>>> [    0.311264]  ext4_sync_file+0x151/0x500 [ext4]
>>>>>>>>>>> [    0.311267]  do_fsync+0x38/0x60
>>>>>>>>>>> [    0.311270]  SyS_fsync+0xc/0x10
>>>>>>>>>>> [    0.311272]  do_syscall_64+0x6f/0x170
>>>>>>>>>>> [    0.311274]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
>>>>>>>>>>>
>>>>>>>>>>> In the original patch, wbt_done is waking up all the exclusive
>>>>>>>>>>> processes in the wait queue, which can cause a thundering herd
>>>>>>>>>>> if there is a large number of writer threads in the queue. The
>>>>>>>>>>> original intention of the code seems to be to wake up one thread
>>>>>>>>>>> only however, it uses wake_up_all() in __wbt_done(), and then
>>>>>>>>>>> uses the following check in __wbt_wait to have only one thread
>>>>>>>>>>> actually get out of the wait loop:
>>>>>>>>>>>
>>>>>>>>>>> if (waitqueue_active(&rqw->wait) &&
>>>>>>>>>>>             rqw->wait.head.next != &wait->entry)
>>>>>>>>>>>                 return false;
>>>>>>>>>>>
>>>>>>>>>>> The problem with this is that the wait entry in wbt_wait is
>>>>>>>>>>> define with DEFINE_WAIT, which uses the autoremove wakeup function.
>>>>>>>>>>> That means that the above check is invalid - the wait entry will
>>>>>>>>>>> have been removed from the queue already by the time we hit the
>>>>>>>>>>> check in the loop.
>>>>>>>>>>>
>>>>>>>>>>> Secondly, auto-removing the wait entries also means that the wait
>>>>>>>>>>> queue essentially gets reordered "randomly" (e.g. threads re-add
>>>>>>>>>>> themselves in the order they got to run after being woken up).
>>>>>>>>>>> Additionally, new requests entering wbt_wait might overtake requests
>>>>>>>>>>> that were queued earlier, because the wait queue will be
>>>>>>>>>>> (temporarily) empty after the wake_up_all, so the waitqueue_active
>>>>>>>>>>> check will not stop them. This can cause certain threads to starve
>>>>>>>>>>> under high load.
>>>>>>>>>>>
>>>>>>>>>>> The fix is to leave the woken up requests in the queue and remove
>>>>>>>>>>> them in finish_wait() once the current thread breaks out of the
>>>>>>>>>>> wait loop in __wbt_wait. This will ensure new requests always
>>>>>>>>>>> end up at the back of the queue, and they won't overtake requests
>>>>>>>>>>> that are already in the wait queue. With that change, the loop
>>>>>>>>>>> in wbt_wait is also in line with many other wait loops in the kernel.
>>>>>>>>>>> Waking up just one thread drastically reduces lock contention, as
>>>>>>>>>>> does moving the wait queue add/remove out of the loop.
>>>>>>>>>>>
>>>>>>>>>>> A significant drop in lockdep's lock contention numbers is seen when
>>>>>>>>>>> running the test application on the patched kernel.
>>>>>>>>>> I like the patch, and a few weeks ago we independently discovered that
>>>>>>>>>> the waitqueue list checking was bogus as well. My only worry is that
>>>>>>>>>> changes like this can be delicate, meaning that it's easy to introduce
>>>>>>>>>> stall conditions. What kind of testing did you push this through?
>>>>>>>>>>
>>>>>>>>>> -- 
>>>>>>>>>> Jens Axboe
>>>>>>>>>>
>>>>>>>>> I ran the following tests on both real HW with NVME devices attached
>>>>>>>>> and emulated NVME too:
>>>>>>>>>
>>>>>>>>> 1. The test case I used to reproduce the issue, spawns a bunch of threads 
>>>>>>>>>    to concurrently read and write files with random size and content. 
>>>>>>>>>    Files are randomly fsync'd. The implementation is a FIFO queue of files. 
>>>>>>>>>    When the queue fills the test starts to verify and remove the files. This 
>>>>>>>>>    test will fail if there's a read, write, or hash check failure. It tests
>>>>>>>>>    for file corruption when lots of small files are being read and written 
>>>>>>>>>    with high concurrency.
>>>>>>>>>
>>>>>>>>> 2. Fio for random writes with a root NVME device of 200GB
>>>>>>>>>   
>>>>>>>>>   fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k 
>>>>>>>>>   --direct=0 --size=10G --numjobs=2 --runtime=60 --group_reporting
>>>>>>>>>   
>>>>>>>>>   fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k
>>>>>>>>>   --direct=0 --size=5G --numjobs=2 --runtime=30 --fsync=64 --group_reporting
>>>>>>>>>   
>>>>>>>>>   I did see an improvement in the bandwidth numbers reported on the patched
>>>>>>>>>   kernel. 
>>>>>>>>>
>>>>>>>>> Do you have any test case/suite in mind that you would suggest me to 
>>>>>>>>> run to be sure that patch does not introduce any stall conditions?
>>>>>>>> One thing that is always useful is to run xfstest, do a full run on
>>>>>>>> the device. If that works, then do another full run, this time limiting
>>>>>>>> the queue depth of the SCSI device to 1. If both of those pass, then
>>>>>>>> I'd feel pretty good getting this applied for 4.19.
>>>>>>> Did you get a chance to run this full test?
>>>>>>>
>>>>>>> -- 
>>>>>>> Jens Axboe
>>>>>>>
>>>>>>>
>>>>>> Hi Jens,
>>>>>> Yes I did run the tests and was in the process of compiling concrete results
>>>>>> I tested following environments against xfs/auto group
>>>>>> 1. Vanilla 4.18.rc kernel
>>>>>> 2. 4.18 kernel with the blk-wbt patch
>>>>>> 3. 4.18 kernel with the blk-wbt patch + io_queue_depth=2. I 
>>>>>> understand you asked for queue depth for SCSI device=1 however, I have NVME 
>>>>>> devices in my environment and 2 is the minimum value for io_queue_depth allowed 
>>>>>> according to the NVME driver code. The results pretty much look same with no 
>>>>>> stalls or exceptional failures. 
>>>>>> xfs/auto ran 296 odd tests with 3 failures and 130 something "no runs". 
>>>>>> Remaining tests passed. "Skipped tests"  were mostly due to missing features
>>>>>> (eg: reflink support on scratch filesystem)
>>>>>> The failures were consistent across runs on 3 different environments. 
>>>>>> I am also running full test suite but it is taking long time as I am 
>>>>>> hitting kernel BUG in xfs code in some generic tests. This BUG is not 
>>>>>> related to the patch and  I see them in vanilla kernel too. I am in 
>>>>>> the process of excluding these kind of tests as they come and 
>>>>>> re-run the suite however, this proces is time taking. 
>>>>>> Do you have any specific tests in mind that you would like me 
>>>>>> to run apart from what I have already tested above?
>>>>> Thanks, I think that looks good. I'll get your patch applied for
>>>>> 4.19.
>>>>>
>>>>> -- 
>>>>> Jens Axboe
>>>>>
>>>>>
>>>> Hi Jens,
>>>> Thanks for accepting this. There is one small issue, I don't find any emails
>>>> send by me on the lkml mailing list. I am not sure why it didn't land there,
>>>> all I can see is your responses. Do you want one of us to resend the patch
>>>> or will you be able to do it?
>>> That's odd, are you getting rejections on your emails? For reference, the
>>> patch is here:
>>>
>>> http://git.kernel.dk/cgit/linux-block/commit/?h=for-4.19/block&id=2887e41b910bb14fd847cf01ab7a5993db989d88
>> One issue with this, as far as I can tell. Right now we've switched to
>> waking one task at the time, which is obviously more efficient. But if
>> we do that with exclusive waits, then we have to ensure that this task
>> makes progress. If we wake up a task, and then fail to get a queueing
>> token, then we'll go back to sleep. We need to ensure that someone makes
>> forward progress at this point. There are two ways I can see that
>> happening:
>>
>> 1) The task woken _always_ gets to queue an IO
>> 2) If the task woken is NOT allowed to queue an IO, then it must select
>>    a new task to wake up. That new task is then subjected to rule 1 or 2
>>    as well.
>>
>> For #1, it could be as simple as:
>>
>> if (slept || !rwb_enabled(rwb)) {
>> 	atomic_inc(&rqw->inflight);
>> 	break;
>> }
>>
>> but this obviously won't always be fair. Might be good enough however,
>> instead of having to eg replace the generic wait queues with a priority
>> list/queue.
>>
>> Note that this isn't an entirely new issue, it's just so much easier to
>> hit with the single wakeups.
>>
> Hi Jens,
> 
> What is the scenario that you see under which the woken up task does not
> get to run?

That scenario is pretty easy to hit - let's say the next in line task
has a queue limit of 1, and we currently have 4 pending. Task gets
woken, goes back to sleep. Which should be totally fine. At some point
we'll get below the limit, and allow the task to proceed. This will
ensure forward progress.

> The theory behind leaving the task on the wait queue is that the
> waitqueue_active check in wbt_wait prevents new tasks from taking up a
> slot in the queue (e.g. incrementing inflight). So, there should not be
> a way for inflight to be incremented between the time the wake_up is
> done and the task at the head of the wait queue runs. That's the idea
> anyway :-) If we missed something, let us know.

And that's a fine theory, I think it's a good improvement (and how it
should have worked). I'm struggling to see where the issue is. Perhaps
it's related to the wq active check. With fewer wakeups, we're more
likely to hit a race there.

I'll poke at it...

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
  2018-08-20 19:08                     ` Jens Axboe
@ 2018-08-20 19:29                       ` Jens Axboe
  2018-08-20 20:19                           ` van der Linden, Frank
  0 siblings, 1 reply; 38+ messages in thread
From: Jens Axboe @ 2018-08-20 19:29 UTC (permalink / raw)
  To: van der Linden, Frank, Agarwal, Anchal
  Cc: linux-block, linux-kernel, Singh, Balbir, Wilson, Matt

On 8/20/18 1:08 PM, Jens Axboe wrote:
> On 8/20/18 11:34 AM, van der Linden, Frank wrote:
>> On 8/20/18 9:37 AM, Jens Axboe wrote:
>>> On 8/7/18 3:19 PM, Jens Axboe wrote:
>>>> On 8/7/18 3:12 PM, Anchal Agarwal wrote:
>>>>> On Tue, Aug 07, 2018 at 02:39:48PM -0600, Jens Axboe wrote:
>>>>>> On 8/7/18 2:12 PM, Anchal Agarwal wrote:
>>>>>>> On Tue, Aug 07, 2018 at 08:29:44AM -0600, Jens Axboe wrote:
>>>>>>>> On 8/1/18 4:09 PM, Jens Axboe wrote:
>>>>>>>>> On 8/1/18 11:06 AM, Anchal Agarwal wrote:
>>>>>>>>>> On Wed, Aug 01, 2018 at 09:14:50AM -0600, Jens Axboe wrote:
>>>>>>>>>>> On 7/31/18 3:34 PM, Anchal Agarwal wrote:
>>>>>>>>>>>> Hi folks,
>>>>>>>>>>>>
>>>>>>>>>>>> This patch modifies commit e34cbd307477a
>>>>>>>>>>>> (blk-wbt: add general throttling mechanism)
>>>>>>>>>>>>
>>>>>>>>>>>> I am currently running a large bare metal instance (i3.metal)
>>>>>>>>>>>> on EC2 with 72 cores, 512GB of RAM and NVME drives, with a
>>>>>>>>>>>> 4.18 kernel. I have a workload that simulates a database
>>>>>>>>>>>> workload and I am running into lockup issues when writeback
>>>>>>>>>>>> throttling is enabled,with the hung task detector also
>>>>>>>>>>>> kicking in.
>>>>>>>>>>>>
>>>>>>>>>>>> Crash dumps show that most CPUs (up to 50 of them) are
>>>>>>>>>>>> all trying to get the wbt wait queue lock while trying to add
>>>>>>>>>>>> themselves to it in __wbt_wait (see stack traces below).
>>>>>>>>>>>>
>>>>>>>>>>>> [    0.948118] CPU: 45 PID: 0 Comm: swapper/45 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
>>>>>>>>>>>> [    0.948119] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
>>>>>>>>>>>> [    0.948120] task: ffff883f7878c000 task.stack: ffffc9000c69c000
>>>>>>>>>>>> [    0.948124] RIP: 0010:native_queued_spin_lock_slowpath+0xf8/0x1a0
>>>>>>>>>>>> [    0.948125] RSP: 0018:ffff883f7fcc3dc8 EFLAGS: 00000046
>>>>>>>>>>>> [    0.948126] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7fce2a00
>>>>>>>>>>>> [    0.948128] RDX: 000000000000001c RSI: 0000000000740001 RDI: ffff887f7709ca68
>>>>>>>>>>>> [    0.948129] RBP: 0000000000000002 R08: 0000000000b80000 R09: 0000000000000000
>>>>>>>>>>>> [    0.948130] R10: ffff883f7fcc3d78 R11: 000000000de27121 R12: 0000000000000002
>>>>>>>>>>>> [    0.948131] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
>>>>>>>>>>>> [    0.948132] FS:  0000000000000000(0000) GS:ffff883f7fcc0000(0000) knlGS:0000000000000000
>>>>>>>>>>>> [    0.948134] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>>>>>>>> [    0.948135] CR2: 000000c424c77000 CR3: 0000000002010005 CR4: 00000000003606e0
>>>>>>>>>>>> [    0.948136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>>>>>>>>>> [    0.948137] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>>>>>>>>>> [    0.948138] Call Trace:
>>>>>>>>>>>> [    0.948139]  <IRQ>
>>>>>>>>>>>> [    0.948142]  do_raw_spin_lock+0xad/0xc0
>>>>>>>>>>>> [    0.948145]  _raw_spin_lock_irqsave+0x44/0x4b
>>>>>>>>>>>> [    0.948149]  ? __wake_up_common_lock+0x53/0x90
>>>>>>>>>>>> [    0.948150]  __wake_up_common_lock+0x53/0x90
>>>>>>>>>>>> [    0.948155]  wbt_done+0x7b/0xa0
>>>>>>>>>>>> [    0.948158]  blk_mq_free_request+0xb7/0x110
>>>>>>>>>>>> [    0.948161]  __blk_mq_complete_request+0xcb/0x140
>>>>>>>>>>>> [    0.948166]  nvme_process_cq+0xce/0x1a0 [nvme]
>>>>>>>>>>>> [    0.948169]  nvme_irq+0x23/0x50 [nvme]
>>>>>>>>>>>> [    0.948173]  __handle_irq_event_percpu+0x46/0x300
>>>>>>>>>>>> [    0.948176]  handle_irq_event_percpu+0x20/0x50
>>>>>>>>>>>> [    0.948179]  handle_irq_event+0x34/0x60
>>>>>>>>>>>> [    0.948181]  handle_edge_irq+0x77/0x190
>>>>>>>>>>>> [    0.948185]  handle_irq+0xaf/0x120
>>>>>>>>>>>> [    0.948188]  do_IRQ+0x53/0x110
>>>>>>>>>>>> [    0.948191]  common_interrupt+0x87/0x87
>>>>>>>>>>>> [    0.948192]  </IRQ>
>>>>>>>>>>>> ....
>>>>>>>>>>>> [    0.311136] CPU: 4 PID: 9737 Comm: run_linux_amd64 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
>>>>>>>>>>>> [    0.311137] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
>>>>>>>>>>>> [    0.311138] task: ffff883f6e6a8000 task.stack: ffffc9000f1ec000
>>>>>>>>>>>> [    0.311141] RIP: 0010:native_queued_spin_lock_slowpath+0xf5/0x1a0
>>>>>>>>>>>> [    0.311142] RSP: 0018:ffffc9000f1efa28 EFLAGS: 00000046
>>>>>>>>>>>> [    0.311144] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7f722a00
>>>>>>>>>>>> [    0.311145] RDX: 0000000000000035 RSI: 0000000000d80001 RDI: ffff887f7709ca68
>>>>>>>>>>>> [    0.311146] RBP: 0000000000000202 R08: 0000000000140000 R09: 0000000000000000
>>>>>>>>>>>> [    0.311147] R10: ffffc9000f1ef9d8 R11: 000000001a249fa0 R12: ffff887f7709ca68
>>>>>>>>>>>> [    0.311148] R13: ffffc9000f1efad0 R14: 0000000000000000 R15: ffff887f7709ca00
>>>>>>>>>>>> [    0.311149] FS:  000000c423f30090(0000) GS:ffff883f7f700000(0000) knlGS:0000000000000000
>>>>>>>>>>>> [    0.311150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>>>>>>>> [    0.311151] CR2: 00007feefcea4000 CR3: 0000007f7016e001 CR4: 00000000003606e0
>>>>>>>>>>>> [    0.311152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>>>>>>>>>> [    0.311153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>>>>>>>>>> [    0.311154] Call Trace:
>>>>>>>>>>>> [    0.311157]  do_raw_spin_lock+0xad/0xc0
>>>>>>>>>>>> [    0.311160]  _raw_spin_lock_irqsave+0x44/0x4b
>>>>>>>>>>>> [    0.311162]  ? prepare_to_wait_exclusive+0x28/0xb0
>>>>>>>>>>>> [    0.311164]  prepare_to_wait_exclusive+0x28/0xb0
>>>>>>>>>>>> [    0.311167]  wbt_wait+0x127/0x330
>>>>>>>>>>>> [    0.311169]  ? finish_wait+0x80/0x80
>>>>>>>>>>>> [    0.311172]  ? generic_make_request+0xda/0x3b0
>>>>>>>>>>>> [    0.311174]  blk_mq_make_request+0xd6/0x7b0
>>>>>>>>>>>> [    0.311176]  ? blk_queue_enter+0x24/0x260
>>>>>>>>>>>> [    0.311178]  ? generic_make_request+0xda/0x3b0
>>>>>>>>>>>> [    0.311181]  generic_make_request+0x10c/0x3b0
>>>>>>>>>>>> [    0.311183]  ? submit_bio+0x5c/0x110
>>>>>>>>>>>> [    0.311185]  submit_bio+0x5c/0x110
>>>>>>>>>>>> [    0.311197]  ? __ext4_journal_stop+0x36/0xa0 [ext4]
>>>>>>>>>>>> [    0.311210]  ext4_io_submit+0x48/0x60 [ext4]
>>>>>>>>>>>> [    0.311222]  ext4_writepages+0x810/0x11f0 [ext4]
>>>>>>>>>>>> [    0.311229]  ? do_writepages+0x3c/0xd0
>>>>>>>>>>>> [    0.311239]  ? ext4_mark_inode_dirty+0x260/0x260 [ext4]
>>>>>>>>>>>> [    0.311240]  do_writepages+0x3c/0xd0
>>>>>>>>>>>> [    0.311243]  ? _raw_spin_unlock+0x24/0x30
>>>>>>>>>>>> [    0.311245]  ? wbc_attach_and_unlock_inode+0x165/0x280
>>>>>>>>>>>> [    0.311248]  ? __filemap_fdatawrite_range+0xa3/0xe0
>>>>>>>>>>>> [    0.311250]  __filemap_fdatawrite_range+0xa3/0xe0
>>>>>>>>>>>> [    0.311253]  file_write_and_wait_range+0x34/0x90
>>>>>>>>>>>> [    0.311264]  ext4_sync_file+0x151/0x500 [ext4]
>>>>>>>>>>>> [    0.311267]  do_fsync+0x38/0x60
>>>>>>>>>>>> [    0.311270]  SyS_fsync+0xc/0x10
>>>>>>>>>>>> [    0.311272]  do_syscall_64+0x6f/0x170
>>>>>>>>>>>> [    0.311274]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
>>>>>>>>>>>>
>>>>>>>>>>>> In the original patch, wbt_done is waking up all the exclusive
>>>>>>>>>>>> processes in the wait queue, which can cause a thundering herd
>>>>>>>>>>>> if there is a large number of writer threads in the queue. The
>>>>>>>>>>>> original intention of the code seems to be to wake up one thread
>>>>>>>>>>>> only however, it uses wake_up_all() in __wbt_done(), and then
>>>>>>>>>>>> uses the following check in __wbt_wait to have only one thread
>>>>>>>>>>>> actually get out of the wait loop:
>>>>>>>>>>>>
>>>>>>>>>>>> if (waitqueue_active(&rqw->wait) &&
>>>>>>>>>>>>             rqw->wait.head.next != &wait->entry)
>>>>>>>>>>>>                 return false;
>>>>>>>>>>>>
>>>>>>>>>>>> The problem with this is that the wait entry in wbt_wait is
>>>>>>>>>>>> define with DEFINE_WAIT, which uses the autoremove wakeup function.
>>>>>>>>>>>> That means that the above check is invalid - the wait entry will
>>>>>>>>>>>> have been removed from the queue already by the time we hit the
>>>>>>>>>>>> check in the loop.
>>>>>>>>>>>>
>>>>>>>>>>>> Secondly, auto-removing the wait entries also means that the wait
>>>>>>>>>>>> queue essentially gets reordered "randomly" (e.g. threads re-add
>>>>>>>>>>>> themselves in the order they got to run after being woken up).
>>>>>>>>>>>> Additionally, new requests entering wbt_wait might overtake requests
>>>>>>>>>>>> that were queued earlier, because the wait queue will be
>>>>>>>>>>>> (temporarily) empty after the wake_up_all, so the waitqueue_active
>>>>>>>>>>>> check will not stop them. This can cause certain threads to starve
>>>>>>>>>>>> under high load.
>>>>>>>>>>>>
>>>>>>>>>>>> The fix is to leave the woken up requests in the queue and remove
>>>>>>>>>>>> them in finish_wait() once the current thread breaks out of the
>>>>>>>>>>>> wait loop in __wbt_wait. This will ensure new requests always
>>>>>>>>>>>> end up at the back of the queue, and they won't overtake requests
>>>>>>>>>>>> that are already in the wait queue. With that change, the loop
>>>>>>>>>>>> in wbt_wait is also in line with many other wait loops in the kernel.
>>>>>>>>>>>> Waking up just one thread drastically reduces lock contention, as
>>>>>>>>>>>> does moving the wait queue add/remove out of the loop.
>>>>>>>>>>>>
>>>>>>>>>>>> A significant drop in lockdep's lock contention numbers is seen when
>>>>>>>>>>>> running the test application on the patched kernel.
>>>>>>>>>>> I like the patch, and a few weeks ago we independently discovered that
>>>>>>>>>>> the waitqueue list checking was bogus as well. My only worry is that
>>>>>>>>>>> changes like this can be delicate, meaning that it's easy to introduce
>>>>>>>>>>> stall conditions. What kind of testing did you push this through?
>>>>>>>>>>>
>>>>>>>>>>> -- 
>>>>>>>>>>> Jens Axboe
>>>>>>>>>>>
>>>>>>>>>> I ran the following tests on both real HW with NVME devices attached
>>>>>>>>>> and emulated NVME too:
>>>>>>>>>>
>>>>>>>>>> 1. The test case I used to reproduce the issue, spawns a bunch of threads 
>>>>>>>>>>    to concurrently read and write files with random size and content. 
>>>>>>>>>>    Files are randomly fsync'd. The implementation is a FIFO queue of files. 
>>>>>>>>>>    When the queue fills the test starts to verify and remove the files. This 
>>>>>>>>>>    test will fail if there's a read, write, or hash check failure. It tests
>>>>>>>>>>    for file corruption when lots of small files are being read and written 
>>>>>>>>>>    with high concurrency.
>>>>>>>>>>
>>>>>>>>>> 2. Fio for random writes with a root NVME device of 200GB
>>>>>>>>>>   
>>>>>>>>>>   fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k 
>>>>>>>>>>   --direct=0 --size=10G --numjobs=2 --runtime=60 --group_reporting
>>>>>>>>>>   
>>>>>>>>>>   fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k
>>>>>>>>>>   --direct=0 --size=5G --numjobs=2 --runtime=30 --fsync=64 --group_reporting
>>>>>>>>>>   
>>>>>>>>>>   I did see an improvement in the bandwidth numbers reported on the patched
>>>>>>>>>>   kernel. 
>>>>>>>>>>
>>>>>>>>>> Do you have any test case/suite in mind that you would suggest me to 
>>>>>>>>>> run to be sure that patch does not introduce any stall conditions?
>>>>>>>>> One thing that is always useful is to run xfstest, do a full run on
>>>>>>>>> the device. If that works, then do another full run, this time limiting
>>>>>>>>> the queue depth of the SCSI device to 1. If both of those pass, then
>>>>>>>>> I'd feel pretty good getting this applied for 4.19.
>>>>>>>> Did you get a chance to run this full test?
>>>>>>>>
>>>>>>>> -- 
>>>>>>>> Jens Axboe
>>>>>>>>
>>>>>>>>
>>>>>>> Hi Jens,
>>>>>>> Yes I did run the tests and was in the process of compiling concrete results
>>>>>>> I tested following environments against xfs/auto group
>>>>>>> 1. Vanilla 4.18.rc kernel
>>>>>>> 2. 4.18 kernel with the blk-wbt patch
>>>>>>> 3. 4.18 kernel with the blk-wbt patch + io_queue_depth=2. I 
>>>>>>> understand you asked for queue depth for SCSI device=1 however, I have NVME 
>>>>>>> devices in my environment and 2 is the minimum value for io_queue_depth allowed 
>>>>>>> according to the NVME driver code. The results pretty much look same with no 
>>>>>>> stalls or exceptional failures. 
>>>>>>> xfs/auto ran 296 odd tests with 3 failures and 130 something "no runs". 
>>>>>>> Remaining tests passed. "Skipped tests"  were mostly due to missing features
>>>>>>> (eg: reflink support on scratch filesystem)
>>>>>>> The failures were consistent across runs on 3 different environments. 
>>>>>>> I am also running full test suite but it is taking long time as I am 
>>>>>>> hitting kernel BUG in xfs code in some generic tests. This BUG is not 
>>>>>>> related to the patch and  I see them in vanilla kernel too. I am in 
>>>>>>> the process of excluding these kind of tests as they come and 
>>>>>>> re-run the suite however, this proces is time taking. 
>>>>>>> Do you have any specific tests in mind that you would like me 
>>>>>>> to run apart from what I have already tested above?
>>>>>> Thanks, I think that looks good. I'll get your patch applied for
>>>>>> 4.19.
>>>>>>
>>>>>> -- 
>>>>>> Jens Axboe
>>>>>>
>>>>>>
>>>>> Hi Jens,
>>>>> Thanks for accepting this. There is one small issue, I don't find any emails
>>>>> send by me on the lkml mailing list. I am not sure why it didn't land there,
>>>>> all I can see is your responses. Do you want one of us to resend the patch
>>>>> or will you be able to do it?
>>>> That's odd, are you getting rejections on your emails? For reference, the
>>>> patch is here:
>>>>
>>>> http://git.kernel.dk/cgit/linux-block/commit/?h=for-4.19/block&id=2887e41b910bb14fd847cf01ab7a5993db989d88
>>> One issue with this, as far as I can tell. Right now we've switched to
>>> waking one task at the time, which is obviously more efficient. But if
>>> we do that with exclusive waits, then we have to ensure that this task
>>> makes progress. If we wake up a task, and then fail to get a queueing
>>> token, then we'll go back to sleep. We need to ensure that someone makes
>>> forward progress at this point. There are two ways I can see that
>>> happening:
>>>
>>> 1) The task woken _always_ gets to queue an IO
>>> 2) If the task woken is NOT allowed to queue an IO, then it must select
>>>    a new task to wake up. That new task is then subjected to rule 1 or 2
>>>    as well.
>>>
>>> For #1, it could be as simple as:
>>>
>>> if (slept || !rwb_enabled(rwb)) {
>>> 	atomic_inc(&rqw->inflight);
>>> 	break;
>>> }
>>>
>>> but this obviously won't always be fair. Might be good enough however,
>>> instead of having to eg replace the generic wait queues with a priority
>>> list/queue.
>>>
>>> Note that this isn't an entirely new issue, it's just so much easier to
>>> hit with the single wakeups.
>>>
>> Hi Jens,
>>
>> What is the scenario that you see under which the woken up task does not
>> get to run?
> 
> That scenario is pretty easy to hit - let's say the next in line task
> has a queue limit of 1, and we currently have 4 pending. Task gets
> woken, goes back to sleep. Which should be totally fine. At some point
> we'll get below the limit, and allow the task to proceed. This will
> ensure forward progress.
> 
>> The theory behind leaving the task on the wait queue is that the
>> waitqueue_active check in wbt_wait prevents new tasks from taking up a
>> slot in the queue (e.g. incrementing inflight). So, there should not be
>> a way for inflight to be incremented between the time the wake_up is
>> done and the task at the head of the wait queue runs. That's the idea
>> anyway :-) If we missed something, let us know.
> 
> And that's a fine theory, I think it's a good improvement (and how it
> should have worked). I'm struggling to see where the issue is. Perhaps
> it's related to the wq active check. With fewer wakeups, we're more
> likely to hit a race there.
> 
> I'll poke at it...

Trying something like this:

http://git.kernel.dk/cgit/linux-block/log/?h=for-4.19/wbt

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
  2018-08-20 19:29                       ` Jens Axboe
@ 2018-08-20 20:19                           ` van der Linden, Frank
  0 siblings, 0 replies; 38+ messages in thread
From: van der Linden, Frank @ 2018-08-20 20:19 UTC (permalink / raw)
  To: Jens Axboe, Agarwal, Anchal
  Cc: linux-block, linux-kernel, Singh, Balbir, Wilson, Matt

On 8/20/18 12:29 PM, Jens Axboe wrote:=0A=
> On 8/20/18 1:08 PM, Jens Axboe wrote:=0A=
>> On 8/20/18 11:34 AM, van der Linden, Frank wrote:=0A=
>>> On 8/20/18 9:37 AM, Jens Axboe wrote:=0A=
>>>> On 8/7/18 3:19 PM, Jens Axboe wrote:=0A=
>>>>> On 8/7/18 3:12 PM, Anchal Agarwal wrote:=0A=
>>>>>> On Tue, Aug 07, 2018 at 02:39:48PM -0600, Jens Axboe wrote:=0A=
>>>>>>> On 8/7/18 2:12 PM, Anchal Agarwal wrote:=0A=
>>>>>>>> On Tue, Aug 07, 2018 at 08:29:44AM -0600, Jens Axboe wrote:=0A=
>>>>>>>>> On 8/1/18 4:09 PM, Jens Axboe wrote:=0A=
>>>>>>>>>> On 8/1/18 11:06 AM, Anchal Agarwal wrote:=0A=
>>>>>>>>>>> On Wed, Aug 01, 2018 at 09:14:50AM -0600, Jens Axboe wrote:=0A=
>>>>>>>>>>>> On 7/31/18 3:34 PM, Anchal Agarwal wrote:=0A=
>>>>>>>>>>>>> Hi folks,=0A=
>>>>>>>>>>>>>=0A=
>>>>>>>>>>>>> This patch modifies commit e34cbd307477a=0A=
>>>>>>>>>>>>> (blk-wbt: add general throttling mechanism)=0A=
>>>>>>>>>>>>>=0A=
>>>>>>>>>>>>> I am currently running a large bare metal instance (i3.metal)=
=0A=
>>>>>>>>>>>>> on EC2 with 72 cores, 512GB of RAM and NVME drives, with a=0A=
>>>>>>>>>>>>> 4.18 kernel. I have a workload that simulates a database=0A=
>>>>>>>>>>>>> workload and I am running into lockup issues when writeback=
=0A=
>>>>>>>>>>>>> throttling is enabled,with the hung task detector also=0A=
>>>>>>>>>>>>> kicking in.=0A=
>>>>>>>>>>>>>=0A=
>>>>>>>>>>>>> Crash dumps show that most CPUs (up to 50 of them) are=0A=
>>>>>>>>>>>>> all trying to get the wbt wait queue lock while trying to add=
=0A=
>>>>>>>>>>>>> themselves to it in __wbt_wait (see stack traces below).=0A=
>>>>>>>>>>>>>=0A=
>>>>>>>>>>>>> [    0.948118] CPU: 45 PID: 0 Comm: swapper/45 Not tainted 4.=
14.51-62.38.amzn1.x86_64 #1=0A=
>>>>>>>>>>>>> [    0.948119] Hardware name: Amazon EC2 i3.metal/Not Specifi=
ed, BIOS 1.0 10/16/2017=0A=
>>>>>>>>>>>>> [    0.948120] task: ffff883f7878c000 task.stack: ffffc9000c6=
9c000=0A=
>>>>>>>>>>>>> [    0.948124] RIP: 0010:native_queued_spin_lock_slowpath+0xf=
8/0x1a0=0A=
>>>>>>>>>>>>> [    0.948125] RSP: 0018:ffff883f7fcc3dc8 EFLAGS: 00000046=0A=
>>>>>>>>>>>>> [    0.948126] RAX: 0000000000000000 RBX: ffff887f7709ca68 RC=
X: ffff883f7fce2a00=0A=
>>>>>>>>>>>>> [    0.948128] RDX: 000000000000001c RSI: 0000000000740001 RD=
I: ffff887f7709ca68=0A=
>>>>>>>>>>>>> [    0.948129] RBP: 0000000000000002 R08: 0000000000b80000 R0=
9: 0000000000000000=0A=
>>>>>>>>>>>>> [    0.948130] R10: ffff883f7fcc3d78 R11: 000000000de27121 R1=
2: 0000000000000002=0A=
>>>>>>>>>>>>> [    0.948131] R13: 0000000000000003 R14: 0000000000000000 R1=
5: 0000000000000000=0A=
>>>>>>>>>>>>> [    0.948132] FS:  0000000000000000(0000) GS:ffff883f7fcc000=
0(0000) knlGS:0000000000000000=0A=
>>>>>>>>>>>>> [    0.948134] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050=
033=0A=
>>>>>>>>>>>>> [    0.948135] CR2: 000000c424c77000 CR3: 0000000002010005 CR=
4: 00000000003606e0=0A=
>>>>>>>>>>>>> [    0.948136] DR0: 0000000000000000 DR1: 0000000000000000 DR=
2: 0000000000000000=0A=
>>>>>>>>>>>>> [    0.948137] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR=
7: 0000000000000400=0A=
>>>>>>>>>>>>> [    0.948138] Call Trace:=0A=
>>>>>>>>>>>>> [    0.948139]  <IRQ>=0A=
>>>>>>>>>>>>> [    0.948142]  do_raw_spin_lock+0xad/0xc0=0A=
>>>>>>>>>>>>> [    0.948145]  _raw_spin_lock_irqsave+0x44/0x4b=0A=
>>>>>>>>>>>>> [    0.948149]  ? __wake_up_common_lock+0x53/0x90=0A=
>>>>>>>>>>>>> [    0.948150]  __wake_up_common_lock+0x53/0x90=0A=
>>>>>>>>>>>>> [    0.948155]  wbt_done+0x7b/0xa0=0A=
>>>>>>>>>>>>> [    0.948158]  blk_mq_free_request+0xb7/0x110=0A=
>>>>>>>>>>>>> [    0.948161]  __blk_mq_complete_request+0xcb/0x140=0A=
>>>>>>>>>>>>> [    0.948166]  nvme_process_cq+0xce/0x1a0 [nvme]=0A=
>>>>>>>>>>>>> [    0.948169]  nvme_irq+0x23/0x50 [nvme]=0A=
>>>>>>>>>>>>> [    0.948173]  __handle_irq_event_percpu+0x46/0x300=0A=
>>>>>>>>>>>>> [    0.948176]  handle_irq_event_percpu+0x20/0x50=0A=
>>>>>>>>>>>>> [    0.948179]  handle_irq_event+0x34/0x60=0A=
>>>>>>>>>>>>> [    0.948181]  handle_edge_irq+0x77/0x190=0A=
>>>>>>>>>>>>> [    0.948185]  handle_irq+0xaf/0x120=0A=
>>>>>>>>>>>>> [    0.948188]  do_IRQ+0x53/0x110=0A=
>>>>>>>>>>>>> [    0.948191]  common_interrupt+0x87/0x87=0A=
>>>>>>>>>>>>> [    0.948192]  </IRQ>=0A=
>>>>>>>>>>>>> ....=0A=
>>>>>>>>>>>>> [    0.311136] CPU: 4 PID: 9737 Comm: run_linux_amd64 Not tai=
nted 4.14.51-62.38.amzn1.x86_64 #1=0A=
>>>>>>>>>>>>> [    0.311137] Hardware name: Amazon EC2 i3.metal/Not Specifi=
ed, BIOS 1.0 10/16/2017=0A=
>>>>>>>>>>>>> [    0.311138] task: ffff883f6e6a8000 task.stack: ffffc9000f1=
ec000=0A=
>>>>>>>>>>>>> [    0.311141] RIP: 0010:native_queued_spin_lock_slowpath+0xf=
5/0x1a0=0A=
>>>>>>>>>>>>> [    0.311142] RSP: 0018:ffffc9000f1efa28 EFLAGS: 00000046=0A=
>>>>>>>>>>>>> [    0.311144] RAX: 0000000000000000 RBX: ffff887f7709ca68 RC=
X: ffff883f7f722a00=0A=
>>>>>>>>>>>>> [    0.311145] RDX: 0000000000000035 RSI: 0000000000d80001 RD=
I: ffff887f7709ca68=0A=
>>>>>>>>>>>>> [    0.311146] RBP: 0000000000000202 R08: 0000000000140000 R0=
9: 0000000000000000=0A=
>>>>>>>>>>>>> [    0.311147] R10: ffffc9000f1ef9d8 R11: 000000001a249fa0 R1=
2: ffff887f7709ca68=0A=
>>>>>>>>>>>>> [    0.311148] R13: ffffc9000f1efad0 R14: 0000000000000000 R1=
5: ffff887f7709ca00=0A=
>>>>>>>>>>>>> [    0.311149] FS:  000000c423f30090(0000) GS:ffff883f7f70000=
0(0000) knlGS:0000000000000000=0A=
>>>>>>>>>>>>> [    0.311150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050=
033=0A=
>>>>>>>>>>>>> [    0.311151] CR2: 00007feefcea4000 CR3: 0000007f7016e001 CR=
4: 00000000003606e0=0A=
>>>>>>>>>>>>> [    0.311152] DR0: 0000000000000000 DR1: 0000000000000000 DR=
2: 0000000000000000=0A=
>>>>>>>>>>>>> [    0.311153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR=
7: 0000000000000400=0A=
>>>>>>>>>>>>> [    0.311154] Call Trace:=0A=
>>>>>>>>>>>>> [    0.311157]  do_raw_spin_lock+0xad/0xc0=0A=
>>>>>>>>>>>>> [    0.311160]  _raw_spin_lock_irqsave+0x44/0x4b=0A=
>>>>>>>>>>>>> [    0.311162]  ? prepare_to_wait_exclusive+0x28/0xb0=0A=
>>>>>>>>>>>>> [    0.311164]  prepare_to_wait_exclusive+0x28/0xb0=0A=
>>>>>>>>>>>>> [    0.311167]  wbt_wait+0x127/0x330=0A=
>>>>>>>>>>>>> [    0.311169]  ? finish_wait+0x80/0x80=0A=
>>>>>>>>>>>>> [    0.311172]  ? generic_make_request+0xda/0x3b0=0A=
>>>>>>>>>>>>> [    0.311174]  blk_mq_make_request+0xd6/0x7b0=0A=
>>>>>>>>>>>>> [    0.311176]  ? blk_queue_enter+0x24/0x260=0A=
>>>>>>>>>>>>> [    0.311178]  ? generic_make_request+0xda/0x3b0=0A=
>>>>>>>>>>>>> [    0.311181]  generic_make_request+0x10c/0x3b0=0A=
>>>>>>>>>>>>> [    0.311183]  ? submit_bio+0x5c/0x110=0A=
>>>>>>>>>>>>> [    0.311185]  submit_bio+0x5c/0x110=0A=
>>>>>>>>>>>>> [    0.311197]  ? __ext4_journal_stop+0x36/0xa0 [ext4]=0A=
>>>>>>>>>>>>> [    0.311210]  ext4_io_submit+0x48/0x60 [ext4]=0A=
>>>>>>>>>>>>> [    0.311222]  ext4_writepages+0x810/0x11f0 [ext4]=0A=
>>>>>>>>>>>>> [    0.311229]  ? do_writepages+0x3c/0xd0=0A=
>>>>>>>>>>>>> [    0.311239]  ? ext4_mark_inode_dirty+0x260/0x260 [ext4]=0A=
>>>>>>>>>>>>> [    0.311240]  do_writepages+0x3c/0xd0=0A=
>>>>>>>>>>>>> [    0.311243]  ? _raw_spin_unlock+0x24/0x30=0A=
>>>>>>>>>>>>> [    0.311245]  ? wbc_attach_and_unlock_inode+0x165/0x280=0A=
>>>>>>>>>>>>> [    0.311248]  ? __filemap_fdatawrite_range+0xa3/0xe0=0A=
>>>>>>>>>>>>> [    0.311250]  __filemap_fdatawrite_range+0xa3/0xe0=0A=
>>>>>>>>>>>>> [    0.311253]  file_write_and_wait_range+0x34/0x90=0A=
>>>>>>>>>>>>> [    0.311264]  ext4_sync_file+0x151/0x500 [ext4]=0A=
>>>>>>>>>>>>> [    0.311267]  do_fsync+0x38/0x60=0A=
>>>>>>>>>>>>> [    0.311270]  SyS_fsync+0xc/0x10=0A=
>>>>>>>>>>>>> [    0.311272]  do_syscall_64+0x6f/0x170=0A=
>>>>>>>>>>>>> [    0.311274]  entry_SYSCALL_64_after_hwframe+0x42/0xb7=0A=
>>>>>>>>>>>>>=0A=
>>>>>>>>>>>>> In the original patch, wbt_done is waking up all the exclusiv=
e=0A=
>>>>>>>>>>>>> processes in the wait queue, which can cause a thundering her=
d=0A=
>>>>>>>>>>>>> if there is a large number of writer threads in the queue. Th=
e=0A=
>>>>>>>>>>>>> original intention of the code seems to be to wake up one thr=
ead=0A=
>>>>>>>>>>>>> only however, it uses wake_up_all() in __wbt_done(), and then=
=0A=
>>>>>>>>>>>>> uses the following check in __wbt_wait to have only one threa=
d=0A=
>>>>>>>>>>>>> actually get out of the wait loop:=0A=
>>>>>>>>>>>>>=0A=
>>>>>>>>>>>>> if (waitqueue_active(&rqw->wait) &&=0A=
>>>>>>>>>>>>>             rqw->wait.head.next !=3D &wait->entry)=0A=
>>>>>>>>>>>>>                 return false;=0A=
>>>>>>>>>>>>>=0A=
>>>>>>>>>>>>> The problem with this is that the wait entry in wbt_wait is=
=0A=
>>>>>>>>>>>>> define with DEFINE_WAIT, which uses the autoremove wakeup fun=
ction.=0A=
>>>>>>>>>>>>> That means that the above check is invalid - the wait entry w=
ill=0A=
>>>>>>>>>>>>> have been removed from the queue already by the time we hit t=
he=0A=
>>>>>>>>>>>>> check in the loop.=0A=
>>>>>>>>>>>>>=0A=
>>>>>>>>>>>>> Secondly, auto-removing the wait entries also means that the =
wait=0A=
>>>>>>>>>>>>> queue essentially gets reordered "randomly" (e.g. threads re-=
add=0A=
>>>>>>>>>>>>> themselves in the order they got to run after being woken up)=
.=0A=
>>>>>>>>>>>>> Additionally, new requests entering wbt_wait might overtake r=
equests=0A=
>>>>>>>>>>>>> that were queued earlier, because the wait queue will be=0A=
>>>>>>>>>>>>> (temporarily) empty after the wake_up_all, so the waitqueue_a=
ctive=0A=
>>>>>>>>>>>>> check will not stop them. This can cause certain threads to s=
tarve=0A=
>>>>>>>>>>>>> under high load.=0A=
>>>>>>>>>>>>>=0A=
>>>>>>>>>>>>> The fix is to leave the woken up requests in the queue and re=
move=0A=
>>>>>>>>>>>>> them in finish_wait() once the current thread breaks out of t=
he=0A=
>>>>>>>>>>>>> wait loop in __wbt_wait. This will ensure new requests always=
=0A=
>>>>>>>>>>>>> end up at the back of the queue, and they won't overtake requ=
ests=0A=
>>>>>>>>>>>>> that are already in the wait queue. With that change, the loo=
p=0A=
>>>>>>>>>>>>> in wbt_wait is also in line with many other wait loops in the=
 kernel.=0A=
>>>>>>>>>>>>> Waking up just one thread drastically reduces lock contention=
, as=0A=
>>>>>>>>>>>>> does moving the wait queue add/remove out of the loop.=0A=
>>>>>>>>>>>>>=0A=
>>>>>>>>>>>>> A significant drop in lockdep's lock contention numbers is se=
en when=0A=
>>>>>>>>>>>>> running the test application on the patched kernel.=0A=
>>>>>>>>>>>> I like the patch, and a few weeks ago we independently discove=
red that=0A=
>>>>>>>>>>>> the waitqueue list checking was bogus as well. My only worry i=
s that=0A=
>>>>>>>>>>>> changes like this can be delicate, meaning that it's easy to i=
ntroduce=0A=
>>>>>>>>>>>> stall conditions. What kind of testing did you push this throu=
gh?=0A=
>>>>>>>>>>>>=0A=
>>>>>>>>>>>> -- =0A=
>>>>>>>>>>>> Jens Axboe=0A=
>>>>>>>>>>>>=0A=
>>>>>>>>>>> I ran the following tests on both real HW with NVME devices att=
ached=0A=
>>>>>>>>>>> and emulated NVME too:=0A=
>>>>>>>>>>>=0A=
>>>>>>>>>>> 1. The test case I used to reproduce the issue, spawns a bunch =
of threads =0A=
>>>>>>>>>>>    to concurrently read and write files with random size and co=
ntent. =0A=
>>>>>>>>>>>    Files are randomly fsync'd. The implementation is a FIFO que=
ue of files. =0A=
>>>>>>>>>>>    When the queue fills the test starts to verify and remove th=
e files. This =0A=
>>>>>>>>>>>    test will fail if there's a read, write, or hash check failu=
re. It tests=0A=
>>>>>>>>>>>    for file corruption when lots of small files are being read =
and written =0A=
>>>>>>>>>>>    with high concurrency.=0A=
>>>>>>>>>>>=0A=
>>>>>>>>>>> 2. Fio for random writes with a root NVME device of 200GB=0A=
>>>>>>>>>>>   =0A=
>>>>>>>>>>>   fio --name=3Drandwrite --ioengine=3Dlibaio --iodepth=3D1 --rw=
=3Drandwrite --bs=3D4k =0A=
>>>>>>>>>>>   --direct=3D0 --size=3D10G --numjobs=3D2 --runtime=3D60 --grou=
p_reporting=0A=
>>>>>>>>>>>   =0A=
>>>>>>>>>>>   fio --name=3Drandwrite --ioengine=3Dlibaio --iodepth=3D1 --rw=
=3Drandwrite --bs=3D4k=0A=
>>>>>>>>>>>   --direct=3D0 --size=3D5G --numjobs=3D2 --runtime=3D30 --fsync=
=3D64 --group_reporting=0A=
>>>>>>>>>>>   =0A=
>>>>>>>>>>>   I did see an improvement in the bandwidth numbers reported on=
 the patched=0A=
>>>>>>>>>>>   kernel. =0A=
>>>>>>>>>>>=0A=
>>>>>>>>>>> Do you have any test case/suite in mind that you would suggest =
me to =0A=
>>>>>>>>>>> run to be sure that patch does not introduce any stall conditio=
ns?=0A=
>>>>>>>>>> One thing that is always useful is to run xfstest, do a full run=
 on=0A=
>>>>>>>>>> the device. If that works, then do another full run, this time l=
imiting=0A=
>>>>>>>>>> the queue depth of the SCSI device to 1. If both of those pass, =
then=0A=
>>>>>>>>>> I'd feel pretty good getting this applied for 4.19.=0A=
>>>>>>>>> Did you get a chance to run this full test?=0A=
>>>>>>>>>=0A=
>>>>>>>>> -- =0A=
>>>>>>>>> Jens Axboe=0A=
>>>>>>>>>=0A=
>>>>>>>>>=0A=
>>>>>>>> Hi Jens,=0A=
>>>>>>>> Yes I did run the tests and was in the process of compiling concre=
te results=0A=
>>>>>>>> I tested following environments against xfs/auto group=0A=
>>>>>>>> 1. Vanilla 4.18.rc kernel=0A=
>>>>>>>> 2. 4.18 kernel with the blk-wbt patch=0A=
>>>>>>>> 3. 4.18 kernel with the blk-wbt patch + io_queue_depth=3D2. I =0A=
>>>>>>>> understand you asked for queue depth for SCSI device=3D1 however, =
I have NVME =0A=
>>>>>>>> devices in my environment and 2 is the minimum value for io_queue_=
depth allowed =0A=
>>>>>>>> according to the NVME driver code. The results pretty much look sa=
me with no =0A=
>>>>>>>> stalls or exceptional failures. =0A=
>>>>>>>> xfs/auto ran 296 odd tests with 3 failures and 130 something "no r=
uns". =0A=
>>>>>>>> Remaining tests passed. "Skipped tests"  were mostly due to missin=
g features=0A=
>>>>>>>> (eg: reflink support on scratch filesystem)=0A=
>>>>>>>> The failures were consistent across runs on 3 different environmen=
ts. =0A=
>>>>>>>> I am also running full test suite but it is taking long time as I =
am =0A=
>>>>>>>> hitting kernel BUG in xfs code in some generic tests. This BUG is =
not =0A=
>>>>>>>> related to the patch and  I see them in vanilla kernel too. I am i=
n =0A=
>>>>>>>> the process of excluding these kind of tests as they come and =0A=
>>>>>>>> re-run the suite however, this proces is time taking. =0A=
>>>>>>>> Do you have any specific tests in mind that you would like me =0A=
>>>>>>>> to run apart from what I have already tested above?=0A=
>>>>>>> Thanks, I think that looks good. I'll get your patch applied for=0A=
>>>>>>> 4.19.=0A=
>>>>>>>=0A=
>>>>>>> -- =0A=
>>>>>>> Jens Axboe=0A=
>>>>>>>=0A=
>>>>>>>=0A=
>>>>>> Hi Jens,=0A=
>>>>>> Thanks for accepting this. There is one small issue, I don't find an=
y emails=0A=
>>>>>> send by me on the lkml mailing list. I am not sure why it didn't lan=
d there,=0A=
>>>>>> all I can see is your responses. Do you want one of us to resend the=
 patch=0A=
>>>>>> or will you be able to do it?=0A=
>>>>> That's odd, are you getting rejections on your emails? For reference,=
 the=0A=
>>>>> patch is here:=0A=
>>>>>=0A=
>>>>> http://git.kernel.dk/cgit/linux-block/commit/?h=3Dfor-4.19/block&id=
=3D2887e41b910bb14fd847cf01ab7a5993db989d88=0A=
>>>> One issue with this, as far as I can tell. Right now we've switched to=
=0A=
>>>> waking one task at the time, which is obviously more efficient. But if=
=0A=
>>>> we do that with exclusive waits, then we have to ensure that this task=
=0A=
>>>> makes progress. If we wake up a task, and then fail to get a queueing=
=0A=
>>>> token, then we'll go back to sleep. We need to ensure that someone mak=
es=0A=
>>>> forward progress at this point. There are two ways I can see that=0A=
>>>> happening:=0A=
>>>>=0A=
>>>> 1) The task woken _always_ gets to queue an IO=0A=
>>>> 2) If the task woken is NOT allowed to queue an IO, then it must selec=
t=0A=
>>>>    a new task to wake up. That new task is then subjected to rule 1 or=
 2=0A=
>>>>    as well.=0A=
>>>>=0A=
>>>> For #1, it could be as simple as:=0A=
>>>>=0A=
>>>> if (slept || !rwb_enabled(rwb)) {=0A=
>>>> 	atomic_inc(&rqw->inflight);=0A=
>>>> 	break;=0A=
>>>> }=0A=
>>>>=0A=
>>>> but this obviously won't always be fair. Might be good enough however,=
=0A=
>>>> instead of having to eg replace the generic wait queues with a priorit=
y=0A=
>>>> list/queue.=0A=
>>>>=0A=
>>>> Note that this isn't an entirely new issue, it's just so much easier t=
o=0A=
>>>> hit with the single wakeups.=0A=
>>>>=0A=
>>> Hi Jens,=0A=
>>>=0A=
>>> What is the scenario that you see under which the woken up task does no=
t=0A=
>>> get to run?=0A=
>> That scenario is pretty easy to hit - let's say the next in line task=0A=
>> has a queue limit of 1, and we currently have 4 pending. Task gets=0A=
>> woken, goes back to sleep. Which should be totally fine. At some point=
=0A=
>> we'll get below the limit, and allow the task to proceed. This will=0A=
>> ensure forward progress.=0A=
>>=0A=
>>> The theory behind leaving the task on the wait queue is that the=0A=
>>> waitqueue_active check in wbt_wait prevents new tasks from taking up a=
=0A=
>>> slot in the queue (e.g. incrementing inflight). So, there should not be=
=0A=
>>> a way for inflight to be incremented between the time the wake_up is=0A=
>>> done and the task at the head of the wait queue runs. That's the idea=
=0A=
>>> anyway :-) If we missed something, let us know.=0A=
>> And that's a fine theory, I think it's a good improvement (and how it=0A=
>> should have worked). I'm struggling to see where the issue is. Perhaps=
=0A=
>> it's related to the wq active check. With fewer wakeups, we're more=0A=
>> likely to hit a race there.=0A=
>>=0A=
>> I'll poke at it...=0A=
> Trying something like this:=0A=
>=0A=
> http://git.kernel.dk/cgit/linux-block/log/?h=3Dfor-4.19/wbt=0A=
>=0A=
Ah, now I see what you mean.=0A=
=0A=
This is the case where a task goes to sleep, not because the inflight=0A=
limit has been reached, but simply because it needs to go to the back of=0A=
the wait queue.=0A=
=0A=
In that case, it should, for its first time inside the loop, not try to=0A=
decrement inflight - since that means it could still race to overtake a=0A=
task that got there earlier and is in the wait queue.=0A=
=0A=
So what you are doing is keeping track of whether it got in to the loop=0A=
only because of queueing, and then you don't try to decrement inflight=0A=
the first time around the loop.=0A=
=0A=
I think that should work to fix that corner case.=0A=
=0A=
Frank=0A=
=0A=

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
@ 2018-08-20 20:19                           ` van der Linden, Frank
  0 siblings, 0 replies; 38+ messages in thread
From: van der Linden, Frank @ 2018-08-20 20:19 UTC (permalink / raw)
  To: Jens Axboe, Agarwal, Anchal
  Cc: linux-block, linux-kernel, Singh, Balbir, Wilson, Matt

On 8/20/18 12:29 PM, Jens Axboe wrote:
> On 8/20/18 1:08 PM, Jens Axboe wrote:
>> On 8/20/18 11:34 AM, van der Linden, Frank wrote:
>>> On 8/20/18 9:37 AM, Jens Axboe wrote:
>>>> On 8/7/18 3:19 PM, Jens Axboe wrote:
>>>>> On 8/7/18 3:12 PM, Anchal Agarwal wrote:
>>>>>> On Tue, Aug 07, 2018 at 02:39:48PM -0600, Jens Axboe wrote:
>>>>>>> On 8/7/18 2:12 PM, Anchal Agarwal wrote:
>>>>>>>> On Tue, Aug 07, 2018 at 08:29:44AM -0600, Jens Axboe wrote:
>>>>>>>>> On 8/1/18 4:09 PM, Jens Axboe wrote:
>>>>>>>>>> On 8/1/18 11:06 AM, Anchal Agarwal wrote:
>>>>>>>>>>> On Wed, Aug 01, 2018 at 09:14:50AM -0600, Jens Axboe wrote:
>>>>>>>>>>>> On 7/31/18 3:34 PM, Anchal Agarwal wrote:
>>>>>>>>>>>>> Hi folks,
>>>>>>>>>>>>>
>>>>>>>>>>>>> This patch modifies commit e34cbd307477a
>>>>>>>>>>>>> (blk-wbt: add general throttling mechanism)
>>>>>>>>>>>>>
>>>>>>>>>>>>> I am currently running a large bare metal instance (i3.metal)
>>>>>>>>>>>>> on EC2 with 72 cores, 512GB of RAM and NVME drives, with a
>>>>>>>>>>>>> 4.18 kernel. I have a workload that simulates a database
>>>>>>>>>>>>> workload and I am running into lockup issues when writeback
>>>>>>>>>>>>> throttling is enabled,with the hung task detector also
>>>>>>>>>>>>> kicking in.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Crash dumps show that most CPUs (up to 50 of them) are
>>>>>>>>>>>>> all trying to get the wbt wait queue lock while trying to add
>>>>>>>>>>>>> themselves to it in __wbt_wait (see stack traces below).
>>>>>>>>>>>>>
>>>>>>>>>>>>> [    0.948118] CPU: 45 PID: 0 Comm: swapper/45 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
>>>>>>>>>>>>> [    0.948119] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
>>>>>>>>>>>>> [    0.948120] task: ffff883f7878c000 task.stack: ffffc9000c69c000
>>>>>>>>>>>>> [    0.948124] RIP: 0010:native_queued_spin_lock_slowpath+0xf8/0x1a0
>>>>>>>>>>>>> [    0.948125] RSP: 0018:ffff883f7fcc3dc8 EFLAGS: 00000046
>>>>>>>>>>>>> [    0.948126] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7fce2a00
>>>>>>>>>>>>> [    0.948128] RDX: 000000000000001c RSI: 0000000000740001 RDI: ffff887f7709ca68
>>>>>>>>>>>>> [    0.948129] RBP: 0000000000000002 R08: 0000000000b80000 R09: 0000000000000000
>>>>>>>>>>>>> [    0.948130] R10: ffff883f7fcc3d78 R11: 000000000de27121 R12: 0000000000000002
>>>>>>>>>>>>> [    0.948131] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
>>>>>>>>>>>>> [    0.948132] FS:  0000000000000000(0000) GS:ffff883f7fcc0000(0000) knlGS:0000000000000000
>>>>>>>>>>>>> [    0.948134] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>>>>>>>>> [    0.948135] CR2: 000000c424c77000 CR3: 0000000002010005 CR4: 00000000003606e0
>>>>>>>>>>>>> [    0.948136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>>>>>>>>>>> [    0.948137] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>>>>>>>>>>> [    0.948138] Call Trace:
>>>>>>>>>>>>> [    0.948139]  <IRQ>
>>>>>>>>>>>>> [    0.948142]  do_raw_spin_lock+0xad/0xc0
>>>>>>>>>>>>> [    0.948145]  _raw_spin_lock_irqsave+0x44/0x4b
>>>>>>>>>>>>> [    0.948149]  ? __wake_up_common_lock+0x53/0x90
>>>>>>>>>>>>> [    0.948150]  __wake_up_common_lock+0x53/0x90
>>>>>>>>>>>>> [    0.948155]  wbt_done+0x7b/0xa0
>>>>>>>>>>>>> [    0.948158]  blk_mq_free_request+0xb7/0x110
>>>>>>>>>>>>> [    0.948161]  __blk_mq_complete_request+0xcb/0x140
>>>>>>>>>>>>> [    0.948166]  nvme_process_cq+0xce/0x1a0 [nvme]
>>>>>>>>>>>>> [    0.948169]  nvme_irq+0x23/0x50 [nvme]
>>>>>>>>>>>>> [    0.948173]  __handle_irq_event_percpu+0x46/0x300
>>>>>>>>>>>>> [    0.948176]  handle_irq_event_percpu+0x20/0x50
>>>>>>>>>>>>> [    0.948179]  handle_irq_event+0x34/0x60
>>>>>>>>>>>>> [    0.948181]  handle_edge_irq+0x77/0x190
>>>>>>>>>>>>> [    0.948185]  handle_irq+0xaf/0x120
>>>>>>>>>>>>> [    0.948188]  do_IRQ+0x53/0x110
>>>>>>>>>>>>> [    0.948191]  common_interrupt+0x87/0x87
>>>>>>>>>>>>> [    0.948192]  </IRQ>
>>>>>>>>>>>>> ....
>>>>>>>>>>>>> [    0.311136] CPU: 4 PID: 9737 Comm: run_linux_amd64 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
>>>>>>>>>>>>> [    0.311137] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
>>>>>>>>>>>>> [    0.311138] task: ffff883f6e6a8000 task.stack: ffffc9000f1ec000
>>>>>>>>>>>>> [    0.311141] RIP: 0010:native_queued_spin_lock_slowpath+0xf5/0x1a0
>>>>>>>>>>>>> [    0.311142] RSP: 0018:ffffc9000f1efa28 EFLAGS: 00000046
>>>>>>>>>>>>> [    0.311144] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7f722a00
>>>>>>>>>>>>> [    0.311145] RDX: 0000000000000035 RSI: 0000000000d80001 RDI: ffff887f7709ca68
>>>>>>>>>>>>> [    0.311146] RBP: 0000000000000202 R08: 0000000000140000 R09: 0000000000000000
>>>>>>>>>>>>> [    0.311147] R10: ffffc9000f1ef9d8 R11: 000000001a249fa0 R12: ffff887f7709ca68
>>>>>>>>>>>>> [    0.311148] R13: ffffc9000f1efad0 R14: 0000000000000000 R15: ffff887f7709ca00
>>>>>>>>>>>>> [    0.311149] FS:  000000c423f30090(0000) GS:ffff883f7f700000(0000) knlGS:0000000000000000
>>>>>>>>>>>>> [    0.311150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>>>>>>>>> [    0.311151] CR2: 00007feefcea4000 CR3: 0000007f7016e001 CR4: 00000000003606e0
>>>>>>>>>>>>> [    0.311152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>>>>>>>>>>> [    0.311153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>>>>>>>>>>> [    0.311154] Call Trace:
>>>>>>>>>>>>> [    0.311157]  do_raw_spin_lock+0xad/0xc0
>>>>>>>>>>>>> [    0.311160]  _raw_spin_lock_irqsave+0x44/0x4b
>>>>>>>>>>>>> [    0.311162]  ? prepare_to_wait_exclusive+0x28/0xb0
>>>>>>>>>>>>> [    0.311164]  prepare_to_wait_exclusive+0x28/0xb0
>>>>>>>>>>>>> [    0.311167]  wbt_wait+0x127/0x330
>>>>>>>>>>>>> [    0.311169]  ? finish_wait+0x80/0x80
>>>>>>>>>>>>> [    0.311172]  ? generic_make_request+0xda/0x3b0
>>>>>>>>>>>>> [    0.311174]  blk_mq_make_request+0xd6/0x7b0
>>>>>>>>>>>>> [    0.311176]  ? blk_queue_enter+0x24/0x260
>>>>>>>>>>>>> [    0.311178]  ? generic_make_request+0xda/0x3b0
>>>>>>>>>>>>> [    0.311181]  generic_make_request+0x10c/0x3b0
>>>>>>>>>>>>> [    0.311183]  ? submit_bio+0x5c/0x110
>>>>>>>>>>>>> [    0.311185]  submit_bio+0x5c/0x110
>>>>>>>>>>>>> [    0.311197]  ? __ext4_journal_stop+0x36/0xa0 [ext4]
>>>>>>>>>>>>> [    0.311210]  ext4_io_submit+0x48/0x60 [ext4]
>>>>>>>>>>>>> [    0.311222]  ext4_writepages+0x810/0x11f0 [ext4]
>>>>>>>>>>>>> [    0.311229]  ? do_writepages+0x3c/0xd0
>>>>>>>>>>>>> [    0.311239]  ? ext4_mark_inode_dirty+0x260/0x260 [ext4]
>>>>>>>>>>>>> [    0.311240]  do_writepages+0x3c/0xd0
>>>>>>>>>>>>> [    0.311243]  ? _raw_spin_unlock+0x24/0x30
>>>>>>>>>>>>> [    0.311245]  ? wbc_attach_and_unlock_inode+0x165/0x280
>>>>>>>>>>>>> [    0.311248]  ? __filemap_fdatawrite_range+0xa3/0xe0
>>>>>>>>>>>>> [    0.311250]  __filemap_fdatawrite_range+0xa3/0xe0
>>>>>>>>>>>>> [    0.311253]  file_write_and_wait_range+0x34/0x90
>>>>>>>>>>>>> [    0.311264]  ext4_sync_file+0x151/0x500 [ext4]
>>>>>>>>>>>>> [    0.311267]  do_fsync+0x38/0x60
>>>>>>>>>>>>> [    0.311270]  SyS_fsync+0xc/0x10
>>>>>>>>>>>>> [    0.311272]  do_syscall_64+0x6f/0x170
>>>>>>>>>>>>> [    0.311274]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
>>>>>>>>>>>>>
>>>>>>>>>>>>> In the original patch, wbt_done is waking up all the exclusive
>>>>>>>>>>>>> processes in the wait queue, which can cause a thundering herd
>>>>>>>>>>>>> if there is a large number of writer threads in the queue. The
>>>>>>>>>>>>> original intention of the code seems to be to wake up one thread
>>>>>>>>>>>>> only however, it uses wake_up_all() in __wbt_done(), and then
>>>>>>>>>>>>> uses the following check in __wbt_wait to have only one thread
>>>>>>>>>>>>> actually get out of the wait loop:
>>>>>>>>>>>>>
>>>>>>>>>>>>> if (waitqueue_active(&rqw->wait) &&
>>>>>>>>>>>>>             rqw->wait.head.next != &wait->entry)
>>>>>>>>>>>>>                 return false;
>>>>>>>>>>>>>
>>>>>>>>>>>>> The problem with this is that the wait entry in wbt_wait is
>>>>>>>>>>>>> define with DEFINE_WAIT, which uses the autoremove wakeup function.
>>>>>>>>>>>>> That means that the above check is invalid - the wait entry will
>>>>>>>>>>>>> have been removed from the queue already by the time we hit the
>>>>>>>>>>>>> check in the loop.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Secondly, auto-removing the wait entries also means that the wait
>>>>>>>>>>>>> queue essentially gets reordered "randomly" (e.g. threads re-add
>>>>>>>>>>>>> themselves in the order they got to run after being woken up).
>>>>>>>>>>>>> Additionally, new requests entering wbt_wait might overtake requests
>>>>>>>>>>>>> that were queued earlier, because the wait queue will be
>>>>>>>>>>>>> (temporarily) empty after the wake_up_all, so the waitqueue_active
>>>>>>>>>>>>> check will not stop them. This can cause certain threads to starve
>>>>>>>>>>>>> under high load.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The fix is to leave the woken up requests in the queue and remove
>>>>>>>>>>>>> them in finish_wait() once the current thread breaks out of the
>>>>>>>>>>>>> wait loop in __wbt_wait. This will ensure new requests always
>>>>>>>>>>>>> end up at the back of the queue, and they won't overtake requests
>>>>>>>>>>>>> that are already in the wait queue. With that change, the loop
>>>>>>>>>>>>> in wbt_wait is also in line with many other wait loops in the kernel.
>>>>>>>>>>>>> Waking up just one thread drastically reduces lock contention, as
>>>>>>>>>>>>> does moving the wait queue add/remove out of the loop.
>>>>>>>>>>>>>
>>>>>>>>>>>>> A significant drop in lockdep's lock contention numbers is seen when
>>>>>>>>>>>>> running the test application on the patched kernel.
>>>>>>>>>>>> I like the patch, and a few weeks ago we independently discovered that
>>>>>>>>>>>> the waitqueue list checking was bogus as well. My only worry is that
>>>>>>>>>>>> changes like this can be delicate, meaning that it's easy to introduce
>>>>>>>>>>>> stall conditions. What kind of testing did you push this through?
>>>>>>>>>>>>
>>>>>>>>>>>> -- 
>>>>>>>>>>>> Jens Axboe
>>>>>>>>>>>>
>>>>>>>>>>> I ran the following tests on both real HW with NVME devices attached
>>>>>>>>>>> and emulated NVME too:
>>>>>>>>>>>
>>>>>>>>>>> 1. The test case I used to reproduce the issue, spawns a bunch of threads 
>>>>>>>>>>>    to concurrently read and write files with random size and content. 
>>>>>>>>>>>    Files are randomly fsync'd. The implementation is a FIFO queue of files. 
>>>>>>>>>>>    When the queue fills the test starts to verify and remove the files. This 
>>>>>>>>>>>    test will fail if there's a read, write, or hash check failure. It tests
>>>>>>>>>>>    for file corruption when lots of small files are being read and written 
>>>>>>>>>>>    with high concurrency.
>>>>>>>>>>>
>>>>>>>>>>> 2. Fio for random writes with a root NVME device of 200GB
>>>>>>>>>>>   
>>>>>>>>>>>   fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k 
>>>>>>>>>>>   --direct=0 --size=10G --numjobs=2 --runtime=60 --group_reporting
>>>>>>>>>>>   
>>>>>>>>>>>   fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k
>>>>>>>>>>>   --direct=0 --size=5G --numjobs=2 --runtime=30 --fsync=64 --group_reporting
>>>>>>>>>>>   
>>>>>>>>>>>   I did see an improvement in the bandwidth numbers reported on the patched
>>>>>>>>>>>   kernel. 
>>>>>>>>>>>
>>>>>>>>>>> Do you have any test case/suite in mind that you would suggest me to 
>>>>>>>>>>> run to be sure that patch does not introduce any stall conditions?
>>>>>>>>>> One thing that is always useful is to run xfstest, do a full run on
>>>>>>>>>> the device. If that works, then do another full run, this time limiting
>>>>>>>>>> the queue depth of the SCSI device to 1. If both of those pass, then
>>>>>>>>>> I'd feel pretty good getting this applied for 4.19.
>>>>>>>>> Did you get a chance to run this full test?
>>>>>>>>>
>>>>>>>>> -- 
>>>>>>>>> Jens Axboe
>>>>>>>>>
>>>>>>>>>
>>>>>>>> Hi Jens,
>>>>>>>> Yes I did run the tests and was in the process of compiling concrete results
>>>>>>>> I tested following environments against xfs/auto group
>>>>>>>> 1. Vanilla 4.18.rc kernel
>>>>>>>> 2. 4.18 kernel with the blk-wbt patch
>>>>>>>> 3. 4.18 kernel with the blk-wbt patch + io_queue_depth=2. I 
>>>>>>>> understand you asked for queue depth for SCSI device=1 however, I have NVME 
>>>>>>>> devices in my environment and 2 is the minimum value for io_queue_depth allowed 
>>>>>>>> according to the NVME driver code. The results pretty much look same with no 
>>>>>>>> stalls or exceptional failures. 
>>>>>>>> xfs/auto ran 296 odd tests with 3 failures and 130 something "no runs". 
>>>>>>>> Remaining tests passed. "Skipped tests"  were mostly due to missing features
>>>>>>>> (eg: reflink support on scratch filesystem)
>>>>>>>> The failures were consistent across runs on 3 different environments. 
>>>>>>>> I am also running full test suite but it is taking long time as I am 
>>>>>>>> hitting kernel BUG in xfs code in some generic tests. This BUG is not 
>>>>>>>> related to the patch and  I see them in vanilla kernel too. I am in 
>>>>>>>> the process of excluding these kind of tests as they come and 
>>>>>>>> re-run the suite however, this proces is time taking. 
>>>>>>>> Do you have any specific tests in mind that you would like me 
>>>>>>>> to run apart from what I have already tested above?
>>>>>>> Thanks, I think that looks good. I'll get your patch applied for
>>>>>>> 4.19.
>>>>>>>
>>>>>>> -- 
>>>>>>> Jens Axboe
>>>>>>>
>>>>>>>
>>>>>> Hi Jens,
>>>>>> Thanks for accepting this. There is one small issue, I don't find any emails
>>>>>> send by me on the lkml mailing list. I am not sure why it didn't land there,
>>>>>> all I can see is your responses. Do you want one of us to resend the patch
>>>>>> or will you be able to do it?
>>>>> That's odd, are you getting rejections on your emails? For reference, the
>>>>> patch is here:
>>>>>
>>>>> http://git.kernel.dk/cgit/linux-block/commit/?h=for-4.19/block&id=2887e41b910bb14fd847cf01ab7a5993db989d88
>>>> One issue with this, as far as I can tell. Right now we've switched to
>>>> waking one task at the time, which is obviously more efficient. But if
>>>> we do that with exclusive waits, then we have to ensure that this task
>>>> makes progress. If we wake up a task, and then fail to get a queueing
>>>> token, then we'll go back to sleep. We need to ensure that someone makes
>>>> forward progress at this point. There are two ways I can see that
>>>> happening:
>>>>
>>>> 1) The task woken _always_ gets to queue an IO
>>>> 2) If the task woken is NOT allowed to queue an IO, then it must select
>>>>    a new task to wake up. That new task is then subjected to rule 1 or 2
>>>>    as well.
>>>>
>>>> For #1, it could be as simple as:
>>>>
>>>> if (slept || !rwb_enabled(rwb)) {
>>>> 	atomic_inc(&rqw->inflight);
>>>> 	break;
>>>> }
>>>>
>>>> but this obviously won't always be fair. Might be good enough however,
>>>> instead of having to eg replace the generic wait queues with a priority
>>>> list/queue.
>>>>
>>>> Note that this isn't an entirely new issue, it's just so much easier to
>>>> hit with the single wakeups.
>>>>
>>> Hi Jens,
>>>
>>> What is the scenario that you see under which the woken up task does not
>>> get to run?
>> That scenario is pretty easy to hit - let's say the next in line task
>> has a queue limit of 1, and we currently have 4 pending. Task gets
>> woken, goes back to sleep. Which should be totally fine. At some point
>> we'll get below the limit, and allow the task to proceed. This will
>> ensure forward progress.
>>
>>> The theory behind leaving the task on the wait queue is that the
>>> waitqueue_active check in wbt_wait prevents new tasks from taking up a
>>> slot in the queue (e.g. incrementing inflight). So, there should not be
>>> a way for inflight to be incremented between the time the wake_up is
>>> done and the task at the head of the wait queue runs. That's the idea
>>> anyway :-) If we missed something, let us know.
>> And that's a fine theory, I think it's a good improvement (and how it
>> should have worked). I'm struggling to see where the issue is. Perhaps
>> it's related to the wq active check. With fewer wakeups, we're more
>> likely to hit a race there.
>>
>> I'll poke at it...
> Trying something like this:
>
> http://git.kernel.dk/cgit/linux-block/log/?h=for-4.19/wbt
>
Ah, now I see what you mean.

This is the case where a task goes to sleep, not because the inflight
limit has been reached, but simply because it needs to go to the back of
the wait queue.

In that case, it should, for its first time inside the loop, not try to
decrement inflight - since that means it could still race to overtake a
task that got there earlier and is in the wait queue.

So what you are doing is keeping track of whether it got in to the loop
only because of queueing, and then you don't try to decrement inflight
the first time around the loop.

I think that should work to fix that corner case.

Frank


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
  2018-08-20 20:19                           ` van der Linden, Frank
  (?)
@ 2018-08-20 20:20                           ` Jens Axboe
  2018-08-20 22:42                             ` Balbir Singh
  -1 siblings, 1 reply; 38+ messages in thread
From: Jens Axboe @ 2018-08-20 20:20 UTC (permalink / raw)
  To: van der Linden, Frank, Agarwal, Anchal
  Cc: linux-block, linux-kernel, Singh, Balbir, Wilson, Matt

On 8/20/18 2:19 PM, van der Linden, Frank wrote:
> On 8/20/18 12:29 PM, Jens Axboe wrote:
>> On 8/20/18 1:08 PM, Jens Axboe wrote:
>>> On 8/20/18 11:34 AM, van der Linden, Frank wrote:
>>>> On 8/20/18 9:37 AM, Jens Axboe wrote:
>>>>> On 8/7/18 3:19 PM, Jens Axboe wrote:
>>>>>> On 8/7/18 3:12 PM, Anchal Agarwal wrote:
>>>>>>> On Tue, Aug 07, 2018 at 02:39:48PM -0600, Jens Axboe wrote:
>>>>>>>> On 8/7/18 2:12 PM, Anchal Agarwal wrote:
>>>>>>>>> On Tue, Aug 07, 2018 at 08:29:44AM -0600, Jens Axboe wrote:
>>>>>>>>>> On 8/1/18 4:09 PM, Jens Axboe wrote:
>>>>>>>>>>> On 8/1/18 11:06 AM, Anchal Agarwal wrote:
>>>>>>>>>>>> On Wed, Aug 01, 2018 at 09:14:50AM -0600, Jens Axboe wrote:
>>>>>>>>>>>>> On 7/31/18 3:34 PM, Anchal Agarwal wrote:
>>>>>>>>>>>>>> Hi folks,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This patch modifies commit e34cbd307477a
>>>>>>>>>>>>>> (blk-wbt: add general throttling mechanism)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am currently running a large bare metal instance (i3.metal)
>>>>>>>>>>>>>> on EC2 with 72 cores, 512GB of RAM and NVME drives, with a
>>>>>>>>>>>>>> 4.18 kernel. I have a workload that simulates a database
>>>>>>>>>>>>>> workload and I am running into lockup issues when writeback
>>>>>>>>>>>>>> throttling is enabled,with the hung task detector also
>>>>>>>>>>>>>> kicking in.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Crash dumps show that most CPUs (up to 50 of them) are
>>>>>>>>>>>>>> all trying to get the wbt wait queue lock while trying to add
>>>>>>>>>>>>>> themselves to it in __wbt_wait (see stack traces below).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [    0.948118] CPU: 45 PID: 0 Comm: swapper/45 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
>>>>>>>>>>>>>> [    0.948119] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
>>>>>>>>>>>>>> [    0.948120] task: ffff883f7878c000 task.stack: ffffc9000c69c000
>>>>>>>>>>>>>> [    0.948124] RIP: 0010:native_queued_spin_lock_slowpath+0xf8/0x1a0
>>>>>>>>>>>>>> [    0.948125] RSP: 0018:ffff883f7fcc3dc8 EFLAGS: 00000046
>>>>>>>>>>>>>> [    0.948126] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7fce2a00
>>>>>>>>>>>>>> [    0.948128] RDX: 000000000000001c RSI: 0000000000740001 RDI: ffff887f7709ca68
>>>>>>>>>>>>>> [    0.948129] RBP: 0000000000000002 R08: 0000000000b80000 R09: 0000000000000000
>>>>>>>>>>>>>> [    0.948130] R10: ffff883f7fcc3d78 R11: 000000000de27121 R12: 0000000000000002
>>>>>>>>>>>>>> [    0.948131] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
>>>>>>>>>>>>>> [    0.948132] FS:  0000000000000000(0000) GS:ffff883f7fcc0000(0000) knlGS:0000000000000000
>>>>>>>>>>>>>> [    0.948134] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>>>>>>>>>> [    0.948135] CR2: 000000c424c77000 CR3: 0000000002010005 CR4: 00000000003606e0
>>>>>>>>>>>>>> [    0.948136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>>>>>>>>>>>> [    0.948137] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>>>>>>>>>>>> [    0.948138] Call Trace:
>>>>>>>>>>>>>> [    0.948139]  <IRQ>
>>>>>>>>>>>>>> [    0.948142]  do_raw_spin_lock+0xad/0xc0
>>>>>>>>>>>>>> [    0.948145]  _raw_spin_lock_irqsave+0x44/0x4b
>>>>>>>>>>>>>> [    0.948149]  ? __wake_up_common_lock+0x53/0x90
>>>>>>>>>>>>>> [    0.948150]  __wake_up_common_lock+0x53/0x90
>>>>>>>>>>>>>> [    0.948155]  wbt_done+0x7b/0xa0
>>>>>>>>>>>>>> [    0.948158]  blk_mq_free_request+0xb7/0x110
>>>>>>>>>>>>>> [    0.948161]  __blk_mq_complete_request+0xcb/0x140
>>>>>>>>>>>>>> [    0.948166]  nvme_process_cq+0xce/0x1a0 [nvme]
>>>>>>>>>>>>>> [    0.948169]  nvme_irq+0x23/0x50 [nvme]
>>>>>>>>>>>>>> [    0.948173]  __handle_irq_event_percpu+0x46/0x300
>>>>>>>>>>>>>> [    0.948176]  handle_irq_event_percpu+0x20/0x50
>>>>>>>>>>>>>> [    0.948179]  handle_irq_event+0x34/0x60
>>>>>>>>>>>>>> [    0.948181]  handle_edge_irq+0x77/0x190
>>>>>>>>>>>>>> [    0.948185]  handle_irq+0xaf/0x120
>>>>>>>>>>>>>> [    0.948188]  do_IRQ+0x53/0x110
>>>>>>>>>>>>>> [    0.948191]  common_interrupt+0x87/0x87
>>>>>>>>>>>>>> [    0.948192]  </IRQ>
>>>>>>>>>>>>>> ....
>>>>>>>>>>>>>> [    0.311136] CPU: 4 PID: 9737 Comm: run_linux_amd64 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
>>>>>>>>>>>>>> [    0.311137] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
>>>>>>>>>>>>>> [    0.311138] task: ffff883f6e6a8000 task.stack: ffffc9000f1ec000
>>>>>>>>>>>>>> [    0.311141] RIP: 0010:native_queued_spin_lock_slowpath+0xf5/0x1a0
>>>>>>>>>>>>>> [    0.311142] RSP: 0018:ffffc9000f1efa28 EFLAGS: 00000046
>>>>>>>>>>>>>> [    0.311144] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7f722a00
>>>>>>>>>>>>>> [    0.311145] RDX: 0000000000000035 RSI: 0000000000d80001 RDI: ffff887f7709ca68
>>>>>>>>>>>>>> [    0.311146] RBP: 0000000000000202 R08: 0000000000140000 R09: 0000000000000000
>>>>>>>>>>>>>> [    0.311147] R10: ffffc9000f1ef9d8 R11: 000000001a249fa0 R12: ffff887f7709ca68
>>>>>>>>>>>>>> [    0.311148] R13: ffffc9000f1efad0 R14: 0000000000000000 R15: ffff887f7709ca00
>>>>>>>>>>>>>> [    0.311149] FS:  000000c423f30090(0000) GS:ffff883f7f700000(0000) knlGS:0000000000000000
>>>>>>>>>>>>>> [    0.311150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>>>>>>>>>> [    0.311151] CR2: 00007feefcea4000 CR3: 0000007f7016e001 CR4: 00000000003606e0
>>>>>>>>>>>>>> [    0.311152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>>>>>>>>>>>> [    0.311153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>>>>>>>>>>>> [    0.311154] Call Trace:
>>>>>>>>>>>>>> [    0.311157]  do_raw_spin_lock+0xad/0xc0
>>>>>>>>>>>>>> [    0.311160]  _raw_spin_lock_irqsave+0x44/0x4b
>>>>>>>>>>>>>> [    0.311162]  ? prepare_to_wait_exclusive+0x28/0xb0
>>>>>>>>>>>>>> [    0.311164]  prepare_to_wait_exclusive+0x28/0xb0
>>>>>>>>>>>>>> [    0.311167]  wbt_wait+0x127/0x330
>>>>>>>>>>>>>> [    0.311169]  ? finish_wait+0x80/0x80
>>>>>>>>>>>>>> [    0.311172]  ? generic_make_request+0xda/0x3b0
>>>>>>>>>>>>>> [    0.311174]  blk_mq_make_request+0xd6/0x7b0
>>>>>>>>>>>>>> [    0.311176]  ? blk_queue_enter+0x24/0x260
>>>>>>>>>>>>>> [    0.311178]  ? generic_make_request+0xda/0x3b0
>>>>>>>>>>>>>> [    0.311181]  generic_make_request+0x10c/0x3b0
>>>>>>>>>>>>>> [    0.311183]  ? submit_bio+0x5c/0x110
>>>>>>>>>>>>>> [    0.311185]  submit_bio+0x5c/0x110
>>>>>>>>>>>>>> [    0.311197]  ? __ext4_journal_stop+0x36/0xa0 [ext4]
>>>>>>>>>>>>>> [    0.311210]  ext4_io_submit+0x48/0x60 [ext4]
>>>>>>>>>>>>>> [    0.311222]  ext4_writepages+0x810/0x11f0 [ext4]
>>>>>>>>>>>>>> [    0.311229]  ? do_writepages+0x3c/0xd0
>>>>>>>>>>>>>> [    0.311239]  ? ext4_mark_inode_dirty+0x260/0x260 [ext4]
>>>>>>>>>>>>>> [    0.311240]  do_writepages+0x3c/0xd0
>>>>>>>>>>>>>> [    0.311243]  ? _raw_spin_unlock+0x24/0x30
>>>>>>>>>>>>>> [    0.311245]  ? wbc_attach_and_unlock_inode+0x165/0x280
>>>>>>>>>>>>>> [    0.311248]  ? __filemap_fdatawrite_range+0xa3/0xe0
>>>>>>>>>>>>>> [    0.311250]  __filemap_fdatawrite_range+0xa3/0xe0
>>>>>>>>>>>>>> [    0.311253]  file_write_and_wait_range+0x34/0x90
>>>>>>>>>>>>>> [    0.311264]  ext4_sync_file+0x151/0x500 [ext4]
>>>>>>>>>>>>>> [    0.311267]  do_fsync+0x38/0x60
>>>>>>>>>>>>>> [    0.311270]  SyS_fsync+0xc/0x10
>>>>>>>>>>>>>> [    0.311272]  do_syscall_64+0x6f/0x170
>>>>>>>>>>>>>> [    0.311274]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In the original patch, wbt_done is waking up all the exclusive
>>>>>>>>>>>>>> processes in the wait queue, which can cause a thundering herd
>>>>>>>>>>>>>> if there is a large number of writer threads in the queue. The
>>>>>>>>>>>>>> original intention of the code seems to be to wake up one thread
>>>>>>>>>>>>>> only however, it uses wake_up_all() in __wbt_done(), and then
>>>>>>>>>>>>>> uses the following check in __wbt_wait to have only one thread
>>>>>>>>>>>>>> actually get out of the wait loop:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> if (waitqueue_active(&rqw->wait) &&
>>>>>>>>>>>>>>             rqw->wait.head.next != &wait->entry)
>>>>>>>>>>>>>>                 return false;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The problem with this is that the wait entry in wbt_wait is
>>>>>>>>>>>>>> define with DEFINE_WAIT, which uses the autoremove wakeup function.
>>>>>>>>>>>>>> That means that the above check is invalid - the wait entry will
>>>>>>>>>>>>>> have been removed from the queue already by the time we hit the
>>>>>>>>>>>>>> check in the loop.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Secondly, auto-removing the wait entries also means that the wait
>>>>>>>>>>>>>> queue essentially gets reordered "randomly" (e.g. threads re-add
>>>>>>>>>>>>>> themselves in the order they got to run after being woken up).
>>>>>>>>>>>>>> Additionally, new requests entering wbt_wait might overtake requests
>>>>>>>>>>>>>> that were queued earlier, because the wait queue will be
>>>>>>>>>>>>>> (temporarily) empty after the wake_up_all, so the waitqueue_active
>>>>>>>>>>>>>> check will not stop them. This can cause certain threads to starve
>>>>>>>>>>>>>> under high load.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The fix is to leave the woken up requests in the queue and remove
>>>>>>>>>>>>>> them in finish_wait() once the current thread breaks out of the
>>>>>>>>>>>>>> wait loop in __wbt_wait. This will ensure new requests always
>>>>>>>>>>>>>> end up at the back of the queue, and they won't overtake requests
>>>>>>>>>>>>>> that are already in the wait queue. With that change, the loop
>>>>>>>>>>>>>> in wbt_wait is also in line with many other wait loops in the kernel.
>>>>>>>>>>>>>> Waking up just one thread drastically reduces lock contention, as
>>>>>>>>>>>>>> does moving the wait queue add/remove out of the loop.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> A significant drop in lockdep's lock contention numbers is seen when
>>>>>>>>>>>>>> running the test application on the patched kernel.
>>>>>>>>>>>>> I like the patch, and a few weeks ago we independently discovered that
>>>>>>>>>>>>> the waitqueue list checking was bogus as well. My only worry is that
>>>>>>>>>>>>> changes like this can be delicate, meaning that it's easy to introduce
>>>>>>>>>>>>> stall conditions. What kind of testing did you push this through?
>>>>>>>>>>>>>
>>>>>>>>>>>>> -- 
>>>>>>>>>>>>> Jens Axboe
>>>>>>>>>>>>>
>>>>>>>>>>>> I ran the following tests on both real HW with NVME devices attached
>>>>>>>>>>>> and emulated NVME too:
>>>>>>>>>>>>
>>>>>>>>>>>> 1. The test case I used to reproduce the issue, spawns a bunch of threads 
>>>>>>>>>>>>    to concurrently read and write files with random size and content. 
>>>>>>>>>>>>    Files are randomly fsync'd. The implementation is a FIFO queue of files. 
>>>>>>>>>>>>    When the queue fills the test starts to verify and remove the files. This 
>>>>>>>>>>>>    test will fail if there's a read, write, or hash check failure. It tests
>>>>>>>>>>>>    for file corruption when lots of small files are being read and written 
>>>>>>>>>>>>    with high concurrency.
>>>>>>>>>>>>
>>>>>>>>>>>> 2. Fio for random writes with a root NVME device of 200GB
>>>>>>>>>>>>   
>>>>>>>>>>>>   fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k 
>>>>>>>>>>>>   --direct=0 --size=10G --numjobs=2 --runtime=60 --group_reporting
>>>>>>>>>>>>   
>>>>>>>>>>>>   fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k
>>>>>>>>>>>>   --direct=0 --size=5G --numjobs=2 --runtime=30 --fsync=64 --group_reporting
>>>>>>>>>>>>   
>>>>>>>>>>>>   I did see an improvement in the bandwidth numbers reported on the patched
>>>>>>>>>>>>   kernel. 
>>>>>>>>>>>>
>>>>>>>>>>>> Do you have any test case/suite in mind that you would suggest me to 
>>>>>>>>>>>> run to be sure that patch does not introduce any stall conditions?
>>>>>>>>>>> One thing that is always useful is to run xfstest, do a full run on
>>>>>>>>>>> the device. If that works, then do another full run, this time limiting
>>>>>>>>>>> the queue depth of the SCSI device to 1. If both of those pass, then
>>>>>>>>>>> I'd feel pretty good getting this applied for 4.19.
>>>>>>>>>> Did you get a chance to run this full test?
>>>>>>>>>>
>>>>>>>>>> -- 
>>>>>>>>>> Jens Axboe
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> Hi Jens,
>>>>>>>>> Yes I did run the tests and was in the process of compiling concrete results
>>>>>>>>> I tested following environments against xfs/auto group
>>>>>>>>> 1. Vanilla 4.18.rc kernel
>>>>>>>>> 2. 4.18 kernel with the blk-wbt patch
>>>>>>>>> 3. 4.18 kernel with the blk-wbt patch + io_queue_depth=2. I 
>>>>>>>>> understand you asked for queue depth for SCSI device=1 however, I have NVME 
>>>>>>>>> devices in my environment and 2 is the minimum value for io_queue_depth allowed 
>>>>>>>>> according to the NVME driver code. The results pretty much look same with no 
>>>>>>>>> stalls or exceptional failures. 
>>>>>>>>> xfs/auto ran 296 odd tests with 3 failures and 130 something "no runs". 
>>>>>>>>> Remaining tests passed. "Skipped tests"  were mostly due to missing features
>>>>>>>>> (eg: reflink support on scratch filesystem)
>>>>>>>>> The failures were consistent across runs on 3 different environments. 
>>>>>>>>> I am also running full test suite but it is taking long time as I am 
>>>>>>>>> hitting kernel BUG in xfs code in some generic tests. This BUG is not 
>>>>>>>>> related to the patch and  I see them in vanilla kernel too. I am in 
>>>>>>>>> the process of excluding these kind of tests as they come and 
>>>>>>>>> re-run the suite however, this proces is time taking. 
>>>>>>>>> Do you have any specific tests in mind that you would like me 
>>>>>>>>> to run apart from what I have already tested above?
>>>>>>>> Thanks, I think that looks good. I'll get your patch applied for
>>>>>>>> 4.19.
>>>>>>>>
>>>>>>>> -- 
>>>>>>>> Jens Axboe
>>>>>>>>
>>>>>>>>
>>>>>>> Hi Jens,
>>>>>>> Thanks for accepting this. There is one small issue, I don't find any emails
>>>>>>> send by me on the lkml mailing list. I am not sure why it didn't land there,
>>>>>>> all I can see is your responses. Do you want one of us to resend the patch
>>>>>>> or will you be able to do it?
>>>>>> That's odd, are you getting rejections on your emails? For reference, the
>>>>>> patch is here:
>>>>>>
>>>>>> http://git.kernel.dk/cgit/linux-block/commit/?h=for-4.19/block&id=2887e41b910bb14fd847cf01ab7a5993db989d88
>>>>> One issue with this, as far as I can tell. Right now we've switched to
>>>>> waking one task at the time, which is obviously more efficient. But if
>>>>> we do that with exclusive waits, then we have to ensure that this task
>>>>> makes progress. If we wake up a task, and then fail to get a queueing
>>>>> token, then we'll go back to sleep. We need to ensure that someone makes
>>>>> forward progress at this point. There are two ways I can see that
>>>>> happening:
>>>>>
>>>>> 1) The task woken _always_ gets to queue an IO
>>>>> 2) If the task woken is NOT allowed to queue an IO, then it must select
>>>>>    a new task to wake up. That new task is then subjected to rule 1 or 2
>>>>>    as well.
>>>>>
>>>>> For #1, it could be as simple as:
>>>>>
>>>>> if (slept || !rwb_enabled(rwb)) {
>>>>> 	atomic_inc(&rqw->inflight);
>>>>> 	break;
>>>>> }
>>>>>
>>>>> but this obviously won't always be fair. Might be good enough however,
>>>>> instead of having to eg replace the generic wait queues with a priority
>>>>> list/queue.
>>>>>
>>>>> Note that this isn't an entirely new issue, it's just so much easier to
>>>>> hit with the single wakeups.
>>>>>
>>>> Hi Jens,
>>>>
>>>> What is the scenario that you see under which the woken up task does not
>>>> get to run?
>>> That scenario is pretty easy to hit - let's say the next in line task
>>> has a queue limit of 1, and we currently have 4 pending. Task gets
>>> woken, goes back to sleep. Which should be totally fine. At some point
>>> we'll get below the limit, and allow the task to proceed. This will
>>> ensure forward progress.
>>>
>>>> The theory behind leaving the task on the wait queue is that the
>>>> waitqueue_active check in wbt_wait prevents new tasks from taking up a
>>>> slot in the queue (e.g. incrementing inflight). So, there should not be
>>>> a way for inflight to be incremented between the time the wake_up is
>>>> done and the task at the head of the wait queue runs. That's the idea
>>>> anyway :-) If we missed something, let us know.
>>> And that's a fine theory, I think it's a good improvement (and how it
>>> should have worked). I'm struggling to see where the issue is. Perhaps
>>> it's related to the wq active check. With fewer wakeups, we're more
>>> likely to hit a race there.
>>>
>>> I'll poke at it...
>> Trying something like this:
>>
>> http://git.kernel.dk/cgit/linux-block/log/?h=for-4.19/wbt
>>
> Ah, now I see what you mean.
> 
> This is the case where a task goes to sleep, not because the inflight
> limit has been reached, but simply because it needs to go to the back of
> the wait queue.
> 
> In that case, it should, for its first time inside the loop, not try to
> decrement inflight - since that means it could still race to overtake a
> task that got there earlier and is in the wait queue.
> 
> So what you are doing is keeping track of whether it got in to the loop
> only because of queueing, and then you don't try to decrement inflight
> the first time around the loop.
> 
> I think that should work to fix that corner case.

I hope so, got tests running now and we'll see...

Outside of that, getting the matching memory barrier for the wq check
could also fix a race on the completion side.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
  2018-08-20 20:20                           ` Jens Axboe
@ 2018-08-20 22:42                             ` Balbir Singh
  2018-08-21  2:58                               ` Jens Axboe
  0 siblings, 1 reply; 38+ messages in thread
From: Balbir Singh @ 2018-08-20 22:42 UTC (permalink / raw)
  To: Jens Axboe
  Cc: van der Linden, Frank, Agarwal, Anchal, linux-block,
	linux-kernel, Wilson, Matt

On Mon, Aug 20, 2018 at 02:20:59PM -0600, Jens Axboe wrote:
> On 8/20/18 2:19 PM, van der Linden, Frank wrote:
> > On 8/20/18 12:29 PM, Jens Axboe wrote:
> >> On 8/20/18 1:08 PM, Jens Axboe wrote:
> >>> On 8/20/18 11:34 AM, van der Linden, Frank wrote:
> >>>> On 8/20/18 9:37 AM, Jens Axboe wrote:
> >>>>> On 8/7/18 3:19 PM, Jens Axboe wrote:
> >>>>>> On 8/7/18 3:12 PM, Anchal Agarwal wrote:
> >>>>>>> On Tue, Aug 07, 2018 at 02:39:48PM -0600, Jens Axboe wrote:
> >>>>>>>> On 8/7/18 2:12 PM, Anchal Agarwal wrote:
> >>>>>>>>> On Tue, Aug 07, 2018 at 08:29:44AM -0600, Jens Axboe wrote:
> >>>>>>>>>> On 8/1/18 4:09 PM, Jens Axboe wrote:
> >>>>>>>>>>> On 8/1/18 11:06 AM, Anchal Agarwal wrote:
> >>>>>>>>>>>> On Wed, Aug 01, 2018 at 09:14:50AM -0600, Jens Axboe wrote:
> >>>>>>>>>>>>> On 7/31/18 3:34 PM, Anchal Agarwal wrote:
> >>>>>>>>>>>>>> Hi folks,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> This patch modifies commit e34cbd307477a
> >>>>>>>>>>>>>> (blk-wbt: add general throttling mechanism)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I am currently running a large bare metal instance (i3.metal)
> >>>>>>>>>>>>>> on EC2 with 72 cores, 512GB of RAM and NVME drives, with a
> >>>>>>>>>>>>>> 4.18 kernel. I have a workload that simulates a database
> >>>>>>>>>>>>>> workload and I am running into lockup issues when writeback
> >>>>>>>>>>>>>> throttling is enabled,with the hung task detector also
> >>>>>>>>>>>>>> kicking in.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Crash dumps show that most CPUs (up to 50 of them) are
> >>>>>>>>>>>>>> all trying to get the wbt wait queue lock while trying to add
> >>>>>>>>>>>>>> themselves to it in __wbt_wait (see stack traces below).
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> [    0.948118] CPU: 45 PID: 0 Comm: swapper/45 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
> >>>>>>>>>>>>>> [    0.948119] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
> >>>>>>>>>>>>>> [    0.948120] task: ffff883f7878c000 task.stack: ffffc9000c69c000
> >>>>>>>>>>>>>> [    0.948124] RIP: 0010:native_queued_spin_lock_slowpath+0xf8/0x1a0
> >>>>>>>>>>>>>> [    0.948125] RSP: 0018:ffff883f7fcc3dc8 EFLAGS: 00000046
> >>>>>>>>>>>>>> [    0.948126] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7fce2a00
> >>>>>>>>>>>>>> [    0.948128] RDX: 000000000000001c RSI: 0000000000740001 RDI: ffff887f7709ca68
> >>>>>>>>>>>>>> [    0.948129] RBP: 0000000000000002 R08: 0000000000b80000 R09: 0000000000000000
> >>>>>>>>>>>>>> [    0.948130] R10: ffff883f7fcc3d78 R11: 000000000de27121 R12: 0000000000000002
> >>>>>>>>>>>>>> [    0.948131] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
> >>>>>>>>>>>>>> [    0.948132] FS:  0000000000000000(0000) GS:ffff883f7fcc0000(0000) knlGS:0000000000000000
> >>>>>>>>>>>>>> [    0.948134] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>>>>>>>>>>>> [    0.948135] CR2: 000000c424c77000 CR3: 0000000002010005 CR4: 00000000003606e0
> >>>>>>>>>>>>>> [    0.948136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >>>>>>>>>>>>>> [    0.948137] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >>>>>>>>>>>>>> [    0.948138] Call Trace:
> >>>>>>>>>>>>>> [    0.948139]  <IRQ>
> >>>>>>>>>>>>>> [    0.948142]  do_raw_spin_lock+0xad/0xc0
> >>>>>>>>>>>>>> [    0.948145]  _raw_spin_lock_irqsave+0x44/0x4b
> >>>>>>>>>>>>>> [    0.948149]  ? __wake_up_common_lock+0x53/0x90
> >>>>>>>>>>>>>> [    0.948150]  __wake_up_common_lock+0x53/0x90
> >>>>>>>>>>>>>> [    0.948155]  wbt_done+0x7b/0xa0
> >>>>>>>>>>>>>> [    0.948158]  blk_mq_free_request+0xb7/0x110
> >>>>>>>>>>>>>> [    0.948161]  __blk_mq_complete_request+0xcb/0x140
> >>>>>>>>>>>>>> [    0.948166]  nvme_process_cq+0xce/0x1a0 [nvme]
> >>>>>>>>>>>>>> [    0.948169]  nvme_irq+0x23/0x50 [nvme]
> >>>>>>>>>>>>>> [    0.948173]  __handle_irq_event_percpu+0x46/0x300
> >>>>>>>>>>>>>> [    0.948176]  handle_irq_event_percpu+0x20/0x50
> >>>>>>>>>>>>>> [    0.948179]  handle_irq_event+0x34/0x60
> >>>>>>>>>>>>>> [    0.948181]  handle_edge_irq+0x77/0x190
> >>>>>>>>>>>>>> [    0.948185]  handle_irq+0xaf/0x120
> >>>>>>>>>>>>>> [    0.948188]  do_IRQ+0x53/0x110
> >>>>>>>>>>>>>> [    0.948191]  common_interrupt+0x87/0x87
> >>>>>>>>>>>>>> [    0.948192]  </IRQ>
> >>>>>>>>>>>>>> ....
> >>>>>>>>>>>>>> [    0.311136] CPU: 4 PID: 9737 Comm: run_linux_amd64 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
> >>>>>>>>>>>>>> [    0.311137] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
> >>>>>>>>>>>>>> [    0.311138] task: ffff883f6e6a8000 task.stack: ffffc9000f1ec000
> >>>>>>>>>>>>>> [    0.311141] RIP: 0010:native_queued_spin_lock_slowpath+0xf5/0x1a0
> >>>>>>>>>>>>>> [    0.311142] RSP: 0018:ffffc9000f1efa28 EFLAGS: 00000046
> >>>>>>>>>>>>>> [    0.311144] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7f722a00
> >>>>>>>>>>>>>> [    0.311145] RDX: 0000000000000035 RSI: 0000000000d80001 RDI: ffff887f7709ca68
> >>>>>>>>>>>>>> [    0.311146] RBP: 0000000000000202 R08: 0000000000140000 R09: 0000000000000000
> >>>>>>>>>>>>>> [    0.311147] R10: ffffc9000f1ef9d8 R11: 000000001a249fa0 R12: ffff887f7709ca68
> >>>>>>>>>>>>>> [    0.311148] R13: ffffc9000f1efad0 R14: 0000000000000000 R15: ffff887f7709ca00
> >>>>>>>>>>>>>> [    0.311149] FS:  000000c423f30090(0000) GS:ffff883f7f700000(0000) knlGS:0000000000000000
> >>>>>>>>>>>>>> [    0.311150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>>>>>>>>>>>> [    0.311151] CR2: 00007feefcea4000 CR3: 0000007f7016e001 CR4: 00000000003606e0
> >>>>>>>>>>>>>> [    0.311152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >>>>>>>>>>>>>> [    0.311153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >>>>>>>>>>>>>> [    0.311154] Call Trace:
> >>>>>>>>>>>>>> [    0.311157]  do_raw_spin_lock+0xad/0xc0
> >>>>>>>>>>>>>> [    0.311160]  _raw_spin_lock_irqsave+0x44/0x4b
> >>>>>>>>>>>>>> [    0.311162]  ? prepare_to_wait_exclusive+0x28/0xb0
> >>>>>>>>>>>>>> [    0.311164]  prepare_to_wait_exclusive+0x28/0xb0
> >>>>>>>>>>>>>> [    0.311167]  wbt_wait+0x127/0x330
> >>>>>>>>>>>>>> [    0.311169]  ? finish_wait+0x80/0x80
> >>>>>>>>>>>>>> [    0.311172]  ? generic_make_request+0xda/0x3b0
> >>>>>>>>>>>>>> [    0.311174]  blk_mq_make_request+0xd6/0x7b0
> >>>>>>>>>>>>>> [    0.311176]  ? blk_queue_enter+0x24/0x260
> >>>>>>>>>>>>>> [    0.311178]  ? generic_make_request+0xda/0x3b0
> >>>>>>>>>>>>>> [    0.311181]  generic_make_request+0x10c/0x3b0
> >>>>>>>>>>>>>> [    0.311183]  ? submit_bio+0x5c/0x110
> >>>>>>>>>>>>>> [    0.311185]  submit_bio+0x5c/0x110
> >>>>>>>>>>>>>> [    0.311197]  ? __ext4_journal_stop+0x36/0xa0 [ext4]
> >>>>>>>>>>>>>> [    0.311210]  ext4_io_submit+0x48/0x60 [ext4]
> >>>>>>>>>>>>>> [    0.311222]  ext4_writepages+0x810/0x11f0 [ext4]
> >>>>>>>>>>>>>> [    0.311229]  ? do_writepages+0x3c/0xd0
> >>>>>>>>>>>>>> [    0.311239]  ? ext4_mark_inode_dirty+0x260/0x260 [ext4]
> >>>>>>>>>>>>>> [    0.311240]  do_writepages+0x3c/0xd0
> >>>>>>>>>>>>>> [    0.311243]  ? _raw_spin_unlock+0x24/0x30
> >>>>>>>>>>>>>> [    0.311245]  ? wbc_attach_and_unlock_inode+0x165/0x280
> >>>>>>>>>>>>>> [    0.311248]  ? __filemap_fdatawrite_range+0xa3/0xe0
> >>>>>>>>>>>>>> [    0.311250]  __filemap_fdatawrite_range+0xa3/0xe0
> >>>>>>>>>>>>>> [    0.311253]  file_write_and_wait_range+0x34/0x90
> >>>>>>>>>>>>>> [    0.311264]  ext4_sync_file+0x151/0x500 [ext4]
> >>>>>>>>>>>>>> [    0.311267]  do_fsync+0x38/0x60
> >>>>>>>>>>>>>> [    0.311270]  SyS_fsync+0xc/0x10
> >>>>>>>>>>>>>> [    0.311272]  do_syscall_64+0x6f/0x170
> >>>>>>>>>>>>>> [    0.311274]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> In the original patch, wbt_done is waking up all the exclusive
> >>>>>>>>>>>>>> processes in the wait queue, which can cause a thundering herd
> >>>>>>>>>>>>>> if there is a large number of writer threads in the queue. The
> >>>>>>>>>>>>>> original intention of the code seems to be to wake up one thread
> >>>>>>>>>>>>>> only however, it uses wake_up_all() in __wbt_done(), and then
> >>>>>>>>>>>>>> uses the following check in __wbt_wait to have only one thread
> >>>>>>>>>>>>>> actually get out of the wait loop:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> if (waitqueue_active(&rqw->wait) &&
> >>>>>>>>>>>>>>             rqw->wait.head.next != &wait->entry)
> >>>>>>>>>>>>>>                 return false;
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> The problem with this is that the wait entry in wbt_wait is
> >>>>>>>>>>>>>> define with DEFINE_WAIT, which uses the autoremove wakeup function.
> >>>>>>>>>>>>>> That means that the above check is invalid - the wait entry will
> >>>>>>>>>>>>>> have been removed from the queue already by the time we hit the
> >>>>>>>>>>>>>> check in the loop.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Secondly, auto-removing the wait entries also means that the wait
> >>>>>>>>>>>>>> queue essentially gets reordered "randomly" (e.g. threads re-add
> >>>>>>>>>>>>>> themselves in the order they got to run after being woken up).
> >>>>>>>>>>>>>> Additionally, new requests entering wbt_wait might overtake requests
> >>>>>>>>>>>>>> that were queued earlier, because the wait queue will be
> >>>>>>>>>>>>>> (temporarily) empty after the wake_up_all, so the waitqueue_active
> >>>>>>>>>>>>>> check will not stop them. This can cause certain threads to starve
> >>>>>>>>>>>>>> under high load.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> The fix is to leave the woken up requests in the queue and remove
> >>>>>>>>>>>>>> them in finish_wait() once the current thread breaks out of the
> >>>>>>>>>>>>>> wait loop in __wbt_wait. This will ensure new requests always
> >>>>>>>>>>>>>> end up at the back of the queue, and they won't overtake requests
> >>>>>>>>>>>>>> that are already in the wait queue. With that change, the loop
> >>>>>>>>>>>>>> in wbt_wait is also in line with many other wait loops in the kernel.
> >>>>>>>>>>>>>> Waking up just one thread drastically reduces lock contention, as
> >>>>>>>>>>>>>> does moving the wait queue add/remove out of the loop.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> A significant drop in lockdep's lock contention numbers is seen when
> >>>>>>>>>>>>>> running the test application on the patched kernel.
> >>>>>>>>>>>>> I like the patch, and a few weeks ago we independently discovered that
> >>>>>>>>>>>>> the waitqueue list checking was bogus as well. My only worry is that
> >>>>>>>>>>>>> changes like this can be delicate, meaning that it's easy to introduce
> >>>>>>>>>>>>> stall conditions. What kind of testing did you push this through?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> -- 
> >>>>>>>>>>>>> Jens Axboe
> >>>>>>>>>>>>>
> >>>>>>>>>>>> I ran the following tests on both real HW with NVME devices attached
> >>>>>>>>>>>> and emulated NVME too:
> >>>>>>>>>>>>
> >>>>>>>>>>>> 1. The test case I used to reproduce the issue, spawns a bunch of threads 
> >>>>>>>>>>>>    to concurrently read and write files with random size and content. 
> >>>>>>>>>>>>    Files are randomly fsync'd. The implementation is a FIFO queue of files. 
> >>>>>>>>>>>>    When the queue fills the test starts to verify and remove the files. This 
> >>>>>>>>>>>>    test will fail if there's a read, write, or hash check failure. It tests
> >>>>>>>>>>>>    for file corruption when lots of small files are being read and written 
> >>>>>>>>>>>>    with high concurrency.
> >>>>>>>>>>>>
> >>>>>>>>>>>> 2. Fio for random writes with a root NVME device of 200GB
> >>>>>>>>>>>>   
> >>>>>>>>>>>>   fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k 
> >>>>>>>>>>>>   --direct=0 --size=10G --numjobs=2 --runtime=60 --group_reporting
> >>>>>>>>>>>>   
> >>>>>>>>>>>>   fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k
> >>>>>>>>>>>>   --direct=0 --size=5G --numjobs=2 --runtime=30 --fsync=64 --group_reporting
> >>>>>>>>>>>>   
> >>>>>>>>>>>>   I did see an improvement in the bandwidth numbers reported on the patched
> >>>>>>>>>>>>   kernel. 
> >>>>>>>>>>>>
> >>>>>>>>>>>> Do you have any test case/suite in mind that you would suggest me to 
> >>>>>>>>>>>> run to be sure that patch does not introduce any stall conditions?
> >>>>>>>>>>> One thing that is always useful is to run xfstest, do a full run on
> >>>>>>>>>>> the device. If that works, then do another full run, this time limiting
> >>>>>>>>>>> the queue depth of the SCSI device to 1. If both of those pass, then
> >>>>>>>>>>> I'd feel pretty good getting this applied for 4.19.
> >>>>>>>>>> Did you get a chance to run this full test?
> >>>>>>>>>>
> >>>>>>>>>> -- 
> >>>>>>>>>> Jens Axboe
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>> Hi Jens,
> >>>>>>>>> Yes I did run the tests and was in the process of compiling concrete results
> >>>>>>>>> I tested following environments against xfs/auto group
> >>>>>>>>> 1. Vanilla 4.18.rc kernel
> >>>>>>>>> 2. 4.18 kernel with the blk-wbt patch
> >>>>>>>>> 3. 4.18 kernel with the blk-wbt patch + io_queue_depth=2. I 
> >>>>>>>>> understand you asked for queue depth for SCSI device=1 however, I have NVME 
> >>>>>>>>> devices in my environment and 2 is the minimum value for io_queue_depth allowed 
> >>>>>>>>> according to the NVME driver code. The results pretty much look same with no 
> >>>>>>>>> stalls or exceptional failures. 
> >>>>>>>>> xfs/auto ran 296 odd tests with 3 failures and 130 something "no runs". 
> >>>>>>>>> Remaining tests passed. "Skipped tests"  were mostly due to missing features
> >>>>>>>>> (eg: reflink support on scratch filesystem)
> >>>>>>>>> The failures were consistent across runs on 3 different environments. 
> >>>>>>>>> I am also running full test suite but it is taking long time as I am 
> >>>>>>>>> hitting kernel BUG in xfs code in some generic tests. This BUG is not 
> >>>>>>>>> related to the patch and  I see them in vanilla kernel too. I am in 
> >>>>>>>>> the process of excluding these kind of tests as they come and 
> >>>>>>>>> re-run the suite however, this proces is time taking. 
> >>>>>>>>> Do you have any specific tests in mind that you would like me 
> >>>>>>>>> to run apart from what I have already tested above?
> >>>>>>>> Thanks, I think that looks good. I'll get your patch applied for
> >>>>>>>> 4.19.
> >>>>>>>>
> >>>>>>>> -- 
> >>>>>>>> Jens Axboe
> >>>>>>>>
> >>>>>>>>
> >>>>>>> Hi Jens,
> >>>>>>> Thanks for accepting this. There is one small issue, I don't find any emails
> >>>>>>> send by me on the lkml mailing list. I am not sure why it didn't land there,
> >>>>>>> all I can see is your responses. Do you want one of us to resend the patch
> >>>>>>> or will you be able to do it?
> >>>>>> That's odd, are you getting rejections on your emails? For reference, the
> >>>>>> patch is here:
> >>>>>>
> >>>>>> http://git.kernel.dk/cgit/linux-block/commit/?h=for-4.19/block&id=2887e41b910bb14fd847cf01ab7a5993db989d88
> >>>>> One issue with this, as far as I can tell. Right now we've switched to
> >>>>> waking one task at the time, which is obviously more efficient. But if
> >>>>> we do that with exclusive waits, then we have to ensure that this task
> >>>>> makes progress. If we wake up a task, and then fail to get a queueing
> >>>>> token, then we'll go back to sleep. We need to ensure that someone makes
> >>>>> forward progress at this point. There are two ways I can see that
> >>>>> happening:
> >>>>>
> >>>>> 1) The task woken _always_ gets to queue an IO
> >>>>> 2) If the task woken is NOT allowed to queue an IO, then it must select
> >>>>>    a new task to wake up. That new task is then subjected to rule 1 or 2
> >>>>>    as well.
> >>>>>
> >>>>> For #1, it could be as simple as:
> >>>>>
> >>>>> if (slept || !rwb_enabled(rwb)) {
> >>>>> 	atomic_inc(&rqw->inflight);
> >>>>> 	break;
> >>>>> }
> >>>>>
> >>>>> but this obviously won't always be fair. Might be good enough however,
> >>>>> instead of having to eg replace the generic wait queues with a priority
> >>>>> list/queue.
> >>>>>
> >>>>> Note that this isn't an entirely new issue, it's just so much easier to
> >>>>> hit with the single wakeups.
> >>>>>
> >>>> Hi Jens,
> >>>>
> >>>> What is the scenario that you see under which the woken up task does not
> >>>> get to run?
> >>> That scenario is pretty easy to hit - let's say the next in line task
> >>> has a queue limit of 1, and we currently have 4 pending. Task gets
> >>> woken, goes back to sleep. Which should be totally fine. At some point
> >>> we'll get below the limit, and allow the task to proceed. This will
> >>> ensure forward progress.
> >>>
> >>>> The theory behind leaving the task on the wait queue is that the
> >>>> waitqueue_active check in wbt_wait prevents new tasks from taking up a
> >>>> slot in the queue (e.g. incrementing inflight). So, there should not be
> >>>> a way for inflight to be incremented between the time the wake_up is
> >>>> done and the task at the head of the wait queue runs. That's the idea
> >>>> anyway :-) If we missed something, let us know.
> >>> And that's a fine theory, I think it's a good improvement (and how it
> >>> should have worked). I'm struggling to see where the issue is. Perhaps
> >>> it's related to the wq active check. With fewer wakeups, we're more
> >>> likely to hit a race there.
> >>>
> >>> I'll poke at it...
> >> Trying something like this:
> >>
> >> http://git.kernel.dk/cgit/linux-block/log/?h=for-4.19/wbt
> >>
> > Ah, now I see what you mean.
> > 
> > This is the case where a task goes to sleep, not because the inflight
> > limit has been reached, but simply because it needs to go to the back of
> > the wait queue.
> > 
> > In that case, it should, for its first time inside the loop, not try to
> > decrement inflight - since that means it could still race to overtake a
> > task that got there earlier and is in the wait queue.
> > 
> > So what you are doing is keeping track of whether it got in to the loop
> > only because of queueing, and then you don't try to decrement inflight
> > the first time around the loop.
> > 
> > I think that should work to fix that corner case.
> 
> I hope so, got tests running now and we'll see...
> 
> Outside of that, getting the matching memory barrier for the wq check
> could also fix a race on the completion side.
>

I thought all the wait_* and set_current_* and atomic_* had implicit barriers.
Are you referring to the rwb->wb_* values we consume on the completion side?

I was initially concerned about not dequeuing the task, but noticed that
wake_up_common seems to handle that well. I looked for sources of missed wake
up as well, notifying the same task twice and missing wakeups, but could
not hit it.

FYI: We ran lock contention and the waitqueue showed up as having the
largest contention, which disappeared after this patch.

Balbir Singh.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
  2018-08-20 22:42                             ` Balbir Singh
@ 2018-08-21  2:58                               ` Jens Axboe
  2018-08-22  3:20                                 ` Jens Axboe
  0 siblings, 1 reply; 38+ messages in thread
From: Jens Axboe @ 2018-08-21  2:58 UTC (permalink / raw)
  To: Balbir Singh
  Cc: van der Linden, Frank, Agarwal, Anchal, linux-block,
	linux-kernel, Wilson, Matt

On 8/20/18 4:42 PM, Balbir Singh wrote:
> On Mon, Aug 20, 2018 at 02:20:59PM -0600, Jens Axboe wrote:
>> On 8/20/18 2:19 PM, van der Linden, Frank wrote:
>>> On 8/20/18 12:29 PM, Jens Axboe wrote:
>>>> On 8/20/18 1:08 PM, Jens Axboe wrote:
>>>>> On 8/20/18 11:34 AM, van der Linden, Frank wrote:
>>>>>> On 8/20/18 9:37 AM, Jens Axboe wrote:
>>>>>>> On 8/7/18 3:19 PM, Jens Axboe wrote:
>>>>>>>> On 8/7/18 3:12 PM, Anchal Agarwal wrote:
>>>>>>>>> On Tue, Aug 07, 2018 at 02:39:48PM -0600, Jens Axboe wrote:
>>>>>>>>>> On 8/7/18 2:12 PM, Anchal Agarwal wrote:
>>>>>>>>>>> On Tue, Aug 07, 2018 at 08:29:44AM -0600, Jens Axboe wrote:
>>>>>>>>>>>> On 8/1/18 4:09 PM, Jens Axboe wrote:
>>>>>>>>>>>>> On 8/1/18 11:06 AM, Anchal Agarwal wrote:
>>>>>>>>>>>>>> On Wed, Aug 01, 2018 at 09:14:50AM -0600, Jens Axboe wrote:
>>>>>>>>>>>>>>> On 7/31/18 3:34 PM, Anchal Agarwal wrote:
>>>>>>>>>>>>>>>> Hi folks,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This patch modifies commit e34cbd307477a
>>>>>>>>>>>>>>>> (blk-wbt: add general throttling mechanism)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I am currently running a large bare metal instance (i3.metal)
>>>>>>>>>>>>>>>> on EC2 with 72 cores, 512GB of RAM and NVME drives, with a
>>>>>>>>>>>>>>>> 4.18 kernel. I have a workload that simulates a database
>>>>>>>>>>>>>>>> workload and I am running into lockup issues when writeback
>>>>>>>>>>>>>>>> throttling is enabled,with the hung task detector also
>>>>>>>>>>>>>>>> kicking in.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Crash dumps show that most CPUs (up to 50 of them) are
>>>>>>>>>>>>>>>> all trying to get the wbt wait queue lock while trying to add
>>>>>>>>>>>>>>>> themselves to it in __wbt_wait (see stack traces below).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [    0.948118] CPU: 45 PID: 0 Comm: swapper/45 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
>>>>>>>>>>>>>>>> [    0.948119] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
>>>>>>>>>>>>>>>> [    0.948120] task: ffff883f7878c000 task.stack: ffffc9000c69c000
>>>>>>>>>>>>>>>> [    0.948124] RIP: 0010:native_queued_spin_lock_slowpath+0xf8/0x1a0
>>>>>>>>>>>>>>>> [    0.948125] RSP: 0018:ffff883f7fcc3dc8 EFLAGS: 00000046
>>>>>>>>>>>>>>>> [    0.948126] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7fce2a00
>>>>>>>>>>>>>>>> [    0.948128] RDX: 000000000000001c RSI: 0000000000740001 RDI: ffff887f7709ca68
>>>>>>>>>>>>>>>> [    0.948129] RBP: 0000000000000002 R08: 0000000000b80000 R09: 0000000000000000
>>>>>>>>>>>>>>>> [    0.948130] R10: ffff883f7fcc3d78 R11: 000000000de27121 R12: 0000000000000002
>>>>>>>>>>>>>>>> [    0.948131] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
>>>>>>>>>>>>>>>> [    0.948132] FS:  0000000000000000(0000) GS:ffff883f7fcc0000(0000) knlGS:0000000000000000
>>>>>>>>>>>>>>>> [    0.948134] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>>>>>>>>>>>> [    0.948135] CR2: 000000c424c77000 CR3: 0000000002010005 CR4: 00000000003606e0
>>>>>>>>>>>>>>>> [    0.948136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>>>>>>>>>>>>>> [    0.948137] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>>>>>>>>>>>>>> [    0.948138] Call Trace:
>>>>>>>>>>>>>>>> [    0.948139]  <IRQ>
>>>>>>>>>>>>>>>> [    0.948142]  do_raw_spin_lock+0xad/0xc0
>>>>>>>>>>>>>>>> [    0.948145]  _raw_spin_lock_irqsave+0x44/0x4b
>>>>>>>>>>>>>>>> [    0.948149]  ? __wake_up_common_lock+0x53/0x90
>>>>>>>>>>>>>>>> [    0.948150]  __wake_up_common_lock+0x53/0x90
>>>>>>>>>>>>>>>> [    0.948155]  wbt_done+0x7b/0xa0
>>>>>>>>>>>>>>>> [    0.948158]  blk_mq_free_request+0xb7/0x110
>>>>>>>>>>>>>>>> [    0.948161]  __blk_mq_complete_request+0xcb/0x140
>>>>>>>>>>>>>>>> [    0.948166]  nvme_process_cq+0xce/0x1a0 [nvme]
>>>>>>>>>>>>>>>> [    0.948169]  nvme_irq+0x23/0x50 [nvme]
>>>>>>>>>>>>>>>> [    0.948173]  __handle_irq_event_percpu+0x46/0x300
>>>>>>>>>>>>>>>> [    0.948176]  handle_irq_event_percpu+0x20/0x50
>>>>>>>>>>>>>>>> [    0.948179]  handle_irq_event+0x34/0x60
>>>>>>>>>>>>>>>> [    0.948181]  handle_edge_irq+0x77/0x190
>>>>>>>>>>>>>>>> [    0.948185]  handle_irq+0xaf/0x120
>>>>>>>>>>>>>>>> [    0.948188]  do_IRQ+0x53/0x110
>>>>>>>>>>>>>>>> [    0.948191]  common_interrupt+0x87/0x87
>>>>>>>>>>>>>>>> [    0.948192]  </IRQ>
>>>>>>>>>>>>>>>> ....
>>>>>>>>>>>>>>>> [    0.311136] CPU: 4 PID: 9737 Comm: run_linux_amd64 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
>>>>>>>>>>>>>>>> [    0.311137] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
>>>>>>>>>>>>>>>> [    0.311138] task: ffff883f6e6a8000 task.stack: ffffc9000f1ec000
>>>>>>>>>>>>>>>> [    0.311141] RIP: 0010:native_queued_spin_lock_slowpath+0xf5/0x1a0
>>>>>>>>>>>>>>>> [    0.311142] RSP: 0018:ffffc9000f1efa28 EFLAGS: 00000046
>>>>>>>>>>>>>>>> [    0.311144] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7f722a00
>>>>>>>>>>>>>>>> [    0.311145] RDX: 0000000000000035 RSI: 0000000000d80001 RDI: ffff887f7709ca68
>>>>>>>>>>>>>>>> [    0.311146] RBP: 0000000000000202 R08: 0000000000140000 R09: 0000000000000000
>>>>>>>>>>>>>>>> [    0.311147] R10: ffffc9000f1ef9d8 R11: 000000001a249fa0 R12: ffff887f7709ca68
>>>>>>>>>>>>>>>> [    0.311148] R13: ffffc9000f1efad0 R14: 0000000000000000 R15: ffff887f7709ca00
>>>>>>>>>>>>>>>> [    0.311149] FS:  000000c423f30090(0000) GS:ffff883f7f700000(0000) knlGS:0000000000000000
>>>>>>>>>>>>>>>> [    0.311150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>>>>>>>>>>>> [    0.311151] CR2: 00007feefcea4000 CR3: 0000007f7016e001 CR4: 00000000003606e0
>>>>>>>>>>>>>>>> [    0.311152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>>>>>>>>>>>>>> [    0.311153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>>>>>>>>>>>>>> [    0.311154] Call Trace:
>>>>>>>>>>>>>>>> [    0.311157]  do_raw_spin_lock+0xad/0xc0
>>>>>>>>>>>>>>>> [    0.311160]  _raw_spin_lock_irqsave+0x44/0x4b
>>>>>>>>>>>>>>>> [    0.311162]  ? prepare_to_wait_exclusive+0x28/0xb0
>>>>>>>>>>>>>>>> [    0.311164]  prepare_to_wait_exclusive+0x28/0xb0
>>>>>>>>>>>>>>>> [    0.311167]  wbt_wait+0x127/0x330
>>>>>>>>>>>>>>>> [    0.311169]  ? finish_wait+0x80/0x80
>>>>>>>>>>>>>>>> [    0.311172]  ? generic_make_request+0xda/0x3b0
>>>>>>>>>>>>>>>> [    0.311174]  blk_mq_make_request+0xd6/0x7b0
>>>>>>>>>>>>>>>> [    0.311176]  ? blk_queue_enter+0x24/0x260
>>>>>>>>>>>>>>>> [    0.311178]  ? generic_make_request+0xda/0x3b0
>>>>>>>>>>>>>>>> [    0.311181]  generic_make_request+0x10c/0x3b0
>>>>>>>>>>>>>>>> [    0.311183]  ? submit_bio+0x5c/0x110
>>>>>>>>>>>>>>>> [    0.311185]  submit_bio+0x5c/0x110
>>>>>>>>>>>>>>>> [    0.311197]  ? __ext4_journal_stop+0x36/0xa0 [ext4]
>>>>>>>>>>>>>>>> [    0.311210]  ext4_io_submit+0x48/0x60 [ext4]
>>>>>>>>>>>>>>>> [    0.311222]  ext4_writepages+0x810/0x11f0 [ext4]
>>>>>>>>>>>>>>>> [    0.311229]  ? do_writepages+0x3c/0xd0
>>>>>>>>>>>>>>>> [    0.311239]  ? ext4_mark_inode_dirty+0x260/0x260 [ext4]
>>>>>>>>>>>>>>>> [    0.311240]  do_writepages+0x3c/0xd0
>>>>>>>>>>>>>>>> [    0.311243]  ? _raw_spin_unlock+0x24/0x30
>>>>>>>>>>>>>>>> [    0.311245]  ? wbc_attach_and_unlock_inode+0x165/0x280
>>>>>>>>>>>>>>>> [    0.311248]  ? __filemap_fdatawrite_range+0xa3/0xe0
>>>>>>>>>>>>>>>> [    0.311250]  __filemap_fdatawrite_range+0xa3/0xe0
>>>>>>>>>>>>>>>> [    0.311253]  file_write_and_wait_range+0x34/0x90
>>>>>>>>>>>>>>>> [    0.311264]  ext4_sync_file+0x151/0x500 [ext4]
>>>>>>>>>>>>>>>> [    0.311267]  do_fsync+0x38/0x60
>>>>>>>>>>>>>>>> [    0.311270]  SyS_fsync+0xc/0x10
>>>>>>>>>>>>>>>> [    0.311272]  do_syscall_64+0x6f/0x170
>>>>>>>>>>>>>>>> [    0.311274]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In the original patch, wbt_done is waking up all the exclusive
>>>>>>>>>>>>>>>> processes in the wait queue, which can cause a thundering herd
>>>>>>>>>>>>>>>> if there is a large number of writer threads in the queue. The
>>>>>>>>>>>>>>>> original intention of the code seems to be to wake up one thread
>>>>>>>>>>>>>>>> only however, it uses wake_up_all() in __wbt_done(), and then
>>>>>>>>>>>>>>>> uses the following check in __wbt_wait to have only one thread
>>>>>>>>>>>>>>>> actually get out of the wait loop:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> if (waitqueue_active(&rqw->wait) &&
>>>>>>>>>>>>>>>>             rqw->wait.head.next != &wait->entry)
>>>>>>>>>>>>>>>>                 return false;
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The problem with this is that the wait entry in wbt_wait is
>>>>>>>>>>>>>>>> define with DEFINE_WAIT, which uses the autoremove wakeup function.
>>>>>>>>>>>>>>>> That means that the above check is invalid - the wait entry will
>>>>>>>>>>>>>>>> have been removed from the queue already by the time we hit the
>>>>>>>>>>>>>>>> check in the loop.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Secondly, auto-removing the wait entries also means that the wait
>>>>>>>>>>>>>>>> queue essentially gets reordered "randomly" (e.g. threads re-add
>>>>>>>>>>>>>>>> themselves in the order they got to run after being woken up).
>>>>>>>>>>>>>>>> Additionally, new requests entering wbt_wait might overtake requests
>>>>>>>>>>>>>>>> that were queued earlier, because the wait queue will be
>>>>>>>>>>>>>>>> (temporarily) empty after the wake_up_all, so the waitqueue_active
>>>>>>>>>>>>>>>> check will not stop them. This can cause certain threads to starve
>>>>>>>>>>>>>>>> under high load.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The fix is to leave the woken up requests in the queue and remove
>>>>>>>>>>>>>>>> them in finish_wait() once the current thread breaks out of the
>>>>>>>>>>>>>>>> wait loop in __wbt_wait. This will ensure new requests always
>>>>>>>>>>>>>>>> end up at the back of the queue, and they won't overtake requests
>>>>>>>>>>>>>>>> that are already in the wait queue. With that change, the loop
>>>>>>>>>>>>>>>> in wbt_wait is also in line with many other wait loops in the kernel.
>>>>>>>>>>>>>>>> Waking up just one thread drastically reduces lock contention, as
>>>>>>>>>>>>>>>> does moving the wait queue add/remove out of the loop.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> A significant drop in lockdep's lock contention numbers is seen when
>>>>>>>>>>>>>>>> running the test application on the patched kernel.
>>>>>>>>>>>>>>> I like the patch, and a few weeks ago we independently discovered that
>>>>>>>>>>>>>>> the waitqueue list checking was bogus as well. My only worry is that
>>>>>>>>>>>>>>> changes like this can be delicate, meaning that it's easy to introduce
>>>>>>>>>>>>>>> stall conditions. What kind of testing did you push this through?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>> Jens Axboe
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I ran the following tests on both real HW with NVME devices attached
>>>>>>>>>>>>>> and emulated NVME too:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1. The test case I used to reproduce the issue, spawns a bunch of threads 
>>>>>>>>>>>>>>    to concurrently read and write files with random size and content. 
>>>>>>>>>>>>>>    Files are randomly fsync'd. The implementation is a FIFO queue of files. 
>>>>>>>>>>>>>>    When the queue fills the test starts to verify and remove the files. This 
>>>>>>>>>>>>>>    test will fail if there's a read, write, or hash check failure. It tests
>>>>>>>>>>>>>>    for file corruption when lots of small files are being read and written 
>>>>>>>>>>>>>>    with high concurrency.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2. Fio for random writes with a root NVME device of 200GB
>>>>>>>>>>>>>>   
>>>>>>>>>>>>>>   fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k 
>>>>>>>>>>>>>>   --direct=0 --size=10G --numjobs=2 --runtime=60 --group_reporting
>>>>>>>>>>>>>>   
>>>>>>>>>>>>>>   fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k
>>>>>>>>>>>>>>   --direct=0 --size=5G --numjobs=2 --runtime=30 --fsync=64 --group_reporting
>>>>>>>>>>>>>>   
>>>>>>>>>>>>>>   I did see an improvement in the bandwidth numbers reported on the patched
>>>>>>>>>>>>>>   kernel. 
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Do you have any test case/suite in mind that you would suggest me to 
>>>>>>>>>>>>>> run to be sure that patch does not introduce any stall conditions?
>>>>>>>>>>>>> One thing that is always useful is to run xfstest, do a full run on
>>>>>>>>>>>>> the device. If that works, then do another full run, this time limiting
>>>>>>>>>>>>> the queue depth of the SCSI device to 1. If both of those pass, then
>>>>>>>>>>>>> I'd feel pretty good getting this applied for 4.19.
>>>>>>>>>>>> Did you get a chance to run this full test?
>>>>>>>>>>>>
>>>>>>>>>>>> -- 
>>>>>>>>>>>> Jens Axboe
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> Hi Jens,
>>>>>>>>>>> Yes I did run the tests and was in the process of compiling concrete results
>>>>>>>>>>> I tested following environments against xfs/auto group
>>>>>>>>>>> 1. Vanilla 4.18.rc kernel
>>>>>>>>>>> 2. 4.18 kernel with the blk-wbt patch
>>>>>>>>>>> 3. 4.18 kernel with the blk-wbt patch + io_queue_depth=2. I 
>>>>>>>>>>> understand you asked for queue depth for SCSI device=1 however, I have NVME 
>>>>>>>>>>> devices in my environment and 2 is the minimum value for io_queue_depth allowed 
>>>>>>>>>>> according to the NVME driver code. The results pretty much look same with no 
>>>>>>>>>>> stalls or exceptional failures. 
>>>>>>>>>>> xfs/auto ran 296 odd tests with 3 failures and 130 something "no runs". 
>>>>>>>>>>> Remaining tests passed. "Skipped tests"  were mostly due to missing features
>>>>>>>>>>> (eg: reflink support on scratch filesystem)
>>>>>>>>>>> The failures were consistent across runs on 3 different environments. 
>>>>>>>>>>> I am also running full test suite but it is taking long time as I am 
>>>>>>>>>>> hitting kernel BUG in xfs code in some generic tests. This BUG is not 
>>>>>>>>>>> related to the patch and  I see them in vanilla kernel too. I am in 
>>>>>>>>>>> the process of excluding these kind of tests as they come and 
>>>>>>>>>>> re-run the suite however, this proces is time taking. 
>>>>>>>>>>> Do you have any specific tests in mind that you would like me 
>>>>>>>>>>> to run apart from what I have already tested above?
>>>>>>>>>> Thanks, I think that looks good. I'll get your patch applied for
>>>>>>>>>> 4.19.
>>>>>>>>>>
>>>>>>>>>> -- 
>>>>>>>>>> Jens Axboe
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> Hi Jens,
>>>>>>>>> Thanks for accepting this. There is one small issue, I don't find any emails
>>>>>>>>> send by me on the lkml mailing list. I am not sure why it didn't land there,
>>>>>>>>> all I can see is your responses. Do you want one of us to resend the patch
>>>>>>>>> or will you be able to do it?
>>>>>>>> That's odd, are you getting rejections on your emails? For reference, the
>>>>>>>> patch is here:
>>>>>>>>
>>>>>>>> http://git.kernel.dk/cgit/linux-block/commit/?h=for-4.19/block&id=2887e41b910bb14fd847cf01ab7a5993db989d88
>>>>>>> One issue with this, as far as I can tell. Right now we've switched to
>>>>>>> waking one task at the time, which is obviously more efficient. But if
>>>>>>> we do that with exclusive waits, then we have to ensure that this task
>>>>>>> makes progress. If we wake up a task, and then fail to get a queueing
>>>>>>> token, then we'll go back to sleep. We need to ensure that someone makes
>>>>>>> forward progress at this point. There are two ways I can see that
>>>>>>> happening:
>>>>>>>
>>>>>>> 1) The task woken _always_ gets to queue an IO
>>>>>>> 2) If the task woken is NOT allowed to queue an IO, then it must select
>>>>>>>    a new task to wake up. That new task is then subjected to rule 1 or 2
>>>>>>>    as well.
>>>>>>>
>>>>>>> For #1, it could be as simple as:
>>>>>>>
>>>>>>> if (slept || !rwb_enabled(rwb)) {
>>>>>>> 	atomic_inc(&rqw->inflight);
>>>>>>> 	break;
>>>>>>> }
>>>>>>>
>>>>>>> but this obviously won't always be fair. Might be good enough however,
>>>>>>> instead of having to eg replace the generic wait queues with a priority
>>>>>>> list/queue.
>>>>>>>
>>>>>>> Note that this isn't an entirely new issue, it's just so much easier to
>>>>>>> hit with the single wakeups.
>>>>>>>
>>>>>> Hi Jens,
>>>>>>
>>>>>> What is the scenario that you see under which the woken up task does not
>>>>>> get to run?
>>>>> That scenario is pretty easy to hit - let's say the next in line task
>>>>> has a queue limit of 1, and we currently have 4 pending. Task gets
>>>>> woken, goes back to sleep. Which should be totally fine. At some point
>>>>> we'll get below the limit, and allow the task to proceed. This will
>>>>> ensure forward progress.
>>>>>
>>>>>> The theory behind leaving the task on the wait queue is that the
>>>>>> waitqueue_active check in wbt_wait prevents new tasks from taking up a
>>>>>> slot in the queue (e.g. incrementing inflight). So, there should not be
>>>>>> a way for inflight to be incremented between the time the wake_up is
>>>>>> done and the task at the head of the wait queue runs. That's the idea
>>>>>> anyway :-) If we missed something, let us know.
>>>>> And that's a fine theory, I think it's a good improvement (and how it
>>>>> should have worked). I'm struggling to see where the issue is. Perhaps
>>>>> it's related to the wq active check. With fewer wakeups, we're more
>>>>> likely to hit a race there.
>>>>>
>>>>> I'll poke at it...
>>>> Trying something like this:
>>>>
>>>> http://git.kernel.dk/cgit/linux-block/log/?h=for-4.19/wbt
>>>>
>>> Ah, now I see what you mean.
>>>
>>> This is the case where a task goes to sleep, not because the inflight
>>> limit has been reached, but simply because it needs to go to the back of
>>> the wait queue.
>>>
>>> In that case, it should, for its first time inside the loop, not try to
>>> decrement inflight - since that means it could still race to overtake a
>>> task that got there earlier and is in the wait queue.
>>>
>>> So what you are doing is keeping track of whether it got in to the loop
>>> only because of queueing, and then you don't try to decrement inflight
>>> the first time around the loop.
>>>
>>> I think that should work to fix that corner case.
>>
>> I hope so, got tests running now and we'll see...
>>
>> Outside of that, getting the matching memory barrier for the wq check
>> could also fix a race on the completion side.
>>
> 
> I thought all the wait_* and set_current_* and atomic_* had implicit barriers.
> Are you referring to the rwb->wb_* values we consume on the completion side?

Not waitqueue_active(), which is the one I was referring to. The additional
helper wq_has_sleeper() does.

> I was initially concerned about not dequeuing the task, but noticed that
> wake_up_common seems to handle that well. I looked for sources of missed wake
> up as well, notifying the same task twice and missing wakeups, but could
> not hit it.

It's better not to dequeue, since we want the task to stay at the head.
So I think all that makes sense, yet I can't find where it would be
missing either. The missing barrier _could_ explain it, especially
since the risk of hitting it should be higher now with single wakeups.

> FYI: We ran lock contention and the waitqueue showed up as having the
> largest contention, which disappeared after this patch.

Yeah, it's a good change for sure, we don't want everybody to wakeup,
and then hammer on the lock both on wq removal and then again for
most of them going back to sleep.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
  2018-08-21  2:58                               ` Jens Axboe
@ 2018-08-22  3:20                                 ` Jens Axboe
  2018-08-22  4:01                                   ` Anchal Agarwal
  0 siblings, 1 reply; 38+ messages in thread
From: Jens Axboe @ 2018-08-22  3:20 UTC (permalink / raw)
  To: Balbir Singh
  Cc: van der Linden, Frank, Agarwal, Anchal, linux-block,
	linux-kernel, Wilson, Matt

On 8/20/18 8:58 PM, Jens Axboe wrote:
> On 8/20/18 4:42 PM, Balbir Singh wrote:
>> On Mon, Aug 20, 2018 at 02:20:59PM -0600, Jens Axboe wrote:
>>> On 8/20/18 2:19 PM, van der Linden, Frank wrote:
>>>> On 8/20/18 12:29 PM, Jens Axboe wrote:
>>>>> On 8/20/18 1:08 PM, Jens Axboe wrote:
>>>>>> On 8/20/18 11:34 AM, van der Linden, Frank wrote:
>>>>>>> On 8/20/18 9:37 AM, Jens Axboe wrote:
>>>>>>>> On 8/7/18 3:19 PM, Jens Axboe wrote:
>>>>>>>>> On 8/7/18 3:12 PM, Anchal Agarwal wrote:
>>>>>>>>>> On Tue, Aug 07, 2018 at 02:39:48PM -0600, Jens Axboe wrote:
>>>>>>>>>>> On 8/7/18 2:12 PM, Anchal Agarwal wrote:
>>>>>>>>>>>> On Tue, Aug 07, 2018 at 08:29:44AM -0600, Jens Axboe wrote:
>>>>>>>>>>>>> On 8/1/18 4:09 PM, Jens Axboe wrote:
>>>>>>>>>>>>>> On 8/1/18 11:06 AM, Anchal Agarwal wrote:
>>>>>>>>>>>>>>> On Wed, Aug 01, 2018 at 09:14:50AM -0600, Jens Axboe wrote:
>>>>>>>>>>>>>>>> On 7/31/18 3:34 PM, Anchal Agarwal wrote:
>>>>>>>>>>>>>>>>> Hi folks,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This patch modifies commit e34cbd307477a
>>>>>>>>>>>>>>>>> (blk-wbt: add general throttling mechanism)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I am currently running a large bare metal instance (i3.metal)
>>>>>>>>>>>>>>>>> on EC2 with 72 cores, 512GB of RAM and NVME drives, with a
>>>>>>>>>>>>>>>>> 4.18 kernel. I have a workload that simulates a database
>>>>>>>>>>>>>>>>> workload and I am running into lockup issues when writeback
>>>>>>>>>>>>>>>>> throttling is enabled,with the hung task detector also
>>>>>>>>>>>>>>>>> kicking in.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Crash dumps show that most CPUs (up to 50 of them) are
>>>>>>>>>>>>>>>>> all trying to get the wbt wait queue lock while trying to add
>>>>>>>>>>>>>>>>> themselves to it in __wbt_wait (see stack traces below).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [    0.948118] CPU: 45 PID: 0 Comm: swapper/45 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
>>>>>>>>>>>>>>>>> [    0.948119] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
>>>>>>>>>>>>>>>>> [    0.948120] task: ffff883f7878c000 task.stack: ffffc9000c69c000
>>>>>>>>>>>>>>>>> [    0.948124] RIP: 0010:native_queued_spin_lock_slowpath+0xf8/0x1a0
>>>>>>>>>>>>>>>>> [    0.948125] RSP: 0018:ffff883f7fcc3dc8 EFLAGS: 00000046
>>>>>>>>>>>>>>>>> [    0.948126] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7fce2a00
>>>>>>>>>>>>>>>>> [    0.948128] RDX: 000000000000001c RSI: 0000000000740001 RDI: ffff887f7709ca68
>>>>>>>>>>>>>>>>> [    0.948129] RBP: 0000000000000002 R08: 0000000000b80000 R09: 0000000000000000
>>>>>>>>>>>>>>>>> [    0.948130] R10: ffff883f7fcc3d78 R11: 000000000de27121 R12: 0000000000000002
>>>>>>>>>>>>>>>>> [    0.948131] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
>>>>>>>>>>>>>>>>> [    0.948132] FS:  0000000000000000(0000) GS:ffff883f7fcc0000(0000) knlGS:0000000000000000
>>>>>>>>>>>>>>>>> [    0.948134] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>>>>>>>>>>>>> [    0.948135] CR2: 000000c424c77000 CR3: 0000000002010005 CR4: 00000000003606e0
>>>>>>>>>>>>>>>>> [    0.948136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>>>>>>>>>>>>>>> [    0.948137] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>>>>>>>>>>>>>>> [    0.948138] Call Trace:
>>>>>>>>>>>>>>>>> [    0.948139]  <IRQ>
>>>>>>>>>>>>>>>>> [    0.948142]  do_raw_spin_lock+0xad/0xc0
>>>>>>>>>>>>>>>>> [    0.948145]  _raw_spin_lock_irqsave+0x44/0x4b
>>>>>>>>>>>>>>>>> [    0.948149]  ? __wake_up_common_lock+0x53/0x90
>>>>>>>>>>>>>>>>> [    0.948150]  __wake_up_common_lock+0x53/0x90
>>>>>>>>>>>>>>>>> [    0.948155]  wbt_done+0x7b/0xa0
>>>>>>>>>>>>>>>>> [    0.948158]  blk_mq_free_request+0xb7/0x110
>>>>>>>>>>>>>>>>> [    0.948161]  __blk_mq_complete_request+0xcb/0x140
>>>>>>>>>>>>>>>>> [    0.948166]  nvme_process_cq+0xce/0x1a0 [nvme]
>>>>>>>>>>>>>>>>> [    0.948169]  nvme_irq+0x23/0x50 [nvme]
>>>>>>>>>>>>>>>>> [    0.948173]  __handle_irq_event_percpu+0x46/0x300
>>>>>>>>>>>>>>>>> [    0.948176]  handle_irq_event_percpu+0x20/0x50
>>>>>>>>>>>>>>>>> [    0.948179]  handle_irq_event+0x34/0x60
>>>>>>>>>>>>>>>>> [    0.948181]  handle_edge_irq+0x77/0x190
>>>>>>>>>>>>>>>>> [    0.948185]  handle_irq+0xaf/0x120
>>>>>>>>>>>>>>>>> [    0.948188]  do_IRQ+0x53/0x110
>>>>>>>>>>>>>>>>> [    0.948191]  common_interrupt+0x87/0x87
>>>>>>>>>>>>>>>>> [    0.948192]  </IRQ>
>>>>>>>>>>>>>>>>> ....
>>>>>>>>>>>>>>>>> [    0.311136] CPU: 4 PID: 9737 Comm: run_linux_amd64 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
>>>>>>>>>>>>>>>>> [    0.311137] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
>>>>>>>>>>>>>>>>> [    0.311138] task: ffff883f6e6a8000 task.stack: ffffc9000f1ec000
>>>>>>>>>>>>>>>>> [    0.311141] RIP: 0010:native_queued_spin_lock_slowpath+0xf5/0x1a0
>>>>>>>>>>>>>>>>> [    0.311142] RSP: 0018:ffffc9000f1efa28 EFLAGS: 00000046
>>>>>>>>>>>>>>>>> [    0.311144] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7f722a00
>>>>>>>>>>>>>>>>> [    0.311145] RDX: 0000000000000035 RSI: 0000000000d80001 RDI: ffff887f7709ca68
>>>>>>>>>>>>>>>>> [    0.311146] RBP: 0000000000000202 R08: 0000000000140000 R09: 0000000000000000
>>>>>>>>>>>>>>>>> [    0.311147] R10: ffffc9000f1ef9d8 R11: 000000001a249fa0 R12: ffff887f7709ca68
>>>>>>>>>>>>>>>>> [    0.311148] R13: ffffc9000f1efad0 R14: 0000000000000000 R15: ffff887f7709ca00
>>>>>>>>>>>>>>>>> [    0.311149] FS:  000000c423f30090(0000) GS:ffff883f7f700000(0000) knlGS:0000000000000000
>>>>>>>>>>>>>>>>> [    0.311150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>>>>>>>>>>>>> [    0.311151] CR2: 00007feefcea4000 CR3: 0000007f7016e001 CR4: 00000000003606e0
>>>>>>>>>>>>>>>>> [    0.311152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>>>>>>>>>>>>>>> [    0.311153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>>>>>>>>>>>>>>> [    0.311154] Call Trace:
>>>>>>>>>>>>>>>>> [    0.311157]  do_raw_spin_lock+0xad/0xc0
>>>>>>>>>>>>>>>>> [    0.311160]  _raw_spin_lock_irqsave+0x44/0x4b
>>>>>>>>>>>>>>>>> [    0.311162]  ? prepare_to_wait_exclusive+0x28/0xb0
>>>>>>>>>>>>>>>>> [    0.311164]  prepare_to_wait_exclusive+0x28/0xb0
>>>>>>>>>>>>>>>>> [    0.311167]  wbt_wait+0x127/0x330
>>>>>>>>>>>>>>>>> [    0.311169]  ? finish_wait+0x80/0x80
>>>>>>>>>>>>>>>>> [    0.311172]  ? generic_make_request+0xda/0x3b0
>>>>>>>>>>>>>>>>> [    0.311174]  blk_mq_make_request+0xd6/0x7b0
>>>>>>>>>>>>>>>>> [    0.311176]  ? blk_queue_enter+0x24/0x260
>>>>>>>>>>>>>>>>> [    0.311178]  ? generic_make_request+0xda/0x3b0
>>>>>>>>>>>>>>>>> [    0.311181]  generic_make_request+0x10c/0x3b0
>>>>>>>>>>>>>>>>> [    0.311183]  ? submit_bio+0x5c/0x110
>>>>>>>>>>>>>>>>> [    0.311185]  submit_bio+0x5c/0x110
>>>>>>>>>>>>>>>>> [    0.311197]  ? __ext4_journal_stop+0x36/0xa0 [ext4]
>>>>>>>>>>>>>>>>> [    0.311210]  ext4_io_submit+0x48/0x60 [ext4]
>>>>>>>>>>>>>>>>> [    0.311222]  ext4_writepages+0x810/0x11f0 [ext4]
>>>>>>>>>>>>>>>>> [    0.311229]  ? do_writepages+0x3c/0xd0
>>>>>>>>>>>>>>>>> [    0.311239]  ? ext4_mark_inode_dirty+0x260/0x260 [ext4]
>>>>>>>>>>>>>>>>> [    0.311240]  do_writepages+0x3c/0xd0
>>>>>>>>>>>>>>>>> [    0.311243]  ? _raw_spin_unlock+0x24/0x30
>>>>>>>>>>>>>>>>> [    0.311245]  ? wbc_attach_and_unlock_inode+0x165/0x280
>>>>>>>>>>>>>>>>> [    0.311248]  ? __filemap_fdatawrite_range+0xa3/0xe0
>>>>>>>>>>>>>>>>> [    0.311250]  __filemap_fdatawrite_range+0xa3/0xe0
>>>>>>>>>>>>>>>>> [    0.311253]  file_write_and_wait_range+0x34/0x90
>>>>>>>>>>>>>>>>> [    0.311264]  ext4_sync_file+0x151/0x500 [ext4]
>>>>>>>>>>>>>>>>> [    0.311267]  do_fsync+0x38/0x60
>>>>>>>>>>>>>>>>> [    0.311270]  SyS_fsync+0xc/0x10
>>>>>>>>>>>>>>>>> [    0.311272]  do_syscall_64+0x6f/0x170
>>>>>>>>>>>>>>>>> [    0.311274]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> In the original patch, wbt_done is waking up all the exclusive
>>>>>>>>>>>>>>>>> processes in the wait queue, which can cause a thundering herd
>>>>>>>>>>>>>>>>> if there is a large number of writer threads in the queue. The
>>>>>>>>>>>>>>>>> original intention of the code seems to be to wake up one thread
>>>>>>>>>>>>>>>>> only however, it uses wake_up_all() in __wbt_done(), and then
>>>>>>>>>>>>>>>>> uses the following check in __wbt_wait to have only one thread
>>>>>>>>>>>>>>>>> actually get out of the wait loop:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> if (waitqueue_active(&rqw->wait) &&
>>>>>>>>>>>>>>>>>             rqw->wait.head.next != &wait->entry)
>>>>>>>>>>>>>>>>>                 return false;
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The problem with this is that the wait entry in wbt_wait is
>>>>>>>>>>>>>>>>> define with DEFINE_WAIT, which uses the autoremove wakeup function.
>>>>>>>>>>>>>>>>> That means that the above check is invalid - the wait entry will
>>>>>>>>>>>>>>>>> have been removed from the queue already by the time we hit the
>>>>>>>>>>>>>>>>> check in the loop.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Secondly, auto-removing the wait entries also means that the wait
>>>>>>>>>>>>>>>>> queue essentially gets reordered "randomly" (e.g. threads re-add
>>>>>>>>>>>>>>>>> themselves in the order they got to run after being woken up).
>>>>>>>>>>>>>>>>> Additionally, new requests entering wbt_wait might overtake requests
>>>>>>>>>>>>>>>>> that were queued earlier, because the wait queue will be
>>>>>>>>>>>>>>>>> (temporarily) empty after the wake_up_all, so the waitqueue_active
>>>>>>>>>>>>>>>>> check will not stop them. This can cause certain threads to starve
>>>>>>>>>>>>>>>>> under high load.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The fix is to leave the woken up requests in the queue and remove
>>>>>>>>>>>>>>>>> them in finish_wait() once the current thread breaks out of the
>>>>>>>>>>>>>>>>> wait loop in __wbt_wait. This will ensure new requests always
>>>>>>>>>>>>>>>>> end up at the back of the queue, and they won't overtake requests
>>>>>>>>>>>>>>>>> that are already in the wait queue. With that change, the loop
>>>>>>>>>>>>>>>>> in wbt_wait is also in line with many other wait loops in the kernel.
>>>>>>>>>>>>>>>>> Waking up just one thread drastically reduces lock contention, as
>>>>>>>>>>>>>>>>> does moving the wait queue add/remove out of the loop.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> A significant drop in lockdep's lock contention numbers is seen when
>>>>>>>>>>>>>>>>> running the test application on the patched kernel.
>>>>>>>>>>>>>>>> I like the patch, and a few weeks ago we independently discovered that
>>>>>>>>>>>>>>>> the waitqueue list checking was bogus as well. My only worry is that
>>>>>>>>>>>>>>>> changes like this can be delicate, meaning that it's easy to introduce
>>>>>>>>>>>>>>>> stall conditions. What kind of testing did you push this through?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>> Jens Axboe
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I ran the following tests on both real HW with NVME devices attached
>>>>>>>>>>>>>>> and emulated NVME too:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1. The test case I used to reproduce the issue, spawns a bunch of threads 
>>>>>>>>>>>>>>>    to concurrently read and write files with random size and content. 
>>>>>>>>>>>>>>>    Files are randomly fsync'd. The implementation is a FIFO queue of files. 
>>>>>>>>>>>>>>>    When the queue fills the test starts to verify and remove the files. This 
>>>>>>>>>>>>>>>    test will fail if there's a read, write, or hash check failure. It tests
>>>>>>>>>>>>>>>    for file corruption when lots of small files are being read and written 
>>>>>>>>>>>>>>>    with high concurrency.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2. Fio for random writes with a root NVME device of 200GB
>>>>>>>>>>>>>>>   
>>>>>>>>>>>>>>>   fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k 
>>>>>>>>>>>>>>>   --direct=0 --size=10G --numjobs=2 --runtime=60 --group_reporting
>>>>>>>>>>>>>>>   
>>>>>>>>>>>>>>>   fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k
>>>>>>>>>>>>>>>   --direct=0 --size=5G --numjobs=2 --runtime=30 --fsync=64 --group_reporting
>>>>>>>>>>>>>>>   
>>>>>>>>>>>>>>>   I did see an improvement in the bandwidth numbers reported on the patched
>>>>>>>>>>>>>>>   kernel. 
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Do you have any test case/suite in mind that you would suggest me to 
>>>>>>>>>>>>>>> run to be sure that patch does not introduce any stall conditions?
>>>>>>>>>>>>>> One thing that is always useful is to run xfstest, do a full run on
>>>>>>>>>>>>>> the device. If that works, then do another full run, this time limiting
>>>>>>>>>>>>>> the queue depth of the SCSI device to 1. If both of those pass, then
>>>>>>>>>>>>>> I'd feel pretty good getting this applied for 4.19.
>>>>>>>>>>>>> Did you get a chance to run this full test?
>>>>>>>>>>>>>
>>>>>>>>>>>>> -- 
>>>>>>>>>>>>> Jens Axboe
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> Hi Jens,
>>>>>>>>>>>> Yes I did run the tests and was in the process of compiling concrete results
>>>>>>>>>>>> I tested following environments against xfs/auto group
>>>>>>>>>>>> 1. Vanilla 4.18.rc kernel
>>>>>>>>>>>> 2. 4.18 kernel with the blk-wbt patch
>>>>>>>>>>>> 3. 4.18 kernel with the blk-wbt patch + io_queue_depth=2. I 
>>>>>>>>>>>> understand you asked for queue depth for SCSI device=1 however, I have NVME 
>>>>>>>>>>>> devices in my environment and 2 is the minimum value for io_queue_depth allowed 
>>>>>>>>>>>> according to the NVME driver code. The results pretty much look same with no 
>>>>>>>>>>>> stalls or exceptional failures. 
>>>>>>>>>>>> xfs/auto ran 296 odd tests with 3 failures and 130 something "no runs". 
>>>>>>>>>>>> Remaining tests passed. "Skipped tests"  were mostly due to missing features
>>>>>>>>>>>> (eg: reflink support on scratch filesystem)
>>>>>>>>>>>> The failures were consistent across runs on 3 different environments. 
>>>>>>>>>>>> I am also running full test suite but it is taking long time as I am 
>>>>>>>>>>>> hitting kernel BUG in xfs code in some generic tests. This BUG is not 
>>>>>>>>>>>> related to the patch and  I see them in vanilla kernel too. I am in 
>>>>>>>>>>>> the process of excluding these kind of tests as they come and 
>>>>>>>>>>>> re-run the suite however, this proces is time taking. 
>>>>>>>>>>>> Do you have any specific tests in mind that you would like me 
>>>>>>>>>>>> to run apart from what I have already tested above?
>>>>>>>>>>> Thanks, I think that looks good. I'll get your patch applied for
>>>>>>>>>>> 4.19.
>>>>>>>>>>>
>>>>>>>>>>> -- 
>>>>>>>>>>> Jens Axboe
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> Hi Jens,
>>>>>>>>>> Thanks for accepting this. There is one small issue, I don't find any emails
>>>>>>>>>> send by me on the lkml mailing list. I am not sure why it didn't land there,
>>>>>>>>>> all I can see is your responses. Do you want one of us to resend the patch
>>>>>>>>>> or will you be able to do it?
>>>>>>>>> That's odd, are you getting rejections on your emails? For reference, the
>>>>>>>>> patch is here:
>>>>>>>>>
>>>>>>>>> http://git.kernel.dk/cgit/linux-block/commit/?h=for-4.19/block&id=2887e41b910bb14fd847cf01ab7a5993db989d88
>>>>>>>> One issue with this, as far as I can tell. Right now we've switched to
>>>>>>>> waking one task at the time, which is obviously more efficient. But if
>>>>>>>> we do that with exclusive waits, then we have to ensure that this task
>>>>>>>> makes progress. If we wake up a task, and then fail to get a queueing
>>>>>>>> token, then we'll go back to sleep. We need to ensure that someone makes
>>>>>>>> forward progress at this point. There are two ways I can see that
>>>>>>>> happening:
>>>>>>>>
>>>>>>>> 1) The task woken _always_ gets to queue an IO
>>>>>>>> 2) If the task woken is NOT allowed to queue an IO, then it must select
>>>>>>>>    a new task to wake up. That new task is then subjected to rule 1 or 2
>>>>>>>>    as well.
>>>>>>>>
>>>>>>>> For #1, it could be as simple as:
>>>>>>>>
>>>>>>>> if (slept || !rwb_enabled(rwb)) {
>>>>>>>> 	atomic_inc(&rqw->inflight);
>>>>>>>> 	break;
>>>>>>>> }
>>>>>>>>
>>>>>>>> but this obviously won't always be fair. Might be good enough however,
>>>>>>>> instead of having to eg replace the generic wait queues with a priority
>>>>>>>> list/queue.
>>>>>>>>
>>>>>>>> Note that this isn't an entirely new issue, it's just so much easier to
>>>>>>>> hit with the single wakeups.
>>>>>>>>
>>>>>>> Hi Jens,
>>>>>>>
>>>>>>> What is the scenario that you see under which the woken up task does not
>>>>>>> get to run?
>>>>>> That scenario is pretty easy to hit - let's say the next in line task
>>>>>> has a queue limit of 1, and we currently have 4 pending. Task gets
>>>>>> woken, goes back to sleep. Which should be totally fine. At some point
>>>>>> we'll get below the limit, and allow the task to proceed. This will
>>>>>> ensure forward progress.
>>>>>>
>>>>>>> The theory behind leaving the task on the wait queue is that the
>>>>>>> waitqueue_active check in wbt_wait prevents new tasks from taking up a
>>>>>>> slot in the queue (e.g. incrementing inflight). So, there should not be
>>>>>>> a way for inflight to be incremented between the time the wake_up is
>>>>>>> done and the task at the head of the wait queue runs. That's the idea
>>>>>>> anyway :-) If we missed something, let us know.
>>>>>> And that's a fine theory, I think it's a good improvement (and how it
>>>>>> should have worked). I'm struggling to see where the issue is. Perhaps
>>>>>> it's related to the wq active check. With fewer wakeups, we're more
>>>>>> likely to hit a race there.
>>>>>>
>>>>>> I'll poke at it...
>>>>> Trying something like this:
>>>>>
>>>>> http://git.kernel.dk/cgit/linux-block/log/?h=for-4.19/wbt
>>>>>
>>>> Ah, now I see what you mean.
>>>>
>>>> This is the case where a task goes to sleep, not because the inflight
>>>> limit has been reached, but simply because it needs to go to the back of
>>>> the wait queue.
>>>>
>>>> In that case, it should, for its first time inside the loop, not try to
>>>> decrement inflight - since that means it could still race to overtake a
>>>> task that got there earlier and is in the wait queue.
>>>>
>>>> So what you are doing is keeping track of whether it got in to the loop
>>>> only because of queueing, and then you don't try to decrement inflight
>>>> the first time around the loop.
>>>>
>>>> I think that should work to fix that corner case.
>>>
>>> I hope so, got tests running now and we'll see...
>>>
>>> Outside of that, getting the matching memory barrier for the wq check
>>> could also fix a race on the completion side.
>>>
>>
>> I thought all the wait_* and set_current_* and atomic_* had implicit barriers.
>> Are you referring to the rwb->wb_* values we consume on the completion side?
> 
> Not waitqueue_active(), which is the one I was referring to. The additional
> helper wq_has_sleeper() does.
> 
>> I was initially concerned about not dequeuing the task, but noticed that
>> wake_up_common seems to handle that well. I looked for sources of missed wake
>> up as well, notifying the same task twice and missing wakeups, but could
>> not hit it.
> 
> It's better not to dequeue, since we want the task to stay at the head.
> So I think all that makes sense, yet I can't find where it would be
> missing either. The missing barrier _could_ explain it, especially
> since the risk of hitting it should be higher now with single wakeups.
> 
>> FYI: We ran lock contention and the waitqueue showed up as having the
>> largest contention, which disappeared after this patch.
> 
> Yeah, it's a good change for sure, we don't want everybody to wakeup,
> and then hammer on the lock both on wq removal and then again for
> most of them going back to sleep.

OK, I think I see it. The problem is that if a task gets woken up and
doesn't get to queue anything, it goes back to sleep. But the default
wake function has already removed it from the wait queue... So once that
happens, we're dead in the water. The problem isn't that we're now more
likely to hit the deadlock with the above change, it's that the above
change introduced this deadlock.

I'm testing a fix.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
  2018-08-22  3:20                                 ` Jens Axboe
@ 2018-08-22  4:01                                   ` Anchal Agarwal
  2018-08-22  4:10                                     ` Jens Axboe
  0 siblings, 1 reply; 38+ messages in thread
From: Anchal Agarwal @ 2018-08-22  4:01 UTC (permalink / raw)
  To: Jens Axboe; +Cc: fllinden, sblbir, anchalag, msw, linux-block, linux-kernel

On Tue, Aug 21, 2018 at 09:20:06PM -0600, Jens Axboe wrote:
> On 8/20/18 8:58 PM, Jens Axboe wrote:
> > On 8/20/18 4:42 PM, Balbir Singh wrote:
> >> On Mon, Aug 20, 2018 at 02:20:59PM -0600, Jens Axboe wrote:
> >>> On 8/20/18 2:19 PM, van der Linden, Frank wrote:
> >>>> On 8/20/18 12:29 PM, Jens Axboe wrote:
> >>>>> On 8/20/18 1:08 PM, Jens Axboe wrote:
> >>>>>> On 8/20/18 11:34 AM, van der Linden, Frank wrote:
> >>>>>>> On 8/20/18 9:37 AM, Jens Axboe wrote:
> >>>>>>>> On 8/7/18 3:19 PM, Jens Axboe wrote:
> >>>>>>>>> On 8/7/18 3:12 PM, Anchal Agarwal wrote:
> >>>>>>>>>> On Tue, Aug 07, 2018 at 02:39:48PM -0600, Jens Axboe wrote:
> >>>>>>>>>>> On 8/7/18 2:12 PM, Anchal Agarwal wrote:
> >>>>>>>>>>>> On Tue, Aug 07, 2018 at 08:29:44AM -0600, Jens Axboe wrote:
> >>>>>>>>>>>>> On 8/1/18 4:09 PM, Jens Axboe wrote:
> >>>>>>>>>>>>>> On 8/1/18 11:06 AM, Anchal Agarwal wrote:
> >>>>>>>>>>>>>>> On Wed, Aug 01, 2018 at 09:14:50AM -0600, Jens Axboe wrote:
> >>>>>>>>>>>>>>>> On 7/31/18 3:34 PM, Anchal Agarwal wrote:
> >>>>>>>>>>>>>>>>> Hi folks,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> This patch modifies commit e34cbd307477a
> >>>>>>>>>>>>>>>>> (blk-wbt: add general throttling mechanism)
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I am currently running a large bare metal instance (i3.metal)
> >>>>>>>>>>>>>>>>> on EC2 with 72 cores, 512GB of RAM and NVME drives, with a
> >>>>>>>>>>>>>>>>> 4.18 kernel. I have a workload that simulates a database
> >>>>>>>>>>>>>>>>> workload and I am running into lockup issues when writeback
> >>>>>>>>>>>>>>>>> throttling is enabled,with the hung task detector also
> >>>>>>>>>>>>>>>>> kicking in.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Crash dumps show that most CPUs (up to 50 of them) are
> >>>>>>>>>>>>>>>>> all trying to get the wbt wait queue lock while trying to add
> >>>>>>>>>>>>>>>>> themselves to it in __wbt_wait (see stack traces below).
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> [    0.948118] CPU: 45 PID: 0 Comm: swapper/45 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
> >>>>>>>>>>>>>>>>> [    0.948119] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
> >>>>>>>>>>>>>>>>> [    0.948120] task: ffff883f7878c000 task.stack: ffffc9000c69c000
> >>>>>>>>>>>>>>>>> [    0.948124] RIP: 0010:native_queued_spin_lock_slowpath+0xf8/0x1a0
> >>>>>>>>>>>>>>>>> [    0.948125] RSP: 0018:ffff883f7fcc3dc8 EFLAGS: 00000046
> >>>>>>>>>>>>>>>>> [    0.948126] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7fce2a00
> >>>>>>>>>>>>>>>>> [    0.948128] RDX: 000000000000001c RSI: 0000000000740001 RDI: ffff887f7709ca68
> >>>>>>>>>>>>>>>>> [    0.948129] RBP: 0000000000000002 R08: 0000000000b80000 R09: 0000000000000000
> >>>>>>>>>>>>>>>>> [    0.948130] R10: ffff883f7fcc3d78 R11: 000000000de27121 R12: 0000000000000002
> >>>>>>>>>>>>>>>>> [    0.948131] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
> >>>>>>>>>>>>>>>>> [    0.948132] FS:  0000000000000000(0000) GS:ffff883f7fcc0000(0000) knlGS:0000000000000000
> >>>>>>>>>>>>>>>>> [    0.948134] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>>>>>>>>>>>>>>> [    0.948135] CR2: 000000c424c77000 CR3: 0000000002010005 CR4: 00000000003606e0
> >>>>>>>>>>>>>>>>> [    0.948136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >>>>>>>>>>>>>>>>> [    0.948137] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >>>>>>>>>>>>>>>>> [    0.948138] Call Trace:
> >>>>>>>>>>>>>>>>> [    0.948139]  <IRQ>
> >>>>>>>>>>>>>>>>> [    0.948142]  do_raw_spin_lock+0xad/0xc0
> >>>>>>>>>>>>>>>>> [    0.948145]  _raw_spin_lock_irqsave+0x44/0x4b
> >>>>>>>>>>>>>>>>> [    0.948149]  ? __wake_up_common_lock+0x53/0x90
> >>>>>>>>>>>>>>>>> [    0.948150]  __wake_up_common_lock+0x53/0x90
> >>>>>>>>>>>>>>>>> [    0.948155]  wbt_done+0x7b/0xa0
> >>>>>>>>>>>>>>>>> [    0.948158]  blk_mq_free_request+0xb7/0x110
> >>>>>>>>>>>>>>>>> [    0.948161]  __blk_mq_complete_request+0xcb/0x140
> >>>>>>>>>>>>>>>>> [    0.948166]  nvme_process_cq+0xce/0x1a0 [nvme]
> >>>>>>>>>>>>>>>>> [    0.948169]  nvme_irq+0x23/0x50 [nvme]
> >>>>>>>>>>>>>>>>> [    0.948173]  __handle_irq_event_percpu+0x46/0x300
> >>>>>>>>>>>>>>>>> [    0.948176]  handle_irq_event_percpu+0x20/0x50
> >>>>>>>>>>>>>>>>> [    0.948179]  handle_irq_event+0x34/0x60
> >>>>>>>>>>>>>>>>> [    0.948181]  handle_edge_irq+0x77/0x190
> >>>>>>>>>>>>>>>>> [    0.948185]  handle_irq+0xaf/0x120
> >>>>>>>>>>>>>>>>> [    0.948188]  do_IRQ+0x53/0x110
> >>>>>>>>>>>>>>>>> [    0.948191]  common_interrupt+0x87/0x87
> >>>>>>>>>>>>>>>>> [    0.948192]  </IRQ>
> >>>>>>>>>>>>>>>>> ....
> >>>>>>>>>>>>>>>>> [    0.311136] CPU: 4 PID: 9737 Comm: run_linux_amd64 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
> >>>>>>>>>>>>>>>>> [    0.311137] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
> >>>>>>>>>>>>>>>>> [    0.311138] task: ffff883f6e6a8000 task.stack: ffffc9000f1ec000
> >>>>>>>>>>>>>>>>> [    0.311141] RIP: 0010:native_queued_spin_lock_slowpath+0xf5/0x1a0
> >>>>>>>>>>>>>>>>> [    0.311142] RSP: 0018:ffffc9000f1efa28 EFLAGS: 00000046
> >>>>>>>>>>>>>>>>> [    0.311144] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7f722a00
> >>>>>>>>>>>>>>>>> [    0.311145] RDX: 0000000000000035 RSI: 0000000000d80001 RDI: ffff887f7709ca68
> >>>>>>>>>>>>>>>>> [    0.311146] RBP: 0000000000000202 R08: 0000000000140000 R09: 0000000000000000
> >>>>>>>>>>>>>>>>> [    0.311147] R10: ffffc9000f1ef9d8 R11: 000000001a249fa0 R12: ffff887f7709ca68
> >>>>>>>>>>>>>>>>> [    0.311148] R13: ffffc9000f1efad0 R14: 0000000000000000 R15: ffff887f7709ca00
> >>>>>>>>>>>>>>>>> [    0.311149] FS:  000000c423f30090(0000) GS:ffff883f7f700000(0000) knlGS:0000000000000000
> >>>>>>>>>>>>>>>>> [    0.311150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>>>>>>>>>>>>>>> [    0.311151] CR2: 00007feefcea4000 CR3: 0000007f7016e001 CR4: 00000000003606e0
> >>>>>>>>>>>>>>>>> [    0.311152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >>>>>>>>>>>>>>>>> [    0.311153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >>>>>>>>>>>>>>>>> [    0.311154] Call Trace:
> >>>>>>>>>>>>>>>>> [    0.311157]  do_raw_spin_lock+0xad/0xc0
> >>>>>>>>>>>>>>>>> [    0.311160]  _raw_spin_lock_irqsave+0x44/0x4b
> >>>>>>>>>>>>>>>>> [    0.311162]  ? prepare_to_wait_exclusive+0x28/0xb0
> >>>>>>>>>>>>>>>>> [    0.311164]  prepare_to_wait_exclusive+0x28/0xb0
> >>>>>>>>>>>>>>>>> [    0.311167]  wbt_wait+0x127/0x330
> >>>>>>>>>>>>>>>>> [    0.311169]  ? finish_wait+0x80/0x80
> >>>>>>>>>>>>>>>>> [    0.311172]  ? generic_make_request+0xda/0x3b0
> >>>>>>>>>>>>>>>>> [    0.311174]  blk_mq_make_request+0xd6/0x7b0
> >>>>>>>>>>>>>>>>> [    0.311176]  ? blk_queue_enter+0x24/0x260
> >>>>>>>>>>>>>>>>> [    0.311178]  ? generic_make_request+0xda/0x3b0
> >>>>>>>>>>>>>>>>> [    0.311181]  generic_make_request+0x10c/0x3b0
> >>>>>>>>>>>>>>>>> [    0.311183]  ? submit_bio+0x5c/0x110
> >>>>>>>>>>>>>>>>> [    0.311185]  submit_bio+0x5c/0x110
> >>>>>>>>>>>>>>>>> [    0.311197]  ? __ext4_journal_stop+0x36/0xa0 [ext4]
> >>>>>>>>>>>>>>>>> [    0.311210]  ext4_io_submit+0x48/0x60 [ext4]
> >>>>>>>>>>>>>>>>> [    0.311222]  ext4_writepages+0x810/0x11f0 [ext4]
> >>>>>>>>>>>>>>>>> [    0.311229]  ? do_writepages+0x3c/0xd0
> >>>>>>>>>>>>>>>>> [    0.311239]  ? ext4_mark_inode_dirty+0x260/0x260 [ext4]
> >>>>>>>>>>>>>>>>> [    0.311240]  do_writepages+0x3c/0xd0
> >>>>>>>>>>>>>>>>> [    0.311243]  ? _raw_spin_unlock+0x24/0x30
> >>>>>>>>>>>>>>>>> [    0.311245]  ? wbc_attach_and_unlock_inode+0x165/0x280
> >>>>>>>>>>>>>>>>> [    0.311248]  ? __filemap_fdatawrite_range+0xa3/0xe0
> >>>>>>>>>>>>>>>>> [    0.311250]  __filemap_fdatawrite_range+0xa3/0xe0
> >>>>>>>>>>>>>>>>> [    0.311253]  file_write_and_wait_range+0x34/0x90
> >>>>>>>>>>>>>>>>> [    0.311264]  ext4_sync_file+0x151/0x500 [ext4]
> >>>>>>>>>>>>>>>>> [    0.311267]  do_fsync+0x38/0x60
> >>>>>>>>>>>>>>>>> [    0.311270]  SyS_fsync+0xc/0x10
> >>>>>>>>>>>>>>>>> [    0.311272]  do_syscall_64+0x6f/0x170
> >>>>>>>>>>>>>>>>> [    0.311274]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> In the original patch, wbt_done is waking up all the exclusive
> >>>>>>>>>>>>>>>>> processes in the wait queue, which can cause a thundering herd
> >>>>>>>>>>>>>>>>> if there is a large number of writer threads in the queue. The
> >>>>>>>>>>>>>>>>> original intention of the code seems to be to wake up one thread
> >>>>>>>>>>>>>>>>> only however, it uses wake_up_all() in __wbt_done(), and then
> >>>>>>>>>>>>>>>>> uses the following check in __wbt_wait to have only one thread
> >>>>>>>>>>>>>>>>> actually get out of the wait loop:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> if (waitqueue_active(&rqw->wait) &&
> >>>>>>>>>>>>>>>>>             rqw->wait.head.next != &wait->entry)
> >>>>>>>>>>>>>>>>>                 return false;
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> The problem with this is that the wait entry in wbt_wait is
> >>>>>>>>>>>>>>>>> define with DEFINE_WAIT, which uses the autoremove wakeup function.
> >>>>>>>>>>>>>>>>> That means that the above check is invalid - the wait entry will
> >>>>>>>>>>>>>>>>> have been removed from the queue already by the time we hit the
> >>>>>>>>>>>>>>>>> check in the loop.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Secondly, auto-removing the wait entries also means that the wait
> >>>>>>>>>>>>>>>>> queue essentially gets reordered "randomly" (e.g. threads re-add
> >>>>>>>>>>>>>>>>> themselves in the order they got to run after being woken up).
> >>>>>>>>>>>>>>>>> Additionally, new requests entering wbt_wait might overtake requests
> >>>>>>>>>>>>>>>>> that were queued earlier, because the wait queue will be
> >>>>>>>>>>>>>>>>> (temporarily) empty after the wake_up_all, so the waitqueue_active
> >>>>>>>>>>>>>>>>> check will not stop them. This can cause certain threads to starve
> >>>>>>>>>>>>>>>>> under high load.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> The fix is to leave the woken up requests in the queue and remove
> >>>>>>>>>>>>>>>>> them in finish_wait() once the current thread breaks out of the
> >>>>>>>>>>>>>>>>> wait loop in __wbt_wait. This will ensure new requests always
> >>>>>>>>>>>>>>>>> end up at the back of the queue, and they won't overtake requests
> >>>>>>>>>>>>>>>>> that are already in the wait queue. With that change, the loop
> >>>>>>>>>>>>>>>>> in wbt_wait is also in line with many other wait loops in the kernel.
> >>>>>>>>>>>>>>>>> Waking up just one thread drastically reduces lock contention, as
> >>>>>>>>>>>>>>>>> does moving the wait queue add/remove out of the loop.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> A significant drop in lockdep's lock contention numbers is seen when
> >>>>>>>>>>>>>>>>> running the test application on the patched kernel.
> >>>>>>>>>>>>>>>> I like the patch, and a few weeks ago we independently discovered that
> >>>>>>>>>>>>>>>> the waitqueue list checking was bogus as well. My only worry is that
> >>>>>>>>>>>>>>>> changes like this can be delicate, meaning that it's easy to introduce
> >>>>>>>>>>>>>>>> stall conditions. What kind of testing did you push this through?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> -- 
> >>>>>>>>>>>>>>>> Jens Axboe
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I ran the following tests on both real HW with NVME devices attached
> >>>>>>>>>>>>>>> and emulated NVME too:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 1. The test case I used to reproduce the issue, spawns a bunch of threads 
> >>>>>>>>>>>>>>>    to concurrently read and write files with random size and content. 
> >>>>>>>>>>>>>>>    Files are randomly fsync'd. The implementation is a FIFO queue of files. 
> >>>>>>>>>>>>>>>    When the queue fills the test starts to verify and remove the files. This 
> >>>>>>>>>>>>>>>    test will fail if there's a read, write, or hash check failure. It tests
> >>>>>>>>>>>>>>>    for file corruption when lots of small files are being read and written 
> >>>>>>>>>>>>>>>    with high concurrency.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 2. Fio for random writes with a root NVME device of 200GB
> >>>>>>>>>>>>>>>   
> >>>>>>>>>>>>>>>   fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k 
> >>>>>>>>>>>>>>>   --direct=0 --size=10G --numjobs=2 --runtime=60 --group_reporting
> >>>>>>>>>>>>>>>   
> >>>>>>>>>>>>>>>   fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k
> >>>>>>>>>>>>>>>   --direct=0 --size=5G --numjobs=2 --runtime=30 --fsync=64 --group_reporting
> >>>>>>>>>>>>>>>   
> >>>>>>>>>>>>>>>   I did see an improvement in the bandwidth numbers reported on the patched
> >>>>>>>>>>>>>>>   kernel. 
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Do you have any test case/suite in mind that you would suggest me to 
> >>>>>>>>>>>>>>> run to be sure that patch does not introduce any stall conditions?
> >>>>>>>>>>>>>> One thing that is always useful is to run xfstest, do a full run on
> >>>>>>>>>>>>>> the device. If that works, then do another full run, this time limiting
> >>>>>>>>>>>>>> the queue depth of the SCSI device to 1. If both of those pass, then
> >>>>>>>>>>>>>> I'd feel pretty good getting this applied for 4.19.
> >>>>>>>>>>>>> Did you get a chance to run this full test?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> -- 
> >>>>>>>>>>>>> Jens Axboe
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>> Hi Jens,
> >>>>>>>>>>>> Yes I did run the tests and was in the process of compiling concrete results
> >>>>>>>>>>>> I tested following environments against xfs/auto group
> >>>>>>>>>>>> 1. Vanilla 4.18.rc kernel
> >>>>>>>>>>>> 2. 4.18 kernel with the blk-wbt patch
> >>>>>>>>>>>> 3. 4.18 kernel with the blk-wbt patch + io_queue_depth=2. I 
> >>>>>>>>>>>> understand you asked for queue depth for SCSI device=1 however, I have NVME 
> >>>>>>>>>>>> devices in my environment and 2 is the minimum value for io_queue_depth allowed 
> >>>>>>>>>>>> according to the NVME driver code. The results pretty much look same with no 
> >>>>>>>>>>>> stalls or exceptional failures. 
> >>>>>>>>>>>> xfs/auto ran 296 odd tests with 3 failures and 130 something "no runs". 
> >>>>>>>>>>>> Remaining tests passed. "Skipped tests"  were mostly due to missing features
> >>>>>>>>>>>> (eg: reflink support on scratch filesystem)
> >>>>>>>>>>>> The failures were consistent across runs on 3 different environments. 
> >>>>>>>>>>>> I am also running full test suite but it is taking long time as I am 
> >>>>>>>>>>>> hitting kernel BUG in xfs code in some generic tests. This BUG is not 
> >>>>>>>>>>>> related to the patch and  I see them in vanilla kernel too. I am in 
> >>>>>>>>>>>> the process of excluding these kind of tests as they come and 
> >>>>>>>>>>>> re-run the suite however, this proces is time taking. 
> >>>>>>>>>>>> Do you have any specific tests in mind that you would like me 
> >>>>>>>>>>>> to run apart from what I have already tested above?
> >>>>>>>>>>> Thanks, I think that looks good. I'll get your patch applied for
> >>>>>>>>>>> 4.19.
> >>>>>>>>>>>
> >>>>>>>>>>> -- 
> >>>>>>>>>>> Jens Axboe
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>> Hi Jens,
> >>>>>>>>>> Thanks for accepting this. There is one small issue, I don't find any emails
> >>>>>>>>>> send by me on the lkml mailing list. I am not sure why it didn't land there,
> >>>>>>>>>> all I can see is your responses. Do you want one of us to resend the patch
> >>>>>>>>>> or will you be able to do it?
> >>>>>>>>> That's odd, are you getting rejections on your emails? For reference, the
> >>>>>>>>> patch is here:
> >>>>>>>>>
> >>>>>>>>> http://git.kernel.dk/cgit/linux-block/commit/?h=for-4.19/block&id=2887e41b910bb14fd847cf01ab7a5993db989d88
> >>>>>>>> One issue with this, as far as I can tell. Right now we've switched to
> >>>>>>>> waking one task at the time, which is obviously more efficient. But if
> >>>>>>>> we do that with exclusive waits, then we have to ensure that this task
> >>>>>>>> makes progress. If we wake up a task, and then fail to get a queueing
> >>>>>>>> token, then we'll go back to sleep. We need to ensure that someone makes
> >>>>>>>> forward progress at this point. There are two ways I can see that
> >>>>>>>> happening:
> >>>>>>>>
> >>>>>>>> 1) The task woken _always_ gets to queue an IO
> >>>>>>>> 2) If the task woken is NOT allowed to queue an IO, then it must select
> >>>>>>>>    a new task to wake up. That new task is then subjected to rule 1 or 2
> >>>>>>>>    as well.
> >>>>>>>>
> >>>>>>>> For #1, it could be as simple as:
> >>>>>>>>
> >>>>>>>> if (slept || !rwb_enabled(rwb)) {
> >>>>>>>> 	atomic_inc(&rqw->inflight);
> >>>>>>>> 	break;
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>> but this obviously won't always be fair. Might be good enough however,
> >>>>>>>> instead of having to eg replace the generic wait queues with a priority
> >>>>>>>> list/queue.
> >>>>>>>>
> >>>>>>>> Note that this isn't an entirely new issue, it's just so much easier to
> >>>>>>>> hit with the single wakeups.
> >>>>>>>>
> >>>>>>> Hi Jens,
> >>>>>>>
> >>>>>>> What is the scenario that you see under which the woken up task does not
> >>>>>>> get to run?
> >>>>>> That scenario is pretty easy to hit - let's say the next in line task
> >>>>>> has a queue limit of 1, and we currently have 4 pending. Task gets
> >>>>>> woken, goes back to sleep. Which should be totally fine. At some point
> >>>>>> we'll get below the limit, and allow the task to proceed. This will
> >>>>>> ensure forward progress.
> >>>>>>
> >>>>>>> The theory behind leaving the task on the wait queue is that the
> >>>>>>> waitqueue_active check in wbt_wait prevents new tasks from taking up a
> >>>>>>> slot in the queue (e.g. incrementing inflight). So, there should not be
> >>>>>>> a way for inflight to be incremented between the time the wake_up is
> >>>>>>> done and the task at the head of the wait queue runs. That's the idea
> >>>>>>> anyway :-) If we missed something, let us know.
> >>>>>> And that's a fine theory, I think it's a good improvement (and how it
> >>>>>> should have worked). I'm struggling to see where the issue is. Perhaps
> >>>>>> it's related to the wq active check. With fewer wakeups, we're more
> >>>>>> likely to hit a race there.
> >>>>>>
> >>>>>> I'll poke at it...
> >>>>> Trying something like this:
> >>>>>
> >>>>> http://git.kernel.dk/cgit/linux-block/log/?h=for-4.19/wbt
> >>>>>
> >>>> Ah, now I see what you mean.
> >>>>
> >>>> This is the case where a task goes to sleep, not because the inflight
> >>>> limit has been reached, but simply because it needs to go to the back of
> >>>> the wait queue.
> >>>>
> >>>> In that case, it should, for its first time inside the loop, not try to
> >>>> decrement inflight - since that means it could still race to overtake a
> >>>> task that got there earlier and is in the wait queue.
> >>>>
> >>>> So what you are doing is keeping track of whether it got in to the loop
> >>>> only because of queueing, and then you don't try to decrement inflight
> >>>> the first time around the loop.
> >>>>
> >>>> I think that should work to fix that corner case.
> >>>
> >>> I hope so, got tests running now and we'll see...
> >>>
> >>> Outside of that, getting the matching memory barrier for the wq check
> >>> could also fix a race on the completion side.
> >>>
> >>
> >> I thought all the wait_* and set_current_* and atomic_* had implicit barriers.
> >> Are you referring to the rwb->wb_* values we consume on the completion side?
> > 
> > Not waitqueue_active(), which is the one I was referring to. The additional
> > helper wq_has_sleeper() does.
> > 
> >> I was initially concerned about not dequeuing the task, but noticed that
> >> wake_up_common seems to handle that well. I looked for sources of missed wake
> >> up as well, notifying the same task twice and missing wakeups, but could
> >> not hit it.
> > 
> > It's better not to dequeue, since we want the task to stay at the head.
> > So I think all that makes sense, yet I can't find where it would be
> > missing either. The missing barrier _could_ explain it, especially
> > since the risk of hitting it should be higher now with single wakeups.
> > 
> >> FYI: We ran lock contention and the waitqueue showed up as having the
> >> largest contention, which disappeared after this patch.
> > 
> > Yeah, it's a good change for sure, we don't want everybody to wakeup,
> > and then hammer on the lock both on wq removal and then again for
> > most of them going back to sleep.
> 
> OK, I think I see it. The problem is that if a task gets woken up and
> doesn't get to queue anything, it goes back to sleep. But the default
> wake function has already removed it from the wait queue... So once that
> happens, we're dead in the water. The problem isn't that we're now more
> likely to hit the deadlock with the above change, it's that the above
> change introduced this deadlock.
> 
> I'm testing a fix.
> 
> -- 
> Jens Axboe

Are you talking about default_wake_function? If so then the woken task will not be deleted from the waitqueue until after it gets scheduled however, the earlier function used in DEFINE_WAIT - autoremove_wake_function does delete the woken up task from the waitqueue. Am I missing anything?

Thanks,
Anchal Agarwal 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
  2018-08-22  4:01                                   ` Anchal Agarwal
@ 2018-08-22  4:10                                     ` Jens Axboe
  2018-08-22 12:54                                       ` Holger Hoffstätte
  0 siblings, 1 reply; 38+ messages in thread
From: Jens Axboe @ 2018-08-22  4:10 UTC (permalink / raw)
  To: Anchal Agarwal; +Cc: fllinden, sblbir, msw, linux-block, linux-kernel

On 8/21/18 10:01 PM, Anchal Agarwal wrote:
> On Tue, Aug 21, 2018 at 09:20:06PM -0600, Jens Axboe wrote:
>> On 8/20/18 8:58 PM, Jens Axboe wrote:
>>> On 8/20/18 4:42 PM, Balbir Singh wrote:
>>>> On Mon, Aug 20, 2018 at 02:20:59PM -0600, Jens Axboe wrote:
>>>>> On 8/20/18 2:19 PM, van der Linden, Frank wrote:
>>>>>> On 8/20/18 12:29 PM, Jens Axboe wrote:
>>>>>>> On 8/20/18 1:08 PM, Jens Axboe wrote:
>>>>>>>> On 8/20/18 11:34 AM, van der Linden, Frank wrote:
>>>>>>>>> On 8/20/18 9:37 AM, Jens Axboe wrote:
>>>>>>>>>> On 8/7/18 3:19 PM, Jens Axboe wrote:
>>>>>>>>>>> On 8/7/18 3:12 PM, Anchal Agarwal wrote:
>>>>>>>>>>>> On Tue, Aug 07, 2018 at 02:39:48PM -0600, Jens Axboe wrote:
>>>>>>>>>>>>> On 8/7/18 2:12 PM, Anchal Agarwal wrote:
>>>>>>>>>>>>>> On Tue, Aug 07, 2018 at 08:29:44AM -0600, Jens Axboe wrote:
>>>>>>>>>>>>>>> On 8/1/18 4:09 PM, Jens Axboe wrote:
>>>>>>>>>>>>>>>> On 8/1/18 11:06 AM, Anchal Agarwal wrote:
>>>>>>>>>>>>>>>>> On Wed, Aug 01, 2018 at 09:14:50AM -0600, Jens Axboe wrote:
>>>>>>>>>>>>>>>>>> On 7/31/18 3:34 PM, Anchal Agarwal wrote:
>>>>>>>>>>>>>>>>>>> Hi folks,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> This patch modifies commit e34cbd307477a
>>>>>>>>>>>>>>>>>>> (blk-wbt: add general throttling mechanism)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I am currently running a large bare metal instance (i3.metal)
>>>>>>>>>>>>>>>>>>> on EC2 with 72 cores, 512GB of RAM and NVME drives, with a
>>>>>>>>>>>>>>>>>>> 4.18 kernel. I have a workload that simulates a database
>>>>>>>>>>>>>>>>>>> workload and I am running into lockup issues when writeback
>>>>>>>>>>>>>>>>>>> throttling is enabled,with the hung task detector also
>>>>>>>>>>>>>>>>>>> kicking in.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Crash dumps show that most CPUs (up to 50 of them) are
>>>>>>>>>>>>>>>>>>> all trying to get the wbt wait queue lock while trying to add
>>>>>>>>>>>>>>>>>>> themselves to it in __wbt_wait (see stack traces below).
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> [    0.948118] CPU: 45 PID: 0 Comm: swapper/45 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
>>>>>>>>>>>>>>>>>>> [    0.948119] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
>>>>>>>>>>>>>>>>>>> [    0.948120] task: ffff883f7878c000 task.stack: ffffc9000c69c000
>>>>>>>>>>>>>>>>>>> [    0.948124] RIP: 0010:native_queued_spin_lock_slowpath+0xf8/0x1a0
>>>>>>>>>>>>>>>>>>> [    0.948125] RSP: 0018:ffff883f7fcc3dc8 EFLAGS: 00000046
>>>>>>>>>>>>>>>>>>> [    0.948126] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7fce2a00
>>>>>>>>>>>>>>>>>>> [    0.948128] RDX: 000000000000001c RSI: 0000000000740001 RDI: ffff887f7709ca68
>>>>>>>>>>>>>>>>>>> [    0.948129] RBP: 0000000000000002 R08: 0000000000b80000 R09: 0000000000000000
>>>>>>>>>>>>>>>>>>> [    0.948130] R10: ffff883f7fcc3d78 R11: 000000000de27121 R12: 0000000000000002
>>>>>>>>>>>>>>>>>>> [    0.948131] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
>>>>>>>>>>>>>>>>>>> [    0.948132] FS:  0000000000000000(0000) GS:ffff883f7fcc0000(0000) knlGS:0000000000000000
>>>>>>>>>>>>>>>>>>> [    0.948134] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>>>>>>>>>>>>>>> [    0.948135] CR2: 000000c424c77000 CR3: 0000000002010005 CR4: 00000000003606e0
>>>>>>>>>>>>>>>>>>> [    0.948136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>>>>>>>>>>>>>>>>> [    0.948137] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>>>>>>>>>>>>>>>>> [    0.948138] Call Trace:
>>>>>>>>>>>>>>>>>>> [    0.948139]  <IRQ>
>>>>>>>>>>>>>>>>>>> [    0.948142]  do_raw_spin_lock+0xad/0xc0
>>>>>>>>>>>>>>>>>>> [    0.948145]  _raw_spin_lock_irqsave+0x44/0x4b
>>>>>>>>>>>>>>>>>>> [    0.948149]  ? __wake_up_common_lock+0x53/0x90
>>>>>>>>>>>>>>>>>>> [    0.948150]  __wake_up_common_lock+0x53/0x90
>>>>>>>>>>>>>>>>>>> [    0.948155]  wbt_done+0x7b/0xa0
>>>>>>>>>>>>>>>>>>> [    0.948158]  blk_mq_free_request+0xb7/0x110
>>>>>>>>>>>>>>>>>>> [    0.948161]  __blk_mq_complete_request+0xcb/0x140
>>>>>>>>>>>>>>>>>>> [    0.948166]  nvme_process_cq+0xce/0x1a0 [nvme]
>>>>>>>>>>>>>>>>>>> [    0.948169]  nvme_irq+0x23/0x50 [nvme]
>>>>>>>>>>>>>>>>>>> [    0.948173]  __handle_irq_event_percpu+0x46/0x300
>>>>>>>>>>>>>>>>>>> [    0.948176]  handle_irq_event_percpu+0x20/0x50
>>>>>>>>>>>>>>>>>>> [    0.948179]  handle_irq_event+0x34/0x60
>>>>>>>>>>>>>>>>>>> [    0.948181]  handle_edge_irq+0x77/0x190
>>>>>>>>>>>>>>>>>>> [    0.948185]  handle_irq+0xaf/0x120
>>>>>>>>>>>>>>>>>>> [    0.948188]  do_IRQ+0x53/0x110
>>>>>>>>>>>>>>>>>>> [    0.948191]  common_interrupt+0x87/0x87
>>>>>>>>>>>>>>>>>>> [    0.948192]  </IRQ>
>>>>>>>>>>>>>>>>>>> ....
>>>>>>>>>>>>>>>>>>> [    0.311136] CPU: 4 PID: 9737 Comm: run_linux_amd64 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
>>>>>>>>>>>>>>>>>>> [    0.311137] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
>>>>>>>>>>>>>>>>>>> [    0.311138] task: ffff883f6e6a8000 task.stack: ffffc9000f1ec000
>>>>>>>>>>>>>>>>>>> [    0.311141] RIP: 0010:native_queued_spin_lock_slowpath+0xf5/0x1a0
>>>>>>>>>>>>>>>>>>> [    0.311142] RSP: 0018:ffffc9000f1efa28 EFLAGS: 00000046
>>>>>>>>>>>>>>>>>>> [    0.311144] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7f722a00
>>>>>>>>>>>>>>>>>>> [    0.311145] RDX: 0000000000000035 RSI: 0000000000d80001 RDI: ffff887f7709ca68
>>>>>>>>>>>>>>>>>>> [    0.311146] RBP: 0000000000000202 R08: 0000000000140000 R09: 0000000000000000
>>>>>>>>>>>>>>>>>>> [    0.311147] R10: ffffc9000f1ef9d8 R11: 000000001a249fa0 R12: ffff887f7709ca68
>>>>>>>>>>>>>>>>>>> [    0.311148] R13: ffffc9000f1efad0 R14: 0000000000000000 R15: ffff887f7709ca00
>>>>>>>>>>>>>>>>>>> [    0.311149] FS:  000000c423f30090(0000) GS:ffff883f7f700000(0000) knlGS:0000000000000000
>>>>>>>>>>>>>>>>>>> [    0.311150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>>>>>>>>>>>>>>> [    0.311151] CR2: 00007feefcea4000 CR3: 0000007f7016e001 CR4: 00000000003606e0
>>>>>>>>>>>>>>>>>>> [    0.311152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>>>>>>>>>>>>>>>>> [    0.311153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>>>>>>>>>>>>>>>>> [    0.311154] Call Trace:
>>>>>>>>>>>>>>>>>>> [    0.311157]  do_raw_spin_lock+0xad/0xc0
>>>>>>>>>>>>>>>>>>> [    0.311160]  _raw_spin_lock_irqsave+0x44/0x4b
>>>>>>>>>>>>>>>>>>> [    0.311162]  ? prepare_to_wait_exclusive+0x28/0xb0
>>>>>>>>>>>>>>>>>>> [    0.311164]  prepare_to_wait_exclusive+0x28/0xb0
>>>>>>>>>>>>>>>>>>> [    0.311167]  wbt_wait+0x127/0x330
>>>>>>>>>>>>>>>>>>> [    0.311169]  ? finish_wait+0x80/0x80
>>>>>>>>>>>>>>>>>>> [    0.311172]  ? generic_make_request+0xda/0x3b0
>>>>>>>>>>>>>>>>>>> [    0.311174]  blk_mq_make_request+0xd6/0x7b0
>>>>>>>>>>>>>>>>>>> [    0.311176]  ? blk_queue_enter+0x24/0x260
>>>>>>>>>>>>>>>>>>> [    0.311178]  ? generic_make_request+0xda/0x3b0
>>>>>>>>>>>>>>>>>>> [    0.311181]  generic_make_request+0x10c/0x3b0
>>>>>>>>>>>>>>>>>>> [    0.311183]  ? submit_bio+0x5c/0x110
>>>>>>>>>>>>>>>>>>> [    0.311185]  submit_bio+0x5c/0x110
>>>>>>>>>>>>>>>>>>> [    0.311197]  ? __ext4_journal_stop+0x36/0xa0 [ext4]
>>>>>>>>>>>>>>>>>>> [    0.311210]  ext4_io_submit+0x48/0x60 [ext4]
>>>>>>>>>>>>>>>>>>> [    0.311222]  ext4_writepages+0x810/0x11f0 [ext4]
>>>>>>>>>>>>>>>>>>> [    0.311229]  ? do_writepages+0x3c/0xd0
>>>>>>>>>>>>>>>>>>> [    0.311239]  ? ext4_mark_inode_dirty+0x260/0x260 [ext4]
>>>>>>>>>>>>>>>>>>> [    0.311240]  do_writepages+0x3c/0xd0
>>>>>>>>>>>>>>>>>>> [    0.311243]  ? _raw_spin_unlock+0x24/0x30
>>>>>>>>>>>>>>>>>>> [    0.311245]  ? wbc_attach_and_unlock_inode+0x165/0x280
>>>>>>>>>>>>>>>>>>> [    0.311248]  ? __filemap_fdatawrite_range+0xa3/0xe0
>>>>>>>>>>>>>>>>>>> [    0.311250]  __filemap_fdatawrite_range+0xa3/0xe0
>>>>>>>>>>>>>>>>>>> [    0.311253]  file_write_and_wait_range+0x34/0x90
>>>>>>>>>>>>>>>>>>> [    0.311264]  ext4_sync_file+0x151/0x500 [ext4]
>>>>>>>>>>>>>>>>>>> [    0.311267]  do_fsync+0x38/0x60
>>>>>>>>>>>>>>>>>>> [    0.311270]  SyS_fsync+0xc/0x10
>>>>>>>>>>>>>>>>>>> [    0.311272]  do_syscall_64+0x6f/0x170
>>>>>>>>>>>>>>>>>>> [    0.311274]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> In the original patch, wbt_done is waking up all the exclusive
>>>>>>>>>>>>>>>>>>> processes in the wait queue, which can cause a thundering herd
>>>>>>>>>>>>>>>>>>> if there is a large number of writer threads in the queue. The
>>>>>>>>>>>>>>>>>>> original intention of the code seems to be to wake up one thread
>>>>>>>>>>>>>>>>>>> only however, it uses wake_up_all() in __wbt_done(), and then
>>>>>>>>>>>>>>>>>>> uses the following check in __wbt_wait to have only one thread
>>>>>>>>>>>>>>>>>>> actually get out of the wait loop:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> if (waitqueue_active(&rqw->wait) &&
>>>>>>>>>>>>>>>>>>>             rqw->wait.head.next != &wait->entry)
>>>>>>>>>>>>>>>>>>>                 return false;
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The problem with this is that the wait entry in wbt_wait is
>>>>>>>>>>>>>>>>>>> define with DEFINE_WAIT, which uses the autoremove wakeup function.
>>>>>>>>>>>>>>>>>>> That means that the above check is invalid - the wait entry will
>>>>>>>>>>>>>>>>>>> have been removed from the queue already by the time we hit the
>>>>>>>>>>>>>>>>>>> check in the loop.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Secondly, auto-removing the wait entries also means that the wait
>>>>>>>>>>>>>>>>>>> queue essentially gets reordered "randomly" (e.g. threads re-add
>>>>>>>>>>>>>>>>>>> themselves in the order they got to run after being woken up).
>>>>>>>>>>>>>>>>>>> Additionally, new requests entering wbt_wait might overtake requests
>>>>>>>>>>>>>>>>>>> that were queued earlier, because the wait queue will be
>>>>>>>>>>>>>>>>>>> (temporarily) empty after the wake_up_all, so the waitqueue_active
>>>>>>>>>>>>>>>>>>> check will not stop them. This can cause certain threads to starve
>>>>>>>>>>>>>>>>>>> under high load.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The fix is to leave the woken up requests in the queue and remove
>>>>>>>>>>>>>>>>>>> them in finish_wait() once the current thread breaks out of the
>>>>>>>>>>>>>>>>>>> wait loop in __wbt_wait. This will ensure new requests always
>>>>>>>>>>>>>>>>>>> end up at the back of the queue, and they won't overtake requests
>>>>>>>>>>>>>>>>>>> that are already in the wait queue. With that change, the loop
>>>>>>>>>>>>>>>>>>> in wbt_wait is also in line with many other wait loops in the kernel.
>>>>>>>>>>>>>>>>>>> Waking up just one thread drastically reduces lock contention, as
>>>>>>>>>>>>>>>>>>> does moving the wait queue add/remove out of the loop.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> A significant drop in lockdep's lock contention numbers is seen when
>>>>>>>>>>>>>>>>>>> running the test application on the patched kernel.
>>>>>>>>>>>>>>>>>> I like the patch, and a few weeks ago we independently discovered that
>>>>>>>>>>>>>>>>>> the waitqueue list checking was bogus as well. My only worry is that
>>>>>>>>>>>>>>>>>> changes like this can be delicate, meaning that it's easy to introduce
>>>>>>>>>>>>>>>>>> stall conditions. What kind of testing did you push this through?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>>> Jens Axboe
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I ran the following tests on both real HW with NVME devices attached
>>>>>>>>>>>>>>>>> and emulated NVME too:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 1. The test case I used to reproduce the issue, spawns a bunch of threads 
>>>>>>>>>>>>>>>>>    to concurrently read and write files with random size and content. 
>>>>>>>>>>>>>>>>>    Files are randomly fsync'd. The implementation is a FIFO queue of files. 
>>>>>>>>>>>>>>>>>    When the queue fills the test starts to verify and remove the files. This 
>>>>>>>>>>>>>>>>>    test will fail if there's a read, write, or hash check failure. It tests
>>>>>>>>>>>>>>>>>    for file corruption when lots of small files are being read and written 
>>>>>>>>>>>>>>>>>    with high concurrency.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 2. Fio for random writes with a root NVME device of 200GB
>>>>>>>>>>>>>>>>>   
>>>>>>>>>>>>>>>>>   fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k 
>>>>>>>>>>>>>>>>>   --direct=0 --size=10G --numjobs=2 --runtime=60 --group_reporting
>>>>>>>>>>>>>>>>>   
>>>>>>>>>>>>>>>>>   fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k
>>>>>>>>>>>>>>>>>   --direct=0 --size=5G --numjobs=2 --runtime=30 --fsync=64 --group_reporting
>>>>>>>>>>>>>>>>>   
>>>>>>>>>>>>>>>>>   I did see an improvement in the bandwidth numbers reported on the patched
>>>>>>>>>>>>>>>>>   kernel. 
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Do you have any test case/suite in mind that you would suggest me to 
>>>>>>>>>>>>>>>>> run to be sure that patch does not introduce any stall conditions?
>>>>>>>>>>>>>>>> One thing that is always useful is to run xfstest, do a full run on
>>>>>>>>>>>>>>>> the device. If that works, then do another full run, this time limiting
>>>>>>>>>>>>>>>> the queue depth of the SCSI device to 1. If both of those pass, then
>>>>>>>>>>>>>>>> I'd feel pretty good getting this applied for 4.19.
>>>>>>>>>>>>>>> Did you get a chance to run this full test?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>> Jens Axboe
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Jens,
>>>>>>>>>>>>>> Yes I did run the tests and was in the process of compiling concrete results
>>>>>>>>>>>>>> I tested following environments against xfs/auto group
>>>>>>>>>>>>>> 1. Vanilla 4.18.rc kernel
>>>>>>>>>>>>>> 2. 4.18 kernel with the blk-wbt patch
>>>>>>>>>>>>>> 3. 4.18 kernel with the blk-wbt patch + io_queue_depth=2. I 
>>>>>>>>>>>>>> understand you asked for queue depth for SCSI device=1 however, I have NVME 
>>>>>>>>>>>>>> devices in my environment and 2 is the minimum value for io_queue_depth allowed 
>>>>>>>>>>>>>> according to the NVME driver code. The results pretty much look same with no 
>>>>>>>>>>>>>> stalls or exceptional failures. 
>>>>>>>>>>>>>> xfs/auto ran 296 odd tests with 3 failures and 130 something "no runs". 
>>>>>>>>>>>>>> Remaining tests passed. "Skipped tests"  were mostly due to missing features
>>>>>>>>>>>>>> (eg: reflink support on scratch filesystem)
>>>>>>>>>>>>>> The failures were consistent across runs on 3 different environments. 
>>>>>>>>>>>>>> I am also running full test suite but it is taking long time as I am 
>>>>>>>>>>>>>> hitting kernel BUG in xfs code in some generic tests. This BUG is not 
>>>>>>>>>>>>>> related to the patch and  I see them in vanilla kernel too. I am in 
>>>>>>>>>>>>>> the process of excluding these kind of tests as they come and 
>>>>>>>>>>>>>> re-run the suite however, this proces is time taking. 
>>>>>>>>>>>>>> Do you have any specific tests in mind that you would like me 
>>>>>>>>>>>>>> to run apart from what I have already tested above?
>>>>>>>>>>>>> Thanks, I think that looks good. I'll get your patch applied for
>>>>>>>>>>>>> 4.19.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -- 
>>>>>>>>>>>>> Jens Axboe
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> Hi Jens,
>>>>>>>>>>>> Thanks for accepting this. There is one small issue, I don't find any emails
>>>>>>>>>>>> send by me on the lkml mailing list. I am not sure why it didn't land there,
>>>>>>>>>>>> all I can see is your responses. Do you want one of us to resend the patch
>>>>>>>>>>>> or will you be able to do it?
>>>>>>>>>>> That's odd, are you getting rejections on your emails? For reference, the
>>>>>>>>>>> patch is here:
>>>>>>>>>>>
>>>>>>>>>>> http://git.kernel.dk/cgit/linux-block/commit/?h=for-4.19/block&id=2887e41b910bb14fd847cf01ab7a5993db989d88
>>>>>>>>>> One issue with this, as far as I can tell. Right now we've switched to
>>>>>>>>>> waking one task at the time, which is obviously more efficient. But if
>>>>>>>>>> we do that with exclusive waits, then we have to ensure that this task
>>>>>>>>>> makes progress. If we wake up a task, and then fail to get a queueing
>>>>>>>>>> token, then we'll go back to sleep. We need to ensure that someone makes
>>>>>>>>>> forward progress at this point. There are two ways I can see that
>>>>>>>>>> happening:
>>>>>>>>>>
>>>>>>>>>> 1) The task woken _always_ gets to queue an IO
>>>>>>>>>> 2) If the task woken is NOT allowed to queue an IO, then it must select
>>>>>>>>>>    a new task to wake up. That new task is then subjected to rule 1 or 2
>>>>>>>>>>    as well.
>>>>>>>>>>
>>>>>>>>>> For #1, it could be as simple as:
>>>>>>>>>>
>>>>>>>>>> if (slept || !rwb_enabled(rwb)) {
>>>>>>>>>> 	atomic_inc(&rqw->inflight);
>>>>>>>>>> 	break;
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> but this obviously won't always be fair. Might be good enough however,
>>>>>>>>>> instead of having to eg replace the generic wait queues with a priority
>>>>>>>>>> list/queue.
>>>>>>>>>>
>>>>>>>>>> Note that this isn't an entirely new issue, it's just so much easier to
>>>>>>>>>> hit with the single wakeups.
>>>>>>>>>>
>>>>>>>>> Hi Jens,
>>>>>>>>>
>>>>>>>>> What is the scenario that you see under which the woken up task does not
>>>>>>>>> get to run?
>>>>>>>> That scenario is pretty easy to hit - let's say the next in line task
>>>>>>>> has a queue limit of 1, and we currently have 4 pending. Task gets
>>>>>>>> woken, goes back to sleep. Which should be totally fine. At some point
>>>>>>>> we'll get below the limit, and allow the task to proceed. This will
>>>>>>>> ensure forward progress.
>>>>>>>>
>>>>>>>>> The theory behind leaving the task on the wait queue is that the
>>>>>>>>> waitqueue_active check in wbt_wait prevents new tasks from taking up a
>>>>>>>>> slot in the queue (e.g. incrementing inflight). So, there should not be
>>>>>>>>> a way for inflight to be incremented between the time the wake_up is
>>>>>>>>> done and the task at the head of the wait queue runs. That's the idea
>>>>>>>>> anyway :-) If we missed something, let us know.
>>>>>>>> And that's a fine theory, I think it's a good improvement (and how it
>>>>>>>> should have worked). I'm struggling to see where the issue is. Perhaps
>>>>>>>> it's related to the wq active check. With fewer wakeups, we're more
>>>>>>>> likely to hit a race there.
>>>>>>>>
>>>>>>>> I'll poke at it...
>>>>>>> Trying something like this:
>>>>>>>
>>>>>>> http://git.kernel.dk/cgit/linux-block/log/?h=for-4.19/wbt
>>>>>>>
>>>>>> Ah, now I see what you mean.
>>>>>>
>>>>>> This is the case where a task goes to sleep, not because the inflight
>>>>>> limit has been reached, but simply because it needs to go to the back of
>>>>>> the wait queue.
>>>>>>
>>>>>> In that case, it should, for its first time inside the loop, not try to
>>>>>> decrement inflight - since that means it could still race to overtake a
>>>>>> task that got there earlier and is in the wait queue.
>>>>>>
>>>>>> So what you are doing is keeping track of whether it got in to the loop
>>>>>> only because of queueing, and then you don't try to decrement inflight
>>>>>> the first time around the loop.
>>>>>>
>>>>>> I think that should work to fix that corner case.
>>>>>
>>>>> I hope so, got tests running now and we'll see...
>>>>>
>>>>> Outside of that, getting the matching memory barrier for the wq check
>>>>> could also fix a race on the completion side.
>>>>>
>>>>
>>>> I thought all the wait_* and set_current_* and atomic_* had implicit barriers.
>>>> Are you referring to the rwb->wb_* values we consume on the completion side?
>>>
>>> Not waitqueue_active(), which is the one I was referring to. The additional
>>> helper wq_has_sleeper() does.
>>>
>>>> I was initially concerned about not dequeuing the task, but noticed that
>>>> wake_up_common seems to handle that well. I looked for sources of missed wake
>>>> up as well, notifying the same task twice and missing wakeups, but could
>>>> not hit it.
>>>
>>> It's better not to dequeue, since we want the task to stay at the head.
>>> So I think all that makes sense, yet I can't find where it would be
>>> missing either. The missing barrier _could_ explain it, especially
>>> since the risk of hitting it should be higher now with single wakeups.
>>>
>>>> FYI: We ran lock contention and the waitqueue showed up as having the
>>>> largest contention, which disappeared after this patch.
>>>
>>> Yeah, it's a good change for sure, we don't want everybody to wakeup,
>>> and then hammer on the lock both on wq removal and then again for
>>> most of them going back to sleep.
>>
>> OK, I think I see it. The problem is that if a task gets woken up and
>> doesn't get to queue anything, it goes back to sleep. But the default
>> wake function has already removed it from the wait queue... So once that
>> happens, we're dead in the water. The problem isn't that we're now more
>> likely to hit the deadlock with the above change, it's that the above
>> change introduced this deadlock.
>>
>> I'm testing a fix.
>>
>> -- 
>> Jens Axboe
> 
> Are you talking about default_wake_function? If so then the woken task
> will not be deleted from the waitqueue until after it gets scheduled
> however, the earlier function used in DEFINE_WAIT -
> autoremove_wake_function does delete the woken up task from the
> waitqueue. Am I missing anything?

The problem is actually in a backport of it, since it didn't do the
proper wait queue func change, hence it was still using
autoremove_wake_function. You are right that in mainline it looks fine.

If you have time, please look at the 3 patches I posted earlier today.
Those are for mainline, so should be OK :-)

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
  2018-08-22  4:10                                     ` Jens Axboe
@ 2018-08-22 12:54                                       ` Holger Hoffstätte
  2018-08-22 14:27                                         ` Jens Axboe
  0 siblings, 1 reply; 38+ messages in thread
From: Holger Hoffstätte @ 2018-08-22 12:54 UTC (permalink / raw)
  To: Jens Axboe, Anchal Agarwal
  Cc: fllinden, sblbir, msw, linux-block, linux-kernel

On 08/22/18 06:10, Jens Axboe wrote:
> [...]
> If you have time, please look at the 3 patches I posted earlier today.
> Those are for mainline, so should be OK :-)

I'm just playing along at home but with those 3 I get repeatable
hangs & writeback not starting at all, but curiously *only* on my btrfs
device; for inexplicable reasons some other devices with ext4/xfs flush
properly. Yes, that surprised me too, but it's repeatable.
Now this may or may not have something to do with some of my in-testing
patches for btrfs itself, but if I remove those 3 wbt fixes, everything
is golden again. Not eager to repeat since it hangs sync & requires a
hard reboot.. :(
Just thought you'd like to know.

-h

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
  2018-08-22 12:54                                       ` Holger Hoffstätte
@ 2018-08-22 14:27                                         ` Jens Axboe
  2018-08-22 16:42                                             ` van der Linden, Frank
  2018-08-22 17:28                                           ` Jens Axboe
  0 siblings, 2 replies; 38+ messages in thread
From: Jens Axboe @ 2018-08-22 14:27 UTC (permalink / raw)
  To: Holger Hoffstätte, Anchal Agarwal
  Cc: fllinden, sblbir, msw, linux-block, linux-kernel

On 8/22/18 6:54 AM, Holger Hoffstätte wrote:
> On 08/22/18 06:10, Jens Axboe wrote:
>> [...]
>> If you have time, please look at the 3 patches I posted earlier today.
>> Those are for mainline, so should be OK :-)
> 
> I'm just playing along at home but with those 3 I get repeatable
> hangs & writeback not starting at all, but curiously *only* on my btrfs
> device; for inexplicable reasons some other devices with ext4/xfs flush
> properly. Yes, that surprised me too, but it's repeatable.
> Now this may or may not have something to do with some of my in-testing
> patches for btrfs itself, but if I remove those 3 wbt fixes, everything
> is golden again. Not eager to repeat since it hangs sync & requires a
> hard reboot.. :(
> Just thought you'd like to know.

Thanks, that's very useful info! I'll see if I can reproduce that.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
  2018-08-22 14:27                                         ` Jens Axboe
@ 2018-08-22 16:42                                             ` van der Linden, Frank
  2018-08-22 17:28                                           ` Jens Axboe
  1 sibling, 0 replies; 38+ messages in thread
From: van der Linden, Frank @ 2018-08-22 16:42 UTC (permalink / raw)
  To: Jens Axboe, Holger Hoffstätte, Agarwal, Anchal
  Cc: Singh, Balbir, Wilson, Matt, linux-block, linux-kernel

On 8/22/18 7:27 AM, Jens Axboe wrote:=0A=
> On 8/22/18 6:54 AM, Holger Hoffst=E4tte wrote:=0A=
>> On 08/22/18 06:10, Jens Axboe wrote:=0A=
>>> [...]=0A=
>>> If you have time, please look at the 3 patches I posted earlier today.=
=0A=
>>> Those are for mainline, so should be OK :-)=0A=
>> I'm just playing along at home but with those 3 I get repeatable=0A=
>> hangs & writeback not starting at all, but curiously *only* on my btrfs=
=0A=
>> device; for inexplicable reasons some other devices with ext4/xfs flush=
=0A=
>> properly. Yes, that surprised me too, but it's repeatable.=0A=
>> Now this may or may not have something to do with some of my in-testing=
=0A=
>> patches for btrfs itself, but if I remove those 3 wbt fixes, everything=
=0A=
>> is golden again. Not eager to repeat since it hangs sync & requires a=0A=
>> hard reboot.. :(=0A=
>> Just thought you'd like to know.=0A=
> Thanks, that's very useful info! I'll see if I can reproduce that.=0A=
>=0A=
I think it might be useful to kind of give a dump of what we discussed=0A=
before this patch was sent, there was a little more than was in the=0A=
description.=0A=
=0A=
We saw hangs and heavy lock contention in the wbt code under a specific=0A=
workload, on XFS. Crash dump analysis showed the following issues:=0A=
=0A=
1) wbt_done uses wake_up_all, which causes a thundering herd=0A=
2) __wbt_wait sets up a wait queue with the auto remove wake function=0A=
(via DEFINE_WAIT), which caused two problems:=0A=
   * combined with the use of wake_up_all, the wait queue would=0A=
essentially be randomly reordered for tasks that did not get to run=0A=
   * the waitqueue_active check in may_queue was not valid with the auto=0A=
remove function, which could lead incoming tasks with requests to=0A=
overtake existing requests=0A=
=0A=
1) was fixed by using a plain wake_up=0A=
2) was fixed by keeping tasks on the queue until they could break out of=0A=
the wait loop in __wbt_wait=0A=
=0A=
=0A=
The random reordering, causing task starvation in __wbt_wait, was the=0A=
main problem. Simply not using the auto remove wait function, e.g.=0A=
*only* changing DEFINE_WAIT(wait) to DEFINE_WAIT_FUNC(wait,=0A=
default_wake_function), fixed the hang / task starvation issue in our=0A=
tests. But there was still more lock contention than there should be, so=0A=
we also changed wake_up_all to wake_up.=0A=
=0A=
It might be useful to run your tests with only the DEFINE_WAIT change I=0A=
describe above added to the original code to see if that still has any=0A=
problems. That would give a good datapoint whether any remaining issues=0A=
are due to missed wakeups or not.=0A=
=0A=
There is the issue of making forward progress, or at least making it=0A=
fast enough. With the changes as they stand now, you could come up with=0A=
a scenario where the throttling limit is hit, but then is raised. Since=0A=
there might still be a wait queue, you could end up putting each=0A=
incoming task to sleep, even though it's not needed.=0A=
=0A=
One way to guarantee that the wait queue clears up as fast as possible,=0A=
without resorting to wakeup_all, is to use wakeup_nr, where the number=0A=
of tasks to wake up is (limit - inflight).=0A=
=0A=
Also, to avoid having tasks going back to sleep in the loop, you could=0A=
do what you already proposed - always just sleep at most once, and=0A=
unconditionally proceed after waking up. To avoid incoming tasks=0A=
overtaking the ones that are being woken up, you could have wbt_done=0A=
increment inflight, effectively reserving a spot for the tasks that are=0A=
about to be woken up.=0A=
=0A=
Another thing I thought about was recording the number of waiters in the=0A=
wait queue, and modify the check from (inflight < limit) to (inflight <=0A=
(limit - nwaiters)), and no longer use any waitqueue_active checks.=0A=
=0A=
The condition checks are of course complicated by the fact that=0A=
condition manipulation is not always done under the same lock (e.g.=0A=
wbt_wait can be called with a NULL lock).=0A=
=0A=
=0A=
So, these are just some of the things to consider here - maybe there's=0A=
nothing in there that you hadn't already considered, but I thought it'd=0A=
be useful to summarize them.=0A=
=0A=
Thanks for looking in to this!=0A=
=0A=
Frank=0A=
=0A=
=0A=

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
@ 2018-08-22 16:42                                             ` van der Linden, Frank
  0 siblings, 0 replies; 38+ messages in thread
From: van der Linden, Frank @ 2018-08-22 16:42 UTC (permalink / raw)
  To: Jens Axboe, Holger Hoffstätte, Agarwal, Anchal
  Cc: Singh, Balbir, Wilson, Matt, linux-block, linux-kernel

On 8/22/18 7:27 AM, Jens Axboe wrote:
> On 8/22/18 6:54 AM, Holger Hoffstätte wrote:
>> On 08/22/18 06:10, Jens Axboe wrote:
>>> [...]
>>> If you have time, please look at the 3 patches I posted earlier today.
>>> Those are for mainline, so should be OK :-)
>> I'm just playing along at home but with those 3 I get repeatable
>> hangs & writeback not starting at all, but curiously *only* on my btrfs
>> device; for inexplicable reasons some other devices with ext4/xfs flush
>> properly. Yes, that surprised me too, but it's repeatable.
>> Now this may or may not have something to do with some of my in-testing
>> patches for btrfs itself, but if I remove those 3 wbt fixes, everything
>> is golden again. Not eager to repeat since it hangs sync & requires a
>> hard reboot.. :(
>> Just thought you'd like to know.
> Thanks, that's very useful info! I'll see if I can reproduce that.
>
I think it might be useful to kind of give a dump of what we discussed
before this patch was sent, there was a little more than was in the
description.

We saw hangs and heavy lock contention in the wbt code under a specific
workload, on XFS. Crash dump analysis showed the following issues:

1) wbt_done uses wake_up_all, which causes a thundering herd
2) __wbt_wait sets up a wait queue with the auto remove wake function
(via DEFINE_WAIT), which caused two problems:
   * combined with the use of wake_up_all, the wait queue would
essentially be randomly reordered for tasks that did not get to run
   * the waitqueue_active check in may_queue was not valid with the auto
remove function, which could lead incoming tasks with requests to
overtake existing requests

1) was fixed by using a plain wake_up
2) was fixed by keeping tasks on the queue until they could break out of
the wait loop in __wbt_wait


The random reordering, causing task starvation in __wbt_wait, was the
main problem. Simply not using the auto remove wait function, e.g.
*only* changing DEFINE_WAIT(wait) to DEFINE_WAIT_FUNC(wait,
default_wake_function), fixed the hang / task starvation issue in our
tests. But there was still more lock contention than there should be, so
we also changed wake_up_all to wake_up.

It might be useful to run your tests with only the DEFINE_WAIT change I
describe above added to the original code to see if that still has any
problems. That would give a good datapoint whether any remaining issues
are due to missed wakeups or not.

There is the issue of making forward progress, or at least making it
fast enough. With the changes as they stand now, you could come up with
a scenario where the throttling limit is hit, but then is raised. Since
there might still be a wait queue, you could end up putting each
incoming task to sleep, even though it's not needed.

One way to guarantee that the wait queue clears up as fast as possible,
without resorting to wakeup_all, is to use wakeup_nr, where the number
of tasks to wake up is (limit - inflight).

Also, to avoid having tasks going back to sleep in the loop, you could
do what you already proposed - always just sleep at most once, and
unconditionally proceed after waking up. To avoid incoming tasks
overtaking the ones that are being woken up, you could have wbt_done
increment inflight, effectively reserving a spot for the tasks that are
about to be woken up.

Another thing I thought about was recording the number of waiters in the
wait queue, and modify the check from (inflight < limit) to (inflight <
(limit - nwaiters)), and no longer use any waitqueue_active checks.

The condition checks are of course complicated by the fact that
condition manipulation is not always done under the same lock (e.g.
wbt_wait can be called with a NULL lock).


So, these are just some of the things to consider here - maybe there's
nothing in there that you hadn't already considered, but I thought it'd
be useful to summarize them.

Thanks for looking in to this!

Frank



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
  2018-08-22 14:27                                         ` Jens Axboe
  2018-08-22 16:42                                             ` van der Linden, Frank
@ 2018-08-22 17:28                                           ` Jens Axboe
  2018-08-22 19:12                                             ` Holger Hoffstätte
  1 sibling, 1 reply; 38+ messages in thread
From: Jens Axboe @ 2018-08-22 17:28 UTC (permalink / raw)
  To: Holger Hoffstätte, Anchal Agarwal
  Cc: fllinden, sblbir, msw, linux-block, linux-kernel

On 8/22/18 8:27 AM, Jens Axboe wrote:
> On 8/22/18 6:54 AM, Holger Hoffstätte wrote:
>> On 08/22/18 06:10, Jens Axboe wrote:
>>> [...]
>>> If you have time, please look at the 3 patches I posted earlier today.
>>> Those are for mainline, so should be OK :-)
>>
>> I'm just playing along at home but with those 3 I get repeatable
>> hangs & writeback not starting at all, but curiously *only* on my btrfs
>> device; for inexplicable reasons some other devices with ext4/xfs flush
>> properly. Yes, that surprised me too, but it's repeatable.
>> Now this may or may not have something to do with some of my in-testing
>> patches for btrfs itself, but if I remove those 3 wbt fixes, everything
>> is golden again. Not eager to repeat since it hangs sync & requires a
>> hard reboot.. :(
>> Just thought you'd like to know.
> 
> Thanks, that's very useful info! I'll see if I can reproduce that.

Any chance you can try with and see which patch is causing the issue?
I can't reproduce it here, seems solid.

Either that, or a reproducer would be great...

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
  2018-08-22 16:42                                             ` van der Linden, Frank
  (?)
@ 2018-08-22 17:30                                             ` Jens Axboe
  2018-08-22 20:26                                               ` Anchal Agarwal
  -1 siblings, 1 reply; 38+ messages in thread
From: Jens Axboe @ 2018-08-22 17:30 UTC (permalink / raw)
  To: van der Linden, Frank, Holger Hoffstätte, Agarwal, Anchal
  Cc: Singh, Balbir, Wilson, Matt, linux-block, linux-kernel

On 8/22/18 10:42 AM, van der Linden, Frank wrote:
> On 8/22/18 7:27 AM, Jens Axboe wrote:
>> On 8/22/18 6:54 AM, Holger Hoffstätte wrote:
>>> On 08/22/18 06:10, Jens Axboe wrote:
>>>> [...]
>>>> If you have time, please look at the 3 patches I posted earlier today.
>>>> Those are for mainline, so should be OK :-)
>>> I'm just playing along at home but with those 3 I get repeatable
>>> hangs & writeback not starting at all, but curiously *only* on my btrfs
>>> device; for inexplicable reasons some other devices with ext4/xfs flush
>>> properly. Yes, that surprised me too, but it's repeatable.
>>> Now this may or may not have something to do with some of my in-testing
>>> patches for btrfs itself, but if I remove those 3 wbt fixes, everything
>>> is golden again. Not eager to repeat since it hangs sync & requires a
>>> hard reboot.. :(
>>> Just thought you'd like to know.
>> Thanks, that's very useful info! I'll see if I can reproduce that.
>>
> I think it might be useful to kind of give a dump of what we discussed
> before this patch was sent, there was a little more than was in the
> description.
> 
> We saw hangs and heavy lock contention in the wbt code under a specific
> workload, on XFS. Crash dump analysis showed the following issues:
> 
> 1) wbt_done uses wake_up_all, which causes a thundering herd
> 2) __wbt_wait sets up a wait queue with the auto remove wake function
> (via DEFINE_WAIT), which caused two problems:
>    * combined with the use of wake_up_all, the wait queue would
> essentially be randomly reordered for tasks that did not get to run
>    * the waitqueue_active check in may_queue was not valid with the auto
> remove function, which could lead incoming tasks with requests to
> overtake existing requests
> 
> 1) was fixed by using a plain wake_up
> 2) was fixed by keeping tasks on the queue until they could break out of
> the wait loop in __wbt_wait
> 
> 
> The random reordering, causing task starvation in __wbt_wait, was the
> main problem. Simply not using the auto remove wait function, e.g.
> *only* changing DEFINE_WAIT(wait) to DEFINE_WAIT_FUNC(wait,
> default_wake_function), fixed the hang / task starvation issue in our
> tests. But there was still more lock contention than there should be, so
> we also changed wake_up_all to wake_up.
> 
> It might be useful to run your tests with only the DEFINE_WAIT change I
> describe above added to the original code to see if that still has any
> problems. That would give a good datapoint whether any remaining issues
> are due to missed wakeups or not.
> 
> There is the issue of making forward progress, or at least making it
> fast enough. With the changes as they stand now, you could come up with
> a scenario where the throttling limit is hit, but then is raised. Since
> there might still be a wait queue, you could end up putting each
> incoming task to sleep, even though it's not needed.
> 
> One way to guarantee that the wait queue clears up as fast as possible,
> without resorting to wakeup_all, is to use wakeup_nr, where the number
> of tasks to wake up is (limit - inflight).
> 
> Also, to avoid having tasks going back to sleep in the loop, you could
> do what you already proposed - always just sleep at most once, and
> unconditionally proceed after waking up. To avoid incoming tasks
> overtaking the ones that are being woken up, you could have wbt_done
> increment inflight, effectively reserving a spot for the tasks that are
> about to be woken up.
> 
> Another thing I thought about was recording the number of waiters in the
> wait queue, and modify the check from (inflight < limit) to (inflight <
> (limit - nwaiters)), and no longer use any waitqueue_active checks.
> 
> The condition checks are of course complicated by the fact that
> condition manipulation is not always done under the same lock (e.g.
> wbt_wait can be called with a NULL lock).
> 
> 
> So, these are just some of the things to consider here - maybe there's
> nothing in there that you hadn't already considered, but I thought it'd
> be useful to summarize them.
> 
> Thanks for looking in to this!

It turned out to be an unrelated problem with rq reordering in blk-mq,
mainline doesn't have it.

So I think the above change is safe and fine, but we definitely still
want the extra change of NOT allowing a queue token for the initial loop
inside __wbt_wait() for when we have current sleepers on the queue.
Without that, the initial check in __wbt_wait() is not useful at all.


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
  2018-08-22 17:28                                           ` Jens Axboe
@ 2018-08-22 19:12                                             ` Holger Hoffstätte
  2018-08-22 19:17                                               ` Jens Axboe
  0 siblings, 1 reply; 38+ messages in thread
From: Holger Hoffstätte @ 2018-08-22 19:12 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block

On 08/22/18 19:28, Jens Axboe wrote:
> On 8/22/18 8:27 AM, Jens Axboe wrote:
>> On 8/22/18 6:54 AM, Holger Hoffstätte wrote:
>>> On 08/22/18 06:10, Jens Axboe wrote:
>>>> [...]
>>>> If you have time, please look at the 3 patches I posted earlier today.
>>>> Those are for mainline, so should be OK :-)
>>>
>>> I'm just playing along at home but with those 3 I get repeatable
>>> hangs & writeback not starting at all, but curiously *only* on my btrfs
>>> device; for inexplicable reasons some other devices with ext4/xfs flush
>>> properly. Yes, that surprised me too, but it's repeatable.
>>> Now this may or may not have something to do with some of my in-testing
>>> patches for btrfs itself, but if I remove those 3 wbt fixes, everything
>>> is golden again. Not eager to repeat since it hangs sync & requires a
>>> hard reboot.. :(
>>> Just thought you'd like to know.
>>
>> Thanks, that's very useful info! I'll see if I can reproduce that.
> 
> Any chance you can try with and see which patch is causing the issue?
> I can't reproduce it here, seems solid.
> 
> Either that, or a reproducer would be great...

It's a hacked up custom tree but the following things have emerged so far:

- it's not btrfs.

- it also happens with ext4.

- I first suspected bfq on a nonrotational device disabling WBT by default,
but using deadline didn't help either. Can't even mkfs.ext4.

- I suspect - but do not know - that using xfs everywhere else is the
reason I got lucky, because xfs. :D

- it immediately happens with only the first patch
("move disable check into get_limit()")

So the obvious suspect is the new return of UINT_MAX from get_limit() to
__wbt_wait(). I first suspected that I mispatched something, but it's all
like in mainline or your tree. Even the recently moved-around atomic loop
inside rq_wait_inc_below() is 1:1 the same and looks like it should.
Now building mainline and see where that leads me.

cheers,
Holger

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
  2018-08-22 19:12                                             ` Holger Hoffstätte
@ 2018-08-22 19:17                                               ` Jens Axboe
  2018-08-22 19:37                                                 ` Holger Hoffstätte
  0 siblings, 1 reply; 38+ messages in thread
From: Jens Axboe @ 2018-08-22 19:17 UTC (permalink / raw)
  To: Holger Hoffstätte; +Cc: linux-block

On 8/22/18 1:12 PM, Holger Hoffstätte wrote:
> On 08/22/18 19:28, Jens Axboe wrote:
>> On 8/22/18 8:27 AM, Jens Axboe wrote:
>>> On 8/22/18 6:54 AM, Holger Hoffstätte wrote:
>>>> On 08/22/18 06:10, Jens Axboe wrote:
>>>>> [...]
>>>>> If you have time, please look at the 3 patches I posted earlier today.
>>>>> Those are for mainline, so should be OK :-)
>>>>
>>>> I'm just playing along at home but with those 3 I get repeatable
>>>> hangs & writeback not starting at all, but curiously *only* on my btrfs
>>>> device; for inexplicable reasons some other devices with ext4/xfs flush
>>>> properly. Yes, that surprised me too, but it's repeatable.
>>>> Now this may or may not have something to do with some of my in-testing
>>>> patches for btrfs itself, but if I remove those 3 wbt fixes, everything
>>>> is golden again. Not eager to repeat since it hangs sync & requires a
>>>> hard reboot.. :(
>>>> Just thought you'd like to know.
>>>
>>> Thanks, that's very useful info! I'll see if I can reproduce that.
>>
>> Any chance you can try with and see which patch is causing the issue?
>> I can't reproduce it here, seems solid.
>>
>> Either that, or a reproducer would be great...
> 
> It's a hacked up custom tree but the following things have emerged so far:
> 
> - it's not btrfs.
> 
> - it also happens with ext4.
> 
> - I first suspected bfq on a nonrotational device disabling WBT by default,
> but using deadline didn't help either. Can't even mkfs.ext4.
> 
> - I suspect - but do not know - that using xfs everywhere else is the
> reason I got lucky, because xfs. :D
> 
> - it immediately happens with only the first patch
> ("move disable check into get_limit()")
> 
> So the obvious suspect is the new return of UINT_MAX from get_limit() to
> __wbt_wait(). I first suspected that I mispatched something, but it's all
> like in mainline or your tree. Even the recently moved-around atomic loop
> inside rq_wait_inc_below() is 1:1 the same and looks like it should.
> Now building mainline and see where that leads me.

I wonder if it's a signedness thing? Can you try and see if using INT_MAX
instead changes anything?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
  2018-08-22 19:17                                               ` Jens Axboe
@ 2018-08-22 19:37                                                 ` Holger Hoffstätte
  2018-08-22 19:46                                                   ` Jens Axboe
  0 siblings, 1 reply; 38+ messages in thread
From: Holger Hoffstätte @ 2018-08-22 19:37 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block

On 08/22/18 21:17, Jens Axboe wrote:
>> So the obvious suspect is the new return of UINT_MAX from get_limit() to
>> __wbt_wait(). I first suspected that I mispatched something, but it's all
>> like in mainline or your tree. Even the recently moved-around atomic loop
>> inside rq_wait_inc_below() is 1:1 the same and looks like it should.
>> Now building mainline and see where that leads me.

So mainline + your tree's last 4 patches works fine, as suspected.
It's all me, as usual.

> I wonder if it's a signedness thing? Can you try and see if using INT_MAX
> instead changes anything?

Beat me to it while I was rebooting ;-)
Exactly what I also found a minute ago:

$diff -rup linux-4.18.4/block/blk-rq-qos.c linux/block/blk-rq-qos.c
..
-bool rq_wait_inc_below(struct rq_wait *rq_wait, int limit)
+bool rq_wait_inc_below(struct rq_wait *rq_wait, unsigned int limit)
..

Moo! Patching now.

cheers
Holger

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
  2018-08-22 19:37                                                 ` Holger Hoffstätte
@ 2018-08-22 19:46                                                   ` Jens Axboe
  2018-08-22 19:58                                                     ` Holger Hoffstätte
  0 siblings, 1 reply; 38+ messages in thread
From: Jens Axboe @ 2018-08-22 19:46 UTC (permalink / raw)
  To: Holger Hoffstätte; +Cc: linux-block

On 8/22/18 1:37 PM, Holger Hoffstätte wrote:
> On 08/22/18 21:17, Jens Axboe wrote:
>>> So the obvious suspect is the new return of UINT_MAX from get_limit() to
>>> __wbt_wait(). I first suspected that I mispatched something, but it's all
>>> like in mainline or your tree. Even the recently moved-around atomic loop
>>> inside rq_wait_inc_below() is 1:1 the same and looks like it should.
>>> Now building mainline and see where that leads me.
> 
> So mainline + your tree's last 4 patches works fine, as suspected.
> It's all me, as usual.

That's a relief!

>> I wonder if it's a signedness thing? Can you try and see if using INT_MAX
>> instead changes anything?
> 
> Beat me to it while I was rebooting ;-)
> Exactly what I also found a minute ago:
> 
> $diff -rup linux-4.18.4/block/blk-rq-qos.c linux/block/blk-rq-qos.c
> ..
> -bool rq_wait_inc_below(struct rq_wait *rq_wait, int limit)
> +bool rq_wait_inc_below(struct rq_wait *rq_wait, unsigned int limit)
> ..
> 
> Moo! Patching now.

At least we have an explanation for why it didn't work.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
  2018-08-22 19:46                                                   ` Jens Axboe
@ 2018-08-22 19:58                                                     ` Holger Hoffstätte
  0 siblings, 0 replies; 38+ messages in thread
From: Holger Hoffstätte @ 2018-08-22 19:58 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block

On 08/22/18 21:46, Jens Axboe wrote:
> On 8/22/18 1:37 PM, Holger Hoffstätte wrote:
>> On 08/22/18 21:17, Jens Axboe wrote:
>>>> So the obvious suspect is the new return of UINT_MAX from get_limit() to
>>>> __wbt_wait(). I first suspected that I mispatched something, but it's all
>>>> like in mainline or your tree. Even the recently moved-around atomic loop
>>>> inside rq_wait_inc_below() is 1:1 the same and looks like it should.
>>>> Now building mainline and see where that leads me.
>>
>> So mainline + your tree's last 4 patches works fine, as suspected.
>> It's all me, as usual.
> 
> That's a relief!
> 
>>> I wonder if it's a signedness thing? Can you try and see if using INT_MAX
>>> instead changes anything?
>>
>> Beat me to it while I was rebooting ;-)
>> Exactly what I also found a minute ago:
>>
>> $diff -rup linux-4.18.4/block/blk-rq-qos.c linux/block/blk-rq-qos.c
>> ..
>> -bool rq_wait_inc_below(struct rq_wait *rq_wait, int limit)
>> +bool rq_wait_inc_below(struct rq_wait *rq_wait, unsigned int limit)
>> ..
>>
>> Moo! Patching now.
> 
> At least we have an explanation for why it didn't work.

Luckily I can still read and 22f17952c7 is conveniently called "blk-rq-qos:
make depth comparisons unsigned"! Needless to say with that thrown into the
mix all is good again. Sorry for the confusion & thanks for your patience.

cheers!
Holger

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
  2018-08-22 17:30                                             ` Jens Axboe
@ 2018-08-22 20:26                                               ` Anchal Agarwal
  2018-08-22 21:05                                                 ` Jens Axboe
  0 siblings, 1 reply; 38+ messages in thread
From: Anchal Agarwal @ 2018-08-22 20:26 UTC (permalink / raw)
  To: Jens Axboe
  Cc: fllinden, sblbir, holger, msw, linux-block, linux-kernel, anchalag

On Wed, Aug 22, 2018 at 11:30:40AM -0600, Jens Axboe wrote:
> On 8/22/18 10:42 AM, van der Linden, Frank wrote:
> > On 8/22/18 7:27 AM, Jens Axboe wrote:
> >> On 8/22/18 6:54 AM, Holger Hoffst??tte wrote:
> >>> On 08/22/18 06:10, Jens Axboe wrote:
> >>>> [...]
> >>>> If you have time, please look at the 3 patches I posted earlier today.
> >>>> Those are for mainline, so should be OK :-)
> >>> I'm just playing along at home but with those 3 I get repeatable
> >>> hangs & writeback not starting at all, but curiously *only* on my btrfs
> >>> device; for inexplicable reasons some other devices with ext4/xfs flush
> >>> properly. Yes, that surprised me too, but it's repeatable.
> >>> Now this may or may not have something to do with some of my in-testing
> >>> patches for btrfs itself, but if I remove those 3 wbt fixes, everything
> >>> is golden again. Not eager to repeat since it hangs sync & requires a
> >>> hard reboot.. :(
> >>> Just thought you'd like to know.
> >> Thanks, that's very useful info! I'll see if I can reproduce that.
> >>
> > I think it might be useful to kind of give a dump of what we discussed
> > before this patch was sent, there was a little more than was in the
> > description.
> > 
> > We saw hangs and heavy lock contention in the wbt code under a specific
> > workload, on XFS. Crash dump analysis showed the following issues:
> > 
> > 1) wbt_done uses wake_up_all, which causes a thundering herd
> > 2) __wbt_wait sets up a wait queue with the auto remove wake function
> > (via DEFINE_WAIT), which caused two problems:
> >    * combined with the use of wake_up_all, the wait queue would
> > essentially be randomly reordered for tasks that did not get to run
> >    * the waitqueue_active check in may_queue was not valid with the auto
> > remove function, which could lead incoming tasks with requests to
> > overtake existing requests
> > 
> > 1) was fixed by using a plain wake_up
> > 2) was fixed by keeping tasks on the queue until they could break out of
> > the wait loop in __wbt_wait
> > 
> > 
> > The random reordering, causing task starvation in __wbt_wait, was the
> > main problem. Simply not using the auto remove wait function, e.g.
> > *only* changing DEFINE_WAIT(wait) to DEFINE_WAIT_FUNC(wait,
> > default_wake_function), fixed the hang / task starvation issue in our
> > tests. But there was still more lock contention than there should be, so
> > we also changed wake_up_all to wake_up.
> > 
> > It might be useful to run your tests with only the DEFINE_WAIT change I
> > describe above added to the original code to see if that still has any
> > problems. That would give a good datapoint whether any remaining issues
> > are due to missed wakeups or not.
> > 
> > There is the issue of making forward progress, or at least making it
> > fast enough. With the changes as they stand now, you could come up with
> > a scenario where the throttling limit is hit, but then is raised. Since
> > there might still be a wait queue, you could end up putting each
> > incoming task to sleep, even though it's not needed.
> > 
> > One way to guarantee that the wait queue clears up as fast as possible,
> > without resorting to wakeup_all, is to use wakeup_nr, where the number
> > of tasks to wake up is (limit - inflight).
> > 
> > Also, to avoid having tasks going back to sleep in the loop, you could
> > do what you already proposed - always just sleep at most once, and
> > unconditionally proceed after waking up. To avoid incoming tasks
> > overtaking the ones that are being woken up, you could have wbt_done
> > increment inflight, effectively reserving a spot for the tasks that are
> > about to be woken up.
> > 
> > Another thing I thought about was recording the number of waiters in the
> > wait queue, and modify the check from (inflight < limit) to (inflight <
> > (limit - nwaiters)), and no longer use any waitqueue_active checks.
> > 
> > The condition checks are of course complicated by the fact that
> > condition manipulation is not always done under the same lock (e.g.
> > wbt_wait can be called with a NULL lock).
> > 
> > 
> > So, these are just some of the things to consider here - maybe there's
> > nothing in there that you hadn't already considered, but I thought it'd
> > be useful to summarize them.
> > 
> > Thanks for looking in to this!
> 
> It turned out to be an unrelated problem with rq reordering in blk-mq,
> mainline doesn't have it.
> 
> So I think the above change is safe and fine, but we definitely still
> want the extra change of NOT allowing a queue token for the initial loop
> inside __wbt_wait() for when we have current sleepers on the queue.
> Without that, the initial check in __wbt_wait() is not useful at all.
> 
> 
> -- 
> Jens Axboe
> 
>

Hi Jens,
I tested your patches in my environment and they look good. There is no sudden increase in 
lock contention either. Thanks for catching the corner case though.

--
Anchal Agarwal

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
  2018-08-22 20:26                                               ` Anchal Agarwal
@ 2018-08-22 21:05                                                 ` Jens Axboe
  0 siblings, 0 replies; 38+ messages in thread
From: Jens Axboe @ 2018-08-22 21:05 UTC (permalink / raw)
  To: Anchal Agarwal; +Cc: fllinden, sblbir, holger, msw, linux-block, linux-kernel

On 8/22/18 2:26 PM, Anchal Agarwal wrote:
> On Wed, Aug 22, 2018 at 11:30:40AM -0600, Jens Axboe wrote:
>> On 8/22/18 10:42 AM, van der Linden, Frank wrote:
>>> On 8/22/18 7:27 AM, Jens Axboe wrote:
>>>> On 8/22/18 6:54 AM, Holger Hoffst??tte wrote:
>>>>> On 08/22/18 06:10, Jens Axboe wrote:
>>>>>> [...]
>>>>>> If you have time, please look at the 3 patches I posted earlier today.
>>>>>> Those are for mainline, so should be OK :-)
>>>>> I'm just playing along at home but with those 3 I get repeatable
>>>>> hangs & writeback not starting at all, but curiously *only* on my btrfs
>>>>> device; for inexplicable reasons some other devices with ext4/xfs flush
>>>>> properly. Yes, that surprised me too, but it's repeatable.
>>>>> Now this may or may not have something to do with some of my in-testing
>>>>> patches for btrfs itself, but if I remove those 3 wbt fixes, everything
>>>>> is golden again. Not eager to repeat since it hangs sync & requires a
>>>>> hard reboot.. :(
>>>>> Just thought you'd like to know.
>>>> Thanks, that's very useful info! I'll see if I can reproduce that.
>>>>
>>> I think it might be useful to kind of give a dump of what we discussed
>>> before this patch was sent, there was a little more than was in the
>>> description.
>>>
>>> We saw hangs and heavy lock contention in the wbt code under a specific
>>> workload, on XFS. Crash dump analysis showed the following issues:
>>>
>>> 1) wbt_done uses wake_up_all, which causes a thundering herd
>>> 2) __wbt_wait sets up a wait queue with the auto remove wake function
>>> (via DEFINE_WAIT), which caused two problems:
>>>    * combined with the use of wake_up_all, the wait queue would
>>> essentially be randomly reordered for tasks that did not get to run
>>>    * the waitqueue_active check in may_queue was not valid with the auto
>>> remove function, which could lead incoming tasks with requests to
>>> overtake existing requests
>>>
>>> 1) was fixed by using a plain wake_up
>>> 2) was fixed by keeping tasks on the queue until they could break out of
>>> the wait loop in __wbt_wait
>>>
>>>
>>> The random reordering, causing task starvation in __wbt_wait, was the
>>> main problem. Simply not using the auto remove wait function, e.g.
>>> *only* changing DEFINE_WAIT(wait) to DEFINE_WAIT_FUNC(wait,
>>> default_wake_function), fixed the hang / task starvation issue in our
>>> tests. But there was still more lock contention than there should be, so
>>> we also changed wake_up_all to wake_up.
>>>
>>> It might be useful to run your tests with only the DEFINE_WAIT change I
>>> describe above added to the original code to see if that still has any
>>> problems. That would give a good datapoint whether any remaining issues
>>> are due to missed wakeups or not.
>>>
>>> There is the issue of making forward progress, or at least making it
>>> fast enough. With the changes as they stand now, you could come up with
>>> a scenario where the throttling limit is hit, but then is raised. Since
>>> there might still be a wait queue, you could end up putting each
>>> incoming task to sleep, even though it's not needed.
>>>
>>> One way to guarantee that the wait queue clears up as fast as possible,
>>> without resorting to wakeup_all, is to use wakeup_nr, where the number
>>> of tasks to wake up is (limit - inflight).
>>>
>>> Also, to avoid having tasks going back to sleep in the loop, you could
>>> do what you already proposed - always just sleep at most once, and
>>> unconditionally proceed after waking up. To avoid incoming tasks
>>> overtaking the ones that are being woken up, you could have wbt_done
>>> increment inflight, effectively reserving a spot for the tasks that are
>>> about to be woken up.
>>>
>>> Another thing I thought about was recording the number of waiters in the
>>> wait queue, and modify the check from (inflight < limit) to (inflight <
>>> (limit - nwaiters)), and no longer use any waitqueue_active checks.
>>>
>>> The condition checks are of course complicated by the fact that
>>> condition manipulation is not always done under the same lock (e.g.
>>> wbt_wait can be called with a NULL lock).
>>>
>>>
>>> So, these are just some of the things to consider here - maybe there's
>>> nothing in there that you hadn't already considered, but I thought it'd
>>> be useful to summarize them.
>>>
>>> Thanks for looking in to this!
>>
>> It turned out to be an unrelated problem with rq reordering in blk-mq,
>> mainline doesn't have it.
>>
>> So I think the above change is safe and fine, but we definitely still
>> want the extra change of NOT allowing a queue token for the initial loop
>> inside __wbt_wait() for when we have current sleepers on the queue.
>> Without that, the initial check in __wbt_wait() is not useful at all.
>>
>>
>> -- 
>> Jens Axboe
>>
>>
> 
> Hi Jens,
> I tested your patches in my environment and they look good. There is no sudden increase in 
> lock contention either. Thanks for catching the corner case though.

Thanks for testing. Can I add your tested-by to the 3 patches?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2018-08-22 21:05 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-31 21:34 [PATCH] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait Anchal Agarwal
2018-07-31 22:02 ` Anchal Agarwal
2018-08-01 15:14 ` Jens Axboe
2018-08-01 17:06   ` Anchal Agarwal
2018-08-01 22:09     ` Jens Axboe
2018-08-07 14:29       ` Jens Axboe
2018-08-07 20:12         ` Anchal Agarwal
2018-08-07 20:39           ` Jens Axboe
2018-08-07 21:12             ` Anchal Agarwal
2018-08-07 21:19               ` Jens Axboe
2018-08-07 22:06                 ` Anchal Agarwal
2018-08-20 16:36                 ` Jens Axboe
2018-08-20 17:34                   ` van der Linden, Frank
2018-08-20 17:34                     ` van der Linden, Frank
2018-08-20 19:08                     ` Jens Axboe
2018-08-20 19:29                       ` Jens Axboe
2018-08-20 20:19                         ` van der Linden, Frank
2018-08-20 20:19                           ` van der Linden, Frank
2018-08-20 20:20                           ` Jens Axboe
2018-08-20 22:42                             ` Balbir Singh
2018-08-21  2:58                               ` Jens Axboe
2018-08-22  3:20                                 ` Jens Axboe
2018-08-22  4:01                                   ` Anchal Agarwal
2018-08-22  4:10                                     ` Jens Axboe
2018-08-22 12:54                                       ` Holger Hoffstätte
2018-08-22 14:27                                         ` Jens Axboe
2018-08-22 16:42                                           ` van der Linden, Frank
2018-08-22 16:42                                             ` van der Linden, Frank
2018-08-22 17:30                                             ` Jens Axboe
2018-08-22 20:26                                               ` Anchal Agarwal
2018-08-22 21:05                                                 ` Jens Axboe
2018-08-22 17:28                                           ` Jens Axboe
2018-08-22 19:12                                             ` Holger Hoffstätte
2018-08-22 19:17                                               ` Jens Axboe
2018-08-22 19:37                                                 ` Holger Hoffstätte
2018-08-22 19:46                                                   ` Jens Axboe
2018-08-22 19:58                                                     ` Holger Hoffstätte
2018-08-07 21:28               ` Matt Wilson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.