[PATCH 1/1] [RFC] workqueue: fix ghost PENDING flag while doing MQ IO

* [PATCH 1/1] [RFC] workqueue: fix ghost PENDING flag while doing MQ IO
@ 2016-04-25 15:22 Roman Pen
  2016-04-25 15:48 ` Tejun Heo
  0 siblings, 1 reply; 15+ messages in thread
From: Roman Pen @ 2016-04-25 15:22 UTC (permalink / raw)
  Cc: Roman Pen, Tejun Heo, Jens Axboe, linux-block, linux-kernel

Hi,

This is RFC, because for last couple of days I hunt a mystery bug and
since now I do not have a strong feeling that the following story is
nothing but bug's trick and attempt to cheat me.

[Sorry, the whole story and explanation are quite long]

The bug is reproduced quite often on our server AMD Opteron(tm) 6386 SE
with 64 cpus and 8 NUMA nodes on 4.4.5 kernel.  What I observe is stalled
IO request in MQ ctx->rq_list with the following backtrace:

[  601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds.
[  601.347574]       Tainted: G           O    4.4.5-1-storage+ #6
[  601.347651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  601.348142] kworker/u129:5  D ffff880803077988     0  1636      2 0x00000000
[  601.348519] Workqueue: ibnbd_server_fileio_wq ibnbd_dev_file_submit_io_worker [ibnbd_server]
[  601.348999]  ffff880803077988 ffff88080466b900 ffff8808033f9c80 ffff880803078000
[  601.349662]  ffff880807c95000 7fffffffffffffff ffffffff815b0920 ffff880803077ad0
[  601.350333]  ffff8808030779a0 ffffffff815b01d5 0000000000000000 ffff880803077a38
[  601.350965] Call Trace:
[  601.351203]  [<ffffffff815b0920>] ? bit_wait+0x60/0x60
[  601.351444]  [<ffffffff815b01d5>] schedule+0x35/0x80
[  601.351709]  [<ffffffff815b2dd2>] schedule_timeout+0x192/0x230
[  601.351958]  [<ffffffff812d43f7>] ? blk_flush_plug_list+0xc7/0x220
[  601.352208]  [<ffffffff810bd737>] ? ktime_get+0x37/0xa0
[  601.352446]  [<ffffffff815b0920>] ? bit_wait+0x60/0x60
[  601.352688]  [<ffffffff815af784>] io_schedule_timeout+0xa4/0x110
[  601.352951]  [<ffffffff815b3a4e>] ? _raw_spin_unlock_irqrestore+0xe/0x10
[  601.353196]  [<ffffffff815b093b>] bit_wait_io+0x1b/0x70
[  601.353440]  [<ffffffff815b056d>] __wait_on_bit+0x5d/0x90
[  601.353689]  [<ffffffff81127bd0>] wait_on_page_bit+0xc0/0xd0
[  601.353958]  [<ffffffff81096db0>] ? autoremove_wake_function+0x40/0x40
[  601.354200]  [<ffffffff81127cc4>] __filemap_fdatawait_range+0xe4/0x140
[  601.354441]  [<ffffffff81127d34>] filemap_fdatawait_range+0x14/0x30
[  601.354688]  [<ffffffff81129a9f>] filemap_write_and_wait_range+0x3f/0x70
[  601.354932]  [<ffffffff811ced3b>] blkdev_fsync+0x1b/0x50
[  601.355193]  [<ffffffff811c82d9>] vfs_fsync_range+0x49/0xa0
[  601.355432]  [<ffffffff811cf45a>] blkdev_write_iter+0xca/0x100
[  601.355679]  [<ffffffff81197b1a>] __vfs_write+0xaa/0xe0
[  601.355925]  [<ffffffff81198379>] vfs_write+0xa9/0x1a0
[  601.356164]  [<ffffffff811c59d8>] kernel_write+0x38/0x50

The underlying device is a null_blk, with default parameters:

root@pserver8:~# cat /sys/module/null_blk/parameters/queue_mode
2 # MQ
root@pserver8:~# cat /sys/module/null_blk/parameters/submit_queues
1 # number of hw queues
root@pserver8:~# cat /sys/module/null_blk/parameters/irqmode
1 # softirq

Verification that nullb0 has something inflight:

root@pserver8:~# cat /sys/block/nullb0/inflight
       0        1
root@pserver8:~# find /sys/block/nullb0/mq/0/cpu* -name rq_list -print -exec cat {} \;
...
/sys/block/nullb0/mq/0/cpu2/rq_list
CTX pending:
        ffff8838038e2400
...

GDB shows that request is alive, nothing is corrupted and all bitmaps
are valid:

root@pserver8:~# gdb /usr/src/linux-4.4.5-1-storage/vmlinux /proc/kcore
(gdb) p (struct request *)0xffff8838038e2400
$1 = (struct request *) 0xffff8838038e2400
(gdb) p *$1
...
  atomic_flags = 1,
  cpu = -1,
  __data_len = 98304,
  __sector = 192210432,
...
  bio = 0xffff8807ef64d000,
  biotail = 0xffff8807ef64df00,
  start_time = 4294963642,
...
(gdb) p $1->mq_ctx
$3 = (struct blk_mq_ctx *) 0xffffe8c7e0c9d300
(gdb) p $1->mq_ctx->cpu
$4 = 2
(gdb) p $1->q->queue_hw_ctx[$1->q->mq_map[2]]
$5 = (struct blk_mq_hw_ctx *) 0xffff8807f1967c00
(gdb) p $5->ctx_map
$6 = {
  size = 8,
  bits_per_word = 8,
  map = 0xffff8807f1bb6200
}
(gdb) p *$5->ctx_map.map@8
$18 = {{
    word = 4,
    depth = 8
  }, {
    word = 0,
    depth = 8
  }, {
    word = 0,
    depth = 8
  }, {
    word = 0,
    depth = 8
  }, {
    word = 0,
    depth = 8
  }, {
    word = 0,
    depth = 8
  }, {
    word = 0,
    depth = 8
  }, {
    word = 0,
    depth = 8
  }}
(gdb) p &$5->ctxs[0 + 2]->rq_list
$23 = (struct list_head *) 0xffffe8c7e0c9d308
(gdb) p $5->ctxs[0 + 2]->rq_list
$24 = {
  next = 0xffff8838038e2400,
  prev = 0xffff8838038e2400           <<<<< Our pended request
}

During debug it became clear that stalled request is always inserted in
the rq_list from the following path:

   save_stack_trace_tsk + 34
   blk_mq_insert_requests + 231
   blk_mq_flush_plug_list + 281
   blk_flush_plug_list + 199
   wait_on_page_bit + 192
   __filemap_fdatawait_range + 228
   filemap_fdatawait_range + 20
   filemap_write_and_wait_range + 63
   blkdev_fsync + 27
   vfs_fsync_range + 73
   blkdev_write_iter + 202
   __vfs_write + 170
   vfs_write + 169
   kernel_write + 56

So blk_flush_plug_list() was called with from_schedule == true.

If from_schedule is true, that means that finally blk_mq_insert_requests()
offloads execution of __blk_mq_run_hw_queue() and uses kblockd workqueue,
i.e. it calls kblockd_schedule_delayed_work_on().

That can mean, that we race with another CPU, which is about to execute
__blk_mq_run_hw_queue() work.

Ok, then I decided to trace deeper and to understand does the race exist,
and if yes, is that because of a probable pending hctx->run_work.

For further tracing I used 4 new events:

 1. insert of a request in a rq_list
 2. hctx marked as pending
 3. did kblockd_schedule_delayed_work_on() succeed or not
 4. actual execution of flush_busy_ctxs()

and inserted these events as in hunks below:

  @@ -690,6 +690,8 @@ static void flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list)
        struct blk_mq_ctx *ctx;
        int i;

 +       trace_hctx_flush_busy_ctxs(hctx);
 +
         for (i = 0; i < hctx->ctx_map.size; i++) {
                struct blk_align_bitmap *bm = &hctx->ctx_map.map[i];

  @@ -872,8 +876,9 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
                  put_cpu();
          }

 -       kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
 -                       &hctx->run_work, 0);
 +       queued = kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
 +                                                 &hctx->run_work, 0);
 +       trace_hctx_delayed_work_queued(hctx, queued);

  @@ -1047,8 +1052,10 @@ static void blk_mq_insert_requests(struct request_queue *q,
                 list_del_init(&rq->queuelist);
                 rq->mq_ctx = ctx;
                 __blk_mq_insert_req_list(hctx, ctx, rq, false);
 +               trace_hctx_insert_req_list(hctx, rq);
         }
         blk_mq_hctx_mark_pending(hctx, ctx);
 +       trace_hctx_mark_pending(hctx);
         spin_unlock(&ctx->lock);

I have to say, that I could not use kernel trace events framework
(i.e. this one include/trace/events), because bug of course immediately
disappears.

I wrote my own static lockless ring buffer with capacity for 64 events.

But I could not use any sorts of atomics - bug disappears.

My ring buffer was implemented per CPU, but then I had to sort all the
events out using some proper global accurate clock.

But I could not use sched_clock() - bug disappears.

I chose raw rdtsc(), and again, not rdtsc_ordered(), but unordered
variant. rdtsc_ordered() again "fixes" the bug.  So frankly I am not
sure can I trust rdtsc(), which again can be reordered, but ok, no
other option.

So here is the sorted output of the tracer of another stalled request:

  rdtsc=849688091004, cpu= 6, INS_REQ    hctx=ffff8800d46c0000, rq=ffff884000343600
  rdtsc=849688091304, cpu= 6, MARK_PEND  hctx=ffff8800d46c0000
  rdtsc=849688099595, cpu= 6, KBLK_SCHED hctx=ffff8800d46c0000, queued=1
  rdtsc=849688106102, cpu= 6, FLUSH_BUSY hctx=ffff8800d46c0000
  rdtsc=849688109586, cpu= 2, INS_REQ    hctx=ffff8800d46c0000, rq=ffff884000343cc0
  rdtsc=849688109957, cpu= 2, MARK_PEND  hctx=ffff8800d46c0000
  rdtsc=849688110921, cpu= 2, KBLK_SCHED hctx=ffff8800d46c0000, queued=0

I see always the same picture for all runs I did (of course CPUs number
and rq addresses can vary):

  CPU#6                                CPU#2
  reqeust ffff884000343600 inserted
  hctx marked as pended
  kblockd_schedule...() returns 1
  <schedule to kblockd workqueue>
  *** WORK_STRUCT_PENDING_BIT is cleared ***
  flush_busy_ctxs() is executed
                                       reqeust ffff884000343cc0 inserted
                                       hctx marked as pended
                                       kblockd_schedule...() returns 0
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                                       WTF?

As a result ffff884000343cc0 request pended forever.

According to the trace output I see that another CPU _always_ observes
WORK_STRUCT_PENDING_BIT as set for that hctx->run_work, even it was
cleared on another CPU.

Checking the workqueue.c code I see that clearing the bit is nothing
more, but atomic_long_set(), which is <mov> instruction. This
function:

  static inline void set_work_data()

In attempt to "fix" the mystery I replaced atomic_long_set() call with
atomic_long_xchg(), which is <xchg> instruction.

The problem has gone.

For me it looks that test_and_set_bit() (<lock btsl> instruction) does
not require flush of all CPU caches, which can be dirty after executing
of <mov> on another CPU.  But <xchg> really updates the memory and the
following execution of <lock btsl> observes that bit was cleared.

As a conculusion I can say, that I am lucky enough and can reproduce
this bug in several minutes on a specific load (I tried many other
simple loads using fio, even using btrecord/btreplay, no success).
And that easy reproduction on a specific load gives me some freedom
to test and then to be sure, that problem has gone.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 kernel/workqueue.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 450c21f..60cd6db 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -621,7 +621,7 @@ static inline void set_work_data(struct work_struct *work, unsigned long data,
 				 unsigned long flags)
 {
 	WARN_ON_ONCE(!work_pending(work));
-	atomic_long_set(&work->data, data | flags | work_static(work));
+	atomic_long_xchg(&work->data, data | flags | work_static(work));
 }
 
 static void set_work_pwq(struct work_struct *work, struct pool_workqueue *pwq,
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 15+ messages in thread