linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/1] [RFC] workqueue: fix ghost PENDING flag while doing MQ IO
@ 2016-04-25 15:22 Roman Pen
  2016-04-25 15:48 ` Tejun Heo
  0 siblings, 1 reply; 15+ messages in thread
From: Roman Pen @ 2016-04-25 15:22 UTC (permalink / raw)
  Cc: Roman Pen, Tejun Heo, Jens Axboe, linux-block, linux-kernel

Hi,

This is RFC, because for last couple of days I hunt a mystery bug and
since now I do not have a strong feeling that the following story is
nothing but bug's trick and attempt to cheat me.

[Sorry, the whole story and explanation are quite long]

The bug is reproduced quite often on our server AMD Opteron(tm) 6386 SE
with 64 cpus and 8 NUMA nodes on 4.4.5 kernel.  What I observe is stalled
IO request in MQ ctx->rq_list with the following backtrace:

[  601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds.
[  601.347574]       Tainted: G           O    4.4.5-1-storage+ #6
[  601.347651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  601.348142] kworker/u129:5  D ffff880803077988     0  1636      2 0x00000000
[  601.348519] Workqueue: ibnbd_server_fileio_wq ibnbd_dev_file_submit_io_worker [ibnbd_server]
[  601.348999]  ffff880803077988 ffff88080466b900 ffff8808033f9c80 ffff880803078000
[  601.349662]  ffff880807c95000 7fffffffffffffff ffffffff815b0920 ffff880803077ad0
[  601.350333]  ffff8808030779a0 ffffffff815b01d5 0000000000000000 ffff880803077a38
[  601.350965] Call Trace:
[  601.351203]  [<ffffffff815b0920>] ? bit_wait+0x60/0x60
[  601.351444]  [<ffffffff815b01d5>] schedule+0x35/0x80
[  601.351709]  [<ffffffff815b2dd2>] schedule_timeout+0x192/0x230
[  601.351958]  [<ffffffff812d43f7>] ? blk_flush_plug_list+0xc7/0x220
[  601.352208]  [<ffffffff810bd737>] ? ktime_get+0x37/0xa0
[  601.352446]  [<ffffffff815b0920>] ? bit_wait+0x60/0x60
[  601.352688]  [<ffffffff815af784>] io_schedule_timeout+0xa4/0x110
[  601.352951]  [<ffffffff815b3a4e>] ? _raw_spin_unlock_irqrestore+0xe/0x10
[  601.353196]  [<ffffffff815b093b>] bit_wait_io+0x1b/0x70
[  601.353440]  [<ffffffff815b056d>] __wait_on_bit+0x5d/0x90
[  601.353689]  [<ffffffff81127bd0>] wait_on_page_bit+0xc0/0xd0
[  601.353958]  [<ffffffff81096db0>] ? autoremove_wake_function+0x40/0x40
[  601.354200]  [<ffffffff81127cc4>] __filemap_fdatawait_range+0xe4/0x140
[  601.354441]  [<ffffffff81127d34>] filemap_fdatawait_range+0x14/0x30
[  601.354688]  [<ffffffff81129a9f>] filemap_write_and_wait_range+0x3f/0x70
[  601.354932]  [<ffffffff811ced3b>] blkdev_fsync+0x1b/0x50
[  601.355193]  [<ffffffff811c82d9>] vfs_fsync_range+0x49/0xa0
[  601.355432]  [<ffffffff811cf45a>] blkdev_write_iter+0xca/0x100
[  601.355679]  [<ffffffff81197b1a>] __vfs_write+0xaa/0xe0
[  601.355925]  [<ffffffff81198379>] vfs_write+0xa9/0x1a0
[  601.356164]  [<ffffffff811c59d8>] kernel_write+0x38/0x50

The underlying device is a null_blk, with default parameters:

root@pserver8:~# cat /sys/module/null_blk/parameters/queue_mode
2 # MQ
root@pserver8:~# cat /sys/module/null_blk/parameters/submit_queues
1 # number of hw queues
root@pserver8:~# cat /sys/module/null_blk/parameters/irqmode
1 # softirq

Verification that nullb0 has something inflight:

root@pserver8:~# cat /sys/block/nullb0/inflight
       0        1
root@pserver8:~# find /sys/block/nullb0/mq/0/cpu* -name rq_list -print -exec cat {} \;
...
/sys/block/nullb0/mq/0/cpu2/rq_list
CTX pending:
        ffff8838038e2400
...

GDB shows that request is alive, nothing is corrupted and all bitmaps
are valid:

root@pserver8:~# gdb /usr/src/linux-4.4.5-1-storage/vmlinux /proc/kcore
(gdb) p (struct request *)0xffff8838038e2400
$1 = (struct request *) 0xffff8838038e2400
(gdb) p *$1
...
  atomic_flags = 1,
  cpu = -1,
  __data_len = 98304,
  __sector = 192210432,
...
  bio = 0xffff8807ef64d000,
  biotail = 0xffff8807ef64df00,
  start_time = 4294963642,
...
(gdb) p $1->mq_ctx
$3 = (struct blk_mq_ctx *) 0xffffe8c7e0c9d300
(gdb) p $1->mq_ctx->cpu
$4 = 2
(gdb) p $1->q->queue_hw_ctx[$1->q->mq_map[2]]
$5 = (struct blk_mq_hw_ctx *) 0xffff8807f1967c00
(gdb) p $5->ctx_map
$6 = {
  size = 8,
  bits_per_word = 8,
  map = 0xffff8807f1bb6200
}
(gdb) p *$5->ctx_map.map@8
$18 = {{
    word = 4,
    depth = 8
  }, {
    word = 0,
    depth = 8
  }, {
    word = 0,
    depth = 8
  }, {
    word = 0,
    depth = 8
  }, {
    word = 0,
    depth = 8
  }, {
    word = 0,
    depth = 8
  }, {
    word = 0,
    depth = 8
  }, {
    word = 0,
    depth = 8
  }}
(gdb) p &$5->ctxs[0 + 2]->rq_list
$23 = (struct list_head *) 0xffffe8c7e0c9d308
(gdb) p $5->ctxs[0 + 2]->rq_list
$24 = {
  next = 0xffff8838038e2400,
  prev = 0xffff8838038e2400           <<<<< Our pended request
}

During debug it became clear that stalled request is always inserted in
the rq_list from the following path:

   save_stack_trace_tsk + 34
   blk_mq_insert_requests + 231
   blk_mq_flush_plug_list + 281
   blk_flush_plug_list + 199
   wait_on_page_bit + 192
   __filemap_fdatawait_range + 228
   filemap_fdatawait_range + 20
   filemap_write_and_wait_range + 63
   blkdev_fsync + 27
   vfs_fsync_range + 73
   blkdev_write_iter + 202
   __vfs_write + 170
   vfs_write + 169
   kernel_write + 56

So blk_flush_plug_list() was called with from_schedule == true.

If from_schedule is true, that means that finally blk_mq_insert_requests()
offloads execution of __blk_mq_run_hw_queue() and uses kblockd workqueue,
i.e. it calls kblockd_schedule_delayed_work_on().

That can mean, that we race with another CPU, which is about to execute
__blk_mq_run_hw_queue() work.

Ok, then I decided to trace deeper and to understand does the race exist,
and if yes, is that because of a probable pending hctx->run_work.

For further tracing I used 4 new events:

 1. insert of a request in a rq_list
 2. hctx marked as pending
 3. did kblockd_schedule_delayed_work_on() succeed or not
 4. actual execution of flush_busy_ctxs()

and inserted these events as in hunks below:

  @@ -690,6 +690,8 @@ static void flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list)
        struct blk_mq_ctx *ctx;
        int i;

 +       trace_hctx_flush_busy_ctxs(hctx);
 +
         for (i = 0; i < hctx->ctx_map.size; i++) {
                struct blk_align_bitmap *bm = &hctx->ctx_map.map[i];

  @@ -872,8 +876,9 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
                  put_cpu();
          }

 -       kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
 -                       &hctx->run_work, 0);
 +       queued = kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
 +                                                 &hctx->run_work, 0);
 +       trace_hctx_delayed_work_queued(hctx, queued);

  @@ -1047,8 +1052,10 @@ static void blk_mq_insert_requests(struct request_queue *q,
                 list_del_init(&rq->queuelist);
                 rq->mq_ctx = ctx;
                 __blk_mq_insert_req_list(hctx, ctx, rq, false);
 +               trace_hctx_insert_req_list(hctx, rq);
         }
         blk_mq_hctx_mark_pending(hctx, ctx);
 +       trace_hctx_mark_pending(hctx);
         spin_unlock(&ctx->lock);

I have to say, that I could not use kernel trace events framework
(i.e. this one include/trace/events), because bug of course immediately
disappears.

I wrote my own static lockless ring buffer with capacity for 64 events.

But I could not use any sorts of atomics - bug disappears.

My ring buffer was implemented per CPU, but then I had to sort all the
events out using some proper global accurate clock.

But I could not use sched_clock() - bug disappears.

I chose raw rdtsc(), and again, not rdtsc_ordered(), but unordered
variant. rdtsc_ordered() again "fixes" the bug.  So frankly I am not
sure can I trust rdtsc(), which again can be reordered, but ok, no
other option.

So here is the sorted output of the tracer of another stalled request:

  rdtsc=849688091004, cpu= 6, INS_REQ    hctx=ffff8800d46c0000, rq=ffff884000343600
  rdtsc=849688091304, cpu= 6, MARK_PEND  hctx=ffff8800d46c0000
  rdtsc=849688099595, cpu= 6, KBLK_SCHED hctx=ffff8800d46c0000, queued=1
  rdtsc=849688106102, cpu= 6, FLUSH_BUSY hctx=ffff8800d46c0000
  rdtsc=849688109586, cpu= 2, INS_REQ    hctx=ffff8800d46c0000, rq=ffff884000343cc0
  rdtsc=849688109957, cpu= 2, MARK_PEND  hctx=ffff8800d46c0000
  rdtsc=849688110921, cpu= 2, KBLK_SCHED hctx=ffff8800d46c0000, queued=0

I see always the same picture for all runs I did (of course CPUs number
and rq addresses can vary):

  CPU#6                                CPU#2
  reqeust ffff884000343600 inserted
  hctx marked as pended
  kblockd_schedule...() returns 1
  <schedule to kblockd workqueue>
  *** WORK_STRUCT_PENDING_BIT is cleared ***
  flush_busy_ctxs() is executed
                                       reqeust ffff884000343cc0 inserted
                                       hctx marked as pended
                                       kblockd_schedule...() returns 0
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                                       WTF?

As a result ffff884000343cc0 request pended forever.

According to the trace output I see that another CPU _always_ observes
WORK_STRUCT_PENDING_BIT as set for that hctx->run_work, even it was
cleared on another CPU.

Checking the workqueue.c code I see that clearing the bit is nothing
more, but atomic_long_set(), which is <mov> instruction. This
function:

  static inline void set_work_data()

In attempt to "fix" the mystery I replaced atomic_long_set() call with
atomic_long_xchg(), which is <xchg> instruction.

The problem has gone.

For me it looks that test_and_set_bit() (<lock btsl> instruction) does
not require flush of all CPU caches, which can be dirty after executing
of <mov> on another CPU.  But <xchg> really updates the memory and the
following execution of <lock btsl> observes that bit was cleared.

As a conculusion I can say, that I am lucky enough and can reproduce
this bug in several minutes on a specific load (I tried many other
simple loads using fio, even using btrecord/btreplay, no success).
And that easy reproduction on a specific load gives me some freedom
to test and then to be sure, that problem has gone.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 kernel/workqueue.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 450c21f..60cd6db 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -621,7 +621,7 @@ static inline void set_work_data(struct work_struct *work, unsigned long data,
 				 unsigned long flags)
 {
 	WARN_ON_ONCE(!work_pending(work));
-	atomic_long_set(&work->data, data | flags | work_static(work));
+	atomic_long_xchg(&work->data, data | flags | work_static(work));
 }
 
 static void set_work_pwq(struct work_struct *work, struct pool_workqueue *pwq,
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] [RFC] workqueue: fix ghost PENDING flag while doing MQ IO
  2016-04-25 15:22 [PATCH 1/1] [RFC] workqueue: fix ghost PENDING flag while doing MQ IO Roman Pen
@ 2016-04-25 15:48 ` Tejun Heo
  2016-04-25 16:00   ` Tejun Heo
                     ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Tejun Heo @ 2016-04-25 15:48 UTC (permalink / raw)
  To: Roman Pen; +Cc: Jens Axboe, linux-block, linux-kernel, David Howells

Hello, Roman.

On Mon, Apr 25, 2016 at 05:22:51PM +0200, Roman Pen wrote:
...
>   CPU#6                                CPU#2
>   reqeust ffff884000343600 inserted
>   hctx marked as pended
>   kblockd_schedule...() returns 1
>   <schedule to kblockd workqueue>
>   *** WORK_STRUCT_PENDING_BIT is cleared ***
>   flush_busy_ctxs() is executed
>                                        reqeust ffff884000343cc0 inserted
>                                        hctx marked as pended
>                                        kblockd_schedule...() returns 0
>                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>                                        WTF?
> 
> As a result ffff884000343cc0 request pended forever.
> 
> According to the trace output I see that another CPU _always_ observes
> WORK_STRUCT_PENDING_BIT as set for that hctx->run_work, even it was
> cleared on another CPU.
> 
> Checking the workqueue.c code I see that clearing the bit is nothing
> more, but atomic_long_set(), which is <mov> instruction. This
> function:
> 
>   static inline void set_work_data()
> 
> In attempt to "fix" the mystery I replaced atomic_long_set() call with
> atomic_long_xchg(), which is <xchg> instruction.
> 
> The problem has gone.
> 
> For me it looks that test_and_set_bit() (<lock btsl> instruction) does
> not require flush of all CPU caches, which can be dirty after executing
> of <mov> on another CPU.  But <xchg> really updates the memory and the
> following execution of <lock btsl> observes that bit was cleared.
> 
> As a conculusion I can say, that I am lucky enough and can reproduce
> this bug in several minutes on a specific load (I tried many other
> simple loads using fio, even using btrecord/btreplay, no success).
> And that easy reproduction on a specific load gives me some freedom
> to test and then to be sure, that problem has gone.

Heh, excellent debugging.  I wonder how old this bug is.  cc'ing David
Howells who ISTR to have reported a similar issue.  The root problem
here, I think, is that PENDING is used to synchronize between
different queueing instances but we don't have proper memory barrier
after it.

	A				B

	queue (test_and_sets PENDING)
	dispatch (clears PENDING)
	execute				queue (test_and_sets PENDING)

So, for B, the guarantee must be that either A starts executing after
B's test_and_set or B's test_and_set succeeds; however, as we don't
have any memory barrier between dispatch and execute, there's nothing
preventing the processor from scheduling some memory fetch operations
from the execute stage before the clearing of PENDING - ie. A might
not see what B has done prior to queue even if B's test_and_set fails
indicating that A should.  Can you please test whether the following
patch fixes the issue?

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 2232ae3..8ec2b5e 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -666,6 +666,7 @@ static void set_work_pool_and_clear_pending(struct work_struct *work,
 	 */
 	smp_wmb();
 	set_work_data(work, (unsigned long)pool_id << WORK_OFFQ_POOL_SHIFT, 0);
+	smp_mb();
 }
 
 static void clear_work_data(struct work_struct *work)


-- 
tejun

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] [RFC] workqueue: fix ghost PENDING flag while doing MQ IO
  2016-04-25 15:48 ` Tejun Heo
@ 2016-04-25 16:00   ` Tejun Heo
  2016-04-25 16:40     ` Roman Penyaev
  2016-04-25 16:34   ` Roman Penyaev
  2016-04-26  1:22   ` Peter Hurley
  2 siblings, 1 reply; 15+ messages in thread
From: Tejun Heo @ 2016-04-25 16:00 UTC (permalink / raw)
  To: Roman Pen; +Cc: Jens Axboe, linux-block, linux-kernel, David Howells

On Mon, Apr 25, 2016 at 11:48:47AM -0400, Tejun Heo wrote:
> Heh, excellent debugging.  I wonder how old this bug is.  cc'ing David

I just went through the history.  So, we used to have clear_bit()
instead of atomic_long_set() but clear_bit() doesn't imply any barrier
either, so kudos to you.  You just debugged an issue which is most
likely has been there for over a decade! :)

-- 
tejun

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] [RFC] workqueue: fix ghost PENDING flag while doing MQ IO
  2016-04-25 15:48 ` Tejun Heo
  2016-04-25 16:00   ` Tejun Heo
@ 2016-04-25 16:34   ` Roman Penyaev
  2016-04-25 17:03     ` Tejun Heo
  2016-04-26  1:22   ` Peter Hurley
  2 siblings, 1 reply; 15+ messages in thread
From: Roman Penyaev @ 2016-04-25 16:34 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jens Axboe, linux-block, linux-kernel, David Howells

Hello, Tejun,

On Mon, Apr 25, 2016 at 5:48 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Roman.
>
> On Mon, Apr 25, 2016 at 05:22:51PM +0200, Roman Pen wrote:
> ...
>>   CPU#6                                CPU#2
>>   reqeust ffff884000343600 inserted
>>   hctx marked as pended
>>   kblockd_schedule...() returns 1
>>   <schedule to kblockd workqueue>
>>   *** WORK_STRUCT_PENDING_BIT is cleared ***
>>   flush_busy_ctxs() is executed
>>                                        reqeust ffff884000343cc0 inserted
>>                                        hctx marked as pended
>>                                        kblockd_schedule...() returns 0
>>                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>                                        WTF?
>>
>> As a result ffff884000343cc0 request pended forever.
>>
>> According to the trace output I see that another CPU _always_ observes
>> WORK_STRUCT_PENDING_BIT as set for that hctx->run_work, even it was
>> cleared on another CPU.
>>
>> Checking the workqueue.c code I see that clearing the bit is nothing
>> more, but atomic_long_set(), which is <mov> instruction. This
>> function:
>>
>>   static inline void set_work_data()
>>
>> In attempt to "fix" the mystery I replaced atomic_long_set() call with
>> atomic_long_xchg(), which is <xchg> instruction.
>>
>> The problem has gone.
>>
>> For me it looks that test_and_set_bit() (<lock btsl> instruction) does
>> not require flush of all CPU caches, which can be dirty after executing
>> of <mov> on another CPU.  But <xchg> really updates the memory and the
>> following execution of <lock btsl> observes that bit was cleared.
>>
>> As a conculusion I can say, that I am lucky enough and can reproduce
>> this bug in several minutes on a specific load (I tried many other
>> simple loads using fio, even using btrecord/btreplay, no success).
>> And that easy reproduction on a specific load gives me some freedom
>> to test and then to be sure, that problem has gone.
>
> Heh, excellent debugging.  I wonder how old this bug is.  cc'ing David
> Howells who ISTR to have reported a similar issue.  The root problem
> here, I think, is that PENDING is used to synchronize between
> different queueing instances but we don't have proper memory barrier
> after it.
>
>         A                               B
>
>         queue (test_and_sets PENDING)
>         dispatch (clears PENDING)
>         execute                         queue (test_and_sets PENDING)
>
> So, for B, the guarantee must be that either A starts executing after
> B's test_and_set or B's test_and_set succeeds; however, as we don't
> have any memory barrier between dispatch and execute, there's nothing
> preventing the processor from scheduling some memory fetch operations
> from the execute stage before the clearing of PENDING - ie. A might
> not see what B has done prior to queue even if B's test_and_set fails
> indicating that A should.  Can you please test whether the following
> patch fixes the issue?

I can assure you that smp_mb() helps (at least running for 30 minutes
under IO). That was my first variant, but I did not like it because I
could not explain myself why:

1. not smp_wmb()? We need to do flush after an update.
   (I tried that also, and it does not help)

2. what protects us from this situation?

  CPU#0                  CPU#1
                         set_work_data()
  test_and_set_bit()
                         smp_mb()

And 2. question was crucial to me, because even tiny delay "fixes" the
problem, e.g. ndelay also "fixes" the bug:

         smp_wmb();
         set_work_data(work, (unsigned long)pool_id << WORK_OFFQ_POOL_SHIFT, 0);
 +       ndelay(40);
  }

Why ndelay(40)? Because on this machine smp_mb() takes 40 ns on average.

--
Roman

>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 2232ae3..8ec2b5e 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -666,6 +666,7 @@ static void set_work_pool_and_clear_pending(struct work_struct *work,
>          */
>         smp_wmb();
>         set_work_data(work, (unsigned long)pool_id << WORK_OFFQ_POOL_SHIFT, 0);
> +       smp_mb();
>  }
>
>  static void clear_work_data(struct work_struct *work)
>
>
> --
> tejun

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] [RFC] workqueue: fix ghost PENDING flag while doing MQ IO
  2016-04-25 16:00   ` Tejun Heo
@ 2016-04-25 16:40     ` Roman Penyaev
  0 siblings, 0 replies; 15+ messages in thread
From: Roman Penyaev @ 2016-04-25 16:40 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jens Axboe, linux-block, linux-kernel, David Howells

On Mon, Apr 25, 2016 at 6:00 PM, Tejun Heo <tj@kernel.org> wrote:
> On Mon, Apr 25, 2016 at 11:48:47AM -0400, Tejun Heo wrote:
>> Heh, excellent debugging.  I wonder how old this bug is.  cc'ing David
>
> I just went through the history.  So, we used to have clear_bit()
> instead of atomic_long_set() but clear_bit() doesn't imply any barrier
> either, so kudos to you.  You just debugged an issue which is most
> likely has been there for over a decade! :)

That's nice, that we can finally fix it.
But as I was saying I was just lucky enough to have very easy reproduction.
I could check all crazy ideas I had :)

--
Roman

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] [RFC] workqueue: fix ghost PENDING flag while doing MQ IO
  2016-04-25 16:34   ` Roman Penyaev
@ 2016-04-25 17:03     ` Tejun Heo
  2016-04-25 17:39       ` Roman Penyaev
  0 siblings, 1 reply; 15+ messages in thread
From: Tejun Heo @ 2016-04-25 17:03 UTC (permalink / raw)
  To: Roman Penyaev; +Cc: Jens Axboe, linux-block, linux-kernel, David Howells

Hello, Roman.

On Mon, Apr 25, 2016 at 06:34:45PM +0200, Roman Penyaev wrote:
> I can assure you that smp_mb() helps (at least running for 30 minutes
> under IO). That was my first variant, but I did not like it because I
> could not explain myself why:
> 
> 1. not smp_wmb()? We need to do flush after an update.
>    (I tried that also, and it does not help)

Regardless of the success of queue_work(), the interface guarantees
that there will be at least one execution instance which sees whatever
updates the queuer has made prior to calling queue_work().  The
PENDING bit is what synchronizes this operations.

	A				B

					Make updates
	clear PENDING			test_and_set PENDING
	start execution

So, if B's test_and_set takes place before clearing of PENDING, what
should be guaranteed is that A's execution must be able to see B's
updates; however, as there's no barrier between "clear PENDING" and
"start execution", memory loads of execution can be scheduled before
clearing of PENDING which leads to a situation where B loses queueing
but its updates are not seen by the prior instance's execution.  It's
a classic "either a sees b (clear PENDING) or b sees a (prior
updates)" interlocking situation.

> 2. what protects us from this situation?
> 
>   CPU#0                  CPU#1
>                          set_work_data()
>   test_and_set_bit()
>                          smp_mb()

The above would be completely fine as CPU#1's execution would see all
the changes CPU#0 has made upto that point.

> And 2. question was crucial to me, because even tiny delay "fixes" the
> problem, e.g. ndelay also "fixes" the bug:
> 
>          smp_wmb();
>          set_work_data(work, (unsigned long)pool_id << WORK_OFFQ_POOL_SHIFT, 0);
>  +       ndelay(40);
>   }
> 
> Why ndelay(40)? Because on this machine smp_mb() takes 40 ns on average.

Yeah, this is the CPU rescheduling loads for the execution ahead of
clearing of PENDING and doing anything inbetween is likely to reduce
the chance of it happening drastically, but smp_mb() inbetween is
actually the right solution here.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] [RFC] workqueue: fix ghost PENDING flag while doing MQ IO
  2016-04-25 17:03     ` Tejun Heo
@ 2016-04-25 17:39       ` Roman Penyaev
  2016-04-25 17:51         ` Tejun Heo
  0 siblings, 1 reply; 15+ messages in thread
From: Roman Penyaev @ 2016-04-25 17:39 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jens Axboe, linux-block, linux-kernel, David Howells

On Mon, Apr 25, 2016 at 7:03 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Roman.
>
> On Mon, Apr 25, 2016 at 06:34:45PM +0200, Roman Penyaev wrote:
>> I can assure you that smp_mb() helps (at least running for 30 minutes
>> under IO). That was my first variant, but I did not like it because I
>> could not explain myself why:
>>
>> 1. not smp_wmb()? We need to do flush after an update.
>>    (I tried that also, and it does not help)
>
> Regardless of the success of queue_work(), the interface guarantees
> that there will be at least one execution instance which sees whatever
> updates the queuer has made prior to calling queue_work().  The
> PENDING bit is what synchronizes this operations.
>
>         A                               B
>
>                                         Make updates
>         clear PENDING                   test_and_set PENDING
>         start execution
>
> So, if B's test_and_set takes place before clearing of PENDING, what
> should be guaranteed is that A's execution must be able to see B's
> updates; however, as there's no barrier between "clear PENDING" and
> "start execution", memory loads of execution can be scheduled before
> clearing of PENDING which leads to a situation where B loses queueing
> but its updates are not seen by the prior instance's execution.  It's
> a classic "either a sees b (clear PENDING) or b sees a (prior
> updates)" interlocking situation.

Ok, that's clear now. Thanks. I was confused also by a spin lock, which
is being released just after clear pending:

   set_work_pool_and_clear_pending(work, pool->id);
   spin_unlock_irq(&pool->lock);
   ...
   worker->current_func(work);

But seems memory operations of execution can leak-in and appear before
pended bit is cleared and spin lock is released.
(according to Documentation/memory-barriers.txt, (6) RELEASE operations)

>> 2. what protects us from this situation?
>>
>>   CPU#0                  CPU#1
>>                          set_work_data()
>>   test_and_set_bit()
>>                          smp_mb()
>
> The above would be completely fine as CPU#1's execution would see all
> the changes CPU#0 has made upto that point.
>
>> And 2. question was crucial to me, because even tiny delay "fixes" the
>> problem, e.g. ndelay also "fixes" the bug:
>>
>>          smp_wmb();
>>          set_work_data(work, (unsigned long)pool_id << WORK_OFFQ_POOL_SHIFT, 0);
>>  +       ndelay(40);
>>   }
>>
>> Why ndelay(40)? Because on this machine smp_mb() takes 40 ns on average.
>
> Yeah, this is the CPU rescheduling loads for the execution ahead of
> clearing of PENDING and doing anything inbetween is likely to reduce
> the chance of it happening drastically, but smp_mb() inbetween is
> actually the right solution here.

Tejun, do you need an updated patch for that? With a proper smp_mb()?

--
Roman

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] [RFC] workqueue: fix ghost PENDING flag while doing MQ IO
  2016-04-25 17:39       ` Roman Penyaev
@ 2016-04-25 17:51         ` Tejun Heo
  0 siblings, 0 replies; 15+ messages in thread
From: Tejun Heo @ 2016-04-25 17:51 UTC (permalink / raw)
  To: Roman Penyaev; +Cc: Jens Axboe, linux-block, linux-kernel, David Howells

Hello, Roman.

On Mon, Apr 25, 2016 at 07:39:52PM +0200, Roman Penyaev wrote:
> Ok, that's clear now. Thanks. I was confused also by a spin lock, which
> is being released just after clear pending:
> 
>    set_work_pool_and_clear_pending(work, pool->id);
>    spin_unlock_irq(&pool->lock);
>    ...
>    worker->current_func(work);
> 
> But seems memory operations of execution can leak-in and appear before
> pended bit is cleared and spin lock is released.
> (according to Documentation/memory-barriers.txt, (6) RELEASE operations)

Yeap, release doesn't prevent following reads from creeping ahead of
it.

> Tejun, do you need an updated patch for that? With a proper smp_mb()?

Yes, can you please also add comment explaining what the mb is
interlocking?

Thanks a lot!

-- 
tejun

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] [RFC] workqueue: fix ghost PENDING flag while doing MQ IO
  2016-04-25 15:48 ` Tejun Heo
  2016-04-25 16:00   ` Tejun Heo
  2016-04-25 16:34   ` Roman Penyaev
@ 2016-04-26  1:22   ` Peter Hurley
  2016-04-26 15:15     ` Tejun Heo
  2 siblings, 1 reply; 15+ messages in thread
From: Peter Hurley @ 2016-04-26  1:22 UTC (permalink / raw)
  To: Tejun Heo, Roman Pen; +Cc: Jens Axboe, linux-block, linux-kernel, David Howells

On 04/25/2016 08:48 AM, Tejun Heo wrote:
> Hello, Roman.
> 
> On Mon, Apr 25, 2016 at 05:22:51PM +0200, Roman Pen wrote:
> ...
>>   CPU#6                                CPU#2
>>   reqeust ffff884000343600 inserted
>>   hctx marked as pended
>>   kblockd_schedule...() returns 1
>>   <schedule to kblockd workqueue>
>>   *** WORK_STRUCT_PENDING_BIT is cleared ***
>>   flush_busy_ctxs() is executed
>>                                        reqeust ffff884000343cc0 inserted
>>                                        hctx marked as pended
>>                                        kblockd_schedule...() returns 0
>>                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>                                        WTF?
>>
>> As a result ffff884000343cc0 request pended forever.
>>
>> According to the trace output I see that another CPU _always_ observes
>> WORK_STRUCT_PENDING_BIT as set for that hctx->run_work, even it was
>> cleared on another CPU.
>>
>> Checking the workqueue.c code I see that clearing the bit is nothing
>> more, but atomic_long_set(), which is <mov> instruction. This
>> function:
>>
>>   static inline void set_work_data()
>>
>> In attempt to "fix" the mystery I replaced atomic_long_set() call with
>> atomic_long_xchg(), which is <xchg> instruction.
>>
>> The problem has gone.
>>
>> For me it looks that test_and_set_bit() (<lock btsl> instruction) does
>> not require flush of all CPU caches, which can be dirty after executing
>> of <mov> on another CPU.  But <xchg> really updates the memory and the
>> following execution of <lock btsl> observes that bit was cleared.
>>
>> As a conculusion I can say, that I am lucky enough and can reproduce
>> this bug in several minutes on a specific load (I tried many other
>> simple loads using fio, even using btrecord/btreplay, no success).
>> And that easy reproduction on a specific load gives me some freedom
>> to test and then to be sure, that problem has gone.
> 
> Heh, excellent debugging.  I wonder how old this bug is.

This is the same bug I wrote about 2 yrs ago (but with the wrong fix).

http://lkml.iu.edu/hypermail/linux/kernel/1402.2/04697.html

Unfortunately I didn't have a reproducer at all :/


> cc'ing David
> Howells who ISTR to have reported a similar issue.  The root problem
> here, I think, is that PENDING is used to synchronize between
> different queueing instances but we don't have proper memory barrier
> after it.
> 
> 	A				B
> 
> 	queue (test_and_sets PENDING)
> 	dispatch (clears PENDING)
> 	execute				queue (test_and_sets PENDING)
> 
> So, for B, the guarantee must be that either A starts executing after
> B's test_and_set or B's test_and_set succeeds; however, as we don't
> have any memory barrier between dispatch and execute, there's nothing
> preventing the processor from scheduling some memory fetch operations
> from the execute stage before the clearing of PENDING - ie. A might
> not see what B has done prior to queue even if B's test_and_set fails
> indicating that A should.  Can you please test whether the following
> patch fixes the issue?
> 
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 2232ae3..8ec2b5e 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -666,6 +666,7 @@ static void set_work_pool_and_clear_pending(struct work_struct *work,
>  	 */
>  	smp_wmb();
>  	set_work_data(work, (unsigned long)pool_id << WORK_OFFQ_POOL_SHIFT, 0);
> +	smp_mb();
>  }

The atomic_long_xchg() patch has several benefits over the naked barrier:

1. set_work_pool_and_clear_pending() has the same requirements as
   clear_work_data(); note that both require write barrier before and
   full barrier after.

2. xchg() et al imply full barrier before and full barrier after.

3. The naked barriers could be removed, while improving efficiency.
   On x86, mov + mfence => xchg

4. Maybe fixes other hidden bugs.
   For example, I'm wondering if reordering with set_work_pwq/list_add_tail
   would be a problem; ie., what if work is visible on the worklist _before_
   data is initialized by set_work_pwq()?

Regards,
Peter Hurley

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] [RFC] workqueue: fix ghost PENDING flag while doing MQ IO
  2016-04-26  1:22   ` Peter Hurley
@ 2016-04-26 15:15     ` Tejun Heo
  2016-04-26 17:27       ` Peter Hurley
  0 siblings, 1 reply; 15+ messages in thread
From: Tejun Heo @ 2016-04-26 15:15 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Roman Pen, Jens Axboe, linux-block, linux-kernel, David Howells

Hello, Peter.

On Mon, Apr 25, 2016 at 06:22:01PM -0700, Peter Hurley wrote:
> This is the same bug I wrote about 2 yrs ago (but with the wrong fix).
> 
> http://lkml.iu.edu/hypermail/linux/kernel/1402.2/04697.html
> 
> Unfortunately I didn't have a reproducer at all :/

Ah, bummer.

> The atomic_long_xchg() patch has several benefits over the naked barrier:
> 
> 1. set_work_pool_and_clear_pending() has the same requirements as
>    clear_work_data(); note that both require write barrier before and
>    full barrier after.

clear_work_data() is only used by __cancel_work_timer() and there's no
following execution or anything where rescheduling memory loads can
cause any issue.

> 2. xchg() et al imply full barrier before and full barrier after.
> 
> 3. The naked barriers could be removed, while improving efficiency.
>    On x86, mov + mfence => xchg

It's unlikely to make any measureable difference.  Is xchg() actually
cheaper than store + rmb?

> 4. Maybe fixes other hidden bugs.
>    For example, I'm wondering if reordering with set_work_pwq/list_add_tail
>    would be a problem; ie., what if work is visible on the worklist _before_
>    data is initialized by set_work_pwq()?

Worklist is always accessed under the pool lock.  The barrier comes
into play only because we're using bare PENDING bit for
synchronization.  I'm not necessarily against making all clearings of
PENDING to be followed by a rmb or use xhcg.  Reasons 2-4 are pretty
weak tho.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] [RFC] workqueue: fix ghost PENDING flag while doing MQ IO
  2016-04-26 15:15     ` Tejun Heo
@ 2016-04-26 17:27       ` Peter Hurley
  2016-04-26 17:45         ` Tejun Heo
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Hurley @ 2016-04-26 17:27 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Roman Pen, Jens Axboe, linux-block, linux-kernel, David Howells

Hi Tejun,

On 04/26/2016 08:15 AM, Tejun Heo wrote:
> Hello, Peter.
> 
> On Mon, Apr 25, 2016 at 06:22:01PM -0700, Peter Hurley wrote:
>> This is the same bug I wrote about 2 yrs ago (but with the wrong fix).
>>
>> http://lkml.iu.edu/hypermail/linux/kernel/1402.2/04697.html
>>
>> Unfortunately I didn't have a reproducer at all :/
> 
> Ah, bummer.
> 
>> The atomic_long_xchg() patch has several benefits over the naked barrier:
>>
>> 1. set_work_pool_and_clear_pending() has the same requirements as
>>    clear_work_data(); note that both require write barrier before and
>>    full barrier after.
> 
> clear_work_data() is only used by __cancel_work_timer() and there's no
> following execution or anything where rescheduling memory loads can
> cause any issue.

I meant this block:

	clear_work_data(work);

	/*
	 * Paired with prepare_to_wait() above so that either
	 * waitqueue_active() is visible here or !work_is_canceling() is
	 * visible there.
	 */
	smp_mb();


>> 2. xchg() et al imply full barrier before and full barrier after.
>>
>> 3. The naked barriers could be removed, while improving efficiency.
>>    On x86, mov + mfence => xchg
> 
> It's unlikely to make any measureable difference.  Is xchg() actually
> cheaper than store + rmb?

store + mfence (full barrier), yes. Roughly 2x faster.

https://lkml.org/lkml/2015/11/2/607


>> 4. Maybe fixes other hidden bugs.
>>    For example, I'm wondering if reordering with set_work_pwq/list_add_tail
>>    would be a problem; ie., what if work is visible on the worklist _before_
>>    data is initialized by set_work_pwq()?
> 
> Worklist is always accessed under the pool lock.

I see a lot of list_empty() checks while not holding pool lock, but I get
what you mean: that work visible on the worklist isn't actually inspected
without holding the pool lock.


> The barrier comes
> into play only because we're using bare PENDING bit for
> synchronization.

I see that it's only transitions of PENDING that need full barriers but it's
not clear from the semantics the conditions under which pending is already
set. If performance isn't the concern there, maybe those uses (where PENDING
is expected to already be set) should check and warn?


> I'm not necessarily against making all clearings of
> PENDING to be followed by a rmb or use xhcg.  Reasons 2-4 are pretty
> weak tho.

I agree 2 and 3 are not the best reasons.
Actually, it looks that I'm in the minority anyway, and that style-wise,
naked barrier is preferred.

Regards,
Peter Hurley

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] [RFC] workqueue: fix ghost PENDING flag while doing MQ IO
  2016-04-26 17:27       ` Peter Hurley
@ 2016-04-26 17:45         ` Tejun Heo
  2016-04-26 20:07           ` Peter Hurley
  2016-04-27  5:50           ` Hannes Reinecke
  0 siblings, 2 replies; 15+ messages in thread
From: Tejun Heo @ 2016-04-26 17:45 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Roman Pen, Jens Axboe, linux-block, linux-kernel, David Howells

Hello, Peter.

On Tue, Apr 26, 2016 at 10:27:59AM -0700, Peter Hurley wrote:
> > It's unlikely to make any measureable difference.  Is xchg() actually
> > cheaper than store + rmb?
> 
> store + mfence (full barrier), yes. Roughly 2x faster.
> 
> https://lkml.org/lkml/2015/11/2/607

Ah, didn't know that.  Thanks for the pointer.

> > I'm not necessarily against making all clearings of
> > PENDING to be followed by a rmb or use xhcg.  Reasons 2-4 are pretty
> > weak tho.
> 
> I agree 2 and 3 are not the best reasons.
> Actually, it looks that I'm in the minority anyway, and that style-wise,
> naked barrier is preferred.

As long as what's happening is clearly documented, I think either is
fine.  I'm gonna go with Roman's mb patch for -stable fix but think
it'd be nice to have a separate patch to consolidate the paths which
clear PENDING and make them use xchg.  If you can spin up a patch for
that, I'd be happy to apply it to wq/for-3.7.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] [RFC] workqueue: fix ghost PENDING flag while doing MQ IO
  2016-04-26 17:45         ` Tejun Heo
@ 2016-04-26 20:07           ` Peter Hurley
  2016-04-27  5:50           ` Hannes Reinecke
  1 sibling, 0 replies; 15+ messages in thread
From: Peter Hurley @ 2016-04-26 20:07 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Roman Pen, Jens Axboe, linux-block, linux-kernel, David Howells

On 04/26/2016 10:45 AM, Tejun Heo wrote:
> As long as what's happening is clearly documented, I think either is
> fine.  I'm gonna go with Roman's mb patch for -stable fix but think
> it'd be nice to have a separate patch to consolidate the paths which
> clear PENDING and make them use xchg.  If you can spin up a patch for
> that, I'd be happy to apply it to wq/for-3.7.

Ok.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] [RFC] workqueue: fix ghost PENDING flag while doing MQ IO
  2016-04-26 17:45         ` Tejun Heo
  2016-04-26 20:07           ` Peter Hurley
@ 2016-04-27  5:50           ` Hannes Reinecke
  2016-04-27 19:05             ` Tejun Heo
  1 sibling, 1 reply; 15+ messages in thread
From: Hannes Reinecke @ 2016-04-27  5:50 UTC (permalink / raw)
  To: Tejun Heo, Peter Hurley
  Cc: Roman Pen, Jens Axboe, linux-block, linux-kernel, David Howells

On 04/26/2016 07:45 PM, Tejun Heo wrote:
> Hello, Peter.
> 
> On Tue, Apr 26, 2016 at 10:27:59AM -0700, Peter Hurley wrote:
>>> It's unlikely to make any measureable difference.  Is xchg() actually
>>> cheaper than store + rmb?
>>
>> store + mfence (full barrier), yes. Roughly 2x faster.
>>
>> https://lkml.org/lkml/2015/11/2/607
> 
> Ah, didn't know that.  Thanks for the pointer.
> 
>>> I'm not necessarily against making all clearings of
>>> PENDING to be followed by a rmb or use xhcg.  Reasons 2-4 are pretty
>>> weak tho.
>>
>> I agree 2 and 3 are not the best reasons.
>> Actually, it looks that I'm in the minority anyway, and that style-wise,
>> naked barrier is preferred.
> 
> As long as what's happening is clearly documented, I think either is
> fine.  I'm gonna go with Roman's mb patch for -stable fix but think
> it'd be nice to have a separate patch to consolidate the paths which
> clear PENDING and make them use xchg.  If you can spin up a patch for
> that, I'd be happy to apply it to wq/for-3.7.
>                                          ^^^
Ah. Time warp.
I knew it would happen eventually :-)

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] [RFC] workqueue: fix ghost PENDING flag while doing MQ IO
  2016-04-27  5:50           ` Hannes Reinecke
@ 2016-04-27 19:05             ` Tejun Heo
  0 siblings, 0 replies; 15+ messages in thread
From: Tejun Heo @ 2016-04-27 19:05 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Peter Hurley, Roman Pen, Jens Axboe, linux-block, lkml, David Howells

On Wed, Apr 27, 2016 at 1:50 AM, Hannes Reinecke <hare@suse.de> wrote:
>> that, I'd be happy to apply it to wq/for-3.7.
>>                                          ^^^
> Ah. Time warp.
> I knew it would happen eventually :-)

lol wrong tree too. Sorry about that.

-- 
tejun

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2016-04-27 19:05 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-25 15:22 [PATCH 1/1] [RFC] workqueue: fix ghost PENDING flag while doing MQ IO Roman Pen
2016-04-25 15:48 ` Tejun Heo
2016-04-25 16:00   ` Tejun Heo
2016-04-25 16:40     ` Roman Penyaev
2016-04-25 16:34   ` Roman Penyaev
2016-04-25 17:03     ` Tejun Heo
2016-04-25 17:39       ` Roman Penyaev
2016-04-25 17:51         ` Tejun Heo
2016-04-26  1:22   ` Peter Hurley
2016-04-26 15:15     ` Tejun Heo
2016-04-26 17:27       ` Peter Hurley
2016-04-26 17:45         ` Tejun Heo
2016-04-26 20:07           ` Peter Hurley
2016-04-27  5:50           ` Hannes Reinecke
2016-04-27 19:05             ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).