Re: [PATCH 1/1] [RFC] workqueue: fix ghost PENDING flag while doing MQ IO

From: Roman Penyaev <roman.penyaev@profitbricks.com>
To: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>,
	linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	David Howells <dhowells@redhat.com>
Subject: Re: [PATCH 1/1] [RFC] workqueue: fix ghost PENDING flag while doing MQ IO
Date: Mon, 25 Apr 2016 18:34:45 +0200	[thread overview]
Message-ID: <CAJrWOzDcY=HrpXHKU7sLO79CJ8H=J14ywfE70aMiu-+1kJKKXg@mail.gmail.com> (raw)
In-Reply-To: <20160425154847.GZ7822@mtj.duckdns.org>

Hello, Tejun,

On Mon, Apr 25, 2016 at 5:48 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Roman.
>
> On Mon, Apr 25, 2016 at 05:22:51PM +0200, Roman Pen wrote:
> ...
>>   CPU#6                                CPU#2
>>   reqeust ffff884000343600 inserted
>>   hctx marked as pended
>>   kblockd_schedule...() returns 1
>>   <schedule to kblockd workqueue>
>>   *** WORK_STRUCT_PENDING_BIT is cleared ***
>>   flush_busy_ctxs() is executed
>>                                        reqeust ffff884000343cc0 inserted
>>                                        hctx marked as pended
>>                                        kblockd_schedule...() returns 0
>>                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>                                        WTF?
>>
>> As a result ffff884000343cc0 request pended forever.
>>
>> According to the trace output I see that another CPU _always_ observes
>> WORK_STRUCT_PENDING_BIT as set for that hctx->run_work, even it was
>> cleared on another CPU.
>>
>> Checking the workqueue.c code I see that clearing the bit is nothing
>> more, but atomic_long_set(), which is <mov> instruction. This
>> function:
>>
>>   static inline void set_work_data()
>>
>> In attempt to "fix" the mystery I replaced atomic_long_set() call with
>> atomic_long_xchg(), which is <xchg> instruction.
>>
>> The problem has gone.
>>
>> For me it looks that test_and_set_bit() (<lock btsl> instruction) does
>> not require flush of all CPU caches, which can be dirty after executing
>> of <mov> on another CPU.  But <xchg> really updates the memory and the
>> following execution of <lock btsl> observes that bit was cleared.
>>
>> As a conculusion I can say, that I am lucky enough and can reproduce
>> this bug in several minutes on a specific load (I tried many other
>> simple loads using fio, even using btrecord/btreplay, no success).
>> And that easy reproduction on a specific load gives me some freedom
>> to test and then to be sure, that problem has gone.
>
> Heh, excellent debugging.  I wonder how old this bug is.  cc'ing David
> Howells who ISTR to have reported a similar issue.  The root problem
> here, I think, is that PENDING is used to synchronize between
> different queueing instances but we don't have proper memory barrier
> after it.
>
>         A                               B
>
>         queue (test_and_sets PENDING)
>         dispatch (clears PENDING)
>         execute                         queue (test_and_sets PENDING)
>
> So, for B, the guarantee must be that either A starts executing after
> B's test_and_set or B's test_and_set succeeds; however, as we don't
> have any memory barrier between dispatch and execute, there's nothing
> preventing the processor from scheduling some memory fetch operations
> from the execute stage before the clearing of PENDING - ie. A might
> not see what B has done prior to queue even if B's test_and_set fails
> indicating that A should.  Can you please test whether the following
> patch fixes the issue?

I can assure you that smp_mb() helps (at least running for 30 minutes
under IO). That was my first variant, but I did not like it because I
could not explain myself why:

1. not smp_wmb()? We need to do flush after an update.
   (I tried that also, and it does not help)

2. what protects us from this situation?

  CPU#0                  CPU#1
                         set_work_data()
  test_and_set_bit()
                         smp_mb()

And 2. question was crucial to me, because even tiny delay "fixes" the
problem, e.g. ndelay also "fixes" the bug:

         smp_wmb();
         set_work_data(work, (unsigned long)pool_id << WORK_OFFQ_POOL_SHIFT, 0);
 +       ndelay(40);
  }

Why ndelay(40)? Because on this machine smp_mb() takes 40 ns on average.

--
Roman

>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 2232ae3..8ec2b5e 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -666,6 +666,7 @@ static void set_work_pool_and_clear_pending(struct work_struct *work,
>          */
>         smp_wmb();
>         set_work_data(work, (unsigned long)pool_id << WORK_OFFQ_POOL_SHIFT, 0);
> +       smp_mb();
>  }
>
>  static void clear_work_data(struct work_struct *work)
>
>
> --
> tejun