Re: [PATCH 1/1] [RFC] workqueue: fix ghost PENDING flag while doing MQ IO

From: Tejun Heo <tj@kernel.org>
To: Roman Pen <roman.penyaev@profitbricks.com>
Cc: Jens Axboe <axboe@kernel.dk>,
	linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	David Howells <dhowells@redhat.com>
Subject: Re: [PATCH 1/1] [RFC] workqueue: fix ghost PENDING flag while doing MQ IO
Date: Mon, 25 Apr 2016 11:48:47 -0400	[thread overview]
Message-ID: <20160425154847.GZ7822@mtj.duckdns.org> (raw)
In-Reply-To: <1461597771-25352-1-git-send-email-roman.penyaev@profitbricks.com>

Hello, Roman.

On Mon, Apr 25, 2016 at 05:22:51PM +0200, Roman Pen wrote:
...
>   CPU#6                                CPU#2
>   reqeust ffff884000343600 inserted
>   hctx marked as pended
>   kblockd_schedule...() returns 1
>   <schedule to kblockd workqueue>
>   *** WORK_STRUCT_PENDING_BIT is cleared ***
>   flush_busy_ctxs() is executed
>                                        reqeust ffff884000343cc0 inserted
>                                        hctx marked as pended
>                                        kblockd_schedule...() returns 0
>                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>                                        WTF?
> 
> As a result ffff884000343cc0 request pended forever.
> 
> According to the trace output I see that another CPU _always_ observes
> WORK_STRUCT_PENDING_BIT as set for that hctx->run_work, even it was
> cleared on another CPU.
> 
> Checking the workqueue.c code I see that clearing the bit is nothing
> more, but atomic_long_set(), which is <mov> instruction. This
> function:
> 
>   static inline void set_work_data()
> 
> In attempt to "fix" the mystery I replaced atomic_long_set() call with
> atomic_long_xchg(), which is <xchg> instruction.
> 
> The problem has gone.
> 
> For me it looks that test_and_set_bit() (<lock btsl> instruction) does
> not require flush of all CPU caches, which can be dirty after executing
> of <mov> on another CPU.  But <xchg> really updates the memory and the
> following execution of <lock btsl> observes that bit was cleared.
> 
> As a conculusion I can say, that I am lucky enough and can reproduce
> this bug in several minutes on a specific load (I tried many other
> simple loads using fio, even using btrecord/btreplay, no success).
> And that easy reproduction on a specific load gives me some freedom
> to test and then to be sure, that problem has gone.

Heh, excellent debugging.  I wonder how old this bug is.  cc'ing David
Howells who ISTR to have reported a similar issue.  The root problem
here, I think, is that PENDING is used to synchronize between
different queueing instances but we don't have proper memory barrier
after it.

	A				B

	queue (test_and_sets PENDING)
	dispatch (clears PENDING)
	execute				queue (test_and_sets PENDING)

So, for B, the guarantee must be that either A starts executing after
B's test_and_set or B's test_and_set succeeds; however, as we don't
have any memory barrier between dispatch and execute, there's nothing
preventing the processor from scheduling some memory fetch operations
from the execute stage before the clearing of PENDING - ie. A might
not see what B has done prior to queue even if B's test_and_set fails
indicating that A should.  Can you please test whether the following
patch fixes the issue?

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 2232ae3..8ec2b5e 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -666,6 +666,7 @@ static void set_work_pool_and_clear_pending(struct work_struct *work,
 	 */
 	smp_wmb();
 	set_work_data(work, (unsigned long)pool_id << WORK_OFFQ_POOL_SHIFT, 0);
+	smp_mb();
 }
 
 static void clear_work_data(struct work_struct *work)


-- 
tejun