All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tejun Heo <tj@kernel.org>
To: Roman Pen <roman.penyaev@profitbricks.com>
Cc: Jens Axboe <axboe@kernel.dk>,
	linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	David Howells <dhowells@redhat.com>
Subject: Re: [PATCH 1/1] [RFC] workqueue: fix ghost PENDING flag while doing MQ IO
Date: Mon, 25 Apr 2016 11:48:47 -0400	[thread overview]
Message-ID: <20160425154847.GZ7822@mtj.duckdns.org> (raw)
In-Reply-To: <1461597771-25352-1-git-send-email-roman.penyaev@profitbricks.com>

Hello, Roman.

On Mon, Apr 25, 2016 at 05:22:51PM +0200, Roman Pen wrote:
...
>   CPU#6                                CPU#2
>   reqeust ffff884000343600 inserted
>   hctx marked as pended
>   kblockd_schedule...() returns 1
>   <schedule to kblockd workqueue>
>   *** WORK_STRUCT_PENDING_BIT is cleared ***
>   flush_busy_ctxs() is executed
>                                        reqeust ffff884000343cc0 inserted
>                                        hctx marked as pended
>                                        kblockd_schedule...() returns 0
>                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>                                        WTF?
> 
> As a result ffff884000343cc0 request pended forever.
> 
> According to the trace output I see that another CPU _always_ observes
> WORK_STRUCT_PENDING_BIT as set for that hctx->run_work, even it was
> cleared on another CPU.
> 
> Checking the workqueue.c code I see that clearing the bit is nothing
> more, but atomic_long_set(), which is <mov> instruction. This
> function:
> 
>   static inline void set_work_data()
> 
> In attempt to "fix" the mystery I replaced atomic_long_set() call with
> atomic_long_xchg(), which is <xchg> instruction.
> 
> The problem has gone.
> 
> For me it looks that test_and_set_bit() (<lock btsl> instruction) does
> not require flush of all CPU caches, which can be dirty after executing
> of <mov> on another CPU.  But <xchg> really updates the memory and the
> following execution of <lock btsl> observes that bit was cleared.
> 
> As a conculusion I can say, that I am lucky enough and can reproduce
> this bug in several minutes on a specific load (I tried many other
> simple loads using fio, even using btrecord/btreplay, no success).
> And that easy reproduction on a specific load gives me some freedom
> to test and then to be sure, that problem has gone.

Heh, excellent debugging.  I wonder how old this bug is.  cc'ing David
Howells who ISTR to have reported a similar issue.  The root problem
here, I think, is that PENDING is used to synchronize between
different queueing instances but we don't have proper memory barrier
after it.

	A				B

	queue (test_and_sets PENDING)
	dispatch (clears PENDING)
	execute				queue (test_and_sets PENDING)

So, for B, the guarantee must be that either A starts executing after
B's test_and_set or B's test_and_set succeeds; however, as we don't
have any memory barrier between dispatch and execute, there's nothing
preventing the processor from scheduling some memory fetch operations
from the execute stage before the clearing of PENDING - ie. A might
not see what B has done prior to queue even if B's test_and_set fails
indicating that A should.  Can you please test whether the following
patch fixes the issue?

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 2232ae3..8ec2b5e 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -666,6 +666,7 @@ static void set_work_pool_and_clear_pending(struct work_struct *work,
 	 */
 	smp_wmb();
 	set_work_data(work, (unsigned long)pool_id << WORK_OFFQ_POOL_SHIFT, 0);
+	smp_mb();
 }
 
 static void clear_work_data(struct work_struct *work)


-- 
tejun

  reply	other threads:[~2016-04-25 15:48 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-04-25 15:22 [PATCH 1/1] [RFC] workqueue: fix ghost PENDING flag while doing MQ IO Roman Pen
2016-04-25 15:48 ` Tejun Heo [this message]
2016-04-25 16:00   ` Tejun Heo
2016-04-25 16:40     ` Roman Penyaev
2016-04-25 16:34   ` Roman Penyaev
2016-04-25 17:03     ` Tejun Heo
2016-04-25 17:39       ` Roman Penyaev
2016-04-25 17:51         ` Tejun Heo
2016-04-26  1:22   ` Peter Hurley
2016-04-26 15:15     ` Tejun Heo
2016-04-26 17:27       ` Peter Hurley
2016-04-26 17:45         ` Tejun Heo
2016-04-26 20:07           ` Peter Hurley
2016-04-27  5:50           ` Hannes Reinecke
2016-04-27 19:05             ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160425154847.GZ7822@mtj.duckdns.org \
    --to=tj@kernel.org \
    --cc=axboe@kernel.dk \
    --cc=dhowells@redhat.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=roman.penyaev@profitbricks.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.