All of lore.kernel.org
 help / color / mirror / Atom feed
From: Roman Penyaev <roman.penyaev@profitbricks.com>
To: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>,
	linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	David Howells <dhowells@redhat.com>
Subject: Re: [PATCH 1/1] [RFC] workqueue: fix ghost PENDING flag while doing MQ IO
Date: Mon, 25 Apr 2016 18:34:45 +0200	[thread overview]
Message-ID: <CAJrWOzDcY=HrpXHKU7sLO79CJ8H=J14ywfE70aMiu-+1kJKKXg@mail.gmail.com> (raw)
In-Reply-To: <20160425154847.GZ7822@mtj.duckdns.org>

Hello, Tejun,

On Mon, Apr 25, 2016 at 5:48 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Roman.
>
> On Mon, Apr 25, 2016 at 05:22:51PM +0200, Roman Pen wrote:
> ...
>>   CPU#6                                CPU#2
>>   reqeust ffff884000343600 inserted
>>   hctx marked as pended
>>   kblockd_schedule...() returns 1
>>   <schedule to kblockd workqueue>
>>   *** WORK_STRUCT_PENDING_BIT is cleared ***
>>   flush_busy_ctxs() is executed
>>                                        reqeust ffff884000343cc0 inserted
>>                                        hctx marked as pended
>>                                        kblockd_schedule...() returns 0
>>                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>                                        WTF?
>>
>> As a result ffff884000343cc0 request pended forever.
>>
>> According to the trace output I see that another CPU _always_ observes
>> WORK_STRUCT_PENDING_BIT as set for that hctx->run_work, even it was
>> cleared on another CPU.
>>
>> Checking the workqueue.c code I see that clearing the bit is nothing
>> more, but atomic_long_set(), which is <mov> instruction. This
>> function:
>>
>>   static inline void set_work_data()
>>
>> In attempt to "fix" the mystery I replaced atomic_long_set() call with
>> atomic_long_xchg(), which is <xchg> instruction.
>>
>> The problem has gone.
>>
>> For me it looks that test_and_set_bit() (<lock btsl> instruction) does
>> not require flush of all CPU caches, which can be dirty after executing
>> of <mov> on another CPU.  But <xchg> really updates the memory and the
>> following execution of <lock btsl> observes that bit was cleared.
>>
>> As a conculusion I can say, that I am lucky enough and can reproduce
>> this bug in several minutes on a specific load (I tried many other
>> simple loads using fio, even using btrecord/btreplay, no success).
>> And that easy reproduction on a specific load gives me some freedom
>> to test and then to be sure, that problem has gone.
>
> Heh, excellent debugging.  I wonder how old this bug is.  cc'ing David
> Howells who ISTR to have reported a similar issue.  The root problem
> here, I think, is that PENDING is used to synchronize between
> different queueing instances but we don't have proper memory barrier
> after it.
>
>         A                               B
>
>         queue (test_and_sets PENDING)
>         dispatch (clears PENDING)
>         execute                         queue (test_and_sets PENDING)
>
> So, for B, the guarantee must be that either A starts executing after
> B's test_and_set or B's test_and_set succeeds; however, as we don't
> have any memory barrier between dispatch and execute, there's nothing
> preventing the processor from scheduling some memory fetch operations
> from the execute stage before the clearing of PENDING - ie. A might
> not see what B has done prior to queue even if B's test_and_set fails
> indicating that A should.  Can you please test whether the following
> patch fixes the issue?

I can assure you that smp_mb() helps (at least running for 30 minutes
under IO). That was my first variant, but I did not like it because I
could not explain myself why:

1. not smp_wmb()? We need to do flush after an update.
   (I tried that also, and it does not help)

2. what protects us from this situation?

  CPU#0                  CPU#1
                         set_work_data()
  test_and_set_bit()
                         smp_mb()

And 2. question was crucial to me, because even tiny delay "fixes" the
problem, e.g. ndelay also "fixes" the bug:

         smp_wmb();
         set_work_data(work, (unsigned long)pool_id << WORK_OFFQ_POOL_SHIFT, 0);
 +       ndelay(40);
  }

Why ndelay(40)? Because on this machine smp_mb() takes 40 ns on average.

--
Roman

>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 2232ae3..8ec2b5e 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -666,6 +666,7 @@ static void set_work_pool_and_clear_pending(struct work_struct *work,
>          */
>         smp_wmb();
>         set_work_data(work, (unsigned long)pool_id << WORK_OFFQ_POOL_SHIFT, 0);
> +       smp_mb();
>  }
>
>  static void clear_work_data(struct work_struct *work)
>
>
> --
> tejun

  parent reply	other threads:[~2016-04-25 16:35 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-04-25 15:22 [PATCH 1/1] [RFC] workqueue: fix ghost PENDING flag while doing MQ IO Roman Pen
2016-04-25 15:48 ` Tejun Heo
2016-04-25 16:00   ` Tejun Heo
2016-04-25 16:40     ` Roman Penyaev
2016-04-25 16:34   ` Roman Penyaev [this message]
2016-04-25 17:03     ` Tejun Heo
2016-04-25 17:39       ` Roman Penyaev
2016-04-25 17:51         ` Tejun Heo
2016-04-26  1:22   ` Peter Hurley
2016-04-26 15:15     ` Tejun Heo
2016-04-26 17:27       ` Peter Hurley
2016-04-26 17:45         ` Tejun Heo
2016-04-26 20:07           ` Peter Hurley
2016-04-27  5:50           ` Hannes Reinecke
2016-04-27 19:05             ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAJrWOzDcY=HrpXHKU7sLO79CJ8H=J14ywfE70aMiu-+1kJKKXg@mail.gmail.com' \
    --to=roman.penyaev@profitbricks.com \
    --cc=axboe@kernel.dk \
    --cc=dhowells@redhat.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.