Re: [PATCH] workqueue: Fix memory ordering race in queue_work*()

From: Will Deacon <will@kernel.org>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>,
	Tejun Heo <tj@kernel.org>,
	marcan@marcan.st, peterz@infradead.org, jirislaby@kernel.org,
	maz@kernel.org, mark.rutland@arm.com, boqun.feng@gmail.com,
	catalin.marinas@arm.com, oneukum@suse.com,
	roman.penyaev@profitbricks.com, asahi@lists.linux.dev,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org
Subject: Re: [PATCH] workqueue: Fix memory ordering race in queue_work*()
Date: Tue, 16 Aug 2022 14:41:57 +0100	[thread overview]
Message-ID: <20220816134156.GB11202@willie-the-truck> (raw)
In-Reply-To: <CAHk-=wgSNiT5qJX53RHtWECsUiFq6d6VWYNAvu71ViOEan07yw@mail.gmail.com>

On Mon, Aug 15, 2022 at 10:27:10PM -0700, Linus Torvalds wrote:
> On Mon, Aug 15, 2022 at 9:15 PM Herbert Xu <herbert@gondor.apana.org.au> wrote:
> >
> > Please revert this as test_and_set_bit was always supposed to be
> > a full memory barrier.  This is an arch bug.
> 
> Yes, the bitops are kind of strange for various legacy reasons:
> 
>  - set/clear_bit is atomic, but without a memory barrier, and need a
> "smp_mb__before_atomic()" or similar for barriers
> 
>  - test_and_set/clear_bit() are atomic, _and_ are memory barriers
> 
>  - test_and_set_bit_lock and test_and_clear_bit_unlock are atomic and
> _weaker_ than full memory barriers, but sufficient for locking (ie
> acquire/release)
> 
> Does any of this make any sense at all? No. But those are the
> documented semantics exactly because people were worried about
> test_and_set_bit being used for locking, since on x86 all the atomics
> are also memory barriers.
> 
> From looking at it, the asm-generic implementation is a bit
> questionable, though. In particular, it does
> 
>         if (READ_ONCE(*p) & mask)
>                 return 1;
> 
> so it's *entirely* unordered for the "bit was already set" case.
> 
> That looks very wrong to me, since it basically means that the
> test_and_set_bit() can return "bit was already set" based on an old
> value - not a memory barrier at all.
> 
> So if you use "test_and_set_bit()" as some kind of "I've done my work,
> now I am going to set the bit to tell people to pick it up", then that
> early "bit was already set" code completely breaks it.
> 
> Now, arguably our atomic bitop semantics are very very odd, and it
> might be time to revisit them. But that code looks very very buggy to
> me.
> 
> The bug seems to go back to commit e986a0d6cb36 ("locking/atomics,
> asm-generic/bitops/atomic.h: Rewrite using atomic_*() APIs"), and the
> fix looks to be as simple as just removing that early READ_ONCE return
> case (test_and_clear has the same bug).
> 
> Will?

Right, this looks like it's all my fault, so sorry about that.

In an effort to replace the spinlock-based atomic bitops with a version
based on atomic instructions in e986a0d6cb36, I inadvertently added this
READ_ONCE() shortcut to test_and_set_bit() because at the time that's
what we had (incorrectly) documented in our attempts at cleaning things
up in this area. I confess that I have never been comfortable with the
comment for test_and_set_bit() prior to my problematic patch:

/**
 * test_and_set_bit - Set a bit and return its old value
 * @nr: Bit to set
 * @addr: Address to count from
 *
 * This operation is atomic and cannot be reordered.
 * It may be reordered on other architectures than x86.
 * It also implies a memory barrier.
 */

so while Peter and I were trying to improve the documentation for
atomics and memory barriers we clearly ended up making the wrong call
trying to treat this like e.g. a cmpxchg() (which has the
unordered-on-failure semantics).

It's worth noting that with the spinlock-based implementation (i.e.
prior to e986a0d6cb36) then we would have the same problem on
architectures that implement spinlocks with acquire/release semantics;
accesses from outside of the critical section can drift in and reorder
with each other there, so the conversion looked legitimate to me in
isolation and I vaguely remember going through callers looking for
potential issues. Alas, I obviously missed this case.

So it looks to me like we need to:

  1. Upgrade test_and_{set,clear}_bit() to have a full memory barrier
     regardless of the value which is read from memory. The lock/unlock
     flavours can remain as-is.

  2. Fix the documentation

  3. Figure out what to do about architectures building atomics out of
     spinlocks (probably ok as lock+unlock == full barrier there?)

  4. Accept my sincerest apologies for the mess!

> IOW, the proper fix for this should, I think, look something like this
> (whitespace mangled) thing:
> 
>    --- a/include/asm-generic/bitops/atomic.h
>    +++ b/include/asm-generic/bitops/atomic.h
>    @@ -39,9 +39,6 @@ arch_test_and_set_bit(
>         unsigned long mask = BIT_MASK(nr);
> 
>         p += BIT_WORD(nr);
>    -    if (READ_ONCE(*p) & mask)
>    -            return 1;
>    -
>         old = arch_atomic_long_fetch_or(mask, (atomic_long_t *)p);
>         return !!(old & mask);
>     }
>    @@ -53,9 +50,6 @@ arch_test_and_clear_bit
>         unsigned long mask = BIT_MASK(nr);
> 
>         p += BIT_WORD(nr);
>    -    if (!(READ_ONCE(*p) & mask))
>    -            return 0;
>    -
>         old = arch_atomic_long_fetch_andnot(mask, (atomic_long_t *)p);
>         return !!(old & mask);
>     }
> 
> but the above is not just whitespace-damaged, it's entirely untested
> and based purely on me looking at that code.

Yes, I think that's step 1, thanks! I'm a bit worried about the perf
numbers on the other thread, but we can get to the bottom of that
separately.

Will