Re: [PATCH v2] btrfs: relocation: Fix KASAN reports caused by extended reloc tree lifespan

From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: dsterba@suse.cz, Nikolay Borisov <nborisov@suse.com>,
	Josef Bacik <josef@toxicpanda.com>, Qu Wenruo <wqu@suse.com>,
	linux-btrfs@vger.kernel.org,
	Zygo Blaxell <ce3g8jdj@umail.furryterror.org>,
	David Sterba <dsterba@suse.com>
Subject: Re: [PATCH v2] btrfs: relocation: Fix KASAN reports caused by extended reloc tree lifespan
Date: Fri, 10 Jan 2020 08:21:46 +0800	[thread overview]
Message-ID: <f8458b9c-0b6c-024e-399d-ea530abd1204@gmx.com> (raw)
In-Reply-To: <20200109143742.GN3929@twin.jikos.cz>

On 2020/1/9 下午10:37, David Sterba wrote:
> On Thu, Jan 09, 2020 at 01:54:34PM +0800, Qu Wenruo wrote:
>>>> The way I read it is more like smp_rmb/smp_wmb, but for bits in this
>>>> case, so the smp_mb__before/after_atomic was only a syntactic sugar to
>>>> match that it's atomic bitops. I realize this could have caused some
>>>> confusion, however I still think that some sort of barrier is needed.
>>>
>>> There's an existing pattern used for serializing set/clear of
>>> BTRFS_ROOT_IN_TRANS_SETUP (record_root_in_trans,
>>> btrfs_record_root_in_trans).
>>>
>>> Once upon a time there were barriers like smp_mb__before_clear_bit but
>>> they got unified to just smp_mb__before_atomic for all set/clear
>>> operations, so I believe I was not all wrong with using them.
>>>
>> I used to believe the same fairy tail, that mb() works like a flush(),
>> but we're not living in a fairy tail.
>
> This sounds strange, by flush I was refering to internal CPU mechanism
> that makes all temporary changes visible to other cpus, so you're saying
> that this does not work as everybody expects?

Because no matter whether mb ensures that or not, the result is the same.

What would happen if the reader get schedule just before the mb()
command? Reader can still get older value.
That makes no difference if we had mb() or not.

>
>> What mb really does is keep certain ordering from happening.
>> And ordering means, we have at least *2* different memory accesses.
>
> Sorry but I think you're missing some pieces here. There are 2 memory
> accesses: set_bit/clear_bit (writer) and test_bit (reader).
>
> The same could be achieved by a plain variable, that it's a bit brings
> more confusion about what barrier should be really used.
>
> The pattern we want to use here is pretty standard. Read barrier before
> read and write barrier that separates the data change from the indicator
> of the change. If you line up the barriers, there's a clear line between
> the data changes and the indicator value.

Again, test_bit() can happen whenever they like, and smp_rmb() before
test_bit() makes no sense.

test_bit() can happen before reloc_root = NULL assign. after reloc_root
= NULL assign but before set_bit(), or after set_bit().

That smp_wmb() makes sure set_bit() won't happen before reloc_root, than
that's enough.
BTW, smp_mb__before_atomic() should be full mb, not just wmb or rmb.

>
> reloc_root = NULL
> smp_wmb                           smp_rmb()
>                                   test_bit()
>                                   ... here
> set_bit(DEAD)
>                                   ... or here, it'll be always
> 				  reloc_root == NULL, but it still could
> 				  be before or after set_bit
>
> You should also distinguish between mb() and smp_mb() (and the rmb/wmb).
> mb is a unconditional barrier, used to synchronize access to hardware io
> ports, suitable in drivers.
>
> We use smp_mb() because this serializes memory among multipe CPUs, when
> one changes memory but stores it to some temporary structures, while
> other CPUs don't see the effects. I'm sure you've read about that in the
> memory barrier docs.

Yes, but schedule can put that smp_rmb(); test_bit() line where ever
they want. So smp_rmb(); test_bit() can still get all the 3 difference
timing. It's that smp_mb__before_atomic() killing the 4th case, not the
smp_rmb().

>
>> It's not to ensure every reader get the same result, as there is no way
>> to do that. Read can happen whenever they want.
>
> That is true about the 'whenever', but what we need to guarantee here is
> that when the read happens the following condition will have view of the
> changes implied by the pairing barrier.

Still, if reader still gets the temporary value, it's the same as random
schedule timing.
What we need to ensure is the order, which is solely ensured by that
smb_mb__before_atomic().

Thanks,
Qu

>
>> So before we talk about mb, we first need to know which 2 memory
>> accesses we're talking about.
>> If there is even no 2 memory access, then there is no need for mb().
>>
>> E.g. for the mb implied by spinlock(), it's not to ensure the spinlock
>> counter reader, but to ensure the memory ordering between the spinlock
>> counter itself and the memory it's protecting.
>>
>> So for memory barrier around test_bit(), as long as the compiler is not
>> doing reordering, we don't need extra mb.
>
> This is not about compiler.
>
>> And if the compiler is really doing re-ordering for the
>> have_reloc_root(), then the compiler is doing something wrong, as that
>> would makes the test_bit() meaningless.