Re: [PATCH 2/4] fs: btrfs: fix data races in extent_write_cache_pages() - Jia-Ju Bai

From: Jia-Ju Bai <baijiaju1990@gmail.com>
To: dsterba@suse.cz, clm@fb.com, josef@toxicpanda.com,
	dsterba@suse.com, linux-btrfs@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH 2/4] fs: btrfs: fix data races in extent_write_cache_pages()
Date: Wed, 13 May 2020 10:17:10 +0800	[thread overview]
Message-ID: <06d64482-a263-668d-7ab1-9f411eb1c794@gmail.com> (raw)
In-Reply-To: <20200512215625.GE18421@twin.jikos.cz>

On 2020/5/13 5:56, David Sterba wrote:
> On Sat, May 09, 2020 at 01:27:01PM +0800, Jia-Ju Bai wrote:
>> The function extent_write_cache_pages is concurrently executed with
>> itself at runtime in the following call contexts:
>>
>> Thread 1:
>>    btrfs_sync_file()
>>      start_ordered_ops()
>>        btrfs_fdatawrite_range()
>>          btrfs_writepages() [via function pointer]
>>            extent_writepages()
>>              extent_write_cache_pages()
>>
>> Thread 2:
>>    btrfs_writepages()
>>      extent_writepages()
>>        extent_write_cache_pages()
>>
>> In extent_write_cache_pages():
>>    index = mapping->writeback_index;
>>    ...
>>    mapping->writeback_index = done_index;
>>
>> The accesses to mapping->writeback_index are not synchronized, and thus
>> data races for this value can occur.
>> These data races were found and actually reproduced by our concurrency
>> fuzzer.
>>
>> To fix these races, the spinlock mapping->private_lock is used to
>> protect the accesses to mapping->writeback_index.
>>
>> Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
>> ---
>>   fs/btrfs/extent_io.c | 7 ++++++-
>>   1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 39e45b8a5031..8c33a60bde1d 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -4160,7 +4160,9 @@ static int extent_write_cache_pages(struct address_space *mapping,
>>   
>>   	pagevec_init(&pvec);
>>   	if (wbc->range_cyclic) {
>> +		spin_lock(&mapping->private_lock);
>>   		index = mapping->writeback_index; /* Start from prev offset */
>> +		spin_unlock(&mapping->private_lock);
>>   		end = -1;
>>   		/*
>>   		 * Start from the beginning does not need to cycle over the
>> @@ -4271,8 +4273,11 @@ static int extent_write_cache_pages(struct address_space *mapping,
>>   			goto retry;
>>   	}
>>   
>> -	if (wbc->range_cyclic || (wbc->nr_to_write > 0 && range_whole))
>> +	if (wbc->range_cyclic || (wbc->nr_to_write > 0 && range_whole)) {
>> +		spin_lock(&mapping->private_lock);
>>   		mapping->writeback_index = done_index;
>> +		spin_unlock(&mapping->private_lock);
> I'm more and more curious what exactly is your fuzzer tool actualy
> reporting. Because adding the locks around the writeback index does not
> make any sense.
>
> The variable is of type unsigned long, this is written atomically so the
> only theoretical problem is on an achritecture that is not capable of
> storing that in one go, which means a lot more problems eg. because
> pointers are assumed to be the same width as unsigned long.
>
> So torn write is not possible and the lock leads to the same result as
> if it wasn't there and the read and write would happen not serialized by
> the spinlock but somewhere on the way from CPU caches to memory.
>
> CPU1                                   CPU2
>
> lock
> index = mapping->writeback_index
> unlock
>                                         lock
> 				       m->writeback_index = index;
> 				       unlock
>
> Is the same as
>
> CPU1                                   CPU2
>
>
> index = mapping->writeback_index
> 				       m->writeback_index = index;
>
> So maybe this makes your tool happy but there's no change from the
> correctness point of view, only added overhead from the lock/unlock
> calls.
>
> Lockless synchronization is a thing, using memory barriers etc., this
> was the case of some other patch, I think your tool needs to take that
> into account to give sensible results.

Thanks for the reply and explanation :)
I agree that only adding locks here makes no sense, because "index = 
mapping->writeback_index" can be still executed before or after 
"m->writeback_index = index" is executed.
So what is the expected order of the two statements? Read after write or 
write after read?

Best wishes,
Jia-Ju Bai