On 2020/7/28 下午10:52, Hans van Kranenburg wrote:
> On 7/28/20 3:52 AM, Qu Wenruo wrote:
>>
>>
>> On 2020/7/28 上午8:51, Qu Wenruo wrote:
>>>
>>>
>>> On 2020/7/28 上午1:17, Hans van Kranenburg wrote:
>>>> Hi!
>>>>
>>>> [...]
>>>>>
>>>>> Since it's almost a dead CPU burner loop, regular sleep based lockup
>>>>> detector won't help much.
>>>>
>>>> Here's a flame graph of 180 seconds, taken from the kernel thread pid:
>>>>
>>>> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2020-07-27-perf-kworker-flush-btrfs.svg
>>>
>>> That's really awesome!
>>>
>>>>
>>>>> You can try trace events first to see which trace event get executed the
>>>>> most frequently, then try to add probe points to pin down the real cause.
>>>>
>>>> From the default collection, I already got the following, a few days
>>>> ago, by enabling find_free_extent and btrfs_cow_block:
>>>>
>>>> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2020-07-25-find_free_extent.txt
>>
>> This output is in fact pretty confusing, and maybe give you a false view
>> on the callers of find_free_extent().
>>
>> It always shows "EXTENT_TREE" as the owner, but that's due to a bad
>> decision on the trace event.
>>
>> I have submitted a patch addressing it, and added you to the CC.
> 
> Oh right, thanks, that actually makes a lot of sense, lol.
> 
> I was misled because at first sight I was thinking, "yeah, obviously,
> where else than in the extent tree are you going to do administration
> about allocated blocks.", and didn't realize yet it did make no sense.
> 
>> Would you mind to re-capture the events with that patch?
>> So that we could have a clearer idea on which tree is having the most
>> amount of concurrency?
> 
> Yes. I will do that and try to reproduce the symptoms with as few
> actions in parallel as possible.
> 
> I've been away most of the day today, I will see how far I get later,
> otherwise continue tomorrow.
> 
> What you *can* see in the current output already, however, is that the
> kworker/u16:3-13887 thing is doing all DATA work, while many different
> processes (rsync, btrfs receive) all do the find_free_extent work for
> METADATA. That's already an interesting difference.

That's common, don't forget that btrfs is doing COW to modify its metadata.
After a data write reaches disk, btrfs will then modify metadata to
reflect that write, which will cause a metadata reserve.

The good news is, btrfs is not always doing metadata COW, if the needed
tree blocks has already been CoWed and not yet written to disk.

So in theory, the data reserve should be more frequently than metadata.

> 
> So, one direction to look into is who is all trying to grab that spin
> lock, since if it's per 'space' (which sounds logical, since the workers
> will never clash because a whole block group belongs to only 1 'space'),
> then I don't see why kworker/u16:18-11336 would spend 1/3 of it's time
> in a busy locking situation waiting, while it's the only process working
> on METADATA.
> 
> But, I'll gather some more logs and pictures.
> 
> Do you know (some RTFM pointer?) about a way to debug who's locking on
> the same thing? I didn't research that yet.

I had a bcc based script to do that analyze, but that's mostly for tree
blocks.
https://github.com/adam900710/btrfs-profiler/blob/master/tree_lock_wait.py

Feel free to use that as a base to develop your own tool.


BTW, this data allocation race reminds me that, Josef and Nikolay
discussed this possible problem before, but not yet have a solid
solution to it.

So I guess what we can do now is just to collect enough data to proof
it's already a big problem.

Thanks,
Qu
> 
> Hans
> 
>> [...]