Re: [PATCH 6/8] xfs: introduce xfs_fs_destroy_super()

From: Qi Zheng <qi.zheng@linux.dev>
To: Christian Brauner <brauner@kernel.org>,
	"Darrick J. Wong" <djwong@kernel.org>,
	tkhai@ya.ru
Cc: Dave Chinner <david@fromorbit.com>,
	akpm@linux-foundation.org, roman.gushchin@linux.dev,
	vbabka@suse.cz, viro@zeniv.linux.org.uk, hughd@google.com,
	paulmck@kernel.org, muchun.song@linux.dev, linux-mm@kvack.org,
	linux-fsdevel@vger.kernel.org, linux-xfs@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	Qi Zheng <zhengqi.arch@bytedance.com>
Subject: Re: [PATCH 6/8] xfs: introduce xfs_fs_destroy_super()
Date: Mon, 5 Jun 2023 20:16:48 +0800	[thread overview]
Message-ID: <f7139906-3e2c-fa5c-b6ce-f7a8a0b1eefc@linux.dev> (raw)
In-Reply-To: <20230605-halbtags-gesplittet-c482c62cb2c9@brauner>

On 2023/6/5 19:50, Christian Brauner wrote:
> On Fri, Jun 02, 2023 at 08:15:32AM -0700, Darrick J. Wong wrote:
>> On Fri, Jun 02, 2023 at 11:13:09AM +0800, Qi Zheng wrote:
>>> Hi Dave,
>>>
>>> On 2023/6/2 07:06, Dave Chinner wrote:
>>>> On Thu, Jun 01, 2023 at 04:43:32PM +0800, Qi Zheng wrote:
>>>>> Hi Dave,
>>>>> On 2023/6/1 07:48, Dave Chinner wrote:
>>>>>> On Wed, May 31, 2023 at 09:57:40AM +0000, Qi Zheng wrote:
>>>>>>> From: Kirill Tkhai <tkhai@ya.ru>
>>>>>> I don't really like this ->destroy_super() callback, especially as
>>>>>> it's completely undocumented as to why it exists. This is purely a
>>>>>> work-around for handling extended filesystem superblock shrinker
>>>>>> functionality, yet there's nothing that tells the reader this.
>>>>>>
>>>>>> It also seems to imply that the superblock shrinker can continue to
>>>>>> run after the existing unregister_shrinker() call before ->kill_sb()
>>>>>> is called. This violates the assumption made in filesystems that the
>>>>>> superblock shrinkers have been stopped and will never run again
>>>>>> before ->kill_sb() is called. Hence ->kill_sb() implementations
>>>>>> assume there is nothing else accessing filesystem owned structures
>>>>>> and it can tear down internal structures safely.
>>>>>>
>>>>>> Realistically, the days of XFS using this superblock shrinker
>>>>>> extension are numbered. We've got a lot of the infrastructure we
>>>>>> need in place to get rid of the background inode reclaim
>>>>>> infrastructure that requires this shrinker extension, and it's on my
>>>>>> list of things that need to be addressed in the near future.
>>>>>>
>>>>>> In fact, now that I look at it, I think the shmem usage of this
>>>>>> superblock shrinker interface is broken - it returns SHRINK_STOP to
>>>>>> ->free_cached_objects(), but the only valid return value is the
>>>>>> number of objects freed (i.e. 0 is nothing freed). These special
>>>>>> superblock extension interfaces do not work like a normal
>>>>>> shrinker....
>>>>>>
>>>>>> Hence I think the shmem usage should be replaced with an separate
>>>>>> internal shmem shrinker that is managed by the filesystem itself
>>>>>> (similar to how XFS has multiple internal shrinkers).
>>>>>>
>>>>>> At this point, then the only user of this interface is (again) XFS.
>>>>>> Given this, adding new VFS methods for a single filesystem
>>>>>> for functionality that is planned to be removed is probably not the
>>>>>> best approach to solving the problem.
>>>>>
>>>>> Thanks for such a detailed analysis. Kirill Tkhai just proposeed a
>>>>> new method[1], I cc'd you on the email.
>>>>
>>>> I;ve just read through that thread, and I've looked at the original
>>>> patch that caused the regression.
>>>>
>>>> I'm a bit annoyed right now. Nobody cc'd me on the original patches
>>>> nor were any of the subsystems that use shrinkers were cc'd on the
>>>> patches that changed shrinker behaviour. I only find out about this
>>>
>>> Sorry about that, my mistake. I followed the results of
>>> scripts/get_maintainer.pl before.
>>
>> Sometimes I wonder if people who contribute a lot to a subsystem should
>> be more aggressive about listing themselves explicitly in MAINTAINERS
>> but then I look at the ~600 emails that came in while I was on vacation
>> for 6 days over a long weekend and ... shut up. :P
>>
>>>> because someone tries to fix something they broke by *breaking more
>>>> stuff* and not even realising how broken what they are proposing is.
>>>
>>> Yes, this slows down the speed of umount. But the benefit is that
>>> slab shrink becomes lockless, the mount operation and slab shrink no
>>> longer affect each other, and the IPC no longer drops significantly,
>>> etc.
>>
>> The lockless shrink seems like a good thing to have, but ... is it
>> really true that the superblock shrinker can still be running after
>> ->kill_sb?  /That/ is surprising to me.
> 
> So what's the plan here? If this causes issues for filesystems that rely
> on specific guarantees that are broken by the patch then either it needs
> a clean fix or a revert.

If the reduction in umount speed is really unacceptable, I think we can
try the patch[1] from Kirill Tkhai. At least the granularity of the
shrinker rwsem lock is reduced, and the file system code does not need
to be modified.

And IIUC, after clearing SHRINKER_REGISTERED under the write lock of
shrinker->rwsem, we can guarantee that the shrinker won't be running.
So synchronize_srcu() doesn't need to be called in unregister_shrinker()
anymore. So we don't need to split unregister_shrinker().

[1]. 
https://lore.kernel.org/lkml/bab60fe4-964c-43a6-ecce-4cbd4981d875@ya.ru/

> 
>>
>> --D
>>
>>> And I used bpftrace to measure the time consumption of
>>> unregister_shrinker():
>>>
>>> ```
>>> And I just tested it on a physical machine (Intel(R) Xeon(R) Platinum
>>> 8260 CPU @ 2.40GHz) and the results are as follows:
>>>
>>> 1) use synchronize_srcu():
>>>
>>> @ns[umount]:
>>> [8K, 16K)             83 |@@@@@@@       |
>>> [16K, 32K)           578
>>> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>>> [32K, 64K)            78 |@@@@@@@       |
>>> [64K, 128K)            6 |       |
>>> [128K, 256K)           7 |       |
>>> [256K, 512K)          29 |@@       |
>>> [512K, 1M)            51 |@@@@      |
>>> [1M, 2M)              90 |@@@@@@@@       |
>>> [2M, 4M)              70 |@@@@@@      |
>>> [4M, 8M)               8 |      |
>>>
>>> 2) use synchronize_srcu_expedited():
>>>
>>> @ns[umount]:
>>> [8K, 16K)             31 |@@       |
>>> [16K, 32K)           803
>>> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>>> [32K, 64K)           158 |@@@@@@@@@@       |
>>> [64K, 128K)            4 |       |
>>> [128K, 256K)           2 |       |
>>> [256K, 512K)           2 |       |
>>> ```
>>>
>>> With synchronize_srcu(), most of the time consumption is between 16us and
>>> 32us, the worst case between 4ms and 8ms. Is this totally
>>> unacceptable?
>>>
>>> This performance regression report comes from a stress test. Will the
>>> umount action be executed so frequently under real workloads?
>>>
>>> If there are really unacceptable, after applying the newly proposed
>>> method, umount will be as fast as before (or even faster).
>>>
>>> Thanks,
>>> Qi
>>>
>>>>
>>>> The previous code was not broken and it provided specific guarantees
>>>> to subsystems via unregister_shrinker(). From the above discussion,
>>>> it appears that the original authors of these changes either did not
>>>> know about or did not understand them, so that casts doubt in my
>>>> mind about the attempted solution and all the proposed fixes for it.
>>>>
>>>> I don't have the time right now unravel this mess and fully
>>>> understand the original problem, changes or the band-aids that are
>>>> being thrown around. We are also getting quite late in the cycle to
>>>> be doing major surgery to critical infrastructure, especially as it
>>>> gives so little time to review regression test whatever new solution
>>>> is proposed.
>>>>
>>>> Given this appears to be a change introduced in 6.4-rc1, I think the
>>>> right thing to do is to revert the change rather than make things
>>>> worse by trying to shove some "quick fix" into the kernel to address
>>>> it.
>>>>
>>>> Andrew, could you please sort out a series to revert this shrinker
>>>> infrastructure change and all the dependent hacks that have been
>>>> added to try to fix it so far?
>>>>
>>>> -Dave.
>>>
>>> -- 
>>> Thanks,
>>> Qi