Linux-NFS Archive on lore.kernel.org
 help / color / Atom feed
* Re: [PATCH] mm: fix hanging shrinker management on long do_shrink_slab
       [not found] <20191129214541.3110-1-ptikhomirov@virtuozzo.com>
@ 2019-12-02 16:36 ` Andrey Ryabinin
  2019-12-03  0:13   ` Shakeel Butt
  2019-12-06  2:09   ` Dave Chinner
  0 siblings, 2 replies; 6+ messages in thread
From: Andrey Ryabinin @ 2019-12-02 16:36 UTC (permalink / raw)
  To: Pavel Tikhomirov, Andrew Morton, linux-kernel, cgroups, linux-mm
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Roman Gushchin,
	Shakeel Butt, Chris Down, Yang Shi, Tejun Heo, Thomas Gleixner,
	Kirill A . Shutemov, Konstantin Khorenko, Kirill Tkhai,
	Trond Myklebust, Anna Schumaker, J. Bruce Fields, Chuck Lever,
	linux-nfs, Alexander Viro, linux-fsdevel


On 11/30/19 12:45 AM, Pavel Tikhomirov wrote:
> We have a problem that shrinker_rwsem can be held for a long time for
> read in shrink_slab, at the same time any process which is trying to
> manage shrinkers hangs.
> 
> The shrinker_rwsem is taken in shrink_slab while traversing shrinker_list.
> It tries to shrink something on nfs (hard) but nfs server is dead at
> these moment already and rpc will never succeed. Generally any shrinker
> can take significant time to do_shrink_slab, so it's a bad idea to hold
> the list lock here.
> 
> We have a similar problem in shrink_slab_memcg, except that we are
> traversing shrinker_map+shrinker_idr there.
> 
> The idea of the patch is to inc a refcount to the chosen shrinker so it
> won't disappear and release shrinker_rwsem while we are in
> do_shrink_slab, after that we will reacquire shrinker_rwsem, dec
> the refcount and continue the traversal.
> 
> We also need a wait_queue so that unregister_shrinker can wait for the
> refcnt to become zero. Only after these we can safely remove the
> shrinker from list and idr, and free the shrinker.
> 
> I've reproduced the nfs hang in do_shrink_slab with the patch applied on
> ms kernel, all other mount/unmount pass fine without any hang.
> 
> Here is a reproduction on kernel without patch:
> 
> 1) Setup nfs on server node with some files in it (e.g. 200)
> 
> [server]# cat /etc/exports
> /vz/nfs2 *(ro,no_root_squash,no_subtree_check,async)
> 
> 2) Hard mount it on client node
> 
> [client]# mount -ohard 10.94.3.40:/vz/nfs2 /mnt
> 
> 3) Open some (e.g. 200) files on the mount
> 
> [client]# for i in $(find /mnt/ -type f | head -n 200); \
>   do setsid sleep 1000 &>/dev/null <$i & done
> 
> 4) Kill all openers
> 
> [client]# killall sleep -9
> 
> 5) Put your network cable out on client node
> 
> 6) Drop caches on the client, it will hang on nfs while holding
>   shrinker_rwsem lock for read
> 
> [client]# echo 3 > /proc/sys/vm/drop_caches
> 
>   crash> bt ...
>   PID: 18739  TASK: ...  CPU: 3   COMMAND: "bash"
>    #0 [...] __schedule at ...
>    #1 [...] schedule at ...
>    #2 [...] rpc_wait_bit_killable at ... [sunrpc]
>    #3 [...] __wait_on_bit at ...
>    #4 [...] out_of_line_wait_on_bit at ...
>    #5 [...] _nfs4_proc_delegreturn at ... [nfsv4]
>    #6 [...] nfs4_proc_delegreturn at ... [nfsv4]
>    #7 [...] nfs_do_return_delegation at ... [nfsv4]
>    #8 [...] nfs4_evict_inode at ... [nfsv4]
>    #9 [...] evict at ...
>   #10 [...] dispose_list at ...
>   #11 [...] prune_icache_sb at ...
>   #12 [...] super_cache_scan at ...
>   #13 [...] do_shrink_slab at ...
>   #14 [...] shrink_slab at ...
>   #15 [...] drop_slab_node at ...
>   #16 [...] drop_slab at ...
>   #17 [...] drop_caches_sysctl_handler at ...
>   #18 [...] proc_sys_call_handler at ...
>   #19 [...] vfs_write at ...
>   #20 [...] ksys_write at ...
>   #21 [...] do_syscall_64 at ...
>   #22 [...] entry_SYSCALL_64_after_hwframe at ...
> 
> 7) All other mount/umount activity now hangs with no luck to take
>   shrinker_rwsem for write.
> 
> [client]# mount -t tmpfs tmpfs /tmp
> 
>   crash> bt ...
>   PID: 5464   TASK: ...  CPU: 3   COMMAND: "mount"
>    #0 [...] __schedule at ...
>    #1 [...] schedule at ...
>    #2 [...] rwsem_down_write_slowpath at ...
>    #3 [...] prealloc_shrinker at ...
>    #4 [...] alloc_super at ...
>    #5 [...] sget at ...
>    #6 [...] mount_nodev at ...
>    #7 [...] legacy_get_tree at ...
>    #8 [...] vfs_get_tree at ...
>    #9 [...] do_mount at ...
>   #10 [...] ksys_mount at ...
>   #11 [...] __x64_sys_mount at ...
>   #12 [...] do_syscall_64 at ...
>   #13 [...] entry_SYSCALL_64_after_hwframe at ...
> 
 

I don't think this patch solves the problem, it only fixes one minor symptom of it.
The actual problem here the reclaim hang in the nfs.
It means that any process, including kswapd, may go into nfs inode reclaim and stuck there.

Even mount() itself has GFP_KERNEL allocations in its path, so it still might stuck there even with your patch.

I think this should be handled on nfs/vfs level by making  inode eviction during reclaim more asynchronous.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm: fix hanging shrinker management on long do_shrink_slab
  2019-12-02 16:36 ` [PATCH] mm: fix hanging shrinker management on long do_shrink_slab Andrey Ryabinin
@ 2019-12-03  0:13   ` Shakeel Butt
  2019-12-03 11:03     ` Kirill Tkhai
  2019-12-06  2:09   ` Dave Chinner
  1 sibling, 1 reply; 6+ messages in thread
From: Shakeel Butt @ 2019-12-03  0:13 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Pavel Tikhomirov, Andrew Morton, LKML, Cgroups, Linux MM,
	Johannes Weiner, Michal Hocko, Vladimir Davydov, Roman Gushchin,
	Chris Down, Yang Shi, Tejun Heo, Thomas Gleixner,
	Kirill A . Shutemov, Konstantin Khorenko, Kirill Tkhai,
	Trond Myklebust, Anna Schumaker, J. Bruce Fields, Chuck Lever,
	linux-nfs, Alexander Viro, linux-fsdevel

On Mon, Dec 2, 2019 at 8:37 AM Andrey Ryabinin <aryabinin@virtuozzo.com> wrote:
>
>
> On 11/30/19 12:45 AM, Pavel Tikhomirov wrote:
> > We have a problem that shrinker_rwsem can be held for a long time for
> > read in shrink_slab, at the same time any process which is trying to
> > manage shrinkers hangs.
> >
> > The shrinker_rwsem is taken in shrink_slab while traversing shrinker_list.
> > It tries to shrink something on nfs (hard) but nfs server is dead at
> > these moment already and rpc will never succeed. Generally any shrinker
> > can take significant time to do_shrink_slab, so it's a bad idea to hold
> > the list lock here.
> >
> > We have a similar problem in shrink_slab_memcg, except that we are
> > traversing shrinker_map+shrinker_idr there.
> >
> > The idea of the patch is to inc a refcount to the chosen shrinker so it
> > won't disappear and release shrinker_rwsem while we are in
> > do_shrink_slab, after that we will reacquire shrinker_rwsem, dec
> > the refcount and continue the traversal.
> >
> > We also need a wait_queue so that unregister_shrinker can wait for the
> > refcnt to become zero. Only after these we can safely remove the
> > shrinker from list and idr, and free the shrinker.
> >
> > I've reproduced the nfs hang in do_shrink_slab with the patch applied on
> > ms kernel, all other mount/unmount pass fine without any hang.
> >
> > Here is a reproduction on kernel without patch:
> >
> > 1) Setup nfs on server node with some files in it (e.g. 200)
> >
> > [server]# cat /etc/exports
> > /vz/nfs2 *(ro,no_root_squash,no_subtree_check,async)
> >
> > 2) Hard mount it on client node
> >
> > [client]# mount -ohard 10.94.3.40:/vz/nfs2 /mnt
> >
> > 3) Open some (e.g. 200) files on the mount
> >
> > [client]# for i in $(find /mnt/ -type f | head -n 200); \
> >   do setsid sleep 1000 &>/dev/null <$i & done
> >
> > 4) Kill all openers
> >
> > [client]# killall sleep -9
> >
> > 5) Put your network cable out on client node
> >
> > 6) Drop caches on the client, it will hang on nfs while holding
> >   shrinker_rwsem lock for read
> >
> > [client]# echo 3 > /proc/sys/vm/drop_caches
> >
> >   crash> bt ...
> >   PID: 18739  TASK: ...  CPU: 3   COMMAND: "bash"
> >    #0 [...] __schedule at ...
> >    #1 [...] schedule at ...
> >    #2 [...] rpc_wait_bit_killable at ... [sunrpc]
> >    #3 [...] __wait_on_bit at ...
> >    #4 [...] out_of_line_wait_on_bit at ...
> >    #5 [...] _nfs4_proc_delegreturn at ... [nfsv4]
> >    #6 [...] nfs4_proc_delegreturn at ... [nfsv4]
> >    #7 [...] nfs_do_return_delegation at ... [nfsv4]
> >    #8 [...] nfs4_evict_inode at ... [nfsv4]
> >    #9 [...] evict at ...
> >   #10 [...] dispose_list at ...
> >   #11 [...] prune_icache_sb at ...
> >   #12 [...] super_cache_scan at ...
> >   #13 [...] do_shrink_slab at ...
> >   #14 [...] shrink_slab at ...
> >   #15 [...] drop_slab_node at ...
> >   #16 [...] drop_slab at ...
> >   #17 [...] drop_caches_sysctl_handler at ...
> >   #18 [...] proc_sys_call_handler at ...
> >   #19 [...] vfs_write at ...
> >   #20 [...] ksys_write at ...
> >   #21 [...] do_syscall_64 at ...
> >   #22 [...] entry_SYSCALL_64_after_hwframe at ...
> >
> > 7) All other mount/umount activity now hangs with no luck to take
> >   shrinker_rwsem for write.
> >
> > [client]# mount -t tmpfs tmpfs /tmp
> >
> >   crash> bt ...
> >   PID: 5464   TASK: ...  CPU: 3   COMMAND: "mount"
> >    #0 [...] __schedule at ...
> >    #1 [...] schedule at ...
> >    #2 [...] rwsem_down_write_slowpath at ...
> >    #3 [...] prealloc_shrinker at ...
> >    #4 [...] alloc_super at ...
> >    #5 [...] sget at ...
> >    #6 [...] mount_nodev at ...
> >    #7 [...] legacy_get_tree at ...
> >    #8 [...] vfs_get_tree at ...
> >    #9 [...] do_mount at ...
> >   #10 [...] ksys_mount at ...
> >   #11 [...] __x64_sys_mount at ...
> >   #12 [...] do_syscall_64 at ...
> >   #13 [...] entry_SYSCALL_64_after_hwframe at ...
> >
>
>
> I don't think this patch solves the problem, it only fixes one minor symptom of it.
> The actual problem here the reclaim hang in the nfs.
> It means that any process, including kswapd, may go into nfs inode reclaim and stuck there.
>
> Even mount() itself has GFP_KERNEL allocations in its path, so it still might stuck there even with your patch.
>
> I think this should be handled on nfs/vfs level by making  inode eviction during reclaim more asynchronous.

Though I agree that we should be fixing shrinkers to not get stuck
(and be more async), I still think the problem this patch is solving
is worth fixing. On machines running multiple workloads, one job stuck
in slab shrinker and blocking all other unrelated jobs wanting
shrinker_rwsem, breaks the isolation and causes DoS.

Shakeel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm: fix hanging shrinker management on long do_shrink_slab
  2019-12-03  0:13   ` Shakeel Butt
@ 2019-12-03 11:03     ` Kirill Tkhai
  0 siblings, 0 replies; 6+ messages in thread
From: Kirill Tkhai @ 2019-12-03 11:03 UTC (permalink / raw)
  To: Shakeel Butt, Andrey Ryabinin
  Cc: Pavel Tikhomirov, Andrew Morton, LKML, Cgroups, Linux MM,
	Johannes Weiner, Michal Hocko, Vladimir Davydov, Roman Gushchin,
	Chris Down, Yang Shi, Tejun Heo, Thomas Gleixner,
	Kirill A . Shutemov, Konstantin Khorenko, Trond Myklebust,
	Anna Schumaker, J. Bruce Fields, Chuck Lever, linux-nfs,
	Alexander Viro, linux-fsdevel

On 03.12.2019 03:13, Shakeel Butt wrote:
> On Mon, Dec 2, 2019 at 8:37 AM Andrey Ryabinin <aryabinin@virtuozzo.com> wrote:
>>
>>
>> On 11/30/19 12:45 AM, Pavel Tikhomirov wrote:
>>> We have a problem that shrinker_rwsem can be held for a long time for
>>> read in shrink_slab, at the same time any process which is trying to
>>> manage shrinkers hangs.
>>>
>>> The shrinker_rwsem is taken in shrink_slab while traversing shrinker_list.
>>> It tries to shrink something on nfs (hard) but nfs server is dead at
>>> these moment already and rpc will never succeed. Generally any shrinker
>>> can take significant time to do_shrink_slab, so it's a bad idea to hold
>>> the list lock here.
>>>
>>> We have a similar problem in shrink_slab_memcg, except that we are
>>> traversing shrinker_map+shrinker_idr there.
>>>
>>> The idea of the patch is to inc a refcount to the chosen shrinker so it
>>> won't disappear and release shrinker_rwsem while we are in
>>> do_shrink_slab, after that we will reacquire shrinker_rwsem, dec
>>> the refcount and continue the traversal.
>>>
>>> We also need a wait_queue so that unregister_shrinker can wait for the
>>> refcnt to become zero. Only after these we can safely remove the
>>> shrinker from list and idr, and free the shrinker.
>>>
>>> I've reproduced the nfs hang in do_shrink_slab with the patch applied on
>>> ms kernel, all other mount/unmount pass fine without any hang.
>>>
>>> Here is a reproduction on kernel without patch:
>>>
>>> 1) Setup nfs on server node with some files in it (e.g. 200)
>>>
>>> [server]# cat /etc/exports
>>> /vz/nfs2 *(ro,no_root_squash,no_subtree_check,async)
>>>
>>> 2) Hard mount it on client node
>>>
>>> [client]# mount -ohard 10.94.3.40:/vz/nfs2 /mnt
>>>
>>> 3) Open some (e.g. 200) files on the mount
>>>
>>> [client]# for i in $(find /mnt/ -type f | head -n 200); \
>>>   do setsid sleep 1000 &>/dev/null <$i & done
>>>
>>> 4) Kill all openers
>>>
>>> [client]# killall sleep -9
>>>
>>> 5) Put your network cable out on client node
>>>
>>> 6) Drop caches on the client, it will hang on nfs while holding
>>>   shrinker_rwsem lock for read
>>>
>>> [client]# echo 3 > /proc/sys/vm/drop_caches
>>>
>>>   crash> bt ...
>>>   PID: 18739  TASK: ...  CPU: 3   COMMAND: "bash"
>>>    #0 [...] __schedule at ...
>>>    #1 [...] schedule at ...
>>>    #2 [...] rpc_wait_bit_killable at ... [sunrpc]
>>>    #3 [...] __wait_on_bit at ...
>>>    #4 [...] out_of_line_wait_on_bit at ...
>>>    #5 [...] _nfs4_proc_delegreturn at ... [nfsv4]
>>>    #6 [...] nfs4_proc_delegreturn at ... [nfsv4]
>>>    #7 [...] nfs_do_return_delegation at ... [nfsv4]
>>>    #8 [...] nfs4_evict_inode at ... [nfsv4]
>>>    #9 [...] evict at ...
>>>   #10 [...] dispose_list at ...
>>>   #11 [...] prune_icache_sb at ...
>>>   #12 [...] super_cache_scan at ...
>>>   #13 [...] do_shrink_slab at ...
>>>   #14 [...] shrink_slab at ...
>>>   #15 [...] drop_slab_node at ...
>>>   #16 [...] drop_slab at ...
>>>   #17 [...] drop_caches_sysctl_handler at ...
>>>   #18 [...] proc_sys_call_handler at ...
>>>   #19 [...] vfs_write at ...
>>>   #20 [...] ksys_write at ...
>>>   #21 [...] do_syscall_64 at ...
>>>   #22 [...] entry_SYSCALL_64_after_hwframe at ...
>>>
>>> 7) All other mount/umount activity now hangs with no luck to take
>>>   shrinker_rwsem for write.
>>>
>>> [client]# mount -t tmpfs tmpfs /tmp
>>>
>>>   crash> bt ...
>>>   PID: 5464   TASK: ...  CPU: 3   COMMAND: "mount"
>>>    #0 [...] __schedule at ...
>>>    #1 [...] schedule at ...
>>>    #2 [...] rwsem_down_write_slowpath at ...
>>>    #3 [...] prealloc_shrinker at ...
>>>    #4 [...] alloc_super at ...
>>>    #5 [...] sget at ...
>>>    #6 [...] mount_nodev at ...
>>>    #7 [...] legacy_get_tree at ...
>>>    #8 [...] vfs_get_tree at ...
>>>    #9 [...] do_mount at ...
>>>   #10 [...] ksys_mount at ...
>>>   #11 [...] __x64_sys_mount at ...
>>>   #12 [...] do_syscall_64 at ...
>>>   #13 [...] entry_SYSCALL_64_after_hwframe at ...
>>>
>>
>>
>> I don't think this patch solves the problem, it only fixes one minor symptom of it.
>> The actual problem here the reclaim hang in the nfs.
>> It means that any process, including kswapd, may go into nfs inode reclaim and stuck there.
>>
>> Even mount() itself has GFP_KERNEL allocations in its path, so it still might stuck there even with your patch.
>>
>> I think this should be handled on nfs/vfs level by making  inode eviction during reclaim more asynchronous.
> 
> Though I agree that we should be fixing shrinkers to not get stuck
> (and be more async), I still think the problem this patch is solving
> is worth fixing. On machines running multiple workloads, one job stuck
> in slab shrinker and blocking all other unrelated jobs wanting
> shrinker_rwsem, breaks the isolation and causes DoS.

This makes jobs isolated till global reclaim occurs. In that case any process
(including kswapd) become stuck, and nothing will help. But (as it was spoken
in this list some time ago) many people configure their workload in the way
they don't have global reclaim, and they may get some profit from this.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm: fix hanging shrinker management on long do_shrink_slab
  2019-12-02 16:36 ` [PATCH] mm: fix hanging shrinker management on long do_shrink_slab Andrey Ryabinin
  2019-12-03  0:13   ` Shakeel Butt
@ 2019-12-06  2:09   ` Dave Chinner
  2019-12-06 10:09     ` Michal Hocko
  2019-12-06 17:11     ` Shakeel Butt
  1 sibling, 2 replies; 6+ messages in thread
From: Dave Chinner @ 2019-12-06  2:09 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Pavel Tikhomirov, Andrew Morton, linux-kernel, cgroups, linux-mm,
	Johannes Weiner, Michal Hocko, Vladimir Davydov, Roman Gushchin,
	Shakeel Butt, Chris Down, Yang Shi, Tejun Heo, Thomas Gleixner,
	Kirill A . Shutemov, Konstantin Khorenko, Kirill Tkhai,
	Trond Myklebust, Anna Schumaker, J. Bruce Fields, Chuck Lever,
	linux-nfs, Alexander Viro, linux-fsdevel

[please cc me on future shrinker infrastructure modifications]

On Mon, Dec 02, 2019 at 07:36:03PM +0300, Andrey Ryabinin wrote:
> 
> On 11/30/19 12:45 AM, Pavel Tikhomirov wrote:
> > We have a problem that shrinker_rwsem can be held for a long time for
> > read in shrink_slab, at the same time any process which is trying to
> > manage shrinkers hangs.
> > 
> > The shrinker_rwsem is taken in shrink_slab while traversing shrinker_list.
> > It tries to shrink something on nfs (hard) but nfs server is dead at
> > these moment already and rpc will never succeed. Generally any shrinker
> > can take significant time to do_shrink_slab, so it's a bad idea to hold
> > the list lock here.

registering/unregistering a shrinker is not a performance critical
task. If a shrinker is blocking for a long time, then we need to
work to fix the shrinker implementation because blocking is a much
bigger problem than just register/unregister.

> > The idea of the patch is to inc a refcount to the chosen shrinker so it
> > won't disappear and release shrinker_rwsem while we are in
> > do_shrink_slab, after that we will reacquire shrinker_rwsem, dec
> > the refcount and continue the traversal.

This is going to cause a *lot* of traffic on the shrinker rwsem.
It's already a pretty hot lock on large machines under memory
pressure (think thousands of tasks all doing direct reclaim across
hundreds of CPUs), and so changing them to cycle the rwsem on every
shrinker that will only make this worse. Esepcially when we consider
that there may be hundreds to thousands of registered shrinker
instances on large machines.

As an example of how frequent cycling of a global lock in shrinker
instances causes issues, we used to take references to superblock
shrinker count invocations to guarantee existence. This was found to
be a scalability limitation when lots of near-empty superblocks were
present in a system (see commit d23da150a37c ("fs/superblock: avoid
locking counting inodes and dentries before reclaiming them")).

This alleviated the problem for a while, but soon we had problems
with just taking a reference to the superblock in the callbacks that
did actual work. Hence we changed it to just take a per-superblock
rwsem to get rid of the global sb_lock spinlock in this path. See
commit eb6ef3df4faa ("trylock_super(): replacement for
grab_super_passive()". Now we don't have a scalability problem.

IOWs, we already know that cycling a global rwsem on every
individual shrinker invocation is going to cause noticable
scalability problems. Hence I don't think that this sort of "cycle
the global rwsem faster to reduce [un]register latency" solution is
going to fly because of the runtime performance regressions it will
introduce....

> I don't think this patch solves the problem, it only fixes one minor symptom of it.
> The actual problem here the reclaim hang in the nfs.

The nfs client is waiting on the NFS server to respond. It may
actually be that the server has hung, not the client...

> It means that any process, including kswapd, may go into nfs inode reclaim and stuck there.

*nod*

> I think this should be handled on nfs/vfs level by making  inode eviction during reclaim more asynchronous.

That's what we are trying to do with similar blocking based issues
in XFS inode reclaim. It's not simple, though, because these days
memory reclaim is like a bowl full of spaghetti covered with a
delicious sauce of non-obvious heuristics and broken
functionality....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm: fix hanging shrinker management on long do_shrink_slab
  2019-12-06  2:09   ` Dave Chinner
@ 2019-12-06 10:09     ` Michal Hocko
  2019-12-06 17:11     ` Shakeel Butt
  1 sibling, 0 replies; 6+ messages in thread
From: Michal Hocko @ 2019-12-06 10:09 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrey Ryabinin, Pavel Tikhomirov, Andrew Morton, linux-kernel,
	cgroups, linux-mm, Johannes Weiner, Vladimir Davydov,
	Roman Gushchin, Shakeel Butt, Chris Down, Yang Shi, Tejun Heo,
	Thomas Gleixner, Kirill A . Shutemov, Konstantin Khorenko,
	Kirill Tkhai, Trond Myklebust, Anna Schumaker, J. Bruce Fields,
	Chuck Lever, linux-nfs, Alexander Viro, linux-fsdevel

On Fri 06-12-19 13:09:53, Dave Chinner wrote:
> [please cc me on future shrinker infrastructure modifications]
> 
> On Mon, Dec 02, 2019 at 07:36:03PM +0300, Andrey Ryabinin wrote:
> > 
> > On 11/30/19 12:45 AM, Pavel Tikhomirov wrote:
> > > We have a problem that shrinker_rwsem can be held for a long time for
> > > read in shrink_slab, at the same time any process which is trying to
> > > manage shrinkers hangs.
> > > 
> > > The shrinker_rwsem is taken in shrink_slab while traversing shrinker_list.
> > > It tries to shrink something on nfs (hard) but nfs server is dead at
> > > these moment already and rpc will never succeed. Generally any shrinker
> > > can take significant time to do_shrink_slab, so it's a bad idea to hold
> > > the list lock here.
> 
> registering/unregistering a shrinker is not a performance critical
> task.

We have had a bug report from production system which stumbled over a
long [u]nmount which was stuck on a shrinker {un}register path waiting for
the lock. This wouldn't be a big deal most of the time but [u]mount were
part of the login session in this case. This was on an older
distribution kernel and e496612c5130 ("mm: do not stall register_shrinker()")
helped a bit but not really for shrinkers that take a lot of time.

> If a shrinker is blocking for a long time, then we need to
> work to fix the shrinker implementation because blocking is a much
> bigger problem than just register/unregister.

Absolutely agreed here.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm: fix hanging shrinker management on long do_shrink_slab
  2019-12-06  2:09   ` Dave Chinner
  2019-12-06 10:09     ` Michal Hocko
@ 2019-12-06 17:11     ` Shakeel Butt
  1 sibling, 0 replies; 6+ messages in thread
From: Shakeel Butt @ 2019-12-06 17:11 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrey Ryabinin, Pavel Tikhomirov, Andrew Morton, LKML, Cgroups,
	Linux MM, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Roman Gushchin, Chris Down, Yang Shi, Tejun Heo, Thomas Gleixner,
	Kirill A . Shutemov, Konstantin Khorenko, Kirill Tkhai,
	Trond Myklebust, Anna Schumaker, J. Bruce Fields, Chuck Lever,
	linux-nfs, Alexander Viro, linux-fsdevel

On Thu, Dec 5, 2019 at 6:10 PM Dave Chinner <david@fromorbit.com> wrote:
>
> [please cc me on future shrinker infrastructure modifications]
>
> On Mon, Dec 02, 2019 at 07:36:03PM +0300, Andrey Ryabinin wrote:
> >
> > On 11/30/19 12:45 AM, Pavel Tikhomirov wrote:
> > > We have a problem that shrinker_rwsem can be held for a long time for
> > > read in shrink_slab, at the same time any process which is trying to
> > > manage shrinkers hangs.
> > >
> > > The shrinker_rwsem is taken in shrink_slab while traversing shrinker_list.
> > > It tries to shrink something on nfs (hard) but nfs server is dead at
> > > these moment already and rpc will never succeed. Generally any shrinker
> > > can take significant time to do_shrink_slab, so it's a bad idea to hold
> > > the list lock here.
>
> registering/unregistering a shrinker is not a performance critical
> task.

Yes, not performance critical but it can cause isolation issues.

> If a shrinker is blocking for a long time, then we need to
> work to fix the shrinker implementation because blocking is a much
> bigger problem than just register/unregister.
>

Yes, we should be fixing the implementations of all shrinkers and yes
it is bigger issue but we can also fix register/unregister isolation
issue in parallel. Fixing all shrinkers would a tedious and long task
and we should not block fixing isolation issue on it.

> > > The idea of the patch is to inc a refcount to the chosen shrinker so it
> > > won't disappear and release shrinker_rwsem while we are in
> > > do_shrink_slab, after that we will reacquire shrinker_rwsem, dec
> > > the refcount and continue the traversal.
>
> This is going to cause a *lot* of traffic on the shrinker rwsem.
> It's already a pretty hot lock on large machines under memory
> pressure (think thousands of tasks all doing direct reclaim across
> hundreds of CPUs), and so changing them to cycle the rwsem on every
> shrinker that will only make this worse. Esepcially when we consider
> that there may be hundreds to thousands of registered shrinker
> instances on large machines.
>
> As an example of how frequent cycling of a global lock in shrinker
> instances causes issues, we used to take references to superblock
> shrinker count invocations to guarantee existence. This was found to
> be a scalability limitation when lots of near-empty superblocks were
> present in a system (see commit d23da150a37c ("fs/superblock: avoid
> locking counting inodes and dentries before reclaiming them")).
>
> This alleviated the problem for a while, but soon we had problems
> with just taking a reference to the superblock in the callbacks that
> did actual work. Hence we changed it to just take a per-superblock
> rwsem to get rid of the global sb_lock spinlock in this path. See
> commit eb6ef3df4faa ("trylock_super(): replacement for
> grab_super_passive()". Now we don't have a scalability problem.
>
> IOWs, we already know that cycling a global rwsem on every
> individual shrinker invocation is going to cause noticable
> scalability problems. Hence I don't think that this sort of "cycle
> the global rwsem faster to reduce [un]register latency" solution is
> going to fly because of the runtime performance regressions it will
> introduce....
>

I agree with your scalability concern (though others would argue to
first demonstrate the issue before adding more sophisticated scalable
code). Most memory reclaim code is written without the performance or
scalability concern, maybe we should switch our thinking.

> > I don't think this patch solves the problem, it only fixes one minor symptom of it.
> > The actual problem here the reclaim hang in the nfs.
>
> The nfs client is waiting on the NFS server to respond. It may
> actually be that the server has hung, not the client...
>
> > It means that any process, including kswapd, may go into nfs inode reclaim and stuck there.
>
> *nod*
>
> > I think this should be handled on nfs/vfs level by making  inode eviction during reclaim more asynchronous.
>
> That's what we are trying to do with similar blocking based issues
> in XFS inode reclaim. It's not simple, though, because these days
> memory reclaim is like a bowl full of spaghetti covered with a
> delicious sauce of non-obvious heuristics and broken
> functionality....
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, back to index

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20191129214541.3110-1-ptikhomirov@virtuozzo.com>
2019-12-02 16:36 ` [PATCH] mm: fix hanging shrinker management on long do_shrink_slab Andrey Ryabinin
2019-12-03  0:13   ` Shakeel Butt
2019-12-03 11:03     ` Kirill Tkhai
2019-12-06  2:09   ` Dave Chinner
2019-12-06 10:09     ` Michal Hocko
2019-12-06 17:11     ` Shakeel Butt

Linux-NFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-nfs/0 linux-nfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-nfs linux-nfs/ https://lore.kernel.org/linux-nfs \
		linux-nfs@vger.kernel.org
	public-inbox-index linux-nfs

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-nfs


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git