All of lore.kernel.org
 help / color / mirror / Atom feed
* RCU stall in xfs_reclaim_inodes_ag
@ 2020-11-16 17:45 Alex Lyakas
  2020-11-16 21:30 ` Dave Chinner
  0 siblings, 1 reply; 5+ messages in thread
From: Alex Lyakas @ 2020-11-16 17:45 UTC (permalink / raw)
  To: linux-xfs

Greetings XFS community,

We had an RCU stall [1]. According to the code, it happened in 
radix_tree_gang_lookup_tag():

rcu_read_lock();
nr_found = radix_tree_gang_lookup_tag(
        &pag->pag_ici_root,
        (void **)batch, first_index,
        XFS_LOOKUP_BATCH,
        XFS_ICI_RECLAIM_TAG);


This XFS system has over 100M files. So perhaps looping inside the radix 
tree took too long, and it was happening in RCU read-side critical seciton. 
This is one of the possible causes for RCU stall.

This happened in kernel 4.14.99, but looking at latest mainline code, code 
is still the same.

Can anyone please advise how to address that? It is not possible to put 
cond_resched() inside the radix tree code, because it can be used with 
spinlocks, and perhaps other contexts where sleeping is not allowed.

Thanks,
Alex.


[1]
Nov 16 12:33:15.314001 vsa-0000002a-vc-1 kernel: [78961.809221] INFO: 
rcu_preempt detected stalls on CPUs/tasks:
Nov 16 12:33:15.314031 vsa-0000002a-vc-1 kernel: [78961.810785]     Tasks 
blocked on level-0 rcu_node (CPUs 0-4): P8854
Nov 16 12:33:15.314033 vsa-0000002a-vc-1 kernel: [78961.812141] 
(detected by 3, t=60004 jiffies, g=18947673, c=18947672, q=110854)
Nov 16 12:33:15.314036 vsa-0000002a-vc-1 kernel: [78961.814050] python 
R  running task        0  8854   8792 0x00000000
Nov 16 12:33:15.314038 vsa-0000002a-vc-1 kernel: [78961.814054] Call Trace:
Nov 16 12:33:15.314040 vsa-0000002a-vc-1 kernel: [78961.814067]  ? 
__schedule+0x290/0x8a0
Nov 16 12:33:15.314041 vsa-0000002a-vc-1 kernel: [78961.814070] 
preempt_schedule_irq+0x37/0x60
Nov 16 12:33:15.314043 vsa-0000002a-vc-1 kernel: [78961.814074] 
retint_kernel+0x1b/0x1d
Nov 16 12:33:15.314044 vsa-0000002a-vc-1 kernel: [78961.814079] RIP: 
0010:radix_tree_next_chunk+0x111/0x320
Nov 16 12:33:15.314045 vsa-0000002a-vc-1 kernel: [78961.814080] RSP: 
0018:ffffb31f5fbab7d0 EFLAGS: 00000287 ORIG_RAX: ffffffffffffff10
Nov 16 12:33:15.314047 vsa-0000002a-vc-1 kernel: [78961.814082] RAX: 
0000000000000033 RBX: 0000000000000010 RCX: 000000000000000c
Nov 16 12:33:15.314049 vsa-0000002a-vc-1 kernel: [78961.814083] RDX: 
0000000000000000 RSI: ffffb31f5fbab818 RDI: 00000000000b3000
Nov 16 12:33:15.314051 vsa-0000002a-vc-1 kernel: [78961.814084] RBP: 
ffffa07dd6d80b48 R08: ffffa08079fe56d0 R09: ffffa08079fe6481
Nov 16 12:33:15.314052 vsa-0000002a-vc-1 kernel: [78961.814085] R10: 
000000000000000c R11: ffffa08079fe5890 R12: 000000000000000c
Nov 16 12:33:15.314054 vsa-0000002a-vc-1 kernel: [78961.814086] R13: 
0000000000000040 R14: 0000000000000228 R15: 0000000000033000
Nov 16 12:33:15.314055 vsa-0000002a-vc-1 kernel: [78961.814090] 
radix_tree_gang_lookup_tag+0xd6/0x160
Nov 16 12:33:15.314056 vsa-0000002a-vc-1 kernel: [78961.814152] 
xfs_reclaim_inodes_ag+0xd5/0x310 [xfs]
Nov 16 12:33:15.314058 vsa-0000002a-vc-1 kernel: [78961.814199] 
xfs_reclaim_inodes_nr+0x3c/0x50 [xfs]
Nov 16 12:33:15.314059 vsa-0000002a-vc-1 kernel: [78961.814204] 
super_cache_scan+0x156/0x1a0
Nov 16 12:33:15.314061 vsa-0000002a-vc-1 kernel: [78961.814210] 
shrink_slab.part.51+0x1e8/0x400
Nov 16 12:33:15.314062 vsa-0000002a-vc-1 kernel: [78961.814213] 
shrink_node+0x123/0x310
Nov 16 12:33:15.314063 vsa-0000002a-vc-1 kernel: [78961.814216] 
do_try_to_free_pages+0xc3/0x330
Nov 16 12:33:15.314064 vsa-0000002a-vc-1 kernel: [78961.814218] 
try_to_free_pages+0xf4/0x1f0
Nov 16 12:33:15.314065 vsa-0000002a-vc-1 kernel: [78961.814224] 
__alloc_pages_slowpath+0x3a5/0xd50
Nov 16 12:33:15.314067 vsa-0000002a-vc-1 kernel: [78961.814228] 
__alloc_pages_nodemask+0x2a7/0x2d0
Nov 16 12:33:15.314068 vsa-0000002a-vc-1 kernel: [78961.814232] 
alloc_pages_vma+0x7c/0x1e0
Nov 16 12:33:15.314070 vsa-0000002a-vc-1 kernel: [78961.814236] 
__handle_mm_fault+0x90c/0x12f0
Nov 16 12:33:15.314071 vsa-0000002a-vc-1 kernel: [78961.814239] 
handle_mm_fault+0xb1/0x1e0
Nov 16 12:33:15.314072 vsa-0000002a-vc-1 kernel: [78961.814244] 
__do_page_fault+0x278/0x530
Nov 16 12:33:15.314073 vsa-0000002a-vc-1 kernel: [78961.814247]  ? 
async_page_fault+0x2f/0x50
Nov 16 12:33:15.314074 vsa-0000002a-vc-1 kernel: [78961.814248] 
async_page_fault+0x45/0x50


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RCU stall in xfs_reclaim_inodes_ag
  2020-11-16 17:45 RCU stall in xfs_reclaim_inodes_ag Alex Lyakas
@ 2020-11-16 21:30 ` Dave Chinner
  2020-12-07 10:18   ` Alex Lyakas
  0 siblings, 1 reply; 5+ messages in thread
From: Dave Chinner @ 2020-11-16 21:30 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: linux-xfs

On Mon, Nov 16, 2020 at 07:45:46PM +0200, Alex Lyakas wrote:
> Greetings XFS community,
> 
> We had an RCU stall [1]. According to the code, it happened in
> radix_tree_gang_lookup_tag():
> 
> rcu_read_lock();
> nr_found = radix_tree_gang_lookup_tag(
>        &pag->pag_ici_root,
>        (void **)batch, first_index,
>        XFS_LOOKUP_BATCH,
>        XFS_ICI_RECLAIM_TAG);
> 
> 
> This XFS system has over 100M files. So perhaps looping inside the radix
> tree took too long, and it was happening in RCU read-side critical seciton.
> This is one of the possible causes for RCU stall.

Doubt it. According to the trace it was stalled for 60s, and a
radix tree walk of 100M entries only takes a second or two.

Further, unless you are using inode32, the inodes will be spread
across multiple radix trees and that makes the radix trees much
smaller and even less likely to take this long to run a traversal.

This could be made a little more efficient by adding a "last index"
parameter to tell the search where to stop (i.e. if the batch count
has not yet been reached), but in general that makes little
difference to the search because the radix tree walk finds the next
inodes in a few pointer chases...

> This happened in kernel 4.14.99, but looking at latest mainline code, code
> is still the same.

These inode radix trees have been used in XFS since 2008, and this
is the first time anyone has reported a stall like this, so I'm
doubtful that there is actually a general bug. My suspicion for such
a rare occurrence would be memory corruption of some kind or a
leaked atomic/rcu state in some other code on that CPU....

> Can anyone please advise how to address that? It is not possible to put
> cond_resched() inside the radix tree code, because it can be used with
> spinlocks, and perhaps other contexts where sleeping is not allowed.

I don't think there is a solution to this problem - it just
shouldn't happen in when everything is operating normally as it's
just a tag search on an indexed tree.

Hence even if there was a hack to stop stall warnings, it won't fix
whatever problem is leading to the rcu stall. The system will then
just spin burning CPU, and eventually something else will fail.

IOWs, unless you can reproduce this stall and find out what is wrong
in the radix tree that is leading to it looping forever, there's
likely nothing we can do to avoid this.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RCU stall in xfs_reclaim_inodes_ag
  2020-11-16 21:30 ` Dave Chinner
@ 2020-12-07 10:18   ` Alex Lyakas
  2020-12-07 14:15     ` Brian Foster
  2020-12-08  8:07     ` Dave Chinner
  0 siblings, 2 replies; 5+ messages in thread
From: Alex Lyakas @ 2020-12-07 10:18 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Hi Dave,

Thank you for your response.

We did some more investigations on the issue, and we have the following 
findings:

1) We tracked the max amount of inodes per AG radix tree. We found in our 
tests, that the max amount of inodes per AG radix tree was about 1.5M:
[xfs_reclaim_inodes_ag:1285] XFS(dm-79): AG[1368]: count=1384662 
reclaimable=58
[xfs_reclaim_inodes_ag:1285] XFS(dm-79): AG[1368]: count=1384630 
reclaimable=46
[xfs_reclaim_inodes_ag:1285] XFS(dm-79): AG[1368]: count=1384600 
reclaimable=16
[xfs_reclaim_inodes_ag:1285] XFS(dm-79): AG[1370]: count=1594500 
reclaimable=75
[xfs_reclaim_inodes_ag:1285] XFS(dm-79): AG[1370]: count=1594468 
reclaimable=55
[xfs_reclaim_inodes_ag:1285] XFS(dm-79): AG[1370]: count=1594436 
reclaimable=46
[xfs_reclaim_inodes_ag:1285] XFS(dm-79): AG[1370]: count=1594421 
reclaimable=42
(but the amount of reclaimable inodes is very small, as you can see).

Do you think this number is reasonable per radix tree?

2) This particular XFS instance is total of 500TB. However, the AG size in 
this case is 100GB. This is the AG size that we use, due to issues that we 
reported in https://www.spinics.net/lists/linux-xfs/msg06501.html,
where the "near" allocation algorithm was stuck for a long time scanning the 
free-space btrees. With smaller AG size, we don't see such issues.
But with 500TB filesystem, we now have 5000 AGs. As a result, we suspect 
(due to some instrumentation), that the looping over 5000 AGs in 
xfs_reclaim_inodes_ag() is what is causing the RCU stall for us. Although 
the code has cond_resched() call, but somehow the RCU stall still happens, 
and it always happens in this function, while searching the radix tree.

Thanks,
Alex.



-----Original Message----- 
From: Dave Chinner
Sent: Monday, November 16, 2020 11:30 PM
To: Alex Lyakas
Cc: linux-xfs@vger.kernel.org
Subject: Re: RCU stall in xfs_reclaim_inodes_ag

On Mon, Nov 16, 2020 at 07:45:46PM +0200, Alex Lyakas wrote:
> Greetings XFS community,
>
> We had an RCU stall [1]. According to the code, it happened in
> radix_tree_gang_lookup_tag():
>
> rcu_read_lock();
> nr_found = radix_tree_gang_lookup_tag(
>        &pag->pag_ici_root,
>        (void **)batch, first_index,
>        XFS_LOOKUP_BATCH,
>        XFS_ICI_RECLAIM_TAG);
>
>
> This XFS system has over 100M files. So perhaps looping inside the radix
> tree took too long, and it was happening in RCU read-side critical 
> seciton.
> This is one of the possible causes for RCU stall.

Doubt it. According to the trace it was stalled for 60s, and a
radix tree walk of 100M entries only takes a second or two.

Further, unless you are using inode32, the inodes will be spread
across multiple radix trees and that makes the radix trees much
smaller and even less likely to take this long to run a traversal.

This could be made a little more efficient by adding a "last index"
parameter to tell the search where to stop (i.e. if the batch count
has not yet been reached), but in general that makes little
difference to the search because the radix tree walk finds the next
inodes in a few pointer chases...

> This happened in kernel 4.14.99, but looking at latest mainline code, code
> is still the same.

These inode radix trees have been used in XFS since 2008, and this
is the first time anyone has reported a stall like this, so I'm
doubtful that there is actually a general bug. My suspicion for such
a rare occurrence would be memory corruption of some kind or a
leaked atomic/rcu state in some other code on that CPU....

> Can anyone please advise how to address that? It is not possible to put
> cond_resched() inside the radix tree code, because it can be used with
> spinlocks, and perhaps other contexts where sleeping is not allowed.

I don't think there is a solution to this problem - it just
shouldn't happen in when everything is operating normally as it's
just a tag search on an indexed tree.

Hence even if there was a hack to stop stall warnings, it won't fix
whatever problem is leading to the rcu stall. The system will then
just spin burning CPU, and eventually something else will fail.

IOWs, unless you can reproduce this stall and find out what is wrong
in the radix tree that is leading to it looping forever, there's
likely nothing we can do to avoid this.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com 


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RCU stall in xfs_reclaim_inodes_ag
  2020-12-07 10:18   ` Alex Lyakas
@ 2020-12-07 14:15     ` Brian Foster
  2020-12-08  8:07     ` Dave Chinner
  1 sibling, 0 replies; 5+ messages in thread
From: Brian Foster @ 2020-12-07 14:15 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: Dave Chinner, linux-xfs

On Mon, Dec 07, 2020 at 12:18:13PM +0200, Alex Lyakas wrote:
> Hi Dave,
> 
> Thank you for your response.
> 
> We did some more investigations on the issue, and we have the following
> findings:
> 
> 1) We tracked the max amount of inodes per AG radix tree. We found in our
> tests, that the max amount of inodes per AG radix tree was about 1.5M:
> [xfs_reclaim_inodes_ag:1285] XFS(dm-79): AG[1368]: count=1384662
> reclaimable=58
> [xfs_reclaim_inodes_ag:1285] XFS(dm-79): AG[1368]: count=1384630
> reclaimable=46
> [xfs_reclaim_inodes_ag:1285] XFS(dm-79): AG[1368]: count=1384600
> reclaimable=16
> [xfs_reclaim_inodes_ag:1285] XFS(dm-79): AG[1370]: count=1594500
> reclaimable=75
> [xfs_reclaim_inodes_ag:1285] XFS(dm-79): AG[1370]: count=1594468
> reclaimable=55
> [xfs_reclaim_inodes_ag:1285] XFS(dm-79): AG[1370]: count=1594436
> reclaimable=46
> [xfs_reclaim_inodes_ag:1285] XFS(dm-79): AG[1370]: count=1594421
> reclaimable=42
> (but the amount of reclaimable inodes is very small, as you can see).
> 
> Do you think this number is reasonable per radix tree?
> 
> 2) This particular XFS instance is total of 500TB. However, the AG size in
> this case is 100GB. This is the AG size that we use, due to issues that we
> reported in https://www.spinics.net/lists/linux-xfs/msg06501.html,
> where the "near" allocation algorithm was stuck for a long time scanning the
> free-space btrees. With smaller AG size, we don't see such issues.
> But with 500TB filesystem, we now have 5000 AGs. As a result, we suspect
> (due to some instrumentation), that the looping over 5000 AGs in
> xfs_reclaim_inodes_ag() is what is causing the RCU stall for us. Although
> the code has cond_resched() call, but somehow the RCU stall still happens,
> and it always happens in this function, while searching the radix tree.
> 

Just a quick bit on the near more allocation algorithm.. the report at
the provided link refers to a v3.18 kernel, which apparently was old
even at the time of the report. Regardless, the allocation algorithm was
reworked in commit dc8e69bd7218 ("xfs: optimize near mode bnobt scans
with concurrent cntbt lookups") to address this kind of problem. That
patch landed in v5.5. It might be worth trying that to see if it allows
you to unroll the AG size mitigation (which seems to have just
transferred the original inefficiency from the allocation code to
management of excessive AG counts)...

Brian

> Thanks,
> Alex.
> 
> 
> 
> -----Original Message----- From: Dave Chinner
> Sent: Monday, November 16, 2020 11:30 PM
> To: Alex Lyakas
> Cc: linux-xfs@vger.kernel.org
> Subject: Re: RCU stall in xfs_reclaim_inodes_ag
> 
> On Mon, Nov 16, 2020 at 07:45:46PM +0200, Alex Lyakas wrote:
> > Greetings XFS community,
> > 
> > We had an RCU stall [1]. According to the code, it happened in
> > radix_tree_gang_lookup_tag():
> > 
> > rcu_read_lock();
> > nr_found = radix_tree_gang_lookup_tag(
> >        &pag->pag_ici_root,
> >        (void **)batch, first_index,
> >        XFS_LOOKUP_BATCH,
> >        XFS_ICI_RECLAIM_TAG);
> > 
> > 
> > This XFS system has over 100M files. So perhaps looping inside the radix
> > tree took too long, and it was happening in RCU read-side critical
> > seciton.
> > This is one of the possible causes for RCU stall.
> 
> Doubt it. According to the trace it was stalled for 60s, and a
> radix tree walk of 100M entries only takes a second or two.
> 
> Further, unless you are using inode32, the inodes will be spread
> across multiple radix trees and that makes the radix trees much
> smaller and even less likely to take this long to run a traversal.
> 
> This could be made a little more efficient by adding a "last index"
> parameter to tell the search where to stop (i.e. if the batch count
> has not yet been reached), but in general that makes little
> difference to the search because the radix tree walk finds the next
> inodes in a few pointer chases...
> 
> > This happened in kernel 4.14.99, but looking at latest mainline code, code
> > is still the same.
> 
> These inode radix trees have been used in XFS since 2008, and this
> is the first time anyone has reported a stall like this, so I'm
> doubtful that there is actually a general bug. My suspicion for such
> a rare occurrence would be memory corruption of some kind or a
> leaked atomic/rcu state in some other code on that CPU....
> 
> > Can anyone please advise how to address that? It is not possible to put
> > cond_resched() inside the radix tree code, because it can be used with
> > spinlocks, and perhaps other contexts where sleeping is not allowed.
> 
> I don't think there is a solution to this problem - it just
> shouldn't happen in when everything is operating normally as it's
> just a tag search on an indexed tree.
> 
> Hence even if there was a hack to stop stall warnings, it won't fix
> whatever problem is leading to the rcu stall. The system will then
> just spin burning CPU, and eventually something else will fail.
> 
> IOWs, unless you can reproduce this stall and find out what is wrong
> in the radix tree that is leading to it looping forever, there's
> likely nothing we can do to avoid this.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RCU stall in xfs_reclaim_inodes_ag
  2020-12-07 10:18   ` Alex Lyakas
  2020-12-07 14:15     ` Brian Foster
@ 2020-12-08  8:07     ` Dave Chinner
  1 sibling, 0 replies; 5+ messages in thread
From: Dave Chinner @ 2020-12-08  8:07 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: linux-xfs

On Mon, Dec 07, 2020 at 12:18:13PM +0200, Alex Lyakas wrote:
> Hi Dave,
> 
> Thank you for your response.
> 
> We did some more investigations on the issue, and we have the following
> findings:
> 
> 1) We tracked the max amount of inodes per AG radix tree. We found in our
> tests, that the max amount of inodes per AG radix tree was about 1.5M:
> [xfs_reclaim_inodes_ag:1285] XFS(dm-79): AG[1368]: count=1384662
> reclaimable=58
> [xfs_reclaim_inodes_ag:1285] XFS(dm-79): AG[1368]: count=1384630
> reclaimable=46
> [xfs_reclaim_inodes_ag:1285] XFS(dm-79): AG[1368]: count=1384600
> reclaimable=16
> [xfs_reclaim_inodes_ag:1285] XFS(dm-79): AG[1370]: count=1594500
> reclaimable=75
> [xfs_reclaim_inodes_ag:1285] XFS(dm-79): AG[1370]: count=1594468
> reclaimable=55
> [xfs_reclaim_inodes_ag:1285] XFS(dm-79): AG[1370]: count=1594436
> reclaimable=46
> [xfs_reclaim_inodes_ag:1285] XFS(dm-79): AG[1370]: count=1594421
> reclaimable=42
> (but the amount of reclaimable inodes is very small, as you can see).
> 
> Do you think this number is reasonable per radix tree?

That's fine. I regularly run tests that push 10M+ inodes into a
single radix tree, and that generally doesn't even show up on the
profiles....

> 2) This particular XFS instance is total of 500TB. However, the AG size in
> this case is 100GB.

Ok. I run scalability tests on 500TB filesystems with 500AGs that
hammer inode reclaim, but it should be trivial to run them with
5000AGs. Hell, let's just run 50,000AGs to see if there's actually
an iteration problem in the shrinker...

(we really need async IO in mkfs for doing things like this!)

Yup, that hurts a bit on 5.10-rc7, but not significantly. Profile
with 16 CPUs turning over 250,000 inodes/s through the cache:

       - 3.17% xfs_fs_free_cached_objects                                                                                                         ▒
	  - xfs_reclaim_inodes_nr                                                                                                                 ▒
	     - 3.09% xfs_reclaim_inodes_ag                                                                                                        ▒
		- 0.91% _raw_spin_lock                                                                                                            ▒
		     0.87% do_raw_spin_lock                                                                                                       ▒
		- 0.71% _raw_spin_unlock                                                                                                          ▒
		   - 0.67% do_raw_spin_unlock                                                                                                     ▒
			__raw_callee_save___pv_queued_spin_unlock

Which indicate spinlocks are the largest CPU user in that path.
That's likley the radix tree spin locks when removing inodes from
the AG because the upstream code now allows multiple reclaimers to
operate on the same AG.

But even with that, there isn't any sign of holdoff latencies,
scanning delays, etc occurring inside the RCU critical section.
IOWs, bumping up the number of AGs massively shouldn't impact the
RCU code here as the RCU crictical region is inside the loop over
the AGs, not spanning the loop.

I don't know how old your kernel is, but maybe something is getting
stuck on a spinlock (per-inode or per-ag) inside the RCU section?
i.e. maybe you see a RCU stall because the code has livelocked or
has severe contention on a per-ag or inode spinlock inside the RCU
section?

I suspect you are going to need to profile the code when it is
running to some idea of what it is actually doing when the stalls
occur...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-12-08  8:08 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-16 17:45 RCU stall in xfs_reclaim_inodes_ag Alex Lyakas
2020-11-16 21:30 ` Dave Chinner
2020-12-07 10:18   ` Alex Lyakas
2020-12-07 14:15     ` Brian Foster
2020-12-08  8:07     ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.