All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: Zhaoyang Huang <huangzhaoyang@gmail.com>,
	"zhaoyang.huang" <zhaoyang.huang@unisoc.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	ke.wang@unisoc.com, steve.kang@unisoc.com,
	baocong.liu@unisoc.com, linux-fsdevel@vger.kernel.org
Subject: Re: [RFC PATCH] mm: move xa forward when run across zombie page
Date: Thu, 20 Oct 2022 10:42:56 +1100	[thread overview]
Message-ID: <20221019234256.GR2703033@dread.disaster.area> (raw)
In-Reply-To: <20221019224616.GP2703033@dread.disaster.area>

On Thu, Oct 20, 2022 at 09:46:16AM +1100, Dave Chinner wrote:
> On Thu, Oct 20, 2022 at 09:04:24AM +1100, Dave Chinner wrote:
> > On Wed, Oct 19, 2022 at 04:23:10PM +0100, Matthew Wilcox wrote:
> > > On Wed, Oct 19, 2022 at 09:30:42AM +1100, Dave Chinner wrote:
> > > > This is reading and writing the same amount of file data at the
> > > > application level, but once the data has been written and kicked out
> > > > of the page cache it seems to require an awful lot more read IO to
> > > > get it back to the application. i.e. this looks like mmap() is
> > > > readahead thrashing severely, and eventually it livelocks with this
> > > > sort of report:
> > > > 
> > > > [175901.982484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > > [175901.985095] rcu:    Tasks blocked on level-1 rcu_node (CPUs 0-15): P25728
> > > > [175901.987996]         (detected by 0, t=97399871 jiffies, g=15891025, q=1972622 ncpus=32)
> > > > [175901.991698] task:test_write      state:R  running task     stack:12784 pid:25728 ppid: 25696 flags:0x00004002
> > > > [175901.995614] Call Trace:
> > > > [175901.996090]  <TASK>
> > > > [175901.996594]  ? __schedule+0x301/0xa30
> > > > [175901.997411]  ? sysvec_apic_timer_interrupt+0xb/0x90
> > > > [175901.998513]  ? sysvec_apic_timer_interrupt+0xb/0x90
> > > > [175901.999578]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> > > > [175902.000714]  ? xas_start+0x53/0xc0
> > > > [175902.001484]  ? xas_load+0x24/0xa0
> > > > [175902.002208]  ? xas_load+0x5/0xa0
> > > > [175902.002878]  ? __filemap_get_folio+0x87/0x340
> > > > [175902.003823]  ? filemap_fault+0x139/0x8d0
> > > > [175902.004693]  ? __do_fault+0x31/0x1d0
> > > > [175902.005372]  ? __handle_mm_fault+0xda9/0x17d0
> > > > [175902.006213]  ? handle_mm_fault+0xd0/0x2a0
> > > > [175902.006998]  ? exc_page_fault+0x1d9/0x810
> > > > [175902.007789]  ? asm_exc_page_fault+0x22/0x30
> > > > [175902.008613]  </TASK>
> > > > 
> > > > Given that filemap_fault on XFS is probably trying to map large
> > > > folios, I do wonder if this is a result of some kind of race with
> > > > teardown of a large folio...
> > > 
> > > It doesn't matter whether we're trying to map a large folio; it
> > > matters whether a large folio was previously created in the cache.
> > > Through the magic of readahead, it may well have been.  I suspect
> > > it's not teardown of a large folio, but splitting.  Removing a
> > > page from the page cache stores to the pointer in the XArray
> > > first (either NULL or a shadow entry), then decrements the refcount.
> > > 
> > > We must be observing a frozen folio.  There are a number of places
> > > in the MM which freeze a folio, but the obvious one is splitting.
> > > That looks like this:
> > > 
> > >         local_irq_disable();
> > >         if (mapping) {
> > >                 xas_lock(&xas);
> > > (...)
> > >         if (folio_ref_freeze(folio, 1 + extra_pins)) {
> > 
> > But the lookup is not doing anything to prevent the split on the
> > frozen page from making progress, right? It's not holding any folio
> > references, and it's not holding the mapping tree lock, either. So
> > how does the lookup in progress prevent the page split from making
> > progress?
> > 
> > 
> > > So one way to solve this might be to try to take the xa_lock on
> > > failure to get the refcount.  Otherwise a high-priority task
> > > might spin forever without a low-priority task getting the chance
> > > to finish the work being done while the folio is frozen.
> > 
> > IIUC, then you are saying that there is a scheduling priority
> > inversion because the lookup failure looping path doesn't yeild the
> > CPU?
> > 
> > If so, how does taking the mapping tree spin lock on failure cause
> > the looping task to yield the CPU and hence allow the folio split to
> > make progress?
> > 
> > Also, AFAICT, the page split has disabled local interrupts, so it
> > should effectively be running with preemption disabled as it has
> > turned off the mechanism the scheduler can use to preempt it. The
> > page split can't sleep, either, because it holds the mapping tree
> > lock. Hence I can't see how a split-in-progress can be preempted in
> > teh first place to cause a priority inversion livelock like this...
> > 
> > > ie this:
> > > 
> > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > index 08341616ae7a..ca0eed80580f 100644
> > > --- a/mm/filemap.c
> > > +++ b/mm/filemap.c
> > > @@ -1860,8 +1860,13 @@ static void *mapping_get_entry(struct address_space *mapping, pgoff_t index)
> > >  	if (!folio || xa_is_value(folio))
> > >  		goto out;
> > >  
> > > -	if (!folio_try_get_rcu(folio))
> > > +	if (!folio_try_get_rcu(folio)) {
> > > +		unsigned long flags;
> > > +
> > > +		xas_lock_irqsave(&xas, flags);
> > > +		xas_unlock_irqrestore(&xas, flags);
> > >  		goto repeat;
> > > +	}
> 
> As I suspected, this change did not prevent the livelock. It
> reproduced after just 650 test iterations (a few minutes) with this
> change in place.
> 
> > I would have thought:
> > 
> > 	if (!folio_try_get_rcu(folio)) {
> > 		rcu_read_unlock();
> > 		cond_resched();
> > 		rcu_read_lock();
> > 		goto repeat;
> > 	}
> > 
> > Would be the right way to yeild the CPU to avoid priority
> > inversion related livelocks here...
> 
> I'm now trying this just to provide a data point that actually
> yeilding the CPU avoids the downstream effects of the livelock (i.e.

This causes a different problems to occur. First, a task hangs
evicting the page cache on unlink like so:

[ 1352.990246] INFO: task test_write:3572 blocked for more than 122 seconds.
[ 1352.991418]       Not tainted 6.1.0-rc1-dgc+ #1618
[ 1352.992378] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1352.994702] task:test_write      state:D stack:13088 pid:3572  ppid:3567   flags:0x00004002
[ 1352.996257] Call Trace:
[ 1352.996905]  <TASK>
[ 1352.997452]  __schedule+0x2f9/0xa20
[ 1352.998380]  schedule+0x5a/0xc0
[ 1352.999011]  io_schedule+0x42/0x70
[ 1352.999565]  folio_wait_bit_common+0x137/0x370
[ 1353.000553]  ? dio_warn_stale_pagecache.part.0+0x50/0x50
[ 1353.001425]  truncate_inode_pages_range+0x41a/0x470
[ 1353.002443]  evict+0x1ad/0x1c0
[ 1353.003199]  do_unlinkat+0x1db/0x2e0
[ 1353.003907]  __x64_sys_unlinkat+0x33/0x60
[ 1353.004569]  do_syscall_64+0x35/0x80
[ 1353.005354]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 1353.006315] RIP: 0033:0x7f03d6a42157
[ 1353.007088] RSP: 002b:00007ffde8d98388 EFLAGS: 00000206 ORIG_RAX: 0000000000000107
[ 1353.008367] RAX: ffffffffffffffda RBX: 0000000000000df4 RCX: 00007f03d6a42157
[ 1353.009748] RDX: 0000000000000000 RSI: 000055f8660415a0 RDI: 0000000000000007
[ 1353.011115] RBP: 0000000000000003 R08: 0000000000000000 R09: 00007ffde8d96105
[ 1353.012422] R10: 0000000000000158 R11: 0000000000000206 R12: 0000000000000158
[ 1353.013556] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 1353.014958]  </TASK>

And soon after:

[ 1607.746983] ------------[ cut here ]------------
[ 1607.748141] WARNING: CPU: 27 PID: 51385 at fs/iomap/buffered-io.c:80 iomap_page_release+0xaf/0xc0
[ 1607.751098] Modules linked in:
[ 1607.751619] CPU: 27 PID: 51385 Comm: test_write Not tainted 6.1.0-rc1-dgc+ #1618
[ 1607.753307] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014
[ 1607.755443] RIP: 0010:iomap_page_release+0xaf/0xc0
[ 1607.756496] Code: 48 c1 e8 02 83 e0 01 75 10 38 d0 75 21 5b 4c 89 e7 5d 41 5c e9 d2 be ee ff eb ee e8 bb e0 ec ff eb 88 48 8b 07 5b 5d 41 5c c3 <0f> 0b eb b1 0f 0b eb db 0f 0b eb b2 0f 1f 44 00 00 0f 1f 44 00 00
[ 1607.759643] RSP: 0018:ffffc900087cb5b0 EFLAGS: 00010206
[ 1607.760775] RAX: 0000000000200000 RBX: ffffea0004378000 RCX: 000000000000000c
[ 1607.762181] RDX: 0017ffffc0050081 RSI: ffff888145cb0178 RDI: ffffea0004378000
[ 1607.764408] RBP: 0000000000000200 R08: 0000000000200000 R09: ffffea0004378008
[ 1607.765560] R10: ffff888141988800 R11: 0000000000033900 R12: ffff88810200ee40
[ 1607.766988] R13: 0000000000000200 R14: ffffc900087cb8c0 R15: ffffea0004378008
[ 1607.768347] FS:  00007f03d5c00640(0000) GS:ffff88833bdc0000(0000) knlGS:0000000000000000
[ 1607.769756] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1607.771421] CR2: 00007f03cae00000 CR3: 0000000246226006 CR4: 0000000000770ee0
[ 1607.773298] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1607.775026] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1607.777341] PKRU: 55555554
[ 1607.777995] Call Trace:
[ 1607.778637]  <TASK>
[ 1607.779372]  iomap_release_folio+0x4d/0xa0
[ 1607.780510]  shrink_folio_list+0x9dd/0xb10
[ 1607.781931]  shrink_lruvec+0x558/0xb80
[ 1607.782922]  shrink_node+0x2ce/0x6f0
[ 1607.783993]  do_try_to_free_pages+0xd5/0x570
[ 1607.785031]  try_to_free_mem_cgroup_pages+0x105/0x220
[ 1607.786728]  reclaim_high+0xa6/0xf0
[ 1607.787672]  mem_cgroup_handle_over_high+0x97/0x290
[ 1607.789179]  try_charge_memcg+0x697/0x7f0
[ 1607.790100]  charge_memcg+0x35/0xd0
[ 1607.790930]  __mem_cgroup_charge+0x29/0x80
[ 1607.792884]  __filemap_add_folio+0x21a/0x460
[ 1607.794342]  ? scan_shadow_nodes+0x30/0x30
[ 1607.796052]  filemap_add_folio+0x37/0xa0
[ 1607.797036]  __filemap_get_folio+0x221/0x360
[ 1607.798468]  ? xas_load+0x5/0xa0
[ 1607.799417]  iomap_write_begin+0x103/0x580
[ 1607.800908]  ? filemap_dirty_folio+0x5c/0x80
[ 1607.801940]  ? iomap_write_end+0x101/0x250
[ 1607.803305]  iomap_file_buffered_write+0x17f/0x330
[ 1607.804700]  xfs_file_buffered_write+0xb1/0x2e0
[ 1607.806240]  ? xfs_file_buffered_write+0x2b2/0x2e0
[ 1607.807743]  vfs_write+0x2ca/0x3d0
[ 1607.808622]  __x64_sys_pwrite64+0x8c/0xc0
[ 1607.810502]  do_syscall_64+0x35/0x80
[ 1607.811511]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 1607.813116] RIP: 0033:0x7f03d6b40a87
....
[ 1607.925434] kernel BUG at mm/vmscan.c:1306!
[ 1607.926977] invalid opcode: 0000 [#1] PREEMPT SMP
[ 1607.928427] CPU: 8 PID: 51385 Comm: test_write Tainted: G        W          6.1.0-rc1-dgc+ #1618
[ 1607.930742] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014
[ 1607.932716] RIP: 0010:__remove_mapping+0x229/0x240
[ 1607.933648] Code: a9 00 00 08 00 75 27 48 8b 3b 48 81 c7 88 00 00 00 e8 db 64 c1 00 31 c0 e9 d6 fe ff ff 48 8b 3b e8 7c 29 0d 00 e9 27 ff ff ff <0f> 0b 0f 0b 48 8b 45 00 f6 c4 04 74 d0 31 c0 e9 b5 fe ff ff 0f 1f
[ 1607.937079] RSP: 0018:ffffc900087cb5d8 EFLAGS: 00010246
[ 1607.938050] RAX: 0017ffffc0010004 RBX: ffffc900087cb6e8 RCX: ffff888244bc5000
[ 1607.939443] RDX: 0000000000000001 RSI: ffffea0004378000 RDI: ffff888145cb0318
[ 1607.940980] RBP: ffffc900087cb648 R08: ffff888145cb0318 R09: ffffffff813be400
[ 1607.942300] R10: ffff88810200ee40 R11: 0000000000033900 R12: ffffea0004378000
[ 1607.943666] R13: 0000000000000200 R14: ffffc900087cb8c0 R15: ffffea0004378008
[ 1607.945004] FS:  00007f03d5c00640(0000) GS:ffff88823bc00000(0000) knlGS:0000000000000000
[ 1607.946492] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1607.947552] CR2: 00007efd1ae33da8 CR3: 0000000246226001 CR4: 0000000000770ee0
[ 1607.948866] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1607.950250] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1607.951738] PKRU: 55555554
[ 1607.952289] Call Trace:
[ 1607.952775]  <TASK>
[ 1607.953190]  shrink_folio_list+0x5cd/0xb10
[ 1607.954140]  shrink_lruvec+0x558/0xb80
[ 1607.954850]  shrink_node+0x2ce/0x6f0
[ 1607.955536]  do_try_to_free_pages+0xd5/0x570
[ 1607.956397]  try_to_free_mem_cgroup_pages+0x105/0x220
[ 1607.957503]  reclaim_high+0xa6/0xf0
[ 1607.958205]  mem_cgroup_handle_over_high+0x97/0x290
[ 1607.959149]  try_charge_memcg+0x697/0x7f0
[ 1607.959932]  charge_memcg+0x35/0xd0
[ 1607.960549]  __mem_cgroup_charge+0x29/0x80
[ 1607.961257]  __filemap_add_folio+0x21a/0x460
[ 1607.961912]  ? scan_shadow_nodes+0x30/0x30
[ 1607.962705]  filemap_add_folio+0x37/0xa0
[ 1607.963354]  __filemap_get_folio+0x221/0x360
[ 1607.964076]  ? xas_load+0x5/0xa0
[ 1607.964761]  iomap_write_begin+0x103/0x580
[ 1607.965510]  ? filemap_dirty_folio+0x5c/0x80
[ 1607.966349]  ? iomap_write_end+0x101/0x250
[ 1607.967185]  iomap_file_buffered_write+0x17f/0x330
[ 1607.968133]  xfs_file_buffered_write+0xb1/0x2e0
[ 1607.968948]  ? xfs_file_buffered_write+0x2b2/0x2e0
[ 1607.969783]  vfs_write+0x2ca/0x3d0
[ 1607.970326]  __x64_sys_pwrite64+0x8c/0xc0
[ 1607.971073]  do_syscall_64+0x35/0x80
[ 1607.971732]  entry_SYSCALL_64_after_hwframe+0x63/0xcd

which is:

1297  * Same as remove_mapping, but if the folio is removed from the mapping, it      
1298  * gets returned with a refcount of 0.                                           
1299  */                                                                              
1300 static int __remove_mapping(struct address_space *mapping, struct folio *folio,  
1301                             bool reclaimed, struct mem_cgroup *target_memcg)     
1302 {                                                                                
1303         int refcount;                                                            
1304         void *shadow = NULL;                                                     
1305                                                                                  
1306         BUG_ON(!folio_test_locked(folio));                                       
1307         BUG_ON(mapping != folio_mapping(folio));                                 
1308                                                                                  

tripping over an unlocked folio in shrink_folio_list()....

And, finally, there's also a process spinning on a cpu in
truncate_inode_pages_range():

[ 3468.057406] CPU: 1 PID: 3579 Comm: test_write Tainted: G      D W          6.1.0-rc1-dgc+ #1618
[ 3468.057407] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014
[ 3468.057408] RIP: 0010:truncate_inode_pages_range+0x3c0/0x470
[ 3468.057411] Code: 8b 7c e5 00 41 f6 c6 01 0f 85 62 ff ff ff be a9 03 00 00 48 c7 c7 b8 62 7b 82 e8 0b 46 e9 ff 2e 2e 2e 31 c0 f0 49 0f ba 7
[ 3468.057412] RSP: 0018:ffffc90002cffd40 EFLAGS: 00000246
[ 3468.057414] RAX: 0000000000000000 RBX: ffffc90002cffdd8 RCX: ffff8882461b8000
[ 3468.057415] RDX: 0000000000000000 RSI: 00000000000003a9 RDI: ffffffff827b62b8
[ 3468.057415] RBP: ffff888246eee718 R08: ffffffffffffffc0 R09: fffffffffffffffe
[ 3468.057416] R10: 0000000000002838 R11: 0000000000002834 R12: 0000000000000004
[ 3468.057417] R13: ffffc90002cffd60 R14: ffffea0004378000 R15: 0000000000002810
[ 3468.057418] FS:  00007f03d6953740(0000) GS:ffff88813bc40000(0000) knlGS:0000000000000000
[ 3468.057420] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3468.057421] CR2: 00007f03d0100000 CR3: 0000000245e2c006 CR4: 0000000000770ee0
root@test4:~# [ 3468.057422] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3468.057422] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 3468.057423] PKRU: 55555554
[ 3468.057424] Call Trace:
[ 3468.057424]  <TASK>
[ 3468.057426]  evict+0x1ad/0x1c0
[ 3468.057429]  do_unlinkat+0x1db/0x2e0
[ 3468.057432]  __x64_sys_unlinkat+0x33/0x60
[ 3468.057434]  do_syscall_64+0x35/0x80
[ 3468.057437]  entry_SYSCALL_64_after_hwframe+0x63/0xcd

So that's not the problem, either...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2022-10-19 23:43 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-14  5:30 [RFC PATCH] mm: move xa forward when run across zombie page zhaoyang.huang
2022-10-14 12:11 ` Matthew Wilcox
2022-10-17  5:34   ` Zhaoyang Huang
2022-10-17  6:58     ` Zhaoyang Huang
2022-10-17 15:55     ` Matthew Wilcox
2022-10-18  2:52       ` Zhaoyang Huang
2022-10-18  3:09         ` Matthew Wilcox
2022-10-18 22:30           ` Dave Chinner
2022-10-19  1:16             ` Dave Chinner
2022-10-19  4:47               ` Dave Chinner
2022-10-19  5:48                 ` Zhaoyang Huang
2022-10-19 13:06                   ` Matthew Wilcox
2022-10-20  1:27                     ` Zhaoyang Huang
2022-10-26 19:49                   ` Matthew Wilcox
2022-10-27  1:57                     ` Zhaoyang Huang
2022-10-19 11:49             ` Brian Foster
2022-10-20  2:04               ` Dave Chinner
2022-10-20  3:12                 ` Zhaoyang Huang
2022-10-19 15:23             ` Matthew Wilcox
2022-10-19 22:04               ` Dave Chinner
2022-10-19 22:46                 ` Dave Chinner
2022-10-19 23:42                   ` Dave Chinner [this message]
2022-10-20 21:52                 ` Matthew Wilcox
2022-10-26  8:38                   ` Zhaoyang Huang
2022-10-26 14:38                     ` Matthew Wilcox
2022-10-26 16:01                   ` Matthew Wilcox
2022-10-28  4:05                     ` Dave Chinner
2022-11-01  7:17                   ` Dave Chinner
2024-04-11  7:04                     ` Zhaoyang Huang
2022-10-21 21:37 Pulavarty, Badari
2022-10-21 22:31 ` Matthew Wilcox
2022-10-21 22:40   ` Pulavarty, Badari
2022-10-31 19:25   ` Pulavarty, Badari
2022-10-31 19:39     ` Hugh Dickins
2022-10-31 21:33       ` Pulavarty, Badari

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20221019234256.GR2703033@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=akpm@linux-foundation.org \
    --cc=baocong.liu@unisoc.com \
    --cc=huangzhaoyang@gmail.com \
    --cc=ke.wang@unisoc.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=steve.kang@unisoc.com \
    --cc=willy@infradead.org \
    --cc=zhaoyang.huang@unisoc.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.