All of lore.kernel.org
 help / color / mirror / Atom feed
From: Zhaoyang Huang <huangzhaoyang@gmail.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>,
	"zhaoyang.huang" <zhaoyang.huang@unisoc.com>,
	 Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org,  linux-kernel@vger.kernel.org,
	steve.kang@unisoc.com, baocong.liu@unisoc.com,
	 linux-fsdevel@vger.kernel.org, Brian Foster <bfoster@redhat.com>,
	 Christoph Hellwig <hch@infradead.org>,
	David Hildenbrand <david@redhat.com>
Subject: Re: [RFC PATCH] mm: move xa forward when run across zombie page
Date: Thu, 11 Apr 2024 15:04:32 +0800	[thread overview]
Message-ID: <CAGWkznEsREG3Ok8zDwPkqJoHGv6_0f3KX1um-zL_VH=2EkOu9Q@mail.gmail.com> (raw)
In-Reply-To: <20221101071721.GV2703033@dread.disaster.area>

On Tue, Nov 1, 2022 at 3:17 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Thu, Oct 20, 2022 at 10:52:14PM +0100, Matthew Wilcox wrote:
> > On Thu, Oct 20, 2022 at 09:04:24AM +1100, Dave Chinner wrote:
> > > On Wed, Oct 19, 2022 at 04:23:10PM +0100, Matthew Wilcox wrote:
> > > > On Wed, Oct 19, 2022 at 09:30:42AM +1100, Dave Chinner wrote:
> > > > > This is reading and writing the same amount of file data at the
> > > > > application level, but once the data has been written and kicked out
> > > > > of the page cache it seems to require an awful lot more read IO to
> > > > > get it back to the application. i.e. this looks like mmap() is
> > > > > readahead thrashing severely, and eventually it livelocks with this
> > > > > sort of report:
> > > > >
> > > > > [175901.982484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > > > [175901.985095] rcu:    Tasks blocked on level-1 rcu_node (CPUs 0-15): P25728
> > > > > [175901.987996]         (detected by 0, t=97399871 jiffies, g=15891025, q=1972622 ncpus=32)
> > > > > [175901.991698] task:test_write      state:R  running task     stack:12784 pid:25728 ppid: 25696 flags:0x00004002
> > > > > [175901.995614] Call Trace:
> > > > > [175901.996090]  <TASK>
> > > > > [175901.996594]  ? __schedule+0x301/0xa30
> > > > > [175901.997411]  ? sysvec_apic_timer_interrupt+0xb/0x90
> > > > > [175901.998513]  ? sysvec_apic_timer_interrupt+0xb/0x90
> > > > > [175901.999578]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> > > > > [175902.000714]  ? xas_start+0x53/0xc0
> > > > > [175902.001484]  ? xas_load+0x24/0xa0
> > > > > [175902.002208]  ? xas_load+0x5/0xa0
> > > > > [175902.002878]  ? __filemap_get_folio+0x87/0x340
> > > > > [175902.003823]  ? filemap_fault+0x139/0x8d0
> > > > > [175902.004693]  ? __do_fault+0x31/0x1d0
> > > > > [175902.005372]  ? __handle_mm_fault+0xda9/0x17d0
> > > > > [175902.006213]  ? handle_mm_fault+0xd0/0x2a0
> > > > > [175902.006998]  ? exc_page_fault+0x1d9/0x810
> > > > > [175902.007789]  ? asm_exc_page_fault+0x22/0x30
> > > > > [175902.008613]  </TASK>
> > > > >
> > > > > Given that filemap_fault on XFS is probably trying to map large
> > > > > folios, I do wonder if this is a result of some kind of race with
> > > > > teardown of a large folio...
> > > >
> > > > It doesn't matter whether we're trying to map a large folio; it
> > > > matters whether a large folio was previously created in the cache.
> > > > Through the magic of readahead, it may well have been.  I suspect
> > > > it's not teardown of a large folio, but splitting.  Removing a
> > > > page from the page cache stores to the pointer in the XArray
> > > > first (either NULL or a shadow entry), then decrements the refcount.
> > > >
> > > > We must be observing a frozen folio.  There are a number of places
> > > > in the MM which freeze a folio, but the obvious one is splitting.
> > > > That looks like this:
> > > >
> > > >         local_irq_disable();
> > > >         if (mapping) {
> > > >                 xas_lock(&xas);
> > > > (...)
> > > >         if (folio_ref_freeze(folio, 1 + extra_pins)) {
> > >
> > > But the lookup is not doing anything to prevent the split on the
> > > frozen page from making progress, right? It's not holding any folio
> > > references, and it's not holding the mapping tree lock, either. So
> > > how does the lookup in progress prevent the page split from making
> > > progress?
> >
> > My thinking was that it keeps hammering the ->refcount field in
> > struct folio.  That might prevent a thread on a different socket
> > from making forward progress.  In contrast, spinlocks are designed
> > to be fair under contention, so by spinning on an actual lock, we'd
> > remove contention on the folio.
> >
> > But I think the tests you've done refute that theory.  I'm all out of
> > ideas at the moment.  Either we have a frozen folio from somebody who
> > doesn't hold the lock, or we have someone who's left a frozen folio in
> > the page cache.  I'm leaning towards that explanation at the moment,
> > but I don't have a good suggestion for debugging.
>
> It's something else. I got gdb attached to qemu and single stepped
> the looping lookup. The context I caught this time is truncate after
> unlink:
>
> (gdb) bt
> #0  find_get_entry (mark=<optimized out>, max=<optimized out>, xas=<optimized out>) at mm/filemap.c:2014
> #1  find_lock_entries (mapping=mapping@entry=0xffff8882445e2118, start=start@entry=25089, end=end@entry=18446744073709551614,
>     fbatch=fbatch@entry=0xffffc900082a7dd8, indices=indices@entry=0xffffc900082a7d60) at mm/filemap.c:2095
> #2  0xffffffff8128f024 in truncate_inode_pages_range (mapping=mapping@entry=0xffff8882445e2118, lstart=lstart@entry=0, lend=lend@entry=-1)
>     at mm/truncate.c:364
> #3  0xffffffff8128f452 in truncate_inode_pages (lstart=0, mapping=0xffff8882445e2118) at mm/truncate.c:452
> #4  0xffffffff8136335d in evict (inode=inode@entry=0xffff8882445e1f78) at fs/inode.c:666
> #5  0xffffffff813636cc in iput_final (inode=0xffff8882445e1f78) at fs/inode.c:1747
> #6  0xffffffff81355b8b in do_unlinkat (dfd=dfd@entry=10, name=0xffff88834170e000) at fs/namei.c:4326
> #7  0xffffffff81355cc3 in __do_sys_unlinkat (flag=<optimized out>, pathname=<optimized out>, dfd=<optimized out>) at fs/namei.c:4362
> #8  __se_sys_unlinkat (flag=<optimized out>, pathname=<optimized out>, dfd=<optimized out>) at fs/namei.c:4355
> #9  __x64_sys_unlinkat (regs=<optimized out>) at fs/namei.c:4355
> #10 0xffffffff81e92e35 in do_syscall_x64 (nr=<optimized out>, regs=0xffffc900082a7f58) at arch/x86/entry/common.c:50
> #11 do_syscall_64 (regs=0xffffc900082a7f58, nr=<optimized out>) at arch/x86/entry/common.c:80
> #12 0xffffffff82000087 in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:120
> #13 0x0000000000000000 in ?? ()
>
> The find_lock_entries() call is being asked to start at index
> 25089, and we are spinning on a folio we find because
> folio_try_get_rcu(folio) is failing - the folio ref count is zero.
>
> The xas state on lookup is:
>
> (gdb) p *xas
> $6 = {xa = 0xffff8882445e2120, xa_index = 25092, xa_shift = 0 '\000', xa_sibs = 0 '\000', xa_offset = 4 '\004', xa_pad = 0 '\000',
>   xa_node = 0xffff888144c15918, xa_alloc = 0x0 <fixed_percpu_data>, xa_update = 0x0 <fixed_percpu_data>, xa_lru = 0x0 <fixed_percpu_data>
>
> indicating that we are trying to look up index 25092 (3 pages
> further in than the start of the batch), and the folio that this
> keeps returning is this:
>
> (gdb) p *folio
> $7 = {{{flags = 24769796876795904, {lru = {next = 0xffffea0005690008, prev = 0xffff88823ffd5f50}, {__filler = 0xffffea0005690008,
>           mlock_count = 1073569616}}, mapping = 0x0 <fixed_percpu_data>, index = 18688, private = 0x8 <fixed_percpu_data+8>, _mapcount = {
>         counter = -129}, _refcount = {counter = 0}, memcg_data = 0}, page = {flags = 24769796876795904, {{{lru = {next = 0xffffea0005690008,
>               prev = 0xffff88823ffd5f50}, {__filler = 0xffffea0005690008, mlock_count = 1073569616}, buddy_list = {
>               next = 0xffffea0005690008, prev = 0xffff88823ffd5f50}, pcp_list = {next = 0xffffea0005690008, prev = 0xffff88823ffd5f50}},
>           mapping = 0x0 <fixed_percpu_data>, index = 18688, private = 8}, {pp_magic = 18446719884544507912, pp = 0xffff88823ffd5f50,
>           _pp_mapping_pad = 0, dma_addr = 18688, {dma_addr_upper = 8, pp_frag_count = {counter = 8}}}, {
>           compound_head = 18446719884544507912, compound_dtor = 80 'P', compound_order = 95 '_', compound_mapcount = {counter = -30590},
>           compound_pincount = {counter = 0}, compound_nr = 0}, {_compound_pad_1 = 18446719884544507912,
>           _compound_pad_2 = 18446612691733536592, deferred_list = {next = 0x0 <fixed_percpu_data>,
>             prev = 0x4900 <irq_stack_backing_store+10496>}}, {_pt_pad_1 = 18446719884544507912, pmd_huge_pte = 0xffff88823ffd5f50,
>           _pt_pad_2 = 0, {pt_mm = 0x4900 <irq_stack_backing_store+10496>, pt_frag_refcount = {counter = 18688}},
>           ptl = 0x8 <fixed_percpu_data+8>}, {pgmap = 0xffffea0005690008, zone_device_data = 0xffff88823ffd5f50}, callback_head = {
>           next = 0xffffea0005690008, func = 0xffff88823ffd5f50}}, {_mapcount = {counter = -129}, page_type = 4294967167}, _refcount = {
>         counter = 0}, memcg_data = 0}}, _flags_1 = 24769796876795904, __head = 0, _folio_dtor = 3 '\003', _folio_order = 8 '\b',
>   _total_mapcount = {counter = -1}, _pincount = {counter = 0}, _folio_nr_pages = 0}
> (gdb)
>
> The folio has a NULL mapping, and an index of 18688, which means
> even if it was not a folio that has been invalidated or freed, the
> index is way outside the range we are looking for.
>
> If I step it round the lookup loop, xas does not change, and the
> same folio is returned every time through the loop. Perhaps
> the mapping tree itself might be corrupt???
>
> It's simple enough to stop the machine once it has become stuck to
> observe the iteration and dump structures, just tell me what you
> need to know from here...
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

This bug emerges again and I would like to propose a reproduce
sequence of this bug which has nothing to do with scheduler stuff (
this could be wrong and sorry for wasting your time if so)

Thread_isolate:
1. alloc_contig_range->isolate_migratepages_block isolate a certain of
pages to cc->migratepages
       (folio has refcount: 1 + n (alloc_pages, page_cache))

2. alloc_contig_range->migrate_pages->folio_ref_freeze(folio, 1 +
extra_pins) set the folio->refcnt to 0

3. alloc_contig_range->migrate_pages->xas_split split the folios to
each slot as folio from slot[offset] to slot[offset + sibs]

Thread_truncate:
4. enter the livelock by the chain below
      rcu_read_lock();
        find_get_entry
            folio = xas_find
            if(!folio_try_get_rcu)
                xas_reset;
      rcu_read_unlock();

4'. alloc_contig_range->migrate_pages->__split_huge_page which will
modify folio's refcnt to 2 and breaks the livelock but is blocked by
lruvec->lock's contention

If the above call chain makes sense, could we solve this by below
modification which has split_folio and __split_huge_page be atomic by
taking lruvec->lock earlier than now.

int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
                                     unsigned int new_order)
{

+        lruvec = folio_lruvec_lock(folio);
                if (mapping) {
                        int nr = folio_nr_pages(folio);

                        xas_split(&xas, folio, folio_order(folio));
                        if (folio_test_pmd_mappable(folio) &&
                            new_order < HPAGE_PMD_ORDER) {
                                if (folio_test_swapbacked(folio)) {
                                        __lruvec_stat_mod_folio(folio,
                                                        NR_SHMEM_THPS, -nr);
                                } else {
                                        __lruvec_stat_mod_folio(folio,
                                                        NR_FILE_THPS, -nr);
                                        filemap_nr_thps_dec(mapping);
                                }
                        }
                }

                __split_huge_page(page, list, end, new_order);
+        folio_lruvec_unlock(folio);

  reply	other threads:[~2024-04-11  7:04 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-14  5:30 [RFC PATCH] mm: move xa forward when run across zombie page zhaoyang.huang
2022-10-14 12:11 ` Matthew Wilcox
2022-10-17  5:34   ` Zhaoyang Huang
2022-10-17  6:58     ` Zhaoyang Huang
2022-10-17 15:55     ` Matthew Wilcox
2022-10-18  2:52       ` Zhaoyang Huang
2022-10-18  3:09         ` Matthew Wilcox
2022-10-18 22:30           ` Dave Chinner
2022-10-19  1:16             ` Dave Chinner
2022-10-19  4:47               ` Dave Chinner
2022-10-19  5:48                 ` Zhaoyang Huang
2022-10-19 13:06                   ` Matthew Wilcox
2022-10-20  1:27                     ` Zhaoyang Huang
2022-10-26 19:49                   ` Matthew Wilcox
2022-10-27  1:57                     ` Zhaoyang Huang
2022-10-19 11:49             ` Brian Foster
2022-10-20  2:04               ` Dave Chinner
2022-10-20  3:12                 ` Zhaoyang Huang
2022-10-19 15:23             ` Matthew Wilcox
2022-10-19 22:04               ` Dave Chinner
2022-10-19 22:46                 ` Dave Chinner
2022-10-19 23:42                   ` Dave Chinner
2022-10-20 21:52                 ` Matthew Wilcox
2022-10-26  8:38                   ` Zhaoyang Huang
2022-10-26 14:38                     ` Matthew Wilcox
2022-10-26 16:01                   ` Matthew Wilcox
2022-10-28  4:05                     ` Dave Chinner
2022-11-01  7:17                   ` Dave Chinner
2024-04-11  7:04                     ` Zhaoyang Huang [this message]
2022-10-21 21:37 Pulavarty, Badari
2022-10-21 22:31 ` Matthew Wilcox
2022-10-21 22:40   ` Pulavarty, Badari
2022-10-31 19:25   ` Pulavarty, Badari
2022-10-31 19:39     ` Hugh Dickins
2022-10-31 21:33       ` Pulavarty, Badari

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAGWkznEsREG3Ok8zDwPkqJoHGv6_0f3KX1um-zL_VH=2EkOu9Q@mail.gmail.com' \
    --to=huangzhaoyang@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=baocong.liu@unisoc.com \
    --cc=bfoster@redhat.com \
    --cc=david@fromorbit.com \
    --cc=david@redhat.com \
    --cc=hch@infradead.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=steve.kang@unisoc.com \
    --cc=willy@infradead.org \
    --cc=zhaoyang.huang@unisoc.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.