Re: [PATCH] mm/hugetl.c: keep the page mapping info when free_huge_page() hit the VM_BUG_ON_PAGE

From: Yongkai Wu <nic.wuyk@gmail.com>
To: mike.kravetz@oracle.com
Cc: mhocko@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] mm/hugetl.c: keep the page mapping info when free_huge_page() hit the VM_BUG_ON_PAGE
Date: Thu, 15 Nov 2018 16:30:32 +0800	[thread overview]
Message-ID: <CAJtqMcYzA6c1pTrWPcPETsJchOjpJS8iXVhDAJyWuVGCA4gKuA@mail.gmail.com> (raw)
In-Reply-To: <dc39308b-1b9e-0cce-471c-64f94f631f97@oracle.com>

[-- Attachment #1: Type: text/plain, Size: 7622 bytes --]

Dear Maintainer,
The following is the detail information of the issue i meet.
It may be too boring to read,but i try to explain it more detail.

We use centos 7.4(kernel version:3.10.0-693.el7.x86_64).
Several days ago,i met a kernel panic issue,the kernel log showed:

[759990.616719] EDAC MC2: 51 CE memory read error on
CPU_SrcID#1_MC#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x4dd4f69
offset:0xcc0 grain:32 syndrome:0x0 -  OVERFLOW err_code:0101:0091 socket:1
imc:0 rank:0 bg:1 ba:1 row:1371a col:1a8)
[759990.627721] soft offline: 0x4dd4f69: migration failed 1, type
6fffff00008000
[759990.627743] ------------[ cut here ]------------
[759990.627763] kernel BUG at
/data/rpmbuild/BUILD/kernel-3.10.0/kernel-3.10.0/mm/hugetlb.c:1250!
[759990.628390] CPU: 27 PID: 457768 Comm: mcelog
[759990.628413] Hardware name: Lenovo HR650X           /HR650X     , BIOS
HR6N0333 06/23/2018
[759990.628433] task: ffff882f7f4f4f10 ti: ffff883326f10000 task.ti:
ffff883326f10000
[759990.628452] RIP: 0010:[<ffffffff811cc094>]  [<ffffffff811cc094>]
free_huge_page+0x1e4/0x200
[759990.628479] RSP: 0018:ffff883326f13d80  EFLAGS: 00010213
[759990.628493] RAX: 0000000000000001 RBX: ffffea0137000000 RCX:
0000000000000012
[759990.628511] RDX: 0000000040000000 RSI: ffffffff81f55500 RDI:
ffffea0137000000
[759990.628529] RBP: ffff883326f13da8 R08: ffffffff81f4e0e8 R09:
0000000000000000
......
[759990.628741] Call Trace:
[759990.628752]  [<ffffffff8169fe6a>] __put_compound_page+0x1f/0x22
[759990.628768]  [<ffffffff8169fea2>] put_compound_page+0x35/0x174
[759990.628786]  [<ffffffff811905a5>] put_page+0x45/0x50
[759990.629591]  [<ffffffff811d09a0>] putback_active_hugepage+0xd0/0xf0
[759990.630365]  [<ffffffff811fa0eb>] soft_offline_page+0x4db/0x580
[759990.631134]  [<ffffffff81453595>] store_soft_offline_page+0xa5/0xe0
[759990.631900]  [<ffffffff8143a298>] dev_attr_store+0x18/0x30
[759990.632660]  [<ffffffff81280486>] sysfs_write_file+0xc6/0x140
[759990.633409]  [<ffffffff8120101d>] vfs_write+0xbd/0x1e0
[759990.634148]  [<ffffffff81201e2f>] SyS_write+0x7f/0xe0
[759990.634870]  [<ffffffff816b4849>] system_call_fastpath+0x16/0x1b

I check the coredump,by disassembly free_huge_page() :
0xffffffff811cbf78 <free_huge_page+0xc8>:       cmp    $0xffffffff,%eax
0xffffffff811cbf7b <free_huge_page+0xcb>:       jne    0xffffffff811cc094
<free_huge_page+0x1e4>
......
0xffffffff811cc094 <free_huge_page+0x1e4>:    ud2

and check the sourcecode,i can only know that the panic reason is:
page->_count=0 but page->_mapcount=1,so hit the
BUG_ON(page_mapcount(page));
But can not get any further clue how the issue happen.

So i modify the code as the patch show,and apply the new code to our
produce line and wait some days,then the issue come again on another server.
And this time,by analyse the coredump using crash tool,i can know the whole
file path which trigger the issue.
For example:
crash> page.mapping ffffea02f9000000
      mapping = 0xffff88b098ae8160
crash> address_space.host 0xffff88b098ae8160
      host = 0xffff88b098ae8010
crash> inode.i_dentry 0xffff88b098ae8010
        i_dentry = {
              first = 0xffff88b0bbeb58b0
        }
crash> dentry.d_name.name -l dentry.d_alias 0xffff88b0bbeb58b0
       d_name.name = 0xffff88b0bbeb5838 "file_a"

So i can know the issue happen when doing soft offline to the page of the
file "file_a".
And i can also know the whole file path by list the dentry.d_parent and
check the dentry name.

Check with other team,i know that their user component will use file_a all
the time,
so the page->_mapcount not equal to -1 seems normal,and page->_count=0 is
abnormal at that time.

I guess if i triggeer a soft offline to the physical addr of the page using
by file_a,maybe the issue can reproduce.
So i write a user application to mmap to file_a and get the physical addr
of the page,the key step just as the following:
      fd = open(FILE_A_PATH,O_RDWR,0666);
      buf = mmap(NULL, pagesize, PROT_READ, MAP_SHARED, fd, 0);
      phys_addr = vtop((unsigned long long)buf);
In the function vtop(),i use "/proc/pid/pagemap" to get the physical addr
of the page.

Suppose that the physical addr is 0xbe40000000,then i can trigger a soft
offline to the addr:
        echo 0xbe40000000 > /sys/devices/system/memory/soft_offline_page
And after i trigger two or more times,the issue reproduce.

Then i use systemtap to probe page->_count and page->_mapcount in the
fucntions soft_offline_page(),putback_active_hugepage() and migrate_pages()
part of my systemtap script:

function get_page_mapcount:long (page:long) %{
    struct page *page;

    page = (struct page *)STAP_ARG_page;
    if(page == NULL)
        STAP_RETURN(NULL);
    else
        STAP_RETURN(page_mapcount(page)-1);
%}

probe kernel.function("migrate_pages")
{
    page2 = get_page_from_migrate_pages($from);
    printf("Now exec migrate_pages --
page=%p,pfn=%ld,phy_addr=0x%lx,page_flags=0x%lx\n",page2,get_page_pfn(page2),get_page_phy_addr(page2),get_page_flags(page2));

printf("page->mapping=%p,page->_count=%d,page->_mapcount=%d\n",get_page_mapping(page2),get_page_count(page2),get_page_mapcount(page2));
    print_backtrace();
}

Then i trigger soft offline to reproduce the issue,and finally find the
root cause:
In centos 7.4,when run into soft_offline_huge_page(),get_any_page() only
increase the page->_count by 1,
(isolate_huge_page() will also inc page->_count by 1 but then put_page()
will release the ref)

because we use 1 GB size of hugepage,so hugepage_migration_supported() will
always return false,
so soft_offline_huge_page() -->  migrate_pages() -->
unmap_and_move_huge_page() will call putback_active_hugepage() to decrease
page->_count by 1 just as the code show:
static int unmap_and_move_huge_page(new_page_t get_new_page,
{
......
if (!hugepage_migration_supported(page_hstate(hpage))) {
    putback_active_hugepage(hpage);   // ==> will decrease page->_count by 1
    return -ENOSYS;
}
......

Then when return to soft_offline_huge_page,page->_count will be decrease by
1 again by putback_active_hugepage():
static int soft_offline_huge_page(struct page *page, int flags)
{
     ret = migrate_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL,
     MIGRATE_SYNC, MR_MEMORY_FAILURE);
     if (ret) {
         pr_info("soft offline: %#lx: migration failed %d, type %lx\n",
         pfn, ret, page->flags);
         putback_active_hugepage(hpage);  // ==> here will decrease
page->_count by 1 again
         ......
     } else {
           ......
     }
}

So we can know when call soft_offline_page() to the 1 GB size of
hugepage,page->_count will be abnormally decrease by 1!

【  I remove one putback_active_hugepage() in soft_offline_huge_page() to
fix this issue.  】

And i check the latest kernel code on git hub(4.19),it seems already fix
this issue by the following code:
static int soft_offline_huge_page(struct page *page, int flags)
{
     ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
     MIGRATE_SYNC, MR_MEMORY_FAILURE);
     if (ret) {
         pr_info("soft offline: %#lx: hugepage migration failed %d, type
%lx (%pGp)\n",
             pfn, ret, page->flags, &page->flags);
         if (!list_empty(&pagelist))             // ==> seems this code can
fix the issue i meet
             putback_movable_pages(&pagelist);
         if (ret > 0)
         ret = -EIO;
     } else {
     }
}
But i can not find a similar bug fix report or commit log.

[-- Attachment #2: Type: text/html, Size: 10813 bytes --]