Dear Maintainer, The following is the detail information of the issue i meet. It may be too boring to read,but i try to explain it more detail. We use centos 7.4(kernel version:3.10.0-693.el7.x86_64). Several days ago,i met a kernel panic issue,the kernel log showed: [759990.616719] EDAC MC2: 51 CE memory read error on CPU_SrcID#1_MC#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x4dd4f69 offset:0xcc0 grain:32 syndrome:0x0 - OVERFLOW err_code:0101:0091 socket:1 imc:0 rank:0 bg:1 ba:1 row:1371a col:1a8) [759990.627721] soft offline: 0x4dd4f69: migration failed 1, type 6fffff00008000 [759990.627743] ------------[ cut here ]------------ [759990.627763] kernel BUG at /data/rpmbuild/BUILD/kernel-3.10.0/kernel-3.10.0/mm/hugetlb.c:1250! [759990.628390] CPU: 27 PID: 457768 Comm: mcelog [759990.628413] Hardware name: Lenovo HR650X /HR650X , BIOS HR6N0333 06/23/2018 [759990.628433] task: ffff882f7f4f4f10 ti: ffff883326f10000 task.ti: ffff883326f10000 [759990.628452] RIP: 0010:[] [] free_huge_page+0x1e4/0x200 [759990.628479] RSP: 0018:ffff883326f13d80 EFLAGS: 00010213 [759990.628493] RAX: 0000000000000001 RBX: ffffea0137000000 RCX: 0000000000000012 [759990.628511] RDX: 0000000040000000 RSI: ffffffff81f55500 RDI: ffffea0137000000 [759990.628529] RBP: ffff883326f13da8 R08: ffffffff81f4e0e8 R09: 0000000000000000 ...... [759990.628741] Call Trace: [759990.628752] [] __put_compound_page+0x1f/0x22 [759990.628768] [] put_compound_page+0x35/0x174 [759990.628786] [] put_page+0x45/0x50 [759990.629591] [] putback_active_hugepage+0xd0/0xf0 [759990.630365] [] soft_offline_page+0x4db/0x580 [759990.631134] [] store_soft_offline_page+0xa5/0xe0 [759990.631900] [] dev_attr_store+0x18/0x30 [759990.632660] [] sysfs_write_file+0xc6/0x140 [759990.633409] [] vfs_write+0xbd/0x1e0 [759990.634148] [] SyS_write+0x7f/0xe0 [759990.634870] [] system_call_fastpath+0x16/0x1b I check the coredump,by disassembly free_huge_page() : 0xffffffff811cbf78 : cmp $0xffffffff,%eax 0xffffffff811cbf7b : jne 0xffffffff811cc094 ...... 0xffffffff811cc094 : ud2 and check the sourcecode,i can only know that the panic reason is: page->_count=0 but page->_mapcount=1,so hit the BUG_ON(page_mapcount(page)); But can not get any further clue how the issue happen. So i modify the code as the patch show,and apply the new code to our produce line and wait some days,then the issue come again on another server. And this time,by analyse the coredump using crash tool,i can know the whole file path which trigger the issue. For example: crash> page.mapping ffffea02f9000000 mapping = 0xffff88b098ae8160 crash> address_space.host 0xffff88b098ae8160 host = 0xffff88b098ae8010 crash> inode.i_dentry 0xffff88b098ae8010 i_dentry = { first = 0xffff88b0bbeb58b0 } crash> dentry.d_name.name -l dentry.d_alias 0xffff88b0bbeb58b0 d_name.name = 0xffff88b0bbeb5838 "file_a" So i can know the issue happen when doing soft offline to the page of the file "file_a". And i can also know the whole file path by list the dentry.d_parent and check the dentry name. Check with other team,i know that their user component will use file_a all the time, so the page->_mapcount not equal to -1 seems normal,and page->_count=0 is abnormal at that time. I guess if i triggeer a soft offline to the physical addr of the page using by file_a,maybe the issue can reproduce. So i write a user application to mmap to file_a and get the physical addr of the page,the key step just as the following: fd = open(FILE_A_PATH,O_RDWR,0666); buf = mmap(NULL, pagesize, PROT_READ, MAP_SHARED, fd, 0); phys_addr = vtop((unsigned long long)buf); In the function vtop(),i use "/proc/pid/pagemap" to get the physical addr of the page. Suppose that the physical addr is 0xbe40000000,then i can trigger a soft offline to the addr: echo 0xbe40000000 > /sys/devices/system/memory/soft_offline_page And after i trigger two or more times,the issue reproduce. Then i use systemtap to probe page->_count and page->_mapcount in the fucntions soft_offline_page(),putback_active_hugepage() and migrate_pages() part of my systemtap script: function get_page_mapcount:long (page:long) %{ struct page *page; page = (struct page *)STAP_ARG_page; if(page == NULL) STAP_RETURN(NULL); else STAP_RETURN(page_mapcount(page)-1); %} probe kernel.function("migrate_pages") { page2 = get_page_from_migrate_pages($from); printf("Now exec migrate_pages -- page=%p,pfn=%ld,phy_addr=0x%lx,page_flags=0x%lx\n",page2,get_page_pfn(page2),get_page_phy_addr(page2),get_page_flags(page2)); printf("page->mapping=%p,page->_count=%d,page->_mapcount=%d\n",get_page_mapping(page2),get_page_count(page2),get_page_mapcount(page2)); print_backtrace(); } Then i trigger soft offline to reproduce the issue,and finally find the root cause: In centos 7.4,when run into soft_offline_huge_page(),get_any_page() only increase the page->_count by 1, (isolate_huge_page() will also inc page->_count by 1 but then put_page() will release the ref) because we use 1 GB size of hugepage,so hugepage_migration_supported() will always return false, so soft_offline_huge_page() --> migrate_pages() --> unmap_and_move_huge_page() will call putback_active_hugepage() to decrease page->_count by 1 just as the code show: static int unmap_and_move_huge_page(new_page_t get_new_page, { ...... if (!hugepage_migration_supported(page_hstate(hpage))) { putback_active_hugepage(hpage); // ==> will decrease page->_count by 1 return -ENOSYS; } ...... Then when return to soft_offline_huge_page,page->_count will be decrease by 1 again by putback_active_hugepage(): static int soft_offline_huge_page(struct page *page, int flags) { ret = migrate_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL, MIGRATE_SYNC, MR_MEMORY_FAILURE); if (ret) { pr_info("soft offline: %#lx: migration failed %d, type %lx\n", pfn, ret, page->flags); putback_active_hugepage(hpage); // ==> here will decrease page->_count by 1 again ...... } else { ...... } } So we can know when call soft_offline_page() to the 1 GB size of hugepage,page->_count will be abnormally decrease by 1! 【 I remove one putback_active_hugepage() in soft_offline_huge_page() to fix this issue. 】 And i check the latest kernel code on git hub(4.19),it seems already fix this issue by the following code: static int soft_offline_huge_page(struct page *page, int flags) { ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL, MIGRATE_SYNC, MR_MEMORY_FAILURE); if (ret) { pr_info("soft offline: %#lx: hugepage migration failed %d, type %lx (%pGp)\n", pfn, ret, page->flags, &page->flags); if (!list_empty(&pagelist)) // ==> seems this code can fix the issue i meet putback_movable_pages(&pagelist); if (ret > 0) ret = -EIO; } else { } } But i can not find a similar bug fix report or commit log.