* BUG() due to "mm: put_and_wait_on_page_locked() while page is migrated" @ 2019-01-23 4:29 Qian Cai 2019-01-23 9:30 ` Michal Hocko 0 siblings, 1 reply; 8+ messages in thread From: Qian Cai @ 2019-01-23 4:29 UTC (permalink / raw) To: hughd, Andrea Arcangeli, Michal Hocko, torvalds, vbabka, akpm; +Cc: Linux-MM Running LTP migrate_pages03 [1] a few times triggering BUG() below on an arm64 ThunderX2 server. Reverted the commit 9a1ea439b16b9 ("mm: put_and_wait_on_page_locked() while page is migrated") allows it to run continuously. put_and_wait_on_page_locked wait_on_page_bit_common put_page put_page_testzero VM_BUG_ON_PAGE(page_ref_count(page) == 0, page); [1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/migrate_pages/migrate_pages03.c [ 1304.643587] page:ffff7fe0226ff000 count:2 mapcount:0 mapping:ffff8095c3406d58 index:0x7 [ 1304.652082] xfs_address_space_operations [xfs] [ 1304.652104] name:"libc-2.28.so" [ 1304.656653] flags: 0x7ffffc00000887(locked|waiters|referenced|uptodate|arch_1) [ 1304.667134] raw: 007ffffc00000887 ffff7fe0227bac88 ffff7fe02261cd88 0000000000000000 [ 1304.674894] raw: 0000000000000007 0000000000000000 00000002ffffffff ffff80082039b080 [ 1304.682652] page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0) [ 1304.689553] page->mem_cgroup:ffff80082039b080 [ 1304.693932] page allocated via order 0, migratetype Movable, gfp_mask 0x62124a(GFP_NOFS|__GFP_HIGHMEM|__GFP_NOWARN|__GFP_NORETRY|__GFP_HARDWALL|__GFP_MOVABLE) [ 1304.708137] get_page_from_freelist+0x2cec/0x30c0 [ 1304.712864] __alloc_pages_nodemask+0x350/0x22d0 [ 1304.717504] alloc_pages_current+0x154/0x158 [ 1304.721795] __page_cache_alloc+0x274/0x27c [ 1304.726001] __do_page_cache_readahead+0x1e4/0x380 [ 1304.730812] filemap_fault+0x540/0x1204 [ 1304.734882] __xfs_filemap_fault+0x714/0x734 [xfs] [ 1304.739893] xfs_filemap_fault+0xe4/0xfc [xfs] [ 1304.744440] __do_fault+0x294/0x5dc [ 1304.747950] do_fault+0x324/0x1360 [ 1304.751370] __handle_mm_fault+0x9a8/0xb90 [ 1304.755481] handle_mm_fault+0x610/0x614 [ 1304.759423] do_page_fault+0x530/0x818 [ 1304.763188] do_translation_fault+0x88/0xe8 [ 1304.767388] do_mem_abort+0x78/0x168 [ 1304.770979] do_el0_ia_bp_hardening+0x7c/0x8c [ 1304.775351] page has been migrated, last migrate reason: syscall_or_cpuset [ 1304.782294] ------------[ cut here ]------------ [ 1304.786904] kernel BUG at include/linux/mm.h:546! [ 1304.791728] Internal error: Oops - BUG: 0 [#1] SMP [ 1304.796513] Modules linked in: thunderx2_pmu ip_tables xfs libcrc32c sd_mod ahci libahci mlx5_core libata dm_mirror dm_region_hash dm_log dm_mod efivarfs [ 1304.810256] CPU: 248 PID: 10307 Comm: 0anacron Kdump: loaded Not tainted 5.0.0-rc3+ #1 [ 1304.818163] Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.0.6 07/10/2018 [ 1304.827980] pstate: 10400009 (nzcV daif +PAN -UAO) [ 1304.832764] pc : put_and_wait_on_page_locked+0x4c8/0x5f0 [ 1304.838067] lr : put_and_wait_on_page_locked+0x4c8/0x5f0 [ 1304.843369] sp : ffff8094fbf57960 [ 1304.846674] x29: ffff8094fbf57960 x28: ffff2000117f7f90 [ 1304.851978] x27: ffff2000117f7f88 x26: ffff2000117fd6c8 [ 1304.857281] x25: 0000000000000001 x24: ffff8094fbf579f0 [ 1304.862584] x23: 1ffff0129f7eaf3a x22: ffff2000117f7f50 [ 1304.867887] x21: dfff200000000000 x20: ffff7fe0226ff034 [ 1304.873190] x19: ffff7fe0226ff000 x18: 0000000000000000 [ 1304.878493] x17: 0000000000000000 x16: 0000000000000000 [ 1304.883795] x15: 0000000000000000 x14: 46475f5f7c4c4c41 [ 1304.889098] x13: 57445241485f5046 x12: ffff0400025bb4d9 [ 1304.894401] x11: 1fffe400025bb4d9 x10: 5f7c4e5241574f4e [ 1304.899703] x9 : dfff200000000000 x8 : 6c6c616373797320 [ 1304.905006] x7 : 0000000000000000 x6 : ffff20001021b024 [ 1304.910309] x5 : 0000000000000000 x4 : 0000000000000000 [ 1304.915612] x3 : 0000000000000000 x2 : 29c8834f768b6d00 [ 1304.920914] x1 : 29c8834f768b6d00 x0 : 0000000000000000 [ 1304.926220] Process 0anacron (pid: 10307, stack limit = 0x00000000e3061c7a) [ 1304.933172] Call trace: [ 1304.935610] put_and_wait_on_page_locked+0x4c8/0x5f0 [ 1304.940577] __migration_entry_wait+0x238/0x260 [ 1304.945098] migration_entry_wait+0xfc/0x110 [ 1304.949361] do_swap_page+0x1b0/0x198c [ 1304.953101] __handle_mm_fault+0x9a0/0xb90 [ 1304.957188] handle_mm_fault+0x610/0x614 [ 1304.961102] do_page_fault+0x530/0x818 [ 1304.964842] do_translation_fault+0x88/0xe8 [ 1304.969016] do_mem_abort+0x78/0x168 [ 1304.972583] do_el0_ia_bp_hardening+0x7c/0x8c [ 1304.976931] el0_ia+0x1c/0x20 [ 1304.979893] Code: 91298021 91128021 aa1303e0 940282fb (d4210000) ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: BUG() due to "mm: put_and_wait_on_page_locked() while page is migrated" 2019-01-23 4:29 BUG() due to "mm: put_and_wait_on_page_locked() while page is migrated" Qian Cai @ 2019-01-23 9:30 ` Michal Hocko 2019-01-25 4:19 ` Hugh Dickins 0 siblings, 1 reply; 8+ messages in thread From: Michal Hocko @ 2019-01-23 9:30 UTC (permalink / raw) To: Qian Cai; +Cc: hughd, Andrea Arcangeli, torvalds, vbabka, akpm, Linux-MM On Tue 22-01-19 23:29:04, Qian Cai wrote: > Running LTP migrate_pages03 [1] a few times triggering BUG() below on an arm64 > ThunderX2 server. Reverted the commit 9a1ea439b16b9 ("mm: > put_and_wait_on_page_locked() while page is migrated") allows it to run > continuously. > > put_and_wait_on_page_locked > wait_on_page_bit_common > put_page > put_page_testzero > VM_BUG_ON_PAGE(page_ref_count(page) == 0, page); > > [1] > https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/migrate_pages/migrate_pages03.c > > [ 1304.643587] page:ffff7fe0226ff000 count:2 mapcount:0 mapping:ffff8095c3406d58 index:0x7 > [ 1304.652082] xfs_address_space_operations [xfs] [...] > [ 1304.682652] page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0) This looks like a page reference countimbalance to me. The page seemed to be freed at the the migration code (wait_on_page_bit_common) called put_page and immediatelly got reused for xfs allocation and that is why we see its ref count==2. But I fail to see how that is possible as __migration_entry_wait already does get_page_unless_zero so the imbalance must have been preexisting. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: BUG() due to "mm: put_and_wait_on_page_locked() while page is migrated" @ 2019-01-25 4:19 ` Hugh Dickins 0 siblings, 0 replies; 8+ messages in thread From: Hugh Dickins @ 2019-01-25 4:19 UTC (permalink / raw) To: Qian Cai Cc: Michal Hocko, hughd, Andrea Arcangeli, torvalds, vbabka, akpm, Linux-MM On Wed, 23 Jan 2019, Michal Hocko wrote: > On Tue 22-01-19 23:29:04, Qian Cai wrote: > > Running LTP migrate_pages03 [1] a few times triggering BUG() below on an arm64 > > ThunderX2 server. Reverted the commit 9a1ea439b16b9 ("mm: > > put_and_wait_on_page_locked() while page is migrated") allows it to run > > continuously. > > > > put_and_wait_on_page_locked > > wait_on_page_bit_common > > put_page > > put_page_testzero > > VM_BUG_ON_PAGE(page_ref_count(page) == 0, page); > > > > [1] > > https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/migrate_pages/migrate_pages03.c > > > > [ 1304.643587] page:ffff7fe0226ff000 count:2 mapcount:0 mapping:ffff8095c3406d58 index:0x7 > > [ 1304.652082] xfs_address_space_operations [xfs] > [...] > > [ 1304.682652] page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0) > > This looks like a page reference countimbalance to me. The page seemed > to be freed at the the migration code (wait_on_page_bit_common) called > put_page and immediatelly got reused for xfs allocation and that is why > we see its ref count==2. But I fail to see how that is possible as > __migration_entry_wait already does get_page_unless_zero so the > imbalance must have been preexisting. This report worried me, but I've thought around it, and agree with Michal that it must be reflecting a preexisting refcount imbalance - preexisting in the sense that the imbalance occurred sometime before reaching put_and_wait_on_page_locked(), and in the sense that the bug causing the imbalance came in before the put_and_wait_on_page_locked() commit, perhaps even long ago. If it is a software bug at all - I wonder if any other hardware shows the same issue - I have not seen it on x86 (though I wasn't using xfs), nor heard of anyone else reporting it - but thank you for doing so, it could be important. But I (probably) disagree with Michal about the page being freed and reused for xfs allocation. I have no proof, but I think the likelihood is that the page shown is the old xfs page (from libc-2.28.so, I see) which is currently being migrated. I realize that "last migrate reason: syscall_or_cpuset" would not get set until later, but I think it's left over from the previous migration: migrate_pages03 looks like it's migrating pages back and forth repeatedly. What I think happened is that something at some time earlier did a mistaken put_page() on the page. Then __migration_entry_wait() raced with migrate_page_move_mapping(), in such a way that get_page_unless_zero() then briefly raised the page's refcount to expected_count, so migration was able to freeze the page (set its refcount transiently to 0). Then put_and _wait_on_page_locked() reached the put_page() in wait_on_page_bit_common() while migration still had the refcount frozen at 0, and bang, your crash. But how come reverting the put_and_wait commit appears to fix it for you? That puzzled me, for a while I expected you then to see an equally visible crash in the old put_page() after wait_on_page_locked(), or else at the migration end where it puts the page afterwards (putback_lru_page perhaps). I guess the answer comes from that "libc-2.28.so". This page is one of those very popular pages which were next-to-impossible to migrate before the put_and_wait commit, because they are so widely mapped, and their migration entries so frequently faulted, that migration could not freeze them. (With enough migration waiters to outweigh the off-by-one of the incorrect refcount.) Being so widely used, the refcount imbalance on that page would (I think) only show up when unmounting the root at shutdown: easily missed. So I think you've identified that the put_and_wait commit has exposed an existing bug, and it may be very tedious to track down where that is. Maybe the bug is itself triggered by migrate_pages03, but quite likely not. Hugh ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: BUG() due to "mm: put_and_wait_on_page_locked() while page is migrated" @ 2019-01-25 4:19 ` Hugh Dickins 0 siblings, 0 replies; 8+ messages in thread From: Hugh Dickins @ 2019-01-25 4:19 UTC (permalink / raw) To: Qian Cai Cc: Michal Hocko, hughd, Andrea Arcangeli, torvalds, vbabka, akpm, Linux-MM On Wed, 23 Jan 2019, Michal Hocko wrote: > On Tue 22-01-19 23:29:04, Qian Cai wrote: > > Running LTP migrate_pages03 [1] a few times triggering BUG() below on an arm64 > > ThunderX2 server. Reverted the commit 9a1ea439b16b9 ("mm: > > put_and_wait_on_page_locked() while page is migrated") allows it to run > > continuously. > > > > put_and_wait_on_page_locked > > wait_on_page_bit_common > > put_page > > put_page_testzero > > VM_BUG_ON_PAGE(page_ref_count(page) == 0, page); > > > > [1] > > https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/migrate_pages/migrate_pages03.c > > > > [ 1304.643587] page:ffff7fe0226ff000 count:2 mapcount:0 mapping:ffff8095c3406d58 index:0x7 > > [ 1304.652082] xfs_address_space_operations [xfs] > [...] > > [ 1304.682652] page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0) > > This looks like a page reference countimbalance to me. The page seemed > to be freed at the the migration code (wait_on_page_bit_common) called > put_page and immediatelly got reused for xfs allocation and that is why > we see its ref count==2. But I fail to see how that is possible as > __migration_entry_wait already does get_page_unless_zero so the > imbalance must have been preexisting. This report worried me, but I've thought around it, and agree with Michal that it must be reflecting a preexisting refcount imbalance - preexisting in the sense that the imbalance occurred sometime before reaching put_and_wait_on_page_locked(), and in the sense that the bug causing the imbalance came in before the put_and_wait_on_page_locked() commit, perhaps even long ago. If it is a software bug at all - I wonder if any other hardware shows the same issue - I have not seen it on x86 (though I wasn't using xfs), nor heard of anyone else reporting it - but thank you for doing so, it could be important. But I (probably) disagree with Michal about the page being freed and reused for xfs allocation. I have no proof, but I think the likelihood is that the page shown is the old xfs page (from libc-2.28.so, I see) which is currently being migrated. I realize that "last migrate reason: syscall_or_cpuset" would not get set until later, but I think it's left over from the previous migration: migrate_pages03 looks like it's migrating pages back and forth repeatedly. What I think happened is that something at some time earlier did a mistaken put_page() on the page. Then __migration_entry_wait() raced with migrate_page_move_mapping(), in such a way that get_page_unless_zero() then briefly raised the page's refcount to expected_count, so migration was able to freeze the page (set its refcount transiently to 0). Then put_and _wait_on_page_locked() reached the put_page() in wait_on_page_bit_common() while migration still had the refcount frozen at 0, and bang, your crash. But how come reverting the put_and_wait commit appears to fix it for you? That puzzled me, for a while I expected you then to see an equally visible crash in the old put_page() after wait_on_page_locked(), or else at the migration end where it puts the page afterwards (putback_lru_page perhaps). I guess the answer comes from that "libc-2.28.so". This page is one of those very popular pages which were next-to-impossible to migrate before the put_and_wait commit, because they are so widely mapped, and their migration entries so frequently faulted, that migration could not freeze them. (With enough migration waiters to outweigh the off-by-one of the incorrect refcount.) Being so widely used, the refcount imbalance on that page would (I think) only show up when unmounting the root at shutdown: easily missed. So I think you've identified that the put_and_wait commit has exposed an existing bug, and it may be very tedious to track down where that is. Maybe the bug is itself triggered by migrate_pages03, but quite likely not. Hugh ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: BUG() due to "mm: put_and_wait_on_page_locked() while page is migrated" 2019-01-25 4:19 ` Hugh Dickins (?) @ 2019-01-25 4:31 ` Qian Cai 2019-01-25 8:51 ` Michal Hocko -1 siblings, 1 reply; 8+ messages in thread From: Qian Cai @ 2019-01-25 4:31 UTC (permalink / raw) To: Hugh Dickins Cc: Michal Hocko, Andrea Arcangeli, torvalds, vbabka, akpm, Linux-MM On 1/24/19 11:19 PM, Hugh Dickins wrote: > On Wed, 23 Jan 2019, Michal Hocko wrote: >> On Tue 22-01-19 23:29:04, Qian Cai wrote: >>> Running LTP migrate_pages03 [1] a few times triggering BUG() below on an arm64 >>> ThunderX2 server. Reverted the commit 9a1ea439b16b9 ("mm: >>> put_and_wait_on_page_locked() while page is migrated") allows it to run >>> continuously. >>> >>> put_and_wait_on_page_locked >>> wait_on_page_bit_common >>> put_page >>> put_page_testzero >>> VM_BUG_ON_PAGE(page_ref_count(page) == 0, page); >>> >>> [1] >>> https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/migrate_pages/migrate_pages03.c >>> >>> [ 1304.643587] page:ffff7fe0226ff000 count:2 mapcount:0 mapping:ffff8095c3406d58 index:0x7 >>> [ 1304.652082] xfs_address_space_operations [xfs] >> [...] >>> [ 1304.682652] page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0) >> >> This looks like a page reference countimbalance to me. The page seemed >> to be freed at the the migration code (wait_on_page_bit_common) called >> put_page and immediatelly got reused for xfs allocation and that is why >> we see its ref count==2. But I fail to see how that is possible as >> __migration_entry_wait already does get_page_unless_zero so the >> imbalance must have been preexisting. > > This report worried me, but I've thought around it, and agree with > Michal that it must be reflecting a preexisting refcount imbalance - > preexisting in the sense that the imbalance occurred sometime before > reaching put_and_wait_on_page_locked(), and in the sense that the bug > causing the imbalance came in before the put_and_wait_on_page_locked() > commit, perhaps even long ago. > > If it is a software bug at all - I wonder if any other hardware shows > the same issue - I have not seen it on x86 (though I wasn't using xfs), > nor heard of anyone else reporting it - but thank you for doing so, > it could be important. > > But I (probably) disagree with Michal about the page being freed and > reused for xfs allocation. I have no proof, but I think the likelihood > is that the page shown is the old xfs page (from libc-2.28.so, I see) > which is currently being migrated. > > I realize that "last migrate reason: syscall_or_cpuset" would not get > set until later, but I think it's left over from the previous migration: > migrate_pages03 looks like it's migrating pages back and forth repeatedly. > > What I think happened is that something at some time earlier did a > mistaken put_page() on the page. Then __migration_entry_wait() raced > with migrate_page_move_mapping(), in such a way that get_page_unless_zero() > then briefly raised the page's refcount to expected_count, so migration was > able to freeze the page (set its refcount transiently to 0). Then put_and > _wait_on_page_locked() reached the put_page() in wait_on_page_bit_common() > while migration still had the refcount frozen at 0, and bang, your crash. > > But how come reverting the put_and_wait commit appears to fix it for you? > That puzzled me, for a while I expected you then to see an equally visible > crash in the old put_page() after wait_on_page_locked(), or else at the > migration end where it puts the page afterwards (putback_lru_page perhaps). > > I guess the answer comes from that "libc-2.28.so". This page is one of > those very popular pages which were next-to-impossible to migrate before > the put_and_wait commit, because they are so widely mapped, and their > migration entries so frequently faulted, that migration could not freeze > them. (With enough migration waiters to outweigh the off-by-one of the > incorrect refcount.) > > Being so widely used, the refcount imbalance on that page would (I think) > only show up when unmounting the root at shutdown: easily missed. > > So I think you've identified that the put_and_wait commit has exposed > an existing bug, and it may be very tedious to track down where that is. > Maybe the bug is itself triggered by migrate_pages03, but quite likely not. It looks like the put_and_wait commit just make the bug easier to reproduce, as it has finally been able to reproduce it (via a different path) after 50+ runs of migrate_pages03 on one of the affected machines even with the commit reverted. [17890.870176] page:ffff7fe02563c780 count:0 mapcount:0 mapping:ffff800803ce6d58 index:0x1 [17890.879190] xfs_address_space_operations [xfs] [17890.879196] name:"ld-2.28.so" [17890.883724] flags: 0x17ffffc00000807(locked|referenced|uptodate|arch_1) [17890.893376] raw: 017ffffc00000807 ffff8094df8a7c40 ffff7fe02561a948 0000000000000000 [17890.901111] raw: 0000000000000001 0000000000000000 00000002ffffffff ffff80082039b080 [17890.908845] page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0) [17890.915710] page->mem_cgroup:ffff80082039b080 [17890.920065] page allocated via order 0, migratetype Movable, gfp_mask 0x62124a(GFP_NOFS|__GFP_HIGHMEM|__GFP_NOWARN|__GFP_NORETRY|__GFP_HARDWALL|__GFP_MOVABLE) [17890.934245] get_page_from_freelist+0x2d34/0x310c [17890.938943] __alloc_pages_nodemask+0x350/0x22d0 [17890.943559] alloc_pages_current+0x154/0x158 [17890.947821] __page_cache_alloc+0x274/0x27c [17890.952002] __do_page_cache_readahead+0x1e4/0x380 [17890.956785] ondemand_readahead+0x790/0x97c [17890.960961] page_cache_sync_readahead+0x2c8/0x2cc [17890.965743] generic_file_buffered_read+0x2b4/0x143c [17890.970700] generic_file_read_iter+0x298/0x2e4 [17890.975433] xfs_file_buffered_aio_read+0x5a0/0x5d0 [xfs] [17890.981034] xfs_file_read_iter+0x574/0x580 [xfs] [17890.985735] __vfs_read+0x478/0x4e8 [17890.989216] vfs_read+0xe4/0x1fc [17890.992436] kernel_read+0xa8/0x110 [17890.995923] load_elf_binary+0x92c/0x1b28 [17890.999932] search_binary_handler+0x138/0x4dc [17891.004368] page has been migrated, last migrate reason: syscall_or_cpuset [17891.011294] ------------[ cut here ]------------ [17891.015903] kernel BUG at include/linux/mm.h:546! [17891.020860] Internal error: Oops - BUG: 0 [#1] SMP [17891.025645] Modules linked in: thunderx2_pmu ip_tables xfs libcrc32c sd_mod ahci mlx5_core libahci libata dm_mirror dm_region_hash dm_log dm_mod efivarfs [17891.039390] CPU: 230 PID: 10606 Comm: bash Kdump: loaded Not tainted 5.0.0-rc3+ #3 [17891.046950] Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.0.6 07/10/2018 [17891.056767] pstate: 10400089 (nzcV daIf +PAN -UAO) [17891.061553] pc : release_pages+0x1e8/0xdbc [17891.065641] lr : release_pages+0x1e8/0xdbc [17891.069727] sp : ffff8095580574d0 [17891.073032] x29: ffff8095580574d0 x28: 1fffeffc04afb720 [17891.078336] x27: 0000000000000001 x26: ffff7fe025602848 [17891.083639] x25: 0000000000000000 x24: 0000000000000034 [17891.088942] x23: ffff8095580575a0 x22: ffff80977c3b8400 [17891.094245] x21: ffff7fe02563c7b4 x20: dfff200000000000 [17891.099548] x19: ffff7fe02563c780 x18: 0000000000000000 [17891.104851] x17: 0000000000000000 x16: 0000000000000000 [17891.110153] x15: 0000000000000000 x14: 4f4d5f5046475f5f [17891.115455] x13: 7c4c4c4157445241 x12: ffff0400026894e1 [17891.120758] x11: 1fffe400026894e1 x10: 5046475f5f7c4e52 [17891.126061] x9 : dfff200000000000 x8 : 737973203a6e6f73 [17891.131364] x7 : 0000000000000000 x6 : ffff20001021cc54 [17891.136666] x5 : 0000000000000000 x4 : 0000000000000000 [17891.141969] x3 : 0000000000000000 x2 : 29c8834f768b6d00 [17891.147275] x1 : 29c8834f768b6d00 x0 : 0000000000000000 [17891.152585] Process bash (pid: 10606, stack limit = 0x0000000036931683) [17891.159190] Call trace: [17891.161633] release_pages+0x1e8/0xdbc [17891.165379] free_pages_and_swap_cache+0x60/0x200 [17891.170081] tlb_flush_mmu_free+0xac/0xe4 [17891.174083] tlb_flush_mmu+0x22c/0x37c [17891.177824] arch_tlb_finish_mmu+0x158/0x260 [17891.182086] tlb_finish_mmu+0x8c/0xcc [17891.185741] exit_mmap+0x268/0x334 [17891.189139] mmput+0x118/0x2c8 [17891.192187] flush_old_exec+0x3a8/0x4fc [17891.196016] load_elf_binary+0x430/0x1b28 [17891.200019] search_binary_handler+0x138/0x4dc [17891.204454] load_script+0x45c/0x484 [17891.208022] search_binary_handler+0x138/0x4dc [17891.212459] __do_execve_file+0x1144/0x1808 [17891.216634] do_execve+0x40/0x50 [17891.219855] __arm64_sys_execve+0x8c/0xa0 [17891.223867] el0_svc_handler+0x258/0x304 [17891.227785] el0_svc+0x8/0xc [17891.230660] Code: 91168021 911d0021 aa1303e0 9401a40a (d4210000) ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: BUG() due to "mm: put_and_wait_on_page_locked() while page is migrated" 2019-01-25 4:31 ` Qian Cai @ 2019-01-25 8:51 ` Michal Hocko 2019-01-26 3:17 ` Qian Cai 0 siblings, 1 reply; 8+ messages in thread From: Michal Hocko @ 2019-01-25 8:51 UTC (permalink / raw) To: Qian Cai; +Cc: Hugh Dickins, Andrea Arcangeli, torvalds, vbabka, akpm, Linux-MM On Thu 24-01-19 23:31:46, Qian Cai wrote: [...] > It looks like the put_and_wait commit just make the bug easier to reproduce, as > it has finally been able to reproduce it (via a different path) after 50+ runs > of migrate_pages03 on one of the affected machines even with the commit reverted. OK, great. This makes it a little bit less of a head scratcher then. Could you confirm whether this is FS specific please? I will go and check the migration path. Maybe we doing something wrong there but it would be definitely good to know whether the underlying fs is really relevant. Thanks! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: BUG() due to "mm: put_and_wait_on_page_locked() while page is migrated" 2019-01-25 8:51 ` Michal Hocko @ 2019-01-26 3:17 ` Qian Cai 2019-02-05 19:08 ` Hugh Dickins 0 siblings, 1 reply; 8+ messages in thread From: Qian Cai @ 2019-01-26 3:17 UTC (permalink / raw) To: Michal Hocko Cc: Hugh Dickins, Andrea Arcangeli, torvalds, vbabka, akpm, Linux-MM On 1/25/19 3:51 AM, Michal Hocko wrote: > On Thu 24-01-19 23:31:46, Qian Cai wrote: > [...] >> It looks like the put_and_wait commit just make the bug easier to reproduce, as >> it has finally been able to reproduce it (via a different path) after 50+ runs >> of migrate_pages03 on one of the affected machines even with the commit reverted. > > OK, great. This makes it a little bit less of a head scratcher then. > Could you confirm whether this is FS specific please? I will go and > check the migration path. Maybe we doing something wrong there but it > would be definitely good to know whether the underlying fs is really > relevant. Thanks! > So, I reinstalled everything using an ext4 rootfs, and then it becomes impossible to reproduce it anymore... ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: BUG() due to "mm: put_and_wait_on_page_locked() while page is migrated" 2019-01-26 3:17 ` Qian Cai @ 2019-02-05 19:08 ` Hugh Dickins 0 siblings, 0 replies; 8+ messages in thread From: Hugh Dickins @ 2019-02-05 19:08 UTC (permalink / raw) To: Qian Cai Cc: Artem Savkov, Baoquan He, Michal Hocko, Hugh Dickins, Andrea Arcangeli, torvalds, vbabka, akpm, Linux-MM On Fri, 25 Jan 2019, Qian Cai wrote: > On 1/25/19 3:51 AM, Michal Hocko wrote: > > On Thu 24-01-19 23:31:46, Qian Cai wrote: > > [...] > >> It looks like the put_and_wait commit just make the bug easier to reproduce, as > >> it has finally been able to reproduce it (via a different path) after 50+ runs > >> of migrate_pages03 on one of the affected machines even with the commit reverted. > > > > OK, great. This makes it a little bit less of a head scratcher then. > > Could you confirm whether this is FS specific please? I will go and > > check the migration path. Maybe we doing something wrong there but it > > would be definitely good to know whether the underlying fs is really > > relevant. Thanks! > > > > So, I reinstalled everything using an ext4 rootfs, and then it becomes > impossible to reproduce it anymore... Just to wrap up this thread: Artem Savkov has identified 5.0-rc5 commit 8e47a457321c "iomap: get/put the page in iomap_page_create/release()" as fixing this issue on xfs (iomap), and Cai verified, in other thread https://marc.info/?l=linux-kernel&m=154927160417473&w=2 ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2019-02-05 19:08 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-01-23 4:29 BUG() due to "mm: put_and_wait_on_page_locked() while page is migrated" Qian Cai 2019-01-23 9:30 ` Michal Hocko 2019-01-25 4:19 ` Hugh Dickins 2019-01-25 4:19 ` Hugh Dickins 2019-01-25 4:31 ` Qian Cai 2019-01-25 8:51 ` Michal Hocko 2019-01-26 3:17 ` Qian Cai 2019-02-05 19:08 ` Hugh Dickins
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.