All of lore.kernel.org
 help / color / mirror / Atom feed
* BUG() due to "mm: put_and_wait_on_page_locked() while page is migrated"
@ 2019-01-23  4:29 Qian Cai
  2019-01-23  9:30 ` Michal Hocko
  0 siblings, 1 reply; 8+ messages in thread
From: Qian Cai @ 2019-01-23  4:29 UTC (permalink / raw)
  To: hughd, Andrea Arcangeli, Michal Hocko, torvalds, vbabka, akpm; +Cc: Linux-MM

Running LTP migrate_pages03 [1] a few times triggering BUG() below on an arm64
ThunderX2 server. Reverted the commit 9a1ea439b16b9 ("mm:
put_and_wait_on_page_locked() while page is migrated") allows it to run
continuously.

put_and_wait_on_page_locked
  wait_on_page_bit_common
    put_page
      put_page_testzero
        VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);

[1]
https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/migrate_pages/migrate_pages03.c

[ 1304.643587] page:ffff7fe0226ff000 count:2 mapcount:0 mapping:ffff8095c3406d58
index:0x7
[ 1304.652082] xfs_address_space_operations [xfs]
[ 1304.652104] name:"libc-2.28.so"
[ 1304.656653] flags: 0x7ffffc00000887(locked|waiters|referenced|uptodate|arch_1)
[ 1304.667134] raw: 007ffffc00000887 ffff7fe0227bac88 ffff7fe02261cd88
0000000000000000
[ 1304.674894] raw: 0000000000000007 0000000000000000 00000002ffffffff
ffff80082039b080
[ 1304.682652] page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
[ 1304.689553] page->mem_cgroup:ffff80082039b080
[ 1304.693932] page allocated via order 0, migratetype Movable, gfp_mask
0x62124a(GFP_NOFS|__GFP_HIGHMEM|__GFP_NOWARN|__GFP_NORETRY|__GFP_HARDWALL|__GFP_MOVABLE)
[ 1304.708137]  get_page_from_freelist+0x2cec/0x30c0
[ 1304.712864]  __alloc_pages_nodemask+0x350/0x22d0
[ 1304.717504]  alloc_pages_current+0x154/0x158
[ 1304.721795]  __page_cache_alloc+0x274/0x27c
[ 1304.726001]  __do_page_cache_readahead+0x1e4/0x380
[ 1304.730812]  filemap_fault+0x540/0x1204
[ 1304.734882]  __xfs_filemap_fault+0x714/0x734 [xfs]
[ 1304.739893]  xfs_filemap_fault+0xe4/0xfc [xfs]
[ 1304.744440]  __do_fault+0x294/0x5dc
[ 1304.747950]  do_fault+0x324/0x1360
[ 1304.751370]  __handle_mm_fault+0x9a8/0xb90
[ 1304.755481]  handle_mm_fault+0x610/0x614
[ 1304.759423]  do_page_fault+0x530/0x818
[ 1304.763188]  do_translation_fault+0x88/0xe8
[ 1304.767388]  do_mem_abort+0x78/0x168
[ 1304.770979]  do_el0_ia_bp_hardening+0x7c/0x8c
[ 1304.775351] page has been migrated, last migrate reason: syscall_or_cpuset
[ 1304.782294] ------------[ cut here ]------------
[ 1304.786904] kernel BUG at include/linux/mm.h:546!
[ 1304.791728] Internal error: Oops - BUG: 0 [#1] SMP
[ 1304.796513] Modules linked in: thunderx2_pmu ip_tables xfs libcrc32c sd_mod
ahci libahci mlx5_core libata dm_mirror dm_region_hash dm_log dm_mod efivarfs
[ 1304.810256] CPU: 248 PID: 10307 Comm: 0anacron Kdump: loaded Not tainted
5.0.0-rc3+ #1
[ 1304.818163] Hardware name: HPE Apollo 70             /C01_APACHE_MB         ,
BIOS L50_5.13_1.0.6 07/10/2018
[ 1304.827980] pstate: 10400009 (nzcV daif +PAN -UAO)
[ 1304.832764] pc : put_and_wait_on_page_locked+0x4c8/0x5f0
[ 1304.838067] lr : put_and_wait_on_page_locked+0x4c8/0x5f0
[ 1304.843369] sp : ffff8094fbf57960
[ 1304.846674] x29: ffff8094fbf57960 x28: ffff2000117f7f90
[ 1304.851978] x27: ffff2000117f7f88 x26: ffff2000117fd6c8
[ 1304.857281] x25: 0000000000000001 x24: ffff8094fbf579f0
[ 1304.862584] x23: 1ffff0129f7eaf3a x22: ffff2000117f7f50
[ 1304.867887] x21: dfff200000000000 x20: ffff7fe0226ff034
[ 1304.873190] x19: ffff7fe0226ff000 x18: 0000000000000000
[ 1304.878493] x17: 0000000000000000 x16: 0000000000000000
[ 1304.883795] x15: 0000000000000000 x14: 46475f5f7c4c4c41
[ 1304.889098] x13: 57445241485f5046 x12: ffff0400025bb4d9
[ 1304.894401] x11: 1fffe400025bb4d9 x10: 5f7c4e5241574f4e
[ 1304.899703] x9 : dfff200000000000 x8 : 6c6c616373797320
[ 1304.905006] x7 : 0000000000000000 x6 : ffff20001021b024
[ 1304.910309] x5 : 0000000000000000 x4 : 0000000000000000
[ 1304.915612] x3 : 0000000000000000 x2 : 29c8834f768b6d00
[ 1304.920914] x1 : 29c8834f768b6d00 x0 : 0000000000000000
[ 1304.926220] Process 0anacron (pid: 10307, stack limit = 0x00000000e3061c7a)
[ 1304.933172] Call trace:
[ 1304.935610]  put_and_wait_on_page_locked+0x4c8/0x5f0
[ 1304.940577]  __migration_entry_wait+0x238/0x260
[ 1304.945098]  migration_entry_wait+0xfc/0x110
[ 1304.949361]  do_swap_page+0x1b0/0x198c
[ 1304.953101]  __handle_mm_fault+0x9a0/0xb90
[ 1304.957188]  handle_mm_fault+0x610/0x614
[ 1304.961102]  do_page_fault+0x530/0x818
[ 1304.964842]  do_translation_fault+0x88/0xe8
[ 1304.969016]  do_mem_abort+0x78/0x168
[ 1304.972583]  do_el0_ia_bp_hardening+0x7c/0x8c
[ 1304.976931]  el0_ia+0x1c/0x20
[ 1304.979893] Code: 91298021 91128021 aa1303e0 940282fb (d4210000)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BUG() due to "mm: put_and_wait_on_page_locked() while page is migrated"
  2019-01-23  4:29 BUG() due to "mm: put_and_wait_on_page_locked() while page is migrated" Qian Cai
@ 2019-01-23  9:30 ` Michal Hocko
  2019-01-25  4:19     ` Hugh Dickins
  0 siblings, 1 reply; 8+ messages in thread
From: Michal Hocko @ 2019-01-23  9:30 UTC (permalink / raw)
  To: Qian Cai; +Cc: hughd, Andrea Arcangeli, torvalds, vbabka, akpm, Linux-MM

On Tue 22-01-19 23:29:04, Qian Cai wrote:
> Running LTP migrate_pages03 [1] a few times triggering BUG() below on an arm64
> ThunderX2 server. Reverted the commit 9a1ea439b16b9 ("mm:
> put_and_wait_on_page_locked() while page is migrated") allows it to run
> continuously.
> 
> put_and_wait_on_page_locked
>   wait_on_page_bit_common
>     put_page
>       put_page_testzero
>         VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
> 
> [1]
> https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/migrate_pages/migrate_pages03.c
> 
> [ 1304.643587] page:ffff7fe0226ff000 count:2 mapcount:0 mapping:ffff8095c3406d58 index:0x7
> [ 1304.652082] xfs_address_space_operations [xfs]
[...]
> [ 1304.682652] page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)

This looks like a page reference countimbalance to me. The page seemed
to be freed at the the migration code (wait_on_page_bit_common) called
put_page and immediatelly got reused for xfs allocation and that is why
we see its ref count==2. But I fail to see how that is possible as
__migration_entry_wait already does get_page_unless_zero so the
imbalance must have been preexisting.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BUG() due to "mm: put_and_wait_on_page_locked() while page is migrated"
@ 2019-01-25  4:19     ` Hugh Dickins
  0 siblings, 0 replies; 8+ messages in thread
From: Hugh Dickins @ 2019-01-25  4:19 UTC (permalink / raw)
  To: Qian Cai
  Cc: Michal Hocko, hughd, Andrea Arcangeli, torvalds, vbabka, akpm, Linux-MM

On Wed, 23 Jan 2019, Michal Hocko wrote:
> On Tue 22-01-19 23:29:04, Qian Cai wrote:
> > Running LTP migrate_pages03 [1] a few times triggering BUG() below on an arm64
> > ThunderX2 server. Reverted the commit 9a1ea439b16b9 ("mm:
> > put_and_wait_on_page_locked() while page is migrated") allows it to run
> > continuously.
> > 
> > put_and_wait_on_page_locked
> >   wait_on_page_bit_common
> >     put_page
> >       put_page_testzero
> >         VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
> > 
> > [1]
> > https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/migrate_pages/migrate_pages03.c
> > 
> > [ 1304.643587] page:ffff7fe0226ff000 count:2 mapcount:0 mapping:ffff8095c3406d58 index:0x7
> > [ 1304.652082] xfs_address_space_operations [xfs]
> [...]
> > [ 1304.682652] page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
> 
> This looks like a page reference countimbalance to me. The page seemed
> to be freed at the the migration code (wait_on_page_bit_common) called
> put_page and immediatelly got reused for xfs allocation and that is why
> we see its ref count==2. But I fail to see how that is possible as
> __migration_entry_wait already does get_page_unless_zero so the
> imbalance must have been preexisting.

This report worried me, but I've thought around it, and agree with
Michal that it must be reflecting a preexisting refcount imbalance -
preexisting in the sense that the imbalance occurred sometime before
reaching put_and_wait_on_page_locked(), and in the sense that the bug
causing the imbalance came in before the put_and_wait_on_page_locked()
commit, perhaps even long ago.

If it is a software bug at all - I wonder if any other hardware shows
the same issue - I have not seen it on x86 (though I wasn't using xfs),
nor heard of anyone else reporting it - but thank you for doing so,
it could be important.

But I (probably) disagree with Michal about the page being freed and
reused for xfs allocation. I have no proof, but I think the likelihood
is that the page shown is the old xfs page (from libc-2.28.so, I see)
which is currently being migrated.

I realize that "last migrate reason: syscall_or_cpuset" would not get 
set until later, but I think it's left over from the previous migration:
migrate_pages03 looks like it's migrating pages back and forth repeatedly.

What I think happened is that something at some time earlier did a
mistaken put_page() on the page.  Then __migration_entry_wait() raced
with migrate_page_move_mapping(), in such a way that get_page_unless_zero()
then briefly raised the page's refcount to expected_count, so migration was
able to freeze the page (set its refcount transiently to 0).  Then put_and
_wait_on_page_locked() reached the put_page() in wait_on_page_bit_common()
while migration still had the refcount frozen at 0, and bang, your crash.

But how come reverting the put_and_wait commit appears to fix it for you?
That puzzled me, for a while I expected you then to see an equally visible
crash in the old put_page() after wait_on_page_locked(), or else at the
migration end where it puts the page afterwards (putback_lru_page perhaps).

I guess the answer comes from that "libc-2.28.so".  This page is one of
those very popular pages which were next-to-impossible to migrate before
the put_and_wait commit, because they are so widely mapped, and their
migration entries so frequently faulted, that migration could not freeze
them.  (With enough migration waiters to outweigh the off-by-one of the
incorrect refcount.)

Being so widely used, the refcount imbalance on that page would (I think)
only show up when unmounting the root at shutdown: easily missed.

So I think you've identified that the put_and_wait commit has exposed
an existing bug, and it may be very tedious to track down where that is.
Maybe the bug is itself triggered by migrate_pages03, but quite likely not.

Hugh

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BUG() due to "mm: put_and_wait_on_page_locked() while page is migrated"
@ 2019-01-25  4:19     ` Hugh Dickins
  0 siblings, 0 replies; 8+ messages in thread
From: Hugh Dickins @ 2019-01-25  4:19 UTC (permalink / raw)
  To: Qian Cai
  Cc: Michal Hocko, hughd, Andrea Arcangeli, torvalds, vbabka, akpm, Linux-MM

On Wed, 23 Jan 2019, Michal Hocko wrote:
> On Tue 22-01-19 23:29:04, Qian Cai wrote:
> > Running LTP migrate_pages03 [1] a few times triggering BUG() below on an arm64
> > ThunderX2 server. Reverted the commit 9a1ea439b16b9 ("mm:
> > put_and_wait_on_page_locked() while page is migrated") allows it to run
> > continuously.
> > 
> > put_and_wait_on_page_locked
> >   wait_on_page_bit_common
> >     put_page
> >       put_page_testzero
> >         VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
> > 
> > [1]
> > https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/migrate_pages/migrate_pages03.c
> > 
> > [ 1304.643587] page:ffff7fe0226ff000 count:2 mapcount:0 mapping:ffff8095c3406d58 index:0x7
> > [ 1304.652082] xfs_address_space_operations [xfs]
> [...]
> > [ 1304.682652] page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
> 
> This looks like a page reference countimbalance to me. The page seemed
> to be freed at the the migration code (wait_on_page_bit_common) called
> put_page and immediatelly got reused for xfs allocation and that is why
> we see its ref count==2. But I fail to see how that is possible as
> __migration_entry_wait already does get_page_unless_zero so the
> imbalance must have been preexisting.

This report worried me, but I've thought around it, and agree with
Michal that it must be reflecting a preexisting refcount imbalance -
preexisting in the sense that the imbalance occurred sometime before
reaching put_and_wait_on_page_locked(), and in the sense that the bug
causing the imbalance came in before the put_and_wait_on_page_locked()
commit, perhaps even long ago.

If it is a software bug at all - I wonder if any other hardware shows
the same issue - I have not seen it on x86 (though I wasn't using xfs),
nor heard of anyone else reporting it - but thank you for doing so,
it could be important.

But I (probably) disagree with Michal about the page being freed and
reused for xfs allocation. I have no proof, but I think the likelihood
is that the page shown is the old xfs page (from libc-2.28.so, I see)
which is currently being migrated.

I realize that "last migrate reason: syscall_or_cpuset" would not get 
set until later, but I think it's left over from the previous migration:
migrate_pages03 looks like it's migrating pages back and forth repeatedly.

What I think happened is that something at some time earlier did a
mistaken put_page() on the page.  Then __migration_entry_wait() raced
with migrate_page_move_mapping(), in such a way that get_page_unless_zero()
then briefly raised the page's refcount to expected_count, so migration was
able to freeze the page (set its refcount transiently to 0).  Then put_and
_wait_on_page_locked() reached the put_page() in wait_on_page_bit_common()
while migration still had the refcount frozen at 0, and bang, your crash.

But how come reverting the put_and_wait commit appears to fix it for you?
That puzzled me, for a while I expected you then to see an equally visible
crash in the old put_page() after wait_on_page_locked(), or else at the
migration end where it puts the page afterwards (putback_lru_page perhaps).

I guess the answer comes from that "libc-2.28.so".  This page is one of
those very popular pages which were next-to-impossible to migrate before
the put_and_wait commit, because they are so widely mapped, and their
migration entries so frequently faulted, that migration could not freeze
them.  (With enough migration waiters to outweigh the off-by-one of the
incorrect refcount.)

Being so widely used, the refcount imbalance on that page would (I think)
only show up when unmounting the root at shutdown: easily missed.

So I think you've identified that the put_and_wait commit has exposed
an existing bug, and it may be very tedious to track down where that is.
Maybe the bug is itself triggered by migrate_pages03, but quite likely not.

Hugh


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BUG() due to "mm: put_and_wait_on_page_locked() while page is migrated"
  2019-01-25  4:19     ` Hugh Dickins
  (?)
@ 2019-01-25  4:31     ` Qian Cai
  2019-01-25  8:51       ` Michal Hocko
  -1 siblings, 1 reply; 8+ messages in thread
From: Qian Cai @ 2019-01-25  4:31 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Michal Hocko, Andrea Arcangeli, torvalds, vbabka, akpm, Linux-MM



On 1/24/19 11:19 PM, Hugh Dickins wrote:
> On Wed, 23 Jan 2019, Michal Hocko wrote:
>> On Tue 22-01-19 23:29:04, Qian Cai wrote:
>>> Running LTP migrate_pages03 [1] a few times triggering BUG() below on an arm64
>>> ThunderX2 server. Reverted the commit 9a1ea439b16b9 ("mm:
>>> put_and_wait_on_page_locked() while page is migrated") allows it to run
>>> continuously.
>>>
>>> put_and_wait_on_page_locked
>>>   wait_on_page_bit_common
>>>     put_page
>>>       put_page_testzero
>>>         VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
>>>
>>> [1]
>>> https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/migrate_pages/migrate_pages03.c
>>>
>>> [ 1304.643587] page:ffff7fe0226ff000 count:2 mapcount:0 mapping:ffff8095c3406d58 index:0x7
>>> [ 1304.652082] xfs_address_space_operations [xfs]
>> [...]
>>> [ 1304.682652] page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
>>
>> This looks like a page reference countimbalance to me. The page seemed
>> to be freed at the the migration code (wait_on_page_bit_common) called
>> put_page and immediatelly got reused for xfs allocation and that is why
>> we see its ref count==2. But I fail to see how that is possible as
>> __migration_entry_wait already does get_page_unless_zero so the
>> imbalance must have been preexisting.
> 
> This report worried me, but I've thought around it, and agree with
> Michal that it must be reflecting a preexisting refcount imbalance -
> preexisting in the sense that the imbalance occurred sometime before
> reaching put_and_wait_on_page_locked(), and in the sense that the bug
> causing the imbalance came in before the put_and_wait_on_page_locked()
> commit, perhaps even long ago.
> 
> If it is a software bug at all - I wonder if any other hardware shows
> the same issue - I have not seen it on x86 (though I wasn't using xfs),
> nor heard of anyone else reporting it - but thank you for doing so,
> it could be important.
> 
> But I (probably) disagree with Michal about the page being freed and
> reused for xfs allocation. I have no proof, but I think the likelihood
> is that the page shown is the old xfs page (from libc-2.28.so, I see)
> which is currently being migrated.
> 
> I realize that "last migrate reason: syscall_or_cpuset" would not get 
> set until later, but I think it's left over from the previous migration:
> migrate_pages03 looks like it's migrating pages back and forth repeatedly.
> 
> What I think happened is that something at some time earlier did a
> mistaken put_page() on the page.  Then __migration_entry_wait() raced
> with migrate_page_move_mapping(), in such a way that get_page_unless_zero()
> then briefly raised the page's refcount to expected_count, so migration was
> able to freeze the page (set its refcount transiently to 0).  Then put_and
> _wait_on_page_locked() reached the put_page() in wait_on_page_bit_common()
> while migration still had the refcount frozen at 0, and bang, your crash.
> 
> But how come reverting the put_and_wait commit appears to fix it for you?
> That puzzled me, for a while I expected you then to see an equally visible
> crash in the old put_page() after wait_on_page_locked(), or else at the
> migration end where it puts the page afterwards (putback_lru_page perhaps).
> 
> I guess the answer comes from that "libc-2.28.so".  This page is one of
> those very popular pages which were next-to-impossible to migrate before
> the put_and_wait commit, because they are so widely mapped, and their
> migration entries so frequently faulted, that migration could not freeze
> them.  (With enough migration waiters to outweigh the off-by-one of the
> incorrect refcount.)
> 
> Being so widely used, the refcount imbalance on that page would (I think)
> only show up when unmounting the root at shutdown: easily missed.
> 
> So I think you've identified that the put_and_wait commit has exposed
> an existing bug, and it may be very tedious to track down where that is.
> Maybe the bug is itself triggered by migrate_pages03, but quite likely not.

It looks like the put_and_wait commit just make the bug easier to reproduce, as
it has finally been able to reproduce it (via a different path) after 50+ runs
of migrate_pages03 on one of the affected machines even with the commit reverted.

[17890.870176] page:ffff7fe02563c780 count:0 mapcount:0 mapping:ffff800803ce6d58
index:0x1
[17890.879190] xfs_address_space_operations [xfs]
[17890.879196] name:"ld-2.28.so"
[17890.883724] flags: 0x17ffffc00000807(locked|referenced|uptodate|arch_1)
[17890.893376] raw: 017ffffc00000807 ffff8094df8a7c40 ffff7fe02561a948
0000000000000000
[17890.901111] raw: 0000000000000001 0000000000000000 00000002ffffffff
ffff80082039b080
[17890.908845] page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
[17890.915710] page->mem_cgroup:ffff80082039b080
[17890.920065] page allocated via order 0, migratetype Movable, gfp_mask
0x62124a(GFP_NOFS|__GFP_HIGHMEM|__GFP_NOWARN|__GFP_NORETRY|__GFP_HARDWALL|__GFP_MOVABLE)
[17890.934245]  get_page_from_freelist+0x2d34/0x310c
[17890.938943]  __alloc_pages_nodemask+0x350/0x22d0
[17890.943559]  alloc_pages_current+0x154/0x158
[17890.947821]  __page_cache_alloc+0x274/0x27c
[17890.952002]  __do_page_cache_readahead+0x1e4/0x380
[17890.956785]  ondemand_readahead+0x790/0x97c
[17890.960961]  page_cache_sync_readahead+0x2c8/0x2cc
[17890.965743]  generic_file_buffered_read+0x2b4/0x143c
[17890.970700]  generic_file_read_iter+0x298/0x2e4
[17890.975433]  xfs_file_buffered_aio_read+0x5a0/0x5d0 [xfs]
[17890.981034]  xfs_file_read_iter+0x574/0x580 [xfs]
[17890.985735]  __vfs_read+0x478/0x4e8
[17890.989216]  vfs_read+0xe4/0x1fc
[17890.992436]  kernel_read+0xa8/0x110
[17890.995923]  load_elf_binary+0x92c/0x1b28
[17890.999932]  search_binary_handler+0x138/0x4dc
[17891.004368] page has been migrated, last migrate reason: syscall_or_cpuset
[17891.011294] ------------[ cut here ]------------
[17891.015903] kernel BUG at include/linux/mm.h:546!
[17891.020860] Internal error: Oops - BUG: 0 [#1] SMP
[17891.025645] Modules linked in: thunderx2_pmu ip_tables xfs libcrc32c sd_mod
ahci mlx5_core libahci libata dm_mirror dm_region_hash dm_log dm_mod efivarfs
[17891.039390] CPU: 230 PID: 10606 Comm: bash Kdump: loaded Not tainted
5.0.0-rc3+ #3
[17891.046950] Hardware name: HPE Apollo 70             /C01_APACHE_MB         ,
BIOS L50_5.13_1.0.6 07/10/2018
[17891.056767] pstate: 10400089 (nzcV daIf +PAN -UAO)
[17891.061553] pc : release_pages+0x1e8/0xdbc
[17891.065641] lr : release_pages+0x1e8/0xdbc
[17891.069727] sp : ffff8095580574d0
[17891.073032] x29: ffff8095580574d0 x28: 1fffeffc04afb720
[17891.078336] x27: 0000000000000001 x26: ffff7fe025602848
[17891.083639] x25: 0000000000000000 x24: 0000000000000034
[17891.088942] x23: ffff8095580575a0 x22: ffff80977c3b8400
[17891.094245] x21: ffff7fe02563c7b4 x20: dfff200000000000
[17891.099548] x19: ffff7fe02563c780 x18: 0000000000000000
[17891.104851] x17: 0000000000000000 x16: 0000000000000000
[17891.110153] x15: 0000000000000000 x14: 4f4d5f5046475f5f
[17891.115455] x13: 7c4c4c4157445241 x12: ffff0400026894e1
[17891.120758] x11: 1fffe400026894e1 x10: 5046475f5f7c4e52
[17891.126061] x9 : dfff200000000000 x8 : 737973203a6e6f73
[17891.131364] x7 : 0000000000000000 x6 : ffff20001021cc54
[17891.136666] x5 : 0000000000000000 x4 : 0000000000000000
[17891.141969] x3 : 0000000000000000 x2 : 29c8834f768b6d00
[17891.147275] x1 : 29c8834f768b6d00 x0 : 0000000000000000
[17891.152585] Process bash (pid: 10606, stack limit = 0x0000000036931683)
[17891.159190] Call trace:
[17891.161633]  release_pages+0x1e8/0xdbc
[17891.165379]  free_pages_and_swap_cache+0x60/0x200
[17891.170081]  tlb_flush_mmu_free+0xac/0xe4
[17891.174083]  tlb_flush_mmu+0x22c/0x37c
[17891.177824]  arch_tlb_finish_mmu+0x158/0x260
[17891.182086]  tlb_finish_mmu+0x8c/0xcc
[17891.185741]  exit_mmap+0x268/0x334
[17891.189139]  mmput+0x118/0x2c8
[17891.192187]  flush_old_exec+0x3a8/0x4fc
[17891.196016]  load_elf_binary+0x430/0x1b28
[17891.200019]  search_binary_handler+0x138/0x4dc
[17891.204454]  load_script+0x45c/0x484
[17891.208022]  search_binary_handler+0x138/0x4dc
[17891.212459]  __do_execve_file+0x1144/0x1808
[17891.216634]  do_execve+0x40/0x50
[17891.219855]  __arm64_sys_execve+0x8c/0xa0
[17891.223867]  el0_svc_handler+0x258/0x304
[17891.227785]  el0_svc+0x8/0xc
[17891.230660] Code: 91168021 911d0021 aa1303e0 9401a40a (d4210000)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BUG() due to "mm: put_and_wait_on_page_locked() while page is migrated"
  2019-01-25  4:31     ` Qian Cai
@ 2019-01-25  8:51       ` Michal Hocko
  2019-01-26  3:17         ` Qian Cai
  0 siblings, 1 reply; 8+ messages in thread
From: Michal Hocko @ 2019-01-25  8:51 UTC (permalink / raw)
  To: Qian Cai; +Cc: Hugh Dickins, Andrea Arcangeli, torvalds, vbabka, akpm, Linux-MM

On Thu 24-01-19 23:31:46, Qian Cai wrote:
[...]
> It looks like the put_and_wait commit just make the bug easier to reproduce, as
> it has finally been able to reproduce it (via a different path) after 50+ runs
> of migrate_pages03 on one of the affected machines even with the commit reverted.

OK, great. This makes it a little bit less of a head scratcher then.
Could you confirm whether this is FS specific please? I will go and
check the migration path. Maybe we doing something wrong there but it
would be definitely good to know whether the underlying fs is really
relevant. Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BUG() due to "mm: put_and_wait_on_page_locked() while page is migrated"
  2019-01-25  8:51       ` Michal Hocko
@ 2019-01-26  3:17         ` Qian Cai
  2019-02-05 19:08           ` Hugh Dickins
  0 siblings, 1 reply; 8+ messages in thread
From: Qian Cai @ 2019-01-26  3:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Andrea Arcangeli, torvalds, vbabka, akpm, Linux-MM



On 1/25/19 3:51 AM, Michal Hocko wrote:
> On Thu 24-01-19 23:31:46, Qian Cai wrote:
> [...]
>> It looks like the put_and_wait commit just make the bug easier to reproduce, as
>> it has finally been able to reproduce it (via a different path) after 50+ runs
>> of migrate_pages03 on one of the affected machines even with the commit reverted.
> 
> OK, great. This makes it a little bit less of a head scratcher then.
> Could you confirm whether this is FS specific please? I will go and
> check the migration path. Maybe we doing something wrong there but it
> would be definitely good to know whether the underlying fs is really
> relevant. Thanks!
> 

So, I reinstalled everything using an ext4 rootfs, and then it becomes
impossible to reproduce it anymore...

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BUG() due to "mm: put_and_wait_on_page_locked() while page is migrated"
  2019-01-26  3:17         ` Qian Cai
@ 2019-02-05 19:08           ` Hugh Dickins
  0 siblings, 0 replies; 8+ messages in thread
From: Hugh Dickins @ 2019-02-05 19:08 UTC (permalink / raw)
  To: Qian Cai
  Cc: Artem Savkov, Baoquan He, Michal Hocko, Hugh Dickins,
	Andrea Arcangeli, torvalds, vbabka, akpm, Linux-MM

On Fri, 25 Jan 2019, Qian Cai wrote:
> On 1/25/19 3:51 AM, Michal Hocko wrote:
> > On Thu 24-01-19 23:31:46, Qian Cai wrote:
> > [...]
> >> It looks like the put_and_wait commit just make the bug easier to reproduce, as
> >> it has finally been able to reproduce it (via a different path) after 50+ runs
> >> of migrate_pages03 on one of the affected machines even with the commit reverted.
> > 
> > OK, great. This makes it a little bit less of a head scratcher then.
> > Could you confirm whether this is FS specific please? I will go and
> > check the migration path. Maybe we doing something wrong there but it
> > would be definitely good to know whether the underlying fs is really
> > relevant. Thanks!
> > 
> 
> So, I reinstalled everything using an ext4 rootfs, and then it becomes
> impossible to reproduce it anymore...

Just to wrap up this thread: Artem Savkov has identified 5.0-rc5 commit
8e47a457321c "iomap: get/put the page in iomap_page_create/release()"
as fixing this issue on xfs (iomap), and Cai verified, in other thread
https://marc.info/?l=linux-kernel&m=154927160417473&w=2


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2019-02-05 19:08 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-23  4:29 BUG() due to "mm: put_and_wait_on_page_locked() while page is migrated" Qian Cai
2019-01-23  9:30 ` Michal Hocko
2019-01-25  4:19   ` Hugh Dickins
2019-01-25  4:19     ` Hugh Dickins
2019-01-25  4:31     ` Qian Cai
2019-01-25  8:51       ` Michal Hocko
2019-01-26  3:17         ` Qian Cai
2019-02-05 19:08           ` Hugh Dickins

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.