* BUG: Bad page state in process dirtyc0w_child @ 2020-09-13 1:54 ` Qian Cai 0 siblings, 0 replies; 17+ messages in thread From: Qian Cai @ 2020-09-13 1:54 UTC (permalink / raw) To: Heiko Carstens, Vasily Gorbik, Christian Borntraeger Cc: linux-s390, linux-mm, linux-kernel Occasionally, running this LTP test will trigger an error below on s390: https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/security/dirtyc0w/dirtyc0w.c https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/security/dirtyc0w/dirtyc0w_child.c this .config: https://gitlab.com/cailca/linux-mm/-/blob/master/s390.config [ 6970.253173] LTP: starting dirtyc0w [ 6971.599102] BUG: Bad page state in process dirtyc0w_child pfn:8865d [ 6971.599867] page:000000001a8328d7 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x8865d [ 6971.599876] flags: 0x400000000008000e(referenced|uptodate|dirty|swapbacked) [ 6971.599886] raw: 400000000008000e 0000000000000100 0000000000000122 0000000000000000 [ 6971.599893] raw: 0000000000000000 0000000000000000 ffffffff00000000 0000000000000000 [ 6971.599900] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set [ 6971.599906] Modules linked in: loop kvm ip_tables x_tables dm_mirror dm_region_hash dm_log dm_mod [last unloaded: dummy_del_mod] [ 6971.599952] CPU: 1 PID: 65238 Comm: dirtyc0w_child Tainted: G O 5.9.0-rc4-next-20200909 #1 [ 6971.599959] Hardware name: IBM 2964 N96 400 (z/VM 6.4.0) [ 6971.599964] Call Trace: [ 6971.599979] [<0000000073aec038>] show_stack+0x158/0x1f0 [ 6971.599986] [<0000000073af724a>] dump_stack+0x1f2/0x238 [ 6971.599994] [<0000000072ed086a>] bad_page+0x1ba/0x1c0 [ 6971.600000] [<0000000072ed20c4>] free_pcp_prepare+0x4fc/0x658 [ 6971.600006] [<0000000072ed96a6>] free_unref_page+0xae/0x158 [ 6971.600013] [<0000000072e8286a>] unmap_page_range+0xb62/0x1df8 [ 6971.600019] [<0000000072e83bbc>] unmap_single_vma+0xbc/0x1c8 [ 6971.600025] [<0000000072e8418e>] zap_page_range+0x176/0x230 [ 6971.600033] [<0000000072eece8e>] do_madvise+0xfde/0x1270 [ 6971.600039] [<0000000072eed50a>] __s390x_sys_madvise+0x72/0x98 [ 6971.600047] [<0000000073b1cce4>] system_call+0xdc/0x278 [ 6971.600053] 2 locks held by dirtyc0w_child/65238: [ 6971.600058] #0: 000000013442fa18 (&mm->mmap_lock){++++}-{3:3}, at: do_madvise+0x17a/0x1270 [ 6971.600432] #1: 00000001343f9060 (ptlock_ptr(page)#2){+.+.}-{2:2}, at: unmap_page_range+0x640/0x1df8 [ 6971.600487] Disabling lock debugging due to kernel taint Once it happens, running it again will trigger in on another PFN. [39717.085115] BUG: Bad page state in process dirtyc0w_child pfn:af065 Any thoughts? ^ permalink raw reply [flat|nested] 17+ messages in thread
* BUG: Bad page state in process dirtyc0w_child @ 2020-09-13 1:54 ` Qian Cai 0 siblings, 0 replies; 17+ messages in thread From: Qian Cai @ 2020-09-13 1:54 UTC (permalink / raw) To: Heiko Carstens, Vasily Gorbik, Christian Borntraeger Cc: linux-s390, linux-mm, linux-kernel Occasionally, running this LTP test will trigger an error below on s390: https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/security/dirtyc0w/dirtyc0w.c https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/security/dirtyc0w/dirtyc0w_child.c this .config: https://gitlab.com/cailca/linux-mm/-/blob/master/s390.config [ 6970.253173] LTP: starting dirtyc0w [ 6971.599102] BUG: Bad page state in process dirtyc0w_child pfn:8865d [ 6971.599867] page:000000001a8328d7 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x8865d [ 6971.599876] flags: 0x400000000008000e(referenced|uptodate|dirty|swapbacked) [ 6971.599886] raw: 400000000008000e 0000000000000100 0000000000000122 0000000000000000 [ 6971.599893] raw: 0000000000000000 0000000000000000 ffffffff00000000 0000000000000000 [ 6971.599900] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set [ 6971.599906] Modules linked in: loop kvm ip_tables x_tables dm_mirror dm_region_hash dm_log dm_mod [last unloaded: dummy_del_mod] [ 6971.599952] CPU: 1 PID: 65238 Comm: dirtyc0w_child Tainted: G O 5.9.0-rc4-next-20200909 #1 [ 6971.599959] Hardware name: IBM 2964 N96 400 (z/VM 6.4.0) [ 6971.599964] Call Trace: [ 6971.599979] [<0000000073aec038>] show_stack+0x158/0x1f0 [ 6971.599986] [<0000000073af724a>] dump_stack+0x1f2/0x238 [ 6971.599994] [<0000000072ed086a>] bad_page+0x1ba/0x1c0 [ 6971.600000] [<0000000072ed20c4>] free_pcp_prepare+0x4fc/0x658 [ 6971.600006] [<0000000072ed96a6>] free_unref_page+0xae/0x158 [ 6971.600013] [<0000000072e8286a>] unmap_page_range+0xb62/0x1df8 [ 6971.600019] [<0000000072e83bbc>] unmap_single_vma+0xbc/0x1c8 [ 6971.600025] [<0000000072e8418e>] zap_page_range+0x176/0x230 [ 6971.600033] [<0000000072eece8e>] do_madvise+0xfde/0x1270 [ 6971.600039] [<0000000072eed50a>] __s390x_sys_madvise+0x72/0x98 [ 6971.600047] [<0000000073b1cce4>] system_call+0xdc/0x278 [ 6971.600053] 2 locks held by dirtyc0w_child/65238: [ 6971.600058] #0: 000000013442fa18 (&mm->mmap_lock){++++}-{3:3}, at: do_madvise+0x17a/0x1270 [ 6971.600432] #1: 00000001343f9060 (ptlock_ptr(page)#2){+.+.}-{2:2}, at: unmap_page_range+0x640/0x1df8 [ 6971.600487] Disabling lock debugging due to kernel taint Once it happens, running it again will trigger in on another PFN. [39717.085115] BUG: Bad page state in process dirtyc0w_child pfn:af065 Any thoughts? ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: BUG: Bad page state in process dirtyc0w_child 2020-09-13 1:54 ` Qian Cai (?) @ 2020-09-16 14:28 ` Heiko Carstens 2020-09-22 17:03 ` Gerald Schaefer -1 siblings, 1 reply; 17+ messages in thread From: Heiko Carstens @ 2020-09-16 14:28 UTC (permalink / raw) To: Qian Cai, Alexander Gordeev, Gerald Schaefer Cc: Vasily Gorbik, Christian Borntraeger, linux-s390, linux-mm, linux-kernel On Sat, Sep 12, 2020 at 09:54:12PM -0400, Qian Cai wrote: > Occasionally, running this LTP test will trigger an error below on > s390: > https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/security/dirtyc0w/dirtyc0w.c > https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/security/dirtyc0w/dirtyc0w_child.c > > this .config: > https://gitlab.com/cailca/linux-mm/-/blob/master/s390.config > > [ 6970.253173] LTP: starting dirtyc0w > [ 6971.599102] BUG: Bad page state in process dirtyc0w_child pfn:8865d > [ 6971.599867] page:000000001a8328d7 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x8865d > [ 6971.599876] flags: 0x400000000008000e(referenced|uptodate|dirty|swapbacked) > [ 6971.599886] raw: 400000000008000e 0000000000000100 0000000000000122 0000000000000000 > [ 6971.599893] raw: 0000000000000000 0000000000000000 ffffffff00000000 0000000000000000 > [ 6971.599900] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set > [ 6971.599906] Modules linked in: loop kvm ip_tables x_tables dm_mirror dm_region_hash dm_log dm_mod [last unloaded: dummy_del_mod] > [ 6971.599952] CPU: 1 PID: 65238 Comm: dirtyc0w_child Tainted: G O 5.9.0-rc4-next-20200909 #1 > [ 6971.599959] Hardware name: IBM 2964 N96 400 (z/VM 6.4.0) > [ 6971.599964] Call Trace: > [ 6971.599979] [<0000000073aec038>] show_stack+0x158/0x1f0 > [ 6971.599986] [<0000000073af724a>] dump_stack+0x1f2/0x238 > [ 6971.599994] [<0000000072ed086a>] bad_page+0x1ba/0x1c0 > [ 6971.600000] [<0000000072ed20c4>] free_pcp_prepare+0x4fc/0x658 > [ 6971.600006] [<0000000072ed96a6>] free_unref_page+0xae/0x158 > [ 6971.600013] [<0000000072e8286a>] unmap_page_range+0xb62/0x1df8 > [ 6971.600019] [<0000000072e83bbc>] unmap_single_vma+0xbc/0x1c8 > [ 6971.600025] [<0000000072e8418e>] zap_page_range+0x176/0x230 > [ 6971.600033] [<0000000072eece8e>] do_madvise+0xfde/0x1270 > [ 6971.600039] [<0000000072eed50a>] __s390x_sys_madvise+0x72/0x98 > [ 6971.600047] [<0000000073b1cce4>] system_call+0xdc/0x278 > [ 6971.600053] 2 locks held by dirtyc0w_child/65238: > [ 6971.600058] #0: 000000013442fa18 (&mm->mmap_lock){++++}-{3:3}, at: do_madvise+0x17a/0x1270 > [ 6971.600432] #1: 00000001343f9060 (ptlock_ptr(page)#2){+.+.}-{2:2}, at: unmap_page_range+0x640/0x1df8 > [ 6971.600487] Disabling lock debugging due to kernel taint > > Once it happens, running it again will trigger in on another PFN. > > [39717.085115] BUG: Bad page state in process dirtyc0w_child pfn:af065 > > Any thoughts? Alexander, Gerald, could you take a look? ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: BUG: Bad page state in process dirtyc0w_child 2020-09-16 14:28 ` Heiko Carstens @ 2020-09-22 17:03 ` Gerald Schaefer 2020-09-23 13:39 ` Gerald Schaefer 0 siblings, 1 reply; 17+ messages in thread From: Gerald Schaefer @ 2020-09-22 17:03 UTC (permalink / raw) To: Heiko Carstens Cc: Qian Cai, Alexander Gordeev, Vasily Gorbik, Christian Borntraeger, linux-s390, linux-mm, linux-kernel On Wed, 16 Sep 2020 16:28:06 +0200 Heiko Carstens <hca@linux.ibm.com> wrote: > On Sat, Sep 12, 2020 at 09:54:12PM -0400, Qian Cai wrote: > > Occasionally, running this LTP test will trigger an error below on > > s390: > > https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/security/dirtyc0w/dirtyc0w.c > > https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/security/dirtyc0w/dirtyc0w_child.c > > > > this .config: > > https://gitlab.com/cailca/linux-mm/-/blob/master/s390.config > > > > [ 6970.253173] LTP: starting dirtyc0w > > [ 6971.599102] BUG: Bad page state in process dirtyc0w_child pfn:8865d > > [ 6971.599867] page:000000001a8328d7 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x8865d > > [ 6971.599876] flags: 0x400000000008000e(referenced|uptodate|dirty|swapbacked) > > [ 6971.599886] raw: 400000000008000e 0000000000000100 0000000000000122 0000000000000000 > > [ 6971.599893] raw: 0000000000000000 0000000000000000 ffffffff00000000 0000000000000000 > > [ 6971.599900] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set > > [ 6971.599906] Modules linked in: loop kvm ip_tables x_tables dm_mirror dm_region_hash dm_log dm_mod [last unloaded: dummy_del_mod] > > [ 6971.599952] CPU: 1 PID: 65238 Comm: dirtyc0w_child Tainted: G O 5.9.0-rc4-next-20200909 #1 > > [ 6971.599959] Hardware name: IBM 2964 N96 400 (z/VM 6.4.0) > > [ 6971.599964] Call Trace: > > [ 6971.599979] [<0000000073aec038>] show_stack+0x158/0x1f0 > > [ 6971.599986] [<0000000073af724a>] dump_stack+0x1f2/0x238 > > [ 6971.599994] [<0000000072ed086a>] bad_page+0x1ba/0x1c0 > > [ 6971.600000] [<0000000072ed20c4>] free_pcp_prepare+0x4fc/0x658 > > [ 6971.600006] [<0000000072ed96a6>] free_unref_page+0xae/0x158 > > [ 6971.600013] [<0000000072e8286a>] unmap_page_range+0xb62/0x1df8 > > [ 6971.600019] [<0000000072e83bbc>] unmap_single_vma+0xbc/0x1c8 > > [ 6971.600025] [<0000000072e8418e>] zap_page_range+0x176/0x230 > > [ 6971.600033] [<0000000072eece8e>] do_madvise+0xfde/0x1270 > > [ 6971.600039] [<0000000072eed50a>] __s390x_sys_madvise+0x72/0x98 > > [ 6971.600047] [<0000000073b1cce4>] system_call+0xdc/0x278 > > [ 6971.600053] 2 locks held by dirtyc0w_child/65238: > > [ 6971.600058] #0: 000000013442fa18 (&mm->mmap_lock){++++}-{3:3}, at: do_madvise+0x17a/0x1270 > > [ 6971.600432] #1: 00000001343f9060 (ptlock_ptr(page)#2){+.+.}-{2:2}, at: unmap_page_range+0x640/0x1df8 > > [ 6971.600487] Disabling lock debugging due to kernel taint > > > > Once it happens, running it again will trigger in on another PFN. > > > > [39717.085115] BUG: Bad page state in process dirtyc0w_child pfn:af065 > > > > Any thoughts? > > Alexander, Gerald, could you take a look? Thanks for reporting. From the header of dirtyc0w.c it seems that this is testing some gup behavior. Given that we have an issue with gup_fast on s390, this could be related. I'll try to reproduce and do more analysis. A fix for our gup_fast issue is also in linux-next now, as of 2020-09-20, but it was not yet included in your kernel version 5.9.0-rc4-next-20200909. So if this is related to the gup_fast issue, it should not occur again with linux-next kernels after 2020-09-20. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: BUG: Bad page state in process dirtyc0w_child 2020-09-22 17:03 ` Gerald Schaefer @ 2020-09-23 13:39 ` Gerald Schaefer 2020-09-23 20:00 ` Linus Torvalds 2020-09-24 2:06 ` Qian Cai 0 siblings, 2 replies; 17+ messages in thread From: Gerald Schaefer @ 2020-09-23 13:39 UTC (permalink / raw) To: Linus Torvalds, Peter Xu Cc: Heiko Carstens, Qian Cai, Alexander Gordeev, Vasily Gorbik, Christian Borntraeger, linux-s390, linux-mm, linux-kernel On Tue, 22 Sep 2020 19:03:50 +0200 Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote: > On Wed, 16 Sep 2020 16:28:06 +0200 > Heiko Carstens <hca@linux.ibm.com> wrote: > > > On Sat, Sep 12, 2020 at 09:54:12PM -0400, Qian Cai wrote: > > > Occasionally, running this LTP test will trigger an error below on > > > s390: > > > https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/security/dirtyc0w/dirtyc0w.c > > > https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/security/dirtyc0w/dirtyc0w_child.c > > > > > > this .config: > > > https://gitlab.com/cailca/linux-mm/-/blob/master/s390.config > > > > > > [ 6970.253173] LTP: starting dirtyc0w > > > [ 6971.599102] BUG: Bad page state in process dirtyc0w_child pfn:8865d > > > [ 6971.599867] page:000000001a8328d7 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x8865d > > > [ 6971.599876] flags: 0x400000000008000e(referenced|uptodate|dirty|swapbacked) > > > [ 6971.599886] raw: 400000000008000e 0000000000000100 0000000000000122 0000000000000000 > > > [ 6971.599893] raw: 0000000000000000 0000000000000000 ffffffff00000000 0000000000000000 > > > [ 6971.599900] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set > > > [ 6971.599906] Modules linked in: loop kvm ip_tables x_tables dm_mirror dm_region_hash dm_log dm_mod [last unloaded: dummy_del_mod] > > > [ 6971.599952] CPU: 1 PID: 65238 Comm: dirtyc0w_child Tainted: G O 5.9.0-rc4-next-20200909 #1 > > > [ 6971.599959] Hardware name: IBM 2964 N96 400 (z/VM 6.4.0) > > > [ 6971.599964] Call Trace: > > > [ 6971.599979] [<0000000073aec038>] show_stack+0x158/0x1f0 > > > [ 6971.599986] [<0000000073af724a>] dump_stack+0x1f2/0x238 > > > [ 6971.599994] [<0000000072ed086a>] bad_page+0x1ba/0x1c0 > > > [ 6971.600000] [<0000000072ed20c4>] free_pcp_prepare+0x4fc/0x658 > > > [ 6971.600006] [<0000000072ed96a6>] free_unref_page+0xae/0x158 > > > [ 6971.600013] [<0000000072e8286a>] unmap_page_range+0xb62/0x1df8 > > > [ 6971.600019] [<0000000072e83bbc>] unmap_single_vma+0xbc/0x1c8 > > > [ 6971.600025] [<0000000072e8418e>] zap_page_range+0x176/0x230 > > > [ 6971.600033] [<0000000072eece8e>] do_madvise+0xfde/0x1270 > > > [ 6971.600039] [<0000000072eed50a>] __s390x_sys_madvise+0x72/0x98 > > > [ 6971.600047] [<0000000073b1cce4>] system_call+0xdc/0x278 > > > [ 6971.600053] 2 locks held by dirtyc0w_child/65238: > > > [ 6971.600058] #0: 000000013442fa18 (&mm->mmap_lock){++++}-{3:3}, at: do_madvise+0x17a/0x1270 > > > [ 6971.600432] #1: 00000001343f9060 (ptlock_ptr(page)#2){+.+.}-{2:2}, at: unmap_page_range+0x640/0x1df8 > > > [ 6971.600487] Disabling lock debugging due to kernel taint > > > > > > Once it happens, running it again will trigger in on another PFN. > > > > > > [39717.085115] BUG: Bad page state in process dirtyc0w_child pfn:af065 > > > > > > Any thoughts? > > > > Alexander, Gerald, could you take a look? > > Thanks for reporting. From the header of dirtyc0w.c it seems that this > is testing some gup behavior. Given that we have an issue with gup_fast > on s390, this could be related. I'll try to reproduce and do more > analysis. > > A fix for our gup_fast issue is also in linux-next now, as of 2020-09-20, > but it was not yet included in your kernel version 5.9.0-rc4-next-20200909. > So if this is related to the gup_fast issue, it should not occur again > with linux-next kernels after 2020-09-20. OK, I can now reproduce this, and unfortunately also with the gup_fast fix, so it is something different. Bisecting is a bit hard, as it will not always show immediately, sometimes takes up to an hour. Still, I think I found the culprit, merge commit b25d1dc9474e "Merge branch 'simplify-do_wp_page'". Without those 4 patches, it works fine, running over night. Not sure why this only shows on s390, should not be architecture-specific, but we do often see subtle races earlier than others due to hypervisor impact. The first commit 09854ba94c6a ("mm: do_wp_page() simplification") already introduces this error. The dirtyc0w_child test seems to play with cow and racing madvise(MADV_DONTNEED), but I have not yet fully understood it and also not the changes from commit 09854ba94c6a. As Linus already mentioned in the merge commit message, this is some bad timing for such a change, so I don't want to delay this further with trying to understand it better before reporting. Maybe Peter or Linus can spot some obvious issue. One thing that seems strange to me is that the page flags from the bad page state output are (uptodate|swapbacked), see below, or (referenced|uptodate|dirty|swapbacked) in the original report. But IIUC, that should not qualify for the "PAGE_FLAGS_CHECK_AT_FREE flag(s) set" reason. So it seems that the flags have changed between check_free_page() and __dump_page(), which would be very odd. Or maybe some issue with compound pages, because __dump_page() looks at head->flags. [ 1863.237707] BUG: Bad page state in process dirtyc0w_child pfn:58527d [ 1863.237721] page:000000008866956b refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x58527d [ 1863.237727] flags: 0x3ffff00000080004(uptodate|swapbacked) [ 1863.237734] raw: 3ffff00000080004 0000000000000100 0000000000000122 0000000000000000 [ 1863.237738] raw: 0000000000000000 0000000000000000 ffffffff00000000 0000000000000000 [ 1863.237742] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set [ 1863.237745] Modules linked in: [ 1863.237752] CPU: 16 PID: 9074 Comm: dirtyc0w_child Tainted: G B 5.9.0-rc6-00020-geff48ddeab78-dirty #104 [ 1863.237756] Hardware name: IBM 3906 M03 703 (LPAR) [ 1863.237759] Call Trace: [ 1863.237768] [<0000000000115f28>] show_stack+0x100/0x158 [ 1863.237775] [<000000000096b41a>] dump_stack+0xa2/0xd8 [ 1863.237781] [<00000000003d497c>] bad_page+0xdc/0x140 [ 1863.237785] [<00000000003d5b62>] free_pcp_prepare+0x31a/0x360 [ 1863.237789] [<00000000003d906a>] free_unref_page+0x32/0xb8 [ 1863.237794] [<00000000003b05f4>] zap_p4d_range+0x64c/0xcf8 [ 1863.237797] [<00000000003b0e7a>] unmap_page_range+0x9a/0x110 [ 1863.237801] [<00000000003b0f84>] unmap_single_vma+0x94/0x100 [ 1863.237805] [<00000000003b14c2>] zap_page_range+0x14a/0x1f0 [ 1863.237809] [<00000000003e3a24>] do_madvise+0x75c/0x918 [ 1863.237812] [<00000000003e3c06>] __s390x_sys_madvise+0x26/0x38 [ 1863.237817] [<0000000000d280d4>] system_call+0xe0/0x2c0 [ 1863.237820] INFO: lockdep is turned off. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: BUG: Bad page state in process dirtyc0w_child 2020-09-23 13:39 ` Gerald Schaefer @ 2020-09-23 20:00 ` Linus Torvalds 2020-09-24 2:06 ` Qian Cai 1 sibling, 0 replies; 17+ messages in thread From: Linus Torvalds @ 2020-09-23 20:00 UTC (permalink / raw) To: Gerald Schaefer Cc: Peter Xu, Heiko Carstens, Qian Cai, Alexander Gordeev, Vasily Gorbik, Christian Borntraeger, linux-s390, Linux-MM, Linux Kernel Mailing List On Wed, Sep 23, 2020 at 6:39 AM Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote: > > OK, I can now reproduce this, and unfortunately also with the gup_fast > fix, so it is something different. Bisecting is a bit hard, as it will > not always show immediately, sometimes takes up to an hour. > > Still, I think I found the culprit, merge commit b25d1dc9474e "Merge > branch 'simplify-do_wp_page'". Without those 4 patches, it works fine, > running over night. Odd, but I have a strong suspicion that the "do_wp_page() simplification" only ends up removing serialization that then hides some existing thing. > Not sure why this only shows on s390, should not be architecture-specific, > but we do often see subtle races earlier than others due to hypervisor > impact. Yeah, so if it needs very particular timing, maybe the s390 page table handling together with the hypervisor interfaces ends up being more likely to trigger this, and thus the different timings at do_wp_page() then ends up showing it. > One thing that seems strange to me is that the page flags from the > bad page state output are (uptodate|swapbacked), see below, or > (referenced|uptodate|dirty|swapbacked) in the original report. But IIUC, > that should not qualify for the "PAGE_FLAGS_CHECK_AT_FREE flag(s) set" > reason. So it seems that the flags have changed between check_free_page() > and __dump_page(), which would be very odd. Or maybe some issue with > compound pages, because __dump_page() looks at head->flags. The odd thing is that all of this _should_ be serialized by the page table lock, as far as I can tell. From your trace, it looks very much like it's do_madvise() -> zap_pte_range() (your stack trace only has zap_p4d_range mentioned, but all the lower levels are inlined) that races with presumably fast-gup. But zap_pte_range() has the pte lock, and fast-gup is - by design - not allowed to change the page state other than taking a reference to it, and should do that with a "try_get" operation, so even taking the reference should never ever race with somebody doing the final free. IOW, the fast-GUP code does that page = pte_page(pte); head = try_grab_compound_head(page, 1, flags); if (!head) goto pte_unmap; if (unlikely(pte_val(pte) != pte_val(*ptep))) { put_compound_head(head, 1, flags); goto pte_unmap; } where the important part is that "try_grab_compound_head()" which does the whole careful atomic "increase page count only if it wasn't zero". See page_cache_add_speculative(). So the rule is - if you hold the page table lock, you can just do "get_page(pte_page())" directly, because you know the pte cannot go away from under you - if you are fast-gup, the pte *can* go away from under you, so you need to do that very careful "get page unless it's gone" dance but I don't see us violating that. There's maybe some interesting memory ordering in the above case, but it does atomic_add_unless() which is ordered, and s390 is strongly ordered anyway, isn't it? (Yes, and it doesn't do the atomic stuff at all if TINY_RCU is set, but that's only set for non-preemptible UP kernels, so that doesn't matter). So if zap_page_range() races with fast-gup, then either zap_page_range() wins the race and puts the page - but then fast-gup won't touch it, or fast-gup wins and gets a reference to the page, and then zap_page_range() will clear it and drop the ref to it, but it won't be the final ref. Your dump seems to show that zap_page_range() *did* drop the final ref, but something is racing with it to the point of actually modifying the page flags. Odd. And the do_wp_page() change itself shouldn't be directly involved, because that's all done under the page table lock. But it obviously does change the page locking a lot, and changes timing a lot. And in fact, the zap_pte_range() code itself doesn't take the page lock (and cannot, because it's all under the page table spinlock). So it does smell like timing to me. But possibly with some s390-specific twist to it. Ooh. One thing that is *very* different about s390 is that it frees the page directly, and doesn't batch things up to happen after the TLB flush. Maybe THAT is the difference? Not that I can tell why it should matter, for all the reasons outlines above. But on x86-64, the __tlb_remove_page() function just adds the page to the "free this later" TLB flush structure, and if it fills up it does the TLB flush and then does the actual batched page freeing outside the page table lock. And that *has* been one of the things that the fast-gup code depended on. We even have a big comment about it: /* * Disable interrupts. The nested form is used, in order to allow * full, general purpose use of this routine. * * With interrupts disabled, we block page table pages from being * freed from under us. See struct mmu_table_batch comments in * include/asm-generic/tlb.h for more details. * * We do not adopt an rcu_read_lock(.) here as we also want to * block IPIs that come from THPs splitting. */ and maybe that whole thing doesn't hold true for s390 at all. Linus ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: BUG: Bad page state in process dirtyc0w_child @ 2020-09-23 20:00 ` Linus Torvalds 0 siblings, 0 replies; 17+ messages in thread From: Linus Torvalds @ 2020-09-23 20:00 UTC (permalink / raw) To: Gerald Schaefer Cc: Peter Xu, Heiko Carstens, Qian Cai, Alexander Gordeev, Vasily Gorbik, Christian Borntraeger, linux-s390, Linux-MM, Linux Kernel Mailing List On Wed, Sep 23, 2020 at 6:39 AM Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote: > > OK, I can now reproduce this, and unfortunately also with the gup_fast > fix, so it is something different. Bisecting is a bit hard, as it will > not always show immediately, sometimes takes up to an hour. > > Still, I think I found the culprit, merge commit b25d1dc9474e "Merge > branch 'simplify-do_wp_page'". Without those 4 patches, it works fine, > running over night. Odd, but I have a strong suspicion that the "do_wp_page() simplification" only ends up removing serialization that then hides some existing thing. > Not sure why this only shows on s390, should not be architecture-specific, > but we do often see subtle races earlier than others due to hypervisor > impact. Yeah, so if it needs very particular timing, maybe the s390 page table handling together with the hypervisor interfaces ends up being more likely to trigger this, and thus the different timings at do_wp_page() then ends up showing it. > One thing that seems strange to me is that the page flags from the > bad page state output are (uptodate|swapbacked), see below, or > (referenced|uptodate|dirty|swapbacked) in the original report. But IIUC, > that should not qualify for the "PAGE_FLAGS_CHECK_AT_FREE flag(s) set" > reason. So it seems that the flags have changed between check_free_page() > and __dump_page(), which would be very odd. Or maybe some issue with > compound pages, because __dump_page() looks at head->flags. The odd thing is that all of this _should_ be serialized by the page table lock, as far as I can tell. From your trace, it looks very much like it's do_madvise() -> zap_pte_range() (your stack trace only has zap_p4d_range mentioned, but all the lower levels are inlined) that races with presumably fast-gup. But zap_pte_range() has the pte lock, and fast-gup is - by design - not allowed to change the page state other than taking a reference to it, and should do that with a "try_get" operation, so even taking the reference should never ever race with somebody doing the final free. IOW, the fast-GUP code does that page = pte_page(pte); head = try_grab_compound_head(page, 1, flags); if (!head) goto pte_unmap; if (unlikely(pte_val(pte) != pte_val(*ptep))) { put_compound_head(head, 1, flags); goto pte_unmap; } where the important part is that "try_grab_compound_head()" which does the whole careful atomic "increase page count only if it wasn't zero". See page_cache_add_speculative(). So the rule is - if you hold the page table lock, you can just do "get_page(pte_page())" directly, because you know the pte cannot go away from under you - if you are fast-gup, the pte *can* go away from under you, so you need to do that very careful "get page unless it's gone" dance but I don't see us violating that. There's maybe some interesting memory ordering in the above case, but it does atomic_add_unless() which is ordered, and s390 is strongly ordered anyway, isn't it? (Yes, and it doesn't do the atomic stuff at all if TINY_RCU is set, but that's only set for non-preemptible UP kernels, so that doesn't matter). So if zap_page_range() races with fast-gup, then either zap_page_range() wins the race and puts the page - but then fast-gup won't touch it, or fast-gup wins and gets a reference to the page, and then zap_page_range() will clear it and drop the ref to it, but it won't be the final ref. Your dump seems to show that zap_page_range() *did* drop the final ref, but something is racing with it to the point of actually modifying the page flags. Odd. And the do_wp_page() change itself shouldn't be directly involved, because that's all done under the page table lock. But it obviously does change the page locking a lot, and changes timing a lot. And in fact, the zap_pte_range() code itself doesn't take the page lock (and cannot, because it's all under the page table spinlock). So it does smell like timing to me. But possibly with some s390-specific twist to it. Ooh. One thing that is *very* different about s390 is that it frees the page directly, and doesn't batch things up to happen after the TLB flush. Maybe THAT is the difference? Not that I can tell why it should matter, for all the reasons outlines above. But on x86-64, the __tlb_remove_page() function just adds the page to the "free this later" TLB flush structure, and if it fills up it does the TLB flush and then does the actual batched page freeing outside the page table lock. And that *has* been one of the things that the fast-gup code depended on. We even have a big comment about it: /* * Disable interrupts. The nested form is used, in order to allow * full, general purpose use of this routine. * * With interrupts disabled, we block page table pages from being * freed from under us. See struct mmu_table_batch comments in * include/asm-generic/tlb.h for more details. * * We do not adopt an rcu_read_lock(.) here as we also want to * block IPIs that come from THPs splitting. */ and maybe that whole thing doesn't hold true for s390 at all. Linus ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: BUG: Bad page state in process dirtyc0w_child 2020-09-23 20:00 ` Linus Torvalds (?) @ 2020-09-23 21:33 ` Gerald Schaefer 2020-09-23 21:50 ` Linus Torvalds -1 siblings, 1 reply; 17+ messages in thread From: Gerald Schaefer @ 2020-09-23 21:33 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Xu, Heiko Carstens, Qian Cai, Alexander Gordeev, Vasily Gorbik, Christian Borntraeger, linux-s390, Linux-MM, Linux Kernel Mailing List On Wed, 23 Sep 2020 13:00:45 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: [...] > > Ooh. One thing that is *very* different about s390 is that it frees > the page directly, and doesn't batch things up to happen after the TLB > flush. > > Maybe THAT is the difference? Not that I can tell why it should > matter, for all the reasons outlines above. But on x86-64, the > __tlb_remove_page() function just adds the page to the "free this > later" TLB flush structure, and if it fills up it does the TLB flush > and then does the actual batched page freeing outside the page table > lock. > > And that *has* been one of the things that the fast-gup code depended > on. We even have a big comment about it: > > /* > * Disable interrupts. The nested form is used, in order to allow > * full, general purpose use of this routine. > * > * With interrupts disabled, we block page table pages from being > * freed from under us. See struct mmu_table_batch comments in > * include/asm-generic/tlb.h for more details. > * > * We do not adopt an rcu_read_lock(.) here as we also want to > * block IPIs that come from THPs splitting. > */ > > and maybe that whole thing doesn't hold true for s390 at all. Thanks, very nice walk-through, need some time to digest this. The TLB aspect is interesting, and we do have our own __tlb_remove_page_size(), which directly calls free_page_and_swap_cache() instead of the generic batched approach. I faintly remember that we also did have some batched and rcu_sched based approach. It seems there was some rework with commit 9de7d833e370 ("s390/tlb: Convert to generic mmu_gather") and discussion in https://lore.kernel.org/linux-arch/20180918125151.31744-1-schwidefsky@de.ibm.com/ Given the comment you mentioned and also this one from mm/gup.c: /* * Fast GUP * * get_user_pages_fast attempts to pin user pages by walking the page * tables directly and avoids taking locks. Thus the walker needs to be * protected from page table pages being freed from under it, and should * block any THP splits. * * One way to achieve this is to have the walker disable interrupts, and * rely on IPIs from the TLB flushing code blocking before the page table * pages are freed. This is unsuitable for architectures that do not need * to broadcast an IPI when invalidating TLBs. * * Another way to achieve this is to batch up page table containing pages * belonging to more than one mm_user, then rcu_sched a callback to free those * pages. Disabling interrupts will allow the fast_gup walker to both block * the rcu_sched callback, and an IPI that we broadcast for splitting THPs * (which is a relatively rare event). The code below adopts this strategy. It really sounds like we might have yet another issue with (common code) gup_fast, and non-batched freeing of pagetable pages, though I wonder how that would have worked with the s390-specific version of gup_fast before. Certainly worth investigating, thanks a lot for the hint! Meanwhile, out of curiosity, while I still fail to comprehend commit 09854ba94c6a ("mm: do_wp_page() simplification") in its entirety, there is one detail that I find most confusing: the unlock_page() has moved behind the wp_page_reuse(), while it was the other way round before. Does it simply not really matter, or was it done on purpose, possibly related to the "get rid of the nasty serialization on the page lock" description? ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: BUG: Bad page state in process dirtyc0w_child 2020-09-23 21:33 ` Gerald Schaefer @ 2020-09-23 21:50 ` Linus Torvalds 0 siblings, 0 replies; 17+ messages in thread From: Linus Torvalds @ 2020-09-23 21:50 UTC (permalink / raw) To: Gerald Schaefer Cc: Peter Xu, Heiko Carstens, Qian Cai, Alexander Gordeev, Vasily Gorbik, Christian Borntraeger, linux-s390, Linux-MM, Linux Kernel Mailing List On Wed, Sep 23, 2020 at 2:33 PM Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote: > > Thanks, very nice walk-through, need some time to digest this. The TLB > aspect is interesting, and we do have our own __tlb_remove_page_size(), > which directly calls free_page_and_swap_cache() instead of the generic > batched approach. So I don't think it's the free_page_and_swap_cache() itself that is the problem. As mentioned, the actual pages themselves should be handled by the reference counting being atomic. The interrupt disable is really about just the page *tables* being free'd - not the final page level. So the issue is that at least on x86-64, we have the serialization that we will only free the page tables after a cross-CPU IPI has flushed the TLB. I think s390 just RCU-free's the page tables instead, which should fix it. So I think this is special, and s390 is very different from x86, but I don't think it's the problem. In fact, I think you pinpointed the real issue: > Meanwhile, out of curiosity, while I still fail to comprehend commit > 09854ba94c6a ("mm: do_wp_page() simplification") in its entirety, there > is one detail that I find most confusing: the unlock_page() has moved > behind the wp_page_reuse(), while it was the other way round before. You know what? That was just a mistake, and I think you may actually have hit the real cause of the problem. It means that we keep the page locked until after we do the pte_unmap_unlock(), so now we have no guarantees that we hold the page referecne. And then we unlock it - while somebody else might be freeing it. So somebody is freeing a locked page just as we're unlocking it, and that matches the problem you see exactly: the debug thing will hit because the last free happened while locked, and then by the time the printout happens it has become unlocked so it doesn't show any more. Duh. Would you mind testing just moving the unlock_page() back to before the wp_page_reuse()? Does that make your debug check go away? Linus ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: BUG: Bad page state in process dirtyc0w_child @ 2020-09-23 21:50 ` Linus Torvalds 0 siblings, 0 replies; 17+ messages in thread From: Linus Torvalds @ 2020-09-23 21:50 UTC (permalink / raw) To: Gerald Schaefer Cc: Peter Xu, Heiko Carstens, Qian Cai, Alexander Gordeev, Vasily Gorbik, Christian Borntraeger, linux-s390, Linux-MM, Linux Kernel Mailing List On Wed, Sep 23, 2020 at 2:33 PM Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote: > > Thanks, very nice walk-through, need some time to digest this. The TLB > aspect is interesting, and we do have our own __tlb_remove_page_size(), > which directly calls free_page_and_swap_cache() instead of the generic > batched approach. So I don't think it's the free_page_and_swap_cache() itself that is the problem. As mentioned, the actual pages themselves should be handled by the reference counting being atomic. The interrupt disable is really about just the page *tables* being free'd - not the final page level. So the issue is that at least on x86-64, we have the serialization that we will only free the page tables after a cross-CPU IPI has flushed the TLB. I think s390 just RCU-free's the page tables instead, which should fix it. So I think this is special, and s390 is very different from x86, but I don't think it's the problem. In fact, I think you pinpointed the real issue: > Meanwhile, out of curiosity, while I still fail to comprehend commit > 09854ba94c6a ("mm: do_wp_page() simplification") in its entirety, there > is one detail that I find most confusing: the unlock_page() has moved > behind the wp_page_reuse(), while it was the other way round before. You know what? That was just a mistake, and I think you may actually have hit the real cause of the problem. It means that we keep the page locked until after we do the pte_unmap_unlock(), so now we have no guarantees that we hold the page referecne. And then we unlock it - while somebody else might be freeing it. So somebody is freeing a locked page just as we're unlocking it, and that matches the problem you see exactly: the debug thing will hit because the last free happened while locked, and then by the time the printout happens it has become unlocked so it doesn't show any more. Duh. Would you mind testing just moving the unlock_page() back to before the wp_page_reuse()? Does that make your debug check go away? Linus ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: BUG: Bad page state in process dirtyc0w_child 2020-09-23 21:50 ` Linus Torvalds (?) @ 2020-09-23 22:02 ` Gerald Schaefer 2020-09-24 12:06 ` Gerald Schaefer -1 siblings, 1 reply; 17+ messages in thread From: Gerald Schaefer @ 2020-09-23 22:02 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Xu, Heiko Carstens, Qian Cai, Alexander Gordeev, Vasily Gorbik, Christian Borntraeger, linux-s390, Linux-MM, Linux Kernel Mailing List On Wed, 23 Sep 2020 14:50:36 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Wed, Sep 23, 2020 at 2:33 PM Gerald Schaefer > <gerald.schaefer@linux.ibm.com> wrote: > > > > Thanks, very nice walk-through, need some time to digest this. The TLB > > aspect is interesting, and we do have our own __tlb_remove_page_size(), > > which directly calls free_page_and_swap_cache() instead of the generic > > batched approach. > > So I don't think it's the free_page_and_swap_cache() itself that is the problem. > > As mentioned, the actual pages themselves should be handled by the > reference counting being atomic. > > The interrupt disable is really about just the page *tables* being > free'd - not the final page level. > > So the issue is that at least on x86-64, we have the serialization > that we will only free the page tables after a cross-CPU IPI has > flushed the TLB. > > I think s390 just RCU-free's the page tables instead, which should fix it. > > So I think this is special, and s390 is very different from x86, but I > don't think it's the problem. > > In fact, I think you pinpointed the real issue: > > > Meanwhile, out of curiosity, while I still fail to comprehend commit > > 09854ba94c6a ("mm: do_wp_page() simplification") in its entirety, there > > is one detail that I find most confusing: the unlock_page() has moved > > behind the wp_page_reuse(), while it was the other way round before. > > You know what? That was just a mistake, and I think you may actually > have hit the real cause of the problem. > > It means that we keep the page locked until after we do the > pte_unmap_unlock(), so now we have no guarantees that we hold the page > referecne. > > And then we unlock it - while somebody else might be freeing it. > > So somebody is freeing a locked page just as we're unlocking it, and > that matches the problem you see exactly: the debug thing will hit > because the last free happened while locked, and then by the time the > printout happens it has become unlocked so it doesn't show any more. > > Duh. > > Would you mind testing just moving the unlock_page() back to before > the wp_page_reuse()? Sure, I'll give it a try running over the night again. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: BUG: Bad page state in process dirtyc0w_child 2020-09-23 22:02 ` Gerald Schaefer @ 2020-09-24 12:06 ` Gerald Schaefer 2020-09-24 15:58 ` Linus Torvalds 0 siblings, 1 reply; 17+ messages in thread From: Gerald Schaefer @ 2020-09-24 12:06 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Xu, Heiko Carstens, Qian Cai, Alexander Gordeev, Vasily Gorbik, Christian Borntraeger, linux-s390, Linux-MM, Linux Kernel Mailing List On Thu, 24 Sep 2020 00:02:26 +0200 Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote: > On Wed, 23 Sep 2020 14:50:36 -0700 > Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > On Wed, Sep 23, 2020 at 2:33 PM Gerald Schaefer > > <gerald.schaefer@linux.ibm.com> wrote: > > > > > > Thanks, very nice walk-through, need some time to digest this. The TLB > > > aspect is interesting, and we do have our own __tlb_remove_page_size(), > > > which directly calls free_page_and_swap_cache() instead of the generic > > > batched approach. > > > > So I don't think it's the free_page_and_swap_cache() itself that is the problem. > > > > As mentioned, the actual pages themselves should be handled by the > > reference counting being atomic. > > > > The interrupt disable is really about just the page *tables* being > > free'd - not the final page level. > > > > So the issue is that at least on x86-64, we have the serialization > > that we will only free the page tables after a cross-CPU IPI has > > flushed the TLB. > > > > I think s390 just RCU-free's the page tables instead, which should fix it. > > > > So I think this is special, and s390 is very different from x86, but I > > don't think it's the problem. Ah of course, I got confused by freeing pagetable pages vs. the pages themselves. For the pagetable pages we actually use the generic tlb_remove_table_(sync_)one, including the IPI-synchronizing smp_call_function (CONFIG_MMU_GATHER_RCU_TABLE_FREE=y). The "s390 magic" then only starts in our own __tlb_remove_table, where we take care of the special 2K vs. 4K pagetable stuff. Thanks a lot for this very valuable abstract of "who is who and why" in pagetable memory management :-) > > > > In fact, I think you pinpointed the real issue: > > > > > Meanwhile, out of curiosity, while I still fail to comprehend commit > > > 09854ba94c6a ("mm: do_wp_page() simplification") in its entirety, there > > > is one detail that I find most confusing: the unlock_page() has moved > > > behind the wp_page_reuse(), while it was the other way round before. > > > > You know what? That was just a mistake, and I think you may actually > > have hit the real cause of the problem. > > > > It means that we keep the page locked until after we do the > > pte_unmap_unlock(), so now we have no guarantees that we hold the page > > referecne. > > > > And then we unlock it - while somebody else might be freeing it. > > > > So somebody is freeing a locked page just as we're unlocking it, and > > that matches the problem you see exactly: the debug thing will hit > > because the last free happened while locked, and then by the time the > > printout happens it has become unlocked so it doesn't show any more. > > > > Duh. > > > > Would you mind testing just moving the unlock_page() back to before > > the wp_page_reuse()? > > Sure, I'll give it a try running over the night again. It's all good now, no more occurrences with unlock_page() before wp_page_reuse(). ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: BUG: Bad page state in process dirtyc0w_child 2020-09-24 12:06 ` Gerald Schaefer @ 2020-09-24 15:58 ` Linus Torvalds 0 siblings, 0 replies; 17+ messages in thread From: Linus Torvalds @ 2020-09-24 15:58 UTC (permalink / raw) To: Gerald Schaefer, Alex Shi Cc: Peter Xu, Heiko Carstens, Qian Cai, Alexander Gordeev, Vasily Gorbik, Christian Borntraeger, linux-s390, Linux-MM, Linux Kernel Mailing List On Thu, Sep 24, 2020 at 5:06 AM Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote: > > It's all good now, no more occurrences with unlock_page() before > wp_page_reuse(). Thanks for the confirmation. When you pointed at that unlock_page() change, I was pretty sure that was it ("D'oh!"), but it's always good to have that verification. Committed, Linus ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: BUG: Bad page state in process dirtyc0w_child @ 2020-09-24 15:58 ` Linus Torvalds 0 siblings, 0 replies; 17+ messages in thread From: Linus Torvalds @ 2020-09-24 15:58 UTC (permalink / raw) To: Gerald Schaefer, Alex Shi Cc: Peter Xu, Heiko Carstens, Qian Cai, Alexander Gordeev, Vasily Gorbik, Christian Borntraeger, linux-s390, Linux-MM, Linux Kernel Mailing List On Thu, Sep 24, 2020 at 5:06 AM Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote: > > It's all good now, no more occurrences with unlock_page() before > wp_page_reuse(). Thanks for the confirmation. When you pointed at that unlock_page() change, I was pretty sure that was it ("D'oh!"), but it's always good to have that verification. Committed, Linus ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: BUG: Bad page state in process dirtyc0w_child 2020-09-23 13:39 ` Gerald Schaefer @ 2020-09-24 2:06 ` Qian Cai 2020-09-24 2:06 ` Qian Cai 1 sibling, 0 replies; 17+ messages in thread From: Qian Cai @ 2020-09-24 2:06 UTC (permalink / raw) To: Gerald Schaefer, Linus Torvalds, Peter Xu Cc: Heiko Carstens, Alexander Gordeev, Vasily Gorbik, Christian Borntraeger, linux-s390, linux-mm, linux-kernel, Alex Shi On Wed, 2020-09-23 at 15:39 +0200, Gerald Schaefer wrote: > OK, I can now reproduce this, and unfortunately also with the gup_fast > fix, so it is something different. Bisecting is a bit hard, as it will > not always show immediately, sometimes takes up to an hour. > > Still, I think I found the culprit, merge commit b25d1dc9474e "Merge > branch 'simplify-do_wp_page'". Without those 4 patches, it works fine, > running over night. > > Not sure why this only shows on s390, should not be architecture-specific, > but we do often see subtle races earlier than others due to hypervisor > impact. Apparently, someone can reproduce something similar on x86 as well: https://lore.kernel.org/linux-mm/c41149a8-211e-390b-af1d-d5eee690fecb@linux.alibaba.com/ Probably, Alex could revert the bad commits and confirm it there. > > The first commit 09854ba94c6a ("mm: do_wp_page() simplification") already > introduces this error. The dirtyc0w_child test seems to play with cow > and racing madvise(MADV_DONTNEED), but I have not yet fully understood > it and also not the changes from commit 09854ba94c6a. As Linus already > mentioned in the merge commit message, this is some bad timing for such > a change, so I don't want to delay this further with trying to understand > it better before reporting. Maybe Peter or Linus can spot some obvious > issue. > > One thing that seems strange to me is that the page flags from the > bad page state output are (uptodate|swapbacked), see below, or > (referenced|uptodate|dirty|swapbacked) in the original report. But IIUC, > that should not qualify for the "PAGE_FLAGS_CHECK_AT_FREE flag(s) set" > reason. So it seems that the flags have changed between check_free_page() > and __dump_page(), which would be very odd. Or maybe some issue with > compound pages, because __dump_page() looks at head->flags. > > [ 1863.237707] BUG: Bad page state in process dirtyc0w_child pfn:58527d > [ 1863.237721] page:000000008866956b refcount:0 mapcount:0 > mapping:0000000000000000 index:0x0 pfn:0x58527d > [ 1863.237727] flags: 0x3ffff00000080004(uptodate|swapbacked) > [ 1863.237734] raw: 3ffff00000080004 0000000000000100 0000000000000122 > 0000000000000000 > [ 1863.237738] raw: 0000000000000000 0000000000000000 ffffffff00000000 > 0000000000000000 > [ 1863.237742] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set > [ 1863.237745] Modules linked in: > [ 1863.237752] CPU: 16 PID: 9074 Comm: dirtyc0w_child Tainted: > G B 5.9.0-rc6-00020-geff48ddeab78-dirty #104 > [ 1863.237756] Hardware name: IBM 3906 M03 703 (LPAR) > [ 1863.237759] Call Trace: > [ 1863.237768] [<0000000000115f28>] show_stack+0x100/0x158 > [ 1863.237775] [<000000000096b41a>] dump_stack+0xa2/0xd8 > [ 1863.237781] [<00000000003d497c>] bad_page+0xdc/0x140 > [ 1863.237785] [<00000000003d5b62>] free_pcp_prepare+0x31a/0x360 > [ 1863.237789] [<00000000003d906a>] free_unref_page+0x32/0xb8 > [ 1863.237794] [<00000000003b05f4>] zap_p4d_range+0x64c/0xcf8 > [ 1863.237797] [<00000000003b0e7a>] unmap_page_range+0x9a/0x110 > [ 1863.237801] [<00000000003b0f84>] unmap_single_vma+0x94/0x100 > [ 1863.237805] [<00000000003b14c2>] zap_page_range+0x14a/0x1f0 > [ 1863.237809] [<00000000003e3a24>] do_madvise+0x75c/0x918 > [ 1863.237812] [<00000000003e3c06>] __s390x_sys_madvise+0x26/0x38 > [ 1863.237817] [<0000000000d280d4>] system_call+0xe0/0x2c0 > [ 1863.237820] INFO: lockdep is turned off. > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: BUG: Bad page state in process dirtyc0w_child @ 2020-09-24 2:06 ` Qian Cai 0 siblings, 0 replies; 17+ messages in thread From: Qian Cai @ 2020-09-24 2:06 UTC (permalink / raw) To: Gerald Schaefer, Linus Torvalds, Peter Xu Cc: Heiko Carstens, Alexander Gordeev, Vasily Gorbik, Christian Borntraeger, linux-s390, linux-mm, linux-kernel, Alex Shi On Wed, 2020-09-23 at 15:39 +0200, Gerald Schaefer wrote: > OK, I can now reproduce this, and unfortunately also with the gup_fast > fix, so it is something different. Bisecting is a bit hard, as it will > not always show immediately, sometimes takes up to an hour. > > Still, I think I found the culprit, merge commit b25d1dc9474e "Merge > branch 'simplify-do_wp_page'". Without those 4 patches, it works fine, > running over night. > > Not sure why this only shows on s390, should not be architecture-specific, > but we do often see subtle races earlier than others due to hypervisor > impact. Apparently, someone can reproduce something similar on x86 as well: https://lore.kernel.org/linux-mm/c41149a8-211e-390b-af1d-d5eee690fecb@linux.alibaba.com/ Probably, Alex could revert the bad commits and confirm it there. > > The first commit 09854ba94c6a ("mm: do_wp_page() simplification") already > introduces this error. The dirtyc0w_child test seems to play with cow > and racing madvise(MADV_DONTNEED), but I have not yet fully understood > it and also not the changes from commit 09854ba94c6a. As Linus already > mentioned in the merge commit message, this is some bad timing for such > a change, so I don't want to delay this further with trying to understand > it better before reporting. Maybe Peter or Linus can spot some obvious > issue. > > One thing that seems strange to me is that the page flags from the > bad page state output are (uptodate|swapbacked), see below, or > (referenced|uptodate|dirty|swapbacked) in the original report. But IIUC, > that should not qualify for the "PAGE_FLAGS_CHECK_AT_FREE flag(s) set" > reason. So it seems that the flags have changed between check_free_page() > and __dump_page(), which would be very odd. Or maybe some issue with > compound pages, because __dump_page() looks at head->flags. > > [ 1863.237707] BUG: Bad page state in process dirtyc0w_child pfn:58527d > [ 1863.237721] page:000000008866956b refcount:0 mapcount:0 > mapping:0000000000000000 index:0x0 pfn:0x58527d > [ 1863.237727] flags: 0x3ffff00000080004(uptodate|swapbacked) > [ 1863.237734] raw: 3ffff00000080004 0000000000000100 0000000000000122 > 0000000000000000 > [ 1863.237738] raw: 0000000000000000 0000000000000000 ffffffff00000000 > 0000000000000000 > [ 1863.237742] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set > [ 1863.237745] Modules linked in: > [ 1863.237752] CPU: 16 PID: 9074 Comm: dirtyc0w_child Tainted: > G B 5.9.0-rc6-00020-geff48ddeab78-dirty #104 > [ 1863.237756] Hardware name: IBM 3906 M03 703 (LPAR) > [ 1863.237759] Call Trace: > [ 1863.237768] [<0000000000115f28>] show_stack+0x100/0x158 > [ 1863.237775] [<000000000096b41a>] dump_stack+0xa2/0xd8 > [ 1863.237781] [<00000000003d497c>] bad_page+0xdc/0x140 > [ 1863.237785] [<00000000003d5b62>] free_pcp_prepare+0x31a/0x360 > [ 1863.237789] [<00000000003d906a>] free_unref_page+0x32/0xb8 > [ 1863.237794] [<00000000003b05f4>] zap_p4d_range+0x64c/0xcf8 > [ 1863.237797] [<00000000003b0e7a>] unmap_page_range+0x9a/0x110 > [ 1863.237801] [<00000000003b0f84>] unmap_single_vma+0x94/0x100 > [ 1863.237805] [<00000000003b14c2>] zap_page_range+0x14a/0x1f0 > [ 1863.237809] [<00000000003e3a24>] do_madvise+0x75c/0x918 > [ 1863.237812] [<00000000003e3c06>] __s390x_sys_madvise+0x26/0x38 > [ 1863.237817] [<0000000000d280d4>] system_call+0xe0/0x2c0 > [ 1863.237820] INFO: lockdep is turned off. > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: BUG: Bad page state in process dirtyc0w_child 2020-09-24 2:06 ` Qian Cai (?) @ 2020-09-24 14:25 ` Alex Shi -1 siblings, 0 replies; 17+ messages in thread From: Alex Shi @ 2020-09-24 14:25 UTC (permalink / raw) To: Qian Cai, Gerald Schaefer, Linus Torvalds, Peter Xu Cc: Heiko Carstens, Alexander Gordeev, Vasily Gorbik, Christian Borntraeger, linux-s390, linux-mm, linux-kernel 在 2020/9/24 上午10:06, Qian Cai 写道: > On Wed, 2020-09-23 at 15:39 +0200, Gerald Schaefer wrote: >> OK, I can now reproduce this, and unfortunately also with the gup_fast >> fix, so it is something different. Bisecting is a bit hard, as it will >> not always show immediately, sometimes takes up to an hour. >> >> Still, I think I found the culprit, merge commit b25d1dc9474e "Merge >> branch 'simplify-do_wp_page'". Without those 4 patches, it works fine, >> running over night. >> >> Not sure why this only shows on s390, should not be architecture-specific, >> but we do often see subtle races earlier than others due to hypervisor >> impact. > > Apparently, someone can reproduce something similar on x86 as well: > > https://lore.kernel.org/linux-mm/c41149a8-211e-390b-af1d-d5eee690fecb@linux.alibaba.com/ > > Probably, Alex could revert the bad commits and confirm it there. > >> >> The first commit 09854ba94c6a ("mm: do_wp_page() simplification") already yes, after revert this commit, the BUG disappears. Thanks Alex >> introduces this error. The dirtyc0w_child test seems to play with cow >> and racing madvise(MADV_DONTNEED), but I have not yet fully understood >> it and also not the changes from commit 09854ba94c6a. As Linus already >> mentioned in the merge commit message, this is some bad timing for such >> a change, so I don't want to delay this further with trying to understand >> it better before reporting. Maybe Peter or Linus can spot some obvious >> issue. >> >> One thing that seems strange to me is that the page flags from the >> bad page state output are (uptodate|swapbacked), see below, or >> (referenced|uptodate|dirty|swapbacked) in the original report. But IIUC, >> that should not qualify for the "PAGE_FLAGS_CHECK_AT_FREE flag(s) set" >> reason. So it seems that the flags have changed between check_free_page() >> and __dump_page(), which would be very odd. Or maybe some issue with >> compound pages, because __dump_page() looks at head->flags. >> >> [ 1863.237707] BUG: Bad page state in process dirtyc0w_child pfn:58527d >> [ 1863.237721] page:000000008866956b refcount:0 mapcount:0 >> mapping:0000000000000000 index:0x0 pfn:0x58527d >> [ 1863.237727] flags: 0x3ffff00000080004(uptodate|swapbacked) >> [ 1863.237734] raw: 3ffff00000080004 0000000000000100 0000000000000122 >> 0000000000000000 >> [ 1863.237738] raw: 0000000000000000 0000000000000000 ffffffff00000000 >> 0000000000000000 >> [ 1863.237742] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set >> [ 1863.237745] Modules linked in: >> [ 1863.237752] CPU: 16 PID: 9074 Comm: dirtyc0w_child Tainted: >> G B 5.9.0-rc6-00020-geff48ddeab78-dirty #104 >> [ 1863.237756] Hardware name: IBM 3906 M03 703 (LPAR) >> [ 1863.237759] Call Trace: >> [ 1863.237768] [<0000000000115f28>] show_stack+0x100/0x158 >> [ 1863.237775] [<000000000096b41a>] dump_stack+0xa2/0xd8 >> [ 1863.237781] [<00000000003d497c>] bad_page+0xdc/0x140 >> [ 1863.237785] [<00000000003d5b62>] free_pcp_prepare+0x31a/0x360 >> [ 1863.237789] [<00000000003d906a>] free_unref_page+0x32/0xb8 >> [ 1863.237794] [<00000000003b05f4>] zap_p4d_range+0x64c/0xcf8 >> [ 1863.237797] [<00000000003b0e7a>] unmap_page_range+0x9a/0x110 >> [ 1863.237801] [<00000000003b0f84>] unmap_single_vma+0x94/0x100 >> [ 1863.237805] [<00000000003b14c2>] zap_page_range+0x14a/0x1f0 >> [ 1863.237809] [<00000000003e3a24>] do_madvise+0x75c/0x918 >> [ 1863.237812] [<00000000003e3c06>] __s390x_sys_madvise+0x26/0x38 >> [ 1863.237817] [<0000000000d280d4>] system_call+0xe0/0x2c0 >> [ 1863.237820] INFO: lockdep is turned off. >> ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2020-09-24 15:58 UTC | newest] Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-09-13 1:54 BUG: Bad page state in process dirtyc0w_child Qian Cai 2020-09-13 1:54 ` Qian Cai 2020-09-16 14:28 ` Heiko Carstens 2020-09-22 17:03 ` Gerald Schaefer 2020-09-23 13:39 ` Gerald Schaefer 2020-09-23 20:00 ` Linus Torvalds 2020-09-23 20:00 ` Linus Torvalds 2020-09-23 21:33 ` Gerald Schaefer 2020-09-23 21:50 ` Linus Torvalds 2020-09-23 21:50 ` Linus Torvalds 2020-09-23 22:02 ` Gerald Schaefer 2020-09-24 12:06 ` Gerald Schaefer 2020-09-24 15:58 ` Linus Torvalds 2020-09-24 15:58 ` Linus Torvalds 2020-09-24 2:06 ` Qian Cai 2020-09-24 2:06 ` Qian Cai 2020-09-24 14:25 ` Alex Shi
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.