All of lore.kernel.org
 help / color / mirror / Atom feed
* kernel BUG at include/linux/swapops.h:497!
@ 2022-12-01 16:58 David Hildenbrand
  2022-12-01 17:48 ` Yang Shi
  0 siblings, 1 reply; 7+ messages in thread
From: David Hildenbrand @ 2022-12-01 16:58 UTC (permalink / raw)
  To: linux-mm, linux-kernel; +Cc: Hugh Dickins, Yang Shi, Peter Xu

Hi,

running COW tests (in mm-unstable) on x86-pae with 8GiB, I am able to trigger the
following BUG on latest upstream:

root@debian:/mnt/scratch/linux/tools/testing/selftests/vm# ./cow
# [INFO] detected THP size: 2048 KiB
# [INFO] detected hugetlb size: 2048 KiB
# [INFO] huge zeropage is enabled
TAP version 13
1..147
# [INFO] Anonymous memory tests in private mappings
# [RUN] Basic COW after fork() ... with base page
ok 1 No leak from parent into child
# [RUN] Basic COW after fork() ... with swapped out base page
ok 2 No leak from parent into child
# [RUN] Basic COW after fork() ... with THP
ok 3 No leak from parent into child
# [RUN] Basic COW after fork() ... with swapped-out THP
Segmentation fault


[  879.314600] kernel BUG at include/linux/swapops.h:497!
[  879.314615] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[  879.314624] CPU: 7 PID: 746 Comm: cow Tainted: G            E      6.1.0-rc7+ #5
[  879.314631] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014
[  879.314634] EIP: pagemap_pmd_range+0x644/0x650
[  879.314645] Code: 00 00 00 00 66 90 89 ce b9 00 f0 ff ff e9 ff fb ff ff 89 d8 31 db e8 1b c2 52 00 e9 23 fb ff ff e8 51 80 56 00 e9 b6 fe ff ff <0f> 0b bf 00 f0 ff ff e9 38 fa ff ff 3e 8d 74 26 00 55 89 e5 57 31
[  879.314651] EAX: ee2bd000 EBX: 00000002 ECX: ee2bd000 EDX: 00000000
[  879.314656] ESI: f54b9ed4 EDI: 0001f400 EBP: f54b9db4 ESP: f54b9d68
[  879.314660] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00010246
[  879.314670] CR0: 80050033 CR2: b7a00000 CR3: 357452a0 CR4: 00350ef0
[  879.314675] Call Trace:
[  879.314681]  ? madvise_free_pte_range+0x720/0x720
[  879.314689]  ? smaps_pte_range+0x4b0/0x4b0
[  879.314694]  walk_pgd_range+0x325/0x720
[  879.314701]  ? mt_find+0x1d6/0x3a0
[  879.314710]  __walk_page_range+0x164/0x170
[  879.314716]  walk_page_range+0xf9/0x170
[  879.314720]  ? __kmem_cache_alloc_node+0x2a8/0x340
[  879.314729]  pagemap_read+0x124/0x280
[  879.314738]  ? default_llseek+0xf1/0x160
[  879.314747]  ? smaps_account+0x1d0/0x1d0
[  879.314754]  vfs_read+0x90/0x290
[  879.314760]  ? do_madvise.part.0+0x24b/0x390
[  879.314765]  ? debug_smp_processor_id+0x12/0x20
[  879.314773]  ksys_pread64+0x58/0x90
[  879.314778]  __ia32_sys_ia32_pread64+0x1b/0x20
[  879.314787]  __do_fast_syscall_32+0x4c/0xc0
[  879.314796]  do_fast_syscall_32+0x29/0x60
[  879.314803]  do_SYSENTER_32+0x15/0x20
[  879.314809]  entry_SYSENTER_32+0x98/0xf1
[  879.314815] EIP: 0xb7f36559
[  879.314820] Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76
[  879.314825] EAX: ffffffda EBX: 00000003 ECX: bff00a50 EDX: 00000008
[  879.314829] ESI: 005bd000 EDI: 00000000 EBP: b7f1c000 ESP: bff00a00
[  879.314833] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000246
[  879.314840] Modules linked in: intel_rapl_msr(E) intel_rapl_common(E) intel_pmc_core(E) kvm_intel(E) kvm(E) irqbypass(E) aesni_intel(E) libaes(E) crypto_simd(E) cryptd(E) rfkill(E) snd_pcm(E) snd_timer(E) joydev(E) snd(E) soundcore(E) sg(E) evdev(E) pcspkr(E) serio_raw(E) qemu_fw_cfg(E) parport_pc(E) ppdev(E) lp(E) parport(E) fuse(E) configfs(E) ip_tables(E) x_tables(E) autofs4(E) ext4(E) crc32c_generic(E) crc16(E) mbcache(E) jbd2(E) sd_mod(E) t10_pi(E) crc64_rocksoft(E) crc64(E) crc_t10dif(E) sr_mod(E) crct10dif_generic(E) cdrom(E) crct10dif_common(E) bochs(E) drm_vram_helper(E) drm_ttm_helper(E) ttm(E) drm_kms_helper(E) ata_generic(E) ata_piix(E) crc32_pclmul(E) libata(E) crc32c_intel(E) drm(E) e1000(E) scsi_mod(E) psmouse(E) i2c_piix4(E) scsi_common(E) floppy(E) button(E)
[  879.314936] ---[ end trace 0000000000000000 ]---
[  879.314940] EIP: pagemap_pmd_range+0x644/0x650
[  879.314944] Code: 00 00 00 00 66 90 89 ce b9 00 f0 ff ff e9 ff fb ff ff 89 d8 31 db e8 1b c2 52 00 e9 23 fb ff ff e8 51 80 56 00 e9 b6 fe ff ff <0f> 0b bf 00 f0 ff ff e9 38 fa ff ff 3e 8d 74 26 00 55 89 e5 57 31
[  879.314949] EAX: ee2bd000 EBX: 00000002 ECX: ee2bd000 EDX: 00000000
[  879.314953] ESI: f54b9ed4 EDI: 0001f400 EBP: f54b9db4 ESP: f54b9d68
[  879.314957] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00010246
[  879.314961] CR0: 80050033 CR2: b7a00000 CR3: 357452a0 CR4: 00350ef0


Reading /proc/self/pagemap in THP test case seems to trigger the
   BUG_ON(is_migration_entry(entry) && !PageLocked(p));
in pfn_swap_entry_to_page().

I did not have time to cherry pick (slow machine) or look into details.
And I don't remember seeing that BUG 64bit yet during my tests.

Having a migration entry in the swap testcase is kind-of weird. But maybe it's
related to THP splitting (which would, however, also be weird). I'd have expected
a swap entry ... hopefully our swap type doesn't get corrupted.


Something slightly realted was reported for -next a couple of months ago:

https://lore.kernel.org/all/11765.1657004484@turing-police/


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: kernel BUG at include/linux/swapops.h:497!
  2022-12-01 16:58 kernel BUG at include/linux/swapops.h:497! David Hildenbrand
@ 2022-12-01 17:48 ` Yang Shi
  2022-12-01 18:14   ` Yang Shi
  0 siblings, 1 reply; 7+ messages in thread
From: Yang Shi @ 2022-12-01 17:48 UTC (permalink / raw)
  To: David Hildenbrand; +Cc: linux-mm, linux-kernel, Hugh Dickins, Peter Xu

On Thu, Dec 1, 2022 at 8:58 AM David Hildenbrand <david@redhat.com> wrote:
>
> Hi,
>
> running COW tests (in mm-unstable) on x86-pae with 8GiB, I am able to trigger the
> following BUG on latest upstream:
>
> root@debian:/mnt/scratch/linux/tools/testing/selftests/vm# ./cow
> # [INFO] detected THP size: 2048 KiB
> # [INFO] detected hugetlb size: 2048 KiB
> # [INFO] huge zeropage is enabled
> TAP version 13
> 1..147
> # [INFO] Anonymous memory tests in private mappings
> # [RUN] Basic COW after fork() ... with base page
> ok 1 No leak from parent into child
> # [RUN] Basic COW after fork() ... with swapped out base page
> ok 2 No leak from parent into child
> # [RUN] Basic COW after fork() ... with THP
> ok 3 No leak from parent into child
> # [RUN] Basic COW after fork() ... with swapped-out THP
> Segmentation fault
>
>
> [  879.314600] kernel BUG at include/linux/swapops.h:497!
> [  879.314615] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> [  879.314624] CPU: 7 PID: 746 Comm: cow Tainted: G            E      6.1.0-rc7+ #5
> [  879.314631] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014
> [  879.314634] EIP: pagemap_pmd_range+0x644/0x650
> [  879.314645] Code: 00 00 00 00 66 90 89 ce b9 00 f0 ff ff e9 ff fb ff ff 89 d8 31 db e8 1b c2 52 00 e9 23 fb ff ff e8 51 80 56 00 e9 b6 fe ff ff <0f> 0b bf 00 f0 ff ff e9 38 fa ff ff 3e 8d 74 26 00 55 89 e5 57 31
> [  879.314651] EAX: ee2bd000 EBX: 00000002 ECX: ee2bd000 EDX: 00000000
> [  879.314656] ESI: f54b9ed4 EDI: 0001f400 EBP: f54b9db4 ESP: f54b9d68
> [  879.314660] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00010246
> [  879.314670] CR0: 80050033 CR2: b7a00000 CR3: 357452a0 CR4: 00350ef0
> [  879.314675] Call Trace:
> [  879.314681]  ? madvise_free_pte_range+0x720/0x720
> [  879.314689]  ? smaps_pte_range+0x4b0/0x4b0
> [  879.314694]  walk_pgd_range+0x325/0x720
> [  879.314701]  ? mt_find+0x1d6/0x3a0
> [  879.314710]  __walk_page_range+0x164/0x170
> [  879.314716]  walk_page_range+0xf9/0x170
> [  879.314720]  ? __kmem_cache_alloc_node+0x2a8/0x340
> [  879.314729]  pagemap_read+0x124/0x280
> [  879.314738]  ? default_llseek+0xf1/0x160
> [  879.314747]  ? smaps_account+0x1d0/0x1d0
> [  879.314754]  vfs_read+0x90/0x290
> [  879.314760]  ? do_madvise.part.0+0x24b/0x390
> [  879.314765]  ? debug_smp_processor_id+0x12/0x20
> [  879.314773]  ksys_pread64+0x58/0x90
> [  879.314778]  __ia32_sys_ia32_pread64+0x1b/0x20
> [  879.314787]  __do_fast_syscall_32+0x4c/0xc0
> [  879.314796]  do_fast_syscall_32+0x29/0x60
> [  879.314803]  do_SYSENTER_32+0x15/0x20
> [  879.314809]  entry_SYSENTER_32+0x98/0xf1
> [  879.314815] EIP: 0xb7f36559
> [  879.314820] Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76
> [  879.314825] EAX: ffffffda EBX: 00000003 ECX: bff00a50 EDX: 00000008
> [  879.314829] ESI: 005bd000 EDI: 00000000 EBP: b7f1c000 ESP: bff00a00
> [  879.314833] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000246
> [  879.314840] Modules linked in: intel_rapl_msr(E) intel_rapl_common(E) intel_pmc_core(E) kvm_intel(E) kvm(E) irqbypass(E) aesni_intel(E) libaes(E) crypto_simd(E) cryptd(E) rfkill(E) snd_pcm(E) snd_timer(E) joydev(E) snd(E) soundcore(E) sg(E) evdev(E) pcspkr(E) serio_raw(E) qemu_fw_cfg(E) parport_pc(E) ppdev(E) lp(E) parport(E) fuse(E) configfs(E) ip_tables(E) x_tables(E) autofs4(E) ext4(E) crc32c_generic(E) crc16(E) mbcache(E) jbd2(E) sd_mod(E) t10_pi(E) crc64_rocksoft(E) crc64(E) crc_t10dif(E) sr_mod(E) crct10dif_generic(E) cdrom(E) crct10dif_common(E) bochs(E) drm_vram_helper(E) drm_ttm_helper(E) ttm(E) drm_kms_helper(E) ata_generic(E) ata_piix(E) crc32_pclmul(E) libata(E) crc32c_intel(E) drm(E) e1000(E) scsi_mod(E) psmouse(E) i2c_piix4(E) scsi_common(E) floppy(E) button(E)
> [  879.314936] ---[ end trace 0000000000000000 ]---
> [  879.314940] EIP: pagemap_pmd_range+0x644/0x650
> [  879.314944] Code: 00 00 00 00 66 90 89 ce b9 00 f0 ff ff e9 ff fb ff ff 89 d8 31 db e8 1b c2 52 00 e9 23 fb ff ff e8 51 80 56 00 e9 b6 fe ff ff <0f> 0b bf 00 f0 ff ff e9 38 fa ff ff 3e 8d 74 26 00 55 89 e5 57 31
> [  879.314949] EAX: ee2bd000 EBX: 00000002 ECX: ee2bd000 EDX: 00000000
> [  879.314953] ESI: f54b9ed4 EDI: 0001f400 EBP: f54b9db4 ESP: f54b9d68
> [  879.314957] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00010246
> [  879.314961] CR0: 80050033 CR2: b7a00000 CR3: 357452a0 CR4: 00350ef0
>
>
> Reading /proc/self/pagemap in THP test case seems to trigger the
>    BUG_ON(is_migration_entry(entry) && !PageLocked(p));
> in pfn_swap_entry_to_page().
>
> I did not have time to cherry pick (slow machine) or look into details.
> And I don't remember seeing that BUG 64bit yet during my tests.
>
> Having a migration entry in the swap testcase is kind-of weird. But maybe it's
> related to THP splitting (which would, however, also be weird). I'd have expected
> a swap entry ... hopefully our swap type doesn't get corrupted.

I'm on a slow machine too... anyway some hints off the top of my head.

First of all, I don't think we will see a real swap PMD entry since
even though THP swap is supported the transhuge PMD is split by
try_to_unmap() if I remember correctly. So we should just be able to
see a regular PMD, a transhuge PMD, a migration PMD or a PROT_NONE PMD
(if autonuma is on).

Secondly, THP splitting doesn't convert transhuge PMD to migration PMD
either, it just splits transhuge PMD then converts every single PTEs
to migration PTEs.

Thirdly, before pfn_swap_entry_to_page() is called, it does check
whether the swap PMD is migration PMD or not, if it is not a VM_BUG is
triggered.

So it seems like a migration PMD is fine. The problem seems like the
page is not locked when doing migration IIUC.

>
>
> Something slightly realted was reported for -next a couple of months ago:
>
> https://lore.kernel.org/all/11765.1657004484@turing-police/
>
>
> --
> Thanks,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: kernel BUG at include/linux/swapops.h:497!
  2022-12-01 17:48 ` Yang Shi
@ 2022-12-01 18:14   ` Yang Shi
  2022-12-02 12:36     ` David Hildenbrand
  0 siblings, 1 reply; 7+ messages in thread
From: Yang Shi @ 2022-12-01 18:14 UTC (permalink / raw)
  To: David Hildenbrand; +Cc: linux-mm, linux-kernel, Hugh Dickins, Peter Xu

On Thu, Dec 1, 2022 at 9:48 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Thu, Dec 1, 2022 at 8:58 AM David Hildenbrand <david@redhat.com> wrote:
> >
> > Hi,
> >
> > running COW tests (in mm-unstable) on x86-pae with 8GiB, I am able to trigger the
> > following BUG on latest upstream:
> >
> > root@debian:/mnt/scratch/linux/tools/testing/selftests/vm# ./cow
> > # [INFO] detected THP size: 2048 KiB
> > # [INFO] detected hugetlb size: 2048 KiB
> > # [INFO] huge zeropage is enabled
> > TAP version 13
> > 1..147
> > # [INFO] Anonymous memory tests in private mappings
> > # [RUN] Basic COW after fork() ... with base page
> > ok 1 No leak from parent into child
> > # [RUN] Basic COW after fork() ... with swapped out base page
> > ok 2 No leak from parent into child
> > # [RUN] Basic COW after fork() ... with THP
> > ok 3 No leak from parent into child
> > # [RUN] Basic COW after fork() ... with swapped-out THP
> > Segmentation fault
> >
> >
> > [  879.314600] kernel BUG at include/linux/swapops.h:497!
> > [  879.314615] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> > [  879.314624] CPU: 7 PID: 746 Comm: cow Tainted: G            E      6.1.0-rc7+ #5
> > [  879.314631] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014
> > [  879.314634] EIP: pagemap_pmd_range+0x644/0x650
> > [  879.314645] Code: 00 00 00 00 66 90 89 ce b9 00 f0 ff ff e9 ff fb ff ff 89 d8 31 db e8 1b c2 52 00 e9 23 fb ff ff e8 51 80 56 00 e9 b6 fe ff ff <0f> 0b bf 00 f0 ff ff e9 38 fa ff ff 3e 8d 74 26 00 55 89 e5 57 31
> > [  879.314651] EAX: ee2bd000 EBX: 00000002 ECX: ee2bd000 EDX: 00000000
> > [  879.314656] ESI: f54b9ed4 EDI: 0001f400 EBP: f54b9db4 ESP: f54b9d68
> > [  879.314660] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00010246
> > [  879.314670] CR0: 80050033 CR2: b7a00000 CR3: 357452a0 CR4: 00350ef0
> > [  879.314675] Call Trace:
> > [  879.314681]  ? madvise_free_pte_range+0x720/0x720
> > [  879.314689]  ? smaps_pte_range+0x4b0/0x4b0
> > [  879.314694]  walk_pgd_range+0x325/0x720
> > [  879.314701]  ? mt_find+0x1d6/0x3a0
> > [  879.314710]  __walk_page_range+0x164/0x170
> > [  879.314716]  walk_page_range+0xf9/0x170
> > [  879.314720]  ? __kmem_cache_alloc_node+0x2a8/0x340
> > [  879.314729]  pagemap_read+0x124/0x280
> > [  879.314738]  ? default_llseek+0xf1/0x160
> > [  879.314747]  ? smaps_account+0x1d0/0x1d0
> > [  879.314754]  vfs_read+0x90/0x290
> > [  879.314760]  ? do_madvise.part.0+0x24b/0x390
> > [  879.314765]  ? debug_smp_processor_id+0x12/0x20
> > [  879.314773]  ksys_pread64+0x58/0x90
> > [  879.314778]  __ia32_sys_ia32_pread64+0x1b/0x20
> > [  879.314787]  __do_fast_syscall_32+0x4c/0xc0
> > [  879.314796]  do_fast_syscall_32+0x29/0x60
> > [  879.314803]  do_SYSENTER_32+0x15/0x20
> > [  879.314809]  entry_SYSENTER_32+0x98/0xf1
> > [  879.314815] EIP: 0xb7f36559
> > [  879.314820] Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76
> > [  879.314825] EAX: ffffffda EBX: 00000003 ECX: bff00a50 EDX: 00000008
> > [  879.314829] ESI: 005bd000 EDI: 00000000 EBP: b7f1c000 ESP: bff00a00
> > [  879.314833] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000246
> > [  879.314840] Modules linked in: intel_rapl_msr(E) intel_rapl_common(E) intel_pmc_core(E) kvm_intel(E) kvm(E) irqbypass(E) aesni_intel(E) libaes(E) crypto_simd(E) cryptd(E) rfkill(E) snd_pcm(E) snd_timer(E) joydev(E) snd(E) soundcore(E) sg(E) evdev(E) pcspkr(E) serio_raw(E) qemu_fw_cfg(E) parport_pc(E) ppdev(E) lp(E) parport(E) fuse(E) configfs(E) ip_tables(E) x_tables(E) autofs4(E) ext4(E) crc32c_generic(E) crc16(E) mbcache(E) jbd2(E) sd_mod(E) t10_pi(E) crc64_rocksoft(E) crc64(E) crc_t10dif(E) sr_mod(E) crct10dif_generic(E) cdrom(E) crct10dif_common(E) bochs(E) drm_vram_helper(E) drm_ttm_helper(E) ttm(E) drm_kms_helper(E) ata_generic(E) ata_piix(E) crc32_pclmul(E) libata(E) crc32c_intel(E) drm(E) e1000(E) scsi_mod(E) psmouse(E) i2c_piix4(E) scsi_common(E) floppy(E) button(E)
> > [  879.314936] ---[ end trace 0000000000000000 ]---
> > [  879.314940] EIP: pagemap_pmd_range+0x644/0x650
> > [  879.314944] Code: 00 00 00 00 66 90 89 ce b9 00 f0 ff ff e9 ff fb ff ff 89 d8 31 db e8 1b c2 52 00 e9 23 fb ff ff e8 51 80 56 00 e9 b6 fe ff ff <0f> 0b bf 00 f0 ff ff e9 38 fa ff ff 3e 8d 74 26 00 55 89 e5 57 31
> > [  879.314949] EAX: ee2bd000 EBX: 00000002 ECX: ee2bd000 EDX: 00000000
> > [  879.314953] ESI: f54b9ed4 EDI: 0001f400 EBP: f54b9db4 ESP: f54b9d68
> > [  879.314957] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00010246
> > [  879.314961] CR0: 80050033 CR2: b7a00000 CR3: 357452a0 CR4: 00350ef0
> >
> >
> > Reading /proc/self/pagemap in THP test case seems to trigger the
> >    BUG_ON(is_migration_entry(entry) && !PageLocked(p));
> > in pfn_swap_entry_to_page().
> >
> > I did not have time to cherry pick (slow machine) or look into details.
> > And I don't remember seeing that BUG 64bit yet during my tests.
> >
> > Having a migration entry in the swap testcase is kind-of weird. But maybe it's
> > related to THP splitting (which would, however, also be weird). I'd have expected
> > a swap entry ... hopefully our swap type doesn't get corrupted.
>
> I'm on a slow machine too... anyway some hints off the top of my head.
>
> First of all, I don't think we will see a real swap PMD entry since
> even though THP swap is supported the transhuge PMD is split by
> try_to_unmap() if I remember correctly. So we should just be able to
> see a regular PMD, a transhuge PMD, a migration PMD or a PROT_NONE PMD
> (if autonuma is on).
>
> Secondly, THP splitting doesn't convert transhuge PMD to migration PMD
> either, it just splits transhuge PMD then converts every single PTEs
> to migration PTEs.
>
> Thirdly, before pfn_swap_entry_to_page() is called, it does check
> whether the swap PMD is migration PMD or not, if it is not a VM_BUG is
> triggered.
>
> So it seems like a migration PMD is fine. The problem seems like the
> page is not locked when doing migration IIUC.

A quick look at the migration code, I don't see the page is unlocked
if I don't miss something. So it may be helpful to dump the page.

>
> >
> >
> > Something slightly realted was reported for -next a couple of months ago:
> >
> > https://lore.kernel.org/all/11765.1657004484@turing-police/
> >
> >
> > --
> > Thanks,
> >
> > David / dhildenb
> >

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: kernel BUG at include/linux/swapops.h:497!
  2022-12-01 18:14   ` Yang Shi
@ 2022-12-02 12:36     ` David Hildenbrand
  2022-12-05 15:14       ` David Hildenbrand
  0 siblings, 1 reply; 7+ messages in thread
From: David Hildenbrand @ 2022-12-02 12:36 UTC (permalink / raw)
  To: Yang Shi; +Cc: linux-mm, linux-kernel, Hugh Dickins, Peter Xu

On 01.12.22 19:14, Yang Shi wrote:
> On Thu, Dec 1, 2022 at 9:48 AM Yang Shi <shy828301@gmail.com> wrote:
>>
>> On Thu, Dec 1, 2022 at 8:58 AM David Hildenbrand <david@redhat.com> wrote:
>>>
>>> Hi,
>>>
>>> running COW tests (in mm-unstable) on x86-pae with 8GiB, I am able to trigger the
>>> following BUG on latest upstream:
>>>
>>> root@debian:/mnt/scratch/linux/tools/testing/selftests/vm# ./cow
>>> # [INFO] detected THP size: 2048 KiB
>>> # [INFO] detected hugetlb size: 2048 KiB
>>> # [INFO] huge zeropage is enabled
>>> TAP version 13
>>> 1..147
>>> # [INFO] Anonymous memory tests in private mappings
>>> # [RUN] Basic COW after fork() ... with base page
>>> ok 1 No leak from parent into child
>>> # [RUN] Basic COW after fork() ... with swapped out base page
>>> ok 2 No leak from parent into child
>>> # [RUN] Basic COW after fork() ... with THP
>>> ok 3 No leak from parent into child
>>> # [RUN] Basic COW after fork() ... with swapped-out THP
>>> Segmentation fault
>>>
>>>
>>> [  879.314600] kernel BUG at include/linux/swapops.h:497!
>>> [  879.314615] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
>>> [  879.314624] CPU: 7 PID: 746 Comm: cow Tainted: G            E      6.1.0-rc7+ #5
>>> [  879.314631] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014
>>> [  879.314634] EIP: pagemap_pmd_range+0x644/0x650
>>> [  879.314645] Code: 00 00 00 00 66 90 89 ce b9 00 f0 ff ff e9 ff fb ff ff 89 d8 31 db e8 1b c2 52 00 e9 23 fb ff ff e8 51 80 56 00 e9 b6 fe ff ff <0f> 0b bf 00 f0 ff ff e9 38 fa ff ff 3e 8d 74 26 00 55 89 e5 57 31
>>> [  879.314651] EAX: ee2bd000 EBX: 00000002 ECX: ee2bd000 EDX: 00000000
>>> [  879.314656] ESI: f54b9ed4 EDI: 0001f400 EBP: f54b9db4 ESP: f54b9d68
>>> [  879.314660] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00010246
>>> [  879.314670] CR0: 80050033 CR2: b7a00000 CR3: 357452a0 CR4: 00350ef0
>>> [  879.314675] Call Trace:
>>> [  879.314681]  ? madvise_free_pte_range+0x720/0x720
>>> [  879.314689]  ? smaps_pte_range+0x4b0/0x4b0
>>> [  879.314694]  walk_pgd_range+0x325/0x720
>>> [  879.314701]  ? mt_find+0x1d6/0x3a0
>>> [  879.314710]  __walk_page_range+0x164/0x170
>>> [  879.314716]  walk_page_range+0xf9/0x170
>>> [  879.314720]  ? __kmem_cache_alloc_node+0x2a8/0x340
>>> [  879.314729]  pagemap_read+0x124/0x280
>>> [  879.314738]  ? default_llseek+0xf1/0x160
>>> [  879.314747]  ? smaps_account+0x1d0/0x1d0
>>> [  879.314754]  vfs_read+0x90/0x290
>>> [  879.314760]  ? do_madvise.part.0+0x24b/0x390
>>> [  879.314765]  ? debug_smp_processor_id+0x12/0x20
>>> [  879.314773]  ksys_pread64+0x58/0x90
>>> [  879.314778]  __ia32_sys_ia32_pread64+0x1b/0x20
>>> [  879.314787]  __do_fast_syscall_32+0x4c/0xc0
>>> [  879.314796]  do_fast_syscall_32+0x29/0x60
>>> [  879.314803]  do_SYSENTER_32+0x15/0x20
>>> [  879.314809]  entry_SYSENTER_32+0x98/0xf1
>>> [  879.314815] EIP: 0xb7f36559
>>> [  879.314820] Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76
>>> [  879.314825] EAX: ffffffda EBX: 00000003 ECX: bff00a50 EDX: 00000008
>>> [  879.314829] ESI: 005bd000 EDI: 00000000 EBP: b7f1c000 ESP: bff00a00
>>> [  879.314833] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000246
>>> [  879.314840] Modules linked in: intel_rapl_msr(E) intel_rapl_common(E) intel_pmc_core(E) kvm_intel(E) kvm(E) irqbypass(E) aesni_intel(E) libaes(E) crypto_simd(E) cryptd(E) rfkill(E) snd_pcm(E) snd_timer(E) joydev(E) snd(E) soundcore(E) sg(E) evdev(E) pcspkr(E) serio_raw(E) qemu_fw_cfg(E) parport_pc(E) ppdev(E) lp(E) parport(E) fuse(E) configfs(E) ip_tables(E) x_tables(E) autofs4(E) ext4(E) crc32c_generic(E) crc16(E) mbcache(E) jbd2(E) sd_mod(E) t10_pi(E) crc64_rocksoft(E) crc64(E) crc_t10dif(E) sr_mod(E) crct10dif_generic(E) cdrom(E) crct10dif_common(E) bochs(E) drm_vram_helper(E) drm_ttm_helper(E) ttm(E) drm_kms_helper(E) ata_generic(E) ata_piix(E) crc32_pclmul(E) libata(E) crc32c_intel(E) drm(E) e1000(E) scsi_mod(E) psmouse(E) i2c_piix4(E) scsi_common(E) floppy(E) button(E)
>>> [  879.314936] ---[ end trace 0000000000000000 ]---
>>> [  879.314940] EIP: pagemap_pmd_range+0x644/0x650
>>> [  879.314944] Code: 00 00 00 00 66 90 89 ce b9 00 f0 ff ff e9 ff fb ff ff 89 d8 31 db e8 1b c2 52 00 e9 23 fb ff ff e8 51 80 56 00 e9 b6 fe ff ff <0f> 0b bf 00 f0 ff ff e9 38 fa ff ff 3e 8d 74 26 00 55 89 e5 57 31
>>> [  879.314949] EAX: ee2bd000 EBX: 00000002 ECX: ee2bd000 EDX: 00000000
>>> [  879.314953] ESI: f54b9ed4 EDI: 0001f400 EBP: f54b9db4 ESP: f54b9d68
>>> [  879.314957] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00010246
>>> [  879.314961] CR0: 80050033 CR2: b7a00000 CR3: 357452a0 CR4: 00350ef0
>>>
>>>
>>> Reading /proc/self/pagemap in THP test case seems to trigger the
>>>     BUG_ON(is_migration_entry(entry) && !PageLocked(p));
>>> in pfn_swap_entry_to_page().
>>>
>>> I did not have time to cherry pick (slow machine) or look into details.
>>> And I don't remember seeing that BUG 64bit yet during my tests.
>>>
>>> Having a migration entry in the swap testcase is kind-of weird. But maybe it's
>>> related to THP splitting (which would, however, also be weird). I'd have expected
>>> a swap entry ... hopefully our swap type doesn't get corrupted.
>>
>> I'm on a slow machine too... anyway some hints off the top of my head.
>>
>> First of all, I don't think we will see a real swap PMD entry since
>> even though THP swap is supported the transhuge PMD is split by
>> try_to_unmap() if I remember correctly. So we should just be able to
>> see a regular PMD, a transhuge PMD, a migration PMD or a PROT_NONE PMD
>> (if autonuma is on).

Yes.

>>
>> Secondly, THP splitting doesn't convert transhuge PMD to migration PMD
>> either, it just splits transhuge PMD then converts every single PTEs
>> to migration PTEs.

Right.

>>
>> Thirdly, before pfn_swap_entry_to_page() is called, it does check
>> whether the swap PMD is migration PMD or not, if it is not a VM_BUG is
>> triggered.
>>
>> So it seems like a migration PMD is fine. The problem seems like the
>> page is not locked when doing migration IIUC.
> 
> A quick look at the migration code, I don't see the page is unlocked
> if I don't miss something. So it may be helpful to dump the page.

It is highly unlikely that we have migration happening here, because

1) This triggers 100% on the first try
2) The machine is essentially idle with 7 GiB of free memory.

I'll try digging a bit what exactly is happening here, dumping the PMD 
entry first.


Thanks!

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: kernel BUG at include/linux/swapops.h:497!
  2022-12-02 12:36     ` David Hildenbrand
@ 2022-12-05 15:14       ` David Hildenbrand
  2022-12-05 23:14         ` Yang Shi
  0 siblings, 1 reply; 7+ messages in thread
From: David Hildenbrand @ 2022-12-05 15:14 UTC (permalink / raw)
  To: Yang Shi; +Cc: linux-mm, linux-kernel, Hugh Dickins, Peter Xu

On 02.12.22 13:36, David Hildenbrand wrote:
> On 01.12.22 19:14, Yang Shi wrote:
>> On Thu, Dec 1, 2022 at 9:48 AM Yang Shi <shy828301@gmail.com> wrote:
>>>
>>> On Thu, Dec 1, 2022 at 8:58 AM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> running COW tests (in mm-unstable) on x86-pae with 8GiB, I am able to trigger the
>>>> following BUG on latest upstream:
>>>>
>>>> root@debian:/mnt/scratch/linux/tools/testing/selftests/vm# ./cow
>>>> # [INFO] detected THP size: 2048 KiB
>>>> # [INFO] detected hugetlb size: 2048 KiB
>>>> # [INFO] huge zeropage is enabled
>>>> TAP version 13
>>>> 1..147
>>>> # [INFO] Anonymous memory tests in private mappings
>>>> # [RUN] Basic COW after fork() ... with base page
>>>> ok 1 No leak from parent into child
>>>> # [RUN] Basic COW after fork() ... with swapped out base page
>>>> ok 2 No leak from parent into child
>>>> # [RUN] Basic COW after fork() ... with THP
>>>> ok 3 No leak from parent into child
>>>> # [RUN] Basic COW after fork() ... with swapped-out THP
>>>> Segmentation fault
>>>>
>>>>
>>>> [  879.314600] kernel BUG at include/linux/swapops.h:497!
>>>> [  879.314615] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
>>>> [  879.314624] CPU: 7 PID: 746 Comm: cow Tainted: G            E      6.1.0-rc7+ #5
>>>> [  879.314631] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014
>>>> [  879.314634] EIP: pagemap_pmd_range+0x644/0x650
>>>> [  879.314645] Code: 00 00 00 00 66 90 89 ce b9 00 f0 ff ff e9 ff fb ff ff 89 d8 31 db e8 1b c2 52 00 e9 23 fb ff ff e8 51 80 56 00 e9 b6 fe ff ff <0f> 0b bf 00 f0 ff ff e9 38 fa ff ff 3e 8d 74 26 00 55 89 e5 57 31
>>>> [  879.314651] EAX: ee2bd000 EBX: 00000002 ECX: ee2bd000 EDX: 00000000
>>>> [  879.314656] ESI: f54b9ed4 EDI: 0001f400 EBP: f54b9db4 ESP: f54b9d68
>>>> [  879.314660] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00010246
>>>> [  879.314670] CR0: 80050033 CR2: b7a00000 CR3: 357452a0 CR4: 00350ef0
>>>> [  879.314675] Call Trace:
>>>> [  879.314681]  ? madvise_free_pte_range+0x720/0x720
>>>> [  879.314689]  ? smaps_pte_range+0x4b0/0x4b0
>>>> [  879.314694]  walk_pgd_range+0x325/0x720
>>>> [  879.314701]  ? mt_find+0x1d6/0x3a0
>>>> [  879.314710]  __walk_page_range+0x164/0x170
>>>> [  879.314716]  walk_page_range+0xf9/0x170
>>>> [  879.314720]  ? __kmem_cache_alloc_node+0x2a8/0x340
>>>> [  879.314729]  pagemap_read+0x124/0x280
>>>> [  879.314738]  ? default_llseek+0xf1/0x160
>>>> [  879.314747]  ? smaps_account+0x1d0/0x1d0
>>>> [  879.314754]  vfs_read+0x90/0x290
>>>> [  879.314760]  ? do_madvise.part.0+0x24b/0x390
>>>> [  879.314765]  ? debug_smp_processor_id+0x12/0x20
>>>> [  879.314773]  ksys_pread64+0x58/0x90
>>>> [  879.314778]  __ia32_sys_ia32_pread64+0x1b/0x20
>>>> [  879.314787]  __do_fast_syscall_32+0x4c/0xc0
>>>> [  879.314796]  do_fast_syscall_32+0x29/0x60
>>>> [  879.314803]  do_SYSENTER_32+0x15/0x20
>>>> [  879.314809]  entry_SYSENTER_32+0x98/0xf1
>>>> [  879.314815] EIP: 0xb7f36559
>>>> [  879.314820] Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76
>>>> [  879.314825] EAX: ffffffda EBX: 00000003 ECX: bff00a50 EDX: 00000008
>>>> [  879.314829] ESI: 005bd000 EDI: 00000000 EBP: b7f1c000 ESP: bff00a00
>>>> [  879.314833] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000246
>>>> [  879.314840] Modules linked in: intel_rapl_msr(E) intel_rapl_common(E) intel_pmc_core(E) kvm_intel(E) kvm(E) irqbypass(E) aesni_intel(E) libaes(E) crypto_simd(E) cryptd(E) rfkill(E) snd_pcm(E) snd_timer(E) joydev(E) snd(E) soundcore(E) sg(E) evdev(E) pcspkr(E) serio_raw(E) qemu_fw_cfg(E) parport_pc(E) ppdev(E) lp(E) parport(E) fuse(E) configfs(E) ip_tables(E) x_tables(E) autofs4(E) ext4(E) crc32c_generic(E) crc16(E) mbcache(E) jbd2(E) sd_mod(E) t10_pi(E) crc64_rocksoft(E) crc64(E) crc_t10dif(E) sr_mod(E) crct10dif_generic(E) cdrom(E) crct10dif_common(E) bochs(E) drm_vram_helper(E) drm_ttm_helper(E) ttm(E) drm_kms_helper(E) ata_generic(E) ata_piix(E) crc32_pclmul(E) libata(E) crc32c_intel(E) drm(E) e1000(E) scsi_mod(E) psmouse(E) i2c_piix4(E) scsi_common(E) floppy(E) button(E)
>>>> [  879.314936] ---[ end trace 0000000000000000 ]---
>>>> [  879.314940] EIP: pagemap_pmd_range+0x644/0x650
>>>> [  879.314944] Code: 00 00 00 00 66 90 89 ce b9 00 f0 ff ff e9 ff fb ff ff 89 d8 31 db e8 1b c2 52 00 e9 23 fb ff ff e8 51 80 56 00 e9 b6 fe ff ff <0f> 0b bf 00 f0 ff ff e9 38 fa ff ff 3e 8d 74 26 00 55 89 e5 57 31
>>>> [  879.314949] EAX: ee2bd000 EBX: 00000002 ECX: ee2bd000 EDX: 00000000
>>>> [  879.314953] ESI: f54b9ed4 EDI: 0001f400 EBP: f54b9db4 ESP: f54b9d68
>>>> [  879.314957] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00010246
>>>> [  879.314961] CR0: 80050033 CR2: b7a00000 CR3: 357452a0 CR4: 00350ef0
>>>>
>>>>
>>>> Reading /proc/self/pagemap in THP test case seems to trigger the
>>>>      BUG_ON(is_migration_entry(entry) && !PageLocked(p));
>>>> in pfn_swap_entry_to_page().
>>>>
>>>> I did not have time to cherry pick (slow machine) or look into details.
>>>> And I don't remember seeing that BUG 64bit yet during my tests.
>>>>
>>>> Having a migration entry in the swap testcase is kind-of weird. But maybe it's
>>>> related to THP splitting (which would, however, also be weird). I'd have expected
>>>> a swap entry ... hopefully our swap type doesn't get corrupted.
>>>
>>> I'm on a slow machine too... anyway some hints off the top of my head.
>>>
>>> First of all, I don't think we will see a real swap PMD entry since
>>> even though THP swap is supported the transhuge PMD is split by
>>> try_to_unmap() if I remember correctly. So we should just be able to
>>> see a regular PMD, a transhuge PMD, a migration PMD or a PROT_NONE PMD
>>> (if autonuma is on).
> 
> Yes.
> 
>>>
>>> Secondly, THP splitting doesn't convert transhuge PMD to migration PMD
>>> either, it just splits transhuge PMD then converts every single PTEs
>>> to migration PTEs.
> 
> Right.
> 
>>>
>>> Thirdly, before pfn_swap_entry_to_page() is called, it does check
>>> whether the swap PMD is migration PMD or not, if it is not a VM_BUG is
>>> triggered.
>>>
>>> So it seems like a migration PMD is fine. The problem seems like the
>>> page is not locked when doing migration IIUC.
>>
>> A quick look at the migration code, I don't see the page is unlocked
>> if I don't miss something. So it may be helpful to dump the page.
> 
> It is highly unlikely that we have migration happening here, because
> 
> 1) This triggers 100% on the first try
> 2) The machine is essentially idle with 7 GiB of free memory.
> 
> I'll try digging a bit what exactly is happening here, dumping the PMD
> entry first.

Turns out that 32bit x86 doesn't even support PMD migration. We're 
stumbling over a PTE holding a migration entry and the underlying page 
was indeed unlocked. Turns out we fail to remove the migration entries 
we temporarily installed while splitting the THP. Splitting code 
doesn't/cannot notice that and unlocks the now-split page(s).

I just sent a fix.
-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: kernel BUG at include/linux/swapops.h:497!
  2022-12-05 15:14       ` David Hildenbrand
@ 2022-12-05 23:14         ` Yang Shi
  2022-12-06  9:00           ` David Hildenbrand
  0 siblings, 1 reply; 7+ messages in thread
From: Yang Shi @ 2022-12-05 23:14 UTC (permalink / raw)
  To: David Hildenbrand; +Cc: linux-mm, linux-kernel, Hugh Dickins, Peter Xu

On Mon, Dec 5, 2022 at 7:14 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 02.12.22 13:36, David Hildenbrand wrote:
> > On 01.12.22 19:14, Yang Shi wrote:
> >> On Thu, Dec 1, 2022 at 9:48 AM Yang Shi <shy828301@gmail.com> wrote:
> >>>
> >>> On Thu, Dec 1, 2022 at 8:58 AM David Hildenbrand <david@redhat.com> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> running COW tests (in mm-unstable) on x86-pae with 8GiB, I am able to trigger the
> >>>> following BUG on latest upstream:
> >>>>
> >>>> root@debian:/mnt/scratch/linux/tools/testing/selftests/vm# ./cow
> >>>> # [INFO] detected THP size: 2048 KiB
> >>>> # [INFO] detected hugetlb size: 2048 KiB
> >>>> # [INFO] huge zeropage is enabled
> >>>> TAP version 13
> >>>> 1..147
> >>>> # [INFO] Anonymous memory tests in private mappings
> >>>> # [RUN] Basic COW after fork() ... with base page
> >>>> ok 1 No leak from parent into child
> >>>> # [RUN] Basic COW after fork() ... with swapped out base page
> >>>> ok 2 No leak from parent into child
> >>>> # [RUN] Basic COW after fork() ... with THP
> >>>> ok 3 No leak from parent into child
> >>>> # [RUN] Basic COW after fork() ... with swapped-out THP
> >>>> Segmentation fault
> >>>>
> >>>>
> >>>> [  879.314600] kernel BUG at include/linux/swapops.h:497!
> >>>> [  879.314615] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> >>>> [  879.314624] CPU: 7 PID: 746 Comm: cow Tainted: G            E      6.1.0-rc7+ #5
> >>>> [  879.314631] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014
> >>>> [  879.314634] EIP: pagemap_pmd_range+0x644/0x650
> >>>> [  879.314645] Code: 00 00 00 00 66 90 89 ce b9 00 f0 ff ff e9 ff fb ff ff 89 d8 31 db e8 1b c2 52 00 e9 23 fb ff ff e8 51 80 56 00 e9 b6 fe ff ff <0f> 0b bf 00 f0 ff ff e9 38 fa ff ff 3e 8d 74 26 00 55 89 e5 57 31
> >>>> [  879.314651] EAX: ee2bd000 EBX: 00000002 ECX: ee2bd000 EDX: 00000000
> >>>> [  879.314656] ESI: f54b9ed4 EDI: 0001f400 EBP: f54b9db4 ESP: f54b9d68
> >>>> [  879.314660] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00010246
> >>>> [  879.314670] CR0: 80050033 CR2: b7a00000 CR3: 357452a0 CR4: 00350ef0
> >>>> [  879.314675] Call Trace:
> >>>> [  879.314681]  ? madvise_free_pte_range+0x720/0x720
> >>>> [  879.314689]  ? smaps_pte_range+0x4b0/0x4b0
> >>>> [  879.314694]  walk_pgd_range+0x325/0x720
> >>>> [  879.314701]  ? mt_find+0x1d6/0x3a0
> >>>> [  879.314710]  __walk_page_range+0x164/0x170
> >>>> [  879.314716]  walk_page_range+0xf9/0x170
> >>>> [  879.314720]  ? __kmem_cache_alloc_node+0x2a8/0x340
> >>>> [  879.314729]  pagemap_read+0x124/0x280
> >>>> [  879.314738]  ? default_llseek+0xf1/0x160
> >>>> [  879.314747]  ? smaps_account+0x1d0/0x1d0
> >>>> [  879.314754]  vfs_read+0x90/0x290
> >>>> [  879.314760]  ? do_madvise.part.0+0x24b/0x390
> >>>> [  879.314765]  ? debug_smp_processor_id+0x12/0x20
> >>>> [  879.314773]  ksys_pread64+0x58/0x90
> >>>> [  879.314778]  __ia32_sys_ia32_pread64+0x1b/0x20
> >>>> [  879.314787]  __do_fast_syscall_32+0x4c/0xc0
> >>>> [  879.314796]  do_fast_syscall_32+0x29/0x60
> >>>> [  879.314803]  do_SYSENTER_32+0x15/0x20
> >>>> [  879.314809]  entry_SYSENTER_32+0x98/0xf1
> >>>> [  879.314815] EIP: 0xb7f36559
> >>>> [  879.314820] Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76
> >>>> [  879.314825] EAX: ffffffda EBX: 00000003 ECX: bff00a50 EDX: 00000008
> >>>> [  879.314829] ESI: 005bd000 EDI: 00000000 EBP: b7f1c000 ESP: bff00a00
> >>>> [  879.314833] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000246
> >>>> [  879.314840] Modules linked in: intel_rapl_msr(E) intel_rapl_common(E) intel_pmc_core(E) kvm_intel(E) kvm(E) irqbypass(E) aesni_intel(E) libaes(E) crypto_simd(E) cryptd(E) rfkill(E) snd_pcm(E) snd_timer(E) joydev(E) snd(E) soundcore(E) sg(E) evdev(E) pcspkr(E) serio_raw(E) qemu_fw_cfg(E) parport_pc(E) ppdev(E) lp(E) parport(E) fuse(E) configfs(E) ip_tables(E) x_tables(E) autofs4(E) ext4(E) crc32c_generic(E) crc16(E) mbcache(E) jbd2(E) sd_mod(E) t10_pi(E) crc64_rocksoft(E) crc64(E) crc_t10dif(E) sr_mod(E) crct10dif_generic(E) cdrom(E) crct10dif_common(E) bochs(E) drm_vram_helper(E) drm_ttm_helper(E) ttm(E) drm_kms_helper(E) ata_generic(E) ata_piix(E) crc32_pclmul(E) libata(E) crc32c_intel(E) drm(E) e1000(E) scsi_mod(E) psmouse(E) i2c_piix4(E) scsi_common(E) floppy(E) button(E)
> >>>> [  879.314936] ---[ end trace 0000000000000000 ]---
> >>>> [  879.314940] EIP: pagemap_pmd_range+0x644/0x650
> >>>> [  879.314944] Code: 00 00 00 00 66 90 89 ce b9 00 f0 ff ff e9 ff fb ff ff 89 d8 31 db e8 1b c2 52 00 e9 23 fb ff ff e8 51 80 56 00 e9 b6 fe ff ff <0f> 0b bf 00 f0 ff ff e9 38 fa ff ff 3e 8d 74 26 00 55 89 e5 57 31
> >>>> [  879.314949] EAX: ee2bd000 EBX: 00000002 ECX: ee2bd000 EDX: 00000000
> >>>> [  879.314953] ESI: f54b9ed4 EDI: 0001f400 EBP: f54b9db4 ESP: f54b9d68
> >>>> [  879.314957] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00010246
> >>>> [  879.314961] CR0: 80050033 CR2: b7a00000 CR3: 357452a0 CR4: 00350ef0
> >>>>
> >>>>
> >>>> Reading /proc/self/pagemap in THP test case seems to trigger the
> >>>>      BUG_ON(is_migration_entry(entry) && !PageLocked(p));
> >>>> in pfn_swap_entry_to_page().
> >>>>
> >>>> I did not have time to cherry pick (slow machine) or look into details.
> >>>> And I don't remember seeing that BUG 64bit yet during my tests.
> >>>>
> >>>> Having a migration entry in the swap testcase is kind-of weird. But maybe it's
> >>>> related to THP splitting (which would, however, also be weird). I'd have expected
> >>>> a swap entry ... hopefully our swap type doesn't get corrupted.
> >>>
> >>> I'm on a slow machine too... anyway some hints off the top of my head.
> >>>
> >>> First of all, I don't think we will see a real swap PMD entry since
> >>> even though THP swap is supported the transhuge PMD is split by
> >>> try_to_unmap() if I remember correctly. So we should just be able to
> >>> see a regular PMD, a transhuge PMD, a migration PMD or a PROT_NONE PMD
> >>> (if autonuma is on).
> >
> > Yes.
> >
> >>>
> >>> Secondly, THP splitting doesn't convert transhuge PMD to migration PMD
> >>> either, it just splits transhuge PMD then converts every single PTEs
> >>> to migration PTEs.
> >
> > Right.
> >
> >>>
> >>> Thirdly, before pfn_swap_entry_to_page() is called, it does check
> >>> whether the swap PMD is migration PMD or not, if it is not a VM_BUG is
> >>> triggered.
> >>>
> >>> So it seems like a migration PMD is fine. The problem seems like the
> >>> page is not locked when doing migration IIUC.
> >>
> >> A quick look at the migration code, I don't see the page is unlocked
> >> if I don't miss something. So it may be helpful to dump the page.
> >
> > It is highly unlikely that we have migration happening here, because
> >
> > 1) This triggers 100% on the first try
> > 2) The machine is essentially idle with 7 GiB of free memory.
> >
> > I'll try digging a bit what exactly is happening here, dumping the PMD
> > entry first.
>
> Turns out that 32bit x86 doesn't even support PMD migration. We're
> stumbling over a PTE holding a migration entry and the underlying page
> was indeed unlocked. Turns out we fail to remove the migration entries
> we temporarily installed while splitting the THP. Splitting code
> doesn't/cannot notice that and unlocks the now-split page(s).

Thanks for debugging this. IIUC the real call trace should be:

pagemap_pmd_range ->
    pte_to_pagemap_entry ->
        pfn_swap_entry_to_page <--- BUG

I thought it was due to the migration PMD entry in the first place.

>
> I just sent a fix.
> --
> Thanks,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: kernel BUG at include/linux/swapops.h:497!
  2022-12-05 23:14         ` Yang Shi
@ 2022-12-06  9:00           ` David Hildenbrand
  0 siblings, 0 replies; 7+ messages in thread
From: David Hildenbrand @ 2022-12-06  9:00 UTC (permalink / raw)
  To: Yang Shi; +Cc: linux-mm, linux-kernel, Hugh Dickins, Peter Xu

On 06.12.22 00:14, Yang Shi wrote:
> On Mon, Dec 5, 2022 at 7:14 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 02.12.22 13:36, David Hildenbrand wrote:
>>> On 01.12.22 19:14, Yang Shi wrote:
>>>> On Thu, Dec 1, 2022 at 9:48 AM Yang Shi <shy828301@gmail.com> wrote:
>>>>>
>>>>> On Thu, Dec 1, 2022 at 8:58 AM David Hildenbrand <david@redhat.com> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> running COW tests (in mm-unstable) on x86-pae with 8GiB, I am able to trigger the
>>>>>> following BUG on latest upstream:
>>>>>>
>>>>>> root@debian:/mnt/scratch/linux/tools/testing/selftests/vm# ./cow
>>>>>> # [INFO] detected THP size: 2048 KiB
>>>>>> # [INFO] detected hugetlb size: 2048 KiB
>>>>>> # [INFO] huge zeropage is enabled
>>>>>> TAP version 13
>>>>>> 1..147
>>>>>> # [INFO] Anonymous memory tests in private mappings
>>>>>> # [RUN] Basic COW after fork() ... with base page
>>>>>> ok 1 No leak from parent into child
>>>>>> # [RUN] Basic COW after fork() ... with swapped out base page
>>>>>> ok 2 No leak from parent into child
>>>>>> # [RUN] Basic COW after fork() ... with THP
>>>>>> ok 3 No leak from parent into child
>>>>>> # [RUN] Basic COW after fork() ... with swapped-out THP
>>>>>> Segmentation fault
>>>>>>
>>>>>>
>>>>>> [  879.314600] kernel BUG at include/linux/swapops.h:497!
>>>>>> [  879.314615] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
>>>>>> [  879.314624] CPU: 7 PID: 746 Comm: cow Tainted: G            E      6.1.0-rc7+ #5
>>>>>> [  879.314631] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014
>>>>>> [  879.314634] EIP: pagemap_pmd_range+0x644/0x650
>>>>>> [  879.314645] Code: 00 00 00 00 66 90 89 ce b9 00 f0 ff ff e9 ff fb ff ff 89 d8 31 db e8 1b c2 52 00 e9 23 fb ff ff e8 51 80 56 00 e9 b6 fe ff ff <0f> 0b bf 00 f0 ff ff e9 38 fa ff ff 3e 8d 74 26 00 55 89 e5 57 31
>>>>>> [  879.314651] EAX: ee2bd000 EBX: 00000002 ECX: ee2bd000 EDX: 00000000
>>>>>> [  879.314656] ESI: f54b9ed4 EDI: 0001f400 EBP: f54b9db4 ESP: f54b9d68
>>>>>> [  879.314660] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00010246
>>>>>> [  879.314670] CR0: 80050033 CR2: b7a00000 CR3: 357452a0 CR4: 00350ef0
>>>>>> [  879.314675] Call Trace:
>>>>>> [  879.314681]  ? madvise_free_pte_range+0x720/0x720
>>>>>> [  879.314689]  ? smaps_pte_range+0x4b0/0x4b0
>>>>>> [  879.314694]  walk_pgd_range+0x325/0x720
>>>>>> [  879.314701]  ? mt_find+0x1d6/0x3a0
>>>>>> [  879.314710]  __walk_page_range+0x164/0x170
>>>>>> [  879.314716]  walk_page_range+0xf9/0x170
>>>>>> [  879.314720]  ? __kmem_cache_alloc_node+0x2a8/0x340
>>>>>> [  879.314729]  pagemap_read+0x124/0x280
>>>>>> [  879.314738]  ? default_llseek+0xf1/0x160
>>>>>> [  879.314747]  ? smaps_account+0x1d0/0x1d0
>>>>>> [  879.314754]  vfs_read+0x90/0x290
>>>>>> [  879.314760]  ? do_madvise.part.0+0x24b/0x390
>>>>>> [  879.314765]  ? debug_smp_processor_id+0x12/0x20
>>>>>> [  879.314773]  ksys_pread64+0x58/0x90
>>>>>> [  879.314778]  __ia32_sys_ia32_pread64+0x1b/0x20
>>>>>> [  879.314787]  __do_fast_syscall_32+0x4c/0xc0
>>>>>> [  879.314796]  do_fast_syscall_32+0x29/0x60
>>>>>> [  879.314803]  do_SYSENTER_32+0x15/0x20
>>>>>> [  879.314809]  entry_SYSENTER_32+0x98/0xf1
>>>>>> [  879.314815] EIP: 0xb7f36559
>>>>>> [  879.314820] Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76
>>>>>> [  879.314825] EAX: ffffffda EBX: 00000003 ECX: bff00a50 EDX: 00000008
>>>>>> [  879.314829] ESI: 005bd000 EDI: 00000000 EBP: b7f1c000 ESP: bff00a00
>>>>>> [  879.314833] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000246
>>>>>> [  879.314840] Modules linked in: intel_rapl_msr(E) intel_rapl_common(E) intel_pmc_core(E) kvm_intel(E) kvm(E) irqbypass(E) aesni_intel(E) libaes(E) crypto_simd(E) cryptd(E) rfkill(E) snd_pcm(E) snd_timer(E) joydev(E) snd(E) soundcore(E) sg(E) evdev(E) pcspkr(E) serio_raw(E) qemu_fw_cfg(E) parport_pc(E) ppdev(E) lp(E) parport(E) fuse(E) configfs(E) ip_tables(E) x_tables(E) autofs4(E) ext4(E) crc32c_generic(E) crc16(E) mbcache(E) jbd2(E) sd_mod(E) t10_pi(E) crc64_rocksoft(E) crc64(E) crc_t10dif(E) sr_mod(E) crct10dif_generic(E) cdrom(E) crct10dif_common(E) bochs(E) drm_vram_helper(E) drm_ttm_helper(E) ttm(E) drm_kms_helper(E) ata_generic(E) ata_piix(E) crc32_pclmul(E) libata(E) crc32c_intel(E) drm(E) e1000(E) scsi_mod(E) psmouse(E) i2c_piix4(E) scsi_common(E) floppy(E) button(E)
>>>>>> [  879.314936] ---[ end trace 0000000000000000 ]---
>>>>>> [  879.314940] EIP: pagemap_pmd_range+0x644/0x650
>>>>>> [  879.314944] Code: 00 00 00 00 66 90 89 ce b9 00 f0 ff ff e9 ff fb ff ff 89 d8 31 db e8 1b c2 52 00 e9 23 fb ff ff e8 51 80 56 00 e9 b6 fe ff ff <0f> 0b bf 00 f0 ff ff e9 38 fa ff ff 3e 8d 74 26 00 55 89 e5 57 31
>>>>>> [  879.314949] EAX: ee2bd000 EBX: 00000002 ECX: ee2bd000 EDX: 00000000
>>>>>> [  879.314953] ESI: f54b9ed4 EDI: 0001f400 EBP: f54b9db4 ESP: f54b9d68
>>>>>> [  879.314957] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00010246
>>>>>> [  879.314961] CR0: 80050033 CR2: b7a00000 CR3: 357452a0 CR4: 00350ef0
>>>>>>
>>>>>>
>>>>>> Reading /proc/self/pagemap in THP test case seems to trigger the
>>>>>>       BUG_ON(is_migration_entry(entry) && !PageLocked(p));
>>>>>> in pfn_swap_entry_to_page().
>>>>>>
>>>>>> I did not have time to cherry pick (slow machine) or look into details.
>>>>>> And I don't remember seeing that BUG 64bit yet during my tests.
>>>>>>
>>>>>> Having a migration entry in the swap testcase is kind-of weird. But maybe it's
>>>>>> related to THP splitting (which would, however, also be weird). I'd have expected
>>>>>> a swap entry ... hopefully our swap type doesn't get corrupted.
>>>>>
>>>>> I'm on a slow machine too... anyway some hints off the top of my head.
>>>>>
>>>>> First of all, I don't think we will see a real swap PMD entry since
>>>>> even though THP swap is supported the transhuge PMD is split by
>>>>> try_to_unmap() if I remember correctly. So we should just be able to
>>>>> see a regular PMD, a transhuge PMD, a migration PMD or a PROT_NONE PMD
>>>>> (if autonuma is on).
>>>
>>> Yes.
>>>
>>>>>
>>>>> Secondly, THP splitting doesn't convert transhuge PMD to migration PMD
>>>>> either, it just splits transhuge PMD then converts every single PTEs
>>>>> to migration PTEs.
>>>
>>> Right.
>>>
>>>>>
>>>>> Thirdly, before pfn_swap_entry_to_page() is called, it does check
>>>>> whether the swap PMD is migration PMD or not, if it is not a VM_BUG is
>>>>> triggered.
>>>>>
>>>>> So it seems like a migration PMD is fine. The problem seems like the
>>>>> page is not locked when doing migration IIUC.
>>>>
>>>> A quick look at the migration code, I don't see the page is unlocked
>>>> if I don't miss something. So it may be helpful to dump the page.
>>>
>>> It is highly unlikely that we have migration happening here, because
>>>
>>> 1) This triggers 100% on the first try
>>> 2) The machine is essentially idle with 7 GiB of free memory.
>>>
>>> I'll try digging a bit what exactly is happening here, dumping the PMD
>>> entry first.
>>
>> Turns out that 32bit x86 doesn't even support PMD migration. We're
>> stumbling over a PTE holding a migration entry and the underlying page
>> was indeed unlocked. Turns out we fail to remove the migration entries
>> we temporarily installed while splitting the THP. Splitting code
>> doesn't/cannot notice that and unlocks the now-split page(s).
> 
> Thanks for debugging this. IIUC the real call trace should be:
> 
> pagemap_pmd_range ->
>      pte_to_pagemap_entry ->
>          pfn_swap_entry_to_page <--- BUG
> 
> I thought it was due to the migration PMD entry in the first place.

Yeah, me too. Unfortunately, the system swallowed that part of the call 
trace.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-12-06  9:01 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-01 16:58 kernel BUG at include/linux/swapops.h:497! David Hildenbrand
2022-12-01 17:48 ` Yang Shi
2022-12-01 18:14   ` Yang Shi
2022-12-02 12:36     ` David Hildenbrand
2022-12-05 15:14       ` David Hildenbrand
2022-12-05 23:14         ` Yang Shi
2022-12-06  9:00           ` David Hildenbrand

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.