linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [BUG] at include/linux/page-flags.h:415 (PageTransHuge)
@ 2014-01-10 18:23 Daniel Borkmann
  2014-01-11  6:22 ` Andrew Morton
  0 siblings, 1 reply; 13+ messages in thread
From: Daniel Borkmann @ 2014-01-10 18:23 UTC (permalink / raw)
  To: linux-kernel

This is being reliably triggered for each mmaped() packet(7)
socket from user space, basically during unmapping resp.
closing the TX socket.

I believe due to some change in transparent hugepages code ?

When I disable transparent hugepages, everything works fine,
no BUG triggered.

I'd be happy to test patches.

With using default kernel config:

[   63.887947] kernel BUG at include/linux/page-flags.h:415!
[   63.889296] invalid opcode: 0000 [#4] SMP
[   63.890637] Modules linked in: fuse ebtable_nat xt_CHECKSUM bridge stp llc rfcomm bnep nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_nat nf_nat_ipv6 ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_filter ebtables ip6table_filter ip6_tables snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm iwlwifi snd_page_alloc btusb snd_timer bluetooth cfg80211 wmi joydev thinkpad_acpi snd iTCO_wdt iTCO_vendor_support pcspkr e1000e tpm_tis tpm uvcvideo i2c_i801 soundcore sdhci_pci sdhci lpc_ich rfkill videobuf2_vmalloc videobuf2_memops videobuf2_core mmc_core mfd_core ptp videodev pps_core media uinput i915
[   63.895055]  i2c_algo_bit drm_kms_helper drm i2c_core video
[   63.896529] CPU: 2 PID: 1598 Comm: trafgen Tainted: G      D      3.13.0-rc6+ #12
[   63.898010] Hardware name: LENOVO 2429BP3/2429BP3, BIOS G4ET37WW (1.12 ) 05/29/2012
[   63.899494] task: ffff8801eca6c1a0 ti: ffff88020e694000 task.ti: ffff88020e694000
[   63.900988] RIP: 0010:[<ffffffff8168a492>]  [<ffffffff8168a492>] PageTransHuge.part.11+0x4/0x6
[   63.902498] RSP: 0018:ffff88020e695df8  EFLAGS: 00010282
[   63.903996] RAX: 005fffc000008004 RBX: ffff88020254cac8 RCX: 00000000078ed340
[   63.905492] RDX: ffffffff8116b3c6 RSI: 0000000000000001 RDI: ffff88020cf80640
[   63.906992] RBP: ffff88020e695df8 R08: 0000000000000000 R09: 0000000000000000
[   63.908485] R10: ffff8801eca6c1a0 R11: 00000000100000fb R12: ffffea00078ed340
[   63.909970] R13: ffff88020e695f48 R14: 00007f274a5f3000 R15: 00007f274a5f4000
[   63.911456] FS:  00007f274e5f3740(0000) GS:ffff88021e280000(0000) knlGS:0000000000000000
[   63.912946] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   63.914430] CR2: 0000003724a35a90 CR3: 00000001eabb6000 CR4: 00000000001407e0
[   63.915919] Stack:
[   63.917404]  ffff88020e695ee0 ffffffff8116fa7a 0000000ed78ec480 00007f274e5f2fff
[   63.918913]  0000000000000001 00007f274e5f3000 00000000810c864b ffff8801fb16aca8
[   63.920430]  0000000000000000 0000000000000000 ffff88020e695e58 0000000000000046
[   63.921938] Call Trace:
[   63.923451]  [<ffffffff8116fa7a>] munlock_vma_pages_range+0x2ea/0x2f0
[   63.924978]  [<ffffffff810a07bd>] ? trace_hardirqs_off+0xd/0x10
[   63.926454]  [<ffffffff81171b32>] ? vma_merge+0xc2/0x330
[   63.927870]  [<ffffffff8116fb7c>] mlock_fixup+0xfc/0x190
[   63.929288]  [<ffffffff8116fdc7>] do_mlockall+0x87/0xc0
[   63.930702]  [<ffffffff811702bf>] sys_munlockall+0x2f/0x50
[   63.932117]  [<ffffffff8169df52>] system_call_fastpath+0x16/0x1b
[   63.933531] Code: c1 e0 06 48 29 d8 eb 02 31 c0 5b 41 5c 5d c3 55 48 89 e5 0f 0b 55 48 89 e5 0f 0b 55 48 89 e5 0f 0b 55 48 89 e5 0f 0b 55 48 89 e5 <0f> 0b 55 48 89 e5 0f 0b 55 48 89 e5 0f 0b 55 48 8b 07 31 c9 48
[   63.935103] RIP  [<ffffffff8168a492>] PageTransHuge.part.11+0x4/0x6
[   63.936580]  RSP <ffff88020e695df8>
[   63.938021] ---[ end trace 67b7aa3fba09186d ]---

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [BUG] at include/linux/page-flags.h:415 (PageTransHuge)
  2014-01-10 18:23 [BUG] at include/linux/page-flags.h:415 (PageTransHuge) Daniel Borkmann
@ 2014-01-11  6:22 ` Andrew Morton
  2014-01-11 13:32   ` Daniel Borkmann
  0 siblings, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2014-01-11  6:22 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: linux-kernel, Vlastimil Babka, Michel Lespinasse

On Fri, 10 Jan 2014 19:23:26 +0100 Daniel Borkmann <borkmann@iogearbox.net> wrote:

> This is being reliably triggered for each mmaped() packet(7)
> socket from user space, basically during unmapping resp.
> closing the TX socket.
> 
> I believe due to some change in transparent hugepages code ?
> 
> When I disable transparent hugepages, everything works fine,
> no BUG triggered.
> 
> I'd be happy to test patches.

Did the inclusion of c424be1cbbf852e46acc8 ("mm: munlock: fix a bug
where THP tail page is encountered") in current mainline fix this?

> With using default kernel config:
> 
> [   63.887947] kernel BUG at include/linux/page-flags.h:415!
> [   63.889296] invalid opcode: 0000 [#4] SMP
> [   63.890637] Modules linked in: fuse ebtable_nat xt_CHECKSUM bridge stp llc rfcomm bnep nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_nat nf_nat_ipv6 ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_filter ebtables ip6table_filter ip6_tables snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm iwlwifi snd_page_alloc btusb snd_timer bluetooth cfg80211 wmi joydev thinkpad_acpi snd iTCO_wdt iTCO_vendor_support pcspkr e1000e tpm_tis tpm uvcvideo i2c_i801 soundcore sdhci_pci sdhci lpc_ich rfkill videobuf2_vmalloc videobuf2_memops videobuf2_core mmc_core mfd_core ptp videodev pps_core media uinput i915
> [   63.895055]  i2c_algo_bit drm_kms_helper drm i2c_core video
> [   63.896529] CPU: 2 PID: 1598 Comm: trafgen Tainted: G      D      3.13.0-rc6+ #12
> [   63.898010] Hardware name: LENOVO 2429BP3/2429BP3, BIOS G4ET37WW (1.12 ) 05/29/2012
> [   63.899494] task: ffff8801eca6c1a0 ti: ffff88020e694000 task.ti: ffff88020e694000
> [   63.900988] RIP: 0010:[<ffffffff8168a492>]  [<ffffffff8168a492>] PageTransHuge.part.11+0x4/0x6
> [   63.902498] RSP: 0018:ffff88020e695df8  EFLAGS: 00010282
> [   63.903996] RAX: 005fffc000008004 RBX: ffff88020254cac8 RCX: 00000000078ed340
> [   63.905492] RDX: ffffffff8116b3c6 RSI: 0000000000000001 RDI: ffff88020cf80640
> [   63.906992] RBP: ffff88020e695df8 R08: 0000000000000000 R09: 0000000000000000
> [   63.908485] R10: ffff8801eca6c1a0 R11: 00000000100000fb R12: ffffea00078ed340
> [   63.909970] R13: ffff88020e695f48 R14: 00007f274a5f3000 R15: 00007f274a5f4000
> [   63.911456] FS:  00007f274e5f3740(0000) GS:ffff88021e280000(0000) knlGS:0000000000000000
> [   63.912946] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   63.914430] CR2: 0000003724a35a90 CR3: 00000001eabb6000 CR4: 00000000001407e0
> [   63.915919] Stack:
> [   63.917404]  ffff88020e695ee0 ffffffff8116fa7a 0000000ed78ec480 00007f274e5f2fff
> [   63.918913]  0000000000000001 00007f274e5f3000 00000000810c864b ffff8801fb16aca8
> [   63.920430]  0000000000000000 0000000000000000 ffff88020e695e58 0000000000000046
> [   63.921938] Call Trace:
> [   63.923451]  [<ffffffff8116fa7a>] munlock_vma_pages_range+0x2ea/0x2f0
> [   63.924978]  [<ffffffff810a07bd>] ? trace_hardirqs_off+0xd/0x10
> [   63.926454]  [<ffffffff81171b32>] ? vma_merge+0xc2/0x330
> [   63.927870]  [<ffffffff8116fb7c>] mlock_fixup+0xfc/0x190
> [   63.929288]  [<ffffffff8116fdc7>] do_mlockall+0x87/0xc0
> [   63.930702]  [<ffffffff811702bf>] sys_munlockall+0x2f/0x50
> [   63.932117]  [<ffffffff8169df52>] system_call_fastpath+0x16/0x1b
> [   63.933531] Code: c1 e0 06 48 29 d8 eb 02 31 c0 5b 41 5c 5d c3 55 48 89 e5 0f 0b 55 48 89 e5 0f 0b 55 48 89 e5 0f 0b 55 48 89 e5 0f 0b 55 48 89 e5 <0f> 0b 55 48 89 e5 0f 0b 55 48 89 e5 0f 0b 55 48 8b 07 31 c9 48
> [   63.935103] RIP  [<ffffffff8168a492>] PageTransHuge.part.11+0x4/0x6
> [   63.936580]  RSP <ffff88020e695df8>
> [   63.938021] ---[ end trace 67b7aa3fba09186d ]---


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [BUG] at include/linux/page-flags.h:415 (PageTransHuge)
  2014-01-11  6:22 ` Andrew Morton
@ 2014-01-11 13:32   ` Daniel Borkmann
  2014-01-13 10:16     ` Vlastimil Babka
  0 siblings, 1 reply; 13+ messages in thread
From: Daniel Borkmann @ 2014-01-11 13:32 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Vlastimil Babka, Michel Lespinasse

On 01/11/2014 07:22 AM, Andrew Morton wrote:
> On Fri, 10 Jan 2014 19:23:26 +0100 Daniel Borkmann <borkmann@iogearbox.net> wrote:
>
>> This is being reliably triggered for each mmaped() packet(7)
>> socket from user space, basically during unmapping resp.
>> closing the TX socket.
>>
>> I believe due to some change in transparent hugepages code ?
>>
>> When I disable transparent hugepages, everything works fine,
>> no BUG triggered.
>>
>> I'd be happy to test patches.
>
> Did the inclusion of c424be1cbbf852e46acc8 ("mm: munlock: fix a bug
> where THP tail page is encountered") in current mainline fix this?

Thanks for your answer Andrew!

Hm, I just cherry-picked that onto current net-next as I have some work
there, and this time I got ...

(User space uses packet mmap() and mlockall(MCL_CURRENT | MCL_FUTURE)
  and on shutdown munlockall() ...)

[   63.863672] ------------[ cut here ]------------
[   63.863702] kernel BUG at mm/mlock.c:507!
[   63.863721] invalid opcode: 0000 [#1] SMP
[   63.863743] Modules linked in: fuse ebtable_nat xt_CHECKSUM nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_nat nf_nat_ipv6 ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack bridge ebtable_filter ebtables stp llc ip6table_filter ip6_tables rfcomm bnep snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec iwlwifi cfg80211 snd_hwdep btusb snd_seq bluetooth sdhci_pci snd_seq_device e1000e tpm_tis snd_pcm thinkpad_acpi sdhci ptp tpm uvcvideo pps_core snd_page_alloc snd_timer snd rfkill mmc_core iTCO_wdt iTCO_vendor_support lpc_ich mfd_core soundcore joydev wmi videobuf2_vmalloc videobuf2_memops videobuf2_core i2c_i801 pcspkr videodev media uinput i915
[   63.864152]  i2c_algo_bit drm_kms_helper drm i2c_core video
[   63.864181] CPU: 1 PID: 1617 Comm: trafgen Not tainted 3.13.0-rc6+ #15
[   63.864209] Hardware name: LENOVO 2429BP3/2429BP3, BIOS G4ET37WW (1.12 ) 05/29/2012
[   63.864242] task: ffff8801ee060000 ti: ffff8800b5954000 task.ti: ffff8800b5954000
[   63.864274] RIP: 0010:[<ffffffff8116fa9a>]  [<ffffffff8116fa9a>] munlock_vma_pages_range+0x2ea/0x2f0
[   63.864318] RSP: 0018:ffff8800b5955e08  EFLAGS: 00010202
[   63.864341] RAX: 00000000000001ff RBX: ffff8800b58f7508 RCX: 0000000000000034
[   63.864372] RDX: 00000007f0708992 RSI: ffffea0002c3e700 RDI: ffffea0002c3e700
[   63.864402] RBP: ffff8800b5955ee0 R08: 3800000000000000 R09: a8000b0f9c000000
[   63.864432] R10: 57ffdef066c3e700 R11: ffffff5cfb00c14a R12: ffffea0002c3e700
[   63.864462] R13: ffff8800b5955f48 R14: 00007f0708992000 R15: 00007f0708992000
[   63.864492] FS:  00007f0708b92740(0000) GS:ffff88021e240000(0000) knlGS:0000000000000000
[   63.864526] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   63.864551] CR2: 00007f33bb373000 CR3: 00000000b2a2c000 CR4: 00000000001407e0
[   63.864581] Stack:
[   63.864593]  ffff8800b5955ed0 00007f0708b91fff 00007f0708b92000 ffff8800b5955e48
[   63.864632]  000001ff810c864b ffff8801ee060000 0000000000000000 0000000000000000
[   63.864669]  ffff8800b5955e58 ffff8801ee060000 0000000700000086 ffff8801ee060000
[   63.864708] Call Trace:
[   63.864724]  [<ffffffff816956bc>] ? _raw_spin_unlock_irq+0x2c/0x30
[   63.864754]  [<ffffffff81171b52>] ? vma_merge+0xc2/0x330
[   63.864786]  [<ffffffff8116fb9c>] mlock_fixup+0xfc/0x190
[   63.864812]  [<ffffffff8116fde7>] do_mlockall+0x87/0xc0
[   63.864836]  [<ffffffff811702df>] sys_munlockall+0x2f/0x50
[   63.864873]  [<ffffffff8169e192>] system_call_fastpath+0x16/0x1b
[   63.864898] Code: d7 48 89 95 28 ff ff ff e8 a4 04 fe ff 84 c0 48 8b 95 28 ff ff ff 0f 85 5a ff ff ff e9 46 ff ff ff e8 3f ac 51 00 e8 34 ac 51 00 <0f> 0b 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55
[   63.865114] RIP  [<ffffffff8116fa9a>] munlock_vma_pages_range+0x2ea/0x2f0
[   63.865148]  RSP <ffff8800b5955e08>
[   63.874968] ------------[ cut here ]------------

... when I find some time, I'll try with normal torvalds' tree, maybe some
other patches are missing as well, not sure right now.

Thanks anyway!

>> With using default kernel config:
>>
>> [   63.887947] kernel BUG at include/linux/page-flags.h:415!
>> [   63.889296] invalid opcode: 0000 [#4] SMP
>> [   63.890637] Modules linked in: fuse ebtable_nat xt_CHECKSUM bridge stp llc rfcomm bnep nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_nat nf_nat_ipv6 ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_filter ebtables ip6table_filter ip6_tables snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm iwlwifi snd_page_alloc btusb snd_timer bluetooth cfg80211 wmi joydev thinkpad_acpi snd iTCO_wdt iTCO_vendor_support pcspkr e1000e tpm_tis tpm uvcvideo i2c_i801 soundcore sdhci_pci sdhci lpc_ich rfkill videobuf2_vmalloc videobuf2_memops videobuf2_core mmc_core mfd_core ptp videodev pps_core media uinput i915
>> [   63.895055]  i2c_algo_bit drm_kms_helper drm i2c_core video
>> [   63.896529] CPU: 2 PID: 1598 Comm: trafgen Tainted: G      D      3.13.0-rc6+ #12
>> [   63.898010] Hardware name: LENOVO 2429BP3/2429BP3, BIOS G4ET37WW (1.12 ) 05/29/2012
>> [   63.899494] task: ffff8801eca6c1a0 ti: ffff88020e694000 task.ti: ffff88020e694000
>> [   63.900988] RIP: 0010:[<ffffffff8168a492>]  [<ffffffff8168a492>] PageTransHuge.part.11+0x4/0x6
>> [   63.902498] RSP: 0018:ffff88020e695df8  EFLAGS: 00010282
>> [   63.903996] RAX: 005fffc000008004 RBX: ffff88020254cac8 RCX: 00000000078ed340
>> [   63.905492] RDX: ffffffff8116b3c6 RSI: 0000000000000001 RDI: ffff88020cf80640
>> [   63.906992] RBP: ffff88020e695df8 R08: 0000000000000000 R09: 0000000000000000
>> [   63.908485] R10: ffff8801eca6c1a0 R11: 00000000100000fb R12: ffffea00078ed340
>> [   63.909970] R13: ffff88020e695f48 R14: 00007f274a5f3000 R15: 00007f274a5f4000
>> [   63.911456] FS:  00007f274e5f3740(0000) GS:ffff88021e280000(0000) knlGS:0000000000000000
>> [   63.912946] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [   63.914430] CR2: 0000003724a35a90 CR3: 00000001eabb6000 CR4: 00000000001407e0
>> [   63.915919] Stack:
>> [   63.917404]  ffff88020e695ee0 ffffffff8116fa7a 0000000ed78ec480 00007f274e5f2fff
>> [   63.918913]  0000000000000001 00007f274e5f3000 00000000810c864b ffff8801fb16aca8
>> [   63.920430]  0000000000000000 0000000000000000 ffff88020e695e58 0000000000000046
>> [   63.921938] Call Trace:
>> [   63.923451]  [<ffffffff8116fa7a>] munlock_vma_pages_range+0x2ea/0x2f0
>> [   63.924978]  [<ffffffff810a07bd>] ? trace_hardirqs_off+0xd/0x10
>> [   63.926454]  [<ffffffff81171b32>] ? vma_merge+0xc2/0x330
>> [   63.927870]  [<ffffffff8116fb7c>] mlock_fixup+0xfc/0x190
>> [   63.929288]  [<ffffffff8116fdc7>] do_mlockall+0x87/0xc0
>> [   63.930702]  [<ffffffff811702bf>] sys_munlockall+0x2f/0x50
>> [   63.932117]  [<ffffffff8169df52>] system_call_fastpath+0x16/0x1b
>> [   63.933531] Code: c1 e0 06 48 29 d8 eb 02 31 c0 5b 41 5c 5d c3 55 48 89 e5 0f 0b 55 48 89 e5 0f 0b 55 48 89 e5 0f 0b 55 48 89 e5 0f 0b 55 48 89 e5 <0f> 0b 55 48 89 e5 0f 0b 55 48 89 e5 0f 0b 55 48 8b 07 31 c9 48
>> [   63.935103] RIP  [<ffffffff8168a492>] PageTransHuge.part.11+0x4/0x6
>> [   63.936580]  RSP <ffff88020e695df8>
>> [   63.938021] ---[ end trace 67b7aa3fba09186d ]---
>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [BUG] at include/linux/page-flags.h:415 (PageTransHuge)
  2014-01-11 13:32   ` Daniel Borkmann
@ 2014-01-13 10:16     ` Vlastimil Babka
  2014-01-13 11:39       ` Daniel Borkmann
  0 siblings, 1 reply; 13+ messages in thread
From: Vlastimil Babka @ 2014-01-13 10:16 UTC (permalink / raw)
  To: Daniel Borkmann, Andrew Morton; +Cc: linux-kernel, Michel Lespinasse

On 01/11/2014 02:32 PM, Daniel Borkmann wrote:
> On 01/11/2014 07:22 AM, Andrew Morton wrote:
>> On Fri, 10 Jan 2014 19:23:26 +0100 Daniel Borkmann <borkmann@iogearbox.net> wrote:
>>
>>> This is being reliably triggered for each mmaped() packet(7)
>>> socket from user space, basically during unmapping resp.
>>> closing the TX socket.
>>>
>>> I believe due to some change in transparent hugepages code ?
>>>
>>> When I disable transparent hugepages, everything works fine,
>>> no BUG triggered.
>>>
>>> I'd be happy to test patches.
>>
>> Did the inclusion of c424be1cbbf852e46acc8 ("mm: munlock: fix a bug
>> where THP tail page is encountered") in current mainline fix this?
> 
> Thanks for your answer Andrew!
> 
> Hm, I just cherry-picked that onto current net-next as I have some work
> there, and this time I got ...
> 
> (User space uses packet mmap() and mlockall(MCL_CURRENT | MCL_FUTURE)
>    and on shutdown munlockall() ...)
> 
> [   63.863672] ------------[ cut here ]------------
> [   63.863702] kernel BUG at mm/mlock.c:507!
> [   63.863721] invalid opcode: 0000 [#1] SMP
> [   63.863743] Modules linked in: fuse ebtable_nat xt_CHECKSUM nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_nat nf_nat_ipv6 ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack bridge ebtable_filter ebtables stp llc ip6table_filter ip6_tables rfcomm bnep snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec iwlwifi cfg80211 snd_hwdep btusb snd_seq bluetooth sdhci_pci snd_seq_device e1000e tpm_tis snd_pcm thinkpad_acpi sdhci ptp tpm uvcvideo pps_core snd_page_alloc snd_timer snd rfkill mmc_core iTCO_wdt iTCO_vendor_support lpc_ich mfd_core soundcore joydev wmi videobuf2_vmalloc videobuf2_memops videobuf2_core i2c_i801 pcspkr videodev media uinput i915
> [   63.864152]  i2c_algo_bit drm_kms_helper drm i2c_core video
> [   63.864181] CPU: 1 PID: 1617 Comm: trafgen Not tainted 3.13.0-rc6+ #15
> [   63.864209] Hardware name: LENOVO 2429BP3/2429BP3, BIOS G4ET37WW (1.12 ) 05/29/2012
> [   63.864242] task: ffff8801ee060000 ti: ffff8800b5954000 task.ti: ffff8800b5954000
> [   63.864274] RIP: 0010:[<ffffffff8116fa9a>]  [<ffffffff8116fa9a>] munlock_vma_pages_range+0x2ea/0x2f0
> [   63.864318] RSP: 0018:ffff8800b5955e08  EFLAGS: 00010202
> [   63.864341] RAX: 00000000000001ff RBX: ffff8800b58f7508 RCX: 0000000000000034
> [   63.864372] RDX: 00000007f0708992 RSI: ffffea0002c3e700 RDI: ffffea0002c3e700
> [   63.864402] RBP: ffff8800b5955ee0 R08: 3800000000000000 R09: a8000b0f9c000000
> [   63.864432] R10: 57ffdef066c3e700 R11: ffffff5cfb00c14a R12: ffffea0002c3e700
> [   63.864462] R13: ffff8800b5955f48 R14: 00007f0708992000 R15: 00007f0708992000
> [   63.864492] FS:  00007f0708b92740(0000) GS:ffff88021e240000(0000) knlGS:0000000000000000
> [   63.864526] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   63.864551] CR2: 00007f33bb373000 CR3: 00000000b2a2c000 CR4: 00000000001407e0
> [   63.864581] Stack:
> [   63.864593]  ffff8800b5955ed0 00007f0708b91fff 00007f0708b92000 ffff8800b5955e48
> [   63.864632]  000001ff810c864b ffff8801ee060000 0000000000000000 0000000000000000
> [   63.864669]  ffff8800b5955e58 ffff8801ee060000 0000000700000086 ffff8801ee060000
> [   63.864708] Call Trace:
> [   63.864724]  [<ffffffff816956bc>] ? _raw_spin_unlock_irq+0x2c/0x30
> [   63.864754]  [<ffffffff81171b52>] ? vma_merge+0xc2/0x330
> [   63.864786]  [<ffffffff8116fb9c>] mlock_fixup+0xfc/0x190
> [   63.864812]  [<ffffffff8116fde7>] do_mlockall+0x87/0xc0
> [   63.864836]  [<ffffffff811702df>] sys_munlockall+0x2f/0x50
> [   63.864873]  [<ffffffff8169e192>] system_call_fastpath+0x16/0x1b
> [   63.864898] Code: d7 48 89 95 28 ff ff ff e8 a4 04 fe ff 84 c0 48 8b 95 28 ff ff ff 0f 85 5a ff ff ff e9 46 ff ff ff e8 3f ac 51 00 e8 34 ac 51 00 <0f> 0b 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55
> [   63.865114] RIP  [<ffffffff8116fa9a>] munlock_vma_pages_range+0x2ea/0x2f0
> [   63.865148]  RSP <ffff8800b5955e08>
> [   63.874968] ------------[ cut here ]------------
> 
> ... when I find some time, I'll try with normal torvalds' tree, maybe some
> other patches are missing as well, not sure right now.

Uh so the triggered assertion is the one added by this very patch, and there are no more changes wrt this in mainline.

If you can still try debug patches, please try this. Thanks.

From: Vlastimil Babka <vbabka@suse.cz>
Date: Mon, 13 Jan 2014 11:13:53 +0100
Subject: [PATCH] debug munlock_vma_pages_range

---
 mm/mlock.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/mm/mlock.c b/mm/mlock.c
index c59c420..7d0e29a 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -448,12 +448,14 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
 void munlock_vma_pages_range(struct vm_area_struct *vma,
 			     unsigned long start, unsigned long end)
 {
+	unsigned long orig_start = start;
+	unsigned long page_increm = 0;
+
 	vma->vm_flags &= ~VM_LOCKED;
 
 	while (start < end) {
 		struct page *page = NULL;
 		unsigned int page_mask;
-		unsigned long page_increm;
 		struct pagevec pvec;
 		struct zone *zone;
 		int zoneid;
@@ -504,7 +506,23 @@ void munlock_vma_pages_range(struct vm_area_struct *vma,
 			}
 		}
 		/* It's a bug to munlock in the middle of a THP page */
-		VM_BUG_ON((start >> PAGE_SHIFT) & page_mask);
+		if ((start >> PAGE_SHIFT) & page_mask) {
+			dump_page(page);
+			printk("start=%lu pfn=%lu orig_start=%lu "
+			       "prev_page_increm=%lu page_mask=%u "
+			       "vm_start=%lu vm_end=%lu vm_flags=%lu\n",
+				start, page_to_pfn(page), orig_start,
+				page_increm, page_mask,
+				vma->vm_start, vma->vm_end,
+				vma->vm_flags);
+			if (PageTail(page)) {
+				struct page *first_page = page->first_page;
+				printk("first_page pfn=%lu\n",
+						page_to_pfn(first_page));
+				dump_page(first_page);
+			}
+			VM_BUG_ON(true);
+		}
 		page_increm = 1 + page_mask;
 		start += page_increm * PAGE_SIZE;
 next:
-- 
1.8.4



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [BUG] at include/linux/page-flags.h:415 (PageTransHuge)
  2014-01-13 10:16     ` Vlastimil Babka
@ 2014-01-13 11:39       ` Daniel Borkmann
  2014-01-15 14:27         ` Vlastimil Babka
  0 siblings, 1 reply; 13+ messages in thread
From: Daniel Borkmann @ 2014-01-13 11:39 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Andrew Morton, linux-kernel, Michel Lespinasse

On 01/13/2014 11:16 AM, Vlastimil Babka wrote:
> On 01/11/2014 02:32 PM, Daniel Borkmann wrote:
>> On 01/11/2014 07:22 AM, Andrew Morton wrote:
>>> On Fri, 10 Jan 2014 19:23:26 +0100 Daniel Borkmann <borkmann@iogearbox.net> wrote:
>>>
>>>> This is being reliably triggered for each mmaped() packet(7)
>>>> socket from user space, basically during unmapping resp.
>>>> closing the TX socket.
>>>>
>>>> I believe due to some change in transparent hugepages code ?
>>>>
>>>> When I disable transparent hugepages, everything works fine,
>>>> no BUG triggered.
>>>>
>>>> I'd be happy to test patches.
>>>
>>> Did the inclusion of c424be1cbbf852e46acc8 ("mm: munlock: fix a bug
>>> where THP tail page is encountered") in current mainline fix this?
>>
>> Thanks for your answer Andrew!
>>
>> Hm, I just cherry-picked that onto current net-next as I have some work
>> there, and this time I got ...
>>
>> (User space uses packet mmap() and mlockall(MCL_CURRENT | MCL_FUTURE)
>>     and on shutdown munlockall() ...)
>>
>> [   63.863672] ------------[ cut here ]------------
>> [   63.863702] kernel BUG at mm/mlock.c:507!
>> [   63.863721] invalid opcode: 0000 [#1] SMP
>> [   63.863743] Modules linked in: fuse ebtable_nat xt_CHECKSUM nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_nat nf_nat_ipv6 ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack bridge ebtable_filter ebtables stp llc ip6table_filter ip6_tables rfcomm bnep snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec iwlwifi cfg80211 snd_hwdep btusb snd_seq bluetooth sdhci_pci snd_seq_device e1000e tpm_tis snd_pcm thinkpad_acpi sdhci ptp tpm uvcvideo pps_core snd_page_alloc snd_timer snd rfkill mmc_core iTCO_wdt iTCO_vendor_support lpc_ich mfd_core soundcore joydev wmi videobuf2_vmalloc videobuf2_memops videobuf2_core i2c_i801 pcspkr videodev media uinput i915
>> [   63.864152]  i2c_algo_bit drm_kms_helper drm i2c_core video
>> [   63.864181] CPU: 1 PID: 1617 Comm: trafgen Not tainted 3.13.0-rc6+ #15
>> [   63.864209] Hardware name: LENOVO 2429BP3/2429BP3, BIOS G4ET37WW (1.12 ) 05/29/2012
>> [   63.864242] task: ffff8801ee060000 ti: ffff8800b5954000 task.ti: ffff8800b5954000
>> [   63.864274] RIP: 0010:[<ffffffff8116fa9a>]  [<ffffffff8116fa9a>] munlock_vma_pages_range+0x2ea/0x2f0
>> [   63.864318] RSP: 0018:ffff8800b5955e08  EFLAGS: 00010202
>> [   63.864341] RAX: 00000000000001ff RBX: ffff8800b58f7508 RCX: 0000000000000034
>> [   63.864372] RDX: 00000007f0708992 RSI: ffffea0002c3e700 RDI: ffffea0002c3e700
>> [   63.864402] RBP: ffff8800b5955ee0 R08: 3800000000000000 R09: a8000b0f9c000000
>> [   63.864432] R10: 57ffdef066c3e700 R11: ffffff5cfb00c14a R12: ffffea0002c3e700
>> [   63.864462] R13: ffff8800b5955f48 R14: 00007f0708992000 R15: 00007f0708992000
>> [   63.864492] FS:  00007f0708b92740(0000) GS:ffff88021e240000(0000) knlGS:0000000000000000
>> [   63.864526] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [   63.864551] CR2: 00007f33bb373000 CR3: 00000000b2a2c000 CR4: 00000000001407e0
>> [   63.864581] Stack:
>> [   63.864593]  ffff8800b5955ed0 00007f0708b91fff 00007f0708b92000 ffff8800b5955e48
>> [   63.864632]  000001ff810c864b ffff8801ee060000 0000000000000000 0000000000000000
>> [   63.864669]  ffff8800b5955e58 ffff8801ee060000 0000000700000086 ffff8801ee060000
>> [   63.864708] Call Trace:
>> [   63.864724]  [<ffffffff816956bc>] ? _raw_spin_unlock_irq+0x2c/0x30
>> [   63.864754]  [<ffffffff81171b52>] ? vma_merge+0xc2/0x330
>> [   63.864786]  [<ffffffff8116fb9c>] mlock_fixup+0xfc/0x190
>> [   63.864812]  [<ffffffff8116fde7>] do_mlockall+0x87/0xc0
>> [   63.864836]  [<ffffffff811702df>] sys_munlockall+0x2f/0x50
>> [   63.864873]  [<ffffffff8169e192>] system_call_fastpath+0x16/0x1b
>> [   63.864898] Code: d7 48 89 95 28 ff ff ff e8 a4 04 fe ff 84 c0 48 8b 95 28 ff ff ff 0f 85 5a ff ff ff e9 46 ff ff ff e8 3f ac 51 00 e8 34 ac 51 00 <0f> 0b 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55
>> [   63.865114] RIP  [<ffffffff8116fa9a>] munlock_vma_pages_range+0x2ea/0x2f0
>> [   63.865148]  RSP <ffff8800b5955e08>
>> [   63.874968] ------------[ cut here ]------------
>>
>> ... when I find some time, I'll try with normal torvalds' tree, maybe some
>> other patches are missing as well, not sure right now.
>
> Uh so the triggered assertion is the one added by this very patch, and there are no more changes wrt this in mainline.
>
> If you can still try debug patches, please try this. Thanks.

Yes, thanks, I'll come back to you some time by today.

> From: Vlastimil Babka <vbabka@suse.cz>
> Date: Mon, 13 Jan 2014 11:13:53 +0100
> Subject: [PATCH] debug munlock_vma_pages_range
>
> ---
>   mm/mlock.c | 22 ++++++++++++++++++++--
>   1 file changed, 20 insertions(+), 2 deletions(-)
>
> diff --git a/mm/mlock.c b/mm/mlock.c
> index c59c420..7d0e29a 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -448,12 +448,14 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
>   void munlock_vma_pages_range(struct vm_area_struct *vma,
>   			     unsigned long start, unsigned long end)
>   {
> +	unsigned long orig_start = start;
> +	unsigned long page_increm = 0;
> +
>   	vma->vm_flags &= ~VM_LOCKED;
>
>   	while (start < end) {
>   		struct page *page = NULL;
>   		unsigned int page_mask;
> -		unsigned long page_increm;
>   		struct pagevec pvec;
>   		struct zone *zone;
>   		int zoneid;
> @@ -504,7 +506,23 @@ void munlock_vma_pages_range(struct vm_area_struct *vma,
>   			}
>   		}
>   		/* It's a bug to munlock in the middle of a THP page */
> -		VM_BUG_ON((start >> PAGE_SHIFT) & page_mask);
> +		if ((start >> PAGE_SHIFT) & page_mask) {
> +			dump_page(page);
> +			printk("start=%lu pfn=%lu orig_start=%lu "
> +			       "prev_page_increm=%lu page_mask=%u "
> +			       "vm_start=%lu vm_end=%lu vm_flags=%lu\n",
> +				start, page_to_pfn(page), orig_start,
> +				page_increm, page_mask,
> +				vma->vm_start, vma->vm_end,
> +				vma->vm_flags);
> +			if (PageTail(page)) {
> +				struct page *first_page = page->first_page;
> +				printk("first_page pfn=%lu\n",
> +						page_to_pfn(first_page));
> +				dump_page(first_page);
> +			}
> +			VM_BUG_ON(true);
> +		}
>   		page_increm = 1 + page_mask;
>   		start += page_increm * PAGE_SIZE;
>   next:
>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [BUG] at include/linux/page-flags.h:415 (PageTransHuge)
  2014-01-13 11:39       ` Daniel Borkmann
@ 2014-01-15 14:27         ` Vlastimil Babka
  2014-01-15 16:06           ` Daniel Borkmann
  0 siblings, 1 reply; 13+ messages in thread
From: Vlastimil Babka @ 2014-01-15 14:27 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Andrew Morton, linux-kernel, Michel Lespinasse, linux-mm, Jared Hulbert

On 01/13/2014 12:39 PM, Daniel Borkmann wrote:
> On 01/13/2014 11:16 AM, Vlastimil Babka wrote:
>> On 01/11/2014 02:32 PM, Daniel Borkmann wrote:
>>> On 01/11/2014 07:22 AM, Andrew Morton wrote:
>>>> On Fri, 10 Jan 2014 19:23:26 +0100 Daniel Borkmann <borkmann@iogearbox.net> wrote:
>>>>
>>>>> This is being reliably triggered for each mmaped() packet(7)
>>>>> socket from user space, basically during unmapping resp.
>>>>> closing the TX socket.
>>>>>
>>>>> I believe due to some change in transparent hugepages code ?
>>>>>
>>>>> When I disable transparent hugepages, everything works fine,
>>>>> no BUG triggered.
>>>>>
>>>>> I'd be happy to test patches.
>>>>
>>>> Did the inclusion of c424be1cbbf852e46acc8 ("mm: munlock: fix a bug
>>>> where THP tail page is encountered") in current mainline fix this?
>>>
>>> Thanks for your answer Andrew!
>>>
>>> Hm, I just cherry-picked that onto current net-next as I have some work
>>> there, and this time I got ...
>>>
>>> (User space uses packet mmap() and mlockall(MCL_CURRENT | MCL_FUTURE)
>>>      and on shutdown munlockall() ...)
>>>
>>> [   63.863672] ------------[ cut here ]------------
>>> [   63.863702] kernel BUG at mm/mlock.c:507!
>>> [   63.863721] invalid opcode: 0000 [#1] SMP
>>> [   63.863743] Modules linked in: fuse ebtable_nat xt_CHECKSUM nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_nat nf_nat_ipv6 ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack bridge ebtable_filter ebtables stp llc ip6table_filter ip6_tables rfcomm bnep snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec iwlwifi cfg80211 snd_hwdep btusb snd_seq bluetooth sdhci_pci snd_seq_device e1000e tpm_tis snd_pcm thinkpad_acpi sdhci ptp tpm uvcvideo pps_core snd_page_alloc snd_timer snd rfkill mmc_core iTCO_wdt iTCO_vendor_support lpc_ich mfd_core soundcore joydev wmi videobuf2_vmalloc videobuf2_memops videobuf2_core i2c_i801 pcspkr videodev media uinput i915
>>> [   63.864152]  i2c_algo_bit drm_kms_helper drm i2c_core video
>>> [   63.864181] CPU: 1 PID: 1617 Comm: trafgen Not tainted 3.13.0-rc6+ #15
>>> [   63.864209] Hardware name: LENOVO 2429BP3/2429BP3, BIOS G4ET37WW (1.12 ) 05/29/2012
>>> [   63.864242] task: ffff8801ee060000 ti: ffff8800b5954000 task.ti: ffff8800b5954000
>>> [   63.864274] RIP: 0010:[<ffffffff8116fa9a>]  [<ffffffff8116fa9a>] munlock_vma_pages_range+0x2ea/0x2f0
>>> [   63.864318] RSP: 0018:ffff8800b5955e08  EFLAGS: 00010202
>>> [   63.864341] RAX: 00000000000001ff RBX: ffff8800b58f7508 RCX: 0000000000000034
>>> [   63.864372] RDX: 00000007f0708992 RSI: ffffea0002c3e700 RDI: ffffea0002c3e700
>>> [   63.864402] RBP: ffff8800b5955ee0 R08: 3800000000000000 R09: a8000b0f9c000000
>>> [   63.864432] R10: 57ffdef066c3e700 R11: ffffff5cfb00c14a R12: ffffea0002c3e700
>>> [   63.864462] R13: ffff8800b5955f48 R14: 00007f0708992000 R15: 00007f0708992000
>>> [   63.864492] FS:  00007f0708b92740(0000) GS:ffff88021e240000(0000) knlGS:0000000000000000
>>> [   63.864526] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [   63.864551] CR2: 00007f33bb373000 CR3: 00000000b2a2c000 CR4: 00000000001407e0
>>> [   63.864581] Stack:
>>> [   63.864593]  ffff8800b5955ed0 00007f0708b91fff 00007f0708b92000 ffff8800b5955e48
>>> [   63.864632]  000001ff810c864b ffff8801ee060000 0000000000000000 0000000000000000
>>> [   63.864669]  ffff8800b5955e58 ffff8801ee060000 0000000700000086 ffff8801ee060000
>>> [   63.864708] Call Trace:
>>> [   63.864724]  [<ffffffff816956bc>] ? _raw_spin_unlock_irq+0x2c/0x30
>>> [   63.864754]  [<ffffffff81171b52>] ? vma_merge+0xc2/0x330
>>> [   63.864786]  [<ffffffff8116fb9c>] mlock_fixup+0xfc/0x190
>>> [   63.864812]  [<ffffffff8116fde7>] do_mlockall+0x87/0xc0
>>> [   63.864836]  [<ffffffff811702df>] sys_munlockall+0x2f/0x50
>>> [   63.864873]  [<ffffffff8169e192>] system_call_fastpath+0x16/0x1b
>>> [   63.864898] Code: d7 48 89 95 28 ff ff ff e8 a4 04 fe ff 84 c0 48 8b 95 28 ff ff ff 0f 85 5a ff ff ff e9 46 ff ff ff e8 3f ac 51 00 e8 34 ac 51 00 <0f> 0b 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55
>>> [   63.865114] RIP  [<ffffffff8116fa9a>] munlock_vma_pages_range+0x2ea/0x2f0
>>> [   63.865148]  RSP <ffff8800b5955e08>
>>> [   63.874968] ------------[ cut here ]------------
>>>
>>> ... when I find some time, I'll try with normal torvalds' tree, maybe some
>>> other patches are missing as well, not sure right now.
>>
>> Uh so the triggered assertion is the one added by this very patch, and there are no more changes wrt this in mainline.
>>
>> If you can still try debug patches, please try this. Thanks.
> 
> Yes, thanks, I'll come back to you some time by today.

Daniel sent me (off-list) instructions to reproduce:

> Then in the kernel source tree, you'll find:
> 
>    tools/testing/selftests/net/
> 
> There, just do a 'make' and run ./psock_tpacket

It reproduces deterministically in mainline since 3.12, i.e. my munlock
performance series. Based on the initial debug output, I've expanded the
debug patch below a bit:

>> From: Vlastimil Babka <vbabka@suse.cz>
>> Date: Mon, 13 Jan 2014 11:13:53 +0100
>> Subject: [PATCH] debug munlock_vma_pages_range
>>
>> ---
>>    mm/mlock.c | 22 ++++++++++++++++++++--
>>    1 file changed, 20 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/mlock.c b/mm/mlock.c
>> index c59c420..7d0e29a 100644
>> --- a/mm/mlock.c
>> +++ b/mm/mlock.c
>> @@ -448,12 +448,14 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
>>    void munlock_vma_pages_range(struct vm_area_struct *vma,
>>    			     unsigned long start, unsigned long end)
>>    {
>> +	unsigned long orig_start = start;
>> +	unsigned long page_increm = 0;
>> +
>>    	vma->vm_flags &= ~VM_LOCKED;
>>
>>    	while (start < end) {
>>    		struct page *page = NULL;
>>    		unsigned int page_mask;
>> -		unsigned long page_increm;
>>    		struct pagevec pvec;
>>    		struct zone *zone;
>>    		int zoneid;
>> @@ -504,7 +506,23 @@ void munlock_vma_pages_range(struct vm_area_struct *vma,
>>    			}
>>    		}
>>    		/* It's a bug to munlock in the middle of a THP page */
>> -		VM_BUG_ON((start >> PAGE_SHIFT) & page_mask);
>> +		if ((start >> PAGE_SHIFT) & page_mask) {
>> +			dump_page(page);
>> +			printk("start=%lu pfn=%lu orig_start=%lu "
>> +			       "prev_page_increm=%lu page_mask=%u "
>> +			       "vm_start=%lu vm_end=%lu vm_flags=%lu\n",
>> +				start, page_to_pfn(page), orig_start,
>> +				page_increm, page_mask,
>> +				vma->vm_start, vma->vm_end,
>> +				vma->vm_flags);
+                        printk("vm_ops=%pF, open=%pF, fault=%pF, remap_pages=%pF\n", vma->vm_ops,
+                                vma->vm_ops->open, vma->vm_ops->fault, vma->vm_ops->remap_pages);
+                        if (PageCompound(page)) {
+                                printk("page is compound with order=%d\n", compound_order(page));
+                        }
>> +			if (PageTail(page)) {
>> +				struct page *first_page = page->first_page;
>> +				printk("first_page pfn=%lu\n",
>> +						page_to_pfn(first_page));
>> +				dump_page(first_page);
>> +			}
>> +			VM_BUG_ON(true);
>> +		}
>>    		page_increm = 1 + page_mask;
>>    		start += page_increm * PAGE_SIZE;
>>    next:
>>

And got output like this:

page:ffffea0002474a40 count:5 mapcount:1 mapping:          (null) index:0x0
page flags: 0x100000000004004(referenced|head)
start=140242647736320 pfn=682616 orig_start=140242647736320 prev_page_increm=0 page_mask=511 vm_start=140242647736320 vm_end=140242651930624 vm_flags=268435707
vm_ops=packet_mmap_ops+0x0/0xfffffffffffff8e0 [af_packet], open=packet_mm_open+0x0/0x30 [af_packet], fault=          (null), remap_pages=          (null)
page is compound with order=2

Observations:
- address 140242647736320 is where the vma starts, and is not aligned to 512 pages
  (so it cannot be a THP head which the munlock expects). Yet there is a head page
  that triggers the PageTransHuge() and consequently hpage_nr_pages() in munlock_vma_page()
  That's why page_mask is determined to be 511 and the code thinks it's in the
  middle of a THP page.
- in fact, the page is a compound page with order=2
- the VM flags (except (may)read/write) are VM_SHARED and VM_MIXEDMAP
- the vma was mmapped by packet_mmap() (net/packet/af_packet.c) which uses
  vm_insert_page(), which adds the VM_MIXEDMAP flag
- the buffers that are mapped were allocated by alloc_one_pg_vec_page()
  where flags indeed include __GFP_COMP

So clearly there is a way to have mlock/munlock operate on a vma that contains
compound pages and confuse the checks for PageTransHuge().

The checks for THP in munlock came with commit ff6a6da60b89 ("mm: accelerate munlock()
treatment of THP pages"), i.e. since 3.9, but did not trigger a bug. It however
makes munlock_vma_pages_range() skip pages until the next 512-pages-aligned page,
when it encounters a head page. If the head page is of smaller order and is followed
by normal LRU pages (theoretically, I'm not sure if that's possible, or done anywhere),
they wouldn't get munlocked.

My commit 7225522bb429 ("mm: munlock: batch non-THP page isolation and
munlock+putback using pagevec") (since 3.12) has added a new PageTransHuge() check
that can trigger on tail pages of the compound page here. Commit c424be1cbbf852e46acc8
("mm: munlock: fix a bug where THP tail page is encountered") in current rc's removes
one class of bugs here, but still non-THP compound pages are not expected in mlock/munlock,
which leads to this assertion failing.

The question is what is the correct fix, and I'm not that familiar with VM_MIXEDMAP
to decide.

Option 1: mlocking VM_MIXEDMAP vma's has no sense. They should be treated like VM_PFNMAP
          and added to VM_SPECIAL, which makes m(un)lock skip them completely.

Option 2: if indeed VM_MIXEDMAP can contain PageLRU pages for which mlocking is useful,
          VM_NO_THP should be checked in munlock before attempting PageTransHuge() and
          friends. VM_NO_THP already contains VM_MIXEDMAP, so knowing that there can be
          no THP means we don't try optimize for it and no unexpected head pages trip us.

Thoughts?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [BUG] at include/linux/page-flags.h:415 (PageTransHuge)
  2014-01-15 14:27         ` Vlastimil Babka
@ 2014-01-15 16:06           ` Daniel Borkmann
  2014-01-31 14:40             ` Vlastimil Babka
  0 siblings, 1 reply; 13+ messages in thread
From: Daniel Borkmann @ 2014-01-15 16:06 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, linux-kernel, Michel Lespinasse, linux-mm,
	Jared Hulbert, netdev

[keeping netdev in loop as well]

On 01/15/2014 03:27 PM, Vlastimil Babka wrote:
> On 01/13/2014 12:39 PM, Daniel Borkmann wrote:
>> On 01/13/2014 11:16 AM, Vlastimil Babka wrote:
>>> On 01/11/2014 02:32 PM, Daniel Borkmann wrote:
>>>> On 01/11/2014 07:22 AM, Andrew Morton wrote:
>>>>> On Fri, 10 Jan 2014 19:23:26 +0100 Daniel Borkmann <borkmann@iogearbox.net> wrote:
>>>>>
>>>>>> This is being reliably triggered for each mmaped() packet(7)
>>>>>> socket from user space, basically during unmapping resp.
>>>>>> closing the TX socket.
>>>>>>
>>>>>> I believe due to some change in transparent hugepages code ?
>>>>>>
>>>>>> When I disable transparent hugepages, everything works fine,
>>>>>> no BUG triggered.
>>>>>>
>>>>>> I'd be happy to test patches.
>>>>>
>>>>> Did the inclusion of c424be1cbbf852e46acc8 ("mm: munlock: fix a bug
>>>>> where THP tail page is encountered") in current mainline fix this?
>>>>
>>>> Thanks for your answer Andrew!
>>>>
>>>> Hm, I just cherry-picked that onto current net-next as I have some work
>>>> there, and this time I got ...
>>>>
>>>> (User space uses packet mmap() and mlockall(MCL_CURRENT | MCL_FUTURE)
>>>>       and on shutdown munlockall() ...)
>>>>
>>>> [   63.863672] ------------[ cut here ]------------
>>>> [   63.863702] kernel BUG at mm/mlock.c:507!
>>>> [   63.863721] invalid opcode: 0000 [#1] SMP
>>>> [   63.863743] Modules linked in: fuse ebtable_nat xt_CHECKSUM nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_nat nf_nat_ipv6 ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack bridge ebtable_filter ebtables stp llc ip6table_filter ip6_tables rfcomm bnep snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec iwlwifi cfg80211 snd_hwdep btusb snd_seq bluetooth sdhci_pci snd_seq_device e1000e tpm_tis snd_pcm thinkpad_acpi sdhci ptp tpm uvcvideo pps_core snd_page_alloc snd_timer snd rfkill mmc_core iTCO_wdt iTCO_vendor_support lpc_ich mfd_core soundcore joydev wmi videobuf2_vmalloc videobuf2_memops videobuf2_core i2c_i801 pcspkr videodev media uinput i915
>>>> [   63.864152]  i2c_algo_bit drm_kms_helper drm i2c_core video
>>>> [   63.864181] CPU: 1 PID: 1617 Comm: trafgen Not tainted 3.13.0-rc6+ #15
>>>> [   63.864209] Hardware name: LENOVO 2429BP3/2429BP3, BIOS G4ET37WW (1.12 ) 05/29/2012
>>>> [   63.864242] task: ffff8801ee060000 ti: ffff8800b5954000 task.ti: ffff8800b5954000
>>>> [   63.864274] RIP: 0010:[<ffffffff8116fa9a>]  [<ffffffff8116fa9a>] munlock_vma_pages_range+0x2ea/0x2f0
>>>> [   63.864318] RSP: 0018:ffff8800b5955e08  EFLAGS: 00010202
>>>> [   63.864341] RAX: 00000000000001ff RBX: ffff8800b58f7508 RCX: 0000000000000034
>>>> [   63.864372] RDX: 00000007f0708992 RSI: ffffea0002c3e700 RDI: ffffea0002c3e700
>>>> [   63.864402] RBP: ffff8800b5955ee0 R08: 3800000000000000 R09: a8000b0f9c000000
>>>> [   63.864432] R10: 57ffdef066c3e700 R11: ffffff5cfb00c14a R12: ffffea0002c3e700
>>>> [   63.864462] R13: ffff8800b5955f48 R14: 00007f0708992000 R15: 00007f0708992000
>>>> [   63.864492] FS:  00007f0708b92740(0000) GS:ffff88021e240000(0000) knlGS:0000000000000000
>>>> [   63.864526] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [   63.864551] CR2: 00007f33bb373000 CR3: 00000000b2a2c000 CR4: 00000000001407e0
>>>> [   63.864581] Stack:
>>>> [   63.864593]  ffff8800b5955ed0 00007f0708b91fff 00007f0708b92000 ffff8800b5955e48
>>>> [   63.864632]  000001ff810c864b ffff8801ee060000 0000000000000000 0000000000000000
>>>> [   63.864669]  ffff8800b5955e58 ffff8801ee060000 0000000700000086 ffff8801ee060000
>>>> [   63.864708] Call Trace:
>>>> [   63.864724]  [<ffffffff816956bc>] ? _raw_spin_unlock_irq+0x2c/0x30
>>>> [   63.864754]  [<ffffffff81171b52>] ? vma_merge+0xc2/0x330
>>>> [   63.864786]  [<ffffffff8116fb9c>] mlock_fixup+0xfc/0x190
>>>> [   63.864812]  [<ffffffff8116fde7>] do_mlockall+0x87/0xc0
>>>> [   63.864836]  [<ffffffff811702df>] sys_munlockall+0x2f/0x50
>>>> [   63.864873]  [<ffffffff8169e192>] system_call_fastpath+0x16/0x1b
>>>> [   63.864898] Code: d7 48 89 95 28 ff ff ff e8 a4 04 fe ff 84 c0 48 8b 95 28 ff ff ff 0f 85 5a ff ff ff e9 46 ff ff ff e8 3f ac 51 00 e8 34 ac 51 00 <0f> 0b 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55
>>>> [   63.865114] RIP  [<ffffffff8116fa9a>] munlock_vma_pages_range+0x2ea/0x2f0
>>>> [   63.865148]  RSP <ffff8800b5955e08>
>>>> [   63.874968] ------------[ cut here ]------------
>>>>
>>>> ... when I find some time, I'll try with normal torvalds' tree, maybe some
>>>> other patches are missing as well, not sure right now.
>>>
>>> Uh so the triggered assertion is the one added by this very patch, and there are no more changes wrt this in mainline.
>>>
>>> If you can still try debug patches, please try this. Thanks.
>>
>> Yes, thanks, I'll come back to you some time by today.
>
> Daniel sent me (off-list) instructions to reproduce:
>
>> Then in the kernel source tree, you'll find:
>>
>>     tools/testing/selftests/net/
>>
>> There, just do a 'make' and run ./psock_tpacket
>
> It reproduces deterministically in mainline since 3.12, i.e. my munlock
> performance series. Based on the initial debug output, I've expanded the
> debug patch below a bit:
>
>>> From: Vlastimil Babka <vbabka@suse.cz>
>>> Date: Mon, 13 Jan 2014 11:13:53 +0100
>>> Subject: [PATCH] debug munlock_vma_pages_range
>>>
>>> ---
>>>     mm/mlock.c | 22 ++++++++++++++++++++--
>>>     1 file changed, 20 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/mlock.c b/mm/mlock.c
>>> index c59c420..7d0e29a 100644
>>> --- a/mm/mlock.c
>>> +++ b/mm/mlock.c
>>> @@ -448,12 +448,14 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
>>>     void munlock_vma_pages_range(struct vm_area_struct *vma,
>>>     			     unsigned long start, unsigned long end)
>>>     {
>>> +	unsigned long orig_start = start;
>>> +	unsigned long page_increm = 0;
>>> +
>>>     	vma->vm_flags &= ~VM_LOCKED;
>>>
>>>     	while (start < end) {
>>>     		struct page *page = NULL;
>>>     		unsigned int page_mask;
>>> -		unsigned long page_increm;
>>>     		struct pagevec pvec;
>>>     		struct zone *zone;
>>>     		int zoneid;
>>> @@ -504,7 +506,23 @@ void munlock_vma_pages_range(struct vm_area_struct *vma,
>>>     			}
>>>     		}
>>>     		/* It's a bug to munlock in the middle of a THP page */
>>> -		VM_BUG_ON((start >> PAGE_SHIFT) & page_mask);
>>> +		if ((start >> PAGE_SHIFT) & page_mask) {
>>> +			dump_page(page);
>>> +			printk("start=%lu pfn=%lu orig_start=%lu "
>>> +			       "prev_page_increm=%lu page_mask=%u "
>>> +			       "vm_start=%lu vm_end=%lu vm_flags=%lu\n",
>>> +				start, page_to_pfn(page), orig_start,
>>> +				page_increm, page_mask,
>>> +				vma->vm_start, vma->vm_end,
>>> +				vma->vm_flags);
> +                        printk("vm_ops=%pF, open=%pF, fault=%pF, remap_pages=%pF\n", vma->vm_ops,
> +                                vma->vm_ops->open, vma->vm_ops->fault, vma->vm_ops->remap_pages);
> +                        if (PageCompound(page)) {
> +                                printk("page is compound with order=%d\n", compound_order(page));
> +                        }
>>> +			if (PageTail(page)) {
>>> +				struct page *first_page = page->first_page;
>>> +				printk("first_page pfn=%lu\n",
>>> +						page_to_pfn(first_page));
>>> +				dump_page(first_page);
>>> +			}
>>> +			VM_BUG_ON(true);
>>> +		}
>>>     		page_increm = 1 + page_mask;
>>>     		start += page_increm * PAGE_SIZE;
>>>     next:
>>>
>
> And got output like this:
>
> page:ffffea0002474a40 count:5 mapcount:1 mapping:          (null) index:0x0
> page flags: 0x100000000004004(referenced|head)
> start=140242647736320 pfn=682616 orig_start=140242647736320 prev_page_increm=0 page_mask=511 vm_start=140242647736320 vm_end=140242651930624 vm_flags=268435707
> vm_ops=packet_mmap_ops+0x0/0xfffffffffffff8e0 [af_packet], open=packet_mm_open+0x0/0x30 [af_packet], fault=          (null), remap_pages=          (null)
> page is compound with order=2
>
> Observations:
> - address 140242647736320 is where the vma starts, and is not aligned to 512 pages
>    (so it cannot be a THP head which the munlock expects). Yet there is a head page
>    that triggers the PageTransHuge() and consequently hpage_nr_pages() in munlock_vma_page()
>    That's why page_mask is determined to be 511 and the code thinks it's in the
>    middle of a THP page.
> - in fact, the page is a compound page with order=2
> - the VM flags (except (may)read/write) are VM_SHARED and VM_MIXEDMAP
> - the vma was mmapped by packet_mmap() (net/packet/af_packet.c) which uses
>    vm_insert_page(), which adds the VM_MIXEDMAP flag
> - the buffers that are mapped were allocated by alloc_one_pg_vec_page()
>    where flags indeed include __GFP_COMP
>
> So clearly there is a way to have mlock/munlock operate on a vma that contains
> compound pages and confuse the checks for PageTransHuge().
>
> The checks for THP in munlock came with commit ff6a6da60b89 ("mm: accelerate munlock()
> treatment of THP pages"), i.e. since 3.9, but did not trigger a bug. It however
> makes munlock_vma_pages_range() skip pages until the next 512-pages-aligned page,
> when it encounters a head page. If the head page is of smaller order and is followed
> by normal LRU pages (theoretically, I'm not sure if that's possible, or done anywhere),
> they wouldn't get munlocked.
>
> My commit 7225522bb429 ("mm: munlock: batch non-THP page isolation and
> munlock+putback using pagevec") (since 3.12) has added a new PageTransHuge() check
> that can trigger on tail pages of the compound page here. Commit c424be1cbbf852e46acc8
> ("mm: munlock: fix a bug where THP tail page is encountered") in current rc's removes
> one class of bugs here, but still non-THP compound pages are not expected in mlock/munlock,
> which leads to this assertion failing.
>
> The question is what is the correct fix, and I'm not that familiar with VM_MIXEDMAP
> to decide.
>
> Option 1: mlocking VM_MIXEDMAP vma's has no sense. They should be treated like VM_PFNMAP
>            and added to VM_SPECIAL, which makes m(un)lock skip them completely.
>
> Option 2: if indeed VM_MIXEDMAP can contain PageLRU pages for which mlocking is useful,
>            VM_NO_THP should be checked in munlock before attempting PageTransHuge() and
>            friends. VM_NO_THP already contains VM_MIXEDMAP, so knowing that there can be
>            no THP means we don't try optimize for it and no unexpected head pages trip us.
>
> Thoughts?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [BUG] at include/linux/page-flags.h:415 (PageTransHuge)
  2014-01-15 16:06           ` Daniel Borkmann
@ 2014-01-31 14:40             ` Vlastimil Babka
  2014-01-31 14:58               ` Thomas Hellstrom
  2014-02-07 18:58               ` Hannes Frederic Sowa
  0 siblings, 2 replies; 13+ messages in thread
From: Vlastimil Babka @ 2014-01-31 14:40 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Andrew Morton, linux-kernel, Michel Lespinasse, linux-mm,
	Jared Hulbert, netdev, Thomas Hellstrom, John David Anglin,
	HATAYAMA Daisuke, Konstantin Khlebnikov, Carsten Otte,
	Peter Zijlstra

On 01/15/2014 05:06 PM, Daniel Borkmann wrote:
> [keeping netdev in loop as well]
> 
> On 01/15/2014 03:27 PM, Vlastimil Babka wrote:
>> On 01/13/2014 12:39 PM, Daniel Borkmann wrote:
>>> On 01/13/2014 11:16 AM, Vlastimil Babka wrote:
>>>> On 01/11/2014 02:32 PM, Daniel Borkmann wrote:
>>>>> On 01/11/2014 07:22 AM, Andrew Morton wrote:
>>>>>> On Fri, 10 Jan 2014 19:23:26 +0100 Daniel Borkmann <borkmann@iogearbox.net> wrote:
>>>>>>
>>>>>>> This is being reliably triggered for each mmaped() packet(7)
>>>>>>> socket from user space, basically during unmapping resp.
>>>>>>> closing the TX socket.
>>>>>>>
>>>>>>> I believe due to some change in transparent hugepages code ?
>>>>>>>
>>>>>>> When I disable transparent hugepages, everything works fine,
>>>>>>> no BUG triggered.
>>>>>>>
>>>>>>> I'd be happy to test patches.
>>>>>>
>>>>>> Did the inclusion of c424be1cbbf852e46acc8 ("mm: munlock: fix a bug
>>>>>> where THP tail page is encountered") in current mainline fix this?
>>>>>
>>>>> Thanks for your answer Andrew!
>>>>>
>>>>> Hm, I just cherry-picked that onto current net-next as I have some work
>>>>> there, and this time I got ...
>>>>>
>>>>> (User space uses packet mmap() and mlockall(MCL_CURRENT | MCL_FUTURE)
>>>>>        and on shutdown munlockall() ...)
>>>>>
>>>>> [   63.863672] ------------[ cut here ]------------
>>>>> [   63.863702] kernel BUG at mm/mlock.c:507!
>>>>> [   63.863721] invalid opcode: 0000 [#1] SMP
>>>>> [   63.863743] Modules linked in: fuse ebtable_nat xt_CHECKSUM nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_nat nf_nat_ipv6 ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack bridge ebtable_filter ebtables stp llc ip6table_filter ip6_tables rfcomm bnep snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec iwlwifi cfg80211 snd_hwdep btusb snd_seq bluetooth sdhci_pci snd_seq_device e1000e tpm_tis snd_pcm thinkpad_acpi sdhci ptp tpm uvcvideo pps_core snd_page_alloc snd_timer snd rfkill mmc_core iTCO_wdt iTCO_vendor_support lpc_ich mfd_core soundcore joydev wmi videobuf2_vmalloc videobuf2_memops videobuf2_core i2c_i801 pcspkr videodev media uinput i915
>>>>> [   63.864152]  i2c_algo_bit drm_kms_helper drm i2c_core video
>>>>> [   63.864181] CPU: 1 PID: 1617 Comm: trafgen Not tainted 3.13.0-rc6+ #15
>>>>> [   63.864209] Hardware name: LENOVO 2429BP3/2429BP3, BIOS G4ET37WW (1.12 ) 05/29/2012
>>>>> [   63.864242] task: ffff8801ee060000 ti: ffff8800b5954000 task.ti: ffff8800b5954000
>>>>> [   63.864274] RIP: 0010:[<ffffffff8116fa9a>]  [<ffffffff8116fa9a>] munlock_vma_pages_range+0x2ea/0x2f0
>>>>> [   63.864318] RSP: 0018:ffff8800b5955e08  EFLAGS: 00010202
>>>>> [   63.864341] RAX: 00000000000001ff RBX: ffff8800b58f7508 RCX: 0000000000000034
>>>>> [   63.864372] RDX: 00000007f0708992 RSI: ffffea0002c3e700 RDI: ffffea0002c3e700
>>>>> [   63.864402] RBP: ffff8800b5955ee0 R08: 3800000000000000 R09: a8000b0f9c000000
>>>>> [   63.864432] R10: 57ffdef066c3e700 R11: ffffff5cfb00c14a R12: ffffea0002c3e700
>>>>> [   63.864462] R13: ffff8800b5955f48 R14: 00007f0708992000 R15: 00007f0708992000
>>>>> [   63.864492] FS:  00007f0708b92740(0000) GS:ffff88021e240000(0000) knlGS:0000000000000000
>>>>> [   63.864526] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> [   63.864551] CR2: 00007f33bb373000 CR3: 00000000b2a2c000 CR4: 00000000001407e0
>>>>> [   63.864581] Stack:
>>>>> [   63.864593]  ffff8800b5955ed0 00007f0708b91fff 00007f0708b92000 ffff8800b5955e48
>>>>> [   63.864632]  000001ff810c864b ffff8801ee060000 0000000000000000 0000000000000000
>>>>> [   63.864669]  ffff8800b5955e58 ffff8801ee060000 0000000700000086 ffff8801ee060000
>>>>> [   63.864708] Call Trace:
>>>>> [   63.864724]  [<ffffffff816956bc>] ? _raw_spin_unlock_irq+0x2c/0x30
>>>>> [   63.864754]  [<ffffffff81171b52>] ? vma_merge+0xc2/0x330
>>>>> [   63.864786]  [<ffffffff8116fb9c>] mlock_fixup+0xfc/0x190
>>>>> [   63.864812]  [<ffffffff8116fde7>] do_mlockall+0x87/0xc0
>>>>> [   63.864836]  [<ffffffff811702df>] sys_munlockall+0x2f/0x50
>>>>> [   63.864873]  [<ffffffff8169e192>] system_call_fastpath+0x16/0x1b
>>>>> [   63.864898] Code: d7 48 89 95 28 ff ff ff e8 a4 04 fe ff 84 c0 48 8b 95 28 ff ff ff 0f 85 5a ff ff ff e9 46 ff ff ff e8 3f ac 51 00 e8 34 ac 51 00 <0f> 0b 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55
>>>>> [   63.865114] RIP  [<ffffffff8116fa9a>] munlock_vma_pages_range+0x2ea/0x2f0
>>>>> [   63.865148]  RSP <ffff8800b5955e08>
>>>>> [   63.874968] ------------[ cut here ]------------
>>>>>
>>>>> ... when I find some time, I'll try with normal torvalds' tree, maybe some
>>>>> other patches are missing as well, not sure right now.
>>>>
>>>> Uh so the triggered assertion is the one added by this very patch, and there are no more changes wrt this in mainline.
>>>>
>>>> If you can still try debug patches, please try this. Thanks.
>>>
>>> Yes, thanks, I'll come back to you some time by today.
>>
>> Daniel sent me (off-list) instructions to reproduce:
>>
>>> Then in the kernel source tree, you'll find:
>>>
>>>      tools/testing/selftests/net/
>>>
>>> There, just do a 'make' and run ./psock_tpacket
>>
>> It reproduces deterministically in mainline since 3.12, i.e. my munlock
>> performance series. Based on the initial debug output, I've expanded the
>> debug patch below a bit:
>>
>>>> From: Vlastimil Babka <vbabka@suse.cz>
>>>> Date: Mon, 13 Jan 2014 11:13:53 +0100
>>>> Subject: [PATCH] debug munlock_vma_pages_range
>>>>
>>>> ---
>>>>      mm/mlock.c | 22 ++++++++++++++++++++--
>>>>      1 file changed, 20 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/mm/mlock.c b/mm/mlock.c
>>>> index c59c420..7d0e29a 100644
>>>> --- a/mm/mlock.c
>>>> +++ b/mm/mlock.c
>>>> @@ -448,12 +448,14 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
>>>>      void munlock_vma_pages_range(struct vm_area_struct *vma,
>>>>      			     unsigned long start, unsigned long end)
>>>>      {
>>>> +	unsigned long orig_start = start;
>>>> +	unsigned long page_increm = 0;
>>>> +
>>>>      	vma->vm_flags &= ~VM_LOCKED;
>>>>
>>>>      	while (start < end) {
>>>>      		struct page *page = NULL;
>>>>      		unsigned int page_mask;
>>>> -		unsigned long page_increm;
>>>>      		struct pagevec pvec;
>>>>      		struct zone *zone;
>>>>      		int zoneid;
>>>> @@ -504,7 +506,23 @@ void munlock_vma_pages_range(struct vm_area_struct *vma,
>>>>      			}
>>>>      		}
>>>>      		/* It's a bug to munlock in the middle of a THP page */
>>>> -		VM_BUG_ON((start >> PAGE_SHIFT) & page_mask);
>>>> +		if ((start >> PAGE_SHIFT) & page_mask) {
>>>> +			dump_page(page);
>>>> +			printk("start=%lu pfn=%lu orig_start=%lu "
>>>> +			       "prev_page_increm=%lu page_mask=%u "
>>>> +			       "vm_start=%lu vm_end=%lu vm_flags=%lu\n",
>>>> +				start, page_to_pfn(page), orig_start,
>>>> +				page_increm, page_mask,
>>>> +				vma->vm_start, vma->vm_end,
>>>> +				vma->vm_flags);
>> +                        printk("vm_ops=%pF, open=%pF, fault=%pF, remap_pages=%pF\n", vma->vm_ops,
>> +                                vma->vm_ops->open, vma->vm_ops->fault, vma->vm_ops->remap_pages);
>> +                        if (PageCompound(page)) {
>> +                                printk("page is compound with order=%d\n", compound_order(page));
>> +                        }
>>>> +			if (PageTail(page)) {
>>>> +				struct page *first_page = page->first_page;
>>>> +				printk("first_page pfn=%lu\n",
>>>> +						page_to_pfn(first_page));
>>>> +				dump_page(first_page);
>>>> +			}
>>>> +			VM_BUG_ON(true);
>>>> +		}
>>>>      		page_increm = 1 + page_mask;
>>>>      		start += page_increm * PAGE_SIZE;
>>>>      next:
>>>>
>>
>> And got output like this:
>>
>> page:ffffea0002474a40 count:5 mapcount:1 mapping:          (null) index:0x0
>> page flags: 0x100000000004004(referenced|head)
>> start=140242647736320 pfn=682616 orig_start=140242647736320 prev_page_increm=0 page_mask=511 vm_start=140242647736320 vm_end=140242651930624 vm_flags=268435707
>> vm_ops=packet_mmap_ops+0x0/0xfffffffffffff8e0 [af_packet], open=packet_mm_open+0x0/0x30 [af_packet], fault=          (null), remap_pages=          (null)
>> page is compound with order=2
>>
>> Observations:
>> - address 140242647736320 is where the vma starts, and is not aligned to 512 pages
>>     (so it cannot be a THP head which the munlock expects). Yet there is a head page
>>     that triggers the PageTransHuge() and consequently hpage_nr_pages() in munlock_vma_page()
>>     That's why page_mask is determined to be 511 and the code thinks it's in the
>>     middle of a THP page.
>> - in fact, the page is a compound page with order=2
>> - the VM flags (except (may)read/write) are VM_SHARED and VM_MIXEDMAP
>> - the vma was mmapped by packet_mmap() (net/packet/af_packet.c) which uses
>>     vm_insert_page(), which adds the VM_MIXEDMAP flag
>> - the buffers that are mapped were allocated by alloc_one_pg_vec_page()
>>     where flags indeed include __GFP_COMP
>>
>> So clearly there is a way to have mlock/munlock operate on a vma that contains
>> compound pages and confuse the checks for PageTransHuge().
>>
>> The checks for THP in munlock came with commit ff6a6da60b89 ("mm: accelerate munlock()
>> treatment of THP pages"), i.e. since 3.9, but did not trigger a bug. It however
>> makes munlock_vma_pages_range() skip pages until the next 512-pages-aligned page,
>> when it encounters a head page. If the head page is of smaller order and is followed
>> by normal LRU pages (theoretically, I'm not sure if that's possible, or done anywhere),
>> they wouldn't get munlocked.
>>
>> My commit 7225522bb429 ("mm: munlock: batch non-THP page isolation and
>> munlock+putback using pagevec") (since 3.12) has added a new PageTransHuge() check
>> that can trigger on tail pages of the compound page here. Commit c424be1cbbf852e46acc8
>> ("mm: munlock: fix a bug where THP tail page is encountered") in current rc's removes
>> one class of bugs here, but still non-THP compound pages are not expected in mlock/munlock,
>> which leads to this assertion failing.
>>
>> The question is what is the correct fix, and I'm not that familiar with VM_MIXEDMAP
>> to decide.
>>
>> Option 1: mlocking VM_MIXEDMAP vma's has no sense. They should be treated like VM_PFNMAP
>>             and added to VM_SPECIAL, which makes m(un)lock skip them completely.
>>
>> Option 2: if indeed VM_MIXEDMAP can contain PageLRU pages for which mlocking is useful,
>>             VM_NO_THP should be checked in munlock before attempting PageTransHuge() and
>>             friends. VM_NO_THP already contains VM_MIXEDMAP, so knowing that there can be
>>             no THP means we don't try optimize for it and no unexpected head pages trip us.
>>
>> Thoughts?

OK, here's a RFC patch to hopefully help get us somewhere. I went for
Option1, as I didn't see anyone using VM_MIXEDMAP also for LRU pages,
and Option2 was ugly to implement and also seemed quite arbitrary. I'm
not sure if making VM_MIXEDMAP also non-mergeable this way is an issue
though.

Actually a nice third option would be that the vma in question would
include also another flag (other than VM_MIXEDMAP) that would declare it
can contain compound pages. But I didn't find any such rule documented,
perhaps VM_IO would be a candidate?

I added people that touched VM_MIXEDMAP in the past to CC.


-------8<-------
From: Vlastimil Babka <vbabka@suse.cz>
Date: Fri, 31 Jan 2014 11:50:21 +0100
Subject: [PATCH] mm: include VM_MIXEDMAP flag in the VM_SPECIAL list to avoid
 m(un)locking

Daniel Borkmann reported a bug with VM_BUG_ON assertions failing where
munlock_vma_pages_range() thinks it's unexpectedly in the middle of a THP page.
This can be reproduced in tools/testing/selftests/net/ by running make and
then ./psock_tpacket.

The problem is that an order=2 compound page (allocated by
alloc_one_pg_vec_page() is part of the munlocked VM_MIXEDMAP vma (mapped by
packet_mmap()) and mistaken for a THP page and assumed to be order=9.

The checks for THP in munlock came with commit ff6a6da60b89 ("mm: accelerate
munlock() treatment of THP pages"), i.e. since 3.9, but did not trigger a bug.
It just makes munlock_vma_pages_range() skip such compound pages until the next
512-pages-aligned page, when it encounters a head page. This is however not a
problem for vma's where mlocking has no effect anyway, but it can distort the
accounting.
Since commit 7225522bb ("mm: munlock: batch non-THP page isolation and
munlock+putback using pagevec") this can trigger a VM_BUG_ON in PageTransHuge()
check.

This patch fixes the issue by adding VM_MIXEDMAP flag to VM_SPECIAL - a list of
flags that make vma's non-mlockable and non-mergeable. The reasoning is that
VM_MIXEDMAP vma's are similar to VM_PFNMAP, which is already on the VM_SPECIAL
list, and both are intended for non-LRU pages where mlocking makes no sense
anyway.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Daniel Borkmann <borkmann@iogearbox.net>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: John David Anglin <dave.anglin@bell.net>
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Carsten Otte <cotte@de.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jared Hulbert <jaredeh@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mm.h | 2 +-
 mm/huge_memory.c   | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f28f46e..f9b04ac 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -175,7 +175,7 @@ extern unsigned int kobjsize(const void *objp);
  * Special vmas that are non-mergable, non-mlock()able.
  * Note: mm/huge_memory.c VM_NO_THP depends on this definition.
  */
-#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_PFNMAP)
+#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_PFNMAP | VM_MIXEDMAP)
 
 /*
  * mapping from the currently active vm_flags protection bits (the
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 82166bf..f32fffb 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1963,7 +1963,7 @@ out:
 	return ret;
 }
 
-#define VM_NO_THP (VM_SPECIAL|VM_MIXEDMAP|VM_HUGETLB|VM_SHARED|VM_MAYSHARE)
+#define VM_NO_THP (VM_SPECIAL|VM_HUGETLB|VM_SHARED|VM_MAYSHARE)
 
 int hugepage_madvise(struct vm_area_struct *vma,
 		     unsigned long *vm_flags, int advice)
-- 
1.8.4.5



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [BUG] at include/linux/page-flags.h:415 (PageTransHuge)
  2014-01-31 14:40             ` Vlastimil Babka
@ 2014-01-31 14:58               ` Thomas Hellstrom
  2014-01-31 15:25                 ` Vlastimil Babka
  2014-02-07 18:58               ` Hannes Frederic Sowa
  1 sibling, 1 reply; 13+ messages in thread
From: Thomas Hellstrom @ 2014-01-31 14:58 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Daniel Borkmann, Andrew Morton, linux-kernel, Michel Lespinasse,
	linux-mm, Jared Hulbert, netdev, John David Anglin,
	HATAYAMA Daisuke, Konstantin Khlebnikov, Carsten Otte,
	Peter Zijlstra

On 01/31/2014 03:40 PM, Vlastimil Babka wrote:
> On 01/15/2014 05:06 PM, Daniel Borkmann wrote:
>> [keeping netdev in loop as well]
>>
>> On 01/15/2014 03:27 PM, Vlastimil Babka wrote:
>>> On 01/13/2014 12:39 PM, Daniel Borkmann wrote:
>>>> On 01/13/2014 11:16 AM, Vlastimil Babka wrote:
>>>>> On 01/11/2014 02:32 PM, Daniel Borkmann wrote:
>>>>>> On 01/11/2014 07:22 AM, Andrew Morton wrote:
>>>>>>> On Fri, 10 Jan 2014 19:23:26 +0100 Daniel Borkmann <borkmann@iogearbox.net> wrote:
>>>>>>>
>>>>>>>> This is being reliably triggered for each mmaped() packet(7)
>>>>>>>> socket from user space, basically during unmapping resp.
>>>>>>>> closing the TX socket.
>>>>>>>>
>>>>>>>> I believe due to some change in transparent hugepages code ?
>>>>>>>>
>>>>>>>> When I disable transparent hugepages, everything works fine,
>>>>>>>> no BUG triggered.
>>>>>>>>
>>>>>>>> I'd be happy to test patches.
>>>>>>> Did the inclusion of c424be1cbbf852e46acc8 ("mm: munlock: fix a bug
>>>>>>> where THP tail page is encountered") in current mainline fix this?
>>>>>> Thanks for your answer Andrew!
>>>>>>
>>>>>> Hm, I just cherry-picked that onto current net-next as I have some work
>>>>>> there, and this time I got ...
>>>>>>
>>>>>> (User space uses packet mmap() and mlockall(MCL_CURRENT | MCL_FUTURE)
>>>>>>        and on shutdown munlockall() ...)
>>>>>>
>>>>>> [   63.863672] ------------[ cut here ]------------
>>>>>> [   63.863702] kernel BUG at mm/mlock.c:507!
>>>>>> [   63.863721] invalid opcode: 0000 [#1] SMP
>>>>>> [   63.863743] Modules linked in: fuse ebtable_nat xt_CHECKSUM nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_nat nf_nat_ipv6 ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack bridge ebtable_filter ebtables stp llc ip6table_filter ip6_tables rfcomm bnep snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec iwlwifi cfg80211 snd_hwdep btusb snd_seq bluetooth sdhci_pci snd_seq_device e1000e tpm_tis snd_pcm thinkpad_acpi sdhci ptp tpm uvcvideo pps_core snd_page_alloc snd_timer snd rfkill mmc_core iTCO_wdt iTCO_vendor_support lpc_ich mfd_core soundcore joydev wmi videobuf2_vmalloc videobuf2_memops videobuf2_core i2c_i801 pcspkr videodev media uinput i915
>>>>>> [   63.864152]  i2c_algo_bit drm_kms_helper drm i2c_core video
>>>>>> [   63.864181] CPU: 1 PID: 1617 Comm: trafgen Not tainted 3.13.0-rc6+ #15
>>>>>> [   63.864209] Hardware name: LENOVO 2429BP3/2429BP3, BIOS G4ET37WW (1.12 ) 05/29/2012
>>>>>> [   63.864242] task: ffff8801ee060000 ti: ffff8800b5954000 task.ti: ffff8800b5954000
>>>>>> [   63.864274] RIP: 0010:[<ffffffff8116fa9a>]  [<ffffffff8116fa9a>] munlock_vma_pages_range+0x2ea/0x2f0
>>>>>> [   63.864318] RSP: 0018:ffff8800b5955e08  EFLAGS: 00010202
>>>>>> [   63.864341] RAX: 00000000000001ff RBX: ffff8800b58f7508 RCX: 0000000000000034
>>>>>> [   63.864372] RDX: 00000007f0708992 RSI: ffffea0002c3e700 RDI: ffffea0002c3e700
>>>>>> [   63.864402] RBP: ffff8800b5955ee0 R08: 3800000000000000 R09: a8000b0f9c000000
>>>>>> [   63.864432] R10: 57ffdef066c3e700 R11: ffffff5cfb00c14a R12: ffffea0002c3e700
>>>>>> [   63.864462] R13: ffff8800b5955f48 R14: 00007f0708992000 R15: 00007f0708992000
>>>>>> [   63.864492] FS:  00007f0708b92740(0000) GS:ffff88021e240000(0000) knlGS:0000000000000000
>>>>>> [   63.864526] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>> [   63.864551] CR2: 00007f33bb373000 CR3: 00000000b2a2c000 CR4: 00000000001407e0
>>>>>> [   63.864581] Stack:
>>>>>> [   63.864593]  ffff8800b5955ed0 00007f0708b91fff 00007f0708b92000 ffff8800b5955e48
>>>>>> [   63.864632]  000001ff810c864b ffff8801ee060000 0000000000000000 0000000000000000
>>>>>> [   63.864669]  ffff8800b5955e58 ffff8801ee060000 0000000700000086 ffff8801ee060000
>>>>>> [   63.864708] Call Trace:
>>>>>> [   63.864724]  [<ffffffff816956bc>] ? _raw_spin_unlock_irq+0x2c/0x30
>>>>>> [   63.864754]  [<ffffffff81171b52>] ? vma_merge+0xc2/0x330
>>>>>> [   63.864786]  [<ffffffff8116fb9c>] mlock_fixup+0xfc/0x190
>>>>>> [   63.864812]  [<ffffffff8116fde7>] do_mlockall+0x87/0xc0
>>>>>> [   63.864836]  [<ffffffff811702df>] sys_munlockall+0x2f/0x50
>>>>>> [   63.864873]  [<ffffffff8169e192>] system_call_fastpath+0x16/0x1b
>>>>>> [   63.864898] Code: d7 48 89 95 28 ff ff ff e8 a4 04 fe ff 84 c0 48 8b 95 28 ff ff ff 0f 85 5a ff ff ff e9 46 ff ff ff e8 3f ac 51 00 e8 34 ac 51 00 <0f> 0b 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55
>>>>>> [   63.865114] RIP  [<ffffffff8116fa9a>] munlock_vma_pages_range+0x2ea/0x2f0
>>>>>> [   63.865148]  RSP <ffff8800b5955e08>
>>>>>> [   63.874968] ------------[ cut here ]------------
>>>>>>
>>>>>> ... when I find some time, I'll try with normal torvalds' tree, maybe some
>>>>>> other patches are missing as well, not sure right now.
>>>>> Uh so the triggered assertion is the one added by this very patch, and there are no more changes wrt this in mainline.
>>>>>
>>>>> If you can still try debug patches, please try this. Thanks.
>>>> Yes, thanks, I'll come back to you some time by today.
>>> Daniel sent me (off-list) instructions to reproduce:
>>>
>>>> Then in the kernel source tree, you'll find:
>>>>
>>>>      tools/testing/selftests/net/
>>>>
>>>> There, just do a 'make' and run ./psock_tpacket
>>> It reproduces deterministically in mainline since 3.12, i.e. my munlock
>>> performance series. Based on the initial debug output, I've expanded the
>>> debug patch below a bit:
>>>
>>>>> From: Vlastimil Babka <vbabka@suse.cz>
>>>>> Date: Mon, 13 Jan 2014 11:13:53 +0100
>>>>> Subject: [PATCH] debug munlock_vma_pages_range
>>>>>
>>>>> ---
>>>>>      mm/mlock.c | 22 ++++++++++++++++++++--
>>>>>      1 file changed, 20 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/mm/mlock.c b/mm/mlock.c
>>>>> index c59c420..7d0e29a 100644
>>>>> --- a/mm/mlock.c
>>>>> +++ b/mm/mlock.c
>>>>> @@ -448,12 +448,14 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
>>>>>      void munlock_vma_pages_range(struct vm_area_struct *vma,
>>>>>      			     unsigned long start, unsigned long end)
>>>>>      {
>>>>> +	unsigned long orig_start = start;
>>>>> +	unsigned long page_increm = 0;
>>>>> +
>>>>>      	vma->vm_flags &= ~VM_LOCKED;
>>>>>
>>>>>      	while (start < end) {
>>>>>      		struct page *page = NULL;
>>>>>      		unsigned int page_mask;
>>>>> -		unsigned long page_increm;
>>>>>      		struct pagevec pvec;
>>>>>      		struct zone *zone;
>>>>>      		int zoneid;
>>>>> @@ -504,7 +506,23 @@ void munlock_vma_pages_range(struct vm_area_struct *vma,
>>>>>      			}
>>>>>      		}
>>>>>      		/* It's a bug to munlock in the middle of a THP page */
>>>>> -		VM_BUG_ON((start >> PAGE_SHIFT) & page_mask);
>>>>> +		if ((start >> PAGE_SHIFT) & page_mask) {
>>>>> +			dump_page(page);
>>>>> +			printk("start=%lu pfn=%lu orig_start=%lu "
>>>>> +			       "prev_page_increm=%lu page_mask=%u "
>>>>> +			       "vm_start=%lu vm_end=%lu vm_flags=%lu\n",
>>>>> +				start, page_to_pfn(page), orig_start,
>>>>> +				page_increm, page_mask,
>>>>> +				vma->vm_start, vma->vm_end,
>>>>> +				vma->vm_flags);
>>> +                        printk("vm_ops=%pF, open=%pF, fault=%pF, remap_pages=%pF\n", vma->vm_ops,
>>> +                                vma->vm_ops->open, vma->vm_ops->fault, vma->vm_ops->remap_pages);
>>> +                        if (PageCompound(page)) {
>>> +                                printk("page is compound with order=%d\n", compound_order(page));
>>> +                        }
>>>>> +			if (PageTail(page)) {
>>>>> +				struct page *first_page = page->first_page;
>>>>> +				printk("first_page pfn=%lu\n",
>>>>> +						page_to_pfn(first_page));
>>>>> +				dump_page(first_page);
>>>>> +			}
>>>>> +			VM_BUG_ON(true);
>>>>> +		}
>>>>>      		page_increm = 1 + page_mask;
>>>>>      		start += page_increm * PAGE_SIZE;
>>>>>      next:
>>>>>
>>> And got output like this:
>>>
>>> page:ffffea0002474a40 count:5 mapcount:1 mapping:          (null) index:0x0
>>> page flags: 0x100000000004004(referenced|head)
>>> start=140242647736320 pfn=682616 orig_start=140242647736320 prev_page_increm=0 page_mask=511 vm_start=140242647736320 vm_end=140242651930624 vm_flags=268435707
>>> vm_ops=packet_mmap_ops+0x0/0xfffffffffffff8e0 [af_packet], open=packet_mm_open+0x0/0x30 [af_packet], fault=          (null), remap_pages=          (null)
>>> page is compound with order=2
>>>
>>> Observations:
>>> - address 140242647736320 is where the vma starts, and is not aligned to 512 pages
>>>     (so it cannot be a THP head which the munlock expects). Yet there is a head page
>>>     that triggers the PageTransHuge() and consequently hpage_nr_pages() in munlock_vma_page()
>>>     That's why page_mask is determined to be 511 and the code thinks it's in the
>>>     middle of a THP page.
>>> - in fact, the page is a compound page with order=2
>>> - the VM flags (except (may)read/write) are VM_SHARED and VM_MIXEDMAP
>>> - the vma was mmapped by packet_mmap() (net/packet/af_packet.c) which uses
>>>     vm_insert_page(), which adds the VM_MIXEDMAP flag
>>> - the buffers that are mapped were allocated by alloc_one_pg_vec_page()
>>>     where flags indeed include __GFP_COMP
>>>
>>> So clearly there is a way to have mlock/munlock operate on a vma that contains
>>> compound pages and confuse the checks for PageTransHuge().
>>>
>>> The checks for THP in munlock came with commit ff6a6da60b89 ("mm: accelerate munlock()
>>> treatment of THP pages"), i.e. since 3.9, but did not trigger a bug. It however
>>> makes munlock_vma_pages_range() skip pages until the next 512-pages-aligned page,
>>> when it encounters a head page. If the head page is of smaller order and is followed
>>> by normal LRU pages (theoretically, I'm not sure if that's possible, or done anywhere),
>>> they wouldn't get munlocked.
>>>
>>> My commit 7225522bb429 ("mm: munlock: batch non-THP page isolation and
>>> munlock+putback using pagevec") (since 3.12) has added a new PageTransHuge() check
>>> that can trigger on tail pages of the compound page here. Commit c424be1cbbf852e46acc8
>>> ("mm: munlock: fix a bug where THP tail page is encountered") in current rc's removes
>>> one class of bugs here, but still non-THP compound pages are not expected in mlock/munlock,
>>> which leads to this assertion failing.
>>>
>>> The question is what is the correct fix, and I'm not that familiar with VM_MIXEDMAP
>>> to decide.
>>>
>>> Option 1: mlocking VM_MIXEDMAP vma's has no sense. They should be treated like VM_PFNMAP
>>>             and added to VM_SPECIAL, which makes m(un)lock skip them completely.
>>>
>>> Option 2: if indeed VM_MIXEDMAP can contain PageLRU pages for which mlocking is useful,
>>>             VM_NO_THP should be checked in munlock before attempting PageTransHuge() and
>>>             friends. VM_NO_THP already contains VM_MIXEDMAP, so knowing that there can be
>>>             no THP means we don't try optimize for it and no unexpected head pages trip us.
>>>
>>> Thoughts?
> OK, here's a RFC patch to hopefully help get us somewhere. I went for
> Option1, as I didn't see anyone using VM_MIXEDMAP also for LRU pages,
> and Option2 was ugly to implement and also seemed quite arbitrary. I'm
> not sure if making VM_MIXEDMAP also non-mergeable this way is an issue
> though.
>
>

Hi!

Forgive an ignorant question, but are anonymous COW'd pages LRU pages?
The reason I'm asking is that TTM VM_MIXEDMAP vmas may contain such pages.

/Thomas

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [BUG] at include/linux/page-flags.h:415 (PageTransHuge)
  2014-01-31 14:58               ` Thomas Hellstrom
@ 2014-01-31 15:25                 ` Vlastimil Babka
  2014-01-31 15:35                   ` Thomas Hellstrom
  0 siblings, 1 reply; 13+ messages in thread
From: Vlastimil Babka @ 2014-01-31 15:25 UTC (permalink / raw)
  To: Thomas Hellstrom
  Cc: Daniel Borkmann, Andrew Morton, linux-kernel, Michel Lespinasse,
	linux-mm, Jared Hulbert, netdev, John David Anglin,
	HATAYAMA Daisuke, Konstantin Khlebnikov, Carsten Otte,
	Peter Zijlstra

On 01/31/2014 03:58 PM, Thomas Hellstrom wrote:
> On 01/31/2014 03:40 PM, Vlastimil Babka wrote:
>> On 01/15/2014 05:06 PM, Daniel Borkmann wrote:
>>> [keeping netdev in loop as well]
>>>
>>> On 01/15/2014 03:27 PM, Vlastimil Babka wrote:
>>>> On 01/13/2014 12:39 PM, Daniel Borkmann wrote:
>>>>> On 01/13/2014 11:16 AM, Vlastimil Babka wrote:
>>>>>> On 01/11/2014 02:32 PM, Daniel Borkmann wrote:
>>>>>>> On 01/11/2014 07:22 AM, Andrew Morton wrote:
>>>>>>>> On Fri, 10 Jan 2014 19:23:26 +0100 Daniel Borkmann <borkmann@iogearbox.net> wrote:
>>>>>>>>
>>>>>>>>> This is being reliably triggered for each mmaped() packet(7)
>>>>>>>>> socket from user space, basically during unmapping resp.
>>>>>>>>> closing the TX socket.
>>>>>>>>>
>>>>>>>>> I believe due to some change in transparent hugepages code ?
>>>>>>>>>
>>>>>>>>> When I disable transparent hugepages, everything works fine,
>>>>>>>>> no BUG triggered.
>>>>>>>>>
>>>>>>>>> I'd be happy to test patches.
>>>>>>>> Did the inclusion of c424be1cbbf852e46acc8 ("mm: munlock: fix a bug
>>>>>>>> where THP tail page is encountered") in current mainline fix this?
>>>>>>> Thanks for your answer Andrew!
>>>>>>>
>>>>>>> Hm, I just cherry-picked that onto current net-next as I have some work
>>>>>>> there, and this time I got ...
>>>>>>>
>>>>>>> (User space uses packet mmap() and mlockall(MCL_CURRENT | MCL_FUTURE)
>>>>>>>         and on shutdown munlockall() ...)
>>>>>>>
>>>>>>> [   63.863672] ------------[ cut here ]------------
>>>>>>> [   63.863702] kernel BUG at mm/mlock.c:507!
>>>>>>> [   63.863721] invalid opcode: 0000 [#1] SMP
>>>>>>> [   63.863743] Modules linked in: fuse ebtable_nat xt_CHECKSUM nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_nat nf_nat_ipv6 ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack bridge ebtable_filter ebtables stp llc ip6table_filter ip6_tables rfcomm bnep snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec iwlwifi cfg80211 snd_hwdep btusb snd_seq bluetooth sdhci_pci snd_seq_device e1000e tpm_tis snd_pcm thinkpad_acpi sdhci ptp tpm uvcvideo pps_core snd_page_alloc snd_timer snd rfkill mmc_core iTCO_wdt iTCO_vendor_support lpc_ich mfd_core soundcore joydev wmi videobuf2_vmalloc videobuf2_memops videobuf2_core i2c_i801 pcspkr videodev media uinput i915
>>>>>>> [   63.864152]  i2c_algo_bit drm_kms_helper drm i2c_core video
>>>>>>> [   63.864181] CPU: 1 PID: 1617 Comm: trafgen Not tainted 3.13.0-rc6+ #15
>>>>>>> [   63.864209] Hardware name: LENOVO 2429BP3/2429BP3, BIOS G4ET37WW (1.12 ) 05/29/2012
>>>>>>> [   63.864242] task: ffff8801ee060000 ti: ffff8800b5954000 task.ti: ffff8800b5954000
>>>>>>> [   63.864274] RIP: 0010:[<ffffffff8116fa9a>]  [<ffffffff8116fa9a>] munlock_vma_pages_range+0x2ea/0x2f0
>>>>>>> [   63.864318] RSP: 0018:ffff8800b5955e08  EFLAGS: 00010202
>>>>>>> [   63.864341] RAX: 00000000000001ff RBX: ffff8800b58f7508 RCX: 0000000000000034
>>>>>>> [   63.864372] RDX: 00000007f0708992 RSI: ffffea0002c3e700 RDI: ffffea0002c3e700
>>>>>>> [   63.864402] RBP: ffff8800b5955ee0 R08: 3800000000000000 R09: a8000b0f9c000000
>>>>>>> [   63.864432] R10: 57ffdef066c3e700 R11: ffffff5cfb00c14a R12: ffffea0002c3e700
>>>>>>> [   63.864462] R13: ffff8800b5955f48 R14: 00007f0708992000 R15: 00007f0708992000
>>>>>>> [   63.864492] FS:  00007f0708b92740(0000) GS:ffff88021e240000(0000) knlGS:0000000000000000
>>>>>>> [   63.864526] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>>> [   63.864551] CR2: 00007f33bb373000 CR3: 00000000b2a2c000 CR4: 00000000001407e0
>>>>>>> [   63.864581] Stack:
>>>>>>> [   63.864593]  ffff8800b5955ed0 00007f0708b91fff 00007f0708b92000 ffff8800b5955e48
>>>>>>> [   63.864632]  000001ff810c864b ffff8801ee060000 0000000000000000 0000000000000000
>>>>>>> [   63.864669]  ffff8800b5955e58 ffff8801ee060000 0000000700000086 ffff8801ee060000
>>>>>>> [   63.864708] Call Trace:
>>>>>>> [   63.864724]  [<ffffffff816956bc>] ? _raw_spin_unlock_irq+0x2c/0x30
>>>>>>> [   63.864754]  [<ffffffff81171b52>] ? vma_merge+0xc2/0x330
>>>>>>> [   63.864786]  [<ffffffff8116fb9c>] mlock_fixup+0xfc/0x190
>>>>>>> [   63.864812]  [<ffffffff8116fde7>] do_mlockall+0x87/0xc0
>>>>>>> [   63.864836]  [<ffffffff811702df>] sys_munlockall+0x2f/0x50
>>>>>>> [   63.864873]  [<ffffffff8169e192>] system_call_fastpath+0x16/0x1b
>>>>>>> [   63.864898] Code: d7 48 89 95 28 ff ff ff e8 a4 04 fe ff 84 c0 48 8b 95 28 ff ff ff 0f 85 5a ff ff ff e9 46 ff ff ff e8 3f ac 51 00 e8 34 ac 51 00 <0f> 0b 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55
>>>>>>> [   63.865114] RIP  [<ffffffff8116fa9a>] munlock_vma_pages_range+0x2ea/0x2f0
>>>>>>> [   63.865148]  RSP <ffff8800b5955e08>
>>>>>>> [   63.874968] ------------[ cut here ]------------
>>>>>>>
>>>>>>> ... when I find some time, I'll try with normal torvalds' tree, maybe some
>>>>>>> other patches are missing as well, not sure right now.
>>>>>> Uh so the triggered assertion is the one added by this very patch, and there are no more changes wrt this in mainline.
>>>>>>
>>>>>> If you can still try debug patches, please try this. Thanks.
>>>>> Yes, thanks, I'll come back to you some time by today.
>>>> Daniel sent me (off-list) instructions to reproduce:
>>>>
>>>>> Then in the kernel source tree, you'll find:
>>>>>
>>>>>       tools/testing/selftests/net/
>>>>>
>>>>> There, just do a 'make' and run ./psock_tpacket
>>>> It reproduces deterministically in mainline since 3.12, i.e. my munlock
>>>> performance series. Based on the initial debug output, I've expanded the
>>>> debug patch below a bit:
>>>>
>>>>>> From: Vlastimil Babka <vbabka@suse.cz>
>>>>>> Date: Mon, 13 Jan 2014 11:13:53 +0100
>>>>>> Subject: [PATCH] debug munlock_vma_pages_range
>>>>>>
>>>>>> ---
>>>>>>       mm/mlock.c | 22 ++++++++++++++++++++--
>>>>>>       1 file changed, 20 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/mlock.c b/mm/mlock.c
>>>>>> index c59c420..7d0e29a 100644
>>>>>> --- a/mm/mlock.c
>>>>>> +++ b/mm/mlock.c
>>>>>> @@ -448,12 +448,14 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
>>>>>>       void munlock_vma_pages_range(struct vm_area_struct *vma,
>>>>>>       			     unsigned long start, unsigned long end)
>>>>>>       {
>>>>>> +	unsigned long orig_start = start;
>>>>>> +	unsigned long page_increm = 0;
>>>>>> +
>>>>>>       	vma->vm_flags &= ~VM_LOCKED;
>>>>>>
>>>>>>       	while (start < end) {
>>>>>>       		struct page *page = NULL;
>>>>>>       		unsigned int page_mask;
>>>>>> -		unsigned long page_increm;
>>>>>>       		struct pagevec pvec;
>>>>>>       		struct zone *zone;
>>>>>>       		int zoneid;
>>>>>> @@ -504,7 +506,23 @@ void munlock_vma_pages_range(struct vm_area_struct *vma,
>>>>>>       			}
>>>>>>       		}
>>>>>>       		/* It's a bug to munlock in the middle of a THP page */
>>>>>> -		VM_BUG_ON((start >> PAGE_SHIFT) & page_mask);
>>>>>> +		if ((start >> PAGE_SHIFT) & page_mask) {
>>>>>> +			dump_page(page);
>>>>>> +			printk("start=%lu pfn=%lu orig_start=%lu "
>>>>>> +			       "prev_page_increm=%lu page_mask=%u "
>>>>>> +			       "vm_start=%lu vm_end=%lu vm_flags=%lu\n",
>>>>>> +				start, page_to_pfn(page), orig_start,
>>>>>> +				page_increm, page_mask,
>>>>>> +				vma->vm_start, vma->vm_end,
>>>>>> +				vma->vm_flags);
>>>> +                        printk("vm_ops=%pF, open=%pF, fault=%pF, remap_pages=%pF\n", vma->vm_ops,
>>>> +                                vma->vm_ops->open, vma->vm_ops->fault, vma->vm_ops->remap_pages);
>>>> +                        if (PageCompound(page)) {
>>>> +                                printk("page is compound with order=%d\n", compound_order(page));
>>>> +                        }
>>>>>> +			if (PageTail(page)) {
>>>>>> +				struct page *first_page = page->first_page;
>>>>>> +				printk("first_page pfn=%lu\n",
>>>>>> +						page_to_pfn(first_page));
>>>>>> +				dump_page(first_page);
>>>>>> +			}
>>>>>> +			VM_BUG_ON(true);
>>>>>> +		}
>>>>>>       		page_increm = 1 + page_mask;
>>>>>>       		start += page_increm * PAGE_SIZE;
>>>>>>       next:
>>>>>>
>>>> And got output like this:
>>>>
>>>> page:ffffea0002474a40 count:5 mapcount:1 mapping:          (null) index:0x0
>>>> page flags: 0x100000000004004(referenced|head)
>>>> start=140242647736320 pfn=682616 orig_start=140242647736320 prev_page_increm=0 page_mask=511 vm_start=140242647736320 vm_end=140242651930624 vm_flags=268435707
>>>> vm_ops=packet_mmap_ops+0x0/0xfffffffffffff8e0 [af_packet], open=packet_mm_open+0x0/0x30 [af_packet], fault=          (null), remap_pages=          (null)
>>>> page is compound with order=2
>>>>
>>>> Observations:
>>>> - address 140242647736320 is where the vma starts, and is not aligned to 512 pages
>>>>      (so it cannot be a THP head which the munlock expects). Yet there is a head page
>>>>      that triggers the PageTransHuge() and consequently hpage_nr_pages() in munlock_vma_page()
>>>>      That's why page_mask is determined to be 511 and the code thinks it's in the
>>>>      middle of a THP page.
>>>> - in fact, the page is a compound page with order=2
>>>> - the VM flags (except (may)read/write) are VM_SHARED and VM_MIXEDMAP
>>>> - the vma was mmapped by packet_mmap() (net/packet/af_packet.c) which uses
>>>>      vm_insert_page(), which adds the VM_MIXEDMAP flag
>>>> - the buffers that are mapped were allocated by alloc_one_pg_vec_page()
>>>>      where flags indeed include __GFP_COMP
>>>>
>>>> So clearly there is a way to have mlock/munlock operate on a vma that contains
>>>> compound pages and confuse the checks for PageTransHuge().
>>>>
>>>> The checks for THP in munlock came with commit ff6a6da60b89 ("mm: accelerate munlock()
>>>> treatment of THP pages"), i.e. since 3.9, but did not trigger a bug. It however
>>>> makes munlock_vma_pages_range() skip pages until the next 512-pages-aligned page,
>>>> when it encounters a head page. If the head page is of smaller order and is followed
>>>> by normal LRU pages (theoretically, I'm not sure if that's possible, or done anywhere),
>>>> they wouldn't get munlocked.
>>>>
>>>> My commit 7225522bb429 ("mm: munlock: batch non-THP page isolation and
>>>> munlock+putback using pagevec") (since 3.12) has added a new PageTransHuge() check
>>>> that can trigger on tail pages of the compound page here. Commit c424be1cbbf852e46acc8
>>>> ("mm: munlock: fix a bug where THP tail page is encountered") in current rc's removes
>>>> one class of bugs here, but still non-THP compound pages are not expected in mlock/munlock,
>>>> which leads to this assertion failing.
>>>>
>>>> The question is what is the correct fix, and I'm not that familiar with VM_MIXEDMAP
>>>> to decide.
>>>>
>>>> Option 1: mlocking VM_MIXEDMAP vma's has no sense. They should be treated like VM_PFNMAP
>>>>              and added to VM_SPECIAL, which makes m(un)lock skip them completely.
>>>>
>>>> Option 2: if indeed VM_MIXEDMAP can contain PageLRU pages for which mlocking is useful,
>>>>              VM_NO_THP should be checked in munlock before attempting PageTransHuge() and
>>>>              friends. VM_NO_THP already contains VM_MIXEDMAP, so knowing that there can be
>>>>              no THP means we don't try optimize for it and no unexpected head pages trip us.
>>>>
>>>> Thoughts?
>> OK, here's a RFC patch to hopefully help get us somewhere. I went for
>> Option1, as I didn't see anyone using VM_MIXEDMAP also for LRU pages,
>> and Option2 was ugly to implement and also seemed quite arbitrary. I'm
>> not sure if making VM_MIXEDMAP also non-mergeable this way is an issue
>> though.
>>
>>
>
> Hi!
>
> Forgive an ignorant question, but are anonymous COW'd pages LRU pages?

I believe so, but I've checked where VM_MIXEDMAP is used in TTM and it 
seems all those vma's are also VM_IO which means they are already 
included in VM_SPECIAL and this change won't affect them.

Vlastimil

> The reason I'm asking is that TTM VM_MIXEDMAP vmas may contain such pages.
>
> /Thomas
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [BUG] at include/linux/page-flags.h:415 (PageTransHuge)
  2014-01-31 15:25                 ` Vlastimil Babka
@ 2014-01-31 15:35                   ` Thomas Hellstrom
  0 siblings, 0 replies; 13+ messages in thread
From: Thomas Hellstrom @ 2014-01-31 15:35 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Daniel Borkmann, Andrew Morton, linux-kernel, Michel Lespinasse,
	linux-mm, Jared Hulbert, netdev, John David Anglin,
	HATAYAMA Daisuke, Konstantin Khlebnikov, Carsten Otte,
	Peter Zijlstra

On 01/31/2014 04:25 PM, Vlastimil Babka wrote:
> On 01/31/2014 03:58 PM, Thomas Hellstrom wrote:
>> On 01/31/2014 03:40 PM, Vlastimil Babka wrote:
>>> On 01/15/2014 05:06 PM, Daniel Borkmann wrote:
>>>> [keeping netdev in loop as well]
>>>>
>>>> On 01/15/2014 03:27 PM, Vlastimil Babka wrote:
>>>>> On 01/13/2014 12:39 PM, Daniel Borkmann wrote:
>>>>>> On 01/13/2014 11:16 AM, Vlastimil Babka wrote:
>>>>>>> On 01/11/2014 02:32 PM, Daniel Borkmann wrote:
>>>>>>>> On 01/11/2014 07:22 AM, Andrew Morton wrote:
>>>>>>>>> On Fri, 10 Jan 2014 19:23:26 +0100 Daniel Borkmann
>>>>>>>>> <borkmann@iogearbox.net> wrote:
>>>>>>>>>
>>>>>>>>>> This is being reliably triggered for each mmaped() packet(7)
>>>>>>>>>> socket from user space, basically during unmapping resp.
>>>>>>>>>> closing the TX socket.
>>>>>>>>>>
>>>>>>>>>> I believe due to some change in transparent hugepages code ?
>>>>>>>>>>
>>>>>>>>>> When I disable transparent hugepages, everything works fine,
>>>>>>>>>> no BUG triggered.
>>>>>>>>>>
>>>>>>>>>> I'd be happy to test patches.
>>>>>>>>> Did the inclusion of c424be1cbbf852e46acc8 ("mm: munlock: fix
>>>>>>>>> a bug
>>>>>>>>> where THP tail page is encountered") in current mainline fix
>>>>>>>>> this?
>>>>>>>> Thanks for your answer Andrew!
>>>>>>>>
>>>>>>>> Hm, I just cherry-picked that onto current net-next as I have
>>>>>>>> some work
>>>>>>>> there, and this time I got ...
>>>>>>>>
>>>>>>>> (User space uses packet mmap() and mlockall(MCL_CURRENT |
>>>>>>>> MCL_FUTURE)
>>>>>>>>         and on shutdown munlockall() ...)
>>>>>>>>
>>>>>>>> [   63.863672] ------------[ cut here ]------------
>>>>>>>> [   63.863702] kernel BUG at mm/mlock.c:507!
>>>>>>>> [   63.863721] invalid opcode: 0000 [#1] SMP
>>>>>>>> [   63.863743] Modules linked in: fuse ebtable_nat xt_CHECKSUM
>>>>>>>> nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE
>>>>>>>> ip6table_nat nf_nat_ipv6 ip6table_mangle ip6t_REJECT
>>>>>>>> nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat
>>>>>>>> iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack
>>>>>>>> nf_conntrack bridge ebtable_filter ebtables stp llc
>>>>>>>> ip6table_filter ip6_tables rfcomm bnep snd_hda_codec_hdmi
>>>>>>>> snd_hda_codec_realtek snd_hda_intel snd_hda_codec iwlwifi
>>>>>>>> cfg80211 snd_hwdep btusb snd_seq bluetooth sdhci_pci
>>>>>>>> snd_seq_device e1000e tpm_tis snd_pcm thinkpad_acpi sdhci ptp
>>>>>>>> tpm uvcvideo pps_core snd_page_alloc snd_timer snd rfkill
>>>>>>>> mmc_core iTCO_wdt iTCO_vendor_support lpc_ich mfd_core
>>>>>>>> soundcore joydev wmi videobuf2_vmalloc videobuf2_memops
>>>>>>>> videobuf2_core i2c_i801 pcspkr videodev media uinput i915
>>>>>>>> [   63.864152]  i2c_algo_bit drm_kms_helper drm i2c_core video
>>>>>>>> [   63.864181] CPU: 1 PID: 1617 Comm: trafgen Not tainted
>>>>>>>> 3.13.0-rc6+ #15
>>>>>>>> [   63.864209] Hardware name: LENOVO 2429BP3/2429BP3, BIOS
>>>>>>>> G4ET37WW (1.12 ) 05/29/2012
>>>>>>>> [   63.864242] task: ffff8801ee060000 ti: ffff8800b5954000
>>>>>>>> task.ti: ffff8800b5954000
>>>>>>>> [   63.864274] RIP: 0010:[<ffffffff8116fa9a>] 
>>>>>>>> [<ffffffff8116fa9a>] munlock_vma_pages_range+0x2ea/0x2f0
>>>>>>>> [   63.864318] RSP: 0018:ffff8800b5955e08  EFLAGS: 00010202
>>>>>>>> [   63.864341] RAX: 00000000000001ff RBX: ffff8800b58f7508 RCX:
>>>>>>>> 0000000000000034
>>>>>>>> [   63.864372] RDX: 00000007f0708992 RSI: ffffea0002c3e700 RDI:
>>>>>>>> ffffea0002c3e700
>>>>>>>> [   63.864402] RBP: ffff8800b5955ee0 R08: 3800000000000000 R09:
>>>>>>>> a8000b0f9c000000
>>>>>>>> [   63.864432] R10: 57ffdef066c3e700 R11: ffffff5cfb00c14a R12:
>>>>>>>> ffffea0002c3e700
>>>>>>>> [   63.864462] R13: ffff8800b5955f48 R14: 00007f0708992000 R15:
>>>>>>>> 00007f0708992000
>>>>>>>> [   63.864492] FS:  00007f0708b92740(0000)
>>>>>>>> GS:ffff88021e240000(0000) knlGS:0000000000000000
>>>>>>>> [   63.864526] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>>>> [   63.864551] CR2: 00007f33bb373000 CR3: 00000000b2a2c000 CR4:
>>>>>>>> 00000000001407e0
>>>>>>>> [   63.864581] Stack:
>>>>>>>> [   63.864593]  ffff8800b5955ed0 00007f0708b91fff
>>>>>>>> 00007f0708b92000 ffff8800b5955e48
>>>>>>>> [   63.864632]  000001ff810c864b ffff8801ee060000
>>>>>>>> 0000000000000000 0000000000000000
>>>>>>>> [   63.864669]  ffff8800b5955e58 ffff8801ee060000
>>>>>>>> 0000000700000086 ffff8801ee060000
>>>>>>>> [   63.864708] Call Trace:
>>>>>>>> [   63.864724]  [<ffffffff816956bc>] ?
>>>>>>>> _raw_spin_unlock_irq+0x2c/0x30
>>>>>>>> [   63.864754]  [<ffffffff81171b52>] ? vma_merge+0xc2/0x330
>>>>>>>> [   63.864786]  [<ffffffff8116fb9c>] mlock_fixup+0xfc/0x190
>>>>>>>> [   63.864812]  [<ffffffff8116fde7>] do_mlockall+0x87/0xc0
>>>>>>>> [   63.864836]  [<ffffffff811702df>] sys_munlockall+0x2f/0x50
>>>>>>>> [   63.864873]  [<ffffffff8169e192>]
>>>>>>>> system_call_fastpath+0x16/0x1b
>>>>>>>> [   63.864898] Code: d7 48 89 95 28 ff ff ff e8 a4 04 fe ff 84
>>>>>>>> c0 48 8b 95 28 ff ff ff 0f 85 5a ff ff ff e9 46 ff ff ff e8 3f
>>>>>>>> ac 51 00 e8 34 ac 51 00 <0f> 0b 0f 1f 40 00 0f 1f 44 00 00 55
>>>>>>>> 48 89 e5 41 57 41 56 41 55
>>>>>>>> [   63.865114] RIP  [<ffffffff8116fa9a>]
>>>>>>>> munlock_vma_pages_range+0x2ea/0x2f0
>>>>>>>> [   63.865148]  RSP <ffff8800b5955e08>
>>>>>>>> [   63.874968] ------------[ cut here ]------------
>>>>>>>>
>>>>>>>> ... when I find some time, I'll try with normal torvalds' tree,
>>>>>>>> maybe some
>>>>>>>> other patches are missing as well, not sure right now.
>>>>>>> Uh so the triggered assertion is the one added by this very
>>>>>>> patch, and there are no more changes wrt this in mainline.
>>>>>>>
>>>>>>> If you can still try debug patches, please try this. Thanks.
>>>>>> Yes, thanks, I'll come back to you some time by today.
>>>>> Daniel sent me (off-list) instructions to reproduce:
>>>>>
>>>>>> Then in the kernel source tree, you'll find:
>>>>>>
>>>>>>       tools/testing/selftests/net/
>>>>>>
>>>>>> There, just do a 'make' and run ./psock_tpacket
>>>>> It reproduces deterministically in mainline since 3.12, i.e. my
>>>>> munlock
>>>>> performance series. Based on the initial debug output, I've
>>>>> expanded the
>>>>> debug patch below a bit:
>>>>>
>>>>>>> From: Vlastimil Babka <vbabka@suse.cz>
>>>>>>> Date: Mon, 13 Jan 2014 11:13:53 +0100
>>>>>>> Subject: [PATCH] debug munlock_vma_pages_range
>>>>>>>
>>>>>>> ---
>>>>>>>       mm/mlock.c | 22 ++++++++++++++++++++--
>>>>>>>       1 file changed, 20 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/mm/mlock.c b/mm/mlock.c
>>>>>>> index c59c420..7d0e29a 100644
>>>>>>> --- a/mm/mlock.c
>>>>>>> +++ b/mm/mlock.c
>>>>>>> @@ -448,12 +448,14 @@ static unsigned long
>>>>>>> __munlock_pagevec_fill(struct pagevec *pvec,
>>>>>>>       void munlock_vma_pages_range(struct vm_area_struct *vma,
>>>>>>>                        unsigned long start, unsigned long end)
>>>>>>>       {
>>>>>>> +    unsigned long orig_start = start;
>>>>>>> +    unsigned long page_increm = 0;
>>>>>>> +
>>>>>>>           vma->vm_flags &= ~VM_LOCKED;
>>>>>>>
>>>>>>>           while (start < end) {
>>>>>>>               struct page *page = NULL;
>>>>>>>               unsigned int page_mask;
>>>>>>> -        unsigned long page_increm;
>>>>>>>               struct pagevec pvec;
>>>>>>>               struct zone *zone;
>>>>>>>               int zoneid;
>>>>>>> @@ -504,7 +506,23 @@ void munlock_vma_pages_range(struct
>>>>>>> vm_area_struct *vma,
>>>>>>>                   }
>>>>>>>               }
>>>>>>>               /* It's a bug to munlock in the middle of a THP
>>>>>>> page */
>>>>>>> -        VM_BUG_ON((start >> PAGE_SHIFT) & page_mask);
>>>>>>> +        if ((start >> PAGE_SHIFT) & page_mask) {
>>>>>>> +            dump_page(page);
>>>>>>> +            printk("start=%lu pfn=%lu orig_start=%lu "
>>>>>>> +                   "prev_page_increm=%lu page_mask=%u "
>>>>>>> +                   "vm_start=%lu vm_end=%lu vm_flags=%lu\n",
>>>>>>> +                start, page_to_pfn(page), orig_start,
>>>>>>> +                page_increm, page_mask,
>>>>>>> +                vma->vm_start, vma->vm_end,
>>>>>>> +                vma->vm_flags);
>>>>> +                        printk("vm_ops=%pF, open=%pF, fault=%pF,
>>>>> remap_pages=%pF\n", vma->vm_ops,
>>>>> +                                vma->vm_ops->open,
>>>>> vma->vm_ops->fault, vma->vm_ops->remap_pages);
>>>>> +                        if (PageCompound(page)) {
>>>>> +                                printk("page is compound with
>>>>> order=%d\n", compound_order(page));
>>>>> +                        }
>>>>>>> +            if (PageTail(page)) {
>>>>>>> +                struct page *first_page = page->first_page;
>>>>>>> +                printk("first_page pfn=%lu\n",
>>>>>>> +                        page_to_pfn(first_page));
>>>>>>> +                dump_page(first_page);
>>>>>>> +            }
>>>>>>> +            VM_BUG_ON(true);
>>>>>>> +        }
>>>>>>>               page_increm = 1 + page_mask;
>>>>>>>               start += page_increm * PAGE_SIZE;
>>>>>>>       next:
>>>>>>>
>>>>> And got output like this:
>>>>>
>>>>> page:ffffea0002474a40 count:5 mapcount:1 mapping:          (null)
>>>>> index:0x0
>>>>> page flags: 0x100000000004004(referenced|head)
>>>>> start=140242647736320 pfn=682616 orig_start=140242647736320
>>>>> prev_page_increm=0 page_mask=511 vm_start=140242647736320
>>>>> vm_end=140242651930624 vm_flags=268435707
>>>>> vm_ops=packet_mmap_ops+0x0/0xfffffffffffff8e0 [af_packet],
>>>>> open=packet_mm_open+0x0/0x30 [af_packet], fault=          (null),
>>>>> remap_pages=          (null)
>>>>> page is compound with order=2
>>>>>
>>>>> Observations:
>>>>> - address 140242647736320 is where the vma starts, and is not
>>>>> aligned to 512 pages
>>>>>      (so it cannot be a THP head which the munlock expects). Yet
>>>>> there is a head page
>>>>>      that triggers the PageTransHuge() and consequently
>>>>> hpage_nr_pages() in munlock_vma_page()
>>>>>      That's why page_mask is determined to be 511 and the code
>>>>> thinks it's in the
>>>>>      middle of a THP page.
>>>>> - in fact, the page is a compound page with order=2
>>>>> - the VM flags (except (may)read/write) are VM_SHARED and VM_MIXEDMAP
>>>>> - the vma was mmapped by packet_mmap() (net/packet/af_packet.c)
>>>>> which uses
>>>>>      vm_insert_page(), which adds the VM_MIXEDMAP flag
>>>>> - the buffers that are mapped were allocated by
>>>>> alloc_one_pg_vec_page()
>>>>>      where flags indeed include __GFP_COMP
>>>>>
>>>>> So clearly there is a way to have mlock/munlock operate on a vma
>>>>> that contains
>>>>> compound pages and confuse the checks for PageTransHuge().
>>>>>
>>>>> The checks for THP in munlock came with commit ff6a6da60b89 ("mm:
>>>>> accelerate munlock()
>>>>> treatment of THP pages"), i.e. since 3.9, but did not trigger a
>>>>> bug. It however
>>>>> makes munlock_vma_pages_range() skip pages until the next
>>>>> 512-pages-aligned page,
>>>>> when it encounters a head page. If the head page is of smaller
>>>>> order and is followed
>>>>> by normal LRU pages (theoretically, I'm not sure if that's
>>>>> possible, or done anywhere),
>>>>> they wouldn't get munlocked.
>>>>>
>>>>> My commit 7225522bb429 ("mm: munlock: batch non-THP page isolation
>>>>> and
>>>>> munlock+putback using pagevec") (since 3.12) has added a new
>>>>> PageTransHuge() check
>>>>> that can trigger on tail pages of the compound page here. Commit
>>>>> c424be1cbbf852e46acc8
>>>>> ("mm: munlock: fix a bug where THP tail page is encountered") in
>>>>> current rc's removes
>>>>> one class of bugs here, but still non-THP compound pages are not
>>>>> expected in mlock/munlock,
>>>>> which leads to this assertion failing.
>>>>>
>>>>> The question is what is the correct fix, and I'm not that familiar
>>>>> with VM_MIXEDMAP
>>>>> to decide.
>>>>>
>>>>> Option 1: mlocking VM_MIXEDMAP vma's has no sense. They should be
>>>>> treated like VM_PFNMAP
>>>>>              and added to VM_SPECIAL, which makes m(un)lock skip
>>>>> them completely.
>>>>>
>>>>> Option 2: if indeed VM_MIXEDMAP can contain PageLRU pages for
>>>>> which mlocking is useful,
>>>>>              VM_NO_THP should be checked in munlock before
>>>>> attempting PageTransHuge() and
>>>>>              friends. VM_NO_THP already contains VM_MIXEDMAP, so
>>>>> knowing that there can be
>>>>>              no THP means we don't try optimize for it and no
>>>>> unexpected head pages trip us.
>>>>>
>>>>> Thoughts?
>>> OK, here's a RFC patch to hopefully help get us somewhere. I went for
>>> Option1, as I didn't see anyone using VM_MIXEDMAP also for LRU pages,
>>> and Option2 was ugly to implement and also seemed quite arbitrary. I'm
>>> not sure if making VM_MIXEDMAP also non-mergeable this way is an issue
>>> though.
>>>
>>>
>>
>> Hi!
>>
>> Forgive an ignorant question, but are anonymous COW'd pages LRU pages?
>
> I believe so, but I've checked where VM_MIXEDMAP is used in TTM and it
> seems all those vma's are also VM_IO which means they are already
> included in VM_SPECIAL and this change won't affect them.
>
> Vlastimil
>

OK. Thanks.

/Thomas




>> The reason I'm asking is that TTM VM_MIXEDMAP vmas may contain such
>> pages.
>>
>> /Thomas
>>
>> -- 
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see:
>> https://urldefense.proofpoint.com/v1/url?u=http://www.linux-mm.org/&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=Wlb%2FkDdCXWj2m8QNoBUogfTl0sK0cH2%2BOONacP0U1SE%3D%0A&s=84b13d34ca94efa34cd185d7ae467b18377a92330174671ac93dbb2948cda967
>> .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [BUG] at include/linux/page-flags.h:415 (PageTransHuge)
  2014-01-31 14:40             ` Vlastimil Babka
  2014-01-31 14:58               ` Thomas Hellstrom
@ 2014-02-07 18:58               ` Hannes Frederic Sowa
  2014-02-12 12:02                 ` Daniel Borkmann
  1 sibling, 1 reply; 13+ messages in thread
From: Hannes Frederic Sowa @ 2014-02-07 18:58 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Daniel Borkmann, Andrew Morton, linux-kernel, Michel Lespinasse,
	linux-mm, Jared Hulbert, netdev, Thomas Hellstrom,
	John David Anglin, HATAYAMA Daisuke, Konstantin Khlebnikov,
	Carsten Otte, Peter Zijlstra

Hi!

On Fri, Jan 31, 2014 at 03:40:38PM +0100, Vlastimil Babka wrote:
> From: Vlastimil Babka <vbabka@suse.cz>
> Date: Fri, 31 Jan 2014 11:50:21 +0100
> Subject: [PATCH] mm: include VM_MIXEDMAP flag in the VM_SPECIAL list to avoid
>  m(un)locking
> 
> Daniel Borkmann reported a bug with VM_BUG_ON assertions failing where
> munlock_vma_pages_range() thinks it's unexpectedly in the middle of a THP page.
> This can be reproduced in tools/testing/selftests/net/ by running make and
> then ./psock_tpacket.
> 
> The problem is that an order=2 compound page (allocated by
> alloc_one_pg_vec_page() is part of the munlocked VM_MIXEDMAP vma (mapped by
> packet_mmap()) and mistaken for a THP page and assumed to be order=9.
> 
> The checks for THP in munlock came with commit ff6a6da60b89 ("mm: accelerate
> munlock() treatment of THP pages"), i.e. since 3.9, but did not trigger a bug.
> It just makes munlock_vma_pages_range() skip such compound pages until the next
> 512-pages-aligned page, when it encounters a head page. This is however not a
> problem for vma's where mlocking has no effect anyway, but it can distort the
> accounting.
> Since commit 7225522bb ("mm: munlock: batch non-THP page isolation and
> munlock+putback using pagevec") this can trigger a VM_BUG_ON in PageTransHuge()
> check.
> 
> This patch fixes the issue by adding VM_MIXEDMAP flag to VM_SPECIAL - a list of
> flags that make vma's non-mlockable and non-mergeable. The reasoning is that
> VM_MIXEDMAP vma's are similar to VM_PFNMAP, which is already on the VM_SPECIAL
> list, and both are intended for non-LRU pages where mlocking makes no sense
> anyway.

I also ran into this problem and wanted to ask what the status of this
patch is? Does it need further testing? I can surely help with that. ;)

Thanks,

  Hannes


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [BUG] at include/linux/page-flags.h:415 (PageTransHuge)
  2014-02-07 18:58               ` Hannes Frederic Sowa
@ 2014-02-12 12:02                 ` Daniel Borkmann
  0 siblings, 0 replies; 13+ messages in thread
From: Daniel Borkmann @ 2014-02-12 12:02 UTC (permalink / raw)
  To: Vlastimil Babka, Andrew Morton, linux-kernel, Michel Lespinasse,
	linux-mm, Jared Hulbert, netdev, Thomas Hellstrom,
	John David Anglin, HATAYAMA Daisuke, Konstantin Khlebnikov,
	Carsten Otte, Peter Zijlstra

Hi Vlastimil,

On 02/07/2014 07:58 PM, Hannes Frederic Sowa wrote:
> On Fri, Jan 31, 2014 at 03:40:38PM +0100, Vlastimil Babka wrote:
>> From: Vlastimil Babka <vbabka@suse.cz>
>> Date: Fri, 31 Jan 2014 11:50:21 +0100
>> Subject: [PATCH] mm: include VM_MIXEDMAP flag in the VM_SPECIAL list to avoid
>>   m(un)locking
>>
>> Daniel Borkmann reported a bug with VM_BUG_ON assertions failing where
>> munlock_vma_pages_range() thinks it's unexpectedly in the middle of a THP page.
>> This can be reproduced in tools/testing/selftests/net/ by running make and
>> then ./psock_tpacket.
>>
>> The problem is that an order=2 compound page (allocated by
>> alloc_one_pg_vec_page() is part of the munlocked VM_MIXEDMAP vma (mapped by
>> packet_mmap()) and mistaken for a THP page and assumed to be order=9.
>>
>> The checks for THP in munlock came with commit ff6a6da60b89 ("mm: accelerate
>> munlock() treatment of THP pages"), i.e. since 3.9, but did not trigger a bug.
>> It just makes munlock_vma_pages_range() skip such compound pages until the next
>> 512-pages-aligned page, when it encounters a head page. This is however not a
>> problem for vma's where mlocking has no effect anyway, but it can distort the
>> accounting.
>> Since commit 7225522bb ("mm: munlock: batch non-THP page isolation and
>> munlock+putback using pagevec") this can trigger a VM_BUG_ON in PageTransHuge()
>> check.
>>
>> This patch fixes the issue by adding VM_MIXEDMAP flag to VM_SPECIAL - a list of
>> flags that make vma's non-mlockable and non-mergeable. The reasoning is that
>> VM_MIXEDMAP vma's are similar to VM_PFNMAP, which is already on the VM_SPECIAL
>> list, and both are intended for non-LRU pages where mlocking makes no sense
>> anyway.

Thanks a lot for your efforts.

Is your patch queued up somewhere for mainline and stable?

> I also ran into this problem and wanted to ask what the status of this
> patch is? Does it need further testing? I can surely help with that. ;)
>
> Thanks,
>
>    Hannes
>


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2014-02-12 12:03 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-01-10 18:23 [BUG] at include/linux/page-flags.h:415 (PageTransHuge) Daniel Borkmann
2014-01-11  6:22 ` Andrew Morton
2014-01-11 13:32   ` Daniel Borkmann
2014-01-13 10:16     ` Vlastimil Babka
2014-01-13 11:39       ` Daniel Borkmann
2014-01-15 14:27         ` Vlastimil Babka
2014-01-15 16:06           ` Daniel Borkmann
2014-01-31 14:40             ` Vlastimil Babka
2014-01-31 14:58               ` Thomas Hellstrom
2014-01-31 15:25                 ` Vlastimil Babka
2014-01-31 15:35                   ` Thomas Hellstrom
2014-02-07 18:58               ` Hannes Frederic Sowa
2014-02-12 12:02                 ` Daniel Borkmann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).