linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
@ 2011-10-12 18:12 Paweł Sikora
  2011-10-13 23:16 ` Hugh Dickins
  0 siblings, 1 reply; 72+ messages in thread
From: Paweł Sikora @ 2011-10-12 18:12 UTC (permalink / raw)
  To: Hugh Dickins, linux-mm; +Cc: jpiszcz, arekm, linux-kernel

Hi Hugh,
i'm resending previous private email with larger cc list as you've requested.


in the last weekend my server died again (processes stuck for 22/23s!) but this time i have more logs for you.
 
on my dual-opteron machines i have non-standard settings:
- DISABLED swap space (fast user process killing is better for my autotest farm than long disk swapping
                       and 64GB ecc-ram is enough for my processing).
- vm.overcommit_memory = 2,
- vm.overcommit_ratio = 100.

after initial BUG_ON (pasted below) there's a flood of 'rcu_sched_state detected stalls / CPU#X stuck for 22s!'
(full compressed log is available at: http://pluto.agmk.net/kernel/kernel.bz2)

Oct  9 08:06:43 hal kernel: [408578.629070] ------------[ cut here ]------------
Oct  9 08:06:43 hal kernel: [408578.629143] kernel BUG at include/linux/swapops.h:105!
Oct  9 08:06:43 hal kernel: [408578.629143] invalid opcode: 0000 [#1] SMP
Oct  9 08:06:43 hal kernel: [408578.629143] CPU 14
Oct  9 08:06:43 hal kernel: [408578.629143] Modules linked in: nfs fscache binfmt_misc nfsd lockd nfs_acl auth_rpcgss sunrpc ipmi_si ipmi_devintf ipmi_msghandler sch_sfq iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_filter xt_TCPMSS xt_tcpudp iptable_mangle ip_tables ip6table_filter ip6_tables x_tables ext4 jbd2 crc16 raid10 raid0 dm_mod uvesafb autofs4 dummy aoe joydev usbhid hid ide_cd_mod cdrom ata_generic pata_acpi ohci_hcd pata_atiixp sp5100_tco ide_pci_generic ssb ehci_hcd igb i2c_piix4 pcmcia evdev atiixp pcmcia_core psmouse ide_core usbcore i2c_core amd64_edac_mod mmc_core edac_core pcspkr sg edac_mce_amd button serio_raw processor dca ghes k10temp hwmon hed sd_mod crc_t10dif raid1 md_mod ext3 jbd mbcache ahci libahci libata scsi_mod [last unloaded: scsi_wait_scan]
Oct  9 08:06:43 hal kernel: [408578.629143]
Oct  9 08:06:43 hal kernel: [408578.629143] Pid: 29214, comm: bitgen Not tainted 3.0.4 #5 Supermicro H8DGU/H8DGU
Oct  9 08:06:43 hal kernel: [408578.629143] RIP: 0010:[<ffffffff81127b76>]  [<ffffffff81127b76>] migration_entry_wait+0x156/0x160
Oct  9 08:06:43 hal kernel: [408578.629143] RSP: 0000:ffff88021cee7d18  EFLAGS: 00010246
Oct  9 08:06:43 hal kernel: [408578.629143] RAX: 0a00000000080068 RBX: ffffea001d1dbe70 RCX: ffff880c02d18978
Oct  9 08:06:43 hal kernel: [408578.629143] RDX: 0000000000851a42 RSI: ffff880d8fe33618 RDI: ffffea002a09dd50
Oct  9 08:06:43 hal kernel: [408578.629143] RBP: ffff88021cee7d38 R08: ffff880d8fe33618 R09: 0000000000000028
Oct  9 08:06:43 hal kernel: [408578.629143] R10: ffff881006eb0f00 R11: f800000000851a42 R12: ffffea002a09dd40
Oct  9 08:06:43 hal kernel: [408578.629143] R13: 0000000c02d18978 R14: 00000000d872f000 R15: ffff880d8fe33618
Oct  9 08:06:43 hal kernel: [408578.629143] FS:  00007f864d7fa700(0000) GS:ffff880c1fc80000(0063) knlGS:00000000f432a910
Oct  9 08:06:43 hal kernel: [408578.629143] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
Oct  9 08:06:43 hal kernel: [408578.629143] CR2: 00000000d872f000 CR3: 000000100668c000 CR4: 00000000000006e0
Oct  9 08:06:43 hal kernel: [408578.629143] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Oct  9 08:06:43 hal kernel: [408578.629143] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Oct  9 08:06:43 hal kernel: [408578.629143] Process bitgen (pid: 29214, threadinfo ffff88021cee6000, task ffff880407b06900)
Oct  9 08:06:43 hal kernel: [408578.629143] Stack:
Oct  9 08:06:43 hal kernel: [408578.629143]  00000000dcef4000 ffff880c00968c98 ffff880c02d18978 000000010a34843e
Oct  9 08:06:43 hal kernel: [408578.629143]  ffff88021cee7de8 ffffffff811016a1 ffff88021cee7d78 ffff881006eb0f00
Oct  9 08:06:43 hal kernel: [408578.629143]  ffff88021cee7d98 ffffffff810feee2 ffff880c06f0d170 8000000b98d14067
Oct  9 08:06:43 hal kernel: [408578.629143] Call Trace:
Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff811016a1>] handle_pte_fault+0xae1/0xaf0
Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff810feee2>] ? __pte_alloc+0x42/0x120
Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff8112c26b>] ? do_huge_pmd_anonymous_page+0xab/0x310
Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff81102a31>] handle_mm_fault+0x181/0x310
Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff81106097>] ? vma_adjust+0x537/0x570
Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff81109a05>] ? do_mremap+0x2d5/0x570
Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff81421d5f>] page_fault+0x1f/0x30
Oct  9 08:06:43 hal kernel: [408578.629143] Code: 80 00 00 00 00 31 f6 48 89 df e8 e6 58 fb ff eb d7 85 c9 0f 84 44 ff ff ff 8d 51 01 89 c8 f0 0f b1 16 39 c1 90 74 b5 89 c1 eb e6 <0f> 0b 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 83 ec 20 48 85 ff
Oct  9 08:06:43 hal kernel: [408578.629143] RIP  [<ffffffff81127b76>] migration_entry_wait+0x156/0x160
Oct  9 08:06:43 hal kernel: [408578.629143]  RSP <ffff88021cee7d18>
Oct  9 08:06:43 hal kernel: [408578.642823] ---[ end trace 0a37362301163711 ]---
Oct  9 08:07:10 hal kernel: [408605.283257] BUG: soft lockup - CPU#12 stuck for 23s! [par:29801]
Oct  9 08:07:10 hal kernel: [408605.283367] Modules linked in: nfs fscache binfmt_misc nfsd lockd nfs_acl auth_rpcgss sunrpc ipmi_si ipmi_devintf ipmi_msghandler sch_sfq iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_filter xt_TCPMSS xt_tcpudp iptable_mangle ip_tables ip6table_filter ip6_tables x_tables ext4 jbd2 crc16 raid10 raid0 dm_mod uvesafb autofs4 dummy aoe joydev usbhid hid ide_cd_mod cdrom ata_generic pata_acpi ohci_hcd pata_atiixp sp5100_tco ide_pci_generic ssb ehci_hcd igb i2c_piix4 pcmcia evdev atiixp pcmcia_core psmouse ide_core usbcore i2c_core amd64_edac_mod mmc_core edac_core pcspkr sg edac_mce_amd button serio_raw processor dca ghes k10temp hwmon hed sd_mod crc_t10dif raid1 md_mod ext3 jbd mbcache ahci libahci libata scsi_mod [last unloaded: scsi_wait_scan]
Oct  9 08:07:10 hal kernel: [408605.285807] CPU 12
Oct  9 08:07:10 hal kernel: [408605.285807] Modules linked in: nfs fscache binfmt_misc nfsd lockd nfs_acl auth_rpcgss sunrpc ipmi_si ipmi_devintf ipmi_msghandler sch_sfq iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_filter xt_TCPMSS xt_tcpudp iptable_mangle ip_tables ip6table_filter ip6_tables x_tables ext4 jbd2 crc16 raid10 raid0 dm_mod uvesafb autofs4 dummy aoe joydev usbhid hid ide_cd_mod cdrom ata_generic pata_acpi ohci_hcd pata_atiixp sp5100_tco ide_pci_generic ssb ehci_hcd igb i2c_piix4 pcmcia evdev atiixp pcmcia_core psmouse ide_core usbcore i2c_core amd64_edac_mod mmc_core edac_core pcspkr sg edac_mce_amd button serio_raw processor dca ghes k10temp hwmon hed sd_mod crc_t10dif raid1 md_mod ext3 jbd mbcache ahci libahci libata scsi_mod [last unloaded: scsi_wait_scan]
Oct  9 08:07:10 hal kernel: [408605.285807]
Oct  9 08:07:10 hal kernel: [408605.285807] Pid: 29801, comm: par Tainted: G      D     3.0.4 #5 Supermicro H8DGU/H8DGU
Oct  9 08:07:10 hal kernel: [408605.285807] RIP: 0010:[<ffffffff814216a4>]  [<ffffffff814216a4>] _raw_spin_lock+0x14/0x20
Oct  9 08:07:10 hal kernel: [408605.285807] RSP: 0018:ffff880c02def808  EFLAGS: 00000293
Oct  9 08:07:10 hal kernel: [408605.285807] RAX: 0000000000000b09 RBX: ffffea002741f6b8 RCX: ffff880000000000
Oct  9 08:07:10 hal kernel: [408605.285807] RDX: ffffea0000000000 RSI: 000000002a09dd40 RDI: ffffea002a09dd50
Oct  9 08:07:10 hal kernel: [408605.285807] RBP: ffff880c02def808 R08: 0000000000000000 R09: ffff880f2e4f4d70
Oct  9 08:07:10 hal kernel: [408605.285807] R10: ffff880c1fffbe00 R11: 0000000000000050 R12: ffffffff8142988e
Oct  9 08:07:10 hal kernel: [408605.285807] R13: ffff880c02def7b8 R14: ffffffff810e63dc R15: ffff880c02def7b8
Oct  9 08:07:10 hal kernel: [408605.285807] FS:  00007fe6b677c720(0000) GS:ffff880c1fc00000(0063) knlGS:00000000f6e0d910
Oct  9 08:07:10 hal kernel: [408605.285807] CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
Oct  9 08:07:10 hal kernel: [408605.285807] CR2: 00000000dd40012c CR3: 00000009f6b78000 CR4: 00000000000006e0
Oct  9 08:07:10 hal kernel: [408605.285807] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Oct  9 08:07:10 hal kernel: [408605.285807] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Oct  9 08:07:10 hal kernel: [408605.285807] Process par (pid: 29801, threadinfo ffff880c02dee000, task ffff880c07c90700)
Oct  9 08:07:10 hal kernel: [408605.285807] Stack:
Oct  9 08:07:10 hal kernel: [408605.285807]  ffff880c02def858 ffffffff8110a2d7 ffffea001fc16fe0 ffff880c02d183a8
Oct  9 08:07:10 hal kernel: [408605.285807]  ffff880c02def8c8 ffffea001fc16fa8 ffff880c00968c98 ffff881006eb0f00
Oct  9 08:07:10 hal kernel: [408605.285807]  0000000000000301 00000000d8675000 ffff880c02def8c8 ffffffff8110aa6a
Oct  9 08:07:10 hal kernel: [408605.285807] Call Trace:
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110a2d7>] __page_check_address+0x107/0x1a0
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110aa6a>] try_to_unmap_one+0x3a/0x420
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110be44>] try_to_unmap_anon+0xb4/0x130
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110bf75>] try_to_unmap+0x65/0x80
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff811285d0>] migrate_pages+0x310/0x4c0
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e93c2>] ? ____pagevec_lru_add+0x12/0x20
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111cbf0>] ? ftrace_define_fields_mm_compaction_isolate_template+0x70/0x70
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111d5da>] compact_zone+0x52a/0x8c0
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810f9919>] ? zone_statistics+0x99/0xc0
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111dade>] compact_zone_order+0x7e/0xb0
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e46a8>] ? get_page_from_freelist+0x3b8/0x7e0
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111dbcd>] try_to_compact_pages+0xbd/0xf0
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e5148>] __alloc_pages_direct_compact+0xa8/0x180
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e56c5>] __alloc_pages_nodemask+0x4a5/0x7f0
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e9698>] ? lru_cache_add_lru+0x28/0x50
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110a92d>] ? page_add_new_anon_rmap+0x9d/0xb0
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111b865>] alloc_pages_vma+0x95/0x180
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8112c2f8>] do_huge_pmd_anonymous_page+0x138/0x310
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff81102ace>] handle_mm_fault+0x21e/0x310
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff81001716>] ? __switch_to+0x1e6/0x2c0
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8141f178>] ? schedule+0x308/0xa10
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff811077a7>] ? do_mmap_pgoff+0x357/0x370
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110790d>] ? sys_mmap_pgoff+0x14d/0x220
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff81421d5f>] page_fault+0x1f/0x30
Oct  9 08:07:10 hal kernel: [408605.285807] Code: 0f b6 c2 85 c0 0f 95 c0 0f b6 c0 5d c3 66 2e 0f 1f 84 00 00 00 00 00 55 b8 00 01 00 00 48 89 e5 f0 66 0f c1 07 38 e0 74 06 f3 90 <8a> 07 eb f6 5d c3 66 0f 1f 44 00 00 55 48 89 e5 9c 58 fa ba 00
Oct  9 08:07:10 hal kernel: [408605.285807] Call Trace:
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110a2d7>] __page_check_address+0x107/0x1a0
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110aa6a>] try_to_unmap_one+0x3a/0x420
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110be44>] try_to_unmap_anon+0xb4/0x130
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110bf75>] try_to_unmap+0x65/0x80
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff811285d0>] migrate_pages+0x310/0x4c0
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e93c2>] ? ____pagevec_lru_add+0x12/0x20
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111cbf0>] ? ftrace_define_fields_mm_compaction_isolate_template+0x70/0x70
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111d5da>] compact_zone+0x52a/0x8c0
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810f9919>] ? zone_statistics+0x99/0xc0
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111dade>] compact_zone_order+0x7e/0xb0
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e46a8>] ? get_page_from_freelist+0x3b8/0x7e0
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111dbcd>] try_to_compact_pages+0xbd/0xf0
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e5148>] __alloc_pages_direct_compact+0xa8/0x180
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e56c5>] __alloc_pages_nodemask+0x4a5/0x7f0
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e9698>] ? lru_cache_add_lru+0x28/0x50
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110a92d>] ? page_add_new_anon_rmap+0x9d/0xb0
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111b865>] alloc_pages_vma+0x95/0x180
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8112c2f8>] do_huge_pmd_anonymous_page+0x138/0x310
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff81102ace>] handle_mm_fault+0x21e/0x310
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff81001716>] ? __switch_to+0x1e6/0x2c0
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8141f178>] ? schedule+0x308/0xa10
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff811077a7>] ? do_mmap_pgoff+0x357/0x370
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110790d>] ? sys_mmap_pgoff+0x14d/0x220
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff81421d5f>] page_fault+0x1f/0x30

BR,
Paweł.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-12 18:12 kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110 Paweł Sikora
@ 2011-10-13 23:16 ` Hugh Dickins
  2011-10-13 23:30   ` Hugh Dickins
                     ` (3 more replies)
  0 siblings, 4 replies; 72+ messages in thread
From: Hugh Dickins @ 2011-10-13 23:16 UTC (permalink / raw)
  To: Pawel Sikora
  Cc: Andrew Morton, Mel Gorman, Andrea Arcangeli, linux-mm, jpiszcz,
	arekm, linux-kernel

[ Subject refers to a different, unexplained 3.0 bug from Pawel ]

On Wed, 12 Oct 2011, Pawel Sikora wrote:

> Hi Hugh,
> i'm resending previous private email with larger cc list as you've requested.

Thanks, yes, on this one I think I do have an answer;
and we ought to bring Mel and Andrea in too.

> 
> in the last weekend my server died again (processes stuck for 22/23s!) but this time i have more logs for you.
>  
> on my dual-opteron machines i have non-standard settings:
> - DISABLED swap space (fast user process killing is better for my autotest farm than long disk swapping
>                        and 64GB ecc-ram is enough for my processing).
> - vm.overcommit_memory = 2,
> - vm.overcommit_ratio = 100.
> 
> after initial BUG_ON (pasted below) there's a flood of 'rcu_sched_state detected stalls / CPU#X stuck for 22s!'

Yes, those are just a tiresome consequence of exiting from a BUG
while holding the page table lock(s).

> (full compressed log is available at: http://pluto.agmk.net/kernel/kernel.bz2)
> 
> Oct  9 08:06:43 hal kernel: [408578.629070] ------------[ cut here ]------------
> Oct  9 08:06:43 hal kernel: [408578.629143] kernel BUG at include/linux/swapops.h:105!
> Oct  9 08:06:43 hal kernel: [408578.629143] invalid opcode: 0000 [#1] SMP
> Oct  9 08:06:43 hal kernel: [408578.629143] CPU 14
[ I'm deleting that irrelevant long line list of modules ]
> Oct  9 08:06:43 hal kernel: [408578.629143]
> Oct  9 08:06:43 hal kernel: [408578.629143] Pid: 29214, comm: bitgen Not tainted 3.0.4 #5 Supermicro H8DGU/H8DGU
> Oct  9 08:06:43 hal kernel: [408578.629143] RIP: 0010:[<ffffffff81127b76>]  [<ffffffff81127b76>] migration_entry_wait+0x156/0x160
> Oct  9 08:06:43 hal kernel: [408578.629143] RSP: 0000:ffff88021cee7d18  EFLAGS: 00010246
> Oct  9 08:06:43 hal kernel: [408578.629143] RAX: 0a00000000080068 RBX: ffffea001d1dbe70 RCX: ffff880c02d18978
> Oct  9 08:06:43 hal kernel: [408578.629143] RDX: 0000000000851a42 RSI: ffff880d8fe33618 RDI: ffffea002a09dd50
> Oct  9 08:06:43 hal kernel: [408578.629143] RBP: ffff88021cee7d38 R08: ffff880d8fe33618 R09: 0000000000000028
> Oct  9 08:06:43 hal kernel: [408578.629143] R10: ffff881006eb0f00 R11: f800000000851a42 R12: ffffea002a09dd40
> Oct  9 08:06:43 hal kernel: [408578.629143] R13: 0000000c02d18978 R14: 00000000d872f000 R15: ffff880d8fe33618
> Oct  9 08:06:43 hal kernel: [408578.629143] FS:  00007f864d7fa700(0000) GS:ffff880c1fc80000(0063) knlGS:00000000f432a910
> Oct  9 08:06:43 hal kernel: [408578.629143] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
> Oct  9 08:06:43 hal kernel: [408578.629143] CR2: 00000000d872f000 CR3: 000000100668c000 CR4: 00000000000006e0
> Oct  9 08:06:43 hal kernel: [408578.629143] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Oct  9 08:06:43 hal kernel: [408578.629143] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Oct  9 08:06:43 hal kernel: [408578.629143] Process bitgen (pid: 29214, threadinfo ffff88021cee6000, task ffff880407b06900)
> Oct  9 08:06:43 hal kernel: [408578.629143] Stack:
> Oct  9 08:06:43 hal kernel: [408578.629143]  00000000dcef4000 ffff880c00968c98 ffff880c02d18978 000000010a34843e
> Oct  9 08:06:43 hal kernel: [408578.629143]  ffff88021cee7de8 ffffffff811016a1 ffff88021cee7d78 ffff881006eb0f00
> Oct  9 08:06:43 hal kernel: [408578.629143]  ffff88021cee7d98 ffffffff810feee2 ffff880c06f0d170 8000000b98d14067
> Oct  9 08:06:43 hal kernel: [408578.629143] Call Trace:
> Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff811016a1>] handle_pte_fault+0xae1/0xaf0
> Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff810feee2>] ? __pte_alloc+0x42/0x120
> Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff8112c26b>] ? do_huge_pmd_anonymous_page+0xab/0x310
> Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff81102a31>] handle_mm_fault+0x181/0x310
> Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff81106097>] ? vma_adjust+0x537/0x570
> Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
> Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff81109a05>] ? do_mremap+0x2d5/0x570
> Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff81421d5f>] page_fault+0x1f/0x30
> Oct  9 08:06:43 hal kernel: [408578.629143] Code: 80 00 00 00 00 31 f6 48 89 df e8 e6 58 fb ff eb d7 85 c9 0f 84 44 ff ff ff 8d 51 01 89 c8 f0 0f b1 16 39 c1 90 74 b5 89 c1 eb e6 <0f> 0b 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 83 ec 20 48 85 ff
> Oct  9 08:06:43 hal kernel: [408578.629143] RIP  [<ffffffff81127b76>] migration_entry_wait+0x156/0x160
> Oct  9 08:06:43 hal kernel: [408578.629143]  RSP <ffff88021cee7d18>
> Oct  9 08:06:43 hal kernel: [408578.642823] ---[ end trace 0a37362301163711 ]---
> Oct  9 08:07:10 hal kernel: [408605.283257] BUG: soft lockup - CPU#12 stuck for 23s! [par:29801]
> Oct  9 08:07:10 hal kernel: [408605.285807] CPU 12
> Oct  9 08:07:10 hal kernel: [408605.285807]
> Oct  9 08:07:10 hal kernel: [408605.285807] Pid: 29801, comm: par Tainted: G      D     3.0.4 #5 Supermicro H8DGU/H8DGU
> Oct  9 08:07:10 hal kernel: [408605.285807] RIP: 0010:[<ffffffff814216a4>]  [<ffffffff814216a4>] _raw_spin_lock+0x14/0x20
> Oct  9 08:07:10 hal kernel: [408605.285807] RSP: 0018:ffff880c02def808  EFLAGS: 00000293
> Oct  9 08:07:10 hal kernel: [408605.285807] RAX: 0000000000000b09 RBX: ffffea002741f6b8 RCX: ffff880000000000
> Oct  9 08:07:10 hal kernel: [408605.285807] RDX: ffffea0000000000 RSI: 000000002a09dd40 RDI: ffffea002a09dd50
> Oct  9 08:07:10 hal kernel: [408605.285807] RBP: ffff880c02def808 R08: 0000000000000000 R09: ffff880f2e4f4d70
> Oct  9 08:07:10 hal kernel: [408605.285807] R10: ffff880c1fffbe00 R11: 0000000000000050 R12: ffffffff8142988e
> Oct  9 08:07:10 hal kernel: [408605.285807] R13: ffff880c02def7b8 R14: ffffffff810e63dc R15: ffff880c02def7b8
> Oct  9 08:07:10 hal kernel: [408605.285807] FS:  00007fe6b677c720(0000) GS:ffff880c1fc00000(0063) knlGS:00000000f6e0d910
> Oct  9 08:07:10 hal kernel: [408605.285807] CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
> Oct  9 08:07:10 hal kernel: [408605.285807] CR2: 00000000dd40012c CR3: 00000009f6b78000 CR4: 00000000000006e0
> Oct  9 08:07:10 hal kernel: [408605.285807] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Oct  9 08:07:10 hal kernel: [408605.285807] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Oct  9 08:07:10 hal kernel: [408605.285807] Process par (pid: 29801, threadinfo ffff880c02dee000, task ffff880c07c90700)
> Oct  9 08:07:10 hal kernel: [408605.285807] Stack:
> Oct  9 08:07:10 hal kernel: [408605.285807]  ffff880c02def858 ffffffff8110a2d7 ffffea001fc16fe0 ffff880c02d183a8
> Oct  9 08:07:10 hal kernel: [408605.285807]  ffff880c02def8c8 ffffea001fc16fa8 ffff880c00968c98 ffff881006eb0f00
> Oct  9 08:07:10 hal kernel: [408605.285807]  0000000000000301 00000000d8675000 ffff880c02def8c8 ffffffff8110aa6a
> Oct  9 08:07:10 hal kernel: [408605.285807] Call Trace:
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110a2d7>] __page_check_address+0x107/0x1a0
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110aa6a>] try_to_unmap_one+0x3a/0x420
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110be44>] try_to_unmap_anon+0xb4/0x130
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110bf75>] try_to_unmap+0x65/0x80
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff811285d0>] migrate_pages+0x310/0x4c0
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e93c2>] ? ____pagevec_lru_add+0x12/0x20
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111cbf0>] ? ftrace_define_fields_mm_compaction_isolate_template+0x70/0x70
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111d5da>] compact_zone+0x52a/0x8c0
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810f9919>] ? zone_statistics+0x99/0xc0
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111dade>] compact_zone_order+0x7e/0xb0
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e46a8>] ? get_page_from_freelist+0x3b8/0x7e0
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111dbcd>] try_to_compact_pages+0xbd/0xf0
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e5148>] __alloc_pages_direct_compact+0xa8/0x180
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e56c5>] __alloc_pages_nodemask+0x4a5/0x7f0
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e9698>] ? lru_cache_add_lru+0x28/0x50
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110a92d>] ? page_add_new_anon_rmap+0x9d/0xb0
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111b865>] alloc_pages_vma+0x95/0x180
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8112c2f8>] do_huge_pmd_anonymous_page+0x138/0x310
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff81102ace>] handle_mm_fault+0x21e/0x310
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff81001716>] ? __switch_to+0x1e6/0x2c0
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8141f178>] ? schedule+0x308/0xa10
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff811077a7>] ? do_mmap_pgoff+0x357/0x370
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110790d>] ? sys_mmap_pgoff+0x14d/0x220
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff81421d5f>] page_fault+0x1f/0x30
> Oct  9 08:07:10 hal kernel: [408605.285807] Code: 0f b6 c2 85 c0 0f 95 c0 0f b6 c0 5d c3 66 2e 0f 1f 84 00 00 00 00 00 55 b8 00 01 00 00 48 89 e5 f0 66 0f c1 07 38 e0 74 06 f3 90 <8a> 07 eb f6 5d c3 66 0f 1f 44 00 00 55 48 89 e5 9c 58 fa ba 00

I guess this is the only time you've seen this?  In which case, ideally
I would try to devise a testcase to demonstrate the issue below instead;
but that may involve more ingenuity than I can find time for, let's see
see if people approve of this patch anyway (it applies to 3.1 or 3.0,
and earlier releases except that i_mmap_mutex used to be i_mmap_lock).


[PATCH] mm: add anon_vma locking to mremap move

I don't usually pay much attention to the stale "? " addresses in
stack backtraces, but this lucky report from Pawel Sikora hints that
mremap's move_ptes() has inadequate locking against page migration.

 3.0 BUG_ON(!PageLocked(p)) in migration_entry_to_page():
 kernel BUG at include/linux/swapops.h:105!
 RIP: 0010:[<ffffffff81127b76>]  [<ffffffff81127b76>]
                       migration_entry_wait+0x156/0x160
  [<ffffffff811016a1>] handle_pte_fault+0xae1/0xaf0
  [<ffffffff810feee2>] ? __pte_alloc+0x42/0x120
  [<ffffffff8112c26b>] ? do_huge_pmd_anonymous_page+0xab/0x310
  [<ffffffff81102a31>] handle_mm_fault+0x181/0x310
  [<ffffffff81106097>] ? vma_adjust+0x537/0x570
  [<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
  [<ffffffff81109a05>] ? do_mremap+0x2d5/0x570
  [<ffffffff81421d5f>] page_fault+0x1f/0x30

mremap's down_write of mmap_sem, together with i_mmap_mutex/lock,
and pagetable locks, were good enough before page migration (with its
requirement that every migration entry be found) came in; and enough
while migration always held mmap_sem.  But not enough nowadays, when
there's memory hotremove and compaction: anon_vma lock is also needed,
to make sure a migration entry is not dodging around behind our back.

It appears that Mel's a8bef8ff6ea1 "mm: migration: avoid race between
shift_arg_pages() and rmap_walk() during migration by not migrating
temporary stacks" was actually a workaround for this in the special
common case of exec's use of move_pagetables(); and we should probably
now remove that VM_STACK_INCOMPLETE_SETUP stuff as a separate cleanup.

Reported-by: Pawel Sikora <pluto@agmk.net>
Cc: stable@kernel.org
Signed-off-by: Hugh Dickins <hughd@google.com>
---

 mm/mremap.c |    5 +++++
 1 file changed, 5 insertions(+)

--- 3.1-rc9/mm/mremap.c	2011-07-21 19:17:23.000000000 -0700
+++ linux/mm/mremap.c	2011-10-13 14:36:25.097780974 -0700
@@ -77,6 +77,7 @@ static void move_ptes(struct vm_area_str
 		unsigned long new_addr)
 {
 	struct address_space *mapping = NULL;
+	struct anon_vma *anon_vma = vma->anon_vma;
 	struct mm_struct *mm = vma->vm_mm;
 	pte_t *old_pte, *new_pte, pte;
 	spinlock_t *old_ptl, *new_ptl;
@@ -95,6 +96,8 @@ static void move_ptes(struct vm_area_str
 		mapping = vma->vm_file->f_mapping;
 		mutex_lock(&mapping->i_mmap_mutex);
 	}
+	if (anon_vma)
+		anon_vma_lock(anon_vma);
 
 	/*
 	 * We don't have to worry about the ordering of src and dst
@@ -121,6 +124,8 @@ static void move_ptes(struct vm_area_str
 		spin_unlock(new_ptl);
 	pte_unmap(new_pte - 1);
 	pte_unmap_unlock(old_pte - 1, old_ptl);
+	if (anon_vma)
+		anon_vma_unlock(anon_vma);
 	if (mapping)
 		mutex_unlock(&mapping->i_mmap_mutex);
 	mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end);

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-13 23:16 ` Hugh Dickins
@ 2011-10-13 23:30   ` Hugh Dickins
  2011-10-16 16:11     ` Christoph Hellwig
  2011-10-16 23:54     ` Andrea Arcangeli
  2011-10-16 22:37   ` Linus Torvalds
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 72+ messages in thread
From: Hugh Dickins @ 2011-10-13 23:30 UTC (permalink / raw)
  To: Pawel Sikora
  Cc: Andrew Morton, Mel Gorman, Andrea Arcangeli, linux-mm, jpiszcz,
	arekm, linux-kernel

[ Subject refers to a different, unexplained 3.0 bug from Pawel ]
[ Resend with correct address for linux-mm@kvack.org ]

On Wed, 12 Oct 2011, Pawel Sikora wrote:

> Hi Hugh,
> i'm resending previous private email with larger cc list as you've requested.

Thanks, yes, on this one I think I do have an answer;
and we ought to bring Mel and Andrea in too.

> 
> in the last weekend my server died again (processes stuck for 22/23s!) but this time i have more logs for you.
>  
> on my dual-opteron machines i have non-standard settings:
> - DISABLED swap space (fast user process killing is better for my autotest farm than long disk swapping
>                        and 64GB ecc-ram is enough for my processing).
> - vm.overcommit_memory = 2,
> - vm.overcommit_ratio = 100.
> 
> after initial BUG_ON (pasted below) there's a flood of 'rcu_sched_state detected stalls / CPU#X stuck for 22s!'

Yes, those are just a tiresome consequence of exiting from a BUG
while holding the page table lock(s).

> (full compressed log is available at: http://pluto.agmk.net/kernel/kernel.bz2)
> 
> Oct  9 08:06:43 hal kernel: [408578.629070] ------------[ cut here ]------------
> Oct  9 08:06:43 hal kernel: [408578.629143] kernel BUG at include/linux/swapops.h:105!
> Oct  9 08:06:43 hal kernel: [408578.629143] invalid opcode: 0000 [#1] SMP
> Oct  9 08:06:43 hal kernel: [408578.629143] CPU 14
[ I'm deleting that irrelevant long line list of modules ]
> Oct  9 08:06:43 hal kernel: [408578.629143]
> Oct  9 08:06:43 hal kernel: [408578.629143] Pid: 29214, comm: bitgen Not tainted 3.0.4 #5 Supermicro H8DGU/H8DGU
> Oct  9 08:06:43 hal kernel: [408578.629143] RIP: 0010:[<ffffffff81127b76>]  [<ffffffff81127b76>] migration_entry_wait+0x156/0x160
> Oct  9 08:06:43 hal kernel: [408578.629143] RSP: 0000:ffff88021cee7d18  EFLAGS: 00010246
> Oct  9 08:06:43 hal kernel: [408578.629143] RAX: 0a00000000080068 RBX: ffffea001d1dbe70 RCX: ffff880c02d18978
> Oct  9 08:06:43 hal kernel: [408578.629143] RDX: 0000000000851a42 RSI: ffff880d8fe33618 RDI: ffffea002a09dd50
> Oct  9 08:06:43 hal kernel: [408578.629143] RBP: ffff88021cee7d38 R08: ffff880d8fe33618 R09: 0000000000000028
> Oct  9 08:06:43 hal kernel: [408578.629143] R10: ffff881006eb0f00 R11: f800000000851a42 R12: ffffea002a09dd40
> Oct  9 08:06:43 hal kernel: [408578.629143] R13: 0000000c02d18978 R14: 00000000d872f000 R15: ffff880d8fe33618
> Oct  9 08:06:43 hal kernel: [408578.629143] FS:  00007f864d7fa700(0000) GS:ffff880c1fc80000(0063) knlGS:00000000f432a910
> Oct  9 08:06:43 hal kernel: [408578.629143] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
> Oct  9 08:06:43 hal kernel: [408578.629143] CR2: 00000000d872f000 CR3: 000000100668c000 CR4: 00000000000006e0
> Oct  9 08:06:43 hal kernel: [408578.629143] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Oct  9 08:06:43 hal kernel: [408578.629143] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Oct  9 08:06:43 hal kernel: [408578.629143] Process bitgen (pid: 29214, threadinfo ffff88021cee6000, task ffff880407b06900)
> Oct  9 08:06:43 hal kernel: [408578.629143] Stack:
> Oct  9 08:06:43 hal kernel: [408578.629143]  00000000dcef4000 ffff880c00968c98 ffff880c02d18978 000000010a34843e
> Oct  9 08:06:43 hal kernel: [408578.629143]  ffff88021cee7de8 ffffffff811016a1 ffff88021cee7d78 ffff881006eb0f00
> Oct  9 08:06:43 hal kernel: [408578.629143]  ffff88021cee7d98 ffffffff810feee2 ffff880c06f0d170 8000000b98d14067
> Oct  9 08:06:43 hal kernel: [408578.629143] Call Trace:
> Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff811016a1>] handle_pte_fault+0xae1/0xaf0
> Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff810feee2>] ? __pte_alloc+0x42/0x120
> Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff8112c26b>] ? do_huge_pmd_anonymous_page+0xab/0x310
> Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff81102a31>] handle_mm_fault+0x181/0x310
> Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff81106097>] ? vma_adjust+0x537/0x570
> Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
> Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff81109a05>] ? do_mremap+0x2d5/0x570
> Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff81421d5f>] page_fault+0x1f/0x30
> Oct  9 08:06:43 hal kernel: [408578.629143] Code: 80 00 00 00 00 31 f6 48 89 df e8 e6 58 fb ff eb d7 85 c9 0f 84 44 ff ff ff 8d 51 01 89 c8 f0 0f b1 16 39 c1 90 74 b5 89 c1 eb e6 <0f> 0b 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 83 ec 20 48 85 ff
> Oct  9 08:06:43 hal kernel: [408578.629143] RIP  [<ffffffff81127b76>] migration_entry_wait+0x156/0x160
> Oct  9 08:06:43 hal kernel: [408578.629143]  RSP <ffff88021cee7d18>
> Oct  9 08:06:43 hal kernel: [408578.642823] ---[ end trace 0a37362301163711 ]---
> Oct  9 08:07:10 hal kernel: [408605.283257] BUG: soft lockup - CPU#12 stuck for 23s! [par:29801]
> Oct  9 08:07:10 hal kernel: [408605.285807] CPU 12
> Oct  9 08:07:10 hal kernel: [408605.285807]
> Oct  9 08:07:10 hal kernel: [408605.285807] Pid: 29801, comm: par Tainted: G      D     3.0.4 #5 Supermicro H8DGU/H8DGU
> Oct  9 08:07:10 hal kernel: [408605.285807] RIP: 0010:[<ffffffff814216a4>]  [<ffffffff814216a4>] _raw_spin_lock+0x14/0x20
> Oct  9 08:07:10 hal kernel: [408605.285807] RSP: 0018:ffff880c02def808  EFLAGS: 00000293
> Oct  9 08:07:10 hal kernel: [408605.285807] RAX: 0000000000000b09 RBX: ffffea002741f6b8 RCX: ffff880000000000
> Oct  9 08:07:10 hal kernel: [408605.285807] RDX: ffffea0000000000 RSI: 000000002a09dd40 RDI: ffffea002a09dd50
> Oct  9 08:07:10 hal kernel: [408605.285807] RBP: ffff880c02def808 R08: 0000000000000000 R09: ffff880f2e4f4d70
> Oct  9 08:07:10 hal kernel: [408605.285807] R10: ffff880c1fffbe00 R11: 0000000000000050 R12: ffffffff8142988e
> Oct  9 08:07:10 hal kernel: [408605.285807] R13: ffff880c02def7b8 R14: ffffffff810e63dc R15: ffff880c02def7b8
> Oct  9 08:07:10 hal kernel: [408605.285807] FS:  00007fe6b677c720(0000) GS:ffff880c1fc00000(0063) knlGS:00000000f6e0d910
> Oct  9 08:07:10 hal kernel: [408605.285807] CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
> Oct  9 08:07:10 hal kernel: [408605.285807] CR2: 00000000dd40012c CR3: 00000009f6b78000 CR4: 00000000000006e0
> Oct  9 08:07:10 hal kernel: [408605.285807] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Oct  9 08:07:10 hal kernel: [408605.285807] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Oct  9 08:07:10 hal kernel: [408605.285807] Process par (pid: 29801, threadinfo ffff880c02dee000, task ffff880c07c90700)
> Oct  9 08:07:10 hal kernel: [408605.285807] Stack:
> Oct  9 08:07:10 hal kernel: [408605.285807]  ffff880c02def858 ffffffff8110a2d7 ffffea001fc16fe0 ffff880c02d183a8
> Oct  9 08:07:10 hal kernel: [408605.285807]  ffff880c02def8c8 ffffea001fc16fa8 ffff880c00968c98 ffff881006eb0f00
> Oct  9 08:07:10 hal kernel: [408605.285807]  0000000000000301 00000000d8675000 ffff880c02def8c8 ffffffff8110aa6a
> Oct  9 08:07:10 hal kernel: [408605.285807] Call Trace:
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110a2d7>] __page_check_address+0x107/0x1a0
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110aa6a>] try_to_unmap_one+0x3a/0x420
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110be44>] try_to_unmap_anon+0xb4/0x130
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110bf75>] try_to_unmap+0x65/0x80
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff811285d0>] migrate_pages+0x310/0x4c0
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e93c2>] ? ____pagevec_lru_add+0x12/0x20
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111cbf0>] ? ftrace_define_fields_mm_compaction_isolate_template+0x70/0x70
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111d5da>] compact_zone+0x52a/0x8c0
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810f9919>] ? zone_statistics+0x99/0xc0
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111dade>] compact_zone_order+0x7e/0xb0
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e46a8>] ? get_page_from_freelist+0x3b8/0x7e0
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111dbcd>] try_to_compact_pages+0xbd/0xf0
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e5148>] __alloc_pages_direct_compact+0xa8/0x180
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e56c5>] __alloc_pages_nodemask+0x4a5/0x7f0
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e9698>] ? lru_cache_add_lru+0x28/0x50
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110a92d>] ? page_add_new_anon_rmap+0x9d/0xb0
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111b865>] alloc_pages_vma+0x95/0x180
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8112c2f8>] do_huge_pmd_anonymous_page+0x138/0x310
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff81102ace>] handle_mm_fault+0x21e/0x310
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff81001716>] ? __switch_to+0x1e6/0x2c0
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8141f178>] ? schedule+0x308/0xa10
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff811077a7>] ? do_mmap_pgoff+0x357/0x370
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110790d>] ? sys_mmap_pgoff+0x14d/0x220
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff81421d5f>] page_fault+0x1f/0x30
> Oct  9 08:07:10 hal kernel: [408605.285807] Code: 0f b6 c2 85 c0 0f 95 c0 0f b6 c0 5d c3 66 2e 0f 1f 84 00 00 00 00 00 55 b8 00 01 00 00 48 89 e5 f0 66 0f c1 07 38 e0 74 06 f3 90 <8a> 07 eb f6 5d c3 66 0f 1f 44 00 00 55 48 89 e5 9c 58 fa ba 00

I guess this is the only time you've seen this?  In which case, ideally
I would try to devise a testcase to demonstrate the issue below instead;
but that may involve more ingenuity than I can find time for, let's see
see if people approve of this patch anyway (it applies to 3.1 or 3.0,
and earlier releases except that i_mmap_mutex used to be i_mmap_lock).


[PATCH] mm: add anon_vma locking to mremap move

I don't usually pay much attention to the stale "? " addresses in
stack backtraces, but this lucky report from Pawel Sikora hints that
mremap's move_ptes() has inadequate locking against page migration.

 3.0 BUG_ON(!PageLocked(p)) in migration_entry_to_page():
 kernel BUG at include/linux/swapops.h:105!
 RIP: 0010:[<ffffffff81127b76>]  [<ffffffff81127b76>]
                       migration_entry_wait+0x156/0x160
  [<ffffffff811016a1>] handle_pte_fault+0xae1/0xaf0
  [<ffffffff810feee2>] ? __pte_alloc+0x42/0x120
  [<ffffffff8112c26b>] ? do_huge_pmd_anonymous_page+0xab/0x310
  [<ffffffff81102a31>] handle_mm_fault+0x181/0x310
  [<ffffffff81106097>] ? vma_adjust+0x537/0x570
  [<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
  [<ffffffff81109a05>] ? do_mremap+0x2d5/0x570
  [<ffffffff81421d5f>] page_fault+0x1f/0x30

mremap's down_write of mmap_sem, together with i_mmap_mutex/lock,
and pagetable locks, were good enough before page migration (with its
requirement that every migration entry be found) came in; and enough
while migration always held mmap_sem.  But not enough nowadays, when
there's memory hotremove and compaction: anon_vma lock is also needed,
to make sure a migration entry is not dodging around behind our back.

It appears that Mel's a8bef8ff6ea1 "mm: migration: avoid race between
shift_arg_pages() and rmap_walk() during migration by not migrating
temporary stacks" was actually a workaround for this in the special
common case of exec's use of move_pagetables(); and we should probably
now remove that VM_STACK_INCOMPLETE_SETUP stuff as a separate cleanup.

Reported-by: Pawel Sikora <pluto@agmk.net>
Cc: stable@kernel.org
Signed-off-by: Hugh Dickins <hughd@google.com>
---

 mm/mremap.c |    5 +++++
 1 file changed, 5 insertions(+)

--- 3.1-rc9/mm/mremap.c	2011-07-21 19:17:23.000000000 -0700
+++ linux/mm/mremap.c	2011-10-13 14:36:25.097780974 -0700
@@ -77,6 +77,7 @@ static void move_ptes(struct vm_area_str
 		unsigned long new_addr)
 {
 	struct address_space *mapping = NULL;
+	struct anon_vma *anon_vma = vma->anon_vma;
 	struct mm_struct *mm = vma->vm_mm;
 	pte_t *old_pte, *new_pte, pte;
 	spinlock_t *old_ptl, *new_ptl;
@@ -95,6 +96,8 @@ static void move_ptes(struct vm_area_str
 		mapping = vma->vm_file->f_mapping;
 		mutex_lock(&mapping->i_mmap_mutex);
 	}
+	if (anon_vma)
+		anon_vma_lock(anon_vma);
 
 	/*
 	 * We don't have to worry about the ordering of src and dst
@@ -121,6 +124,8 @@ static void move_ptes(struct vm_area_str
 		spin_unlock(new_ptl);
 	pte_unmap(new_pte - 1);
 	pte_unmap_unlock(old_pte - 1, old_ptl);
+	if (anon_vma)
+		anon_vma_unlock(anon_vma);
 	if (mapping)
 		mutex_unlock(&mapping->i_mmap_mutex);
 	mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end);

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-13 23:30   ` Hugh Dickins
@ 2011-10-16 16:11     ` Christoph Hellwig
  2011-10-16 23:54     ` Andrea Arcangeli
  1 sibling, 0 replies; 72+ messages in thread
From: Christoph Hellwig @ 2011-10-16 16:11 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Pawel Sikora, Andrew Morton, Mel Gorman, Andrea Arcangeli,
	linux-mm, jpiszcz, arekm, linux-kernel, Anders Ossowicki

Btw, 

Anders Ossowicki reported a very similar soft lockup on 2.6.38 recently,
although without a bug on before.

Here is the pointer: https://lkml.org/lkml/2011/10/11/87

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-13 23:16 ` Hugh Dickins
  2011-10-13 23:30   ` Hugh Dickins
@ 2011-10-16 22:37   ` Linus Torvalds
  2011-10-17  3:02     ` Hugh Dickins
  2011-10-18 19:17   ` Paweł Sikora
  2011-10-19  7:30   ` Mel Gorman
  3 siblings, 1 reply; 72+ messages in thread
From: Linus Torvalds @ 2011-10-16 22:37 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Pawel Sikora, Andrew Morton, Mel Gorman, Andrea Arcangeli,
	linux-mm, jpiszcz, arekm, linux-kernel

What's the status of this thing? Is it stable/3.1 material? Do we have
ack/nak's for it? Anybody?

                               Linus

On Thu, Oct 13, 2011 at 4:16 PM, Hugh Dickins <hughd@google.com> wrote:
>
> [PATCH] mm: add anon_vma locking to mremap move
>
> I don't usually pay much attention to the stale "? " addresses in
> stack backtraces, but this lucky report from Pawel Sikora hints that
> mremap's move_ptes() has inadequate locking against page migration.
>
>  3.0 BUG_ON(!PageLocked(p)) in migration_entry_to_page():
>  kernel BUG at include/linux/swapops.h:105!
>  RIP: 0010:[<ffffffff81127b76>]  [<ffffffff81127b76>]
>                       migration_entry_wait+0x156/0x160
>  [<ffffffff811016a1>] handle_pte_fault+0xae1/0xaf0
>  [<ffffffff810feee2>] ? __pte_alloc+0x42/0x120
>  [<ffffffff8112c26b>] ? do_huge_pmd_anonymous_page+0xab/0x310
>  [<ffffffff81102a31>] handle_mm_fault+0x181/0x310
>  [<ffffffff81106097>] ? vma_adjust+0x537/0x570
>  [<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
>  [<ffffffff81109a05>] ? do_mremap+0x2d5/0x570
>  [<ffffffff81421d5f>] page_fault+0x1f/0x30
>
> mremap's down_write of mmap_sem, together with i_mmap_mutex/lock,
> and pagetable locks, were good enough before page migration (with its
> requirement that every migration entry be found) came in; and enough
> while migration always held mmap_sem.  But not enough nowadays, when
> there's memory hotremove and compaction: anon_vma lock is also needed,
> to make sure a migration entry is not dodging around behind our back.
>
> It appears that Mel's a8bef8ff6ea1 "mm: migration: avoid race between
> shift_arg_pages() and rmap_walk() during migration by not migrating
> temporary stacks" was actually a workaround for this in the special
> common case of exec's use of move_pagetables(); and we should probably
> now remove that VM_STACK_INCOMPLETE_SETUP stuff as a separate cleanup.
>
> Reported-by: Pawel Sikora <pluto@agmk.net>
> Cc: stable@kernel.org
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>
>  mm/mremap.c |    5 +++++
>  1 file changed, 5 insertions(+)
>
> --- 3.1-rc9/mm/mremap.c 2011-07-21 19:17:23.000000000 -0700
> +++ linux/mm/mremap.c   2011-10-13 14:36:25.097780974 -0700
> @@ -77,6 +77,7 @@ static void move_ptes(struct vm_area_str
>                unsigned long new_addr)
>  {
>        struct address_space *mapping = NULL;
> +       struct anon_vma *anon_vma = vma->anon_vma;
>        struct mm_struct *mm = vma->vm_mm;
>        pte_t *old_pte, *new_pte, pte;
>        spinlock_t *old_ptl, *new_ptl;
> @@ -95,6 +96,8 @@ static void move_ptes(struct vm_area_str
>                mapping = vma->vm_file->f_mapping;
>                mutex_lock(&mapping->i_mmap_mutex);
>        }
> +       if (anon_vma)
> +               anon_vma_lock(anon_vma);
>
>        /*
>         * We don't have to worry about the ordering of src and dst
> @@ -121,6 +124,8 @@ static void move_ptes(struct vm_area_str
>                spin_unlock(new_ptl);
>        pte_unmap(new_pte - 1);
>        pte_unmap_unlock(old_pte - 1, old_ptl);
> +       if (anon_vma)
> +               anon_vma_unlock(anon_vma);
>        if (mapping)
>                mutex_unlock(&mapping->i_mmap_mutex);
>        mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end);

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-13 23:30   ` Hugh Dickins
  2011-10-16 16:11     ` Christoph Hellwig
@ 2011-10-16 23:54     ` Andrea Arcangeli
  2011-10-17 18:51       ` Hugh Dickins
  2011-10-20  9:11       ` Nai Xia
  1 sibling, 2 replies; 72+ messages in thread
From: Andrea Arcangeli @ 2011-10-16 23:54 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Pawel Sikora, Andrew Morton, Mel Gorman, linux-mm, jpiszcz,
	arekm, linux-kernel

On Thu, Oct 13, 2011 at 04:30:09PM -0700, Hugh Dickins wrote:
> mremap's down_write of mmap_sem, together with i_mmap_mutex/lock,
> and pagetable locks, were good enough before page migration (with its
> requirement that every migration entry be found) came in; and enough
> while migration always held mmap_sem.  But not enough nowadays, when
> there's memory hotremove and compaction: anon_vma lock is also needed,
> to make sure a migration entry is not dodging around behind our back.

For things like migrate and split_huge_page, the anon_vma layer must
guarantee the page is reachable by rmap walk at all times regardless
if it's at the old or new address.

This shall be guaranteed by the copy_vma called by move_vma well
before move_page_tables/move_ptes can run.

copy_vma obviously takes the anon_vma lock to insert the new "dst" vma
into the anon_vma chains structures (vma_link does that). That before
any pte can be moved.

Because we keep two vmas mapped on both src and dst range, with
different vma->vm_pgoff that is valid for the page (the page doesn't
change its page->index) the page should always find _all_ its pte at
any given time.

There may be other variables at play like the order of insertion in
the anon_vma chain matches our direction of copy and removal of the
old pte. But I think the double locking of the PT lock should make the
order in the anon_vma chain absolutely irrelevant (the rmap_walk
obviously takes the PT lock too), and furthermore likely the
anon_vma_chain insertion is favorable (the dst vma is inserted last
and checked last). But it shouldn't matter.

Another thing could be the copy_vma vma_merge branch succeeding
(returning not NULL) but I doubt we risk to fall into that one. For
the rmap_walk to be always working on both the src and dst
vma->vma_pgoff the pgoff must be different so we can't possibly be ok
if there's just 1 vma covering the whole range. I exclude this could
be the case because the pgoff passed to copy_vma is different than the
vma->vm_pgoff given to copy_vma, so vma_merge can't possibly succeed.

Yet another point to investigate is the point where we teardown the
old vma and we leave the new vma generated by copy_vma
established. That's apparently taken care of by do_munmap in move_vma
so that shall be safe too as munmap is safe in the first place.

Overall I don't think this patch is needed and it seems a noop.

> It appears that Mel's a8bef8ff6ea1 "mm: migration: avoid race between
> shift_arg_pages() and rmap_walk() during migration by not migrating
> temporary stacks" was actually a workaround for this in the special
> common case of exec's use of move_pagetables(); and we should probably
> now remove that VM_STACK_INCOMPLETE_SETUP stuff as a separate cleanup.

I don't think this patch can help with that, the problem of execve vs
rmap_walk is that there's 1 single vma existing for src and dst
virtual ranges while execve runs move_page_tables. So there is no
possible way that rmap_walk will be guaranteed to find _all_ ptes
mapping a page if there's just one vma mapping either the src or dst
range while move_page_table runs. No addition of locking whatsoever
can fix that bug because we miss a vma (well modulo locking that
prevents rmap_walk to run at all, until we're finished with execve,
which is more or less what VM_STACK_INCOMPLETE_SETUP does...).

The only way is to fix this is prevent migrate (or any other rmap_walk
user that requires 100% reliability from the rmap layer, for example
swap doesn't require 100% reliability and can still run and gracefully
fail at finding the pte) while we're moving pagetables in execve. And
that's what Mel's above mentioned patch does.

The other way to fix that bug that I implemented was to do copy_vma in
execve, so that we still have both src and dst ranges of
move_page_tables covered by 2 (not 1) vma, each with the proper
vma->vm_pgoff, so my approach fixed that bug as well (but requires a
vma allocation in execve so it was dropped in favor of Mel's patch
which is totally fine with as both approaches fixes the bug equally
well, even if now we've to deal with this special case of sometime
rmap_walk having false negatives if the vma_flags is set, and the
important thing is that after VM_STACK_INCOMPLETE_SETUP has been
cleared it won't ever be set again for the whole lifetime of the vma).

I may be missing something, I did a short review so far, just so the
patch doesn't get merged if not needed. I mean I think it needs a bit
more looks on it... The fact the i_mmap_mutex was taken but the
anon_vma lock was not taken (while in every other place they both are
needed) certainly makes the patch look correct, but that's just a
misleading coincidence I think.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-16 22:37   ` Linus Torvalds
@ 2011-10-17  3:02     ` Hugh Dickins
  2011-10-17  3:09       ` Linus Torvalds
  0 siblings, 1 reply; 72+ messages in thread
From: Hugh Dickins @ 2011-10-17  3:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Pawel Sikora, Andrew Morton, Mel Gorman, Andrea Arcangeli,
	linux-mm, jpiszcz, arekm, linux-kernel

I've not read through and digested Andrea's reply yet, but I'd say
this is not something we need to rush into 3.1 at the last moment,
before it's been fully considered: the bug here is hard to hit,
ancient, made more likely in 2.6.35 by compaction and in 2.6.38 by
THP's reliance on compaction, but not a regression in 3.1 at all - let
it wait until stable.

Hugh

On Sun, Oct 16, 2011 at 3:37 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> What's the status of this thing? Is it stable/3.1 material? Do we have
> ack/nak's for it? Anybody?
>
>                               Linus
>
> On Thu, Oct 13, 2011 at 4:16 PM, Hugh Dickins <hughd@google.com> wrote:
>>
>> [PATCH] mm: add anon_vma locking to mremap move
>>
>> I don't usually pay much attention to the stale "? " addresses in
>> stack backtraces, but this lucky report from Pawel Sikora hints that
>> mremap's move_ptes() has inadequate locking against page migration.
>>
>>  3.0 BUG_ON(!PageLocked(p)) in migration_entry_to_page():
>>  kernel BUG at include/linux/swapops.h:105!
>>  RIP: 0010:[<ffffffff81127b76>]  [<ffffffff81127b76>]
>>                       migration_entry_wait+0x156/0x160
>>  [<ffffffff811016a1>] handle_pte_fault+0xae1/0xaf0
>>  [<ffffffff810feee2>] ? __pte_alloc+0x42/0x120
>>  [<ffffffff8112c26b>] ? do_huge_pmd_anonymous_page+0xab/0x310
>>  [<ffffffff81102a31>] handle_mm_fault+0x181/0x310
>>  [<ffffffff81106097>] ? vma_adjust+0x537/0x570
>>  [<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
>>  [<ffffffff81109a05>] ? do_mremap+0x2d5/0x570
>>  [<ffffffff81421d5f>] page_fault+0x1f/0x30
>>
>> mremap's down_write of mmap_sem, together with i_mmap_mutex/lock,
>> and pagetable locks, were good enough before page migration (with its
>> requirement that every migration entry be found) came in; and enough
>> while migration always held mmap_sem.  But not enough nowadays, when
>> there's memory hotremove and compaction: anon_vma lock is also needed,
>> to make sure a migration entry is not dodging around behind our back.
>>
>> It appears that Mel's a8bef8ff6ea1 "mm: migration: avoid race between
>> shift_arg_pages() and rmap_walk() during migration by not migrating
>> temporary stacks" was actually a workaround for this in the special
>> common case of exec's use of move_pagetables(); and we should probably
>> now remove that VM_STACK_INCOMPLETE_SETUP stuff as a separate cleanup.
>>
>> Reported-by: Pawel Sikora <pluto@agmk.net>
>> Cc: stable@kernel.org
>> Signed-off-by: Hugh Dickins <hughd@google.com>
>> ---
>>
>>  mm/mremap.c |    5 +++++
>>  1 file changed, 5 insertions(+)
>>
>> --- 3.1-rc9/mm/mremap.c 2011-07-21 19:17:23.000000000 -0700
>> +++ linux/mm/mremap.c   2011-10-13 14:36:25.097780974 -0700
>> @@ -77,6 +77,7 @@ static void move_ptes(struct vm_area_str
>>                unsigned long new_addr)
>>  {
>>        struct address_space *mapping = NULL;
>> +       struct anon_vma *anon_vma = vma->anon_vma;
>>        struct mm_struct *mm = vma->vm_mm;
>>        pte_t *old_pte, *new_pte, pte;
>>        spinlock_t *old_ptl, *new_ptl;
>> @@ -95,6 +96,8 @@ static void move_ptes(struct vm_area_str
>>                mapping = vma->vm_file->f_mapping;
>>                mutex_lock(&mapping->i_mmap_mutex);
>>        }
>> +       if (anon_vma)
>> +               anon_vma_lock(anon_vma);
>>
>>        /*
>>         * We don't have to worry about the ordering of src and dst
>> @@ -121,6 +124,8 @@ static void move_ptes(struct vm_area_str
>>                spin_unlock(new_ptl);
>>        pte_unmap(new_pte - 1);
>>        pte_unmap_unlock(old_pte - 1, old_ptl);
>> +       if (anon_vma)
>> +               anon_vma_unlock(anon_vma);
>>        if (mapping)
>>                mutex_unlock(&mapping->i_mmap_mutex);
>>        mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end);
>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-17  3:02     ` Hugh Dickins
@ 2011-10-17  3:09       ` Linus Torvalds
  0 siblings, 0 replies; 72+ messages in thread
From: Linus Torvalds @ 2011-10-17  3:09 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Pawel Sikora, Andrew Morton, Mel Gorman, Andrea Arcangeli,
	linux-mm, jpiszcz, arekm, linux-kernel

On Sun, Oct 16, 2011 at 8:02 PM, Hugh Dickins <hughd@google.com> wrote:
> I've not read through and digested Andrea's reply yet, but I'd say
> this is not something we need to rush into 3.1 at the last moment,
> before it's been fully considered: the bug here is hard to hit,
> ancient, made more likely in 2.6.35 by compaction and in 2.6.38 by
> THP's reliance on compaction, but not a regression in 3.1 at all - let
> it wait until stable.

Ok, thanks. Just wanted to check.

                     Linus

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-16 23:54     ` Andrea Arcangeli
@ 2011-10-17 18:51       ` Hugh Dickins
  2011-10-17 22:05         ` Andrea Arcangeli
  2011-10-19  7:43         ` Mel Gorman
  2011-10-20  9:11       ` Nai Xia
  1 sibling, 2 replies; 72+ messages in thread
From: Hugh Dickins @ 2011-10-17 18:51 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Pawel Sikora, Andrew Morton, Mel Gorman, Linus Torvalds,
	linux-mm, jpiszcz, arekm, linux-kernel

On Mon, 17 Oct 2011, Andrea Arcangeli wrote:
> On Thu, Oct 13, 2011 at 04:30:09PM -0700, Hugh Dickins wrote:
> > mremap's down_write of mmap_sem, together with i_mmap_mutex/lock,
> > and pagetable locks, were good enough before page migration (with its
> > requirement that every migration entry be found) came in; and enough
> > while migration always held mmap_sem.  But not enough nowadays, when
> > there's memory hotremove and compaction: anon_vma lock is also needed,
> > to make sure a migration entry is not dodging around behind our back.
> 
> For things like migrate and split_huge_page, the anon_vma layer must
> guarantee the page is reachable by rmap walk at all times regardless
> if it's at the old or new address.
> 
> This shall be guaranteed by the copy_vma called by move_vma well
> before move_page_tables/move_ptes can run.
> 
> copy_vma obviously takes the anon_vma lock to insert the new "dst" vma
> into the anon_vma chains structures (vma_link does that). That before
> any pte can be moved.
> 
> Because we keep two vmas mapped on both src and dst range, with
> different vma->vm_pgoff that is valid for the page (the page doesn't
> change its page->index) the page should always find _all_ its pte at
> any given time.
> 
> There may be other variables at play like the order of insertion in
> the anon_vma chain matches our direction of copy and removal of the
> old pte. But I think the double locking of the PT lock should make the
> order in the anon_vma chain absolutely irrelevant (the rmap_walk
> obviously takes the PT lock too), and furthermore likely the
> anon_vma_chain insertion is favorable (the dst vma is inserted last
> and checked last). But it shouldn't matter.

Thanks a lot for thinking it over.  I _almost_ agree with you, except
there's one aspect that I forgot to highlight in the patch comment:
remove_migration_pte() behaves as page_check_address() does by default,
it peeks to see if what it wants is there _before_ taking ptlock.

And therefore, I think, it is possible that during mremap move, the swap
pte is in neither of the locations it tries at the instant it peeks there.

We could put a stop to that: see plausible alternative patch below.
Though I have dithered from one to the other and back, I think on the
whole I still prefer the anon_vma locking in move_ptes(): we don't care
too deeply about the speed of mremap, but we do care about the speed of
exec, and this does add another lock/unlock there, but it will always
be uncontended; whereas the patch at the migration end could be adding
a contended and unnecessary lock.

Oh, I don't know which, you vote - if you now agree there is a problem.
I'll sign off the migrate.c one if you prefer it.  But no hurry.

> 
> Another thing could be the copy_vma vma_merge branch succeeding
> (returning not NULL) but I doubt we risk to fall into that one. For
> the rmap_walk to be always working on both the src and dst
> vma->vma_pgoff the pgoff must be different so we can't possibly be ok
> if there's just 1 vma covering the whole range. I exclude this could
> be the case because the pgoff passed to copy_vma is different than the
> vma->vm_pgoff given to copy_vma, so vma_merge can't possibly succeed.
> 
> Yet another point to investigate is the point where we teardown the
> old vma and we leave the new vma generated by copy_vma
> established. That's apparently taken care of by do_munmap in move_vma
> so that shall be safe too as munmap is safe in the first place.
> 
> Overall I don't think this patch is needed and it seems a noop.
> 
> > It appears that Mel's a8bef8ff6ea1 "mm: migration: avoid race between
> > shift_arg_pages() and rmap_walk() during migration by not migrating
> > temporary stacks" was actually a workaround for this in the special
> > common case of exec's use of move_pagetables(); and we should probably
> > now remove that VM_STACK_INCOMPLETE_SETUP stuff as a separate cleanup.
> 
> I don't think this patch can help with that, the problem of execve vs
> rmap_walk is that there's 1 single vma existing for src and dst
> virtual ranges while execve runs move_page_tables. So there is no
> possible way that rmap_walk will be guaranteed to find _all_ ptes
> mapping a page if there's just one vma mapping either the src or dst
> range while move_page_table runs. No addition of locking whatsoever
> can fix that bug because we miss a vma (well modulo locking that
> prevents rmap_walk to run at all, until we're finished with execve,
> which is more or less what VM_STACK_INCOMPLETE_SETUP does...).
> 
> The only way is to fix this is prevent migrate (or any other rmap_walk
> user that requires 100% reliability from the rmap layer, for example
> swap doesn't require 100% reliability and can still run and gracefully
> fail at finding the pte) while we're moving pagetables in execve. And
> that's what Mel's above mentioned patch does.

Thanks for explaining, yes, you're right.

> 
> The other way to fix that bug that I implemented was to do copy_vma in
> execve, so that we still have both src and dst ranges of
> move_page_tables covered by 2 (not 1) vma, each with the proper
> vma->vm_pgoff, so my approach fixed that bug as well (but requires a
> vma allocation in execve so it was dropped in favor of Mel's patch
> which is totally fine with as both approaches fixes the bug equally
> well, even if now we've to deal with this special case of sometime
> rmap_walk having false negatives if the vma_flags is set, and the
> important thing is that after VM_STACK_INCOMPLETE_SETUP has been
> cleared it won't ever be set again for the whole lifetime of the vma).

I think your two-vmas approach is more aesthetically pleasing (and
matches mremap), but can see that Mel's vmaflag hack^Htechnique ends up
more economical.  It is a bit sad that we lose that all-pages-swappable
condition for unlimited args, for a brief moment, but I think no memory
allocations are made in that interval, so I guess it's fine.

Hugh

> 
> I may be missing something, I did a short review so far, just so the
> patch doesn't get merged if not needed. I mean I think it needs a bit
> more looks on it... The fact the i_mmap_mutex was taken but the
> anon_vma lock was not taken (while in every other place they both are
> needed) certainly makes the patch look correct, but that's just a
> misleading coincidence I think.
> 

--- 3.1-rc9/mm/migrate.c	2011-07-21 19:17:23.000000000 -0700
+++ linux/mm/migrate.c	2011-10-17 11:21:48.923826334 -0700
@@ -119,12 +119,6 @@ static int remove_migration_pte(struct p
 			goto out;
 
 		ptep = pte_offset_map(pmd, addr);
-
-		if (!is_swap_pte(*ptep)) {
-			pte_unmap(ptep);
-			goto out;
-		}
-
 		ptl = pte_lockptr(mm, pmd);
 	}
 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-17 18:51       ` Hugh Dickins
@ 2011-10-17 22:05         ` Andrea Arcangeli
  2011-10-19  7:43         ` Mel Gorman
  1 sibling, 0 replies; 72+ messages in thread
From: Andrea Arcangeli @ 2011-10-17 22:05 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Pawel Sikora, Andrew Morton, Mel Gorman, Linus Torvalds,
	linux-mm, jpiszcz, arekm, linux-kernel

On Mon, Oct 17, 2011 at 11:51:00AM -0700, Hugh Dickins wrote:
> Thanks a lot for thinking it over.  I _almost_ agree with you, except
> there's one aspect that I forgot to highlight in the patch comment:
> remove_migration_pte() behaves as page_check_address() does by default,
> it peeks to see if what it wants is there _before_ taking ptlock.
> 
> And therefore, I think, it is possible that during mremap move, the swap
> pte is in neither of the locations it tries at the instant it peeks there.

I see what you mean, I didn't realize you were fixing that race.
mremap for a few CPU cycles (which may expand if interrupted by irq)
the migration entry will only live in the kernel stack of the process
doing mremap. So the rmap_walk may just loop quick lockless and not
see it and return while mremap holds boths PT locks (src and dst
pte).

Now getting an irq exactly at that migrate cycle and that irq doesn't
sound too easy but we still must fix this race.

Maybe who needs a 100% reliability should not go lockless looping all
over the vmas without taking PT lock that prevents serialization
against the pte "moving" functions that normally do in order
ptep_clear_flush(src_ptep); set_pet_at(dst_ptep).

For example I never thought of optimizing __split_huge_page_splitting,
that must be reliable so I never felt like it could be safe to go
lockless there.

So I think it's better to fix migrate, as there may be other places
like mremap. Who can't afford failure should do the PT locking.

But maybe it's possible to find good reasons to fix the race in the
other way too.

> We could put a stop to that: see plausible alternative patch below.
> Though I have dithered from one to the other and back, I think on the
> whole I still prefer the anon_vma locking in move_ptes(): we don't care
> too deeply about the speed of mremap, but we do care about the speed of
> exec, and this does add another lock/unlock there, but it will always
> be uncontended; whereas the patch at the migration end could be adding
> a contended and unnecessary lock.
> 
> Oh, I don't know which, you vote - if you now agree there is a problem.
> I'll sign off the migrate.c one if you prefer it.  But no hurry.

Adding more locking in migrate than in mremap fast path should be
better performance-wise. Java GC uses mremap. migrate is somewhat less
performance critical, but I guess there may be other workloads where
migrate runs more often than mremap. But it also depends on the false
positive ratio of rmap_walk, if normally that's low the patch to
migrate may actually result in an optimization, while the mremap patch
can't possibly speed anything.

In short I'm slightly more inclined on preferring the fix to migrate
and enforce all rmap-walkers who can't fail should not go lockless
speculative on the ptes but take the lock before checking if the pte
they're searching is there.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-13 23:16 ` Hugh Dickins
  2011-10-13 23:30   ` Hugh Dickins
  2011-10-16 22:37   ` Linus Torvalds
@ 2011-10-18 19:17   ` Paweł Sikora
  2011-10-19  7:30   ` Mel Gorman
  3 siblings, 0 replies; 72+ messages in thread
From: Paweł Sikora @ 2011-10-18 19:17 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Mel Gorman, Andrea Arcangeli, linux-mm, jpiszcz,
	arekm, linux-kernel

On Friday 14 of October 2011 01:16:01 Hugh Dickins wrote:
> [ Subject refers to a different, unexplained 3.0 bug from Pawel ]
> 
> On Wed, 12 Oct 2011, Pawel Sikora wrote:
> 
> > Hi Hugh,
> > i'm resending previous private email with larger cc list as you've requested.
> 
> Thanks, yes, on this one I think I do have an answer;
> and we ought to bring Mel and Andrea in too.
> 
> > 
> > in the last weekend my server died again (processes stuck for 22/23s!) but this time i have more logs for you.
> >  
> > on my dual-opteron machines i have non-standard settings:
> > - DISABLED swap space (fast user process killing is better for my autotest farm than long disk swapping
> >                        and 64GB ecc-ram is enough for my processing).
> > - vm.overcommit_memory = 2,
> > - vm.overcommit_ratio = 100.
> > 
> > after initial BUG_ON (pasted below) there's a flood of 'rcu_sched_state detected stalls / CPU#X stuck for 22s!'
> 
> Yes, those are just a tiresome consequence of exiting from a BUG
> while holding the page table lock(s).
> 
> > (full compressed log is available at: http://pluto.agmk.net/kernel/kernel.bz2)
> > 
> > Oct  9 08:06:43 hal kernel: [408578.629070] ------------[ cut here ]------------
> > Oct  9 08:06:43 hal kernel: [408578.629143] kernel BUG at include/linux/swapops.h:105!
> > Oct  9 08:06:43 hal kernel: [408578.629143] invalid opcode: 0000 [#1] SMP
> > Oct  9 08:06:43 hal kernel: [408578.629143] CPU 14
> [ I'm deleting that irrelevant long line list of modules ]
> > Oct  9 08:06:43 hal kernel: [408578.629143]
> > Oct  9 08:06:43 hal kernel: [408578.629143] Pid: 29214, comm: bitgen Not tainted 3.0.4 #5 Supermicro H8DGU/H8DGU
> > Oct  9 08:06:43 hal kernel: [408578.629143] RIP: 0010:[<ffffffff81127b76>]  [<ffffffff81127b76>] migration_entry_wait+0x156/0x160
> > Oct  9 08:06:43 hal kernel: [408578.629143] RSP: 0000:ffff88021cee7d18  EFLAGS: 00010246
> > Oct  9 08:06:43 hal kernel: [408578.629143] RAX: 0a00000000080068 RBX: ffffea001d1dbe70 RCX: ffff880c02d18978
> > Oct  9 08:06:43 hal kernel: [408578.629143] RDX: 0000000000851a42 RSI: ffff880d8fe33618 RDI: ffffea002a09dd50
> > Oct  9 08:06:43 hal kernel: [408578.629143] RBP: ffff88021cee7d38 R08: ffff880d8fe33618 R09: 0000000000000028
> > Oct  9 08:06:43 hal kernel: [408578.629143] R10: ffff881006eb0f00 R11: f800000000851a42 R12: ffffea002a09dd40
> > Oct  9 08:06:43 hal kernel: [408578.629143] R13: 0000000c02d18978 R14: 00000000d872f000 R15: ffff880d8fe33618
> > Oct  9 08:06:43 hal kernel: [408578.629143] FS:  00007f864d7fa700(0000) GS:ffff880c1fc80000(0063) knlGS:00000000f432a910
> > Oct  9 08:06:43 hal kernel: [408578.629143] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
> > Oct  9 08:06:43 hal kernel: [408578.629143] CR2: 00000000d872f000 CR3: 000000100668c000 CR4: 00000000000006e0
> > Oct  9 08:06:43 hal kernel: [408578.629143] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > Oct  9 08:06:43 hal kernel: [408578.629143] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > Oct  9 08:06:43 hal kernel: [408578.629143] Process bitgen (pid: 29214, threadinfo ffff88021cee6000, task ffff880407b06900)
> > Oct  9 08:06:43 hal kernel: [408578.629143] Stack:
> > Oct  9 08:06:43 hal kernel: [408578.629143]  00000000dcef4000 ffff880c00968c98 ffff880c02d18978 000000010a34843e
> > Oct  9 08:06:43 hal kernel: [408578.629143]  ffff88021cee7de8 ffffffff811016a1 ffff88021cee7d78 ffff881006eb0f00
> > Oct  9 08:06:43 hal kernel: [408578.629143]  ffff88021cee7d98 ffffffff810feee2 ffff880c06f0d170 8000000b98d14067
> > Oct  9 08:06:43 hal kernel: [408578.629143] Call Trace:
> > Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff811016a1>] handle_pte_fault+0xae1/0xaf0
> > Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff810feee2>] ? __pte_alloc+0x42/0x120
> > Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff8112c26b>] ? do_huge_pmd_anonymous_page+0xab/0x310
> > Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff81102a31>] handle_mm_fault+0x181/0x310
> > Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff81106097>] ? vma_adjust+0x537/0x570
> > Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
> > Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff81109a05>] ? do_mremap+0x2d5/0x570
> > Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff81421d5f>] page_fault+0x1f/0x30
> > Oct  9 08:06:43 hal kernel: [408578.629143] Code: 80 00 00 00 00 31 f6 48 89 df e8 e6 58 fb ff eb d7 85 c9 0f 84 44 ff ff ff 8d 51 01 89 c8 f0 0f b1 16 39 c1 90 74 b5 89 c1 eb e6 <0f> 0b 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 83 ec 20 48 85 ff
> > Oct  9 08:06:43 hal kernel: [408578.629143] RIP  [<ffffffff81127b76>] migration_entry_wait+0x156/0x160
> > Oct  9 08:06:43 hal kernel: [408578.629143]  RSP <ffff88021cee7d18>
> > Oct  9 08:06:43 hal kernel: [408578.642823] ---[ end trace 0a37362301163711 ]---
> > Oct  9 08:07:10 hal kernel: [408605.283257] BUG: soft lockup - CPU#12 stuck for 23s! [par:29801]
> > Oct  9 08:07:10 hal kernel: [408605.285807] CPU 12
> > Oct  9 08:07:10 hal kernel: [408605.285807]
> > Oct  9 08:07:10 hal kernel: [408605.285807] Pid: 29801, comm: par Tainted: G      D     3.0.4 #5 Supermicro H8DGU/H8DGU
> > Oct  9 08:07:10 hal kernel: [408605.285807] RIP: 0010:[<ffffffff814216a4>]  [<ffffffff814216a4>] _raw_spin_lock+0x14/0x20
> > Oct  9 08:07:10 hal kernel: [408605.285807] RSP: 0018:ffff880c02def808  EFLAGS: 00000293
> > Oct  9 08:07:10 hal kernel: [408605.285807] RAX: 0000000000000b09 RBX: ffffea002741f6b8 RCX: ffff880000000000
> > Oct  9 08:07:10 hal kernel: [408605.285807] RDX: ffffea0000000000 RSI: 000000002a09dd40 RDI: ffffea002a09dd50
> > Oct  9 08:07:10 hal kernel: [408605.285807] RBP: ffff880c02def808 R08: 0000000000000000 R09: ffff880f2e4f4d70
> > Oct  9 08:07:10 hal kernel: [408605.285807] R10: ffff880c1fffbe00 R11: 0000000000000050 R12: ffffffff8142988e
> > Oct  9 08:07:10 hal kernel: [408605.285807] R13: ffff880c02def7b8 R14: ffffffff810e63dc R15: ffff880c02def7b8
> > Oct  9 08:07:10 hal kernel: [408605.285807] FS:  00007fe6b677c720(0000) GS:ffff880c1fc00000(0063) knlGS:00000000f6e0d910
> > Oct  9 08:07:10 hal kernel: [408605.285807] CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
> > Oct  9 08:07:10 hal kernel: [408605.285807] CR2: 00000000dd40012c CR3: 00000009f6b78000 CR4: 00000000000006e0
> > Oct  9 08:07:10 hal kernel: [408605.285807] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > Oct  9 08:07:10 hal kernel: [408605.285807] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > Oct  9 08:07:10 hal kernel: [408605.285807] Process par (pid: 29801, threadinfo ffff880c02dee000, task ffff880c07c90700)
> > Oct  9 08:07:10 hal kernel: [408605.285807] Stack:
> > Oct  9 08:07:10 hal kernel: [408605.285807]  ffff880c02def858 ffffffff8110a2d7 ffffea001fc16fe0 ffff880c02d183a8
> > Oct  9 08:07:10 hal kernel: [408605.285807]  ffff880c02def8c8 ffffea001fc16fa8 ffff880c00968c98 ffff881006eb0f00
> > Oct  9 08:07:10 hal kernel: [408605.285807]  0000000000000301 00000000d8675000 ffff880c02def8c8 ffffffff8110aa6a
> > Oct  9 08:07:10 hal kernel: [408605.285807] Call Trace:
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110a2d7>] __page_check_address+0x107/0x1a0
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110aa6a>] try_to_unmap_one+0x3a/0x420
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110be44>] try_to_unmap_anon+0xb4/0x130
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110bf75>] try_to_unmap+0x65/0x80
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff811285d0>] migrate_pages+0x310/0x4c0
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e93c2>] ? ____pagevec_lru_add+0x12/0x20
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111cbf0>] ? ftrace_define_fields_mm_compaction_isolate_template+0x70/0x70
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111d5da>] compact_zone+0x52a/0x8c0
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810f9919>] ? zone_statistics+0x99/0xc0
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111dade>] compact_zone_order+0x7e/0xb0
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e46a8>] ? get_page_from_freelist+0x3b8/0x7e0
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111dbcd>] try_to_compact_pages+0xbd/0xf0
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e5148>] __alloc_pages_direct_compact+0xa8/0x180
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e56c5>] __alloc_pages_nodemask+0x4a5/0x7f0
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e9698>] ? lru_cache_add_lru+0x28/0x50
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110a92d>] ? page_add_new_anon_rmap+0x9d/0xb0
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111b865>] alloc_pages_vma+0x95/0x180
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8112c2f8>] do_huge_pmd_anonymous_page+0x138/0x310
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff81102ace>] handle_mm_fault+0x21e/0x310
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff81001716>] ? __switch_to+0x1e6/0x2c0
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8141f178>] ? schedule+0x308/0xa10
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff811077a7>] ? do_mmap_pgoff+0x357/0x370
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110790d>] ? sys_mmap_pgoff+0x14d/0x220
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff81421d5f>] page_fault+0x1f/0x30
> > Oct  9 08:07:10 hal kernel: [408605.285807] Code: 0f b6 c2 85 c0 0f 95 c0 0f b6 c0 5d c3 66 2e 0f 1f 84 00 00 00 00 00 55 b8 00 01 00 00 48 89 e5 f0 66 0f c1 07 38 e0 74 06 f3 90 <8a> 07 eb f6 5d c3 66 0f 1f 44 00 00 55 48 89 e5 9c 58 fa ba 00
> 
> I guess this is the only time you've seen this?  In which case, ideally
> I would try to devise a testcase to demonstrate the issue below instead;
> but that may involve more ingenuity than I can find time for, let's see
> see if people approve of this patch anyway (it applies to 3.1 or 3.0,
> and earlier releases except that i_mmap_mutex used to be i_mmap_lock).
> 
> 
> [PATCH] mm: add anon_vma locking to mremap move
> 
> I don't usually pay much attention to the stale "? " addresses in
> stack backtraces, but this lucky report from Pawel Sikora hints that
> mremap's move_ptes() has inadequate locking against page migration.
> 
>  3.0 BUG_ON(!PageLocked(p)) in migration_entry_to_page():
>  kernel BUG at include/linux/swapops.h:105!
>  RIP: 0010:[<ffffffff81127b76>]  [<ffffffff81127b76>]
>                        migration_entry_wait+0x156/0x160
>   [<ffffffff811016a1>] handle_pte_fault+0xae1/0xaf0
>   [<ffffffff810feee2>] ? __pte_alloc+0x42/0x120
>   [<ffffffff8112c26b>] ? do_huge_pmd_anonymous_page+0xab/0x310
>   [<ffffffff81102a31>] handle_mm_fault+0x181/0x310
>   [<ffffffff81106097>] ? vma_adjust+0x537/0x570
>   [<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
>   [<ffffffff81109a05>] ? do_mremap+0x2d5/0x570
>   [<ffffffff81421d5f>] page_fault+0x1f/0x30
> 
> mremap's down_write of mmap_sem, together with i_mmap_mutex/lock,
> and pagetable locks, were good enough before page migration (with its
> requirement that every migration entry be found) came in; and enough
> while migration always held mmap_sem.  But not enough nowadays, when
> there's memory hotremove and compaction: anon_vma lock is also needed,
> to make sure a migration entry is not dodging around behind our back.
> 
> It appears that Mel's a8bef8ff6ea1 "mm: migration: avoid race between
> shift_arg_pages() and rmap_walk() during migration by not migrating
> temporary stacks" was actually a workaround for this in the special
> common case of exec's use of move_pagetables(); and we should probably
> now remove that VM_STACK_INCOMPLETE_SETUP stuff as a separate cleanup.
> 
> Reported-by: Pawel Sikora <pluto@agmk.net>
> Cc: stable@kernel.org
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
> 
>  mm/mremap.c |    5 +++++
>  1 file changed, 5 insertions(+)
> 
> --- 3.1-rc9/mm/mremap.c	2011-07-21 19:17:23.000000000 -0700
> +++ linux/mm/mremap.c	2011-10-13 14:36:25.097780974 -0700
> @@ -77,6 +77,7 @@ static void move_ptes(struct vm_area_str
>  		unsigned long new_addr)
>  {
>  	struct address_space *mapping = NULL;
> +	struct anon_vma *anon_vma = vma->anon_vma;
>  	struct mm_struct *mm = vma->vm_mm;
>  	pte_t *old_pte, *new_pte, pte;
>  	spinlock_t *old_ptl, *new_ptl;
> @@ -95,6 +96,8 @@ static void move_ptes(struct vm_area_str
>  		mapping = vma->vm_file->f_mapping;
>  		mutex_lock(&mapping->i_mmap_mutex);
>  	}
> +	if (anon_vma)
> +		anon_vma_lock(anon_vma);
>  
>  	/*
>  	 * We don't have to worry about the ordering of src and dst
> @@ -121,6 +124,8 @@ static void move_ptes(struct vm_area_str
>  		spin_unlock(new_ptl);
>  	pte_unmap(new_pte - 1);
>  	pte_unmap_unlock(old_pte - 1, old_ptl);
> +	if (anon_vma)
> +		anon_vma_unlock(anon_vma);
>  	if (mapping)
>  		mutex_unlock(&mapping->i_mmap_mutex);
>  	mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end);
> 

Hi,

1).
with this patch applied to vanilla 3.0.6 kernel my opterons have been working stable for ~4 days so far.
nice :)

2).
with this patch i can't reproduce soft-lockup described at https://lkml.org/lkml/2011/8/30/112
nice :)

3).
now i've started more tests with this patch + 3.0.4 + vserver 2.3.1 to check possibly related locks
described on vserver mailinglist http://list.linux-vserver.org/archive?mss:5264:201108:odomikkjgoemcaomgidl
and lkml archive https://lkml.org/lkml/2011/5/23/398

1h uptime and still going...

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-13 23:16 ` Hugh Dickins
                     ` (2 preceding siblings ...)
  2011-10-18 19:17   ` Paweł Sikora
@ 2011-10-19  7:30   ` Mel Gorman
  2011-10-21 12:44     ` Mel Gorman
  3 siblings, 1 reply; 72+ messages in thread
From: Mel Gorman @ 2011-10-19  7:30 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Pawel Sikora, Andrew Morton, Andrea Arcangeli, linux-mm, jpiszcz,
	arekm, linux-kernel

On Thu, Oct 13, 2011 at 04:16:01PM -0700, Hugh Dickins wrote:
> <SNIP>
> 
> I guess this is the only time you've seen this?  In which case, ideally
> I would try to devise a testcase to demonstrate the issue below instead;

Considering that mremap workloads have been tested fairly heavily and
this hasn't triggered before (or at least not reported), I would not be
confident it can be easily reproduced. Maybe reproducing is easier if
interrupts are also high.

> but that may involve more ingenuity than I can find time for, let's see
> see if people approve of this patch anyway (it applies to 3.1 or 3.0,
> and earlier releases except that i_mmap_mutex used to be i_mmap_lock).
> 
> 
> [PATCH] mm: add anon_vma locking to mremap move
> 
> I don't usually pay much attention to the stale "? " addresses in
> stack backtraces, but this lucky report from Pawel Sikora hints that
> mremap's move_ptes() has inadequate locking against page migration.
> 
>  3.0 BUG_ON(!PageLocked(p)) in migration_entry_to_page():
>  kernel BUG at include/linux/swapops.h:105!

This check is triggered if migration PTEs are left behind. In the few
cases I saw this during compaction development, it was because a VMA was
unreachable during remove_migration_pte. With the anon_vma changes, the
locking during VMA insertion is meant to protect it and the order VMAs
are linked is important so the right anon_vma lock is found.

I don't think it is an unreachable VMA problem because if it was, the
problem would trigger much more frequently and not be exclusive to
mremap.

>  RIP: 0010:[<ffffffff81127b76>]  [<ffffffff81127b76>]
>                        migration_entry_wait+0x156/0x160
>   [<ffffffff811016a1>] handle_pte_fault+0xae1/0xaf0
>   [<ffffffff810feee2>] ? __pte_alloc+0x42/0x120
>   [<ffffffff8112c26b>] ? do_huge_pmd_anonymous_page+0xab/0x310
>   [<ffffffff81102a31>] handle_mm_fault+0x181/0x310
>   [<ffffffff81106097>] ? vma_adjust+0x537/0x570
>   [<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
>   [<ffffffff81109a05>] ? do_mremap+0x2d5/0x570
>   [<ffffffff81421d5f>] page_fault+0x1f/0x30
> 
> mremap's down_write of mmap_sem, together with i_mmap_mutex/lock,
> and pagetable locks, were good enough before page migration (with its
> requirement that every migration entry be found) came in; and enough
> while migration always held mmap_sem.  But not enough nowadays, when
> there's memory hotremove and compaction: anon_vma lock is also needed,
> to make sure a migration entry is not dodging around behind our back.
> 

migration holds the anon_vma lock while it unmaps the pages and keeps holding
it until after remove_migration_ptes is called.  There are two anon vmas
that should exist during mremap that were created for the move. They
should not be able to disappear while migration runs and right now, I'm
not seeing how the VMA can get lost :(

I think a consequence of this patch is that migration and mremap are now
serialised by anon_vma lock. As a result, it might still fix the problem
if there is some race between mremap and migration simply by stopping
them playing with each other.

> It appears that Mel's a8bef8ff6ea1 "mm: migration: avoid race between
> shift_arg_pages() and rmap_walk() during migration by not migrating
> temporary stacks" was actually a workaround for this in the special
> common case of exec's use of move_pagetables(); and we should probably
> now remove that VM_STACK_INCOMPLETE_SETUP stuff as a separate cleanup.
> 

The problem was that there was only one VMA for two page table
ranges. The neater fix was to create a second VMA but that required a
kmalloc and additional VMA work during exec which was considered too
heavy. VM_STACK_INCOMPLETE_SETUP is less clean but it is faster.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-17 18:51       ` Hugh Dickins
  2011-10-17 22:05         ` Andrea Arcangeli
@ 2011-10-19  7:43         ` Mel Gorman
  2011-10-19 13:39           ` Linus Torvalds
  1 sibling, 1 reply; 72+ messages in thread
From: Mel Gorman @ 2011-10-19  7:43 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Pawel Sikora, Andrew Morton, Linus Torvalds,
	linux-mm, jpiszcz, arekm, linux-kernel

On Mon, Oct 17, 2011 at 11:51:00AM -0700, Hugh Dickins wrote:
> On Mon, 17 Oct 2011, Andrea Arcangeli wrote:
> > On Thu, Oct 13, 2011 at 04:30:09PM -0700, Hugh Dickins wrote:
> > > mremap's down_write of mmap_sem, together with i_mmap_mutex/lock,
> > > and pagetable locks, were good enough before page migration (with its
> > > requirement that every migration entry be found) came in; and enough
> > > while migration always held mmap_sem.  But not enough nowadays, when
> > > there's memory hotremove and compaction: anon_vma lock is also needed,
> > > to make sure a migration entry is not dodging around behind our back.
> > 
> > For things like migrate and split_huge_page, the anon_vma layer must
> > guarantee the page is reachable by rmap walk at all times regardless
> > if it's at the old or new address.
> > 
> > This shall be guaranteed by the copy_vma called by move_vma well
> > before move_page_tables/move_ptes can run.
> > 
> > copy_vma obviously takes the anon_vma lock to insert the new "dst" vma
> > into the anon_vma chains structures (vma_link does that). That before
> > any pte can be moved.
> > 
> > Because we keep two vmas mapped on both src and dst range, with
> > different vma->vm_pgoff that is valid for the page (the page doesn't
> > change its page->index) the page should always find _all_ its pte at
> > any given time.
> > 
> > There may be other variables at play like the order of insertion in
> > the anon_vma chain matches our direction of copy and removal of the
> > old pte. But I think the double locking of the PT lock should make the
> > order in the anon_vma chain absolutely irrelevant (the rmap_walk
> > obviously takes the PT lock too), and furthermore likely the
> > anon_vma_chain insertion is favorable (the dst vma is inserted last
> > and checked last). But it shouldn't matter.
> 
> Thanks a lot for thinking it over.  I _almost_ agree with you, except
> there's one aspect that I forgot to highlight in the patch comment:
> remove_migration_pte() behaves as page_check_address() does by default,
> it peeks to see if what it wants is there _before_ taking ptlock.
> 
> And therefore, I think, it is possible that during mremap move, the swap
> pte is in neither of the locations it tries at the instant it peeks there.
> 

I should have read the rest of the thread before responding :/ .

This makes more sense and is a relief in a sense. There is nothing known
wrong with the VMA locking or ordering. The correct PTE is found but it is
in the wrong state.

> We could put a stop to that: see plausible alternative patch below.
> Though I have dithered from one to the other and back, I think on the
> whole I still prefer the anon_vma locking in move_ptes(): we don't care
> too deeply about the speed of mremap, but we do care about the speed of

I still think the anon_vma lock serialises mremap and migration. If that
is correct, it could cause things like huge page collapsing stalling mremap
operations. That might cause slowdowns in JVMs during GC which is undesirable.

> exec, and this does add another lock/unlock there, but it will always
> be uncontended; whereas the patch at the migration end could be adding
> a contended and unnecessary lock.
> 
> Oh, I don't know which, you vote - if you now agree there is a problem.
> I'll sign off the migrate.c one if you prefer it.  But no hurry.
> 

My vote is with the migration change. While there are occasionally
patches to make migration go faster, I don't consider it a hot path.
mremap may be used intensively by JVMs so I'd loathe to hurt it.

Thanks Hugh.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-19  7:43         ` Mel Gorman
@ 2011-10-19 13:39           ` Linus Torvalds
  2011-10-19 19:42             ` Hugh Dickins
  0 siblings, 1 reply; 72+ messages in thread
From: Linus Torvalds @ 2011-10-19 13:39 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Hugh Dickins, Andrea Arcangeli, Pawel Sikora, Andrew Morton,
	linux-mm, jpiszcz, arekm, linux-kernel

On Wed, Oct 19, 2011 at 12:43 AM, Mel Gorman <mgorman@suse.de> wrote:
>
> My vote is with the migration change. While there are occasionally
> patches to make migration go faster, I don't consider it a hot path.
> mremap may be used intensively by JVMs so I'd loathe to hurt it.

Ok, everybody seems to like that more, and it removes code rather than
adds it, so I certainly prefer it too. Paweł, can you test that other
patch (to mm/migrate.c) that Hugh posted? Instead of the mremap vma
locking patch that you already verified for your setup?

Hugh - that one didn't have a changelog/sign-off, so if you could
write that up, and Paweł's testing is successful, I can apply it...
Looks like we have acks from both Andrea and Mel.

                  Linus

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-19 13:39           ` Linus Torvalds
@ 2011-10-19 19:42             ` Hugh Dickins
  2011-10-20  6:30               ` Paweł Sikora
  2011-10-20 12:51               ` Nai Xia
  0 siblings, 2 replies; 72+ messages in thread
From: Hugh Dickins @ 2011-10-19 19:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mel Gorman, Andrea Arcangeli, Pawel Sikora, Andrew Morton,
	linux-mm, jpiszcz, arekm, linux-kernel

On Wed, 19 Oct 2011, Linus Torvalds wrote:
> On Wed, Oct 19, 2011 at 12:43 AM, Mel Gorman <mgorman@suse.de> wrote:
> >
> > My vote is with the migration change. While there are occasionally
> > patches to make migration go faster, I don't consider it a hot path.
> > mremap may be used intensively by JVMs so I'd loathe to hurt it.
> 
> Ok, everybody seems to like that more, and it removes code rather than
> adds it, so I certainly prefer it too. Pawel, can you test that other
> patch (to mm/migrate.c) that Hugh posted? Instead of the mremap vma
> locking patch that you already verified for your setup?
> 
> Hugh - that one didn't have a changelog/sign-off, so if you could
> write that up, and Pawel's testing is successful, I can apply it...
> Looks like we have acks from both Andrea and Mel.

Yes, I'm glad to have that input from Andrea and Mel, thank you.

Here we go.  I can't add a Tested-by since Pawel was reporting on the
alternative patch, but perhaps you'll be able to add that in later.

I may have read too much into Pawel's mail, but it sounded like he
would have expected an eponymous find_get_pages() lockup by now,
and was pleased that this patch appeared to have cured that.

I've spent quite a while trying to explain find_get_pages() lockup by
a missed migration entry, but I just don't see it: I don't expect this
(or the alternative) patch to do anything to fix that problem.  I won't
mind if it magically goes away, but I expect we'll need more info from
the debug patch I sent Justin a couple of days ago.

Ah, I'd better send the patch separately as
"[PATCH] mm: fix race between mremap and removing migration entry":
Pawel's "l" makes my old alpine setup choose quoted printable when
I reply to your mail.

Hugh

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-19 19:42             ` Hugh Dickins
@ 2011-10-20  6:30               ` Paweł Sikora
  2011-10-20  6:51                 ` Linus Torvalds
  2011-10-21  6:54                 ` Nai Xia
  2011-10-20 12:51               ` Nai Xia
  1 sibling, 2 replies; 72+ messages in thread
From: Paweł Sikora @ 2011-10-20  6:30 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Mel Gorman, Andrea Arcangeli, Andrew Morton,
	linux-mm, jpiszcz, arekm, linux-kernel

On Wednesday 19 of October 2011 21:42:15 Hugh Dickins wrote:
> On Wed, 19 Oct 2011, Linus Torvalds wrote:
> > On Wed, Oct 19, 2011 at 12:43 AM, Mel Gorman <mgorman@suse.de> wrote:
> > >
> > > My vote is with the migration change. While there are occasionally
> > > patches to make migration go faster, I don't consider it a hot path.
> > > mremap may be used intensively by JVMs so I'd loathe to hurt it.
> > 
> > Ok, everybody seems to like that more, and it removes code rather than
> > adds it, so I certainly prefer it too. Pawel, can you test that other
> > patch (to mm/migrate.c) that Hugh posted? Instead of the mremap vma
> > locking patch that you already verified for your setup?
> > 
> > Hugh - that one didn't have a changelog/sign-off, so if you could
> > write that up, and Pawel's testing is successful, I can apply it...
> > Looks like we have acks from both Andrea and Mel.
> 
> Yes, I'm glad to have that input from Andrea and Mel, thank you.
> 
> Here we go.  I can't add a Tested-by since Pawel was reporting on the
> alternative patch, but perhaps you'll be able to add that in later.
> 
> I may have read too much into Pawel's mail, but it sounded like he
> would have expected an eponymous find_get_pages() lockup by now,
> and was pleased that this patch appeared to have cured that.
> 
> I've spent quite a while trying to explain find_get_pages() lockup by
> a missed migration entry, but I just don't see it: I don't expect this
> (or the alternative) patch to do anything to fix that problem.  I won't
> mind if it magically goes away, but I expect we'll need more info from
> the debug patch I sent Justin a couple of days ago.

the latest patch (mm/migrate.c) applied on 3.0.4 also survives points
1) and 2) described previously (https://lkml.org/lkml/2011/10/18/427),
so please apply it to the upstream/stable git tree.

from the other side, both patches don't help for 3.0.4+vserver host soft-lock
which dies in few hours of stressing. iirc this lock has started with 2.6.38.
is there any major change in memory managment area in 2.6.38 that i can bisect
and test with vserver?

BR,
Paweł.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-20  6:30               ` Paweł Sikora
@ 2011-10-20  6:51                 ` Linus Torvalds
  2011-10-21  6:54                 ` Nai Xia
  1 sibling, 0 replies; 72+ messages in thread
From: Linus Torvalds @ 2011-10-20  6:51 UTC (permalink / raw)
  To: Paweł Sikora
  Cc: Hugh Dickins, Mel Gorman, Andrea Arcangeli, Andrew Morton,
	linux-mm, jpiszcz, arekm, linux-kernel

2011/10/19 Paweł Sikora <pluto@agmk.net>:
>
> the latest patch (mm/migrate.c) applied on 3.0.4 also survives points
> 1) and 2) described previously (https://lkml.org/lkml/2011/10/18/427),
> so please apply it to the upstream/stable git tree.

Ok, thanks, applied and pushed out.

> from the other side, both patches don't help for 3.0.4+vserver host soft-lock
> which dies in few hours of stressing. iirc this lock has started with 2.6.38.
> is there any major change in memory managment area in 2.6.38 that i can bisect
> and test with vserver?

I suspect you'd be best off simply just doing a full bisect. Yes, if
2.6.37 is the last known working kernel for you, and 38 breaks, that's
a lot of commits (about 10k, to be exact), and it will take an
annoying number of reboots and tests, but assuming you don't hit any
problems, it should still be "only" about 14 bisection points or so.

You could *try* to minimize the bisect by only looking at commits that
change mm/, but quite frankly, partial tree bisects tend to not be all
that reliable. But if you want to try, you could do basically

   git bisect start mm/
   git bisect good v2.6.37
   git bisect bad v2.6.38

and go from there. That will try to do a more specific bisect, and you
should have fewer test points, but the end result really is much less
reliable. But it might help narrow things down a bit.

             Linus

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-16 23:54     ` Andrea Arcangeli
  2011-10-17 18:51       ` Hugh Dickins
@ 2011-10-20  9:11       ` Nai Xia
  2011-10-21 15:56         ` Mel Gorman
  1 sibling, 1 reply; 72+ messages in thread
From: Nai Xia @ 2011-10-20  9:11 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Pawel Sikora, Andrew Morton, Mel Gorman, linux-mm,
	jpiszcz, arekm, linux-kernel

Hi Andrea,

On Mon, Oct 17, 2011 at 7:54 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> On Thu, Oct 13, 2011 at 04:30:09PM -0700, Hugh Dickins wrote:
>> mremap's down_write of mmap_sem, together with i_mmap_mutex/lock,
>> and pagetable locks, were good enough before page migration (with its
>> requirement that every migration entry be found) came in; and enough
>> while migration always held mmap_sem.  But not enough nowadays, when
>> there's memory hotremove and compaction: anon_vma lock is also needed,
>> to make sure a migration entry is not dodging around behind our back.
>
> For things like migrate and split_huge_page, the anon_vma layer must
> guarantee the page is reachable by rmap walk at all times regardless
> if it's at the old or new address.
>
> This shall be guaranteed by the copy_vma called by move_vma well
> before move_page_tables/move_ptes can run.
>
> copy_vma obviously takes the anon_vma lock to insert the new "dst" vma
> into the anon_vma chains structures (vma_link does that). That before
> any pte can be moved.
>
> Because we keep two vmas mapped on both src and dst range, with
> different vma->vm_pgoff that is valid for the page (the page doesn't
> change its page->index) the page should always find _all_ its pte at
> any given time.
>
> There may be other variables at play like the order of insertion in
> the anon_vma chain matches our direction of copy and removal of the
> old pte. But I think the double locking of the PT lock should make the
> order in the anon_vma chain absolutely irrelevant (the rmap_walk
> obviously takes the PT lock too), and furthermore likely the
> anon_vma_chain insertion is favorable (the dst vma is inserted last
> and checked last). But it shouldn't matter.

I happened to be reading these code last week.

And I do think this order matters, the reason is just quite similar why we
need i_mmap_lock in move_ptes():
If rmap_walk goes dst--->src, then when it first look into dst, ok, the
pte is not there, and it happily skip it and release the PTL.
Then just before it look into src, move_ptes() comes in, takes the locks
and moves the pte from src to dst. And then when rmap_walk() look
into src,  it will find an empty pte again. The pte is still there,
but rmap_walk() missed it !

IMO, this can really happen in case of vma_merge() succeeding.
Imagine that src vma is lately faulted and in anon_vma_prepare()
it got a same anon_vma with an existing vma ( named evil_vma )through
find_mergeable_anon_vma().  This can potentially make the vma_merge() in
copy_vma() return with evil_vma on some new relocation request. But src_vma
is really linked _after_  evil_vma/new_vma/dst_vma.
In this way, the ordering protocol  of anon_vma chain is broken.
This should be a rare case because I think in most cases
if two VMAs can reusable_anon_vma() they were already merged.

How do you think  ?

And If my reasoning is sound and this bug is really triggered by it
Hugh's first patch should be the right fix :)


Regards,

Nai Xia

>
> Another thing could be the copy_vma vma_merge branch succeeding
> (returning not NULL) but I doubt we risk to fall into that one. For
> the rmap_walk to be always working on both the src and dst
> vma->vma_pgoff the pgoff must be different so we can't possibly be ok
> if there's just 1 vma covering the whole range. I exclude this could
> be the case because the pgoff passed to copy_vma is different than the
> vma->vm_pgoff given to copy_vma, so vma_merge can't possibly succeed.
>
> Yet another point to investigate is the point where we teardown the
> old vma and we leave the new vma generated by copy_vma
> established. That's apparently taken care of by do_munmap in move_vma
> so that shall be safe too as munmap is safe in the first place.
>
> Overall I don't think this patch is needed and it seems a noop.
>
>> It appears that Mel's a8bef8ff6ea1 "mm: migration: avoid race between
>> shift_arg_pages() and rmap_walk() during migration by not migrating
>> temporary stacks" was actually a workaround for this in the special
>> common case of exec's use of move_pagetables(); and we should probably
>> now remove that VM_STACK_INCOMPLETE_SETUP stuff as a separate cleanup.
>
> I don't think this patch can help with that, the problem of execve vs
> rmap_walk is that there's 1 single vma existing for src and dst
> virtual ranges while execve runs move_page_tables. So there is no
> possible way that rmap_walk will be guaranteed to find _all_ ptes
> mapping a page if there's just one vma mapping either the src or dst
> range while move_page_table runs. No addition of locking whatsoever
> can fix that bug because we miss a vma (well modulo locking that
> prevents rmap_walk to run at all, until we're finished with execve,
> which is more or less what VM_STACK_INCOMPLETE_SETUP does...).
>
> The only way is to fix this is prevent migrate (or any other rmap_walk
> user that requires 100% reliability from the rmap layer, for example
> swap doesn't require 100% reliability and can still run and gracefully
> fail at finding the pte) while we're moving pagetables in execve. And
> that's what Mel's above mentioned patch does.
>
> The other way to fix that bug that I implemented was to do copy_vma in
> execve, so that we still have both src and dst ranges of
> move_page_tables covered by 2 (not 1) vma, each with the proper
> vma->vm_pgoff, so my approach fixed that bug as well (but requires a
> vma allocation in execve so it was dropped in favor of Mel's patch
> which is totally fine with as both approaches fixes the bug equally
> well, even if now we've to deal with this special case of sometime
> rmap_walk having false negatives if the vma_flags is set, and the
> important thing is that after VM_STACK_INCOMPLETE_SETUP has been
> cleared it won't ever be set again for the whole lifetime of the vma).
>
> I may be missing something, I did a short review so far, just so the
> patch doesn't get merged if not needed. I mean I think it needs a bit
> more looks on it... The fact the i_mmap_mutex was taken but the
> anon_vma lock was not taken (while in every other place they both are
> needed) certainly makes the patch look correct, but that's just a
> misleading coincidence I think.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-19 19:42             ` Hugh Dickins
  2011-10-20  6:30               ` Paweł Sikora
@ 2011-10-20 12:51               ` Nai Xia
       [not found]                 ` <CANsGZ6a6_q8+88FRV2froBsVEq7GhtKd9fRnB-0M2MD3a7tnSw@mail.gmail.com>
  1 sibling, 1 reply; 72+ messages in thread
From: Nai Xia @ 2011-10-20 12:51 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Mel Gorman, Andrea Arcangeli, Pawel Sikora,
	Andrew Morton, linux-mm, jpiszcz, arekm, linux-kernel

On Thursday 20 October 2011 03:42:15 Hugh Dickins wrote:
> On Wed, 19 Oct 2011, Linus Torvalds wrote:
> > On Wed, Oct 19, 2011 at 12:43 AM, Mel Gorman <mgorman@suse.de> wrote:
> > >
> > > My vote is with the migration change. While there are occasionally
> > > patches to make migration go faster, I don't consider it a hot path.
> > > mremap may be used intensively by JVMs so I'd loathe to hurt it.
> > 
> > Ok, everybody seems to like that more, and it removes code rather than
> > adds it, so I certainly prefer it too. Pawel, can you test that other
> > patch (to mm/migrate.c) that Hugh posted? Instead of the mremap vma
> > locking patch that you already verified for your setup?
> > 
> > Hugh - that one didn't have a changelog/sign-off, so if you could
> > write that up, and Pawel's testing is successful, I can apply it...
> > Looks like we have acks from both Andrea and Mel.
> 
> Yes, I'm glad to have that input from Andrea and Mel, thank you.
> 
> Here we go.  I can't add a Tested-by since Pawel was reporting on the
> alternative patch, but perhaps you'll be able to add that in later.
> 
> I may have read too much into Pawel's mail, but it sounded like he
> would have expected an eponymous find_get_pages() lockup by now,
> and was pleased that this patch appeared to have cured that.
> 
> I've spent quite a while trying to explain find_get_pages() lockup by
> a missed migration entry, but I just don't see it: I don't expect this
> (or the alternative) patch to do anything to fix that problem.  I won't
> mind if it magically goes away, but I expect we'll need more info from
> the debug patch I sent Justin a couple of days ago.

Hi Hugh, 

Will you please look into my explanation in my reply to Andrea in this thread
and see if it's what you are seeking?


Thanks,

Nai Xia


> 
> Ah, I'd better send the patch separately as
> "[PATCH] mm: fix race between mremap and removing migration entry":
> Pawel's "l" makes my old alpine setup choose quoted printable when
> I reply to your mail.
> 
> Hugh
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
       [not found]                 ` <CANsGZ6a6_q8+88FRV2froBsVEq7GhtKd9fRnB-0M2MD3a7tnSw@mail.gmail.com>
@ 2011-10-21  6:22                   ` Nai Xia
  2011-10-21  8:07                     ` Pawel Sikora
  0 siblings, 1 reply; 72+ messages in thread
From: Nai Xia @ 2011-10-21  6:22 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: arekm, Linus Torvalds, linux-mm, Mel Gorman, jpiszcz,
	linux-kernel, Andrew Morton, Pawel Sikora, Andrea Arcangeli

On Fri, Oct 21, 2011 at 2:36 AM, Hugh Dickins <hughd@google.com> wrote:
> I'm travelling at the moment, my brain is not in gear, the source is not in
> front of me, and I'm not used to typing on my phone much!  Excuses, excuses
>
> I flip between thinking you are right, and I'm a fool, and thinking you are
> wrong, and I'm still a fool.

Ha, well, human brains are all weak in thoroughly searching racing state space,
while automated model checking is still far from applicable to complex
real world
like kernel source. Maybe some day someone will give out a human guided
computer aided tool to help us search the combination of all involved code paths
to valid a specific high level logic assertion.


>
> Please work it out with Linus, Andrea and Mel: I may not be able to reply
> for a couple of days - thanks.

OK.

And as a side note. Since I notice that Pawel's workload may include OOM,
I'd like to give an imaginary series of events that may trigger such an bug.

1.  do_brk() want to expand a vma, but vma_merge  failed because of
transient  ENOMEM,  but succeeded in creating a new vmas at the boundary.

    vma_a           vma_b
|----------------|---------------------|

2.  page fault in vma_b, gives it a anon_vma, then page fault in vma_a,
it reuses the anon_vma of  vma_b.


3.   vma_a remaps to somewhere irrelevant, a new vma_c is created
and linked by anon_vma_clone(). In the anon_vma chain of vma_b,
vma_c is linked after  vma_b:

    vma_a           vma_b                   vma_c
|----------------|---------------------|   |==============|

           vma_b                   vma_c
|---------------------|   |==============|



4.  vma_c remaps back to its original place where vma_a was.
Ok,  vma_merge() in copy_vma() says that this request can be merged
to vma_b, and it returns with vma_b.

5. move_page_tables moves from vma_c to vma_b,  and races with rmap_walk.
The reverse ordering of vma_b and vma_c in anon_vma chain makes
rmap_walk miss an entry in the way I explained.

Well, it seems a very tricky construction, but also seems a possible
thing to me.

Will Linus, Andrea and Mel or any other one please look into my construction
and judge if it's valid?

Thanks

Nai Xia

>
> Hugh
>
> On Oct 20, 2011 5:51 AM, "Nai Xia" <nai.xia@gmail.com> wrote:
>>
>> On Thursday 20 October 2011 03:42:15 Hugh Dickins wrote:
>> > On Wed, 19 Oct 2011, Linus Torvalds wrote:
>> > > On Wed, Oct 19, 2011 at 12:43 AM, Mel Gorman <mgorman@suse.de> wrote:
>> > > >
>> > > > My vote is with the migration change. While there are occasionally
>> > > > patches to make migration go faster, I don't consider it a hot path.
>> > > > mremap may be used intensively by JVMs so I'd loathe to hurt it.
>> > >
>> > > Ok, everybody seems to like that more, and it removes code rather than
>> > > adds it, so I certainly prefer it too. Pawel, can you test that other
>> > > patch (to mm/migrate.c) that Hugh posted? Instead of the mremap vma
>> > > locking patch that you already verified for your setup?
>> > >
>> > > Hugh - that one didn't have a changelog/sign-off, so if you could
>> > > write that up, and Pawel's testing is successful, I can apply it...
>> > > Looks like we have acks from both Andrea and Mel.
>> >
>> > Yes, I'm glad to have that input from Andrea and Mel, thank you.
>> >
>> > Here we go.  I can't add a Tested-by since Pawel was reporting on the
>> > alternative patch, but perhaps you'll be able to add that in later.
>> >
>> > I may have read too much into Pawel's mail, but it sounded like he
>> > would have expected an eponymous find_get_pages() lockup by now,
>> > and was pleased that this patch appeared to have cured that.
>> >
>> > I've spent quite a while trying to explain find_get_pages() lockup by
>> > a missed migration entry, but I just don't see it: I don't expect this
>> > (or the alternative) patch to do anything to fix that problem.  I won't
>> > mind if it magically goes away, but I expect we'll need more info from
>> > the debug patch I sent Justin a couple of days ago.
>>
>> Hi Hugh,
>>
>> Will you please look into my explanation in my reply to Andrea in this
>> thread
>> and see if it's what you are seeking?
>>
>>
>> Thanks,
>>
>> Nai Xia
>>
>>
>> >
>> > Ah, I'd better send the patch separately as
>> > "[PATCH] mm: fix race between mremap and removing migration entry":
>> > Pawel's "l" makes my old alpine setup choose quoted printable when
>> > I reply to your mail.
>> >
>> > Hugh
>> >
>> > --
>> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> > the body to majordomo@kvack.org.  For more info on Linux MM,
>> > see: http://www.linux-mm.org/ .
>> > Fight unfair telecom internet charges in Canada: sign
>> > http://stopthemeter.ca/
>> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>> >
>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-20  6:30               ` Paweł Sikora
  2011-10-20  6:51                 ` Linus Torvalds
@ 2011-10-21  6:54                 ` Nai Xia
  2011-10-21  7:35                   ` Pawel Sikora
  1 sibling, 1 reply; 72+ messages in thread
From: Nai Xia @ 2011-10-21  6:54 UTC (permalink / raw)
  To: Paweł Sikora
  Cc: Hugh Dickins, Linus Torvalds, Mel Gorman, Andrea Arcangeli,
	Andrew Morton, linux-mm, jpiszcz, arekm, linux-kernel

2011/10/20 Paweł Sikora <pluto@agmk.net>:
> On Wednesday 19 of October 2011 21:42:15 Hugh Dickins wrote:
>> On Wed, 19 Oct 2011, Linus Torvalds wrote:
>> > On Wed, Oct 19, 2011 at 12:43 AM, Mel Gorman <mgorman@suse.de> wrote:
>> > >
>> > > My vote is with the migration change. While there are occasionally
>> > > patches to make migration go faster, I don't consider it a hot path.
>> > > mremap may be used intensively by JVMs so I'd loathe to hurt it.
>> >
>> > Ok, everybody seems to like that more, and it removes code rather than
>> > adds it, so I certainly prefer it too. Pawel, can you test that other
>> > patch (to mm/migrate.c) that Hugh posted? Instead of the mremap vma
>> > locking patch that you already verified for your setup?
>> >
>> > Hugh - that one didn't have a changelog/sign-off, so if you could
>> > write that up, and Pawel's testing is successful, I can apply it...
>> > Looks like we have acks from both Andrea and Mel.
>>
>> Yes, I'm glad to have that input from Andrea and Mel, thank you.
>>
>> Here we go.  I can't add a Tested-by since Pawel was reporting on the
>> alternative patch, but perhaps you'll be able to add that in later.
>>
>> I may have read too much into Pawel's mail, but it sounded like he
>> would have expected an eponymous find_get_pages() lockup by now,
>> and was pleased that this patch appeared to have cured that.
>>
>> I've spent quite a while trying to explain find_get_pages() lockup by
>> a missed migration entry, but I just don't see it: I don't expect this
>> (or the alternative) patch to do anything to fix that problem.  I won't
>> mind if it magically goes away, but I expect we'll need more info from
>> the debug patch I sent Justin a couple of days ago.
>
> the latest patch (mm/migrate.c) applied on 3.0.4 also survives points
> 1) and 2) described previously (https://lkml.org/lkml/2011/10/18/427),
> so please apply it to the upstream/stable git tree.
>
> from the other side, both patches don't help for 3.0.4+vserver host soft-lock

Hi Paweł,

Did your "both" mean that you applied each patch and run the tests separately,
or you applied the both patches and run them together?

Maybe there were more than one bugs dancing but having a same effect,
not fixing all of them wouldn't help at all.

Thanks,

Nai Xia


> which dies in few hours of stressing. iirc this lock has started with 2.6.38.
> is there any major change in memory managment area in 2.6.38 that i can bisect
> and test with vserver?
>
> BR,
> Paweł.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-21  6:54                 ` Nai Xia
@ 2011-10-21  7:35                   ` Pawel Sikora
  0 siblings, 0 replies; 72+ messages in thread
From: Pawel Sikora @ 2011-10-21  7:35 UTC (permalink / raw)
  To: Nai Xia
  Cc: Hugh Dickins, Linus Torvalds, Mel Gorman, Andrea Arcangeli,
	Andrew Morton, linux-mm, jpiszcz, arekm, linux-kernel

On Friday 21 of October 2011 14:54:29 Nai Xia wrote:
> 2011/10/20 Paweł Sikora <pluto@agmk.net>:
> > On Wednesday 19 of October 2011 21:42:15 Hugh Dickins wrote:
> >> On Wed, 19 Oct 2011, Linus Torvalds wrote:
> >> > On Wed, Oct 19, 2011 at 12:43 AM, Mel Gorman <mgorman@suse.de> wrote:
> >> > >
> >> > > My vote is with the migration change. While there are occasionally
> >> > > patches to make migration go faster, I don't consider it a hot path.
> >> > > mremap may be used intensively by JVMs so I'd loathe to hurt it.
> >> >
> >> > Ok, everybody seems to like that more, and it removes code rather than
> >> > adds it, so I certainly prefer it too. Pawel, can you test that other
> >> > patch (to mm/migrate.c) that Hugh posted? Instead of the mremap vma
> >> > locking patch that you already verified for your setup?
> >> >
> >> > Hugh - that one didn't have a changelog/sign-off, so if you could
> >> > write that up, and Pawel's testing is successful, I can apply it...
> >> > Looks like we have acks from both Andrea and Mel.
> >>
> >> Yes, I'm glad to have that input from Andrea and Mel, thank you.
> >>
> >> Here we go.  I can't add a Tested-by since Pawel was reporting on the
> >> alternative patch, but perhaps you'll be able to add that in later.
> >>
> >> I may have read too much into Pawel's mail, but it sounded like he
> >> would have expected an eponymous find_get_pages() lockup by now,
> >> and was pleased that this patch appeared to have cured that.
> >>
> >> I've spent quite a while trying to explain find_get_pages() lockup by
> >> a missed migration entry, but I just don't see it: I don't expect this
> >> (or the alternative) patch to do anything to fix that problem.  I won't
> >> mind if it magically goes away, but I expect we'll need more info from
> >> the debug patch I sent Justin a couple of days ago.
> >
> > the latest patch (mm/migrate.c) applied on 3.0.4 also survives points
> > 1) and 2) described previously (https://lkml.org/lkml/2011/10/18/427),
> > so please apply it to the upstream/stable git tree.
> >
> > from the other side, both patches don't help for 3.0.4+vserver host soft-lock
> 
> Hi Paweł,
> 
> Did your "both" mean that you applied each patch and run the tests separately,

yes, i've tested Hugh's patches separately.

> Maybe there were more than one bugs dancing but having a same effect,
> not fixing all of them wouldn't help at all.

i suppose that vserver patch only exposes some tricky bug introduced in 2.6.38.


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-21  6:22                   ` Nai Xia
@ 2011-10-21  8:07                     ` Pawel Sikora
  2011-10-21  9:07                       ` Nai Xia
  0 siblings, 1 reply; 72+ messages in thread
From: Pawel Sikora @ 2011-10-21  8:07 UTC (permalink / raw)
  To: Nai Xia
  Cc: Hugh Dickins, arekm, Linus Torvalds, linux-mm, Mel Gorman,
	jpiszcz, linux-kernel, Andrew Morton, Andrea Arcangeli

On Friday 21 of October 2011 14:22:37 Nai Xia wrote:

> And as a side note. Since I notice that Pawel's workload may include OOM,

my last tests on patched (3.0.4 + migrate.c fix + vserver) kernel produce full cpu load
on dual 8-cores opterons like on this htop screenshot -> http://pluto.agmk.net/kernel/screen1.png
afaics all userspace applications usualy don't use more than half of physical memory
and so called "cache" on htop bar doesn't reach the 100%.

the patched kernel with disabled CONFIG_TRANSPARENT_HUGEPAGE (new thing in 2.6.38)
died at night, so now i'm going to disable also CONFIG_COMPACTION/MIGRATION in next
steps and stress this machine again...


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-21  8:07                     ` Pawel Sikora
@ 2011-10-21  9:07                       ` Nai Xia
  2011-10-21 21:36                         ` Paweł Sikora
  0 siblings, 1 reply; 72+ messages in thread
From: Nai Xia @ 2011-10-21  9:07 UTC (permalink / raw)
  To: Pawel Sikora
  Cc: Hugh Dickins, arekm, Linus Torvalds, linux-mm, Mel Gorman,
	jpiszcz, linux-kernel, Andrew Morton, Andrea Arcangeli

On Fri, Oct 21, 2011 at 4:07 PM, Pawel Sikora <pluto@agmk.net> wrote:
> On Friday 21 of October 2011 14:22:37 Nai Xia wrote:
>
>> And as a side note. Since I notice that Pawel's workload may include OOM,
>
> my last tests on patched (3.0.4 + migrate.c fix + vserver) kernel produce full cpu load
> on dual 8-cores opterons like on this htop screenshot -> http://pluto.agmk.net/kernel/screen1.png
> afaics all userspace applications usualy don't use more than half of physical memory
> and so called "cache" on htop bar doesn't reach the 100%.

OK,did you logged any OOM killing if there was some memory usage burst?
But, well my above OOM reasoning is a direct short cut to imagined
root cause of "adjacent VMAs which
should have been merged but in fact not merged" case.
Maybe there are other cases that can lead to this or maybe it's
totally another bug....

But still I think if my reasoning is good, similar bad things will
happen again some time in the future,
even if it was not your case here...

>
> the patched kernel with disabled CONFIG_TRANSPARENT_HUGEPAGE (new thing in 2.6.38)
> died at night, so now i'm going to disable also CONFIG_COMPACTION/MIGRATION in next
> steps and stress this machine again...

OK, it's smart to narrow down the range first....

>
>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-19  7:30   ` Mel Gorman
@ 2011-10-21 12:44     ` Mel Gorman
  0 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2011-10-21 12:44 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Pawel Sikora, Andrew Morton, Andrea Arcangeli, linux-mm, jpiszcz,
	arekm, linux-kernel

On Wed, Oct 19, 2011 at 09:30:36AM +0200, Mel Gorman wrote:
> >  RIP: 0010:[<ffffffff81127b76>]  [<ffffffff81127b76>]
> >                        migration_entry_wait+0x156/0x160
> >   [<ffffffff811016a1>] handle_pte_fault+0xae1/0xaf0
> >   [<ffffffff810feee2>] ? __pte_alloc+0x42/0x120
> >   [<ffffffff8112c26b>] ? do_huge_pmd_anonymous_page+0xab/0x310
> >   [<ffffffff81102a31>] handle_mm_fault+0x181/0x310
> >   [<ffffffff81106097>] ? vma_adjust+0x537/0x570
> >   [<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
> >   [<ffffffff81109a05>] ? do_mremap+0x2d5/0x570
> >   [<ffffffff81421d5f>] page_fault+0x1f/0x30
> > 
> > mremap's down_write of mmap_sem, together with i_mmap_mutex/lock,
> > and pagetable locks, were good enough before page migration (with its
> > requirement that every migration entry be found) came in; and enough
> > while migration always held mmap_sem.  But not enough nowadays, when
> > there's memory hotremove and compaction: anon_vma lock is also needed,
> > to make sure a migration entry is not dodging around behind our back.
> > 
> 
> migration holds the anon_vma lock while it unmaps the pages and keeps holding
> it until after remove_migration_ptes is called. 

I reread this today and realised I was sloppy with my writing. migration
holds the anon_vma lock while it unmaps the pages. It also holds the
anon_vma lock during remove_migration_ptes. For the migration operation,
a reference count is held on anon_vma but not the lock itself.

> There are two anon vmas
> that should exist during mremap that were created for the move. They
> should not be able to disappear while migration runs and right now,

And what is preventing them disappearing is not the lock but the
reference count.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-20  9:11       ` Nai Xia
@ 2011-10-21 15:56         ` Mel Gorman
  2011-10-21 17:21           ` Nai Xia
  2011-10-21 17:41           ` Andrea Arcangeli
  0 siblings, 2 replies; 72+ messages in thread
From: Mel Gorman @ 2011-10-21 15:56 UTC (permalink / raw)
  To: Nai Xia
  Cc: Andrea Arcangeli, Hugh Dickins, Pawel Sikora, Andrew Morton,
	linux-mm, jpiszcz, arekm, linux-kernel

On Thu, Oct 20, 2011 at 05:11:28PM +0800, Nai Xia wrote:
> On Mon, Oct 17, 2011 at 7:54 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> > On Thu, Oct 13, 2011 at 04:30:09PM -0700, Hugh Dickins wrote:
> >> mremap's down_write of mmap_sem, together with i_mmap_mutex/lock,
> >> and pagetable locks, were good enough before page migration (with its
> >> requirement that every migration entry be found) came in; and enough
> >> while migration always held mmap_sem.  But not enough nowadays, when
> >> there's memory hotremove and compaction: anon_vma lock is also needed,
> >> to make sure a migration entry is not dodging around behind our back.
> >
> > For things like migrate and split_huge_page, the anon_vma layer must
> > guarantee the page is reachable by rmap walk at all times regardless
> > if it's at the old or new address.
> >
> > This shall be guaranteed by the copy_vma called by move_vma well
> > before move_page_tables/move_ptes can run.
> >
> > copy_vma obviously takes the anon_vma lock to insert the new "dst" vma
> > into the anon_vma chains structures (vma_link does that). That before
> > any pte can be moved.
> >
> > Because we keep two vmas mapped on both src and dst range, with
> > different vma->vm_pgoff that is valid for the page (the page doesn't
> > change its page->index) the page should always find _all_ its pte at
> > any given time.
> >
> > There may be other variables at play like the order of insertion in
> > the anon_vma chain matches our direction of copy and removal of the
> > old pte. But I think the double locking of the PT lock should make the
> > order in the anon_vma chain absolutely irrelevant (the rmap_walk
> > obviously takes the PT lock too), and furthermore likely the
> > anon_vma_chain insertion is favorable (the dst vma is inserted last
> > and checked last). But it shouldn't matter.
> 
> I happened to be reading these code last week.
> 
> And I do think this order matters, the reason is just quite similar why we
> need i_mmap_lock in move_ptes():
> If rmap_walk goes dst--->src, then when it first look into dst, ok, the

You might be right in that the ordering matters. We do link new VMAs at
the end of the list in anon_vma_chain_list so remove_migrate_ptes should
be walking from src->dst.

If remove_migrate_pte finds src first, it will remove the pte and the
correct version will get copied. If move_ptes runs between when
remove_migrate_ptes moves from src to dst, then the PTE at dst will
still be correct.

> pte is not there, and it happily skip it and release the PTL.
> Then just before it look into src, move_ptes() comes in, takes the locks
> and moves the pte from src to dst. And then when rmap_walk() look
> into src,  it will find an empty pte again. The pte is still there,
> but rmap_walk() missed it !
> 

I believe the ordering is correct though and protects us in this case.

> IMO, this can really happen in case of vma_merge() succeeding.
> Imagine that src vma is lately faulted and in anon_vma_prepare()
> it got a same anon_vma with an existing vma ( named evil_vma )through
> find_mergeable_anon_vma().  This can potentially make the vma_merge() in
> copy_vma() return with evil_vma on some new relocation request. But src_vma
> is really linked _after_  evil_vma/new_vma/dst_vma.
> In this way, the ordering protocol  of anon_vma chain is broken.
> This should be a rare case because I think in most cases
> if two VMAs can reusable_anon_vma() they were already merged.
> 
> How do you think  ?
> 

Despite the comments in anon_vma_compatible(), I would expect that VMAs
that can share an anon_vma from find_mergeable_anon_vma() will also get
merged. When the new VMA is created, it will be linked in the usual
manner and the oldest->newest ordering is what is required. That's not
that important though.

What is important is if mremap is moving src to a dst that is adjacent
to another anon_vma. If src has never been faulted, it's not an issue
because there are also no migration PTEs. If src has been faulted, then
is_mergeable_anon_vma() should fail as anon_vma1 != anon_vma2 and they
are not compatible. The ordering is preserved and we are still ok.

All that said, while I don't think there is a problem, I can't convince
myself 100% of it. Andrea, can you spot a flaw?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-21 15:56         ` Mel Gorman
@ 2011-10-21 17:21           ` Nai Xia
  2011-10-21 17:41           ` Andrea Arcangeli
  1 sibling, 0 replies; 72+ messages in thread
From: Nai Xia @ 2011-10-21 17:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Hugh Dickins, Pawel Sikora, Andrew Morton,
	linux-mm, jpiszcz, arekm, linux-kernel

On Fri, Oct 21, 2011 at 11:56 PM, Mel Gorman <mgorman@suse.de> wrote:
> On Thu, Oct 20, 2011 at 05:11:28PM +0800, Nai Xia wrote:
>> On Mon, Oct 17, 2011 at 7:54 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>> > On Thu, Oct 13, 2011 at 04:30:09PM -0700, Hugh Dickins wrote:
>> >> mremap's down_write of mmap_sem, together with i_mmap_mutex/lock,
>> >> and pagetable locks, were good enough before page migration (with its
>> >> requirement that every migration entry be found) came in; and enough
>> >> while migration always held mmap_sem.  But not enough nowadays, when
>> >> there's memory hotremove and compaction: anon_vma lock is also needed,
>> >> to make sure a migration entry is not dodging around behind our back.
>> >
>> > For things like migrate and split_huge_page, the anon_vma layer must
>> > guarantee the page is reachable by rmap walk at all times regardless
>> > if it's at the old or new address.
>> >
>> > This shall be guaranteed by the copy_vma called by move_vma well
>> > before move_page_tables/move_ptes can run.
>> >
>> > copy_vma obviously takes the anon_vma lock to insert the new "dst" vma
>> > into the anon_vma chains structures (vma_link does that). That before
>> > any pte can be moved.
>> >
>> > Because we keep two vmas mapped on both src and dst range, with
>> > different vma->vm_pgoff that is valid for the page (the page doesn't
>> > change its page->index) the page should always find _all_ its pte at
>> > any given time.
>> >
>> > There may be other variables at play like the order of insertion in
>> > the anon_vma chain matches our direction of copy and removal of the
>> > old pte. But I think the double locking of the PT lock should make the
>> > order in the anon_vma chain absolutely irrelevant (the rmap_walk
>> > obviously takes the PT lock too), and furthermore likely the
>> > anon_vma_chain insertion is favorable (the dst vma is inserted last
>> > and checked last). But it shouldn't matter.
>>
>> I happened to be reading these code last week.
>>
>> And I do think this order matters, the reason is just quite similar why we
>> need i_mmap_lock in move_ptes():
>> If rmap_walk goes dst--->src, then when it first look into dst, ok, the
>
> You might be right in that the ordering matters. We do link new VMAs at
> the end of the list in anon_vma_chain_list so remove_migrate_ptes should
> be walking from src->dst.
>
> If remove_migrate_pte finds src first, it will remove the pte and the
> correct version will get copied. If move_ptes runs between when
> remove_migrate_ptes moves from src to dst, then the PTE at dst will
> still be correct.
>
>> pte is not there, and it happily skip it and release the PTL.
>> Then just before it look into src, move_ptes() comes in, takes the locks
>> and moves the pte from src to dst. And then when rmap_walk() look
>> into src,  it will find an empty pte again. The pte is still there,
>> but rmap_walk() missed it !
>>
>
> I believe the ordering is correct though and protects us in this case.
>
>> IMO, this can really happen in case of vma_merge() succeeding.
>> Imagine that src vma is lately faulted and in anon_vma_prepare()
>> it got a same anon_vma with an existing vma ( named evil_vma )through
>> find_mergeable_anon_vma().  This can potentially make the vma_merge() in
>> copy_vma() return with evil_vma on some new relocation request. But src_vma
>> is really linked _after_  evil_vma/new_vma/dst_vma.
>> In this way, the ordering protocol  of anon_vma chain is broken.
>> This should be a rare case because I think in most cases
>> if two VMAs can reusable_anon_vma() they were already merged.
>>
>> How do you think  ?
>>
>
> Despite the comments in anon_vma_compatible(), I would expect that VMAs
> that can share an anon_vma from find_mergeable_anon_vma() will also get
> merged. When the new VMA is created, it will be linked in the usual
> manner and the oldest->newest ordering is what is required. That's not
> that important though.
>
> What is important is if mremap is moving src to a dst that is adjacent
> to another anon_vma. If src has never been faulted, it's not an issue
> because there are also no migration PTEs. If src has been faulted, then
> is_mergeable_anon_vma() should fail as anon_vma1 != anon_vma2 and they
> are not compatible. The ordering is preserved and we are still ok.

Hi Mel,

Thanks for input. I agree on _almost_ all your reasoning above.

But there is a tricky series of events I mentioned in
https://lkml.org/lkml/2011/10/21/14

, which, I think, can really lead to anon_vma1 == anon_vma2 in this case.
These events is led by a failure when do_brk() fails on vma_merge() due to
ENOMEM, rare it maybe though, And I am still not sure if there exists
any other corner cases when a "should be merged" VMAs just sit there
side by side
for sth reason -- normally, that does not trigger BUGs, so maybe hard to
 detect in real workload.

Please refer to my link and I think the construction was very clear if I had not
missed sth subtle.

Thanks,

Nai Xia
>
> All that said, while I don't think there is a problem, I can't convince
> myself 100% of it. Andrea, can you spot a flaw?
>
> --
> Mel Gorman
> SUSE Labs
>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-21 15:56         ` Mel Gorman
  2011-10-21 17:21           ` Nai Xia
@ 2011-10-21 17:41           ` Andrea Arcangeli
  2011-10-21 22:50             ` Andrea Arcangeli
  2011-10-22  5:07             ` kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110 Nai Xia
  1 sibling, 2 replies; 72+ messages in thread
From: Andrea Arcangeli @ 2011-10-21 17:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nai Xia, Hugh Dickins, Pawel Sikora, Andrew Morton, linux-mm,
	jpiszcz, arekm, linux-kernel

On Fri, Oct 21, 2011 at 05:56:32PM +0200, Mel Gorman wrote:
> On Thu, Oct 20, 2011 at 05:11:28PM +0800, Nai Xia wrote:
> > On Mon, Oct 17, 2011 at 7:54 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> > > On Thu, Oct 13, 2011 at 04:30:09PM -0700, Hugh Dickins wrote:
> > >> mremap's down_write of mmap_sem, together with i_mmap_mutex/lock,
> > >> and pagetable locks, were good enough before page migration (with its
> > >> requirement that every migration entry be found) came in; and enough
> > >> while migration always held mmap_sem.  But not enough nowadays, when
> > >> there's memory hotremove and compaction: anon_vma lock is also needed,
> > >> to make sure a migration entry is not dodging around behind our back.
> > >
> > > For things like migrate and split_huge_page, the anon_vma layer must
> > > guarantee the page is reachable by rmap walk at all times regardless
> > > if it's at the old or new address.
> > >
> > > This shall be guaranteed by the copy_vma called by move_vma well
> > > before move_page_tables/move_ptes can run.
> > >
> > > copy_vma obviously takes the anon_vma lock to insert the new "dst" vma
> > > into the anon_vma chains structures (vma_link does that). That before
> > > any pte can be moved.
> > >
> > > Because we keep two vmas mapped on both src and dst range, with
> > > different vma->vm_pgoff that is valid for the page (the page doesn't
> > > change its page->index) the page should always find _all_ its pte at
> > > any given time.
> > >
> > > There may be other variables at play like the order of insertion in
> > > the anon_vma chain matches our direction of copy and removal of the
> > > old pte. But I think the double locking of the PT lock should make the
> > > order in the anon_vma chain absolutely irrelevant (the rmap_walk
> > > obviously takes the PT lock too), and furthermore likely the
> > > anon_vma_chain insertion is favorable (the dst vma is inserted last
> > > and checked last). But it shouldn't matter.
> > 
> > I happened to be reading these code last week.
> > 
> > And I do think this order matters, the reason is just quite similar why we
> > need i_mmap_lock in move_ptes():
> > If rmap_walk goes dst--->src, then when it first look into dst, ok, the
> 
> You might be right in that the ordering matters. We do link new VMAs at

Yes I also think ordering matters as I mentioned in the previous email
that Nai answered to.

> the end of the list in anon_vma_chain_list so remove_migrate_ptes should
> be walking from src->dst.

Correct. Like I mentioned in that previous email that Nai answered,
that wouldn't be ok only if vma_merge succeeds and I didn't change my mind
about that...

copy_vma is only called by mremap so supposedly that path can
trigger. Looks like I was wrong about vma_merge being able to succeed
in copy_vma, and if it does I still think it's a problem as we have no
ordering guarantee.

The only other place that depends on the anon_vma_chain order is fork,
and there, no vma_merge can happen, so that is safe.

> If remove_migrate_pte finds src first, it will remove the pte and the
> correct version will get copied. If move_ptes runs between when
> remove_migrate_ptes moves from src to dst, then the PTE at dst will
> still be correct.

The problem is rmap_walk will search dst before src. So it will do
nothing on dst. Then mremap moves the pte from src to dst. When rmap
walk then checks "src" it finds nothing again.

> > pte is not there, and it happily skip it and release the PTL.
> > Then just before it look into src, move_ptes() comes in, takes the locks
> > and moves the pte from src to dst. And then when rmap_walk() look
> > into src,  it will find an empty pte again. The pte is still there,
> > but rmap_walk() missed it !
> > 
> 
> I believe the ordering is correct though and protects us in this case.

Normally it is, the only problem is vma_merge succeeding I think.

> > IMO, this can really happen in case of vma_merge() succeeding.
> > Imagine that src vma is lately faulted and in anon_vma_prepare()
> > it got a same anon_vma with an existing vma ( named evil_vma )through
> > find_mergeable_anon_vma().  This can potentially make the vma_merge() in
> > copy_vma() return with evil_vma on some new relocation request. But src_vma
> > is really linked _after_  evil_vma/new_vma/dst_vma.
> > In this way, the ordering protocol  of anon_vma chain is broken.
> > This should be a rare case because I think in most cases
> > if two VMAs can reusable_anon_vma() they were already merged.
> > 
> > How do you think  ?
> > 

I tried to understand the above scenario yesterday but with 12 hour
of travel on me I just couldn't.

Yesterday however I thought of another simpler case:

part of a vma is moved with mremap elsewhere. Then it is moved back to
its original place. So then vma_merge will succeed, and the "src" of
mremap is now queued last in anon_vma_chain, wrong ordering.

Today I read an email from Nai who showed apparently the same scenario
I was thinking, without evil vmas or stuff.

I have an hard time to imagine a vma_merge succeeding on a vma that
isn't going back to its original place. The vm_pgoff + vma->anon_vma
checks should keep some linarity so going back to the original place
sounds the only way vma_merge can succeed in copy_vma. But still it
can happen in that case I think (so not sure how the above scenario
with an evil_vma could ever happen if it has a different anon_vma and
it's not a part of a vma that is going back to its original place like
in the second scenario Nai also posted about).

That me and Nai had same scenario hypothesis indipendentely (second
Nai hypoteisis not the first quoted above), plus copy_vma doing
vma_merge and being only called by mremap, sounds like it can really
happen.

> Despite the comments in anon_vma_compatible(), I would expect that VMAs
> that can share an anon_vma from find_mergeable_anon_vma() will also get
> merged. When the new VMA is created, it will be linked in the usual
> manner and the oldest->newest ordering is what is required. That's not
> that important though.
> 
> What is important is if mremap is moving src to a dst that is adjacent
> to another anon_vma. If src has never been faulted, it's not an issue
> because there are also no migration PTEs. If src has been faulted, then
> is_mergeable_anon_vma() should fail as anon_vma1 != anon_vma2 and they
> are not compatible. The ordering is preserved and we are still ok.

I was thinking along these lines, the only pitfall should be when
something is moved and put back into its original place. When it is
moved, a new vma is created and queued last. When it's put back to its
original location, vma_merge will succeed, and "src" is now the
previous "dst" so queued last and that breaks.

> All that said, while I don't think there is a problem, I can't convince
> myself 100% of it. Andrea, can you spot a flaw?

I think Nai's correct, only second hypothesis though.

We have two options:

1) we remove the vma_merge call from copy_vma and we do the vma_merge
manually after mremap succeed (so then we're as safe as fork is and we
relay on the ordering). No locks but we'll just do 1 more allocation
for one addition temporary vma that will be removed after mremap
completed.

2) Hugh's original fix.

First option probably is faster and prefereable, the vma_merge there
should only trigger when putting things back to origin I suspect, and
never with random mremaps, not sure how common it is to put things
back to origin. If we're in a hurry we can merge Hugh's patch and
optimize it later. We can still retain the migrate fix if we intend to
take way number 1 later. I didn't like too much migrate doing
speculative access on ptes that it can't miss or it'll crash anyway.

Said that the fix merged upstream is 99% certain to fix things in
practice already so I doubt we're in hurry. And if things go wrong
these issues don't go unnoticed and they shouldn't corrupt memory even
if they trigger. 100% certain it can't do damage (other than a BUG_ON)
for split_huge_page as I count the pmds encountered in the rmap_walk
when I set the splitting bit, and I compare that count with
page_mapcount and BUG_ON if they don't match, and later I repeat the
same comparsion in the second rmap_walk that establishes the pte and
downgrades the hugepmd to pmd, and BUG_ON again if they don't match
with the previous rmap_walk count. It may be possible to trigger the
BUG_ON with some malicious activity but it won't be too easy either
because it's not an instant thing, still a race had to trigger and
it's hard to reproduce.

The anon_vma lock is quite a wide lock as it's shared by all parents
anon_vma_chains too, slab allocation from local cpu may actually be
faster in some condition (even when the slab allocation is
superflous). But then I'm not sure. So I'm not against applying Hugh's
fix even for the long run. I wouldn't git revert the migration change,
but then if we go with Hugh's fix probably it'd be safe.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-21  9:07                       ` Nai Xia
@ 2011-10-21 21:36                         ` Paweł Sikora
  2011-10-22  6:21                           ` Nai Xia
  0 siblings, 1 reply; 72+ messages in thread
From: Paweł Sikora @ 2011-10-21 21:36 UTC (permalink / raw)
  To: Nai Xia
  Cc: Hugh Dickins, arekm, Linus Torvalds, linux-mm, Mel Gorman,
	jpiszcz, linux-kernel, Andrew Morton, Andrea Arcangeli

On Friday 21 of October 2011 11:07:56 Nai Xia wrote:
> On Fri, Oct 21, 2011 at 4:07 PM, Pawel Sikora <pluto@agmk.net> wrote:
> > On Friday 21 of October 2011 14:22:37 Nai Xia wrote:
> >
> >> And as a side note. Since I notice that Pawel's workload may include OOM,
> >
> > my last tests on patched (3.0.4 + migrate.c fix + vserver) kernel produce full cpu load
> > on dual 8-cores opterons like on this htop screenshot -> http://pluto.agmk.net/kernel/screen1.png
> > afaics all userspace applications usualy don't use more than half of physical memory
> > and so called "cache" on htop bar doesn't reach the 100%.
> 
> OK,did you logged any OOM killing if there was some memory usage burst?
> But, well my above OOM reasoning is a direct short cut to imagined
> root cause of "adjacent VMAs which
> should have been merged but in fact not merged" case.
> Maybe there are other cases that can lead to this or maybe it's
> totally another bug....

i don't see any OOM killing with my conservative settings
(vm.overcommit_memory=2, vm.overcommit_ratio=100).

> But still I think if my reasoning is good, similar bad things will
> happen again some time in the future,
> even if it was not your case here...
> 
> >
> > the patched kernel with disabled CONFIG_TRANSPARENT_HUGEPAGE (new thing in 2.6.38)
> > died at night, so now i'm going to disable also CONFIG_COMPACTION/MIGRATION in next
> > steps and stress this machine again...
> 
> OK, it's smart to narrow down the range first....

disabling hugepage/compacting didn't help but disabling hugepage/compacting/migration keeps
opterons stable for ~9h so far. userspace uses ~40GB (from 64) ram, caches reach 100% on htop bar,
average load ~16. i wonder if it survive weekend...

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-21 17:41           ` Andrea Arcangeli
@ 2011-10-21 22:50             ` Andrea Arcangeli
  2011-10-22  5:52               ` Nai Xia
  2011-10-22  5:07             ` kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110 Nai Xia
  1 sibling, 1 reply; 72+ messages in thread
From: Andrea Arcangeli @ 2011-10-21 22:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nai Xia, Hugh Dickins, Pawel Sikora, Andrew Morton, linux-mm,
	jpiszcz, arekm, linux-kernel

On Fri, Oct 21, 2011 at 07:41:20PM +0200, Andrea Arcangeli wrote:
> We have two options:
> 
> 1) we remove the vma_merge call from copy_vma and we do the vma_merge
> manually after mremap succeed (so then we're as safe as fork is and we
> relay on the ordering). No locks but we'll just do 1 more allocation
> for one addition temporary vma that will be removed after mremap
> completed.
> 
> 2) Hugh's original fix.

3) put the src vma at the tail if vma_merge succeeds and the src vma
and dst vma aren't the same

I tried to implement this but I'm still wondering about the safety of
this with concurrent processes all calling mremap at the same time on
the same anon_vma same_anon_vma list, the reasoning I think it may be
safe is in the comment. I run a few mremap with my benchmark where the
THP aware mremap in -mm gets a x10 boost and moves 5G and it didn't
crash but that's about it and not conclusive, if you review please
comment...

I've to pack luggage and prepare to fly to KS tomorrow so I may not be
responsive in the next few days.

===
>From f2898ff06b5a9a14b9d957c7696137f42a2438e9 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <aarcange@redhat.com>
Date: Sat, 22 Oct 2011 00:11:49 +0200
Subject: [PATCH] mremap: enforce rmap src/dst vma ordering in case of
 vma_merge succeeding in copy_vma

migrate was doing a rmap_walk with speculative lock-less access on
pagetables. That could lead it to not serialize properly against
mremap PT locks. But a second problem remains in the order of vmas in
the same_anon_vma list used by the rmap_walk.

If vma_merge would succeed in copy_vma, the src vma could be placed
after the dst vma in the same_anon_vma list. That could still lead
migrate to miss some pte.

This patch adds a anon_vma_order_tail() function to force the dst vma
at the end of the list before mremap starts to solve the problem.

If the mremap is very large and there are a lots of parents or childs
sharing the anon_vma root lock, this should still scale better than
taking the anon_vma root lock around every pte copy practically for
the whole duration of mremap.
---
 include/linux/rmap.h |    1 +
 mm/mmap.c            |    8 ++++++++
 mm/rmap.c            |   43 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 52 insertions(+), 0 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 2148b12..45eb098 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -120,6 +120,7 @@ void anon_vma_init(void);	/* create anon_vma_cachep */
 int  anon_vma_prepare(struct vm_area_struct *);
 void unlink_anon_vmas(struct vm_area_struct *);
 int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *);
+void anon_vma_order_tail(struct vm_area_struct *);
 int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *);
 void __anon_vma_link(struct vm_area_struct *);
 
diff --git a/mm/mmap.c b/mm/mmap.c
index a65efd4..a5858dc 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2339,7 +2339,15 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 		 */
 		if (vma_start >= new_vma->vm_start &&
 		    vma_start < new_vma->vm_end)
+			/*
+			 * No need to call anon_vma_order_tail() in
+			 * this case because the same PT lock will
+			 * serialize the rmap_walk against both src
+			 * and dst vmas.
+			 */
 			*vmap = new_vma;
+		else
+			anon_vma_order_tail(new_vma);
 	} else {
 		new_vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
 		if (new_vma) {
diff --git a/mm/rmap.c b/mm/rmap.c
index 8005080..170cece 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -272,6 +272,49 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
 }
 
 /*
+ * Some rmap walk that needs to find all ptes/hugepmds without false
+ * negatives (like migrate and split_huge_page) running concurrent
+ * with operations that copy or move pagetables (like mremap() and
+ * fork()) to be safe depends the anon_vma "same_anon_vma" list to be
+ * in a certain order: the dst_vma must be placed after the src_vma in
+ * the list. This is always guaranteed by fork() but mremap() needs to
+ * call this function to enforce it in case the dst_vma isn't newly
+ * allocated and chained with the anon_vma_clone() function but just
+ * an extension of a pre-existing vma through vma_merge.
+ *
+ * NOTE: the same_anon_vma list can still changed by other processes
+ * while mremap runs because mremap doesn't hold the anon_vma mutex to
+ * prevent modifications to the list while it runs. All we need to
+ * enforce is that the relative order of this process vmas isn't
+ * changing (we don't care about other vmas order). Each vma
+ * corresponds to an anon_vma_chain structure so there's no risk that
+ * other processes calling anon_vma_order_tail() and changing the
+ * same_anon_vma list under mremap() will screw with the relative
+ * order of this process vmas in the list, because we won't alter the
+ * order of any vma that isn't belonging to this process. And there
+ * can't be another anon_vma_order_tail running concurrently with
+ * mremap() coming from this process because we hold the mmap_sem for
+ * the whole mremap(). fork() ordering dependency also shouldn't be
+ * affected because we only care that the parent vmas are placed in
+ * the list before the child vmas and anon_vma_order_tail won't reorder
+ * vmas from either the fork parent or child.
+ */
+void anon_vma_order_tail(struct vm_area_struct *dst)
+{
+	struct anon_vma_chain *pavc;
+	struct anon_vma *root = NULL;
+
+	list_for_each_entry_reverse(pavc, &dst->anon_vma_chain, same_vma) {
+		struct anon_vma *anon_vma = pavc->anon_vma;
+		VM_BUG_ON(pavc->vma != dst);
+		root = lock_anon_vma_root(root, anon_vma);
+		list_del(&pavc->same_anon_vma);
+		list_add_tail(&pavc->same_anon_vma, &anon_vma->head);
+	}
+	unlock_anon_vma_root(root);
+}
+
+/*
  * Attach vma to its own anon_vma, as well as to the anon_vmas that
  * the corresponding VMA in the parent process is attached to.
  * Returns 0 on success, non-zero on failure.

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-21 17:41           ` Andrea Arcangeli
  2011-10-21 22:50             ` Andrea Arcangeli
@ 2011-10-22  5:07             ` Nai Xia
  2011-10-31 16:34               ` Andrea Arcangeli
  1 sibling, 1 reply; 72+ messages in thread
From: Nai Xia @ 2011-10-22  5:07 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Hugh Dickins, Pawel Sikora, Andrew Morton, linux-mm,
	jpiszcz, arekm, linux-kernel

On Saturday 22 October 2011 01:41:20 Andrea Arcangeli wrote:
> On Fri, Oct 21, 2011 at 05:56:32PM +0200, Mel Gorman wrote:
> > On Thu, Oct 20, 2011 at 05:11:28PM +0800, Nai Xia wrote:
> > > On Mon, Oct 17, 2011 at 7:54 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> > > > On Thu, Oct 13, 2011 at 04:30:09PM -0700, Hugh Dickins wrote:
> > > >> mremap's down_write of mmap_sem, together with i_mmap_mutex/lock,
> > > >> and pagetable locks, were good enough before page migration (with its
> > > >> requirement that every migration entry be found) came in; and enough
> > > >> while migration always held mmap_sem.  But not enough nowadays, when
> > > >> there's memory hotremove and compaction: anon_vma lock is also needed,
> > > >> to make sure a migration entry is not dodging around behind our back.
> > > >
> > > > For things like migrate and split_huge_page, the anon_vma layer must
> > > > guarantee the page is reachable by rmap walk at all times regardless
> > > > if it's at the old or new address.
> > > >
> > > > This shall be guaranteed by the copy_vma called by move_vma well
> > > > before move_page_tables/move_ptes can run.
> > > >
> > > > copy_vma obviously takes the anon_vma lock to insert the new "dst" vma
> > > > into the anon_vma chains structures (vma_link does that). That before
> > > > any pte can be moved.
> > > >
> > > > Because we keep two vmas mapped on both src and dst range, with
> > > > different vma->vm_pgoff that is valid for the page (the page doesn't
> > > > change its page->index) the page should always find _all_ its pte at
> > > > any given time.
> > > >
> > > > There may be other variables at play like the order of insertion in
> > > > the anon_vma chain matches our direction of copy and removal of the
> > > > old pte. But I think the double locking of the PT lock should make the
> > > > order in the anon_vma chain absolutely irrelevant (the rmap_walk
> > > > obviously takes the PT lock too), and furthermore likely the
> > > > anon_vma_chain insertion is favorable (the dst vma is inserted last
> > > > and checked last). But it shouldn't matter.
> > > 
> > > I happened to be reading these code last week.
> > > 
> > > And I do think this order matters, the reason is just quite similar why we
> > > need i_mmap_lock in move_ptes():
> > > If rmap_walk goes dst--->src, then when it first look into dst, ok, the
> > 
> > You might be right in that the ordering matters. We do link new VMAs at
> 
> Yes I also think ordering matters as I mentioned in the previous email
> that Nai answered to.
> 
> > the end of the list in anon_vma_chain_list so remove_migrate_ptes should
> > be walking from src->dst.
> 
> Correct. Like I mentioned in that previous email that Nai answered,
> that wouldn't be ok only if vma_merge succeeds and I didn't change my mind
> about that...
> 
> copy_vma is only called by mremap so supposedly that path can
> trigger. Looks like I was wrong about vma_merge being able to succeed
> in copy_vma, and if it does I still think it's a problem as we have no
> ordering guarantee.
> 
> The only other place that depends on the anon_vma_chain order is fork,
> and there, no vma_merge can happen, so that is safe.
> 
> > If remove_migrate_pte finds src first, it will remove the pte and the
> > correct version will get copied. If move_ptes runs between when
> > remove_migrate_ptes moves from src to dst, then the PTE at dst will
> > still be correct.
> 
> The problem is rmap_walk will search dst before src. So it will do
> nothing on dst. Then mremap moves the pte from src to dst. When rmap
> walk then checks "src" it finds nothing again.
> 
> > > pte is not there, and it happily skip it and release the PTL.
> > > Then just before it look into src, move_ptes() comes in, takes the locks
> > > and moves the pte from src to dst. And then when rmap_walk() look
> > > into src,  it will find an empty pte again. The pte is still there,
> > > but rmap_walk() missed it !
> > > 
> > 
> > I believe the ordering is correct though and protects us in this case.
> 
> Normally it is, the only problem is vma_merge succeeding I think.
> 
> > > IMO, this can really happen in case of vma_merge() succeeding.
> > > Imagine that src vma is lately faulted and in anon_vma_prepare()
> > > it got a same anon_vma with an existing vma ( named evil_vma )through
> > > find_mergeable_anon_vma().  This can potentially make the vma_merge() in
> > > copy_vma() return with evil_vma on some new relocation request. But src_vma
> > > is really linked _after_  evil_vma/new_vma/dst_vma.
> > > In this way, the ordering protocol  of anon_vma chain is broken.
> > > This should be a rare case because I think in most cases
> > > if two VMAs can reusable_anon_vma() they were already merged.
> > > 
> > > How do you think  ?
> > > 
> 
> I tried to understand the above scenario yesterday but with 12 hour
> of travel on me I just couldn't.

Oh,yes, the first hypothesis was actually a vague feeling that things
might go wrong in that direction. The details in it was somewhat 
missleading. But following that direction, I found the 2nd clear 
hypothesis that leads to this bug step by step.

> 
> Yesterday however I thought of another simpler case:
> 
> part of a vma is moved with mremap elsewhere. Then it is moved back to
> its original place. So then vma_merge will succeed, and the "src" of
> mremap is now queued last in anon_vma_chain, wrong ordering.

Oh, yes, partial mremaping will do the trick. I was too addicted to find
a case when two VMAs missed a normal merge chance but will merge later
on. The only thing I can find by now is that ENOMEM is vma_adjust().

Partial mremaping is a simpler case and definitely more likey to happen. 

> 
> Today I read an email from Nai who showed apparently the same scenario
> I was thinking, without evil vmas or stuff.
> 
> I have an hard time to imagine a vma_merge succeeding on a vma that
> isn't going back to its original place. The vm_pgoff + vma->anon_vma
> checks should keep some linarity so going back to the original place
> sounds the only way vma_merge can succeed in copy_vma. But still it
> can happen in that case I think (so not sure how the above scenario
> with an evil_vma could ever happen if it has a different anon_vma and
> it's not a part of a vma that is going back to its original place like
> in the second scenario Nai also posted about).
> 
> That me and Nai had same scenario hypothesis indipendentely (second
> Nai hypoteisis not the first quoted above), plus copy_vma doing
> vma_merge and being only called by mremap, sounds like it can really
> happen.
> 
> > Despite the comments in anon_vma_compatible(), I would expect that VMAs
> > that can share an anon_vma from find_mergeable_anon_vma() will also get
> > merged. When the new VMA is created, it will be linked in the usual
> > manner and the oldest->newest ordering is what is required. That's not
> > that important though.
> > 
> > What is important is if mremap is moving src to a dst that is adjacent
> > to another anon_vma. If src has never been faulted, it's not an issue
> > because there are also no migration PTEs. If src has been faulted, then
> > is_mergeable_anon_vma() should fail as anon_vma1 != anon_vma2 and they
> > are not compatible. The ordering is preserved and we are still ok.
> 
> I was thinking along these lines, the only pitfall should be when
> something is moved and put back into its original place. When it is
> moved, a new vma is created and queued last. When it's put back to its
> original location, vma_merge will succeed, and "src" is now the
> previous "dst" so queued last and that breaks.
> 
> > All that said, while I don't think there is a problem, I can't convince
> > myself 100% of it. Andrea, can you spot a flaw?
> 
> I think Nai's correct, only second hypothesis though.
> 
> We have two options:
> 
> 1) we remove the vma_merge call from copy_vma and we do the vma_merge
> manually after mremap succeed (so then we're as safe as fork is and we
> relay on the ordering). No locks but we'll just do 1 more allocation
> for one addition temporary vma that will be removed after mremap
> completed.
> 
> 2) Hugh's original fix.
> 
> First option probably is faster and prefereable, the vma_merge there
> should only trigger when putting things back to origin I suspect, and
> never with random mremaps, not sure how common it is to put things
> back to origin. If we're in a hurry we can merge Hugh's patch and
> optimize it later. We can still retain the migrate fix if we intend to
> take way number 1 later. I didn't like too much migrate doing
> speculative access on ptes that it can't miss or it'll crash anyway.

Me too, I think it's error-prone or at least we must be very careful
of its not doing sth evil. If the speculative access does not save
too much of the time, we need not brother to waste our mind power
over it.

> 
> Said that the fix merged upstream is 99% certain to fix things in
> practice already so I doubt we're in hurry. And if things go wrong
> these issues don't go unnoticed and they shouldn't corrupt memory even
> if they trigger. 100% certain it can't do damage (other than a BUG_ON)
> for split_huge_page as I count the pmds encountered in the rmap_walk
> when I set the splitting bit, and I compare that count with
> page_mapcount and BUG_ON if they don't match, and later I repeat the
> same comparsion in the second rmap_walk that establishes the pte and
> downgrades the hugepmd to pmd, and BUG_ON again if they don't match
> with the previous rmap_walk count. It may be possible to trigger the
> BUG_ON with some malicious activity but it won't be too easy either
> because it's not an instant thing, still a race had to trigger and
> it's hard to reproduce.
> 
> The anon_vma lock is quite a wide lock as it's shared by all parents
> anon_vma_chains too, slab allocation from local cpu may actually be
> faster in some condition (even when the slab allocation is
> superflous). But then I'm not sure. So I'm not against applying Hugh's
> fix even for the long run. I wouldn't git revert the migration change,
> but then if we go with Hugh's fix probably it'd be safe.
> 

Yeah, anon_vma root lock is a big lock. And JFYI, actually I am doing 
some very nasty hacking on anon_vma and one of the side effects is 
breaking the root lock into pieces. But this area is pretty 
convolved by many racing conditions. I hope some day I will finally make
my patch work and have your precious review of it. :-)


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-21 22:50             ` Andrea Arcangeli
@ 2011-10-22  5:52               ` Nai Xia
  2011-10-31 17:14                 ` Andrea Arcangeli
  0 siblings, 1 reply; 72+ messages in thread
From: Nai Xia @ 2011-10-22  5:52 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Hugh Dickins, Pawel Sikora, Andrew Morton, linux-mm,
	jpiszcz, arekm, linux-kernel

On Saturday 22 October 2011 06:50:08 Andrea Arcangeli wrote:
> On Fri, Oct 21, 2011 at 07:41:20PM +0200, Andrea Arcangeli wrote:
> > We have two options:
> > 
> > 1) we remove the vma_merge call from copy_vma and we do the vma_merge
> > manually after mremap succeed (so then we're as safe as fork is and we
> > relay on the ordering). No locks but we'll just do 1 more allocation
> > for one addition temporary vma that will be removed after mremap
> > completed.
> > 
> > 2) Hugh's original fix.
> 
> 3) put the src vma at the tail if vma_merge succeeds and the src vma
> and dst vma aren't the same
> 
> I tried to implement this but I'm still wondering about the safety of
> this with concurrent processes all calling mremap at the same time on
> the same anon_vma same_anon_vma list, the reasoning I think it may be
> safe is in the comment. I run a few mremap with my benchmark where the
> THP aware mremap in -mm gets a x10 boost and moves 5G and it didn't

BTW, I am curious about what benchmark did you run and " x10 boost"
meaning compared to Hugh's anon_vma_locking fix?

> crash but that's about it and not conclusive, if you review please
> comment...

My comment is at the bottom of this post.

> 
> I've to pack luggage and prepare to fly to KS tomorrow so I may not be
> responsive in the next few days.
> 
> ===
> From f2898ff06b5a9a14b9d957c7696137f42a2438e9 Mon Sep 17 00:00:00 2001
> From: Andrea Arcangeli <aarcange@redhat.com>
> Date: Sat, 22 Oct 2011 00:11:49 +0200
> Subject: [PATCH] mremap: enforce rmap src/dst vma ordering in case of
>  vma_merge succeeding in copy_vma
> 
> migrate was doing a rmap_walk with speculative lock-less access on
> pagetables. That could lead it to not serialize properly against
> mremap PT locks. But a second problem remains in the order of vmas in
> the same_anon_vma list used by the rmap_walk.
> 
> If vma_merge would succeed in copy_vma, the src vma could be placed
> after the dst vma in the same_anon_vma list. That could still lead
> migrate to miss some pte.
> 
> This patch adds a anon_vma_order_tail() function to force the dst vma
> at the end of the list before mremap starts to solve the problem.
> 
> If the mremap is very large and there are a lots of parents or childs
> sharing the anon_vma root lock, this should still scale better than
> taking the anon_vma root lock around every pte copy practically for
> the whole duration of mremap.
> ---
>  include/linux/rmap.h |    1 +
>  mm/mmap.c            |    8 ++++++++
>  mm/rmap.c            |   43 +++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 52 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 2148b12..45eb098 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -120,6 +120,7 @@ void anon_vma_init(void);	/* create anon_vma_cachep */
>  int  anon_vma_prepare(struct vm_area_struct *);
>  void unlink_anon_vmas(struct vm_area_struct *);
>  int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *);
> +void anon_vma_order_tail(struct vm_area_struct *);
>  int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *);
>  void __anon_vma_link(struct vm_area_struct *);
>  
> diff --git a/mm/mmap.c b/mm/mmap.c
> index a65efd4..a5858dc 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2339,7 +2339,15 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
>  		 */
>  		if (vma_start >= new_vma->vm_start &&
>  		    vma_start < new_vma->vm_end)
> +			/*
> +			 * No need to call anon_vma_order_tail() in
> +			 * this case because the same PT lock will
> +			 * serialize the rmap_walk against both src
> +			 * and dst vmas.
> +			 */
>  			*vmap = new_vma;
> +		else
> +			anon_vma_order_tail(new_vma);
>  	} else {
>  		new_vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
>  		if (new_vma) {
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 8005080..170cece 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -272,6 +272,49 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
>  }
>  
>  /*
> + * Some rmap walk that needs to find all ptes/hugepmds without false
> + * negatives (like migrate and split_huge_page) running concurrent
> + * with operations that copy or move pagetables (like mremap() and
> + * fork()) to be safe depends the anon_vma "same_anon_vma" list to be
> + * in a certain order: the dst_vma must be placed after the src_vma in
> + * the list. This is always guaranteed by fork() but mremap() needs to
> + * call this function to enforce it in case the dst_vma isn't newly
> + * allocated and chained with the anon_vma_clone() function but just
> + * an extension of a pre-existing vma through vma_merge.
> + *
> + * NOTE: the same_anon_vma list can still changed by other processes
> + * while mremap runs because mremap doesn't hold the anon_vma mutex to
> + * prevent modifications to the list while it runs. All we need to
> + * enforce is that the relative order of this process vmas isn't
> + * changing (we don't care about other vmas order). Each vma
> + * corresponds to an anon_vma_chain structure so there's no risk that
> + * other processes calling anon_vma_order_tail() and changing the
> + * same_anon_vma list under mremap() will screw with the relative
> + * order of this process vmas in the list, because we won't alter the
> + * order of any vma that isn't belonging to this process. And there
> + * can't be another anon_vma_order_tail running concurrently with
> + * mremap() coming from this process because we hold the mmap_sem for
> + * the whole mremap(). fork() ordering dependency also shouldn't be
> + * affected because we only care that the parent vmas are placed in
> + * the list before the child vmas and anon_vma_order_tail won't reorder
> + * vmas from either the fork parent or child.
> + */
> +void anon_vma_order_tail(struct vm_area_struct *dst)
> +{
> +	struct anon_vma_chain *pavc;
> +	struct anon_vma *root = NULL;
> +
> +	list_for_each_entry_reverse(pavc, &dst->anon_vma_chain, same_vma) {
> +		struct anon_vma *anon_vma = pavc->anon_vma;
> +		VM_BUG_ON(pavc->vma != dst);
> +		root = lock_anon_vma_root(root, anon_vma);
> +		list_del(&pavc->same_anon_vma);
> +		list_add_tail(&pavc->same_anon_vma, &anon_vma->head);
> +	}
> +	unlock_anon_vma_root(root);
> +}

This patch and together with the reasoning looks good to me. 
But I wondering this patch can make the anon_vma chain ordering game more 
complex and harder to play in the future.
However, if it does bring much perfomance benefit, I vote for this patch 
because it balances all three requirements here: bug free, performance &
no two VMAs stay not merged for no good reason.

Our situation again makes me have the strong feeling that we are really
in bad need of a computer aided way to travel all possible state space.
There are some guys around me who do automatic software testing research.
But I am afraid our problem is too much "real world" for them... sigh...  




^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-21 21:36                         ` Paweł Sikora
@ 2011-10-22  6:21                           ` Nai Xia
  2011-10-22 16:42                             ` Paweł Sikora
  0 siblings, 1 reply; 72+ messages in thread
From: Nai Xia @ 2011-10-22  6:21 UTC (permalink / raw)
  To: Paweł Sikora
  Cc: Hugh Dickins, arekm, Linus Torvalds, linux-mm, Mel Gorman,
	jpiszcz, linux-kernel, Andrew Morton, Andrea Arcangeli

On Saturday 22 October 2011 05:36:46 Paweł Sikora wrote:
> On Friday 21 of October 2011 11:07:56 Nai Xia wrote:
> > On Fri, Oct 21, 2011 at 4:07 PM, Pawel Sikora <pluto@agmk.net> wrote:
> > > On Friday 21 of October 2011 14:22:37 Nai Xia wrote:
> > >
> > >> And as a side note. Since I notice that Pawel's workload may include OOM,
> > >
> > > my last tests on patched (3.0.4 + migrate.c fix + vserver) kernel produce full cpu load
> > > on dual 8-cores opterons like on this htop screenshot -> http://pluto.agmk.net/kernel/screen1.png
> > > afaics all userspace applications usualy don't use more than half of physical memory
> > > and so called "cache" on htop bar doesn't reach the 100%.
> > 
> > OK,did you logged any OOM killing if there was some memory usage burst?
> > But, well my above OOM reasoning is a direct short cut to imagined
> > root cause of "adjacent VMAs which
> > should have been merged but in fact not merged" case.
> > Maybe there are other cases that can lead to this or maybe it's
> > totally another bug....
> 
> i don't see any OOM killing with my conservative settings
> (vm.overcommit_memory=2, vm.overcommit_ratio=100).

OK, that does not matter now. Andrea showed us a simpler way to goto
this bug. 

> 
> > But still I think if my reasoning is good, similar bad things will
> > happen again some time in the future,
> > even if it was not your case here...
> > 
> > >
> > > the patched kernel with disabled CONFIG_TRANSPARENT_HUGEPAGE (new thing in 2.6.38)
> > > died at night, so now i'm going to disable also CONFIG_COMPACTION/MIGRATION in next
> > > steps and stress this machine again...
> > 
> > OK, it's smart to narrow down the range first....
> 
> disabling hugepage/compacting didn't help but disabling hugepage/compacting/migration keeps
> opterons stable for ~9h so far. userspace uses ~40GB (from 64) ram, caches reach 100% on htop bar,
> average load ~16. i wonder if it survive weekend...
> 

Maybe you should give another shot of Andrea's latest anon_vma_order_tail() patch. :)


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-22  6:21                           ` Nai Xia
@ 2011-10-22 16:42                             ` Paweł Sikora
       [not found]                               ` <CAPQyPG5HJKTo8AEy_khdJeciTgtNQepK6XLcpzvPF8PYS0V-Lw@mail.gmail.com>
  0 siblings, 1 reply; 72+ messages in thread
From: Paweł Sikora @ 2011-10-22 16:42 UTC (permalink / raw)
  To: nai.xia
  Cc: Hugh Dickins, arekm, Linus Torvalds, linux-mm, Mel Gorman,
	jpiszcz, linux-kernel, Andrew Morton, Andrea Arcangeli

On Saturday 22 of October 2011 08:21:23 Nai Xia wrote:
> On Saturday 22 October 2011 05:36:46 Paweł Sikora wrote:
> > On Friday 21 of October 2011 11:07:56 Nai Xia wrote:
> > > On Fri, Oct 21, 2011 at 4:07 PM, Pawel Sikora <pluto@agmk.net> wrote:
> > > > On Friday 21 of October 2011 14:22:37 Nai Xia wrote:
> > > >
> > > >> And as a side note. Since I notice that Pawel's workload may include OOM,
> > > >
> > > > my last tests on patched (3.0.4 + migrate.c fix + vserver) kernel produce full cpu load
> > > > on dual 8-cores opterons like on this htop screenshot -> http://pluto.agmk.net/kernel/screen1.png
> > > > afaics all userspace applications usualy don't use more than half of physical memory
> > > > and so called "cache" on htop bar doesn't reach the 100%.
> > > 
> > > OK,did you logged any OOM killing if there was some memory usage burst?
> > > But, well my above OOM reasoning is a direct short cut to imagined
> > > root cause of "adjacent VMAs which
> > > should have been merged but in fact not merged" case.
> > > Maybe there are other cases that can lead to this or maybe it's
> > > totally another bug....
> > 
> > i don't see any OOM killing with my conservative settings
> > (vm.overcommit_memory=2, vm.overcommit_ratio=100).
> 
> OK, that does not matter now. Andrea showed us a simpler way to goto
> this bug. 
> 
> > 
> > > But still I think if my reasoning is good, similar bad things will
> > > happen again some time in the future,
> > > even if it was not your case here...
> > > 
> > > >
> > > > the patched kernel with disabled CONFIG_TRANSPARENT_HUGEPAGE (new thing in 2.6.38)
> > > > died at night, so now i'm going to disable also CONFIG_COMPACTION/MIGRATION in next
> > > > steps and stress this machine again...
> > > 
> > > OK, it's smart to narrow down the range first....
> > 
> > disabling hugepage/compacting didn't help but disabling hugepage/compacting/migration keeps
> > opterons stable for ~9h so far. userspace uses ~40GB (from 64) ram, caches reach 100% on htop bar,
> > average load ~16. i wonder if it survive weekend...
> > 
> 
> Maybe you should give another shot of Andrea's latest anon_vma_order_tail() patch. :)
> 

all my attempts to disabling thp/compaction/migration failed (machine locked).
now, i'm testing 3.0.7+vserver+Hugh's+Andrea's patches+enabled few kernel debug options.

so far it has logged only something unrelated to memory managment subsystem:

[  258.397014] =======================================================
[  258.397209] [ INFO: possible circular locking dependency detected ]
[  258.397311] 3.0.7-vs2.3.1-dirty #1
[  258.397402] -------------------------------------------------------
[  258.397503] slave_odra_g_00/19432 is trying to acquire lock:
[  258.397603]  (&(&sig->cputimer.lock)->rlock){-.....}, at: [<ffffffff8103adfc>] update_curr+0xfc/0x190
[  258.397912] 
[  258.397912] but task is already holding lock:
[  258.398090]  (&rq->lock){-.-.-.}, at: [<ffffffff81041a8e>] scheduler_tick+0x4e/0x280
[  258.398387] 
[  258.398388] which lock already depends on the new lock.
[  258.398389] 
[  258.398652] 
[  258.398653] the existing dependency chain (in reverse order) is:
[  258.398836] 
[  258.398837] -> #2 (&rq->lock){-.-.-.}:
[  258.399178]        [<ffffffff810959ee>] lock_acquire+0x8e/0x120
[  258.399336]        [<ffffffff81466e5c>] _raw_spin_lock+0x2c/0x40
[  258.399495]        [<ffffffff81040bd7>] wake_up_new_task+0x97/0x1c0
[  258.399652]        [<ffffffff81047db6>] do_fork+0x176/0x460
[  258.399807]        [<ffffffff8100999c>] kernel_thread+0x6c/0x70
[  258.399964]        [<ffffffff8144715d>] rest_init+0x21/0xc4
[  258.400120]        [<ffffffff818adbd2>] start_kernel+0x3d6/0x3e1
[  258.400280]        [<ffffffff818ad322>] x86_64_start_reservations+0x132/0x136
[  258.400336]        [<ffffffff818ad416>] x86_64_start_kernel+0xf0/0xf7
[  258.400336] 
[  258.400336] -> #1 (&p->pi_lock){-.-.-.}:
[  258.400336]        [<ffffffff810959ee>] lock_acquire+0x8e/0x120
[  258.400336]        [<ffffffff81466f5c>] _raw_spin_lock_irqsave+0x3c/0x60
[  258.400336]        [<ffffffff8106f328>] thread_group_cputimer+0x38/0x100
[  258.400336]        [<ffffffff8106f41d>] cpu_timer_sample_group+0x2d/0xa0
[  258.400336]        [<ffffffff8107080a>] set_process_cpu_timer+0x3a/0x110
[  258.400336]        [<ffffffff8107091a>] update_rlimit_cpu+0x3a/0x60
[  258.400336]        [<ffffffff81062c0e>] do_prlimit+0x19e/0x240
[  258.400336]        [<ffffffff81063008>] sys_setrlimit+0x48/0x60
[  258.400336]        [<ffffffff8146efbb>] system_call_fastpath+0x16/0x1b
[  258.400336] 
[  258.400336] -> #0 (&(&sig->cputimer.lock)->rlock){-.....}:
[  258.400336]        [<ffffffff810951e7>] __lock_acquire+0x1aa7/0x1cc0
[  258.400336]        [<ffffffff810959ee>] lock_acquire+0x8e/0x120
[  258.400336]        [<ffffffff81466e5c>] _raw_spin_lock+0x2c/0x40
[  258.400336]        [<ffffffff8103adfc>] update_curr+0xfc/0x190
[  258.400336]        [<ffffffff8103b22d>] task_tick_fair+0x2d/0x140
[  258.400336]        [<ffffffff81041b0f>] scheduler_tick+0xcf/0x280
[  258.400336]        [<ffffffff8105a439>] update_process_times+0x69/0x80
[  258.400336]        [<ffffffff8108e0cf>] tick_sched_timer+0x5f/0xc0
[  258.400336]        [<ffffffff81071339>] __run_hrtimer+0x79/0x1f0
[  258.400336]        [<ffffffff81071ce3>] hrtimer_interrupt+0xf3/0x220
[  258.400336]        [<ffffffff8101daa4>] smp_apic_timer_interrupt+0x64/0xa0
[  258.400336]        [<ffffffff8146f9d3>] apic_timer_interrupt+0x13/0x20
[  258.400336]        [<ffffffff8107092d>] update_rlimit_cpu+0x4d/0x60
[  258.400336]        [<ffffffff81062c0e>] do_prlimit+0x19e/0x240
[  258.400336]        [<ffffffff81063008>] sys_setrlimit+0x48/0x60
[  258.400336]        [<ffffffff8146efbb>] system_call_fastpath+0x16/0x1b
[  258.400336] 
[  258.400336] other info that might help us debug this:
[  258.400336] 
[  258.400336] Chain exists of:
[  258.400336]   &(&sig->cputimer.lock)->rlock --> &p->pi_lock --> &rq->lock
[  258.400336] 
[  258.400336]  Possible unsafe locking scenario:
[  258.400336] 
[  258.400336]        CPU0                    CPU1
[  258.400336]        ----                    ----
[  258.400336]   lock(&rq->lock);
[  258.400336]                                lock(&p->pi_lock);
[  258.400336]                                lock(&rq->lock);
[  258.400336]   lock(&(&sig->cputimer.lock)->rlock);
[  258.400336] 
[  258.400336]  *** DEADLOCK ***
[  258.400336] 
[  258.400336] 2 locks held by slave_odra_g_00/19432:
[  258.400336]  #0:  (tasklist_lock){.+.+..}, at: [<ffffffff81062acd>] do_prlimit+0x5d/0x240
[  258.400336]  #1:  (&rq->lock){-.-.-.}, at: [<ffffffff81041a8e>] scheduler_tick+0x4e/0x280
[  258.400336] 
[  258.400336] stack backtrace:
[  258.400336] Pid: 19432, comm: slave_odra_g_00 Not tainted 3.0.7-vs2.3.1-dirty #1
[  258.400336] Call Trace:
[  258.400336]  <IRQ>  [<ffffffff8145e204>] print_circular_bug+0x23d/0x24e
[  258.400336]  [<ffffffff810951e7>] __lock_acquire+0x1aa7/0x1cc0
[  258.400336]  [<ffffffff8109264d>] ? mark_lock+0x2dd/0x330
[  258.400336]  [<ffffffff81093bfd>] ? __lock_acquire+0x4bd/0x1cc0
[  258.400336]  [<ffffffff8103adfc>] ? update_curr+0xfc/0x190
[  258.400336]  [<ffffffff810959ee>] lock_acquire+0x8e/0x120
[  258.400336]  [<ffffffff8103adfc>] ? update_curr+0xfc/0x190
[  258.400336]  [<ffffffff81466e5c>] _raw_spin_lock+0x2c/0x40
[  258.400336]  [<ffffffff8103adfc>] ? update_curr+0xfc/0x190
[  258.400336]  [<ffffffff8103adfc>] update_curr+0xfc/0x190
[  258.400336]  [<ffffffff8103b22d>] task_tick_fair+0x2d/0x140
[  258.400336]  [<ffffffff81041b0f>] scheduler_tick+0xcf/0x280
[  258.400336]  [<ffffffff8105a439>] update_process_times+0x69/0x80
[  258.400336]  [<ffffffff8108e0cf>] tick_sched_timer+0x5f/0xc0
[  258.400336]  [<ffffffff81071339>] __run_hrtimer+0x79/0x1f0
[  258.400336]  [<ffffffff8108e070>] ? tick_nohz_handler+0x100/0x100
[  258.400336]  [<ffffffff81071ce3>] hrtimer_interrupt+0xf3/0x220
[  258.400336]  [<ffffffff8101daa4>] smp_apic_timer_interrupt+0x64/0xa0
[  258.400336]  [<ffffffff8146f9d3>] apic_timer_interrupt+0x13/0x20
[  258.400336]  <EOI>  [<ffffffff814674e0>] ? _raw_spin_unlock_irq+0x30/0x40
[  258.400336]  [<ffffffff8107092d>] update_rlimit_cpu+0x4d/0x60
[  258.400336]  [<ffffffff81062c0e>] do_prlimit+0x19e/0x240
[  258.400336]  [<ffffffff81063008>] sys_setrlimit+0x48/0x60
[  258.400336]  [<ffffffff8146efbb>] system_call_fastpath+0x16/0x1b

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
       [not found]                               ` <CAPQyPG5HJKTo8AEy_khdJeciTgtNQepK6XLcpzvPF8PYS0V-Lw@mail.gmail.com>
@ 2011-10-25  7:33                                 ` Pawel Sikora
  0 siblings, 0 replies; 72+ messages in thread
From: Pawel Sikora @ 2011-10-25  7:33 UTC (permalink / raw)
  To: Nai Xia; +Cc: linux-kernel, akpm, aarcange, mgorman, hughd, torvalds

On Tuesday 25 of October 2011 12:21:30 Nai Xia wrote:
> 2011/10/23 Paweł Sikora <pluto@agmk.net>:
> > On Saturday 22 of October 2011 08:21:23 Nai Xia wrote:
> >> On Saturday 22 October 2011 05:36:46 Paweł Sikora wrote:
> >> > On Friday 21 of October 2011 11:07:56 Nai Xia wrote:
> >> > > On Fri, Oct 21, 2011 at 4:07 PM, Pawel Sikora <pluto@agmk.net> wrote:
> >> > > > On Friday 21 of October 2011 14:22:37 Nai Xia wrote:
> >> > > >
> >> > > >> And as a side note. Since I notice that Pawel's workload may include OOM,
> >> > > >
> >> > > > my last tests on patched (3.0.4 + migrate.c fix + vserver) kernel produce full cpu load
> >> > > > on dual 8-cores opterons like on this htop screenshot -> http://pluto.agmk.net/kernel/screen1.png
> >> > > > afaics all userspace applications usualy don't use more than half of physical memory
> >> > > > and so called "cache" on htop bar doesn't reach the 100%.
> >> > >
> >> > > OK,did you logged any OOM killing if there was some memory usage burst?
> >> > > But, well my above OOM reasoning is a direct short cut to imagined
> >> > > root cause of "adjacent VMAs which
> >> > > should have been merged but in fact not merged" case.
> >> > > Maybe there are other cases that can lead to this or maybe it's
> >> > > totally another bug....
> >> >
> >> > i don't see any OOM killing with my conservative settings
> >> > (vm.overcommit_memory=2, vm.overcommit_ratio=100).
> >>
> >> OK, that does not matter now. Andrea showed us a simpler way to goto
> >> this bug.
> >>
> >> >
> >> > > But still I think if my reasoning is good, similar bad things will
> >> > > happen again some time in the future,
> >> > > even if it was not your case here...
> >> > >
> >> > > >
> >> > > > the patched kernel with disabled CONFIG_TRANSPARENT_HUGEPAGE (new thing in 2.6.38)
> >> > > > died at night, so now i'm going to disable also CONFIG_COMPACTION/MIGRATION in next
> >> > > > steps and stress this machine again...
> >> > >
> >> > > OK, it's smart to narrow down the range first....
> >> >
> >> > disabling hugepage/compacting didn't help but disabling hugepage/compacting/migration keeps
> >> > opterons stable for ~9h so far. userspace uses ~40GB (from 64) ram, caches reach 100% on htop bar,
> >> > average load ~16. i wonder if it survive weekend...
> >> >
> >>
> >> Maybe you should give another shot of Andrea's latest anon_vma_order_tail() patch. :)
> >>
> >
> > all my attempts to disabling thp/compaction/migration failed (machine locked).
> > now, i'm testing 3.0.7+vserver+Hugh's+Andrea's patches+enabled few kernel debug options.
> 
> Have you got the result of this patch combination by now?

yes, this combination is working *stable* for ~2 days so far (with heavy stressing).

moreover, i've isolated/reported a faulty code in vserver patch that causes cryptic
deadlocks for 2.6.38+ kernels: http://list.linux-vserver.org/archive?msp:5420:mdaibmimlbgoligkjdma

> I have no clues about the locking below, indeed, it seems like another bug......

this might be fixed by 3.0.8  https://lkml.org/lkml/2011/10/23/26, i'll test it soon...

> >
> > so far it has logged only something unrelated to memory managment subsystem:
> >
> > [  258.397014] =======================================================
> > [  258.397209] [ INFO: possible circular locking dependency detected ]
> > [  258.397311] 3.0.7-vs2.3.1-dirty #1
> > [  258.397402] -------------------------------------------------------
> > [  258.397503] slave_odra_g_00/19432 is trying to acquire lock:
> > [  258.397603]  (&(&sig->cputimer.lock)->rlock){-.....}, at: [<ffffffff8103adfc>] update_curr+0xfc/0x190

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-22  5:07             ` kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110 Nai Xia
@ 2011-10-31 16:34               ` Andrea Arcangeli
  0 siblings, 0 replies; 72+ messages in thread
From: Andrea Arcangeli @ 2011-10-31 16:34 UTC (permalink / raw)
  To: Nai Xia
  Cc: Mel Gorman, Hugh Dickins, Pawel Sikora, Andrew Morton, linux-mm,
	jpiszcz, arekm, linux-kernel

Hi Nai,

On Sat, Oct 22, 2011 at 01:07:11PM +0800, Nai Xia wrote:
> Yeah, anon_vma root lock is a big lock. And JFYI, actually I am doing 
> some very nasty hacking on anon_vma and one of the side effects is 
> breaking the root lock into pieces. But this area is pretty 
> convolved by many racing conditions. I hope some day I will finally make
> my patch work and have your precious review of it. :-)

:) It's going to be not trivial, initially it was not a shared lock
but it wasn't safe that way (especially with migrate required a
reliable rmap_walk) and using a shared lock across all
same_anon_vma/same_vma lists was the only way to be safe and solve the
races.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-22  5:52               ` Nai Xia
@ 2011-10-31 17:14                 ` Andrea Arcangeli
  2011-10-31 17:27                   ` [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma Andrea Arcangeli
  0 siblings, 1 reply; 72+ messages in thread
From: Andrea Arcangeli @ 2011-10-31 17:14 UTC (permalink / raw)
  To: Nai Xia
  Cc: Mel Gorman, Hugh Dickins, Pawel Sikora, Andrew Morton, linux-mm,
	jpiszcz, arekm, linux-kernel

On Sat, Oct 22, 2011 at 01:52:22PM +0800, Nai Xia wrote:
> BTW, I am curious about what benchmark did you run and " x10 boost"
> meaning compared to Hugh's anon_vma_locking fix?

I was referring to the mremap optimizations I pushed in -mm.

> This patch and together with the reasoning looks good to me. 
> But I wondering this patch can make the anon_vma chain ordering game more 
> complex and harder to play in the future.

Well we don't know yet what future will bring... at least this adds
some documentation on the fact the order matters for
fork/mremap/migrate/split_huge_page. As far as I can tell they're the
4 pieces of the VM where the rmap_walk order matters. And
split_huge_page and migrate are the only two where if the rmap_walk
fails we can't safely continue and have to BUG_ON.

> However, if it does bring much perfomance benefit, I vote for this patch 
> because it balances all three requirements here: bug free, performance &
> no two VMAs stay not merged for no good reason.

I suppose it should bring an SMP performance benefit as the critical
section is reduced but we'll have to do some more list_del/add_tail
than if we take the global lock...

> Our situation again makes me have the strong feeling that we are really
> in bad need of a computer aided way to travel all possible state space.
> There are some guys around me who do automatic software testing research.
> But I am afraid our problem is too much "real world" for them... sigh...  

Also the code changes too fast for that...

I'll send the patch again with signoff.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma
  2011-10-31 17:14                 ` Andrea Arcangeli
@ 2011-10-31 17:27                   ` Andrea Arcangeli
  2011-11-01 12:07                     ` Mel Gorman
                                       ` (2 more replies)
  0 siblings, 3 replies; 72+ messages in thread
From: Andrea Arcangeli @ 2011-10-31 17:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nai Xia, Hugh Dickins, Pawel Sikora, Andrew Morton, linux-mm,
	jpiszcz, arekm, linux-kernel

migrate was doing a rmap_walk with speculative lock-less access on
pagetables. That could lead it to not serialize properly against
mremap PT locks. But a second problem remains in the order of vmas in
the same_anon_vma list used by the rmap_walk.

If vma_merge would succeed in copy_vma, the src vma could be placed
after the dst vma in the same_anon_vma list. That could still lead
migrate to miss some pte.

This patch adds a anon_vma_order_tail() function to force the dst vma
at the end of the list before mremap starts to solve the problem.

If the mremap is very large and there are a lots of parents or childs
sharing the anon_vma root lock, this should still scale better than
taking the anon_vma root lock around every pte copy practically for
the whole duration of mremap.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/rmap.h |    1 +
 mm/mmap.c            |    8 ++++++++
 mm/rmap.c            |   44 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 53 insertions(+), 0 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 2148b12..45eb098 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -120,6 +120,7 @@ void anon_vma_init(void);	/* create anon_vma_cachep */
 int  anon_vma_prepare(struct vm_area_struct *);
 void unlink_anon_vmas(struct vm_area_struct *);
 int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *);
+void anon_vma_order_tail(struct vm_area_struct *);
 int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *);
 void __anon_vma_link(struct vm_area_struct *);
 
diff --git a/mm/mmap.c b/mm/mmap.c
index a65efd4..a5858dc 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2339,7 +2339,15 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 		 */
 		if (vma_start >= new_vma->vm_start &&
 		    vma_start < new_vma->vm_end)
+			/*
+			 * No need to call anon_vma_order_tail() in
+			 * this case because the same PT lock will
+			 * serialize the rmap_walk against both src
+			 * and dst vmas.
+			 */
 			*vmap = new_vma;
+		else
+			anon_vma_order_tail(new_vma);
 	} else {
 		new_vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
 		if (new_vma) {
diff --git a/mm/rmap.c b/mm/rmap.c
index 8005080..6dbc165 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -272,6 +272,50 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
 }
 
 /*
+ * Some rmap walk that needs to find all ptes/hugepmds without false
+ * negatives (like migrate and split_huge_page) running concurrent
+ * with operations that copy or move pagetables (like mremap() and
+ * fork()) to be safe depends the anon_vma "same_anon_vma" list to be
+ * in a certain order: the dst_vma must be placed after the src_vma in
+ * the list. This is always guaranteed by fork() but mremap() needs to
+ * call this function to enforce it in case the dst_vma isn't newly
+ * allocated and chained with the anon_vma_clone() function but just
+ * an extension of a pre-existing vma through vma_merge.
+ *
+ * NOTE: the same_anon_vma list can still be changed by other
+ * processes while mremap runs because mremap doesn't hold the
+ * anon_vma mutex to prevent modifications to the list while it
+ * runs. All we need to enforce is that the relative order of this
+ * process vmas isn't changing (we don't care about other vmas
+ * order). Each vma corresponds to an anon_vma_chain structure so
+ * there's no risk that other processes calling anon_vma_order_tail()
+ * and changing the same_anon_vma list under mremap() will screw with
+ * the relative order of this process vmas in the list, because we
+ * won't alter the order of any vma that isn't belonging to this
+ * process. And there can't be another anon_vma_order_tail running
+ * concurrently with mremap() coming from this process because we hold
+ * the mmap_sem for the whole mremap(). fork() ordering dependency
+ * also shouldn't be affected because we only care that the parent
+ * vmas are placed in the list before the child vmas and
+ * anon_vma_order_tail won't reorder vmas from either the fork parent
+ * or child.
+ */
+void anon_vma_order_tail(struct vm_area_struct *dst)
+{
+	struct anon_vma_chain *pavc;
+	struct anon_vma *root = NULL;
+
+	list_for_each_entry_reverse(pavc, &dst->anon_vma_chain, same_vma) {
+		struct anon_vma *anon_vma = pavc->anon_vma;
+		VM_BUG_ON(pavc->vma != dst);
+		root = lock_anon_vma_root(root, anon_vma);
+		list_del(&pavc->same_anon_vma);
+		list_add_tail(&pavc->same_anon_vma, &anon_vma->head);
+	}
+	unlock_anon_vma_root(root);
+}
+
+/*
  * Attach vma to its own anon_vma, as well as to the anon_vmas that
  * the corresponding VMA in the parent process is attached to.
  * Returns 0 on success, non-zero on failure.

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma
  2011-10-31 17:27                   ` [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma Andrea Arcangeli
@ 2011-11-01 12:07                     ` Mel Gorman
  2011-11-01 14:35                     ` Nai Xia
  2011-11-04  7:31                     ` Hugh Dickins
  2 siblings, 0 replies; 72+ messages in thread
From: Mel Gorman @ 2011-11-01 12:07 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nai Xia, Hugh Dickins, Pawel Sikora, Andrew Morton, linux-mm,
	jpiszcz, arekm, linux-kernel

On Mon, Oct 31, 2011 at 06:27:20PM +0100, Andrea Arcangeli wrote:
> migrate was doing a rmap_walk with speculative lock-less access on
> pagetables. That could lead it to not serialize properly against
> mremap PT locks. But a second problem remains in the order of vmas in
> the same_anon_vma list used by the rmap_walk.
> 
> If vma_merge would succeed in copy_vma, the src vma could be placed
> after the dst vma in the same_anon_vma list. That could still lead
> migrate to miss some pte.
> 

For future reference, why? How about this as an explanation?

If vma_merge would succeed in copy_vma, the src vma could be placed
after the dst vma in the same_anon_vma list. That leads to a race
between migration and mremap whereby a migration PTE is left behind.

mremap				migration
create dst VMA
				rmap_walk
				finds dst, no ptes, release PTL
move_ptes
copies src PTEs to dst
				finds src, ptes empty, releases PTL

The migration PTE is now left behind because the order of VMAs matter.

> This patch adds a anon_vma_order_tail() function to force the dst vma
> at the end of the list before mremap starts to solve the problem.
> 

Document the alternative just in case?

"One fix would be to have mremap take the anon_vma lock which would
serialise migration and mremap but this would hurt scalability. Instead,
this patch adds....."

I would also prefer something like anon_vma_moveto_tail() but maybe
it's just me that sees "order" and thinks "high-order allocation".

> If the mremap is very large and there are a lots of parents or childs
> sharing the anon_vma root lock, this should still scale better than
> taking the anon_vma root lock around every pte copy practically for
> the whole duration of mremap.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  include/linux/rmap.h |    1 +
>  mm/mmap.c            |    8 ++++++++
>  mm/rmap.c            |   44 ++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 53 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 2148b12..45eb098 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -120,6 +120,7 @@ void anon_vma_init(void);	/* create anon_vma_cachep */
>  int  anon_vma_prepare(struct vm_area_struct *);
>  void unlink_anon_vmas(struct vm_area_struct *);
>  int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *);
> +void anon_vma_order_tail(struct vm_area_struct *);
>  int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *);
>  void __anon_vma_link(struct vm_area_struct *);
>  
> diff --git a/mm/mmap.c b/mm/mmap.c
> index a65efd4..a5858dc 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2339,7 +2339,15 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
>  		 */
>  		if (vma_start >= new_vma->vm_start &&
>  		    vma_start < new_vma->vm_end)
> +			/*
> +			 * No need to call anon_vma_order_tail() in
> +			 * this case because the same PT lock will
> +			 * serialize the rmap_walk against both src
> +			 * and dst vmas.
> +			 */
>  			*vmap = new_vma;
> +		else
> +			anon_vma_order_tail(new_vma);
>  	} else {
>  		new_vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
>  		if (new_vma) {
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 8005080..6dbc165 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -272,6 +272,50 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
>  }
>  
>  /*
> + * Some rmap walk that needs to find all ptes/hugepmds without false
> + * negatives (like migrate and split_huge_page) running concurrent
> + * with operations that copy or move pagetables (like mremap() and
> + * fork()) to be safe depends the anon_vma "same_anon_vma" list to be
> + * in a certain order: the dst_vma must be placed after the src_vma in
> + * the list. This is always guaranteed by fork() but mremap() needs to
> + * call this function to enforce it in case the dst_vma isn't newly
> + * allocated and chained with the anon_vma_clone() function but just
> + * an extension of a pre-existing vma through vma_merge.
> + *
> + * NOTE: the same_anon_vma list can still be changed by other
> + * processes while mremap runs because mremap doesn't hold the
> + * anon_vma mutex to prevent modifications to the list while it
> + * runs. All we need to enforce is that the relative order of this
> + * process vmas isn't changing (we don't care about other vmas
> + * order). Each vma corresponds to an anon_vma_chain structure so
> + * there's no risk that other processes calling anon_vma_order_tail()
> + * and changing the same_anon_vma list under mremap() will screw with
> + * the relative order of this process vmas in the list, because we
> + * won't alter the order of any vma that isn't belonging to this
> + * process. And there can't be another anon_vma_order_tail running
> + * concurrently with mremap() coming from this process because we hold
> + * the mmap_sem for the whole mremap(). fork() ordering dependency
> + * also shouldn't be affected because we only care that the parent
> + * vmas are placed in the list before the child vmas and
> + * anon_vma_order_tail won't reorder vmas from either the fork parent
> + * or child.
> + */
> +void anon_vma_order_tail(struct vm_area_struct *dst)
> +{
> +	struct anon_vma_chain *pavc;
> +	struct anon_vma *root = NULL;
> +
> +	list_for_each_entry_reverse(pavc, &dst->anon_vma_chain, same_vma) {
> +		struct anon_vma *anon_vma = pavc->anon_vma;
> +		VM_BUG_ON(pavc->vma != dst);
> +		root = lock_anon_vma_root(root, anon_vma);
> +		list_del(&pavc->same_anon_vma);
> +		list_add_tail(&pavc->same_anon_vma, &anon_vma->head);
> +	}
> +	unlock_anon_vma_root(root);
> +}
> +

This is following the same rules as anon_vma_clone() and I didn't see a
flaw in your explanation as to why it's safe.

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma
  2011-10-31 17:27                   ` [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma Andrea Arcangeli
  2011-11-01 12:07                     ` Mel Gorman
@ 2011-11-01 14:35                     ` Nai Xia
  2011-11-04  7:31                     ` Hugh Dickins
  2 siblings, 0 replies; 72+ messages in thread
From: Nai Xia @ 2011-11-01 14:35 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Hugh Dickins, Pawel Sikora, Andrew Morton, linux-mm,
	jpiszcz, arekm, linux-kernel

On Tue, Nov 1, 2011 at 1:27 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> migrate was doing a rmap_walk with speculative lock-less access on
> pagetables. That could lead it to not serialize properly against
> mremap PT locks. But a second problem remains in the order of vmas in
> the same_anon_vma list used by the rmap_walk.
>
> If vma_merge would succeed in copy_vma, the src vma could be placed
> after the dst vma in the same_anon_vma list. That could still lead
> migrate to miss some pte.
>
> This patch adds a anon_vma_order_tail() function to force the dst vma
> at the end of the list before mremap starts to solve the problem.
>
> If the mremap is very large and there are a lots of parents or childs
> sharing the anon_vma root lock, this should still scale better than
> taking the anon_vma root lock around every pte copy practically for
> the whole duration of mremap.
>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  include/linux/rmap.h |    1 +
>  mm/mmap.c            |    8 ++++++++
>  mm/rmap.c            |   44 ++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 53 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 2148b12..45eb098 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -120,6 +120,7 @@ void anon_vma_init(void);   /* create anon_vma_cachep */
>  int  anon_vma_prepare(struct vm_area_struct *);
>  void unlink_anon_vmas(struct vm_area_struct *);
>  int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *);
> +void anon_vma_order_tail(struct vm_area_struct *);
>  int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *);
>  void __anon_vma_link(struct vm_area_struct *);
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index a65efd4..a5858dc 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2339,7 +2339,15 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
>                 */
>                if (vma_start >= new_vma->vm_start &&
>                    vma_start < new_vma->vm_end)
> +                       /*
> +                        * No need to call anon_vma_order_tail() in
> +                        * this case because the same PT lock will
> +                        * serialize the rmap_walk against both src
> +                        * and dst vmas.
> +                        */
>                        *vmap = new_vma;
> +               else
> +                       anon_vma_order_tail(new_vma);
>        } else {
>                new_vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
>                if (new_vma) {
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 8005080..6dbc165 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -272,6 +272,50 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
>  }
>
>  /*
> + * Some rmap walk that needs to find all ptes/hugepmds without false
> + * negatives (like migrate and split_huge_page) running concurrent
> + * with operations that copy or move pagetables (like mremap() and
> + * fork()) to be safe depends the anon_vma "same_anon_vma" list to be
> + * in a certain order: the dst_vma must be placed after the src_vma in
> + * the list. This is always guaranteed by fork() but mremap() needs to
> + * call this function to enforce it in case the dst_vma isn't newly
> + * allocated and chained with the anon_vma_clone() function but just
> + * an extension of a pre-existing vma through vma_merge.
> + *
> + * NOTE: the same_anon_vma list can still be changed by other
> + * processes while mremap runs because mremap doesn't hold the
> + * anon_vma mutex to prevent modifications to the list while it
> + * runs. All we need to enforce is that the relative order of this
> + * process vmas isn't changing (we don't care about other vmas
> + * order). Each vma corresponds to an anon_vma_chain structure so
> + * there's no risk that other processes calling anon_vma_order_tail()
> + * and changing the same_anon_vma list under mremap() will screw with
> + * the relative order of this process vmas in the list, because we
> + * won't alter the order of any vma that isn't belonging to this
> + * process. And there can't be another anon_vma_order_tail running
> + * concurrently with mremap() coming from this process because we hold
> + * the mmap_sem for the whole mremap(). fork() ordering dependency
> + * also shouldn't be affected because we only care that the parent
> + * vmas are placed in the list before the child vmas and
> + * anon_vma_order_tail won't reorder vmas from either the fork parent
> + * or child.
> + */
> +void anon_vma_order_tail(struct vm_area_struct *dst)
> +{
> +       struct anon_vma_chain *pavc;
> +       struct anon_vma *root = NULL;
> +
> +       list_for_each_entry_reverse(pavc, &dst->anon_vma_chain, same_vma) {
> +               struct anon_vma *anon_vma = pavc->anon_vma;
> +               VM_BUG_ON(pavc->vma != dst);
> +               root = lock_anon_vma_root(root, anon_vma);
> +               list_del(&pavc->same_anon_vma);
> +               list_add_tail(&pavc->same_anon_vma, &anon_vma->head);
> +       }
> +       unlock_anon_vma_root(root);
> +}

I think Pawel might want to sign a "Tested-by", he may have been running this
patch safely for quite some days. :)

Reviewed-by: Nai Xia <nai.xia@gmail.com>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma
  2011-10-31 17:27                   ` [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma Andrea Arcangeli
  2011-11-01 12:07                     ` Mel Gorman
  2011-11-01 14:35                     ` Nai Xia
@ 2011-11-04  7:31                     ` Hugh Dickins
  2011-11-04 14:34                       ` Nai Xia
  2011-11-04 23:56                       ` Andrea Arcangeli
  2 siblings, 2 replies; 72+ messages in thread
From: Hugh Dickins @ 2011-11-04  7:31 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Nai Xia, Pawel Sikora, Andrew Morton, linux-mm,
	jpiszcz, arekm, linux-kernel

On Mon, 31 Oct 2011, Andrea Arcangeli wrote:

> migrate was doing a rmap_walk with speculative lock-less access on
> pagetables. That could lead it to not serialize properly against
> mremap PT locks. But a second problem remains in the order of vmas in
> the same_anon_vma list used by the rmap_walk.

I do think that Nai Xia deserves special credit for thinking deeper
into this than the rest of us (before you came back): something like

Issue-conceived-by: Nai Xia <nai.xia@gmail.com>

> 
> If vma_merge would succeed in copy_vma, the src vma could be placed
> after the dst vma in the same_anon_vma list. That could still lead
> migrate to miss some pte.
> 
> This patch adds a anon_vma_order_tail() function to force the dst vma

I agree with Mel that anon_vma_moveto_tail() would be a better name;
or even anon_vma_move_to_tail().

> at the end of the list before mremap starts to solve the problem.
> 
> If the mremap is very large and there are a lots of parents or childs
> sharing the anon_vma root lock, this should still scale better than
> taking the anon_vma root lock around every pte copy practically for
> the whole duration of mremap.

But I'm sorry to say that I'm actually not persuaded by the patch,
on three counts.

> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  include/linux/rmap.h |    1 +
>  mm/mmap.c            |    8 ++++++++
>  mm/rmap.c            |   44 ++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 53 insertions(+), 0 deletions(-)
B
> 
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 2148b12..45eb098 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -120,6 +120,7 @@ void anon_vma_init(void);	/* create anon_vma_cachep */
>  int  anon_vma_prepare(struct vm_area_struct *);
>  void unlink_anon_vmas(struct vm_area_struct *);
>  int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *);
> +void anon_vma_order_tail(struct vm_area_struct *);
>  int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *);
>  void __anon_vma_link(struct vm_area_struct *);
>  
> diff --git a/mm/mmap.c b/mm/mmap.c
> index a65efd4..a5858dc 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2339,7 +2339,15 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
>  		 */
>  		if (vma_start >= new_vma->vm_start &&
>  		    vma_start < new_vma->vm_end)
> +			/*
> +			 * No need to call anon_vma_order_tail() in
> +			 * this case because the same PT lock will
> +			 * serialize the rmap_walk against both src
> +			 * and dst vmas.
> +			 */

Really?  Please convince me: I just do not see what ensures that
the same pt lock covers both src and dst areas in this case.

>  			*vmap = new_vma;
> +		else
> +			anon_vma_order_tail(new_vma);

And if this puts new_vma in the right position for the normal
move_page_tables(), as anon_vma_clone() does in the block below,
aren't they both in exactly the wrong position for the abnormal
move_page_tables(), called to put ptes back where they were if
the original move_page_tables() fails?

It might be possible to argue that move_page_tables() can only
fail by failing to allocate memory for pud or pmd, and that (perhaps)
could only happen if the task was being OOM-killed and ran out of
reserves at this point, and if it's being OOM-killed then we don't
mind losing a migration entry for a moment... perhaps.

Certainly I'd agree that it's a very rare case.  But it feels wrong
to be attempting to fix the already unlikely issue, while ignoring
this aspect, or relying on such unrelated implementation details.

Perhaps some further anon_vma_ordering could fix it up,
but that would look increasingly desperate.

>  	} else {
>  		new_vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
>  		if (new_vma) {
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 8005080..6dbc165 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -272,6 +272,50 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
>  }
>  
>  /*
> + * Some rmap walk that needs to find all ptes/hugepmds without false
> + * negatives (like migrate and split_huge_page) running concurrent
> + * with operations that copy or move pagetables (like mremap() and
> + * fork()) to be safe depends the anon_vma "same_anon_vma" list to be
> + * in a certain order: the dst_vma must be placed after the src_vma in
> + * the list. This is always guaranteed by fork() but mremap() needs to
> + * call this function to enforce it in case the dst_vma isn't newly
> + * allocated and chained with the anon_vma_clone() function but just
> + * an extension of a pre-existing vma through vma_merge.
> + *
> + * NOTE: the same_anon_vma list can still be changed by other
> + * processes while mremap runs because mremap doesn't hold the
> + * anon_vma mutex to prevent modifications to the list while it
> + * runs. All we need to enforce is that the relative order of this
> + * process vmas isn't changing (we don't care about other vmas
> + * order). Each vma corresponds to an anon_vma_chain structure so
> + * there's no risk that other processes calling anon_vma_order_tail()
> + * and changing the same_anon_vma list under mremap() will screw with
> + * the relative order of this process vmas in the list, because we
> + * won't alter the order of any vma that isn't belonging to this
> + * process. And there can't be another anon_vma_order_tail running
> + * concurrently with mremap() coming from this process because we hold
> + * the mmap_sem for the whole mremap(). fork() ordering dependency
> + * also shouldn't be affected because we only care that the parent
> + * vmas are placed in the list before the child vmas and
> + * anon_vma_order_tail won't reorder vmas from either the fork parent
> + * or child.
> + */
> +void anon_vma_order_tail(struct vm_area_struct *dst)
> +{
> +	struct anon_vma_chain *pavc;
> +	struct anon_vma *root = NULL;
> +
> +	list_for_each_entry_reverse(pavc, &dst->anon_vma_chain, same_vma) {
> +		struct anon_vma *anon_vma = pavc->anon_vma;
> +		VM_BUG_ON(pavc->vma != dst);
> +		root = lock_anon_vma_root(root, anon_vma);
> +		list_del(&pavc->same_anon_vma);
> +		list_add_tail(&pavc->same_anon_vma, &anon_vma->head);
> +	}
> +	unlock_anon_vma_root(root);
> +}

I thought this was correct, but now I'm not so sure.  You rightly
consider the question of interference between concurrent mremaps in
different mms in your comment above, but I'm still not convinced it
is safe.  Oh, probably just my persistent failure to picture these
avcs properly.

If we were back in the days of the simple anon_vma list, I'd probably
share your enthusiasm for the list ordering solution; but now it looks
like a fragile and contorted way of avoiding the obvious... we just
need to use the anon_vma_lock (but perhaps there are some common and
easily tested conditions under which we can skip it e.g. when a single
pt lock covers src and dst?).

Sorry to be so negative!  I may just be wrong on all counts.

Hugh

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma
  2011-11-04  7:31                     ` Hugh Dickins
@ 2011-11-04 14:34                       ` Nai Xia
  2011-11-04 15:59                         ` Pawel Sikora
  2011-11-04 19:16                         ` Hugh Dickins
  2011-11-04 23:56                       ` Andrea Arcangeli
  1 sibling, 2 replies; 72+ messages in thread
From: Nai Xia @ 2011-11-04 14:34 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Mel Gorman, Pawel Sikora, Andrew Morton,
	linux-mm, jpiszcz, arekm, linux-kernel

On Fri, Nov 4, 2011 at 3:31 PM, Hugh Dickins <hughd@google.com> wrote:
> On Mon, 31 Oct 2011, Andrea Arcangeli wrote:
>
>> migrate was doing a rmap_walk with speculative lock-less access on
>> pagetables. That could lead it to not serialize properly against
>> mremap PT locks. But a second problem remains in the order of vmas in
>> the same_anon_vma list used by the rmap_walk.
>
> I do think that Nai Xia deserves special credit for thinking deeper
> into this than the rest of us (before you came back): something like
>
> Issue-conceived-by: Nai Xia <nai.xia@gmail.com>

Thanks! ;-)

>
>>
>> If vma_merge would succeed in copy_vma, the src vma could be placed
>> after the dst vma in the same_anon_vma list. That could still lead
>> migrate to miss some pte.
>>
>> This patch adds a anon_vma_order_tail() function to force the dst vma
>
> I agree with Mel that anon_vma_moveto_tail() would be a better name;
> or even anon_vma_move_to_tail().
>
>> at the end of the list before mremap starts to solve the problem.
>>
>> If the mremap is very large and there are a lots of parents or childs
>> sharing the anon_vma root lock, this should still scale better than
>> taking the anon_vma root lock around every pte copy practically for
>> the whole duration of mremap.
>
> But I'm sorry to say that I'm actually not persuaded by the patch,
> on three counts.
>
>>
>> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
>> ---
>>  include/linux/rmap.h |    1 +
>>  mm/mmap.c            |    8 ++++++++
>>  mm/rmap.c            |   44 ++++++++++++++++++++++++++++++++++++++++++++
>>  3 files changed, 53 insertions(+), 0 deletions(-)
> B
>>
>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>> index 2148b12..45eb098 100644
>> --- a/include/linux/rmap.h
>> +++ b/include/linux/rmap.h
>> @@ -120,6 +120,7 @@ void anon_vma_init(void); /* create anon_vma_cachep */
>>  int  anon_vma_prepare(struct vm_area_struct *);
>>  void unlink_anon_vmas(struct vm_area_struct *);
>>  int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *);
>> +void anon_vma_order_tail(struct vm_area_struct *);
>>  int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *);
>>  void __anon_vma_link(struct vm_area_struct *);
>>
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index a65efd4..a5858dc 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -2339,7 +2339,15 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
>>                */
>>               if (vma_start >= new_vma->vm_start &&
>>                   vma_start < new_vma->vm_end)
>> +                     /*
>> +                      * No need to call anon_vma_order_tail() in
>> +                      * this case because the same PT lock will
>> +                      * serialize the rmap_walk against both src
>> +                      * and dst vmas.
>> +                      */
>
> Really?  Please convince me: I just do not see what ensures that
> the same pt lock covers both src and dst areas in this case.

At the first glance that rmap_walk does travel this merged VMA
once...
But, Now, Wait...., I am actually really puzzled that this case can really
happen at all, you see that vma_merge() does not break the validness
between page->index and its VMA. So if this can really happen,
a page->index should be valid in both areas in a same VMA.
It's strange to imagine that a PTE is copy inside a _same_ VMA
and page->index is valid at both old and new places.

IMO, the only case that src VMA can be merged by the new
is that src VMA hasn't been faulted yet and the pgoff
is recalculated. And if my reasoning is true, this place
does not need to be worried about.

How do you think?

>
>>                       *vmap = new_vma;
>> +             else
>> +                     anon_vma_order_tail(new_vma);
>
> And if this puts new_vma in the right position for the normal
> move_page_tables(), as anon_vma_clone() does in the block below,
> aren't they both in exactly the wrong position for the abnormal
> move_page_tables(), called to put ptes back where they were if
> the original move_page_tables() fails?

OH,MY, at least 6 six eye balls missed another apparent case...
Now you know why I said "Human brains are all weak in...." :P

>
> It might be possible to argue that move_page_tables() can only
> fail by failing to allocate memory for pud or pmd, and that (perhaps)
> could only happen if the task was being OOM-killed and ran out of
> reserves at this point, and if it's being OOM-killed then we don't
> mind losing a migration entry for a moment... perhaps.
>
> Certainly I'd agree that it's a very rare case.  But it feels wrong
> to be attempting to fix the already unlikely issue, while ignoring
> this aspect, or relying on such unrelated implementation details.
>
> Perhaps some further anon_vma_ordering could fix it up,
> but that would look increasingly desperate.
>
>>       } else {
>>               new_vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
>>               if (new_vma) {
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 8005080..6dbc165 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -272,6 +272,50 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
>>  }
>>
>>  /*
>> + * Some rmap walk that needs to find all ptes/hugepmds without false
>> + * negatives (like migrate and split_huge_page) running concurrent
>> + * with operations that copy or move pagetables (like mremap() and
>> + * fork()) to be safe depends the anon_vma "same_anon_vma" list to be
>> + * in a certain order: the dst_vma must be placed after the src_vma in
>> + * the list. This is always guaranteed by fork() but mremap() needs to
>> + * call this function to enforce it in case the dst_vma isn't newly
>> + * allocated and chained with the anon_vma_clone() function but just
>> + * an extension of a pre-existing vma through vma_merge.
>> + *
>> + * NOTE: the same_anon_vma list can still be changed by other
>> + * processes while mremap runs because mremap doesn't hold the
>> + * anon_vma mutex to prevent modifications to the list while it
>> + * runs. All we need to enforce is that the relative order of this
>> + * process vmas isn't changing (we don't care about other vmas
>> + * order). Each vma corresponds to an anon_vma_chain structure so
>> + * there's no risk that other processes calling anon_vma_order_tail()
>> + * and changing the same_anon_vma list under mremap() will screw with
>> + * the relative order of this process vmas in the list, because we
>> + * won't alter the order of any vma that isn't belonging to this
>> + * process. And there can't be another anon_vma_order_tail running
>> + * concurrently with mremap() coming from this process because we hold
>> + * the mmap_sem for the whole mremap(). fork() ordering dependency
>> + * also shouldn't be affected because we only care that the parent
>> + * vmas are placed in the list before the child vmas and
>> + * anon_vma_order_tail won't reorder vmas from either the fork parent
>> + * or child.
>> + */
>> +void anon_vma_order_tail(struct vm_area_struct *dst)
>> +{
>> +     struct anon_vma_chain *pavc;
>> +     struct anon_vma *root = NULL;
>> +
>> +     list_for_each_entry_reverse(pavc, &dst->anon_vma_chain, same_vma) {
>> +             struct anon_vma *anon_vma = pavc->anon_vma;
>> +             VM_BUG_ON(pavc->vma != dst);
>> +             root = lock_anon_vma_root(root, anon_vma);
>> +             list_del(&pavc->same_anon_vma);
>> +             list_add_tail(&pavc->same_anon_vma, &anon_vma->head);
>> +     }
>> +     unlock_anon_vma_root(root);
>> +}
>
> I thought this was correct, but now I'm not so sure.  You rightly
> consider the question of interference between concurrent mremaps in
> different mms in your comment above, but I'm still not convinced it
> is safe.  Oh, probably just my persistent failure to picture these
> avcs properly.
>
> If we were back in the days of the simple anon_vma list, I'd probably
> share your enthusiasm for the list ordering solution; but now it looks
> like a fragile and contorted way of avoiding the obvious... we just
> need to use the anon_vma_lock (but perhaps there are some common and
> easily tested conditions under which we can skip it e.g. when a single
> pt lock covers src and dst?).
>
> Sorry to be so negative!  I may just be wrong on all counts.
>
> Hugh
>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma
  2011-11-04 14:34                       ` Nai Xia
@ 2011-11-04 15:59                         ` Pawel Sikora
  2011-11-05  2:21                           ` Nai Xia
  2011-11-04 19:16                         ` Hugh Dickins
  1 sibling, 1 reply; 72+ messages in thread
From: Pawel Sikora @ 2011-11-04 15:59 UTC (permalink / raw)
  To: Nai Xia
  Cc: Hugh Dickins, Andrea Arcangeli, Mel Gorman, Andrew Morton,
	linux-mm, jpiszcz, arekm, linux-kernel

On Friday 04 of November 2011 22:34:54 Nai Xia wrote:
> On Fri, Nov 4, 2011 at 3:31 PM, Hugh Dickins <hughd@google.com> wrote:
> > On Mon, 31 Oct 2011, Andrea Arcangeli wrote:
> >
> >> migrate was doing a rmap_walk with speculative lock-less access on
> >> pagetables. That could lead it to not serialize properly against
> >> mremap PT locks. But a second problem remains in the order of vmas in
> >> the same_anon_vma list used by the rmap_walk.
> >
> > I do think that Nai Xia deserves special credit for thinking deeper
> > into this than the rest of us (before you came back): something like
> >
> > Issue-conceived-by: Nai Xia <nai.xia@gmail.com>
> 
> Thanks! ;-)

hi all,

i'm still testing anon_vma_order_tail() patch. 10 days of heavy processing
and machine is still stable but i've recorded some interesting thing:

$ uname -a
Linux hal 3.0.8-vs2.3.1-dirty #6 SMP Tue Oct 25 10:07:50 CEST 2011 x86_64 AMD_Opteron(tm)_Processor_6128 PLD Linux
$ uptime
 16:47:44 up 10 days,  4:21,  5 users,  load average: 19.55, 19.15, 18.76
$ ps aux|grep migration
root         6  0.0  0.0      0     0 ?        S    Oct25   0:00 [migration/0]
root         8 68.0  0.0      0     0 ?        S    Oct25 9974:01 [migration/1]
root        13 35.4  0.0      0     0 ?        S    Oct25 5202:15 [migration/2]
root        17 71.4  0.0      0     0 ?        S    Oct25 10479:10 [migration/3]
root        21 70.7  0.0      0     0 ?        S    Oct25 10370:14 [migration/4]
root        25 66.1  0.0      0     0 ?        S    Oct25 9698:11 [migration/5]
root        29 70.1  0.0      0     0 ?        S    Oct25 10283:22 [migration/6]
root        33 62.6  0.0      0     0 ?        S    Oct25 9190:28 [migration/7]
root        37  0.0  0.0      0     0 ?        S    Oct25   0:00 [migration/8]
root        41 97.7  0.0      0     0 ?        S    Oct25 14338:30 [migration/9]
root        45 29.2  0.0      0     0 ?        S    Oct25 4290:00 [migration/10]
root        49 68.7  0.0      0     0 ?        S    Oct25 10081:38 [migration/11]
root        53 98.7  0.0      0     0 ?        S    Oct25 14477:25 [migration/12]
root        57 70.0  0.0      0     0 ?        S    Oct25 10272:57 [migration/13]
root        61 69.7  0.0      0     0 ?        S    Oct25 10232:29 [migration/14]
root        65 70.9  0.0      0     0 ?        S    Oct25 10403:09 [migration/15]

wow, 71..241 hours in migration processes after 10 days of uptime?
machine has 2 opteron nodes with 32GB ram paired with each processor.
i suppose that it spends a lot of time on migration (processes + memory pages).

BR,
Paweł.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma
  2011-11-04 14:34                       ` Nai Xia
  2011-11-04 15:59                         ` Pawel Sikora
@ 2011-11-04 19:16                         ` Hugh Dickins
  2011-11-04 20:54                           ` Andrea Arcangeli
  1 sibling, 1 reply; 72+ messages in thread
From: Hugh Dickins @ 2011-11-04 19:16 UTC (permalink / raw)
  To: Nai Xia
  Cc: Andrea Arcangeli, Mel Gorman, Pawel Sikora, Andrew Morton,
	linux-mm, jpiszcz, arekm, linux-kernel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2580 bytes --]

On Fri, 4 Nov 2011, Nai Xia wrote:
> On Fri, Nov 4, 2011 at 3:31 PM, Hugh Dickins <hughd@google.com> wrote:
> > On Mon, 31 Oct 2011, Andrea Arcangeli wrote:
> >> @@ -2339,7 +2339,15 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
> >>                */
> >>               if (vma_start >= new_vma->vm_start &&
> >>                   vma_start < new_vma->vm_end)
> >> +                     /*
> >> +                      * No need to call anon_vma_order_tail() in
> >> +                      * this case because the same PT lock will
> >> +                      * serialize the rmap_walk against both src
> >> +                      * and dst vmas.
> >> +                      */
> >
> > Really?  Please convince me: I just do not see what ensures that
> > the same pt lock covers both src and dst areas in this case.
> 
> At the first glance that rmap_walk does travel this merged VMA
> once...
> But, Now, Wait...., I am actually really puzzled that this case can really
> happen at all, you see that vma_merge() does not break the validness
> between page->index and its VMA. So if this can really happen,
> a page->index should be valid in both areas in a same VMA.
> It's strange to imagine that a PTE is copy inside a _same_ VMA
> and page->index is valid at both old and new places.

Yes, I think you are right, thank you for elucidating it.

That was a real case when we wrote copy_vma(), when rmap was using
pte_chains; but once anon_vma came in, and imposed vm_pgoff matching
on anonymous mappings too, it became dead code.  With linear vm_pgoff
matching, you cannot fit a range in two places within the same vma.
(And even the non-linear case relies upon vm_pgoff defaults.)

So we could simplify the copy_vma() interface a little now (get rid of
that nasty **vmap): I'm not quite sure whether we ought to do that,
but certainly Andrea's comment there should be updated (if he also
agrees with your analysis).

> 
> IMO, the only case that src VMA can be merged by the new
> is that src VMA hasn't been faulted yet and the pgoff
> is recalculated. And if my reasoning is true, this place
> does not need to be worried about.

I don't see a place where "the pgoff is recalculated" (except in
the consistent way when expanding or splitting or merging vma), nor
where vma merge would allow for variable pgoff.  I agree that we
could avoid finalizing vm_pgoff for an anonymous area until its
anon_vma is assigned: were you imagining doing that in future,
or am I overlooking something already there?

Hugh

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma
  2011-11-04 19:16                         ` Hugh Dickins
@ 2011-11-04 20:54                           ` Andrea Arcangeli
  2011-11-05  0:09                             ` Nai Xia
  0 siblings, 1 reply; 72+ messages in thread
From: Andrea Arcangeli @ 2011-11-04 20:54 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Nai Xia, Mel Gorman, Pawel Sikora, Andrew Morton, linux-mm,
	jpiszcz, arekm, linux-kernel

On Fri, Nov 04, 2011 at 12:16:03PM -0700, Hugh Dickins wrote:
> On Fri, 4 Nov 2011, Nai Xia wrote:
> > On Fri, Nov 4, 2011 at 3:31 PM, Hugh Dickins <hughd@google.com> wrote:
> > > On Mon, 31 Oct 2011, Andrea Arcangeli wrote:
> > >> @@ -2339,7 +2339,15 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
> > >>                */
> > >>               if (vma_start >= new_vma->vm_start &&
> > >>                   vma_start < new_vma->vm_end)
> > >> +                     /*
> > >> +                      * No need to call anon_vma_order_tail() in
> > >> +                      * this case because the same PT lock will
> > >> +                      * serialize the rmap_walk against both src
> > >> +                      * and dst vmas.
> > >> +                      */
> > >
> > > Really?  Please convince me: I just do not see what ensures that
> > > the same pt lock covers both src and dst areas in this case.
> > 
> > At the first glance that rmap_walk does travel this merged VMA
> > once...
> > But, Now, Wait...., I am actually really puzzled that this case can really
> > happen at all, you see that vma_merge() does not break the validness
> > between page->index and its VMA. So if this can really happen,
> > a page->index should be valid in both areas in a same VMA.
> > It's strange to imagine that a PTE is copy inside a _same_ VMA
> > and page->index is valid at both old and new places.
> 
> Yes, I think you are right, thank you for elucidating it.
> 
> That was a real case when we wrote copy_vma(), when rmap was using
> pte_chains; but once anon_vma came in, and imposed vm_pgoff matching
> on anonymous mappings too, it became dead code.  With linear vm_pgoff
> matching, you cannot fit a range in two places within the same vma.
> (And even the non-linear case relies upon vm_pgoff defaults.)
> 
> So we could simplify the copy_vma() interface a little now (get rid of
> that nasty **vmap): I'm not quite sure whether we ought to do that,
> but certainly Andrea's comment there should be updated (if he also
> agrees with your analysis).

The vmap should only trigger when the prev vma (prev relative to src
vma) is extended at the end to make space for the dst range. And by
extending it, we filled the hole between the prev vma and "src"
vma. So then the prev vma becomes the "src vma" and also the "dst
vma". So we can't keep working with the old "vma" pointer after that.

I doubt it can be removed without crashing in the above case.

I thought some more about it and what I missed I think is the
anon_vma_merge in vma_adjust. What that anon_vma_merge, rmap_walk will
have to complete before we can start moving the ptes. And so rmap_walk
when starts again from scratch (after anon_vma_merge run in
vma_adjust) will find all ptes even if vma_merge succeeded before.

In fact this may also work for fork. Fork will take the anon_vma root
lock somehow to queue the child vma in the same_anon_vma. Doing so it
will serialize against any running rmap_walk from all other cpus. The
ordering has never been an issue for fork anyway, but it would have
have been an issue for mremap in case vma_merge succeeded and src_vma
!= dst_vma, if vma_merge didn't act as a serialization point against
rmap_walk (which I realized now).

What makes it safe is again taking both PT locks simultanously. So it
doesn't matter what rmap_walk searches, as long as the anon_vma_chain
list cannot change by the time rmap_walk started.

What I thought before was rmap_walk checking vma1 and then vma_merge
succeed (where src vma is vma2 and dst vma is vma1, but vma1 is not a
new vma queued at the end of same_anon_vma), move_page_tables moves
the pte from vma2 to vma1, and then rmap_walk checks vma2. But again
vma_merge won't be allowed to complete in the middle of rmap_walk, and
so it cannot trigger and we can safely drop the patch. It wasn't
immediate to think at the locks taken within vma_adjust sorry.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma
  2011-11-04  7:31                     ` Hugh Dickins
  2011-11-04 14:34                       ` Nai Xia
@ 2011-11-04 23:56                       ` Andrea Arcangeli
  2011-11-05  0:21                         ` Nai Xia
  1 sibling, 1 reply; 72+ messages in thread
From: Andrea Arcangeli @ 2011-11-04 23:56 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Mel Gorman, Nai Xia, Pawel Sikora, Andrew Morton, linux-mm,
	jpiszcz, arekm, linux-kernel

On Fri, Nov 04, 2011 at 12:31:04AM -0700, Hugh Dickins wrote:
> On Mon, 31 Oct 2011, Andrea Arcangeli wrote:
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index a65efd4..a5858dc 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -2339,7 +2339,15 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
> >  		 */
> >  		if (vma_start >= new_vma->vm_start &&
> >  		    vma_start < new_vma->vm_end)
> > +			/*
> > +			 * No need to call anon_vma_order_tail() in
> > +			 * this case because the same PT lock will
> > +			 * serialize the rmap_walk against both src
> > +			 * and dst vmas.
> > +			 */
> 
> Really?  Please convince me: I just do not see what ensures that
> the same pt lock covers both src and dst areas in this case.

Right, vma being the same for src/dst doesn't mean the PT lock is the
same, it might be if source pte entry fit in the same pagetable but
maybe not if the vma is >2M (the max a single pagetable can point to).

> >  			*vmap = new_vma;
> > +		else
> > +			anon_vma_order_tail(new_vma);
> 
> And if this puts new_vma in the right position for the normal
> move_page_tables(), as anon_vma_clone() does in the block below,
> aren't they both in exactly the wrong position for the abnormal
> move_page_tables(), called to put ptes back where they were if
> the original move_page_tables() fails?

Failure paths. Good point, they'd need to be reversed again in that
case.

> It might be possible to argue that move_page_tables() can only
> fail by failing to allocate memory for pud or pmd, and that (perhaps)
> could only happen if the task was being OOM-killed and ran out of
> reserves at this point, and if it's being OOM-killed then we don't
> mind losing a migration entry for a moment... perhaps.

Hmm no it wouldn't be ok, or I wouldn't want to risk that.

> Certainly I'd agree that it's a very rare case.  But it feels wrong
> to be attempting to fix the already unlikely issue, while ignoring
> this aspect, or relying on such unrelated implementation details.

Agreed.

> Perhaps some further anon_vma_ordering could fix it up,
> but that would look increasingly desperate.

I think what Nai didn't consider in explaining this theoretical race
that I noticed now is the anon_vma root lock taken by adjust_vma.

If the merge succeeds adjust_vma will take the lock and flush away
from all others CPUs any sign of rmap_walk before the move_page_tables
can start.

So it can't happen that you do rmap_walk, check vma1, mremap moves
stuff from vma2 to vma1 (wrong order), and then rmap_walk continues
checking vma2 where the pte won't be there anymore. It can't happen
because mremap would block in vma_merge waiting the rmap_walk to
complete. Before proceeding moving any pte. Thanks to the anon_vma
lock already taken by adjust_vma.

So the real fix for the real bug is the one already merged in kernel
v3.1 and we don't need to make any more changes because there is no
race left.

The only bug was the lack of PT lock before checking the pte that
could read the ptes while move_ptes transferred the pte from src_ptep
to kernel stack, and before writing it to dst_ptep. That is closed by
taking the PT lock in migrate before checking if the pte could be a
migrate pte (so flushing move_ptes away from all other CPUs while
migrate checks if a migrate-pte is mapped in the pte).

I don't think the ordering matters anymore, Nai theory sounded good
there was just one small detail he missed in the vma_merge internal
locking that prevents the race to trigger.

> If we were back in the days of the simple anon_vma list, I'd probably
> share your enthusiasm for the list ordering solution; but now it looks
> like a fragile and contorted way of avoiding the obvious... we just
> need to use the anon_vma_lock (but perhaps there are some common and
> easily tested conditions under which we can skip it e.g. when a single
> pt lock covers src and dst?).

Actually I thought about this one when I didn't notice yet the
vma_merge internal locking that prevents Nai's remaining race to
trigger. And my conclusion is that the anon_vma_chains aren't actually
changing anything with regard to ordering. It become a bit
multidimensional to think about it so it complicates things
incredibly, but the ordering issue could have happened before too, and
the fix would have worked for both.

Old anon_vma is like three dimensional (vma, anon_vma, page). Now it's
(vma, chain, anon_vma, page). But if you consider just a single
process execve'd without any child, it returns three dimensional. And
the moment you add childs, you can imagine the old "three dimension"
anon_vma logic to be the one of the parent. And if parent is safe with
all childs vmas in the same_anon_vma_list, then childs are sure safe
too to reorder that way. But hey it's not needed so we're faster and
we don't have to do those list searches during mremap and it's simpler
too :).

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma
  2011-11-04 20:54                           ` Andrea Arcangeli
@ 2011-11-05  0:09                             ` Nai Xia
  2011-11-05  2:21                               ` Hugh Dickins
  0 siblings, 1 reply; 72+ messages in thread
From: Nai Xia @ 2011-11-05  0:09 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Mel Gorman, Pawel Sikora, Andrew Morton, linux-mm,
	jpiszcz, arekm, linux-kernel

On Sat, Nov 5, 2011 at 4:54 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> On Fri, Nov 04, 2011 at 12:16:03PM -0700, Hugh Dickins wrote:
>> On Fri, 4 Nov 2011, Nai Xia wrote:
>> > On Fri, Nov 4, 2011 at 3:31 PM, Hugh Dickins <hughd@google.com> wrote:
>> > > On Mon, 31 Oct 2011, Andrea Arcangeli wrote:
>> > >> @@ -2339,7 +2339,15 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
>> > >>                */
>> > >>               if (vma_start >= new_vma->vm_start &&
>> > >>                   vma_start < new_vma->vm_end)
>> > >> +                     /*
>> > >> +                      * No need to call anon_vma_order_tail() in
>> > >> +                      * this case because the same PT lock will
>> > >> +                      * serialize the rmap_walk against both src
>> > >> +                      * and dst vmas.
>> > >> +                      */
>> > >
>> > > Really?  Please convince me: I just do not see what ensures that
>> > > the same pt lock covers both src and dst areas in this case.
>> >
>> > At the first glance that rmap_walk does travel this merged VMA
>> > once...
>> > But, Now, Wait...., I am actually really puzzled that this case can really
>> > happen at all, you see that vma_merge() does not break the validness
>> > between page->index and its VMA. So if this can really happen,
>> > a page->index should be valid in both areas in a same VMA.
>> > It's strange to imagine that a PTE is copy inside a _same_ VMA
>> > and page->index is valid at both old and new places.
>>
>> Yes, I think you are right, thank you for elucidating it.
>>
>> That was a real case when we wrote copy_vma(), when rmap was using
>> pte_chains; but once anon_vma came in, and imposed vm_pgoff matching
>> on anonymous mappings too, it became dead code.  With linear vm_pgoff
>> matching, you cannot fit a range in two places within the same vma.
>> (And even the non-linear case relies upon vm_pgoff defaults.)
>>
>> So we could simplify the copy_vma() interface a little now (get rid of
>> that nasty **vmap): I'm not quite sure whether we ought to do that,
>> but certainly Andrea's comment there should be updated (if he also
>> agrees with your analysis).
>
> The vmap should only trigger when the prev vma (prev relative to src
> vma) is extended at the end to make space for the dst range. And by
> extending it, we filled the hole between the prev vma and "src"
> vma. So then the prev vma becomes the "src vma" and also the "dst
> vma". So we can't keep working with the old "vma" pointer after that.
>
> I doubt it can be removed without crashing in the above case.

Yes, this line itself should not be removed. As I explained,
pgoff adjustment at the top of the copy_vma() for non-faulted
vma will lead to this case. But we do not need to worry
about the move_page_tables() should after this happens.
And so no lines need to be added here. But maybe the
documentation should be changed in your original patch
to clarify this. Reasoning with PTL locks for this case might
be somewhat misleading.

 Furthermore, the move_page_tables() call following this case
might better be totally avoided for code readability and it's
simple to judge with (vma == new_vma)

Do you agree? :)

>
> I thought some more about it and what I missed I think is the
> anon_vma_merge in vma_adjust. What that anon_vma_merge, rmap_walk will
> have to complete before we can start moving the ptes. And so rmap_walk
> when starts again from scratch (after anon_vma_merge run in
> vma_adjust) will find all ptes even if vma_merge succeeded before.
>
> In fact this may also work for fork. Fork will take the anon_vma root
> lock somehow to queue the child vma in the same_anon_vma. Doing so it
> will serialize against any running rmap_walk from all other cpus. The
> ordering has never been an issue for fork anyway, but it would have
> have been an issue for mremap in case vma_merge succeeded and src_vma
> != dst_vma, if vma_merge didn't act as a serialization point against
> rmap_walk (which I realized now).
>
> What makes it safe is again taking both PT locks simultanously. So it
> doesn't matter what rmap_walk searches, as long as the anon_vma_chain
> list cannot change by the time rmap_walk started.
>
> What I thought before was rmap_walk checking vma1 and then vma_merge
> succeed (where src vma is vma2 and dst vma is vma1, but vma1 is not a
> new vma queued at the end of same_anon_vma), move_page_tables moves
> the pte from vma2 to vma1, and then rmap_walk checks vma2. But again
> vma_merge won't be allowed to complete in the middle of rmap_walk, and
> so it cannot trigger and we can safely drop the patch. It wasn't
> immediate to think at the locks taken within vma_adjust sorry.
>

Oh, no, sorry. I think I was trying to clarify in the first reply on
that thread that
we all agree that anon_vma chain is 100% stable when doing rmap_walk().
What is important, I think,  is the relative order of these three events:
1.  The time  rmap_walk() scans the src
2.  The time rmap_walk() scans the dst
3.  The time move_page_tables() move PTE from src vma to dst.

rmap_walk() scans dst( taking dst PTL) ---> move_page_tables() with
both PTLs ---> rmap_walk() scans src(taking src PTL)

will trigger this bug.  The racing is there even if rmap_walk() scans src--->dst
but that racing does not harm. I think Mel explained why it's safe for good
ordering in his first reply to my post.

vma_merge() is only guilty for giving a wrong order of VMAs before
move_page_tables() and rmap_walk() begin to race, itself does not race
with rmap_walk().

You see, it seems this game might be really puzzling. Indeed, maybe it's time
to fall back on locks instead of playing with racing. Just like the
good old time,
our classic OS text book told us that shared variables deserve locks. :-)


Thanks,

Nai

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma
  2011-11-04 23:56                       ` Andrea Arcangeli
@ 2011-11-05  0:21                         ` Nai Xia
  2011-11-05  0:59                           ` Nai Xia
  2011-11-05  1:33                           ` Andrea Arcangeli
  0 siblings, 2 replies; 72+ messages in thread
From: Nai Xia @ 2011-11-05  0:21 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Mel Gorman, Pawel Sikora, Andrew Morton, linux-mm,
	jpiszcz, arekm, linux-kernel

On Sat, Nov 5, 2011 at 7:56 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> On Fri, Nov 04, 2011 at 12:31:04AM -0700, Hugh Dickins wrote:
>> On Mon, 31 Oct 2011, Andrea Arcangeli wrote:
>> > diff --git a/mm/mmap.c b/mm/mmap.c
>> > index a65efd4..a5858dc 100644
>> > --- a/mm/mmap.c
>> > +++ b/mm/mmap.c
>> > @@ -2339,7 +2339,15 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
>> >              */
>> >             if (vma_start >= new_vma->vm_start &&
>> >                 vma_start < new_vma->vm_end)
>> > +                   /*
>> > +                    * No need to call anon_vma_order_tail() in
>> > +                    * this case because the same PT lock will
>> > +                    * serialize the rmap_walk against both src
>> > +                    * and dst vmas.
>> > +                    */
>>
>> Really?  Please convince me: I just do not see what ensures that
>> the same pt lock covers both src and dst areas in this case.
>
> Right, vma being the same for src/dst doesn't mean the PT lock is the
> same, it might be if source pte entry fit in the same pagetable but
> maybe not if the vma is >2M (the max a single pagetable can point to).
>
>> >                     *vmap = new_vma;
>> > +           else
>> > +                   anon_vma_order_tail(new_vma);
>>
>> And if this puts new_vma in the right position for the normal
>> move_page_tables(), as anon_vma_clone() does in the block below,
>> aren't they both in exactly the wrong position for the abnormal
>> move_page_tables(), called to put ptes back where they were if
>> the original move_page_tables() fails?
>
> Failure paths. Good point, they'd need to be reversed again in that
> case.
>
>> It might be possible to argue that move_page_tables() can only
>> fail by failing to allocate memory for pud or pmd, and that (perhaps)
>> could only happen if the task was being OOM-killed and ran out of
>> reserves at this point, and if it's being OOM-killed then we don't
>> mind losing a migration entry for a moment... perhaps.
>
> Hmm no it wouldn't be ok, or I wouldn't want to risk that.
>
>> Certainly I'd agree that it's a very rare case.  But it feels wrong
>> to be attempting to fix the already unlikely issue, while ignoring
>> this aspect, or relying on such unrelated implementation details.
>
> Agreed.
>
>> Perhaps some further anon_vma_ordering could fix it up,
>> but that would look increasingly desperate.
>
> I think what Nai didn't consider in explaining this theoretical race
> that I noticed now is the anon_vma root lock taken by adjust_vma.
>
> If the merge succeeds adjust_vma will take the lock and flush away
> from all others CPUs any sign of rmap_walk before the move_page_tables
> can start.
>
> So it can't happen that you do rmap_walk, check vma1, mremap moves
> stuff from vma2 to vma1 (wrong order), and then rmap_walk continues
> checking vma2 where the pte won't be there anymore. It can't happen
> because mremap would block in vma_merge waiting the rmap_walk to
> complete. Before proceeding moving any pte. Thanks to the anon_vma
> lock already taken by adjust_vma.

Still,  I think it's not rmap_walk() ---> mremap() --> rmap_walk() that trigger
the bug,  but this events would:

copy_vma() ---> rmap_walk() scan dst VMA --> move_page_tables() moves src to dst
--->  rmap_walk() scan src VMA.  :D

I might be wrong. But thank you all for the time and patience for
playing this racing game
with me. It's really an honor to exhaust my mind on a daunting thing
with you. :)


Best Regards,

Nai

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma
  2011-11-05  0:21                         ` Nai Xia
@ 2011-11-05  0:59                           ` Nai Xia
  2011-11-05  1:33                           ` Andrea Arcangeli
  1 sibling, 0 replies; 72+ messages in thread
From: Nai Xia @ 2011-11-05  0:59 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Mel Gorman, Pawel Sikora, Andrew Morton, linux-mm,
	jpiszcz, arekm, linux-kernel, Linus Torvalds

On Sat, Nov 5, 2011 at 8:21 AM, Nai Xia <nai.xia@gmail.com> wrote:
> On Sat, Nov 5, 2011 at 7:56 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>> On Fri, Nov 04, 2011 at 12:31:04AM -0700, Hugh Dickins wrote:
>>> On Mon, 31 Oct 2011, Andrea Arcangeli wrote:
>>> > diff --git a/mm/mmap.c b/mm/mmap.c
>>> > index a65efd4..a5858dc 100644
>>> > --- a/mm/mmap.c
>>> > +++ b/mm/mmap.c
>>> > @@ -2339,7 +2339,15 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
>>> >              */
>>> >             if (vma_start >= new_vma->vm_start &&
>>> >                 vma_start < new_vma->vm_end)
>>> > +                   /*
>>> > +                    * No need to call anon_vma_order_tail() in
>>> > +                    * this case because the same PT lock will
>>> > +                    * serialize the rmap_walk against both src
>>> > +                    * and dst vmas.
>>> > +                    */
>>>
>>> Really?  Please convince me: I just do not see what ensures that
>>> the same pt lock covers both src and dst areas in this case.
>>
>> Right, vma being the same for src/dst doesn't mean the PT lock is the
>> same, it might be if source pte entry fit in the same pagetable but
>> maybe not if the vma is >2M (the max a single pagetable can point to).
>>
>>> >                     *vmap = new_vma;
>>> > +           else
>>> > +                   anon_vma_order_tail(new_vma);
>>>
>>> And if this puts new_vma in the right position for the normal
>>> move_page_tables(), as anon_vma_clone() does in the block below,
>>> aren't they both in exactly the wrong position for the abnormal
>>> move_page_tables(), called to put ptes back where they were if
>>> the original move_page_tables() fails?
>>
>> Failure paths. Good point, they'd need to be reversed again in that
>> case.
>>
>>> It might be possible to argue that move_page_tables() can only
>>> fail by failing to allocate memory for pud or pmd, and that (perhaps)
>>> could only happen if the task was being OOM-killed and ran out of
>>> reserves at this point, and if it's being OOM-killed then we don't
>>> mind losing a migration entry for a moment... perhaps.
>>
>> Hmm no it wouldn't be ok, or I wouldn't want to risk that.
>>
>>> Certainly I'd agree that it's a very rare case.  But it feels wrong
>>> to be attempting to fix the already unlikely issue, while ignoring
>>> this aspect, or relying on such unrelated implementation details.
>>
>> Agreed.
>>
>>> Perhaps some further anon_vma_ordering could fix it up,
>>> but that would look increasingly desperate.
>>
>> I think what Nai didn't consider in explaining this theoretical race
>> that I noticed now is the anon_vma root lock taken by adjust_vma.
>>
>> If the merge succeeds adjust_vma will take the lock and flush away
>> from all others CPUs any sign of rmap_walk before the move_page_tables
>> can start.
>>
>> So it can't happen that you do rmap_walk, check vma1, mremap moves
>> stuff from vma2 to vma1 (wrong order), and then rmap_walk continues
>> checking vma2 where the pte won't be there anymore. It can't happen
>> because mremap would block in vma_merge waiting the rmap_walk to
>> complete. Before proceeding moving any pte. Thanks to the anon_vma
>> lock already taken by adjust_vma.
>
> Still,  I think it's not rmap_walk() ---> mremap() --> rmap_walk() that trigger
> the bug,  but this events would:
>
> copy_vma() ---> rmap_walk() scan dst VMA --> move_page_tables() moves src to dst
> --->  rmap_walk() scan src VMA.  :D

OK, I think I need to be more concise: Your last reasoning only
ensures that mremap
as a whole entity cannot interleave with  rmap_walk(). But I think
nothing can prevent
move_page_tables() from doing this. As long as copy_vma() gives an
wrong ordering,
the racing between  rmap_walk() & move_page_tables() afterwards may
trigger the bug.

Do you agree?



>
> I might be wrong. But thank you all for the time and patience for
> playing this racing game
> with me. It's really an honor to exhaust my mind on a daunting thing
> with you. :)
>
>
> Best Regards,
>
> Nai
>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma
  2011-11-05  0:21                         ` Nai Xia
  2011-11-05  0:59                           ` Nai Xia
@ 2011-11-05  1:33                           ` Andrea Arcangeli
  2011-11-05  2:00                             ` Nai Xia
  1 sibling, 1 reply; 72+ messages in thread
From: Andrea Arcangeli @ 2011-11-05  1:33 UTC (permalink / raw)
  To: Nai Xia
  Cc: Hugh Dickins, Mel Gorman, Pawel Sikora, Andrew Morton, linux-mm,
	jpiszcz, arekm, linux-kernel

On Sat, Nov 05, 2011 at 08:21:03AM +0800, Nai Xia wrote:
> copy_vma() ---> rmap_walk() scan dst VMA --> move_page_tables() moves src to dst
> --->  rmap_walk() scan src VMA.  :D

Hmm yes. I think I got in the wrong track because I focused too much
on that line you started talking about, the *vmap = new_vma, you said
I had to reorder stuff there too, and that didn't make sense.

The reason it doesn't make sense is that it can't be ok to reorder
stuff when *vmap = new_vma (i.e. new_vma = old_vma). So if I didn't
need to reorder in that case I thought I could extrapolate it was
always ok.

But the opposite is true: that case can't be solved.

Can it really happen that vma_merge will pack (prev_vma, new_range,
old_vma) together in a single vma? (i.e. prev_vma extended to
old_vma->vm_end)

Even if there's no prev_vma in the picture (but that's the extreme
case) it cannot be safe: i.e. a (new_range, old_vma) or (old_vma,
new_range).

1 single "vma" for src and dst virtual ranges, means 1 single
vma->vm_pgoff. But we've two virtual addresses and two ptes. So the
same page->index can't work for both if the vma->vm_pgoff is the
same.

So regardless of the ordering here we're dealing with something more
fundamental.

If rmap_walk runs immediately after vma_merge completes and releases
the anon_vma_lock, it won't find any pte in the vma anymore. No matter
the order.

I thought at this before and I didn't mention it but at the light of
the above issue I start to think this is the only possible correct
solution to the problem. We should just never call vma_merge before
move_page_tables. And do the merge by hand later after mremap is
complete.

The only safe way to do it is to have _two_ different vmas, with two
different ->vm_pgoff. Then it will work. And by always creating a new
vma we'll always have it queued at the end, and it'll be safe for the
same reasons fork is safe.

Always allocate a new vma, and then after the whole vma copy is
complete, look if we can merge and free some vma. After the fact, so
it means we can't use vma_merge anymore. vma_merge assumes the
new_range is "virtual" and no vma is mapped there I think. Anyway
that's an implementation issue. In some unlikely case we'll allocate 1
more vma than before, and we'll free it once mremap is finished, but
that's small problem compared to solving this once and for all.

And that will fix it without ordering games and it'll fix the *vmap=
new_vma case too. That case really tripped on me as I was assuming
*that* was correct.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma
  2011-11-05  1:33                           ` Andrea Arcangeli
@ 2011-11-05  2:00                             ` Nai Xia
  2011-11-07 13:14                               ` Mel Gorman
  0 siblings, 1 reply; 72+ messages in thread
From: Nai Xia @ 2011-11-05  2:00 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Mel Gorman, Pawel Sikora, Andrew Morton, linux-mm,
	jpiszcz, arekm, linux-kernel

On Sat, Nov 5, 2011 at 9:33 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> On Sat, Nov 05, 2011 at 08:21:03AM +0800, Nai Xia wrote:
>> copy_vma() ---> rmap_walk() scan dst VMA --> move_page_tables() moves src to dst
>> --->  rmap_walk() scan src VMA.  :D
>
> Hmm yes. I think I got in the wrong track because I focused too much
> on that line you started talking about, the *vmap = new_vma, you said
> I had to reorder stuff there too, and that didn't make sense.

Oh, I think you misunderstood me in that. I was just saying:

if (*vmap = new_vma), then _NO_ PTEs need to be moved afterwards,
because vma has not yet been faulted at all. Otherwise, it breaks the
page->index semantics in the way I explained in my reply to Hugh.

So nothing need to be added there, but the reason is because
the above reasoning, not the same PTL locking...

And for this case alone, I think the proper solving place
should be outside move_vma() but inside do_mremap()
by only vma_adjust() and vma_merge() like stuff.
Because really it does not involve  move_page_tables().

>
> The reason it doesn't make sense is that it can't be ok to reorder
> stuff when *vmap = new_vma (i.e. new_vma = old_vma). So if I didn't
> need to reorder in that case I thought I could extrapolate it was
> always ok.
>
> But the opposite is true: that case can't be solved.
>
> Can it really happen that vma_merge will pack (prev_vma, new_range,
> old_vma) together in a single vma? (i.e. prev_vma extended to
> old_vma->vm_end)
>
> Even if there's no prev_vma in the picture (but that's the extreme
> case) it cannot be safe: i.e. a (new_range, old_vma) or (old_vma,
> new_range).
>
> 1 single "vma" for src and dst virtual ranges, means 1 single
> vma->vm_pgoff. But we've two virtual addresses and two ptes. So the
> same page->index can't work for both if the vma->vm_pgoff is the
> same.
>
> So regardless of the ordering here we're dealing with something more
> fundamental.
>
> If rmap_walk runs immediately after vma_merge completes and releases
> the anon_vma_lock, it won't find any pte in the vma anymore. No matter
> the order.
>
> I thought at this before and I didn't mention it but at the light of
> the above issue I start to think this is the only possible correct
> solution to the problem. We should just never call vma_merge before
> move_page_tables. And do the merge by hand later after mremap is
> complete.
>
> The only safe way to do it is to have _two_ different vmas, with two
> different ->vm_pgoff. Then it will work. And by always creating a new
> vma we'll always have it queued at the end, and it'll be safe for the
> same reasons fork is safe.
>
> Always allocate a new vma, and then after the whole vma copy is
> complete, look if we can merge and free some vma. After the fact, so
> it means we can't use vma_merge anymore. vma_merge assumes the
> new_range is "virtual" and no vma is mapped there I think. Anyway
> that's an implementation issue. In some unlikely case we'll allocate 1
> more vma than before, and we'll free it once mremap is finished, but
> that's small problem compared to solving this once and for all.
>
> And that will fix it without ordering games and it'll fix the *vmap=
> new_vma case too. That case really tripped on me as I was assuming
> *that* was correct.

Yes. "Allocating a new vma, copy first and merge later " seems
another solution without the tricky reordering. But you know,
I now share some of Hugh's feeling that maybe we are too
desperate using racing in places where locks are simpler
and guaranteed to be safe.

But I think Mel indicated that anon_vma_locking might be
harmful to JVM SMP performance.
How severe you expect this to be, Mel ?


Thanks

Nai

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma
  2011-11-04 15:59                         ` Pawel Sikora
@ 2011-11-05  2:21                           ` Nai Xia
  0 siblings, 0 replies; 72+ messages in thread
From: Nai Xia @ 2011-11-05  2:21 UTC (permalink / raw)
  To: Pawel Sikora
  Cc: Hugh Dickins, Andrea Arcangeli, Mel Gorman, Andrew Morton,
	linux-mm, jpiszcz, arekm, linux-kernel

On Fri, Nov 4, 2011 at 11:59 PM, Pawel Sikora <pluto@agmk.net> wrote:
> On Friday 04 of November 2011 22:34:54 Nai Xia wrote:
>> On Fri, Nov 4, 2011 at 3:31 PM, Hugh Dickins <hughd@google.com> wrote:
>> > On Mon, 31 Oct 2011, Andrea Arcangeli wrote:
>> >
>> >> migrate was doing a rmap_walk with speculative lock-less access on
>> >> pagetables. That could lead it to not serialize properly against
>> >> mremap PT locks. But a second problem remains in the order of vmas in
>> >> the same_anon_vma list used by the rmap_walk.
>> >
>> > I do think that Nai Xia deserves special credit for thinking deeper
>> > into this than the rest of us (before you came back): something like
>> >
>> > Issue-conceived-by: Nai Xia <nai.xia@gmail.com>
>>
>> Thanks! ;-)
>
> hi all,
>
> i'm still testing anon_vma_order_tail() patch. 10 days of heavy processing
> and machine is still stable but i've recorded some interesting thing:
>
> $ uname -a
> Linux hal 3.0.8-vs2.3.1-dirty #6 SMP Tue Oct 25 10:07:50 CEST 2011 x86_64 AMD_Opteron(tm)_Processor_6128 PLD Linux
> $ uptime
>  16:47:44 up 10 days,  4:21,  5 users,  load average: 19.55, 19.15, 18.76
> $ ps aux|grep migration
> root         6  0.0  0.0      0     0 ?        S    Oct25   0:00 [migration/0]
> root         8 68.0  0.0      0     0 ?        S    Oct25 9974:01 [migration/1]
> root        13 35.4  0.0      0     0 ?        S    Oct25 5202:15 [migration/2]
> root        17 71.4  0.0      0     0 ?        S    Oct25 10479:10 [migration/3]
> root        21 70.7  0.0      0     0 ?        S    Oct25 10370:14 [migration/4]
> root        25 66.1  0.0      0     0 ?        S    Oct25 9698:11 [migration/5]
> root        29 70.1  0.0      0     0 ?        S    Oct25 10283:22 [migration/6]
> root        33 62.6  0.0      0     0 ?        S    Oct25 9190:28 [migration/7]
> root        37  0.0  0.0      0     0 ?        S    Oct25   0:00 [migration/8]
> root        41 97.7  0.0      0     0 ?        S    Oct25 14338:30 [migration/9]
> root        45 29.2  0.0      0     0 ?        S    Oct25 4290:00 [migration/10]
> root        49 68.7  0.0      0     0 ?        S    Oct25 10081:38 [migration/11]
> root        53 98.7  0.0      0     0 ?        S    Oct25 14477:25 [migration/12]
> root        57 70.0  0.0      0     0 ?        S    Oct25 10272:57 [migration/13]
> root        61 69.7  0.0      0     0 ?        S    Oct25 10232:29 [migration/14]
> root        65 70.9  0.0      0     0 ?        S    Oct25 10403:09 [migration/15]
>
> wow, 71..241 hours in migration processes after 10 days of uptime?
> machine has 2 opteron nodes with 32GB ram paired with each processor.
> i suppose that it spends a lot of time on migration (processes + memory pages).

Hi Paweł, it seems to me an issue related to load balancing but might
not directly
related to this bug or even not related to abnormal page migration.
Can this be a scheduler & interrupts issue?

But oh, well, actually I never ever had touch a 16-core machine
and do heavy processing. So I cannot tell if this result is normal or not.

Maybe you should ask for a broader range of people?

BR,
Nai

>
> BR,
> Paweł.
>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma
  2011-11-05  0:09                             ` Nai Xia
@ 2011-11-05  2:21                               ` Hugh Dickins
  2011-11-05  3:07                                 ` Andrea Arcangeli
  2011-11-05 17:06                                 ` Andrea Arcangeli
  0 siblings, 2 replies; 72+ messages in thread
From: Hugh Dickins @ 2011-11-05  2:21 UTC (permalink / raw)
  To: Nai Xia
  Cc: Andrea Arcangeli, Mel Gorman, Pawel Sikora, Andrew Morton,
	linux-mm, jpiszcz, arekm, linux-kernel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 7081 bytes --]

On Sat, 5 Nov 2011, Nai Xia wrote:
> On Sat, Nov 5, 2011 at 4:54 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> > On Fri, Nov 04, 2011 at 12:16:03PM -0700, Hugh Dickins wrote:
> >> On Fri, 4 Nov 2011, Nai Xia wrote:
> >> > On Fri, Nov 4, 2011 at 3:31 PM, Hugh Dickins <hughd@google.com> wrote:
> >> > > On Mon, 31 Oct 2011, Andrea Arcangeli wrote:
> >> > >> @@ -2339,7 +2339,15 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
> >> > >>                */
> >> > >>               if (vma_start >= new_vma->vm_start &&
> >> > >>                   vma_start < new_vma->vm_end)
> >> > >> +                     /*
> >> > >> +                      * No need to call anon_vma_order_tail() in
> >> > >> +                      * this case because the same PT lock will
> >> > >> +                      * serialize the rmap_walk against both src
> >> > >> +                      * and dst vmas.
> >> > >> +                      */
> >> > >
> >> > > Really?  Please convince me: I just do not see what ensures that
> >> > > the same pt lock covers both src and dst areas in this case.
> >> >
> >> > At the first glance that rmap_walk does travel this merged VMA
> >> > once...
> >> > But, Now, Wait...., I am actually really puzzled that this case can really
> >> > happen at all, you see that vma_merge() does not break the validness
> >> > between page->index and its VMA. So if this can really happen,
> >> > a page->index should be valid in both areas in a same VMA.
> >> > It's strange to imagine that a PTE is copy inside a _same_ VMA
> >> > and page->index is valid at both old and new places.
> >>
> >> Yes, I think you are right, thank you for elucidating it.
> >>
> >> That was a real case when we wrote copy_vma(), when rmap was using
> >> pte_chains; but once anon_vma came in, and imposed vm_pgoff matching
> >> on anonymous mappings too, it became dead code.  With linear vm_pgoff
> >> matching, you cannot fit a range in two places within the same vma.
> >> (And even the non-linear case relies upon vm_pgoff defaults.)
> >>
> >> So we could simplify the copy_vma() interface a little now (get rid of
> >> that nasty **vmap): I'm not quite sure whether we ought to do that,
> >> but certainly Andrea's comment there should be updated (if he also
> >> agrees with your analysis).
> >
> > The vmap should only trigger when the prev vma (prev relative to src
> > vma) is extended at the end to make space for the dst range. And by
> > extending it, we filled the hole between the prev vma and "src"
> > vma. So then the prev vma becomes the "src vma" and also the "dst
> > vma". So we can't keep working with the old "vma" pointer after that.
> >
> > I doubt it can be removed without crashing in the above case.
> 
> Yes, this line itself should not be removed. As I explained,
> pgoff adjustment at the top of the copy_vma() for non-faulted
> vma will lead to this case.

Ah, thank you, that's what I was asking you to point me to, the place
I was missing that recalculates pgoff: at the head of copy_vma() itself.

Yes, if that adjustment remains (no reason why not), then we cannot
remove the *vmap = new_vma; but that is the only case that nowadays
can need the *vmap = new_vma (agreed?), which does deserve a comment.


> But we do not need to worry
> about the move_page_tables() should after this happens.
> And so no lines need to be added here. But maybe the
> documentation should be changed in your original patch
> to clarify this. Reasoning with PTL locks for this case might
> be somewhat misleading.

Right, there are no ptes there yet, so we're cannot miss any.

> 
>  Furthermore, the move_page_tables() call following this case
> might better be totally avoided for code readability and it's
> simple to judge with (vma == new_vma)
> 
> Do you agree? :)

Well, it's true that looking at pagetables in this case is just
a waste of time; but personally I'd prefer to add more comment
than special case handling for this.

> 
> >
> > I thought some more about it and what I missed I think is the
> > anon_vma_merge in vma_adjust. What that anon_vma_merge, rmap_walk will
> > have to complete before we can start moving the ptes. And so rmap_walk
> > when starts again from scratch (after anon_vma_merge run in
> > vma_adjust) will find all ptes even if vma_merge succeeded before.
> >
> > In fact this may also work for fork. Fork will take the anon_vma root
> > lock somehow to queue the child vma in the same_anon_vma. Doing so it
> > will serialize against any running rmap_walk from all other cpus. The
> > ordering has never been an issue for fork anyway, but it would have
> > have been an issue for mremap in case vma_merge succeeded and src_vma
> > != dst_vma, if vma_merge didn't act as a serialization point against
> > rmap_walk (which I realized now).
> >
> > What makes it safe is again taking both PT locks simultanously. So it
> > doesn't matter what rmap_walk searches, as long as the anon_vma_chain
> > list cannot change by the time rmap_walk started.
> >
> > What I thought before was rmap_walk checking vma1 and then vma_merge
> > succeed (where src vma is vma2 and dst vma is vma1, but vma1 is not a
> > new vma queued at the end of same_anon_vma), move_page_tables moves
> > the pte from vma2 to vma1, and then rmap_walk checks vma2. But again
> > vma_merge won't be allowed to complete in the middle of rmap_walk, and
> > so it cannot trigger and we can safely drop the patch. It wasn't
> > immediate to think at the locks taken within vma_adjust sorry.
> >

I found Andrea's "anon_vma_merge" reply very hard to understand; but
it looks like he now accepts that it was mistaken, or on the wrong
track at least...

> 
> Oh, no, sorry. I think I was trying to clarify in the first reply on
> that thread that
> we all agree that anon_vma chain is 100% stable when doing rmap_walk().
> What is important, I think,  is the relative order of these three events:
> 1.  The time  rmap_walk() scans the src
> 2.  The time rmap_walk() scans the dst
> 3.  The time move_page_tables() move PTE from src vma to dst.

... after you set us straight again with this.

> 
> rmap_walk() scans dst( taking dst PTL) ---> move_page_tables() with
> both PTLs ---> rmap_walk() scans src(taking src PTL)
> 
> will trigger this bug.  The racing is there even if rmap_walk() scans src--->dst
> but that racing does not harm. I think Mel explained why it's safe for good
> ordering in his first reply to my post.
> 
> vma_merge() is only guilty for giving a wrong order of VMAs before
> move_page_tables() and rmap_walk() begin to race, itself does not race
> with rmap_walk().
> 
> You see, it seems this game might be really puzzling. Indeed, maybe it's time
> to fall back on locks instead of playing with racing. Just like the
> good old time,
> our classic OS text book told us that shared variables deserve locks. :-)

That's my preference, yes: this mail thread seems to cry out for that!

Hugh

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma
  2011-11-05  2:21                               ` Hugh Dickins
@ 2011-11-05  3:07                                 ` Andrea Arcangeli
  2011-11-05 17:06                                 ` Andrea Arcangeli
  1 sibling, 0 replies; 72+ messages in thread
From: Andrea Arcangeli @ 2011-11-05  3:07 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Nai Xia, Mel Gorman, Pawel Sikora, Andrew Morton, linux-mm,
	jpiszcz, arekm, linux-kernel

On Fri, Nov 04, 2011 at 07:21:28PM -0700, Hugh Dickins wrote:
> I found Andrea's "anon_vma_merge" reply very hard to understand; but
> it looks like he now accepts that it was mistaken, or on the wrong
> track at least...

No matter how we get the order right, we still need to reverse the
order in case of error without taking the lock. So even allocating a
new vma every time wouldn't be enough to get out of the ordering
games (it would be enough in the non-error path of course...).

So there are a couple of ways:

1) Keep my patch (adjust comment) and add a second ordering call in
   the error path. Cleanup the *vmap case.

2) Always allocate a new vma, merge later, and still keep my patch for
   reversing the order in the error path only (not an huge improvement
   if we still have to reverse the order). So this now looks the worst
   option at the light of the error path which would give
   trouble by going the opposite way... again.

3) Return to your fix that takes the anon_vma lock during the pte
   moves

Fixing my patch requires just a one liner to fix the error path, it's
not like the patch was wrong in fact it reduced the window even more,
it just missed one liner in the error path.

But it's still doing reordering. Which I think is safe and not
fundamentally different in ordering terms by the old anon_vma logic
before _chain (which is why this bug could have triggered before
too). But certainly more complex than taking the anon_vma lock around
every pagetable move, that's for sure. fork will still relay on the
ordering but fork has a super easy life compared to mremap which goes
both ways and has vma_merge in it too which makes the vma order non
deterministic.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma
  2011-11-05  2:21                               ` Hugh Dickins
  2011-11-05  3:07                                 ` Andrea Arcangeli
@ 2011-11-05 17:06                                 ` Andrea Arcangeli
  2011-12-08  3:24                                   ` David Rientjes
  2011-12-09  0:08                                   ` Andrew Morton
  1 sibling, 2 replies; 72+ messages in thread
From: Andrea Arcangeli @ 2011-11-05 17:06 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Mel Gorman, Pawel Sikora, Andrew Morton, linux-mm, jpiszcz,
	arekm, linux-kernel

migrate was doing a rmap_walk with speculative lock-less access on
pagetables. That could lead it to not serialize properly against
mremap PT locks. But a second problem remains in the order of vmas in
the same_anon_vma list used by the rmap_walk.

If vma_merge would succeed in copy_vma, the src vma could be placed
after the dst vma in the same_anon_vma list. That could still lead
migrate to miss some pte.

This patch adds a anon_vma_moveto_tail() function to force the dst vma
at the end of the list before mremap starts to solve the problem.

If the mremap is very large and there are a lots of parents or childs
sharing the anon_vma root lock, this should still scale better than
taking the anon_vma root lock around every pte copy practically for
the whole duration of mremap.

Update: Hugh noticed special care is needed in the error path where
move_page_tables goes in the reverse direction, a second
anon_vma_moveto_tail() call is needed in the error path.

This program exercises the anon_vma_moveto_tail:

===

int main()
{
	static struct timeval oldstamp, newstamp;
	long diffsec;
	char *p, *p2, *p3, *p4;
	if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
		perror("memalign"), exit(1);
	if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
		perror("memalign"), exit(1);
	if (posix_memalign((void **)&p3, 2*1024*1024, SIZE))
		perror("memalign"), exit(1);

	memset(p, 0xff, SIZE);
	printf("%p\n", p);
	memset(p2, 0xff, SIZE);
	memset(p3, 0x77, 4096);
	if (memcmp(p, p2, SIZE))
		printf("error\n");
	p4 = mremap(p+SIZE/2, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
	if (p4 != p3)
		perror("mremap"), exit(1);
	p4 = mremap(p4, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p+SIZE/2);
	if (p4 != p+SIZE/2)
		perror("mremap"), exit(1);
	if (memcmp(p, p2, SIZE))
		printf("error\n");
	printf("ok\n");

	return 0;
}
===

$ perf probe -a anon_vma_moveto_tail
Add new event:
  probe:anon_vma_moveto_tail (on anon_vma_moveto_tail)

You can now use it on all perf tools, such as:

        perf record -e probe:anon_vma_moveto_tail -aR sleep 1

$ perf record -e probe:anon_vma_moveto_tail -aR ./anon_vma_moveto_tail
0x7f2ca2800000
ok
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.043 MB perf.data (~1860 samples) ]
$ perf report --stdio
   100.00%  anon_vma_moveto  [kernel.kallsyms]  [k] anon_vma_moveto_tail

Reported-by: Nai Xia <nai.xia@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/rmap.h |    1 +
 mm/mmap.c            |   22 ++++++++++++++++++++--
 mm/mremap.c          |    1 +
 mm/rmap.c            |   45 +++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 67 insertions(+), 2 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 2148b12..1afb995 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -120,6 +120,7 @@ void anon_vma_init(void);	/* create anon_vma_cachep */
 int  anon_vma_prepare(struct vm_area_struct *);
 void unlink_anon_vmas(struct vm_area_struct *);
 int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *);
+void anon_vma_moveto_tail(struct vm_area_struct *);
 int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *);
 void __anon_vma_link(struct vm_area_struct *);
 
diff --git a/mm/mmap.c b/mm/mmap.c
index 3c0061f..948513d 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2322,13 +2322,16 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 	struct vm_area_struct *new_vma, *prev;
 	struct rb_node **rb_link, *rb_parent;
 	struct mempolicy *pol;
+	bool faulted_in_anon_vma = true;
 
 	/*
 	 * If anonymous vma has not yet been faulted, update new pgoff
 	 * to match new location, to increase its chance of merging.
 	 */
-	if (!vma->vm_file && !vma->anon_vma)
+	if (!vma->vm_file && !vma->anon_vma) {
 		pgoff = addr >> PAGE_SHIFT;
+		faulted_in_anon_vma = false;
+	}
 
 	find_vma_prepare(mm, addr, &prev, &rb_link, &rb_parent);
 	new_vma = vma_merge(mm, prev, addr, addr + len, vma->vm_flags,
@@ -2338,8 +2341,23 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 		 * Source vma may have been merged into new_vma
 		 */
 		if (vma_start >= new_vma->vm_start &&
-		    vma_start < new_vma->vm_end)
+		    vma_start < new_vma->vm_end) {
+			/*
+			 * The only way we can get a vma_merge with
+			 * self during an mremap is if the vma hasn't
+			 * been faulted in yet and we were allowed to
+			 * reset the dst vma->vm_pgoff to the
+			 * destination address of the mremap to allow
+			 * the merge to happen. mremap must change the
+			 * vm_pgoff linearity between src and dst vmas
+			 * (in turn preventing a vma_merge) to be
+			 * safe. It is only safe to keep the vm_pgoff
+			 * linear if there are no pages mapped yet.
+			 */
+			VM_BUG_ON(faulted_in_anon_vma);
 			*vmap = new_vma;
+		} else
+			anon_vma_moveto_tail(new_vma);
 	} else {
 		new_vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
 		if (new_vma) {
diff --git a/mm/mremap.c b/mm/mremap.c
index d6959cb..d845537 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -225,6 +225,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 		 * which will succeed since page tables still there,
 		 * and then proceed to unmap new area instead of old.
 		 */
+		anon_vma_moveto_tail(vma);
 		move_page_tables(new_vma, new_addr, vma, old_addr, moved_len);
 		vma = new_vma;
 		old_len = new_len;
diff --git a/mm/rmap.c b/mm/rmap.c
index 6541cf7..9832f03 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -272,6 +272,51 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
 }
 
 /*
+ * Some rmap walk that needs to find all ptes/hugepmds without false
+ * negatives (like migrate and split_huge_page) running concurrent
+ * with operations that copy or move pagetables (like mremap() and
+ * fork()) to be safe. They depend on the anon_vma "same_anon_vma"
+ * list to be in a certain order: the dst_vma must be placed after the
+ * src_vma in the list. This is always guaranteed by fork() but
+ * mremap() needs to call this function to enforce it in case the
+ * dst_vma isn't newly allocated and chained with the anon_vma_clone()
+ * function but just an extension of a pre-existing vma through
+ * vma_merge.
+ *
+ * NOTE: the same_anon_vma list can still be changed by other
+ * processes while mremap runs because mremap doesn't hold the
+ * anon_vma mutex to prevent modifications to the list while it
+ * runs. All we need to enforce is that the relative order of this
+ * process vmas isn't changing (we don't care about other vmas
+ * order). Each vma corresponds to an anon_vma_chain structure so
+ * there's no risk that other processes calling anon_vma_moveto_tail()
+ * and changing the same_anon_vma list under mremap() will screw with
+ * the relative order of this process vmas in the list, because we
+ * they can't alter the order of any vma that belongs to this
+ * process. And there can't be another anon_vma_moveto_tail() running
+ * concurrently with mremap() coming from this process because we hold
+ * the mmap_sem for the whole mremap(). fork() ordering dependency
+ * also shouldn't be affected because fork() only cares that the
+ * parent vmas are placed in the list before the child vmas and
+ * anon_vma_moveto_tail() won't reorder vmas from either the fork()
+ * parent or child.
+ */
+void anon_vma_moveto_tail(struct vm_area_struct *dst)
+{
+	struct anon_vma_chain *pavc;
+	struct anon_vma *root = NULL;
+
+	list_for_each_entry_reverse(pavc, &dst->anon_vma_chain, same_vma) {
+		struct anon_vma *anon_vma = pavc->anon_vma;
+		VM_BUG_ON(pavc->vma != dst);
+		root = lock_anon_vma_root(root, anon_vma);
+		list_del(&pavc->same_anon_vma);
+		list_add_tail(&pavc->same_anon_vma, &anon_vma->head);
+	}
+	unlock_anon_vma_root(root);
+}
+
+/*
  * Attach vma to its own anon_vma, as well as to the anon_vmas that
  * the corresponding VMA in the parent process is attached to.
  * Returns 0 on success, non-zero on failure.

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma
  2011-11-05  2:00                             ` Nai Xia
@ 2011-11-07 13:14                               ` Mel Gorman
  2011-11-07 15:42                                 ` Andrea Arcangeli
  0 siblings, 1 reply; 72+ messages in thread
From: Mel Gorman @ 2011-11-07 13:14 UTC (permalink / raw)
  To: Nai Xia
  Cc: Andrea Arcangeli, Hugh Dickins, Pawel Sikora, Andrew Morton,
	linux-mm, jpiszcz, arekm, linux-kernel

On Sat, Nov 05, 2011 at 10:00:52AM +0800, Nai Xia wrote:
> > <SNIP>
> > The only safe way to do it is to have _two_ different vmas, with two
> > different ->vm_pgoff. Then it will work. And by always creating a new
> > vma we'll always have it queued at the end, and it'll be safe for the
> > same reasons fork is safe.
> >
> > Always allocate a new vma, and then after the whole vma copy is
> > complete, look if we can merge and free some vma. After the fact, so
> > it means we can't use vma_merge anymore. vma_merge assumes the
> > new_range is "virtual" and no vma is mapped there I think. Anyway
> > that's an implementation issue. In some unlikely case we'll allocate 1
> > more vma than before, and we'll free it once mremap is finished, but
> > that's small problem compared to solving this once and for all.
> >
> > And that will fix it without ordering games and it'll fix the *vmap=
> > new_vma case too. That case really tripped on me as I was assuming
> > *that* was correct.
> 
> Yes. "Allocating a new vma, copy first and merge later " seems
> another solution without the tricky reordering. But you know,
> I now share some of Hugh's feeling that maybe we are too
> desperate using racing in places where locks are simpler
> and guaranteed to be safe.
> 

I'm tending to agree. The number of cases that must be kept in mind
is getting too tricky. Taking the anon_vma lock may be slower but at
the risk of sounding chicken, it's easier to understand.

> But I think Mel indicated that anon_vma_locking might be
> harmful to JVM SMP performance.
> How severe you expect this to be, Mel ?
> 

I would only expect it to be a problem during garbage collection when
there is a greater likelihood that mremap is heavily used. While it
would have been nice to avoid additional overhead in mremap, I don't
think the JVM GC case on its own is sufficient justification to avoid
taking the anon_vma lock.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma
  2011-11-07 13:14                               ` Mel Gorman
@ 2011-11-07 15:42                                 ` Andrea Arcangeli
  2011-11-07 16:28                                   ` Mel Gorman
  0 siblings, 1 reply; 72+ messages in thread
From: Andrea Arcangeli @ 2011-11-07 15:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nai Xia, Hugh Dickins, Pawel Sikora, Andrew Morton, linux-mm,
	jpiszcz, arekm, linux-kernel

On Mon, Nov 07, 2011 at 01:14:13PM +0000, Mel Gorman wrote:
> I'm tending to agree. The number of cases that must be kept in mind
> is getting too tricky. Taking the anon_vma lock may be slower but at
> the risk of sounding chicken, it's easier to understand.
> 
> > But I think Mel indicated that anon_vma_locking might be
> > harmful to JVM SMP performance.
> > How severe you expect this to be, Mel ?
> > 
> 
> I would only expect it to be a problem during garbage collection when
> there is a greater likelihood that mremap is heavily used. While it
> would have been nice to avoid additional overhead in mremap, I don't
> think the JVM GC case on its own is sufficient justification to avoid
> taking the anon_vma lock.

Adding one liner in the error path and a bugcheck in the *vmap case,
doesn't seem the end of the world compared to my previous fix that you
acked. I suspect last friday I was probably confused for a little
while because I was recovering from some flu I picked up with the cold
weather and the confusion around the vmap case which I assumed as safe
(not only if no page was faulted in yet) also didn't help.

BTW, with regard to those comments about human brain being all weak,
well I doubt monkey brain would work better, so in absence of some
alien brain which may work better than ours, we should concentrate and
handle it :). The ordering constraints isn't going away no matter what
we do in mremap, fork has the exact same issue, except it won't
require reordering but my patch documents that.

NOTE: f we could remove _all_ the ordering dependencies between the
vmas pointed by the anon_vma_chains queued in the same_anon_vma list,
and all the rmap_walk then I would be more inclined to agree on
keeping the simpler way, because then we would stop playing the
ordering games all together, but regardless of mremap, we'll be still
playing ordering games with fork vs rmap_walk, so we can exploit that
to run a bit faster in mremap too, play the same ordering game (though
I admit more complex to play the ordering games in mremap as it
requires 2 more function calls for the vma_merge case) but not
fundamentally different.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma
  2011-11-07 15:42                                 ` Andrea Arcangeli
@ 2011-11-07 16:28                                   ` Mel Gorman
  2011-11-09  1:25                                     ` Andrea Arcangeli
  0 siblings, 1 reply; 72+ messages in thread
From: Mel Gorman @ 2011-11-07 16:28 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nai Xia, Hugh Dickins, Pawel Sikora, Andrew Morton, linux-mm,
	jpiszcz, arekm, linux-kernel

On Mon, Nov 07, 2011 at 04:42:35PM +0100, Andrea Arcangeli wrote:
> On Mon, Nov 07, 2011 at 01:14:13PM +0000, Mel Gorman wrote:
> > I'm tending to agree. The number of cases that must be kept in mind
> > is getting too tricky. Taking the anon_vma lock may be slower but at
> > the risk of sounding chicken, it's easier to understand.
> > 
> > > But I think Mel indicated that anon_vma_locking might be
> > > harmful to JVM SMP performance.
> > > How severe you expect this to be, Mel ?
> > > 
> > 
> > I would only expect it to be a problem during garbage collection when
> > there is a greater likelihood that mremap is heavily used. While it
> > would have been nice to avoid additional overhead in mremap, I don't
> > think the JVM GC case on its own is sufficient justification to avoid
> > taking the anon_vma lock.
> 
> Adding one liner in the error path and a bugcheck in the *vmap case,
> doesn't seem the end of the world compared to my previous fix that you
> acked.

Note that I didn't suddenly turn that ack into a nack although

  1) A small comment on why we need to call anon_vma_moveto_tail in the
     error path would be nice

  2) It is unfortunate that we need the faulted_in_anon_vma just
     for a VM_BUG_ON check that only exists for CONFIG_DEBUG_VM
     but not earth shatting

What I said was taking the anon_vma lock may be slower but it was
generally easier to understand. I'm happy with the new patch too
particularly as it keeps the "ordering game" consistent for fork
and mremap but I previously missed move_page_tables in the error
path so was worried if there was something else I managed to miss
particularly in light of the "Allocating a new vma, copy first and
merge later" direction.

I'm also prefectly happy with my human meat brain and do not expect
to replace it with an aliens.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma
  2011-11-07 16:28                                   ` Mel Gorman
@ 2011-11-09  1:25                                     ` Andrea Arcangeli
  2011-11-11  9:14                                       ` Nai Xia
  2011-11-16 14:00                                       ` Andrea Arcangeli
  0 siblings, 2 replies; 72+ messages in thread
From: Andrea Arcangeli @ 2011-11-09  1:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nai Xia, Hugh Dickins, Pawel Sikora, Andrew Morton, linux-mm,
	jpiszcz, arekm, linux-kernel

On Mon, Nov 07, 2011 at 04:28:08PM +0000, Mel Gorman wrote:
> Note that I didn't suddenly turn that ack into a nack although

:)

>   1) A small comment on why we need to call anon_vma_moveto_tail in the
>      error path would be nice

I can add that.

>   2) It is unfortunate that we need the faulted_in_anon_vma just
>      for a VM_BUG_ON check that only exists for CONFIG_DEBUG_VM
>      but not earth shatting

It should be optimized away at build time. It thought it was better
not to leave that path without a VM_BUG_ON. It should be a slow path
in the first place (probably we should even mark it unlikely). And
it's obscure enough that I think a check will clarify things. In the
common case (i.e. some pte faulted in) that vma_merge on self if it
succeeds, it couldn't possibly be safe because the vma->vm_pgoff vs
page->index linearity couldn't be valid for the same vma and the same
page on two different virtual addresses. So checking for it I think is
sane. Especially given at some point it was mentioned we could
optimize away the check all together, so it's a bit of an obscure path
that the VM_BUG_ON I think will help document (and verify).

> What I said was taking the anon_vma lock may be slower but it was
> generally easier to understand. I'm happy with the new patch too
> particularly as it keeps the "ordering game" consistent for fork
> and mremap but I previously missed move_page_tables in the error
> path so was worried if there was something else I managed to miss
> particularly in light of the "Allocating a new vma, copy first and
> merge later" direction.

I liked that direction a lot. I thought with that we could stick to
the exact same behavior of fork and not need to reorder stuff. But the
error path is still in the way, and we've to undo the move in place
without tearing down the vmas. Plus it would have required to write
mode code, and the allocation path wouldn't have necessarily been
faster than a reordering if the list is not huge.

> I'm also prefectly happy with my human meat brain and do not expect
> to replace it with an aliens.

8-)

On a totally different but related topic, unmap_mapping_range_tree
walks the prio tree the same way try_to_unmap_file walks it and if
truncate can truncate "dst" before "src" then supposedly the
try_to_unmap_file could miss a migration entry copied into the "child"
ptep while fork runs too... But I think there is no risk there because
we don't establish migration ptes there, and we just unmap the
pagecache, so worst case we'll abort migration if the race trigger and
we'll retry later. But I wonder what happens if truncate runs against
fork, if truncate can drop ptes from dst before src (like mremap
comment says), we could still end up with some pte mapped to the file
in the ptes of the child, even if the pte was correctly truncated in
the parent...

Overall I think fork/mremap vs fully_reliable_rmap_walk/truncate
aren't fundamentally different in relation. If we relay on ordering
for anon pages in fork it's not adding too much mess to also relay on
ordering for mremap. If we take the i_mmap_mutex in mremap because we
can't enforce a order in the prio tree, then we need the i_mmap_mutex
in fork too (and that's missing). But nothing prevents us to use a
lock in mreamp and ordering in fork. I think the decision should be
based more on performance expectations.

So we could add the ordering to mremap (patch posted), and add the
i_mmap_mutex to fork, or we add the anon_vma lock in both mremap and
fork, and the i_mmap_lock to fork.

Also note, if we find a way to enforce orderings in the prio tree (not
sure if it's possible, apparently it's already using list_add_tail
so..), then we could also remove the i_mmap_lock from mremap and fork.

Keeping the anon and file cases separated it's better though, I think
the patch I posted should close the race and be ok. Pending your
requested changes. If you think taking the lock is faster it's fine
with me, but I think taking the anon_vma lock once per VMA (plus the
anon_vma_chail list walk), and reduce the per-pagetable locking
overhead is better. Ideally the anon_vma_chain lists won't be long
anyway. And if they are long and lots of processes do mremap at the
same time it should still work better. The anon_vma root lock is not
so small lock to take and better not to take it repeatedly. I also
recall Andi's patches to try to avoid doing lock/unlock in a tight
loop, if we take it and we do some work with it hold, is likely better
than bouncing it at high freq across CPUs for each pmd.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma
  2011-11-09  1:25                                     ` Andrea Arcangeli
@ 2011-11-11  9:14                                       ` Nai Xia
  2011-11-16 14:00                                       ` Andrea Arcangeli
  1 sibling, 0 replies; 72+ messages in thread
From: Nai Xia @ 2011-11-11  9:14 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Hugh Dickins, Pawel Sikora, Andrew Morton, linux-mm,
	jpiszcz, arekm, linux-kernel

On Wed, Nov 9, 2011 at 9:25 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> On Mon, Nov 07, 2011 at 04:28:08PM +0000, Mel Gorman wrote:
>> Note that I didn't suddenly turn that ack into a nack although
>
> :)
>
>>   1) A small comment on why we need to call anon_vma_moveto_tail in the
>>      error path would be nice
>
> I can add that.
>
>>   2) It is unfortunate that we need the faulted_in_anon_vma just
>>      for a VM_BUG_ON check that only exists for CONFIG_DEBUG_VM
>>      but not earth shatting
>
> It should be optimized away at build time. It thought it was better
> not to leave that path without a VM_BUG_ON. It should be a slow path
> in the first place (probably we should even mark it unlikely). And
> it's obscure enough that I think a check will clarify things. In the
> common case (i.e. some pte faulted in) that vma_merge on self if it
> succeeds, it couldn't possibly be safe because the vma->vm_pgoff vs
> page->index linearity couldn't be valid for the same vma and the same
> page on two different virtual addresses. So checking for it I think is
> sane. Especially given at some point it was mentioned we could
> optimize away the check all together, so it's a bit of an obscure path
> that the VM_BUG_ON I think will help document (and verify).
>
>> What I said was taking the anon_vma lock may be slower but it was
>> generally easier to understand. I'm happy with the new patch too
>> particularly as it keeps the "ordering game" consistent for fork
>> and mremap but I previously missed move_page_tables in the error
>> path so was worried if there was something else I managed to miss
>> particularly in light of the "Allocating a new vma, copy first and
>> merge later" direction.
>
> I liked that direction a lot. I thought with that we could stick to
> the exact same behavior of fork and not need to reorder stuff. But the
> error path is still in the way, and we've to undo the move in place
> without tearing down the vmas. Plus it would have required to write
> mode code, and the allocation path wouldn't have necessarily been
> faster than a reordering if the list is not huge.
>
>> I'm also prefectly happy with my human meat brain and do not expect
>> to replace it with an aliens.
>
> 8-)
>
> On a totally different but related topic, unmap_mapping_range_tree
> walks the prio tree the same way try_to_unmap_file walks it and if
> truncate can truncate "dst" before "src" then supposedly the
> try_to_unmap_file could miss a migration entry copied into the "child"
> ptep while fork runs too... But I think there is no risk there because
> we don't establish migration ptes there, and we just unmap the
> pagecache, so worst case we'll abort migration if the race trigger and
> we'll retry later. But I wonder what happens if truncate runs against
> fork, if truncate can drop ptes from dst before src (like mremap
> comment says), we could still end up with some pte mapped to the file
> in the ptes of the child, even if the pte was correctly truncated in
> the parent...
>
> Overall I think fork/mremap vs fully_reliable_rmap_walk/truncate
> aren't fundamentally different in relation. If we relay on ordering
> for anon pages in fork it's not adding too much mess to also relay on
> ordering for mremap. If we take the i_mmap_mutex in mremap because we
> can't enforce a order in the prio tree, then we need the i_mmap_mutex
> in fork too (and that's missing). But nothing prevents us to use a
> lock in mreamp and ordering in fork. I think the decision should be
> based more on performance expectations.
>
> So we could add the ordering to mremap (patch posted), and add the
> i_mmap_mutex to fork, or we add the anon_vma lock in both mremap and
> fork, and the i_mmap_lock to fork.
>
> Also note, if we find a way to enforce orderings in the prio tree (not
> sure if it's possible, apparently it's already using list_add_tail
> so..), then we could also remove the i_mmap_lock from mremap and fork.
>

Oh, well, I had thought that for partial remap the src and dst VMA are
inserted as
different prio tree nodes, instead of being list_add_tail linked,
which means they
can not be reordered back and force at all...

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma
  2011-11-09  1:25                                     ` Andrea Arcangeli
  2011-11-11  9:14                                       ` Nai Xia
@ 2011-11-16 14:00                                       ` Andrea Arcangeli
  2011-11-17  0:16                                         ` Hugh Dickins
  1 sibling, 1 reply; 72+ messages in thread
From: Andrea Arcangeli @ 2011-11-16 14:00 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nai Xia, Hugh Dickins, Pawel Sikora, Andrew Morton, linux-mm,
	jpiszcz, arekm, linux-kernel

On Wed, Nov 09, 2011 at 02:25:42AM +0100, Andrea Arcangeli wrote:
> Also note, if we find a way to enforce orderings in the prio tree (not
> sure if it's possible, apparently it's already using list_add_tail
> so..), then we could also remove the i_mmap_lock from mremap and fork.

I'm not optimistic we can enforce ordering there. Being a tree it's
walked in range order.

I thought of another solution that would avoid having to reorder the
list in mremap and avoid the i_mmap_mutex to be added to fork (and
then we can remove it from mremap too). The solution is to rmap_walk
twice. I mean two loops over the same_anon_vma for those rmap walks
that must be reliable (that includes two calls of
unmap_mapping_range). For both same_anon_vma and prio tree.

Reading truncate_pagecache I see two loops already and a comment
saying it's for fork(), to avoid leaking ptes in the child. So fork is
probably ok already without having to take the i_mmap_mutex, but then
I wonder why that also doesn't fix mremap if we do two loops there and
why that i_mmap_mutex is really needed in mremap considering those two
calls already present in truncate_pagecache. I wonder if that was a
"theoretical" fix that missed the fact truncate already walks the prio
tree twice, so it doesn't matter if the rmap_walk goes in the opposite
direction of move_page_tables? That i_mmap_lock in mremap (now
i_mmap_mutex) is there since start of git history. The double loop was
introduced in d00806b183152af6d24f46f0c33f14162ca1262a. So it's very
possible that i_mmap_mutex is now useless (after
d00806b183152af6d24f46f0c33f14162ca1262a) and the fix for fork, was
already taking care of mremap too and that i_mmap_mutex can now be
removed.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma
  2011-11-16 14:00                                       ` Andrea Arcangeli
@ 2011-11-17  0:16                                         ` Hugh Dickins
  2011-11-17  2:49                                           ` Nai Xia
                                                             ` (2 more replies)
  0 siblings, 3 replies; 72+ messages in thread
From: Hugh Dickins @ 2011-11-17  0:16 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Nai Xia, Pawel Sikora, Andrew Morton, linux-mm,
	jpiszcz, arekm, linux-kernel

On Wed, 16 Nov 2011, Andrea Arcangeli wrote:
> On Wed, Nov 09, 2011 at 02:25:42AM +0100, Andrea Arcangeli wrote:
> > Also note, if we find a way to enforce orderings in the prio tree (not
> > sure if it's possible, apparently it's already using list_add_tail
> > so..), then we could also remove the i_mmap_lock from mremap and fork.
> 
> I'm not optimistic we can enforce ordering there. Being a tree it's
> walked in range order.
> 
> I thought of another solution that would avoid having to reorder the
> list in mremap and avoid the i_mmap_mutex to be added to fork (and
> then we can remove it from mremap too). The solution is to rmap_walk
> twice. I mean two loops over the same_anon_vma for those rmap walks
> that must be reliable (that includes two calls of
> unmap_mapping_range). For both same_anon_vma and prio tree.
> 
> Reading truncate_pagecache I see two loops already and a comment
> saying it's for fork(), to avoid leaking ptes in the child. So fork is
> probably ok already without having to take the i_mmap_mutex, but then
> I wonder why that also doesn't fix mremap if we do two loops there and
> why that i_mmap_mutex is really needed in mremap considering those two
> calls already present in truncate_pagecache. I wonder if that was a
> "theoretical" fix that missed the fact truncate already walks the prio
> tree twice, so it doesn't matter if the rmap_walk goes in the opposite
> direction of move_page_tables? That i_mmap_lock in mremap (now
> i_mmap_mutex) is there since start of git history. The double loop was
> introduced in d00806b183152af6d24f46f0c33f14162ca1262a. So it's very
> possible that i_mmap_mutex is now useless (after
> d00806b183152af6d24f46f0c33f14162ca1262a) and the fix for fork, was
> already taking care of mremap too and that i_mmap_mutex can now be
> removed.

As you found, the mremap locking long predates truncation's double unmap.

That's an interesting point, and you may be right - though, what about
the *very* unlikely case where unmap_mapping_range looks at new vma
when pte is in old, then at old vma when pte is in new, then
move_page_tables runs out of memory and cannot complete, then the
second unmap_mapping_range looks at old vma while pte is still in new
(I guess this needs some other activity to have jumbled the prio_tree,
and may just be impossible), then at new (to be abandoned) vma after
pte has moved back to old.

Probably not an everyday occurrence :)

But, setting that aside, I've always thought of that second call to
unmap_mapping_range() as a regrettable expedient that we should try
to eliminate e.g. by checking for private mappings in the first pass,
and skipping the second call if there were none.

But since nobody ever complained about that added overhead, I never
got around to bothering; and you may consider the i_mmap_mutex in
move_ptes a more serious unnecessary overhead.

By the way, you mention "a comment saying it's for fork()": I don't
find "fork" anywhere in mm/truncate.c, my understanding is in this
comment (probably mine) from truncate_pagecache():

	/*
	 * unmap_mapping_range is called twice, first simply for
	 * efficiency so that truncate_inode_pages does fewer
	 * single-page unmaps.  However after this first call, and
	 * before truncate_inode_pages finishes, it is possible for
	 * private pages to be COWed, which remain after
	 * truncate_inode_pages finishes, hence the second
	 * unmap_mapping_range call must be made for correctness.
	 */

The second call was not (I think) necessary when we relied upon
truncate_count, but became necessary once Nick relied upon page lock
(the page lock on the file page providing no guarantee for the COWed
page).

Hugh

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma
  2011-11-17  0:16                                         ` Hugh Dickins
@ 2011-11-17  2:49                                           ` Nai Xia
  2011-11-17  6:21                                           ` Nai Xia
  2011-11-17 18:42                                           ` Andrea Arcangeli
  2 siblings, 0 replies; 72+ messages in thread
From: Nai Xia @ 2011-11-17  2:49 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Mel Gorman, Pawel Sikora, Andrew Morton,
	linux-mm, jpiszcz, arekm, linux-kernel

On Thursday 17 November 2011 08:16:57 Hugh Dickins wrote:
> On Wed, 16 Nov 2011, Andrea Arcangeli wrote:
> > On Wed, Nov 09, 2011 at 02:25:42AM +0100, Andrea Arcangeli wrote:
> > > Also note, if we find a way to enforce orderings in the prio tree (not
> > > sure if it's possible, apparently it's already using list_add_tail
> > > so..), then we could also remove the i_mmap_lock from mremap and fork.
> > 
> > I'm not optimistic we can enforce ordering there. Being a tree it's
> > walked in range order.
> > 
> > I thought of another solution that would avoid having to reorder the
> > list in mremap and avoid the i_mmap_mutex to be added to fork (and
> > then we can remove it from mremap too). The solution is to rmap_walk
> > twice. I mean two loops over the same_anon_vma for those rmap walks
> > that must be reliable (that includes two calls of
> > unmap_mapping_range). For both same_anon_vma and prio tree.
> > 
> > Reading truncate_pagecache I see two loops already and a comment
> > saying it's for fork(), to avoid leaking ptes in the child. So fork is
> > probably ok already without having to take the i_mmap_mutex, but then
> > I wonder why that also doesn't fix mremap if we do two loops there and
> > why that i_mmap_mutex is really needed in mremap considering those two
> > calls already present in truncate_pagecache. I wonder if that was a
> > "theoretical" fix that missed the fact truncate already walks the prio
> > tree twice, so it doesn't matter if the rmap_walk goes in the opposite
> > direction of move_page_tables? That i_mmap_lock in mremap (now
> > i_mmap_mutex) is there since start of git history. The double loop was
> > introduced in d00806b183152af6d24f46f0c33f14162ca1262a. So it's very
> > possible that i_mmap_mutex is now useless (after
> > d00806b183152af6d24f46f0c33f14162ca1262a) and the fix for fork, was
> > already taking care of mremap too and that i_mmap_mutex can now be
> > removed.
> 
> As you found, the mremap locking long predates truncation's double unmap.
> 
> That's an interesting point, and you may be right - though, what about
> the *very* unlikely case where unmap_mapping_range looks at new vma
> when pte is in old, then at old vma when pte is in new, then
> move_page_tables runs out of memory and cannot complete, then the
> second unmap_mapping_range looks at old vma while pte is still in new
> (I guess this needs some other activity to have jumbled the prio_tree,
> and may just be impossible), then at new (to be abandoned) vma after
> pte has moved back to old.
> 
> Probably not an everyday occurrence :)
> 
> But, setting that aside, I've always thought of that second call to
> unmap_mapping_range() as a regrettable expedient that we should try
> to eliminate e.g. by checking for private mappings in the first pass,
> and skipping the second call if there were none.
> 
> But since nobody ever complained about that added overhead, I never
> got around to bothering; and you may consider the i_mmap_mutex in
> move_ptes a more serious unnecessary overhead.
> 
> By the way, you mention "a comment saying it's for fork()": I don't
> find "fork" anywhere in mm/truncate.c, my understanding is in this
> comment (probably mine) from truncate_pagecache():

I think you guys are talking about two different COWs:

Andrea's question is that if a new VMA is created by fork() between
the two loops and PTEs are getting copied.

And you are refering to the new PTEs get COWed by __do_fault() in 
the same VMA before the cache pages are really dropped.

>From my point of view, the two loops there are really fork() 
irrelevant, as you said, they are only for missed COWed ptes in the 
same VMA before a cache page is really blind for find_get_page(). 




As for Andrea's reasoning, I think I deem this racing story as below:

1. fork() is safe without tree lock/mutex after the second loop, the 
reason is just why it's safe for the try_to_unmap_file: the new VMA is
really linked as list tail in a *same* tree node as the old VMA in 
vma prio_tree. The old and new are traveled by vma_prio_tree_foreach() 
in a proper order. And fork() does not include a error path requiring 
backward page table copy operation which needs a reverse order.

2. Partial mremap is not safe for this without tree lock/mutex, because the src
and dst VMA are different prio_tree nodes, and their order are not meant to 
be screwed.



Nai

> 
> 	/*
> 	 * unmap_mapping_range is called twice, first simply for
> 	 * efficiency so that truncate_inode_pages does fewer
> 	 * single-page unmaps.  However after this first call, and
> 	 * before truncate_inode_pages finishes, it is possible for
> 	 * private pages to be COWed, which remain after
> 	 * truncate_inode_pages finishes, hence the second
> 	 * unmap_mapping_range call must be made for correctness.
> 	 */
> 
> The second call was not (I think) necessary when we relied upon
> truncate_count, but became necessary once Nick relied upon page lock
> (the page lock on the file page providing no guarantee for the COWed
> page).
> 
> Hugh
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma
  2011-11-17  0:16                                         ` Hugh Dickins
  2011-11-17  2:49                                           ` Nai Xia
@ 2011-11-17  6:21                                           ` Nai Xia
  2011-11-17 18:42                                           ` Andrea Arcangeli
  2 siblings, 0 replies; 72+ messages in thread
From: Nai Xia @ 2011-11-17  6:21 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Mel Gorman, Pawel Sikora, Andrew Morton,
	linux-mm, jpiszcz, arekm, linux-kernel

On Thu, Nov 17, 2011 at 8:16 AM, Hugh Dickins <hughd@google.com> wrote:
> On Wed, 16 Nov 2011, Andrea Arcangeli wrote:
>> On Wed, Nov 09, 2011 at 02:25:42AM +0100, Andrea Arcangeli wrote:
>> > Also note, if we find a way to enforce orderings in the prio tree (not
>> > sure if it's possible, apparently it's already using list_add_tail
>> > so..), then we could also remove the i_mmap_lock from mremap and fork.
>>
>> I'm not optimistic we can enforce ordering there. Being a tree it's
>> walked in range order.
>>
>> I thought of another solution that would avoid having to reorder the
>> list in mremap and avoid the i_mmap_mutex to be added to fork (and
>> then we can remove it from mremap too). The solution is to rmap_walk
>> twice. I mean two loops over the same_anon_vma for those rmap walks
>> that must be reliable (that includes two calls of
>> unmap_mapping_range). For both same_anon_vma and prio tree.
>>
>> Reading truncate_pagecache I see two loops already and a comment
>> saying it's for fork(), to avoid leaking ptes in the child. So fork is
>> probably ok already without having to take the i_mmap_mutex, but then
>> I wonder why that also doesn't fix mremap if we do two loops there and
>> why that i_mmap_mutex is really needed in mremap considering those two
>> calls already present in truncate_pagecache. I wonder if that was a
>> "theoretical" fix that missed the fact truncate already walks the prio
>> tree twice, so it doesn't matter if the rmap_walk goes in the opposite
>> direction of move_page_tables? That i_mmap_lock in mremap (now
>> i_mmap_mutex) is there since start of git history. The double loop was
>> introduced in d00806b183152af6d24f46f0c33f14162ca1262a. So it's very
>> possible that i_mmap_mutex is now useless (after
>> d00806b183152af6d24f46f0c33f14162ca1262a) and the fix for fork, was
>> already taking care of mremap too and that i_mmap_mutex can now be
>> removed.
>
> As you found, the mremap locking long predates truncation's double unmap.
>
> That's an interesting point, and you may be right - though, what about
> the *very* unlikely case where unmap_mapping_range looks at new vma
> when pte is in old, then at old vma when pte is in new, then
> move_page_tables runs out of memory and cannot complete, then the
> second unmap_mapping_range looks at old vma while pte is still in new
> (I guess this needs some other activity to have jumbled the prio_tree,
> and may just be impossible), then at new (to be abandoned) vma after
> pte has moved back to old.

I think this cannot happen either with proper ordering or with the tree lock
and Andrea was talking about if the two loops setup can avoid taking the
tree lock in mremap().

So, a simple answer would be: No, the two loops setup does not aim at
solving the PTE copy racing in fork() (it's lucky though), so can cannot
solve the problem of mremap either.

>
> Probably not an everyday occurrence :)
>
> But, setting that aside, I've always thought of that second call to
> unmap_mapping_range() as a regrettable expedient that we should try
> to eliminate e.g. by checking for private mappings in the first pass,
> and skipping the second call if there were none.

Don't you think this is only a partial solution? Given that
truncate_inode_page() does not shoot down cowed ptes,  the zap of
ptes and the cache pages are not atomic anyway,
so the second pass seems unavoidable for general cases....

Of course, if you let  truncate_inode_page()  has an option to unmap
cowed ptes, the second pass may not be needed, but then you may worry
about the performance.... a real dilemma, isn't it?  :)

>
> But since nobody ever complained about that added overhead, I never
> got around to bothering; and you may consider the i_mmap_mutex in
> move_ptes a more serious unnecessary overhead.
>
> By the way, you mention "a comment saying it's for fork()": I don't
> find "fork" anywhere in mm/truncate.c, my understanding is in this
> comment (probably mine) from truncate_pagecache():
>
>        /*
>         * unmap_mapping_range is called twice, first simply for
>         * efficiency so that truncate_inode_pages does fewer
>         * single-page unmaps.  However after this first call, and
>         * before truncate_inode_pages finishes, it is possible for
>         * private pages to be COWed, which remain after
>         * truncate_inode_pages finishes, hence the second
>         * unmap_mapping_range call must be made for correctness.
>         */
>
> The second call was not (I think) necessary when we relied upon
> truncate_count, but became necessary once Nick relied upon page lock
> (the page lock on the file page providing no guarantee for the COWed
> page).

Hmm, yes, do_wp_page() does not take the page lock when doing COW
(only the PTE lock), but I think another citical reason for the second pass
is that nothing can prevent a just zapped pte from launching a write fault
again and get COWed in __do_fault(), just *before* truncate_inode_pages()
can take the its page lock.., so even if we bring do_wp_page()  under control
of page lock, the second pass is still needed, right?

Nai
>
> Hugh
>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma
  2011-11-17  0:16                                         ` Hugh Dickins
  2011-11-17  2:49                                           ` Nai Xia
  2011-11-17  6:21                                           ` Nai Xia
@ 2011-11-17 18:42                                           ` Andrea Arcangeli
  2011-11-18  1:42                                             ` Nai Xia
  2 siblings, 1 reply; 72+ messages in thread
From: Andrea Arcangeli @ 2011-11-17 18:42 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Mel Gorman, Nai Xia, Pawel Sikora, Andrew Morton, linux-mm,
	jpiszcz, arekm, linux-kernel

Hi Hugh,

On Wed, Nov 16, 2011 at 04:16:57PM -0800, Hugh Dickins wrote:
> As you found, the mremap locking long predates truncation's double unmap.
> 
> That's an interesting point, and you may be right - though, what about
> the *very* unlikely case where unmap_mapping_range looks at new vma
> when pte is in old, then at old vma when pte is in new, then
> move_page_tables runs out of memory and cannot complete, then the
> second unmap_mapping_range looks at old vma while pte is still in new
> (I guess this needs some other activity to have jumbled the prio_tree,
> and may just be impossible), then at new (to be abandoned) vma after
> pte has moved back to old.

I tend to think it should still work fine. The second loop is needed
to take care of the "reverse" order. If the first move_page_tables is
not in order the second move_page_tables will be in order. So it
should catch it. If the first move_page_tables is in order, the double
loop will catch any skip in the second move_page_tables.

Well if I'm missing something worst case we'd need a dummy
mutex_lock/unlock of the i_mmap_mutex before running the rolling-back
move_page_tables no big deal, still out of the fast path.

> But since nobody ever complained about that added overhead, I never
> got around to bothering; and you may consider the i_mmap_mutex in
> move_ptes a more serious unnecessary overhead.

The point is that if there's no solution to fix truncate by removing
the double loop for the other reasons, so we could take advantage of
the double loop in mremap too (adding proper comment to truncate.c of
course).

> By the way, you mention "a comment saying it's for fork()": I don't
> find "fork" anywhere in mm/truncate.c, my understanding is in this
> comment (probably mine) from truncate_pagecache():
> 
> 	/*
> 	 * unmap_mapping_range is called twice, first simply for
> 	 * efficiency so that truncate_inode_pages does fewer
> 	 * single-page unmaps.  However after this first call, and
> 	 * before truncate_inode_pages finishes, it is possible for
> 	 * private pages to be COWed, which remain after
> 	 * truncate_inode_pages finishes, hence the second
> 	 * unmap_mapping_range call must be made for correctness.
> 	 */
> 
> The second call was not (I think) necessary when we relied upon
> truncate_count, but became necessary once Nick relied upon page lock
> (the page lock on the file page providing no guarantee for the COWed
> page).

I see. Truncate locks down the page while it shoots down the pte so no
new mapping could be established, while the COWs still can because
they don't take the lock on the old page. But do_wp_page takes the
lock for anon pages and MAP_SHARED. It's a little weird it doesn't
take it for MAP_PRIVATE (i.e. VM_SHARED not set). MAP_SHARED already
does the check for page->mapping being null after the lock is obtained.

The double loop happens to make fork safe too, or the inverse ordering
between truncate and fork would lead to the same issue and that will
also map pagecache (not just anon cows). I don't see lock_page in fork
it just copies the pte it doesn't mangle on the page lock.

Note however that for a tiny window, with the current truncate code
that does unmap+truncate+unmap, there can still be a pte in the fork
child that points to an orphaned pagecache (before the second call of
unmap_mapping_range starts). It'd be a transient pte, it'll be dropped
as soon as the second unmap_mapping_range runs. Not sure how bad that
thing is. To avoid it we'd need to run unmap+unmap+truncate. That way
no pte in fork could map anymore a orphaned pagecache. But then the
second unmap wouldn't take down the COWs generated by do_wp_page in
MAP_PRIVATE areas anymore.

So it boils down if we are ok with transient pte mapping an orphaned
pagecache for a little. The only problem I can see is that writes
would then be discared without triggering SIGBUS beyond the end of
i_size on MAP_SHARED. But if the write from the other process (or
thread) happened a millisecond before it would be discared anyway. So
I guess it's not a problem and it's mostly an implementation issue if
there could be any code that won't like a pte pointing to an orphaned
pagecache for a little while. I'm optimistic it can work safe and we
can just drop the i_mmap_mutex completely from mremap after checking
that those transient ptes mapping orphaned pagecache won't trigger
asserts.

As for the anon_vma my ordering patch (last version I posted) fixes it
already. The other way is to add double loops. Or the anon_vma->lock
of course!

If we go double loops for anon-vma, with split_huge_page I could
unlink any anon_vma_chain where the address-range matches but the
pte/pmd is not found, and re-check in the second loop _only_ those
anon_vma_chains where we failed to find a mapping. Only thought about
it, not actually attempted to implement it. Even rmap_walk could do
that but it requires changes to the caller (i.e. migrate.c), while for
split_huge_page it'd be simpler local change. Then I would relink the
re-checked anon_vma_chains with list_splice. The whole list is
protected by the root anon vma lock which is hold for the whole
duration of split_huge_page so I guess it shall be doable.

The rmap_walks of filebacked mappings won't need any double loop (only
migrate and split_huge_page will need it) because neither
remove_migration_ptes nor split_huge_page runs on filebacked mappings
as migration ptes and hugepage splits only runs for anon memory. And
nothing would prevent to add double loops there too if we extend
split_huge_page to pagecache (we already double loop in truncate).

Nai, if prio tree could guarantee ordering, 1) there would be no
i_mmap_lock I guess, or there would be a comment that it's only for
the vma_merge case and the error path that goes in reverse order, 2)
if you were right that list_add_tail in prio tree and both src and dst
vmas being in the same node guarantees ordering, it would imply the
prio tree works in O(N) and that can't be or we'd use a list instead
of a prio tree. The whole idea of any structure smarter of a list is
to insert things in some "order" that depends on the index (the index
is the vm_start,vm_end range in the prio tree case) do some "work" in
insert so the walk can be faster, but that practically guarantees the
walk won't be in the same order as the way it was inserted.

If prio tree could guarantee ordering then I could also reoder the
prio tree extending my patch that already fixes it the anon_vma case,
and still avoid the i_mmap_mutex without requiring double loops.

So in short.

1) for anon I'm not sure if it's better my current patch that fixes
the anon case just fine, or if to go double loops in
split_huge_page/migrate, or if to add the anon_vma lock aroudn
move_page_tables.

2) for filebacked if we can deal with the transient pte on orphaned
pagecache we can just add a comment to truncate.c and drop the
i_mmap_mutex.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma
  2011-11-17 18:42                                           ` Andrea Arcangeli
@ 2011-11-18  1:42                                             ` Nai Xia
  2011-11-18  2:17                                               ` Andrea Arcangeli
  0 siblings, 1 reply; 72+ messages in thread
From: Nai Xia @ 2011-11-18  1:42 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Mel Gorman, Pawel Sikora, Andrew Morton, linux-mm,
	jpiszcz, arekm, linux-kernel

On Fri, Nov 18, 2011 at 2:42 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> Hi Hugh,
>
> On Wed, Nov 16, 2011 at 04:16:57PM -0800, Hugh Dickins wrote:
>> As you found, the mremap locking long predates truncation's double unmap.
>>
>> That's an interesting point, and you may be right - though, what about
>> the *very* unlikely case where unmap_mapping_range looks at new vma
>> when pte is in old, then at old vma when pte is in new, then
>> move_page_tables runs out of memory and cannot complete, then the
>> second unmap_mapping_range looks at old vma while pte is still in new
>> (I guess this needs some other activity to have jumbled the prio_tree,
>> and may just be impossible), then at new (to be abandoned) vma after
>> pte has moved back to old.
>
> I tend to think it should still work fine. The second loop is needed
> to take care of the "reverse" order. If the first move_page_tables is
> not in order the second move_page_tables will be in order. So it
> should catch it. If the first move_page_tables is in order, the double
> loop will catch any skip in the second move_page_tables.


First of all, I believe that at the POSIX level, it's ok for
truncate_inode_page()
not scanning  COWed pages, since basically we does not provide any guarantee
for privately mapped file pages for this behavior. But missing a file
mapped pte after its
cache page is already removed from the the page cache is a
fundermental malfuntion for
a shared mapping when some threads see the file cache page is gone
while some thread
is still r/w from/to it! No matter how short the gap between
truncate_inode_page() and
the second loop, this is wrong.

Second, even if the we don't care about this POSIX flaw that may
introduce, a pte can still
missed by the second loop. mremap can happen serveral times during
these non-atomic
firstpass-trunc-secondpass operations, a proper events can happily
make the wrong order
for every scan, and miss them all -- That's just what in Hugh's mind
in the post you just
replied. Without lock and proper ordering( which patial mremap cannot provide),
this *will* happen.

You may disagree with me and have that locking removed, and I am
already have that
one line patch prepared waiting fora bug bumpping up again, what a
cheap patch submission!

:P


Thanks,

Nai

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma
  2011-11-18  1:42                                             ` Nai Xia
@ 2011-11-18  2:17                                               ` Andrea Arcangeli
  2011-11-19  9:15                                                 ` Nai Xia
  0 siblings, 1 reply; 72+ messages in thread
From: Andrea Arcangeli @ 2011-11-18  2:17 UTC (permalink / raw)
  To: Nai Xia
  Cc: Hugh Dickins, Mel Gorman, Pawel Sikora, Andrew Morton, linux-mm,
	jpiszcz, arekm, linux-kernel

On Fri, Nov 18, 2011 at 09:42:05AM +0800, Nai Xia wrote:
> First of all, I believe that at the POSIX level, it's ok for
> truncate_inode_page()
> not scanning  COWed pages, since basically we does not provide any guarantee
> for privately mapped file pages for this behavior. But missing a file
> mapped pte after its
> cache page is already removed from the the page cache is a

I also exclude there is a case that would break, but it's safer to
keep things as is, in case somebody depends on segfault trapping.

> fundermental malfuntion for
> a shared mapping when some threads see the file cache page is gone
> while some thread
> is still r/w from/to it! No matter how short the gap between
> truncate_inode_page() and
> the second loop, this is wrong.

Truncate will destroy the info on disk too... so if somebody is
writing to a mapping which points beyond the end of the i_size
concurrently with truncate, the result is undefined. The write may
well reach the page but then the page is discared. Or you may get
SIGBUS before the write.

> Second, even if the we don't care about this POSIX flaw that may
> introduce, a pte can still
> missed by the second loop. mremap can happen serveral times during
> these non-atomic
> firstpass-trunc-secondpass operations, a proper events can happily
> make the wrong order
> for every scan, and miss them all -- That's just what in Hugh's mind
> in the post you just
> replied. Without lock and proper ordering( which patial mremap cannot provide),
> this *will* happen.

There won't be more than one mremap running concurrently from the same
process (we must enforce it by making sure anon_vma lock and
i_mmap_lock are both taken at least once in copy_vma, they're already
both taken in fork, they should already be taken in all common cases
in copy_vma so for all cases it's going to be a L1 exclusive cacheline
already). I don't exclude there may be some case that won't take the
locks in vma_adjust though, we should check it, if we decide to relay
on the double loop, but it'd be a simple addition if needed.

I'm more concerned about the pte pointing to the orphaned pagecache
that would materialize for a little while because of
unmap+truncate+unmap instead of unmap+unmap+truncate (but the latter
order is needed for the COWs).

> You may disagree with me and have that locking removed, and I am
> already have that
> one line patch prepared waiting fora bug bumpping up again, what a
> cheap patch submission!

Well I'm not yet sure it's good idea to remove the i_mmap_mutex, or if
we should just add the anon_vma lock in mremap and add the i_mmap_lock
in fork (to avoid the orphaned pagecache left mapped in the child
which already may happen unless there's some i_mmap_lock belonging to
the same inode taken after copy_page_range returns until we return to
userland and child can run, and I don't think we can relay on the
order of the prio tree in fork. Fork is safe for anon pages because
there we can relay on the order of the same_anon_vma list.

I think clearing up if this orphaned pagecache is dangerous would be a
good start. If too complex we just add the i_mmap_lock around
copy_page_range in fork if vma->vm_file is set. If you instead think
we can deal with the orphaned pagecache we can add a dummy lock/unlock
of i_mmap_mutex in copy_vma vma_merge succeeding case (short critical
section and not common common case) and remove the i_mmap_mutex around
move_page_tables (common case) overall speeding up mremap and not
degrading fork.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma
  2011-11-18  2:17                                               ` Andrea Arcangeli
@ 2011-11-19  9:15                                                 ` Nai Xia
  0 siblings, 0 replies; 72+ messages in thread
From: Nai Xia @ 2011-11-19  9:15 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Mel Gorman, Pawel Sikora, Andrew Morton, linux-mm,
	jpiszcz, arekm, linux-kernel

On Fri, Nov 18, 2011 at 10:17 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> On Fri, Nov 18, 2011 at 09:42:05AM +0800, Nai Xia wrote:
>> First of all, I believe that at the POSIX level, it's ok for
>> truncate_inode_page()
>> not scanning  COWed pages, since basically we does not provide any guarantee
>> for privately mapped file pages for this behavior. But missing a file
>> mapped pte after its
>> cache page is already removed from the the page cache is a
>
> I also exclude there is a case that would break, but it's safer to
> keep things as is, in case somebody depends on segfault trapping.
>
>> fundermental malfuntion for
>> a shared mapping when some threads see the file cache page is gone
>> while some thread
>> is still r/w from/to it! No matter how short the gap between
>> truncate_inode_page() and
>> the second loop, this is wrong.
>
> Truncate will destroy the info on disk too... so if somebody is
> writing to a mapping which points beyond the end of the i_size
> concurrently with truncate, the result is undefined. The write may
> well reach the page but then the page is discared. Or you may get
> SIGBUS before the write.
>
>> Second, even if the we don't care about this POSIX flaw that may
>> introduce, a pte can still
>> missed by the second loop. mremap can happen serveral times during
>> these non-atomic
>> firstpass-trunc-secondpass operations, a proper events can happily
>> make the wrong order
>> for every scan, and miss them all -- That's just what in Hugh's mind
>> in the post you just
>> replied. Without lock and proper ordering( which patial mremap cannot provide),
>> this *will* happen.
>
> There won't be more than one mremap running concurrently from the same
> process (we must enforce it by making sure anon_vma lock and
> i_mmap_lock are both taken at least once in copy_vma, they're already
> both taken in fork, they should already be taken in all common cases
> in copy_vma so for all cases it's going to be a L1 exclusive cacheline
> already). I don't exclude there may be some case that won't take the
> locks in vma_adjust though, we should check it, if we decide to relay
> on the double loop, but it'd be a simple addition if needed.

I mean it's not the concurrent mremap, it's mremap() can be done several
times between these 3-stage scans, since we don't take the mmap_sem
of the scanned VMAs, they are valid to do so. And without proper ordering
and locks/mutex it's possible for these 3-stage scans racing with these
mremap() s and a ghost PTE just jumps back and force and misses all
these scans.

>
> I'm more concerned about the pte pointing to the orphaned pagecache
> that would materialize for a little while because of
> unmap+truncate+unmap instead of unmap+unmap+truncate (but the latter
> order is needed for the COWs).
>
>> You may disagree with me and have that locking removed, and I am
>> already have that
>> one line patch prepared waiting fora bug bumpping up again, what a
>> cheap patch submission!
>
> Well I'm not yet sure it's good idea to remove the i_mmap_mutex, or if
> we should just add the anon_vma lock in mremap and add the i_mmap_lock
> in fork (to avoid the orphaned pagecache left mapped in the child
> which already may happen unless there's some i_mmap_lock belonging to
> the same inode taken after copy_page_range returns until we return to
> userland and child can run, and I don't think we can relay on the
> order of the prio tree in fork. Fork is safe for anon pages because
> there we can relay on the order of the same_anon_vma list.
>
> I think clearing up if this orphaned pagecache is dangerous would be a
> good start. If too complex we just add the i_mmap_lock around
> copy_page_range in fork if vma->vm_file is set. If you instead think
> we can deal with the orphaned pagecache we can add a dummy lock/unlock
> of i_mmap_mutex in copy_vma vma_merge succeeding case (short critical
> section and not common common case) and remove the i_mmap_mutex around
> move_page_tables (common case) overall speeding up mremap and not
> degrading fork.
>

I am actually feel comfortable either direction you take :)

But I do think orphaned pagecache is not a good idea,
don't you see there is a "BUG_ON(page_mapped(page))"
in __delete_from_page_cache()? Do you really plan to
remove this line?

Nai

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma
  2011-11-05 17:06                                 ` Andrea Arcangeli
@ 2011-12-08  3:24                                   ` David Rientjes
  2011-12-08 12:42                                     ` Andrea Arcangeli
  2011-12-09  0:08                                   ` Andrew Morton
  1 sibling, 1 reply; 72+ messages in thread
From: David Rientjes @ 2011-12-08  3:24 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Mel Gorman, Pawel Sikora, Andrew Morton, linux-mm,
	jpiszcz, arekm, linux-kernel

On Sat, 5 Nov 2011, Andrea Arcangeli wrote:

> migrate was doing a rmap_walk with speculative lock-less access on
> pagetables. That could lead it to not serialize properly against
> mremap PT locks. But a second problem remains in the order of vmas in
> the same_anon_vma list used by the rmap_walk.
> 
> If vma_merge would succeed in copy_vma, the src vma could be placed
> after the dst vma in the same_anon_vma list. That could still lead
> migrate to miss some pte.
> 
> This patch adds a anon_vma_moveto_tail() function to force the dst vma
> at the end of the list before mremap starts to solve the problem.
> 
> If the mremap is very large and there are a lots of parents or childs
> sharing the anon_vma root lock, this should still scale better than
> taking the anon_vma root lock around every pte copy practically for
> the whole duration of mremap.
> 
> Update: Hugh noticed special care is needed in the error path where
> move_page_tables goes in the reverse direction, a second
> anon_vma_moveto_tail() call is needed in the error path.
> 

Is this still needed?  It's missing in linux-next.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma
  2011-12-08  3:24                                   ` David Rientjes
@ 2011-12-08 12:42                                     ` Andrea Arcangeli
  0 siblings, 0 replies; 72+ messages in thread
From: Andrea Arcangeli @ 2011-12-08 12:42 UTC (permalink / raw)
  To: David Rientjes
  Cc: Hugh Dickins, Mel Gorman, Pawel Sikora, Andrew Morton, linux-mm,
	jpiszcz, arekm, linux-kernel

On Wed, Dec 07, 2011 at 07:24:59PM -0800, David Rientjes wrote:
> On Sat, 5 Nov 2011, Andrea Arcangeli wrote:
> 
> > migrate was doing a rmap_walk with speculative lock-less access on
> > pagetables. That could lead it to not serialize properly against
> > mremap PT locks. But a second problem remains in the order of vmas in
> > the same_anon_vma list used by the rmap_walk.
> > 
> > If vma_merge would succeed in copy_vma, the src vma could be placed
> > after the dst vma in the same_anon_vma list. That could still lead
> > migrate to miss some pte.
> > 
> > This patch adds a anon_vma_moveto_tail() function to force the dst vma
> > at the end of the list before mremap starts to solve the problem.
> > 
> > If the mremap is very large and there are a lots of parents or childs
> > sharing the anon_vma root lock, this should still scale better than
> > taking the anon_vma root lock around every pte copy practically for
> > the whole duration of mremap.
> > 
> > Update: Hugh noticed special care is needed in the error path where
> > move_page_tables goes in the reverse direction, a second
> > anon_vma_moveto_tail() call is needed in the error path.
> > 
> 
> Is this still needed?  It's missing in linux-next.

Yes it's needed, either this or the anon_vma lock around
move_page_tables. Then we also need the i_mmap_mutex around fork or a
triple loop in vmtruncate (then we could remove i_mmap_mutex in
mremap).

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma
  2011-11-05 17:06                                 ` Andrea Arcangeli
  2011-12-08  3:24                                   ` David Rientjes
@ 2011-12-09  0:08                                   ` Andrew Morton
  2011-12-09  1:55                                     ` Andrea Arcangeli
  1 sibling, 1 reply; 72+ messages in thread
From: Andrew Morton @ 2011-12-09  0:08 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Mel Gorman, Pawel Sikora, linux-mm, jpiszcz, arekm,
	linux-kernel, Nai Xia

On Sat,  5 Nov 2011 18:06:22 +0100
Andrea Arcangeli <aarcange@redhat.com> wrote:

> This patch adds a anon_vma_moveto_tail() function to force the dst vma
> at the end of the list before mremap starts to solve the problem.

It's not obvious to me that the patch which I merged is the one which
we want to merge, given the amount of subsequent discussion.  Please
check this.

I'm thinking we merge this into 3.3-rc1, tagged for backporting into
3.2.x.  To give us additional time to think about it and test it.

Or perhaps the bug just isn't serious enough to bother fixing it in 3.2
or earlier?



From: Andrea Arcangeli <aarcange@redhat.com>
Subject: mremap: enforce rmap src/dst vma ordering in case of vma_merge() succeeding in copy_vma()

migrate was doing an rmap_walk with speculative lock-less access on
pagetables.  That could lead it to not serializing properly against mremap
PT locks.  But a second problem remains in the order of vmas in the
same_anon_vma list used by the rmap_walk.

If vma_merge succeeds in copy_vma, the src vma could be placed after the
dst vma in the same_anon_vma list.  That could still lead to migrate
missing some pte.

This patch adds an anon_vma_moveto_tail() function to force the dst vma at
the end of the list before mremap starts to solve the problem.

If the mremap is very large and there are a lots of parents or childs
sharing the anon_vma root lock, this should still scale better than taking
the anon_vma root lock around every pte copy practically for the whole
duration of mremap.

Update: Hugh noticed special care is needed in the error path where
move_page_tables goes in the reverse direction, a second
anon_vma_moveto_tail() call is needed in the error path.

This program exercises the anon_vma_moveto_tail:

===

int main()
{
	static struct timeval oldstamp, newstamp;
	long diffsec;
	char *p, *p2, *p3, *p4;
	if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
		perror("memalign"), exit(1);
	if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
		perror("memalign"), exit(1);
	if (posix_memalign((void **)&p3, 2*1024*1024, SIZE))
		perror("memalign"), exit(1);

	memset(p, 0xff, SIZE);
	printf("%p\n", p);
	memset(p2, 0xff, SIZE);
	memset(p3, 0x77, 4096);
	if (memcmp(p, p2, SIZE))
		printf("error\n");
	p4 = mremap(p+SIZE/2, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
	if (p4 != p3)
		perror("mremap"), exit(1);
	p4 = mremap(p4, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p+SIZE/2);
	if (p4 != p+SIZE/2)
		perror("mremap"), exit(1);
	if (memcmp(p, p2, SIZE))
		printf("error\n");
	printf("ok\n");

	return 0;
}
===

$ perf probe -a anon_vma_moveto_tail
Add new event:
  probe:anon_vma_moveto_tail (on anon_vma_moveto_tail)

You can now use it on all perf tools, such as:

        perf record -e probe:anon_vma_moveto_tail -aR sleep 1

$ perf record -e probe:anon_vma_moveto_tail -aR ./anon_vma_moveto_tail
0x7f2ca2800000
ok
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.043 MB perf.data (~1860 samples) ]
$ perf report --stdio
   100.00%  anon_vma_moveto  [kernel.kallsyms]  [k] anon_vma_moveto_tail

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Nai Xia <nai.xia@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Pawel Sikora <pluto@agmk.net
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/rmap.h |    1 
 mm/mmap.c            |   22 ++++++++++++++++++--
 mm/mremap.c          |    1 
 mm/rmap.c            |   45 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 67 insertions(+), 2 deletions(-)

diff -puN include/linux/rmap.h~mremap-enforce-rmap-src-dst-vma-ordering-in-case-of-vma_merge-succeeding-in-copy_vma include/linux/rmap.h
--- a/include/linux/rmap.h~mremap-enforce-rmap-src-dst-vma-ordering-in-case-of-vma_merge-succeeding-in-copy_vma
+++ a/include/linux/rmap.h
@@ -120,6 +120,7 @@ void anon_vma_init(void);	/* create anon
 int  anon_vma_prepare(struct vm_area_struct *);
 void unlink_anon_vmas(struct vm_area_struct *);
 int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *);
+void anon_vma_moveto_tail(struct vm_area_struct *);
 int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *);
 void __anon_vma_link(struct vm_area_struct *);
 
diff -puN mm/mmap.c~mremap-enforce-rmap-src-dst-vma-ordering-in-case-of-vma_merge-succeeding-in-copy_vma mm/mmap.c
--- a/mm/mmap.c~mremap-enforce-rmap-src-dst-vma-ordering-in-case-of-vma_merge-succeeding-in-copy_vma
+++ a/mm/mmap.c
@@ -2349,13 +2349,16 @@ struct vm_area_struct *copy_vma(struct v
 	struct vm_area_struct *new_vma, *prev;
 	struct rb_node **rb_link, *rb_parent;
 	struct mempolicy *pol;
+	bool faulted_in_anon_vma = true;
 
 	/*
 	 * If anonymous vma has not yet been faulted, update new pgoff
 	 * to match new location, to increase its chance of merging.
 	 */
-	if (!vma->vm_file && !vma->anon_vma)
+	if (!vma->vm_file && !vma->anon_vma) {
 		pgoff = addr >> PAGE_SHIFT;
+		faulted_in_anon_vma = false;
+	}
 
 	find_vma_prepare(mm, addr, &prev, &rb_link, &rb_parent);
 	new_vma = vma_merge(mm, prev, addr, addr + len, vma->vm_flags,
@@ -2365,8 +2368,23 @@ struct vm_area_struct *copy_vma(struct v
 		 * Source vma may have been merged into new_vma
 		 */
 		if (vma_start >= new_vma->vm_start &&
-		    vma_start < new_vma->vm_end)
+		    vma_start < new_vma->vm_end) {
+			/*
+			 * The only way we can get a vma_merge with
+			 * self during an mremap is if the vma hasn't
+			 * been faulted in yet and we were allowed to
+			 * reset the dst vma->vm_pgoff to the
+			 * destination address of the mremap to allow
+			 * the merge to happen. mremap must change the
+			 * vm_pgoff linearity between src and dst vmas
+			 * (in turn preventing a vma_merge) to be
+			 * safe. It is only safe to keep the vm_pgoff
+			 * linear if there are no pages mapped yet.
+			 */
+			VM_BUG_ON(faulted_in_anon_vma);
 			*vmap = new_vma;
+		} else
+			anon_vma_moveto_tail(new_vma);
 	} else {
 		new_vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
 		if (new_vma) {
diff -puN mm/mremap.c~mremap-enforce-rmap-src-dst-vma-ordering-in-case-of-vma_merge-succeeding-in-copy_vma mm/mremap.c
--- a/mm/mremap.c~mremap-enforce-rmap-src-dst-vma-ordering-in-case-of-vma_merge-succeeding-in-copy_vma
+++ a/mm/mremap.c
@@ -225,6 +225,7 @@ static unsigned long move_vma(struct vm_
 		 * which will succeed since page tables still there,
 		 * and then proceed to unmap new area instead of old.
 		 */
+		anon_vma_moveto_tail(vma);
 		move_page_tables(new_vma, new_addr, vma, old_addr, moved_len);
 		vma = new_vma;
 		old_len = new_len;
diff -puN mm/rmap.c~mremap-enforce-rmap-src-dst-vma-ordering-in-case-of-vma_merge-succeeding-in-copy_vma mm/rmap.c
--- a/mm/rmap.c~mremap-enforce-rmap-src-dst-vma-ordering-in-case-of-vma_merge-succeeding-in-copy_vma
+++ a/mm/rmap.c
@@ -272,6 +272,51 @@ int anon_vma_clone(struct vm_area_struct
 }
 
 /*
+ * Some rmap walk that needs to find all ptes/hugepmds without false
+ * negatives (like migrate and split_huge_page) running concurrent
+ * with operations that copy or move pagetables (like mremap() and
+ * fork()) to be safe. They depend on the anon_vma "same_anon_vma"
+ * list to be in a certain order: the dst_vma must be placed after the
+ * src_vma in the list. This is always guaranteed by fork() but
+ * mremap() needs to call this function to enforce it in case the
+ * dst_vma isn't newly allocated and chained with the anon_vma_clone()
+ * function but just an extension of a pre-existing vma through
+ * vma_merge.
+ *
+ * NOTE: the same_anon_vma list can still be changed by other
+ * processes while mremap runs because mremap doesn't hold the
+ * anon_vma mutex to prevent modifications to the list while it
+ * runs. All we need to enforce is that the relative order of this
+ * process vmas isn't changing (we don't care about other vmas
+ * order). Each vma corresponds to an anon_vma_chain structure so
+ * there's no risk that other processes calling anon_vma_moveto_tail()
+ * and changing the same_anon_vma list under mremap() will screw with
+ * the relative order of this process vmas in the list, because we
+ * they can't alter the order of any vma that belongs to this
+ * process. And there can't be another anon_vma_moveto_tail() running
+ * concurrently with mremap() coming from this process because we hold
+ * the mmap_sem for the whole mremap(). fork() ordering dependency
+ * also shouldn't be affected because fork() only cares that the
+ * parent vmas are placed in the list before the child vmas and
+ * anon_vma_moveto_tail() won't reorder vmas from either the fork()
+ * parent or child.
+ */
+void anon_vma_moveto_tail(struct vm_area_struct *dst)
+{
+	struct anon_vma_chain *pavc;
+	struct anon_vma *root = NULL;
+
+	list_for_each_entry_reverse(pavc, &dst->anon_vma_chain, same_vma) {
+		struct anon_vma *anon_vma = pavc->anon_vma;
+		VM_BUG_ON(pavc->vma != dst);
+		root = lock_anon_vma_root(root, anon_vma);
+		list_del(&pavc->same_anon_vma);
+		list_add_tail(&pavc->same_anon_vma, &anon_vma->head);
+	}
+	unlock_anon_vma_root(root);
+}
+
+/*
  * Attach vma to its own anon_vma, as well as to the anon_vmas that
  * the corresponding VMA in the parent process is attached to.
  * Returns 0 on success, non-zero on failure.
_


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma
  2011-12-09  0:08                                   ` Andrew Morton
@ 2011-12-09  1:55                                     ` Andrea Arcangeli
  0 siblings, 0 replies; 72+ messages in thread
From: Andrea Arcangeli @ 2011-12-09  1:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Mel Gorman, Pawel Sikora, linux-mm, jpiszcz, arekm,
	linux-kernel, Nai Xia

On Thu, Dec 08, 2011 at 04:08:56PM -0800, Andrew Morton wrote:
> It's not obvious to me that the patch which I merged is the one which
> we want to merge, given the amount of subsequent discussion.  Please
> check this.

That's not the last version.

> I'm thinking we merge this into 3.3-rc1, tagged for backporting into
> 3.2.x.  To give us additional time to think about it and test it.
> 
> Or perhaps the bug just isn't serious enough to bother fixing it in 3.2
> or earlier?

Probably not serious enough, I'm not aware of anybody reproducing it.

Then we've also to think what to do about the i_mmap_mutex, if to
remove it from mremap it too, or if to add it to fork too.

The problem of the i_mmap_mutex is that the prio tree, being a tree,
has no way for us to ensure ordering of the range "walk" is related to
the order of "insertion". So a solution like below can't work for
prio tree (it only works for the anon_vma_chain _list_).

Either we loop twice in the rmap_walk (adding a third loop to
vmtruncate) or we add i_mmap_mutex to fork (where it looks missing and
probably the page_mapped check in __delete_from_page_cache can fire if
such a race triggers, otherwise it looks fairly innocent race but
clearly the implications aren't obvious or there would be no BUG_ON in
__delete_from_page_cache).

For file mappings the only rmap walk that has to be exact and not to
miss any pte, is the vmtruncate path. That's why only vmtruncate would
need a third loop (third because we need a first loop before the
pagecache truncation, and two more loops to catch all ptes, or a
temporary, but only temporary pte, can still be mapped and fire the
bug-on in __delete_from_page_cache).

For anon pages it's only split_huge_page and remove_migration_ptes
that shouldn't miss ptes/hugepmds.

===
From: Andrea Arcangeli <aarcange@redhat.com>
Subject: [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma

migrate was doing a rmap_walk with speculative lock-less access on
pagetables. That could lead it to not serialize properly against
mremap PT locks. But a second problem remains in the order of vmas in
the same_anon_vma list used by the rmap_walk.

If vma_merge would succeed in copy_vma, the src vma could be placed
after the dst vma in the same_anon_vma list. That could still lead
migrate to miss some pte.

This patch adds a anon_vma_moveto_tail() function to force the dst vma
at the end of the list before mremap starts to solve the problem.

If the mremap is very large and there are a lots of parents or childs
sharing the anon_vma root lock, this should still scale better than
taking the anon_vma root lock around every pte copy practically for
the whole duration of mremap.

Update: Hugh noticed special care is needed in the error path where
move_page_tables goes in the reverse direction, a second
anon_vma_moveto_tail() call is needed in the error path.

This program exercises the anon_vma_moveto_tail:

===

int main()
{
	static struct timeval oldstamp, newstamp;
	long diffsec;
	char *p, *p2, *p3, *p4;
	if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
		perror("memalign"), exit(1);
	if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
		perror("memalign"), exit(1);
	if (posix_memalign((void **)&p3, 2*1024*1024, SIZE))
		perror("memalign"), exit(1);

	memset(p, 0xff, SIZE);
	printf("%p\n", p);
	memset(p2, 0xff, SIZE);
	memset(p3, 0x77, 4096);
	if (memcmp(p, p2, SIZE))
		printf("error\n");
	p4 = mremap(p+SIZE/2, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
	if (p4 != p3)
		perror("mremap"), exit(1);
	p4 = mremap(p4, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p+SIZE/2);
	if (p4 != p+SIZE/2)
		perror("mremap"), exit(1);
	if (memcmp(p, p2, SIZE))
		printf("error\n");
	printf("ok\n");

	return 0;
}
===

$ perf probe -a anon_vma_moveto_tail
Add new event:
  probe:anon_vma_moveto_tail (on anon_vma_moveto_tail)

You can now use it on all perf tools, such as:

        perf record -e probe:anon_vma_moveto_tail -aR sleep 1

$ perf record -e probe:anon_vma_moveto_tail -aR ./anon_vma_moveto_tail
0x7f2ca2800000
ok
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.043 MB perf.data (~1860 samples) ]
$ perf report --stdio
   100.00%  anon_vma_moveto  [kernel.kallsyms]  [k] anon_vma_moveto_tail

Reported-by: Nai Xia <nai.xia@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/rmap.h |    1 +
 mm/mmap.c            |   24 +++++++++++++++++++++---
 mm/mremap.c          |    9 +++++++++
 mm/rmap.c            |   45 +++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 76 insertions(+), 3 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 2148b12..1afb995 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -120,6 +120,7 @@ void anon_vma_init(void);	/* create anon_vma_cachep */
 int  anon_vma_prepare(struct vm_area_struct *);
 void unlink_anon_vmas(struct vm_area_struct *);
 int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *);
+void anon_vma_moveto_tail(struct vm_area_struct *);
 int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *);
 void __anon_vma_link(struct vm_area_struct *);
 
diff --git a/mm/mmap.c b/mm/mmap.c
index eae90af..adea3b8 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2322,13 +2322,16 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 	struct vm_area_struct *new_vma, *prev;
 	struct rb_node **rb_link, *rb_parent;
 	struct mempolicy *pol;
+	bool faulted_in_anon_vma = true;
 
 	/*
 	 * If anonymous vma has not yet been faulted, update new pgoff
 	 * to match new location, to increase its chance of merging.
 	 */
-	if (!vma->vm_file && !vma->anon_vma)
+	if (unlikely(!vma->vm_file && !vma->anon_vma)) {
 		pgoff = addr >> PAGE_SHIFT;
+		faulted_in_anon_vma = false;
+	}
 
 	find_vma_prepare(mm, addr, &prev, &rb_link, &rb_parent);
 	new_vma = vma_merge(mm, prev, addr, addr + len, vma->vm_flags,
@@ -2337,9 +2340,24 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 		/*
 		 * Source vma may have been merged into new_vma
 		 */
-		if (vma_start >= new_vma->vm_start &&
-		    vma_start < new_vma->vm_end)
+		if (unlikely(vma_start >= new_vma->vm_start &&
+			     vma_start < new_vma->vm_end)) {
+			/*
+			 * The only way we can get a vma_merge with
+			 * self during an mremap is if the vma hasn't
+			 * been faulted in yet and we were allowed to
+			 * reset the dst vma->vm_pgoff to the
+			 * destination address of the mremap to allow
+			 * the merge to happen. mremap must change the
+			 * vm_pgoff linearity between src and dst vmas
+			 * (in turn preventing a vma_merge) to be
+			 * safe. It is only safe to keep the vm_pgoff
+			 * linear if there are no pages mapped yet.
+			 */
+			VM_BUG_ON(faulted_in_anon_vma);
 			*vmap = new_vma;
+		} else
+			anon_vma_moveto_tail(new_vma);
 	} else {
 		new_vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
 		if (new_vma) {
diff --git a/mm/mremap.c b/mm/mremap.c
index d6959cb..87bb839 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -221,6 +221,15 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 	moved_len = move_page_tables(vma, old_addr, new_vma, new_addr, old_len);
 	if (moved_len < old_len) {
 		/*
+		 * Before moving the page tables from the new vma to
+		 * the old vma, we need to be sure the old vma is
+		 * queued after new vma in the same_anon_vma list to
+		 * prevent SMP races with rmap_walk (that could lead
+		 * rmap_walk to miss some page table).
+		 */
+		anon_vma_moveto_tail(vma);
+
+		/*
 		 * On error, move entries back from new area to old,
 		 * which will succeed since page tables still there,
 		 * and then proceed to unmap new area instead of old.
diff --git a/mm/rmap.c b/mm/rmap.c
index a4fd368..a2e5ce1 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -272,6 +272,51 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
 }
 
 /*
+ * Some rmap walk that needs to find all ptes/hugepmds without false
+ * negatives (like migrate and split_huge_page) running concurrent
+ * with operations that copy or move pagetables (like mremap() and
+ * fork()) to be safe. They depend on the anon_vma "same_anon_vma"
+ * list to be in a certain order: the dst_vma must be placed after the
+ * src_vma in the list. This is always guaranteed by fork() but
+ * mremap() needs to call this function to enforce it in case the
+ * dst_vma isn't newly allocated and chained with the anon_vma_clone()
+ * function but just an extension of a pre-existing vma through
+ * vma_merge.
+ *
+ * NOTE: the same_anon_vma list can still be changed by other
+ * processes while mremap runs because mremap doesn't hold the
+ * anon_vma mutex to prevent modifications to the list while it
+ * runs. All we need to enforce is that the relative order of this
+ * process vmas isn't changing (we don't care about other vmas
+ * order). Each vma corresponds to an anon_vma_chain structure so
+ * there's no risk that other processes calling anon_vma_moveto_tail()
+ * and changing the same_anon_vma list under mremap() will screw with
+ * the relative order of this process vmas in the list, because we
+ * they can't alter the order of any vma that belongs to this
+ * process. And there can't be another anon_vma_moveto_tail() running
+ * concurrently with mremap() coming from this process because we hold
+ * the mmap_sem for the whole mremap(). fork() ordering dependency
+ * also shouldn't be affected because fork() only cares that the
+ * parent vmas are placed in the list before the child vmas and
+ * anon_vma_moveto_tail() won't reorder vmas from either the fork()
+ * parent or child.
+ */
+void anon_vma_moveto_tail(struct vm_area_struct *dst)
+{
+	struct anon_vma_chain *pavc;
+	struct anon_vma *root = NULL;
+
+	list_for_each_entry_reverse(pavc, &dst->anon_vma_chain, same_vma) {
+		struct anon_vma *anon_vma = pavc->anon_vma;
+		VM_BUG_ON(pavc->vma != dst);
+		root = lock_anon_vma_root(root, anon_vma);
+		list_del(&pavc->same_anon_vma);
+		list_add_tail(&pavc->same_anon_vma, &anon_vma->head);
+	}
+	unlock_anon_vma_root(root);
+}
+
+/*
  * Attach vma to its own anon_vma, as well as to the anon_vmas that
  * the corresponding VMA in the parent process is attached to.
  * Returns 0 on success, non-zero on failure.

^ permalink raw reply related	[flat|nested] 72+ messages in thread

end of thread, other threads:[~2011-12-09  1:55 UTC | newest]

Thread overview: 72+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-10-12 18:12 kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110 Paweł Sikora
2011-10-13 23:16 ` Hugh Dickins
2011-10-13 23:30   ` Hugh Dickins
2011-10-16 16:11     ` Christoph Hellwig
2011-10-16 23:54     ` Andrea Arcangeli
2011-10-17 18:51       ` Hugh Dickins
2011-10-17 22:05         ` Andrea Arcangeli
2011-10-19  7:43         ` Mel Gorman
2011-10-19 13:39           ` Linus Torvalds
2011-10-19 19:42             ` Hugh Dickins
2011-10-20  6:30               ` Paweł Sikora
2011-10-20  6:51                 ` Linus Torvalds
2011-10-21  6:54                 ` Nai Xia
2011-10-21  7:35                   ` Pawel Sikora
2011-10-20 12:51               ` Nai Xia
     [not found]                 ` <CANsGZ6a6_q8+88FRV2froBsVEq7GhtKd9fRnB-0M2MD3a7tnSw@mail.gmail.com>
2011-10-21  6:22                   ` Nai Xia
2011-10-21  8:07                     ` Pawel Sikora
2011-10-21  9:07                       ` Nai Xia
2011-10-21 21:36                         ` Paweł Sikora
2011-10-22  6:21                           ` Nai Xia
2011-10-22 16:42                             ` Paweł Sikora
     [not found]                               ` <CAPQyPG5HJKTo8AEy_khdJeciTgtNQepK6XLcpzvPF8PYS0V-Lw@mail.gmail.com>
2011-10-25  7:33                                 ` Pawel Sikora
2011-10-20  9:11       ` Nai Xia
2011-10-21 15:56         ` Mel Gorman
2011-10-21 17:21           ` Nai Xia
2011-10-21 17:41           ` Andrea Arcangeli
2011-10-21 22:50             ` Andrea Arcangeli
2011-10-22  5:52               ` Nai Xia
2011-10-31 17:14                 ` Andrea Arcangeli
2011-10-31 17:27                   ` [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma Andrea Arcangeli
2011-11-01 12:07                     ` Mel Gorman
2011-11-01 14:35                     ` Nai Xia
2011-11-04  7:31                     ` Hugh Dickins
2011-11-04 14:34                       ` Nai Xia
2011-11-04 15:59                         ` Pawel Sikora
2011-11-05  2:21                           ` Nai Xia
2011-11-04 19:16                         ` Hugh Dickins
2011-11-04 20:54                           ` Andrea Arcangeli
2011-11-05  0:09                             ` Nai Xia
2011-11-05  2:21                               ` Hugh Dickins
2011-11-05  3:07                                 ` Andrea Arcangeli
2011-11-05 17:06                                 ` Andrea Arcangeli
2011-12-08  3:24                                   ` David Rientjes
2011-12-08 12:42                                     ` Andrea Arcangeli
2011-12-09  0:08                                   ` Andrew Morton
2011-12-09  1:55                                     ` Andrea Arcangeli
2011-11-04 23:56                       ` Andrea Arcangeli
2011-11-05  0:21                         ` Nai Xia
2011-11-05  0:59                           ` Nai Xia
2011-11-05  1:33                           ` Andrea Arcangeli
2011-11-05  2:00                             ` Nai Xia
2011-11-07 13:14                               ` Mel Gorman
2011-11-07 15:42                                 ` Andrea Arcangeli
2011-11-07 16:28                                   ` Mel Gorman
2011-11-09  1:25                                     ` Andrea Arcangeli
2011-11-11  9:14                                       ` Nai Xia
2011-11-16 14:00                                       ` Andrea Arcangeli
2011-11-17  0:16                                         ` Hugh Dickins
2011-11-17  2:49                                           ` Nai Xia
2011-11-17  6:21                                           ` Nai Xia
2011-11-17 18:42                                           ` Andrea Arcangeli
2011-11-18  1:42                                             ` Nai Xia
2011-11-18  2:17                                               ` Andrea Arcangeli
2011-11-19  9:15                                                 ` Nai Xia
2011-10-22  5:07             ` kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110 Nai Xia
2011-10-31 16:34               ` Andrea Arcangeli
2011-10-16 22:37   ` Linus Torvalds
2011-10-17  3:02     ` Hugh Dickins
2011-10-17  3:09       ` Linus Torvalds
2011-10-18 19:17   ` Paweł Sikora
2011-10-19  7:30   ` Mel Gorman
2011-10-21 12:44     ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).