All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables
@ 2012-07-18 10:43 ` Mel Gorman
  0 siblings, 0 replies; 13+ messages in thread
From: Mel Gorman @ 2012-07-18 10:43 UTC (permalink / raw)
  To: Linux-MM
  Cc: Michal Hocko, Hugh Dickins, David Gibson, Kenneth W Chen, LKML,
	Mel Gorman

(Sending as RFC as this one is tricky and as it is timing dependent the
patch may accidentally be papering over a more fundamental problem. Even
if it is not, it may be more heavy handed than necessary but am suffering
from tunnel vision from looking at this. I wanted to get comments on this
version before trying to be clever.)

If a process creates a large hugetlbfs mapping that is eligible for page
table sharing and forks heavily with children some of whom fault and
others which destroy the mapping then it is possible for page tables to
get corrupted. Some teardowns of the mapping encounter a "bad pmd" and
output a message to the kernel log.  The final teardown will trigger a
BUG_ON in mm/filemap.c.

This was reproduced in 3.4 but is known to have existed for a long time
and goes back at least as far as 2.6.37. It was probably was introduced in
2.6.20 by [39dde65c: shared page table for hugetlb page]. The messages
look like this;

[  ..........] Lots of bad pmd messages followed by this
[  127.164256] mm/memory.c:391: bad pmd ffff880412e04fe8(80000003de4000e7).
[  127.164257] mm/memory.c:391: bad pmd ffff880412e04ff0(80000003de6000e7).
[  127.164258] mm/memory.c:391: bad pmd ffff880412e04ff8(80000003de0000e7).
[  127.186778] ------------[ cut here ]------------
[  127.186781] kernel BUG at mm/filemap.c:134!
[  127.186782] invalid opcode: 0000 [#1] SMP
[  127.186783] CPU 7
[  127.186784] Modules linked in: af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos usb_common button i2c_core intel_agp video intel_gtt fan processor thermal thermal_sys hwmon ata_generic pata_atiixp libata scsi_mod
[  127.186801]
[  127.186802] Pid: 9017, comm: hugetlbfs-test Not tainted 3.4.0-autobuild #53 Dell Inc. OptiPlex 990/06D7TR
[  127.186804] RIP: 0010:[<ffffffff810ed6ce>]  [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
[  127.186809] RSP: 0000:ffff8804144b5c08  EFLAGS: 00010002
[  127.186810] RAX: 0000000000000001 RBX: ffffea000a5c9000 RCX: 00000000ffffffc0
[  127.186811] RDX: 0000000000000000 RSI: 0000000000000009 RDI: ffff88042dfdad00
[  127.186812] RBP: ffff8804144b5c18 R08: 0000000000000009 R09: 0000000000000003
[  127.186813] R10: 0000000000000000 R11: 000000000000002d R12: ffff880412ff83d8
[  127.186814] R13: ffff880412ff83d8 R14: 0000000000000000 R15: ffff880412ff83d8
[  127.186815] FS:  00007fe18ed2c700(0000) GS:ffff88042dce0000(0000) knlGS:0000000000000000
[  127.186816] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  127.186817] CR2: 00007fe340000503 CR3: 0000000417a14000 CR4: 00000000000407e0
[  127.186818] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  127.186819] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  127.186820] Process hugetlbfs-test (pid: 9017, threadinfo ffff8804144b4000, task ffff880417f803c0)
[  127.186821] Stack:
[  127.186822]  ffffea000a5c9000 0000000000000000 ffff8804144b5c48 ffffffff810ed83b
[  127.186824]  ffff8804144b5c48 000000000000138a 0000000000001387 ffff8804144b5c98
[  127.186825]  ffff8804144b5d48 ffffffff811bc925 ffff8804144b5cb8 0000000000000000
[  127.186827] Call Trace:
[  127.186829]  [<ffffffff810ed83b>] delete_from_page_cache+0x3b/0x80
[  127.186832]  [<ffffffff811bc925>] truncate_hugepages+0x115/0x220
[  127.186834]  [<ffffffff811bca43>] hugetlbfs_evict_inode+0x13/0x30
[  127.186837]  [<ffffffff811655c7>] evict+0xa7/0x1b0
[  127.186839]  [<ffffffff811657a3>] iput_final+0xd3/0x1f0
[  127.186840]  [<ffffffff811658f9>] iput+0x39/0x50
[  127.186842]  [<ffffffff81162708>] d_kill+0xf8/0x130
[  127.186843]  [<ffffffff81162812>] dput+0xd2/0x1a0
[  127.186845]  [<ffffffff8114e2d0>] __fput+0x170/0x230
[  127.186848]  [<ffffffff81236e0e>] ? rb_erase+0xce/0x150
[  127.186849]  [<ffffffff8114e3ad>] fput+0x1d/0x30
[  127.186851]  [<ffffffff81117db7>] remove_vma+0x37/0x80
[  127.186853]  [<ffffffff81119182>] do_munmap+0x2d2/0x360
[  127.186855]  [<ffffffff811cc639>] sys_shmdt+0xc9/0x170
[  127.186857]  [<ffffffff81410a39>] system_call_fastpath+0x16/0x1b
[  127.186858] Code: 0f 1f 44 00 00 48 8b 43 08 48 8b 00 48 8b 40 28 8b b0 40 03 00 00 85 f6 0f 88 df fe ff ff 48 89 df e8 e7 cb 05 00 e9 d2 fe ff ff <0f> 0b 55 83 e2 fd 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 65 e0
[  127.186868] RIP  [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
[  127.186870]  RSP <ffff8804144b5c08>
[  127.186871] ---[ end trace 7cbac5d1db69f426 ]---

The bug is a race and not always easy to reproduce. To reproduce it I was
doing the following on a single socket I7-based machine with 16G of RAM.

$ hugeadm --pool-pages-max DEFAULT:13G
$ echo $((18*1048576*1024)) > /proc/sys/kernel/shmmax
$ echo $((18*1048576*1024)) > /proc/sys/kernel/shmall
$ for i in `seq 1 9000`; do ./hugetlbfs-test; done

On my particular machine, it usually triggers within 10 minutes but enabling
debug options can change the timing such that it never hits. Once the bug is
triggered, the machine is in trouble and needs to be rebooted. The machine
will respond but processes accessing proc like "ps aux" will hang due to
the BUG_ON. shutdown will also hang and needs a hard reset or a sysrq-b.

The test case was mostly written by Michal Hocko with a few minor changes
by me to reproduce this bug. Michal did a lot of heavy lifting eliminating
possible sources of the race and saved me the embarrassment of posting a
completely broken patch yesterday. He did not see this patch before
going to the lists so any flaws are mine!

The basic problem is a race between page table sharing and teardown. For
the most part page table sharing depends on i_mmap_mutex. In some cases,
it is also taking the mm->page_table_lock for the PTE updates but with
shared page tables, it is the i_mmap_mutex that is more important.

Unfortunately it appears to be also insufficient. Consider the following
situation

Process A					Process B
---------					---------
hugetlb_fault					shmdt
  huge_pte_alloc				LockWrite(mmap_sem)
    Lock(i_mmap_mutex)				  do_munmap
						    unmap_region
						      unmap_vmas
						        unmap_single_vma
						          unmap_hugepage_range
      						            Lock(i_mmap_mutex)
							    Lock(mm->page_table_lock)
							    huge_pmd_unshare/unmap tables <--- (1)
							    Unlock(mm->page_table_lock)
      						            Unlock(i_mmap_mutex)
    huge_pte_alloc				      ...
      Lock(i_mmap_mutex)			      ...
      vma_prio_walk, find svma, spte		      ...
      Lock(mm->page_table_lock)			      ...
      share spte				      ...
      Unlock(mm->page_table_lock)		      ...
      Unlock(i_mmap_mutex)			      ...
    hugetlb_no_page									  <--- (2)
						      free_pgtables
						        unlink_file_vma
							hugetlb_free_pgd_range
						    remove_vma_list

In this scenario, it is possible for Process A to share page tables with
Process B that is trying to tear them down.  The i_mmap_mutex on its own
does not prevent Process A walking Process B's page tables. At (1) above,
the page tables are not shared yet so it unmaps the PMDs. Process A sets
up page table sharing and at (2) faults a new entry. Process B then trips
up on it in free_pgtables.

This patch is heavy-handed but to prevent against teardown an unmapping
races it takes the mmap_sem for read and then the page_table_lock of
processes it is considering sharing with. This prevents the VMA or page
tables being updated during sharing. I verified that page table sharing
still occurs using the awesome technology of printk to spit out a message
when huge_pmd_share is successful. libhugetlbfs regression test suite passes.

I strongly suggest this be treated as a -stable candidate if it is merged.

Test program is as follows.

==== CUT HERE ====

static size_t huge_page_size = (2UL << 20);
static size_t nr_huge_page_A = 512;
static size_t nr_huge_page_B = 5632;

unsigned int get_random(unsigned int max)
{
	struct timeval tv;

	gettimeofday(&tv, NULL);
	srandom(tv.tv_usec);
	return random() % max;
}

static void play(void *addr, size_t size)
{
	unsigned char *start = addr,
		      *end = start + size,
		      *a;
	start += get_random(size/2);

	/* we could itterate on huge pages but let's give it more time. */
	for (a = start; a < end; a += 4096)
		*a = 0;
}

int main(int argc, char **argv)
{
	key_t key = IPC_PRIVATE;
	size_t sizeA = nr_huge_page_A * huge_page_size;
	size_t sizeB = nr_huge_page_B * huge_page_size;
	int shmidA, shmidB;
	void *addrA = NULL, *addrB = NULL;
	int nr_children = 300, n = 0;

	if ((shmidA = shmget(key, sizeA, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
		perror("shmget:");
		return 1;
	}

	if ((addrA = shmat(shmidA, addrA, SHM_R|SHM_W)) == (void *)-1UL) {
		perror("shmat");
		return 1;
	}
	if ((shmidB = shmget(key, sizeB, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
		perror("shmget:");
		return 1;
	}

	if ((addrB = shmat(shmidB, addrB, SHM_R|SHM_W)) == (void *)-1UL) {
		perror("shmat");
		return 1;
	}

fork_child:
	switch(fork()) {
		case 0:
			switch (n%3) {
			case 0:
				play(addrA, sizeA);
				break;
			case 1:
				play(addrB, sizeB);
				break;
			case 2:
				break;
			}
			break;
		case -1:
			perror("fork:");
			break;
		default:
			if (++n < nr_children)
				goto fork_child;
			play(addrA, sizeA);
			break;
	}
	shmdt(addrA);
	shmdt(addrB);
	do {
		wait(NULL);
	} while (--n > 0);
	shmctl(shmidA, IPC_RMID, NULL);
	shmctl(shmidB, IPC_RMID, NULL);
	return 0;
}

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/x86/mm/hugetlbpage.c |   29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index f6679a7..0524556 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -68,14 +68,37 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	struct vm_area_struct *svma;
 	unsigned long saddr;
 	pte_t *spte = NULL;
+	spinlock_t *spage_table_lock = NULL;
+	struct rw_semaphore *smmap_sem = NULL;
 
 	if (!vma_shareable(vma, addr))
 		return;
 
+retry:
 	mutex_lock(&mapping->i_mmap_mutex);
 	vma_prio_tree_foreach(svma, &iter, &mapping->i_mmap, idx, idx) {
 		if (svma == vma)
 			continue;
+		if (svma->vm_mm == vma->vm_mm)
+			continue;
+
+		/*
+		 * The target mm could be in the process of tearing down
+		 * its page tables and the i_mmap_mutex on its own is
+		 * not sufficient. To prevent races against teardown and
+		 * pagetable updates, we acquire the mmap_sem and pagetable
+		 * lock of the remote address space. down_read_trylock()
+		 * is necessary as the other process could also be trying
+		 * to share pagetables with the current mm.
+		 */
+		if (!down_read_trylock(&svma->vm_mm->mmap_sem)) {
+			mutex_unlock(&mapping->i_mmap_mutex);
+			goto retry;
+		}
+
+		smmap_sem = &svma->vm_mm->mmap_sem;
+		spage_table_lock = &svma->vm_mm->page_table_lock;
+		spin_lock_nested(spage_table_lock, SINGLE_DEPTH_NESTING);
 
 		saddr = page_table_shareable(svma, vma, addr, idx);
 		if (saddr) {
@@ -85,6 +108,10 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 				break;
 			}
 		}
+		up_read(smmap_sem);
+		spin_unlock(spage_table_lock);
+		spage_table_lock = NULL;
+		smmap_sem = NULL;
 	}
 
 	if (!spte)
@@ -96,6 +123,8 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	else
 		put_page(virt_to_page(spte));
 	spin_unlock(&mm->page_table_lock);
+	spin_unlock(spage_table_lock);
+	up_read(smmap_sem);
 out:
 	mutex_unlock(&mapping->i_mmap_mutex);
 }

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables
@ 2012-07-18 10:43 ` Mel Gorman
  0 siblings, 0 replies; 13+ messages in thread
From: Mel Gorman @ 2012-07-18 10:43 UTC (permalink / raw)
  To: Linux-MM
  Cc: Michal Hocko, Hugh Dickins, David Gibson, Kenneth W Chen, LKML,
	Mel Gorman

(Sending as RFC as this one is tricky and as it is timing dependent the
patch may accidentally be papering over a more fundamental problem. Even
if it is not, it may be more heavy handed than necessary but am suffering
from tunnel vision from looking at this. I wanted to get comments on this
version before trying to be clever.)

If a process creates a large hugetlbfs mapping that is eligible for page
table sharing and forks heavily with children some of whom fault and
others which destroy the mapping then it is possible for page tables to
get corrupted. Some teardowns of the mapping encounter a "bad pmd" and
output a message to the kernel log.  The final teardown will trigger a
BUG_ON in mm/filemap.c.

This was reproduced in 3.4 but is known to have existed for a long time
and goes back at least as far as 2.6.37. It was probably was introduced in
2.6.20 by [39dde65c: shared page table for hugetlb page]. The messages
look like this;

[  ..........] Lots of bad pmd messages followed by this
[  127.164256] mm/memory.c:391: bad pmd ffff880412e04fe8(80000003de4000e7).
[  127.164257] mm/memory.c:391: bad pmd ffff880412e04ff0(80000003de6000e7).
[  127.164258] mm/memory.c:391: bad pmd ffff880412e04ff8(80000003de0000e7).
[  127.186778] ------------[ cut here ]------------
[  127.186781] kernel BUG at mm/filemap.c:134!
[  127.186782] invalid opcode: 0000 [#1] SMP
[  127.186783] CPU 7
[  127.186784] Modules linked in: af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos usb_common button i2c_core intel_agp video intel_gtt fan processor thermal thermal_sys hwmon ata_generic pata_atiixp libata scsi_mod
[  127.186801]
[  127.186802] Pid: 9017, comm: hugetlbfs-test Not tainted 3.4.0-autobuild #53 Dell Inc. OptiPlex 990/06D7TR
[  127.186804] RIP: 0010:[<ffffffff810ed6ce>]  [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
[  127.186809] RSP: 0000:ffff8804144b5c08  EFLAGS: 00010002
[  127.186810] RAX: 0000000000000001 RBX: ffffea000a5c9000 RCX: 00000000ffffffc0
[  127.186811] RDX: 0000000000000000 RSI: 0000000000000009 RDI: ffff88042dfdad00
[  127.186812] RBP: ffff8804144b5c18 R08: 0000000000000009 R09: 0000000000000003
[  127.186813] R10: 0000000000000000 R11: 000000000000002d R12: ffff880412ff83d8
[  127.186814] R13: ffff880412ff83d8 R14: 0000000000000000 R15: ffff880412ff83d8
[  127.186815] FS:  00007fe18ed2c700(0000) GS:ffff88042dce0000(0000) knlGS:0000000000000000
[  127.186816] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  127.186817] CR2: 00007fe340000503 CR3: 0000000417a14000 CR4: 00000000000407e0
[  127.186818] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  127.186819] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  127.186820] Process hugetlbfs-test (pid: 9017, threadinfo ffff8804144b4000, task ffff880417f803c0)
[  127.186821] Stack:
[  127.186822]  ffffea000a5c9000 0000000000000000 ffff8804144b5c48 ffffffff810ed83b
[  127.186824]  ffff8804144b5c48 000000000000138a 0000000000001387 ffff8804144b5c98
[  127.186825]  ffff8804144b5d48 ffffffff811bc925 ffff8804144b5cb8 0000000000000000
[  127.186827] Call Trace:
[  127.186829]  [<ffffffff810ed83b>] delete_from_page_cache+0x3b/0x80
[  127.186832]  [<ffffffff811bc925>] truncate_hugepages+0x115/0x220
[  127.186834]  [<ffffffff811bca43>] hugetlbfs_evict_inode+0x13/0x30
[  127.186837]  [<ffffffff811655c7>] evict+0xa7/0x1b0
[  127.186839]  [<ffffffff811657a3>] iput_final+0xd3/0x1f0
[  127.186840]  [<ffffffff811658f9>] iput+0x39/0x50
[  127.186842]  [<ffffffff81162708>] d_kill+0xf8/0x130
[  127.186843]  [<ffffffff81162812>] dput+0xd2/0x1a0
[  127.186845]  [<ffffffff8114e2d0>] __fput+0x170/0x230
[  127.186848]  [<ffffffff81236e0e>] ? rb_erase+0xce/0x150
[  127.186849]  [<ffffffff8114e3ad>] fput+0x1d/0x30
[  127.186851]  [<ffffffff81117db7>] remove_vma+0x37/0x80
[  127.186853]  [<ffffffff81119182>] do_munmap+0x2d2/0x360
[  127.186855]  [<ffffffff811cc639>] sys_shmdt+0xc9/0x170
[  127.186857]  [<ffffffff81410a39>] system_call_fastpath+0x16/0x1b
[  127.186858] Code: 0f 1f 44 00 00 48 8b 43 08 48 8b 00 48 8b 40 28 8b b0 40 03 00 00 85 f6 0f 88 df fe ff ff 48 89 df e8 e7 cb 05 00 e9 d2 fe ff ff <0f> 0b 55 83 e2 fd 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 65 e0
[  127.186868] RIP  [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
[  127.186870]  RSP <ffff8804144b5c08>
[  127.186871] ---[ end trace 7cbac5d1db69f426 ]---

The bug is a race and not always easy to reproduce. To reproduce it I was
doing the following on a single socket I7-based machine with 16G of RAM.

$ hugeadm --pool-pages-max DEFAULT:13G
$ echo $((18*1048576*1024)) > /proc/sys/kernel/shmmax
$ echo $((18*1048576*1024)) > /proc/sys/kernel/shmall
$ for i in `seq 1 9000`; do ./hugetlbfs-test; done

On my particular machine, it usually triggers within 10 minutes but enabling
debug options can change the timing such that it never hits. Once the bug is
triggered, the machine is in trouble and needs to be rebooted. The machine
will respond but processes accessing proc like "ps aux" will hang due to
the BUG_ON. shutdown will also hang and needs a hard reset or a sysrq-b.

The test case was mostly written by Michal Hocko with a few minor changes
by me to reproduce this bug. Michal did a lot of heavy lifting eliminating
possible sources of the race and saved me the embarrassment of posting a
completely broken patch yesterday. He did not see this patch before
going to the lists so any flaws are mine!

The basic problem is a race between page table sharing and teardown. For
the most part page table sharing depends on i_mmap_mutex. In some cases,
it is also taking the mm->page_table_lock for the PTE updates but with
shared page tables, it is the i_mmap_mutex that is more important.

Unfortunately it appears to be also insufficient. Consider the following
situation

Process A					Process B
---------					---------
hugetlb_fault					shmdt
  huge_pte_alloc				LockWrite(mmap_sem)
    Lock(i_mmap_mutex)				  do_munmap
						    unmap_region
						      unmap_vmas
						        unmap_single_vma
						          unmap_hugepage_range
      						            Lock(i_mmap_mutex)
							    Lock(mm->page_table_lock)
							    huge_pmd_unshare/unmap tables <--- (1)
							    Unlock(mm->page_table_lock)
      						            Unlock(i_mmap_mutex)
    huge_pte_alloc				      ...
      Lock(i_mmap_mutex)			      ...
      vma_prio_walk, find svma, spte		      ...
      Lock(mm->page_table_lock)			      ...
      share spte				      ...
      Unlock(mm->page_table_lock)		      ...
      Unlock(i_mmap_mutex)			      ...
    hugetlb_no_page									  <--- (2)
						      free_pgtables
						        unlink_file_vma
							hugetlb_free_pgd_range
						    remove_vma_list

In this scenario, it is possible for Process A to share page tables with
Process B that is trying to tear them down.  The i_mmap_mutex on its own
does not prevent Process A walking Process B's page tables. At (1) above,
the page tables are not shared yet so it unmaps the PMDs. Process A sets
up page table sharing and at (2) faults a new entry. Process B then trips
up on it in free_pgtables.

This patch is heavy-handed but to prevent against teardown an unmapping
races it takes the mmap_sem for read and then the page_table_lock of
processes it is considering sharing with. This prevents the VMA or page
tables being updated during sharing. I verified that page table sharing
still occurs using the awesome technology of printk to spit out a message
when huge_pmd_share is successful. libhugetlbfs regression test suite passes.

I strongly suggest this be treated as a -stable candidate if it is merged.

Test program is as follows.

==== CUT HERE ====

static size_t huge_page_size = (2UL << 20);
static size_t nr_huge_page_A = 512;
static size_t nr_huge_page_B = 5632;

unsigned int get_random(unsigned int max)
{
	struct timeval tv;

	gettimeofday(&tv, NULL);
	srandom(tv.tv_usec);
	return random() % max;
}

static void play(void *addr, size_t size)
{
	unsigned char *start = addr,
		      *end = start + size,
		      *a;
	start += get_random(size/2);

	/* we could itterate on huge pages but let's give it more time. */
	for (a = start; a < end; a += 4096)
		*a = 0;
}

int main(int argc, char **argv)
{
	key_t key = IPC_PRIVATE;
	size_t sizeA = nr_huge_page_A * huge_page_size;
	size_t sizeB = nr_huge_page_B * huge_page_size;
	int shmidA, shmidB;
	void *addrA = NULL, *addrB = NULL;
	int nr_children = 300, n = 0;

	if ((shmidA = shmget(key, sizeA, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
		perror("shmget:");
		return 1;
	}

	if ((addrA = shmat(shmidA, addrA, SHM_R|SHM_W)) == (void *)-1UL) {
		perror("shmat");
		return 1;
	}
	if ((shmidB = shmget(key, sizeB, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
		perror("shmget:");
		return 1;
	}

	if ((addrB = shmat(shmidB, addrB, SHM_R|SHM_W)) == (void *)-1UL) {
		perror("shmat");
		return 1;
	}

fork_child:
	switch(fork()) {
		case 0:
			switch (n%3) {
			case 0:
				play(addrA, sizeA);
				break;
			case 1:
				play(addrB, sizeB);
				break;
			case 2:
				break;
			}
			break;
		case -1:
			perror("fork:");
			break;
		default:
			if (++n < nr_children)
				goto fork_child;
			play(addrA, sizeA);
			break;
	}
	shmdt(addrA);
	shmdt(addrB);
	do {
		wait(NULL);
	} while (--n > 0);
	shmctl(shmidA, IPC_RMID, NULL);
	shmctl(shmidB, IPC_RMID, NULL);
	return 0;
}

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/x86/mm/hugetlbpage.c |   29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index f6679a7..0524556 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -68,14 +68,37 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	struct vm_area_struct *svma;
 	unsigned long saddr;
 	pte_t *spte = NULL;
+	spinlock_t *spage_table_lock = NULL;
+	struct rw_semaphore *smmap_sem = NULL;
 
 	if (!vma_shareable(vma, addr))
 		return;
 
+retry:
 	mutex_lock(&mapping->i_mmap_mutex);
 	vma_prio_tree_foreach(svma, &iter, &mapping->i_mmap, idx, idx) {
 		if (svma == vma)
 			continue;
+		if (svma->vm_mm == vma->vm_mm)
+			continue;
+
+		/*
+		 * The target mm could be in the process of tearing down
+		 * its page tables and the i_mmap_mutex on its own is
+		 * not sufficient. To prevent races against teardown and
+		 * pagetable updates, we acquire the mmap_sem and pagetable
+		 * lock of the remote address space. down_read_trylock()
+		 * is necessary as the other process could also be trying
+		 * to share pagetables with the current mm.
+		 */
+		if (!down_read_trylock(&svma->vm_mm->mmap_sem)) {
+			mutex_unlock(&mapping->i_mmap_mutex);
+			goto retry;
+		}
+
+		smmap_sem = &svma->vm_mm->mmap_sem;
+		spage_table_lock = &svma->vm_mm->page_table_lock;
+		spin_lock_nested(spage_table_lock, SINGLE_DEPTH_NESTING);
 
 		saddr = page_table_shareable(svma, vma, addr, idx);
 		if (saddr) {
@@ -85,6 +108,10 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 				break;
 			}
 		}
+		up_read(smmap_sem);
+		spin_unlock(spage_table_lock);
+		spage_table_lock = NULL;
+		smmap_sem = NULL;
 	}
 
 	if (!spte)
@@ -96,6 +123,8 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	else
 		put_page(virt_to_page(spte));
 	spin_unlock(&mm->page_table_lock);
+	spin_unlock(spage_table_lock);
+	up_read(smmap_sem);
 out:
 	mutex_unlock(&mapping->i_mmap_mutex);
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables
  2012-07-18 10:43 ` Mel Gorman
  (?)
@ 2012-07-19  9:08 ` Cong Wang
  2012-07-20 13:43     ` Mel Gorman
  -1 siblings, 1 reply; 13+ messages in thread
From: Cong Wang @ 2012-07-19  9:08 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

On Wed, 18 Jul 2012 at 10:43 GMT, Mel Gorman <mgorman@suse.de> wrote:
> +		if (!down_read_trylock(&svma->vm_mm->mmap_sem)) {
> +			mutex_unlock(&mapping->i_mmap_mutex);
> +			goto retry;
> +		}
> +
> +		smmap_sem = &svma->vm_mm->mmap_sem;
> +		spage_table_lock = &svma->vm_mm->page_table_lock;
> +		spin_lock_nested(spage_table_lock, SINGLE_DEPTH_NESTING);
>  
>  		saddr = page_table_shareable(svma, vma, addr, idx);
>  		if (saddr) {
> @@ -85,6 +108,10 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>  				break;
>  			}
>  		}
> +		up_read(smmap_sem);
> +		spin_unlock(spage_table_lock);

Looks like we should do spin_unlock() before up_read(),
in the reverse order of how they get accquired.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables
  2012-07-18 10:43 ` Mel Gorman
@ 2012-07-19 12:38   ` Hugh Dickins
  -1 siblings, 0 replies; 13+ messages in thread
From: Hugh Dickins @ 2012-07-19 12:38 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux-MM, Michal Hocko, David Gibson, Ken Chen, LKML

On Wed, 18 Jul 2012, Mel Gorman wrote:

> (Sending as RFC as this one is tricky and as it is timing dependent the
> patch may accidentally be papering over a more fundamental problem. Even
> if it is not, it may be more heavy handed than necessary but am suffering
> from tunnel vision from looking at this. I wanted to get comments on this
> version before trying to be clever.)

I haven't been able to spend as long on this as I'd like, and can't
look more today, nor probably tomorrow: premature reply and hope I'm
getting it right.

The bug looks genuine to me: good job you both tracking it down.

I am a bit surprised we didn't notice back when reviewing, it does
seem exactly the kind of issue one would be on the lookout for;
but I've not read through those old mails.

Your interim trylock fix looks just too worrying to me to spend any
time considering it - and does it even work? given that an exiting mm
does not take mmap_sem, where munmap would have down_write mmap_sem.

As I understand it, there would be no problem at all if the entry (and
here I'm rushed and avoid making a fool of myself by not saying whether
I mean pmd entry or pud entry! I get so confused amongst the levels)
were cleared under lock in or near the huge_pmd_unshare(), but it's
left set so that the subsequent free_pgtables() gets the freeing via
TLB mmu_gather right.

I'd feel much happier with a fix approaching it from that direction:
which is probably just what you have in mind when you're thinking
of trying to be clever - please advance to that.

If nothing but x86 uses sharing at this level (again I've not checked)
then you may be able to clear the entry and put the page directly into
into the mmu_gather like an ordinary page, leaving nothing for
free_pgtables() to do there.  Ah, that might leave a danger that the
table above never gets freed.  Well, another possibility is to use a
special pte bit (perhaps the one we play with for PROT_NONE, or a THP
bit) to flag this: to prevent more sharing but get the freeing right.

Anyway, over to you: just a couple of further comments below.

> 
> If a process creates a large hugetlbfs mapping that is eligible for page
> table sharing and forks heavily with children some of whom fault and
> others which destroy the mapping then it is possible for page tables to
> get corrupted. Some teardowns of the mapping encounter a "bad pmd" and
> output a message to the kernel log.  The final teardown will trigger a
> BUG_ON in mm/filemap.c.
> 
> This was reproduced in 3.4 but is known to have existed for a long time
> and goes back at least as far as 2.6.37. It was probably was introduced in
> 2.6.20 by [39dde65c: shared page table for hugetlb page]. The messages
> look like this;
> 
> [  ..........] Lots of bad pmd messages followed by this
> [  127.164256] mm/memory.c:391: bad pmd ffff880412e04fe8(80000003de4000e7).
> [  127.164257] mm/memory.c:391: bad pmd ffff880412e04ff0(80000003de6000e7).
> [  127.164258] mm/memory.c:391: bad pmd ffff880412e04ff8(80000003de0000e7).
> [  127.186778] ------------[ cut here ]------------
> [  127.186781] kernel BUG at mm/filemap.c:134!
> [  127.186782] invalid opcode: 0000 [#1] SMP
> [  127.186783] CPU 7
> [  127.186784] Modules linked in: af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos usb_common button i2c_core intel_agp video intel_gtt fan processor thermal thermal_sys hwmon ata_generic pata_atiixp libata scsi_mod
> [  127.186801]
> [  127.186802] Pid: 9017, comm: hugetlbfs-test Not tainted 3.4.0-autobuild #53 Dell Inc. OptiPlex 990/06D7TR
> [  127.186804] RIP: 0010:[<ffffffff810ed6ce>]  [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
> [  127.186809] RSP: 0000:ffff8804144b5c08  EFLAGS: 00010002
> [  127.186810] RAX: 0000000000000001 RBX: ffffea000a5c9000 RCX: 00000000ffffffc0
> [  127.186811] RDX: 0000000000000000 RSI: 0000000000000009 RDI: ffff88042dfdad00
> [  127.186812] RBP: ffff8804144b5c18 R08: 0000000000000009 R09: 0000000000000003
> [  127.186813] R10: 0000000000000000 R11: 000000000000002d R12: ffff880412ff83d8
> [  127.186814] R13: ffff880412ff83d8 R14: 0000000000000000 R15: ffff880412ff83d8
> [  127.186815] FS:  00007fe18ed2c700(0000) GS:ffff88042dce0000(0000) knlGS:0000000000000000
> [  127.186816] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  127.186817] CR2: 00007fe340000503 CR3: 0000000417a14000 CR4: 00000000000407e0
> [  127.186818] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  127.186819] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [  127.186820] Process hugetlbfs-test (pid: 9017, threadinfo ffff8804144b4000, task ffff880417f803c0)
> [  127.186821] Stack:
> [  127.186822]  ffffea000a5c9000 0000000000000000 ffff8804144b5c48 ffffffff810ed83b
> [  127.186824]  ffff8804144b5c48 000000000000138a 0000000000001387 ffff8804144b5c98
> [  127.186825]  ffff8804144b5d48 ffffffff811bc925 ffff8804144b5cb8 0000000000000000
> [  127.186827] Call Trace:
> [  127.186829]  [<ffffffff810ed83b>] delete_from_page_cache+0x3b/0x80
> [  127.186832]  [<ffffffff811bc925>] truncate_hugepages+0x115/0x220
> [  127.186834]  [<ffffffff811bca43>] hugetlbfs_evict_inode+0x13/0x30
> [  127.186837]  [<ffffffff811655c7>] evict+0xa7/0x1b0
> [  127.186839]  [<ffffffff811657a3>] iput_final+0xd3/0x1f0
> [  127.186840]  [<ffffffff811658f9>] iput+0x39/0x50
> [  127.186842]  [<ffffffff81162708>] d_kill+0xf8/0x130
> [  127.186843]  [<ffffffff81162812>] dput+0xd2/0x1a0
> [  127.186845]  [<ffffffff8114e2d0>] __fput+0x170/0x230
> [  127.186848]  [<ffffffff81236e0e>] ? rb_erase+0xce/0x150
> [  127.186849]  [<ffffffff8114e3ad>] fput+0x1d/0x30
> [  127.186851]  [<ffffffff81117db7>] remove_vma+0x37/0x80
> [  127.186853]  [<ffffffff81119182>] do_munmap+0x2d2/0x360
> [  127.186855]  [<ffffffff811cc639>] sys_shmdt+0xc9/0x170
> [  127.186857]  [<ffffffff81410a39>] system_call_fastpath+0x16/0x1b
> [  127.186858] Code: 0f 1f 44 00 00 48 8b 43 08 48 8b 00 48 8b 40 28 8b b0 40 03 00 00 85 f6 0f 88 df fe ff ff 48 89 df e8 e7 cb 05 00 e9 d2 fe ff ff <0f> 0b 55 83 e2 fd 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 65 e0
> [  127.186868] RIP  [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
> [  127.186870]  RSP <ffff8804144b5c08>
> [  127.186871] ---[ end trace 7cbac5d1db69f426 ]---
> 
> The bug is a race and not always easy to reproduce. To reproduce it I was
> doing the following on a single socket I7-based machine with 16G of RAM.
> 
> $ hugeadm --pool-pages-max DEFAULT:13G
> $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmmax
> $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmall
> $ for i in `seq 1 9000`; do ./hugetlbfs-test; done
> 
> On my particular machine, it usually triggers within 10 minutes but enabling
> debug options can change the timing such that it never hits. Once the bug is
> triggered, the machine is in trouble and needs to be rebooted. The machine
> will respond but processes accessing proc like "ps aux" will hang due to
> the BUG_ON. shutdown will also hang and needs a hard reset or a sysrq-b.
> 
> The test case was mostly written by Michal Hocko with a few minor changes
> by me to reproduce this bug. Michal did a lot of heavy lifting eliminating
> possible sources of the race and saved me the embarrassment of posting a
> completely broken patch yesterday. He did not see this patch before
> going to the lists so any flaws are mine!

Well, I think you're being grossly unfair to Michal there:
you're saying that he only has to look at a patch to put flaws in it ?-)

> 
> The basic problem is a race between page table sharing and teardown. For
> the most part page table sharing depends on i_mmap_mutex. In some cases,
> it is also taking the mm->page_table_lock for the PTE updates but with
> shared page tables, it is the i_mmap_mutex that is more important.
> 
> Unfortunately it appears to be also insufficient. Consider the following
> situation
> 
> Process A					Process B
> ---------					---------
> hugetlb_fault					shmdt
>   huge_pte_alloc				LockWrite(mmap_sem)
>     Lock(i_mmap_mutex)				  do_munmap

You meant to erase the huge_pte_alloc and Lock(i_mmap_mutex)
above, didn't you?  They're repeated in the right place below.

> 						    unmap_region
> 						      unmap_vmas
> 						        unmap_single_vma
> 						          unmap_hugepage_range
>       						            Lock(i_mmap_mutex)
> 							    Lock(mm->page_table_lock)
> 							    huge_pmd_unshare/unmap tables <--- (1)
> 							    Unlock(mm->page_table_lock)
>       						            Unlock(i_mmap_mutex)
>     huge_pte_alloc				      ...
>       Lock(i_mmap_mutex)			      ...
>       vma_prio_walk, find svma, spte		      ...
>       Lock(mm->page_table_lock)			      ...
>       share spte				      ...
>       Unlock(mm->page_table_lock)		      ...
>       Unlock(i_mmap_mutex)			      ...
>     hugetlb_no_page									  <--- (2)
> 						      free_pgtables
> 						        unlink_file_vma
> 							hugetlb_free_pgd_range
> 						    remove_vma_list
> 
> In this scenario, it is possible for Process A to share page tables with
> Process B that is trying to tear them down.  The i_mmap_mutex on its own
> does not prevent Process A walking Process B's page tables. At (1) above,
> the page tables are not shared yet so it unmaps the PMDs. Process A sets
> up page table sharing and at (2) faults a new entry. Process B then trips
> up on it in free_pgtables.
> 
> This patch is heavy-handed but to prevent against teardown an unmapping
> races it takes the mmap_sem for read and then the page_table_lock of
> processes it is considering sharing with. This prevents the VMA or page
> tables being updated during sharing. I verified that page table sharing
> still occurs using the awesome technology of printk to spit out a message
> when huge_pmd_share is successful. libhugetlbfs regression test suite passes.
> 
> I strongly suggest this be treated as a -stable candidate if it is merged.
> 
> Test program is as follows.
> 
> ==== CUT HERE ====
> 
> static size_t huge_page_size = (2UL << 20);
> static size_t nr_huge_page_A = 512;
> static size_t nr_huge_page_B = 5632;
> 
> unsigned int get_random(unsigned int max)
> {
> 	struct timeval tv;
> 
> 	gettimeofday(&tv, NULL);
> 	srandom(tv.tv_usec);
> 	return random() % max;
> }
> 
> static void play(void *addr, size_t size)
> {
> 	unsigned char *start = addr,
> 		      *end = start + size,
> 		      *a;
> 	start += get_random(size/2);
> 
> 	/* we could itterate on huge pages but let's give it more time. */
> 	for (a = start; a < end; a += 4096)
> 		*a = 0;
> }
> 
> int main(int argc, char **argv)
> {
> 	key_t key = IPC_PRIVATE;
> 	size_t sizeA = nr_huge_page_A * huge_page_size;
> 	size_t sizeB = nr_huge_page_B * huge_page_size;
> 	int shmidA, shmidB;
> 	void *addrA = NULL, *addrB = NULL;
> 	int nr_children = 300, n = 0;
> 
> 	if ((shmidA = shmget(key, sizeA, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
> 		perror("shmget:");
> 		return 1;
> 	}
> 
> 	if ((addrA = shmat(shmidA, addrA, SHM_R|SHM_W)) == (void *)-1UL) {
> 		perror("shmat");
> 		return 1;
> 	}
> 	if ((shmidB = shmget(key, sizeB, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
> 		perror("shmget:");
> 		return 1;
> 	}
> 
> 	if ((addrB = shmat(shmidB, addrB, SHM_R|SHM_W)) == (void *)-1UL) {
> 		perror("shmat");
> 		return 1;
> 	}
> 
> fork_child:
> 	switch(fork()) {
> 		case 0:
> 			switch (n%3) {
> 			case 0:
> 				play(addrA, sizeA);
> 				break;
> 			case 1:
> 				play(addrB, sizeB);
> 				break;
> 			case 2:
> 				break;
> 			}
> 			break;
> 		case -1:
> 			perror("fork:");
> 			break;
> 		default:
> 			if (++n < nr_children)
> 				goto fork_child;
> 			play(addrA, sizeA);
> 			break;
> 	}
> 	shmdt(addrA);
> 	shmdt(addrB);
> 	do {
> 		wait(NULL);
> 	} while (--n > 0);
> 	shmctl(shmidA, IPC_RMID, NULL);
> 	shmctl(shmidB, IPC_RMID, NULL);
> 	return 0;
> }
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  arch/x86/mm/hugetlbpage.c |   29 +++++++++++++++++++++++++++++
>  1 file changed, 29 insertions(+)
> 
> diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> index f6679a7..0524556 100644
> --- a/arch/x86/mm/hugetlbpage.c
> +++ b/arch/x86/mm/hugetlbpage.c
> @@ -68,14 +68,37 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>  	struct vm_area_struct *svma;
>  	unsigned long saddr;
>  	pte_t *spte = NULL;
> +	spinlock_t *spage_table_lock = NULL;
> +	struct rw_semaphore *smmap_sem = NULL;
>  
>  	if (!vma_shareable(vma, addr))
>  		return;
>  
> +retry:
>  	mutex_lock(&mapping->i_mmap_mutex);
>  	vma_prio_tree_foreach(svma, &iter, &mapping->i_mmap, idx, idx) {
>  		if (svma == vma)
>  			continue;
> +		if (svma->vm_mm == vma->vm_mm)
> +			continue;
> +
> +		/*
> +		 * The target mm could be in the process of tearing down
> +		 * its page tables and the i_mmap_mutex on its own is
> +		 * not sufficient. To prevent races against teardown and
> +		 * pagetable updates, we acquire the mmap_sem and pagetable
> +		 * lock of the remote address space. down_read_trylock()
> +		 * is necessary as the other process could also be trying
> +		 * to share pagetables with the current mm.
> +		 */
> +		if (!down_read_trylock(&svma->vm_mm->mmap_sem)) {
> +			mutex_unlock(&mapping->i_mmap_mutex);
> +			goto retry;
> +		}
> +
> +		smmap_sem = &svma->vm_mm->mmap_sem;
> +		spage_table_lock = &svma->vm_mm->page_table_lock;
> +		spin_lock_nested(spage_table_lock, SINGLE_DEPTH_NESTING);
>  
>  		saddr = page_table_shareable(svma, vma, addr, idx);
>  		if (saddr) {
> @@ -85,6 +108,10 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>  				break;
>  			}
>  		}
> +		up_read(smmap_sem);
> +		spin_unlock(spage_table_lock);
> +		spage_table_lock = NULL;
> +		smmap_sem = NULL;
>  	}
>  
>  	if (!spte)
> @@ -96,6 +123,8 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>  	else
>  		put_page(virt_to_page(spte));
>  	spin_unlock(&mm->page_table_lock);
> +	spin_unlock(spage_table_lock);
> +	up_read(smmap_sem);
>  out:
>  	mutex_unlock(&mapping->i_mmap_mutex);
>  }
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables
@ 2012-07-19 12:38   ` Hugh Dickins
  0 siblings, 0 replies; 13+ messages in thread
From: Hugh Dickins @ 2012-07-19 12:38 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux-MM, Michal Hocko, David Gibson, Ken Chen, LKML

On Wed, 18 Jul 2012, Mel Gorman wrote:

> (Sending as RFC as this one is tricky and as it is timing dependent the
> patch may accidentally be papering over a more fundamental problem. Even
> if it is not, it may be more heavy handed than necessary but am suffering
> from tunnel vision from looking at this. I wanted to get comments on this
> version before trying to be clever.)

I haven't been able to spend as long on this as I'd like, and can't
look more today, nor probably tomorrow: premature reply and hope I'm
getting it right.

The bug looks genuine to me: good job you both tracking it down.

I am a bit surprised we didn't notice back when reviewing, it does
seem exactly the kind of issue one would be on the lookout for;
but I've not read through those old mails.

Your interim trylock fix looks just too worrying to me to spend any
time considering it - and does it even work? given that an exiting mm
does not take mmap_sem, where munmap would have down_write mmap_sem.

As I understand it, there would be no problem at all if the entry (and
here I'm rushed and avoid making a fool of myself by not saying whether
I mean pmd entry or pud entry! I get so confused amongst the levels)
were cleared under lock in or near the huge_pmd_unshare(), but it's
left set so that the subsequent free_pgtables() gets the freeing via
TLB mmu_gather right.

I'd feel much happier with a fix approaching it from that direction:
which is probably just what you have in mind when you're thinking
of trying to be clever - please advance to that.

If nothing but x86 uses sharing at this level (again I've not checked)
then you may be able to clear the entry and put the page directly into
into the mmu_gather like an ordinary page, leaving nothing for
free_pgtables() to do there.  Ah, that might leave a danger that the
table above never gets freed.  Well, another possibility is to use a
special pte bit (perhaps the one we play with for PROT_NONE, or a THP
bit) to flag this: to prevent more sharing but get the freeing right.

Anyway, over to you: just a couple of further comments below.

> 
> If a process creates a large hugetlbfs mapping that is eligible for page
> table sharing and forks heavily with children some of whom fault and
> others which destroy the mapping then it is possible for page tables to
> get corrupted. Some teardowns of the mapping encounter a "bad pmd" and
> output a message to the kernel log.  The final teardown will trigger a
> BUG_ON in mm/filemap.c.
> 
> This was reproduced in 3.4 but is known to have existed for a long time
> and goes back at least as far as 2.6.37. It was probably was introduced in
> 2.6.20 by [39dde65c: shared page table for hugetlb page]. The messages
> look like this;
> 
> [  ..........] Lots of bad pmd messages followed by this
> [  127.164256] mm/memory.c:391: bad pmd ffff880412e04fe8(80000003de4000e7).
> [  127.164257] mm/memory.c:391: bad pmd ffff880412e04ff0(80000003de6000e7).
> [  127.164258] mm/memory.c:391: bad pmd ffff880412e04ff8(80000003de0000e7).
> [  127.186778] ------------[ cut here ]------------
> [  127.186781] kernel BUG at mm/filemap.c:134!
> [  127.186782] invalid opcode: 0000 [#1] SMP
> [  127.186783] CPU 7
> [  127.186784] Modules linked in: af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos usb_common button i2c_core intel_agp video intel_gtt fan processor thermal thermal_sys hwmon ata_generic pata_atiixp libata scsi_mod
> [  127.186801]
> [  127.186802] Pid: 9017, comm: hugetlbfs-test Not tainted 3.4.0-autobuild #53 Dell Inc. OptiPlex 990/06D7TR
> [  127.186804] RIP: 0010:[<ffffffff810ed6ce>]  [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
> [  127.186809] RSP: 0000:ffff8804144b5c08  EFLAGS: 00010002
> [  127.186810] RAX: 0000000000000001 RBX: ffffea000a5c9000 RCX: 00000000ffffffc0
> [  127.186811] RDX: 0000000000000000 RSI: 0000000000000009 RDI: ffff88042dfdad00
> [  127.186812] RBP: ffff8804144b5c18 R08: 0000000000000009 R09: 0000000000000003
> [  127.186813] R10: 0000000000000000 R11: 000000000000002d R12: ffff880412ff83d8
> [  127.186814] R13: ffff880412ff83d8 R14: 0000000000000000 R15: ffff880412ff83d8
> [  127.186815] FS:  00007fe18ed2c700(0000) GS:ffff88042dce0000(0000) knlGS:0000000000000000
> [  127.186816] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  127.186817] CR2: 00007fe340000503 CR3: 0000000417a14000 CR4: 00000000000407e0
> [  127.186818] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  127.186819] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [  127.186820] Process hugetlbfs-test (pid: 9017, threadinfo ffff8804144b4000, task ffff880417f803c0)
> [  127.186821] Stack:
> [  127.186822]  ffffea000a5c9000 0000000000000000 ffff8804144b5c48 ffffffff810ed83b
> [  127.186824]  ffff8804144b5c48 000000000000138a 0000000000001387 ffff8804144b5c98
> [  127.186825]  ffff8804144b5d48 ffffffff811bc925 ffff8804144b5cb8 0000000000000000
> [  127.186827] Call Trace:
> [  127.186829]  [<ffffffff810ed83b>] delete_from_page_cache+0x3b/0x80
> [  127.186832]  [<ffffffff811bc925>] truncate_hugepages+0x115/0x220
> [  127.186834]  [<ffffffff811bca43>] hugetlbfs_evict_inode+0x13/0x30
> [  127.186837]  [<ffffffff811655c7>] evict+0xa7/0x1b0
> [  127.186839]  [<ffffffff811657a3>] iput_final+0xd3/0x1f0
> [  127.186840]  [<ffffffff811658f9>] iput+0x39/0x50
> [  127.186842]  [<ffffffff81162708>] d_kill+0xf8/0x130
> [  127.186843]  [<ffffffff81162812>] dput+0xd2/0x1a0
> [  127.186845]  [<ffffffff8114e2d0>] __fput+0x170/0x230
> [  127.186848]  [<ffffffff81236e0e>] ? rb_erase+0xce/0x150
> [  127.186849]  [<ffffffff8114e3ad>] fput+0x1d/0x30
> [  127.186851]  [<ffffffff81117db7>] remove_vma+0x37/0x80
> [  127.186853]  [<ffffffff81119182>] do_munmap+0x2d2/0x360
> [  127.186855]  [<ffffffff811cc639>] sys_shmdt+0xc9/0x170
> [  127.186857]  [<ffffffff81410a39>] system_call_fastpath+0x16/0x1b
> [  127.186858] Code: 0f 1f 44 00 00 48 8b 43 08 48 8b 00 48 8b 40 28 8b b0 40 03 00 00 85 f6 0f 88 df fe ff ff 48 89 df e8 e7 cb 05 00 e9 d2 fe ff ff <0f> 0b 55 83 e2 fd 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 65 e0
> [  127.186868] RIP  [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
> [  127.186870]  RSP <ffff8804144b5c08>
> [  127.186871] ---[ end trace 7cbac5d1db69f426 ]---
> 
> The bug is a race and not always easy to reproduce. To reproduce it I was
> doing the following on a single socket I7-based machine with 16G of RAM.
> 
> $ hugeadm --pool-pages-max DEFAULT:13G
> $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmmax
> $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmall
> $ for i in `seq 1 9000`; do ./hugetlbfs-test; done
> 
> On my particular machine, it usually triggers within 10 minutes but enabling
> debug options can change the timing such that it never hits. Once the bug is
> triggered, the machine is in trouble and needs to be rebooted. The machine
> will respond but processes accessing proc like "ps aux" will hang due to
> the BUG_ON. shutdown will also hang and needs a hard reset or a sysrq-b.
> 
> The test case was mostly written by Michal Hocko with a few minor changes
> by me to reproduce this bug. Michal did a lot of heavy lifting eliminating
> possible sources of the race and saved me the embarrassment of posting a
> completely broken patch yesterday. He did not see this patch before
> going to the lists so any flaws are mine!

Well, I think you're being grossly unfair to Michal there:
you're saying that he only has to look at a patch to put flaws in it ?-)

> 
> The basic problem is a race between page table sharing and teardown. For
> the most part page table sharing depends on i_mmap_mutex. In some cases,
> it is also taking the mm->page_table_lock for the PTE updates but with
> shared page tables, it is the i_mmap_mutex that is more important.
> 
> Unfortunately it appears to be also insufficient. Consider the following
> situation
> 
> Process A					Process B
> ---------					---------
> hugetlb_fault					shmdt
>   huge_pte_alloc				LockWrite(mmap_sem)
>     Lock(i_mmap_mutex)				  do_munmap

You meant to erase the huge_pte_alloc and Lock(i_mmap_mutex)
above, didn't you?  They're repeated in the right place below.

> 						    unmap_region
> 						      unmap_vmas
> 						        unmap_single_vma
> 						          unmap_hugepage_range
>       						            Lock(i_mmap_mutex)
> 							    Lock(mm->page_table_lock)
> 							    huge_pmd_unshare/unmap tables <--- (1)
> 							    Unlock(mm->page_table_lock)
>       						            Unlock(i_mmap_mutex)
>     huge_pte_alloc				      ...
>       Lock(i_mmap_mutex)			      ...
>       vma_prio_walk, find svma, spte		      ...
>       Lock(mm->page_table_lock)			      ...
>       share spte				      ...
>       Unlock(mm->page_table_lock)		      ...
>       Unlock(i_mmap_mutex)			      ...
>     hugetlb_no_page									  <--- (2)
> 						      free_pgtables
> 						        unlink_file_vma
> 							hugetlb_free_pgd_range
> 						    remove_vma_list
> 
> In this scenario, it is possible for Process A to share page tables with
> Process B that is trying to tear them down.  The i_mmap_mutex on its own
> does not prevent Process A walking Process B's page tables. At (1) above,
> the page tables are not shared yet so it unmaps the PMDs. Process A sets
> up page table sharing and at (2) faults a new entry. Process B then trips
> up on it in free_pgtables.
> 
> This patch is heavy-handed but to prevent against teardown an unmapping
> races it takes the mmap_sem for read and then the page_table_lock of
> processes it is considering sharing with. This prevents the VMA or page
> tables being updated during sharing. I verified that page table sharing
> still occurs using the awesome technology of printk to spit out a message
> when huge_pmd_share is successful. libhugetlbfs regression test suite passes.
> 
> I strongly suggest this be treated as a -stable candidate if it is merged.
> 
> Test program is as follows.
> 
> ==== CUT HERE ====
> 
> static size_t huge_page_size = (2UL << 20);
> static size_t nr_huge_page_A = 512;
> static size_t nr_huge_page_B = 5632;
> 
> unsigned int get_random(unsigned int max)
> {
> 	struct timeval tv;
> 
> 	gettimeofday(&tv, NULL);
> 	srandom(tv.tv_usec);
> 	return random() % max;
> }
> 
> static void play(void *addr, size_t size)
> {
> 	unsigned char *start = addr,
> 		      *end = start + size,
> 		      *a;
> 	start += get_random(size/2);
> 
> 	/* we could itterate on huge pages but let's give it more time. */
> 	for (a = start; a < end; a += 4096)
> 		*a = 0;
> }
> 
> int main(int argc, char **argv)
> {
> 	key_t key = IPC_PRIVATE;
> 	size_t sizeA = nr_huge_page_A * huge_page_size;
> 	size_t sizeB = nr_huge_page_B * huge_page_size;
> 	int shmidA, shmidB;
> 	void *addrA = NULL, *addrB = NULL;
> 	int nr_children = 300, n = 0;
> 
> 	if ((shmidA = shmget(key, sizeA, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
> 		perror("shmget:");
> 		return 1;
> 	}
> 
> 	if ((addrA = shmat(shmidA, addrA, SHM_R|SHM_W)) == (void *)-1UL) {
> 		perror("shmat");
> 		return 1;
> 	}
> 	if ((shmidB = shmget(key, sizeB, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
> 		perror("shmget:");
> 		return 1;
> 	}
> 
> 	if ((addrB = shmat(shmidB, addrB, SHM_R|SHM_W)) == (void *)-1UL) {
> 		perror("shmat");
> 		return 1;
> 	}
> 
> fork_child:
> 	switch(fork()) {
> 		case 0:
> 			switch (n%3) {
> 			case 0:
> 				play(addrA, sizeA);
> 				break;
> 			case 1:
> 				play(addrB, sizeB);
> 				break;
> 			case 2:
> 				break;
> 			}
> 			break;
> 		case -1:
> 			perror("fork:");
> 			break;
> 		default:
> 			if (++n < nr_children)
> 				goto fork_child;
> 			play(addrA, sizeA);
> 			break;
> 	}
> 	shmdt(addrA);
> 	shmdt(addrB);
> 	do {
> 		wait(NULL);
> 	} while (--n > 0);
> 	shmctl(shmidA, IPC_RMID, NULL);
> 	shmctl(shmidB, IPC_RMID, NULL);
> 	return 0;
> }
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  arch/x86/mm/hugetlbpage.c |   29 +++++++++++++++++++++++++++++
>  1 file changed, 29 insertions(+)
> 
> diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> index f6679a7..0524556 100644
> --- a/arch/x86/mm/hugetlbpage.c
> +++ b/arch/x86/mm/hugetlbpage.c
> @@ -68,14 +68,37 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>  	struct vm_area_struct *svma;
>  	unsigned long saddr;
>  	pte_t *spte = NULL;
> +	spinlock_t *spage_table_lock = NULL;
> +	struct rw_semaphore *smmap_sem = NULL;
>  
>  	if (!vma_shareable(vma, addr))
>  		return;
>  
> +retry:
>  	mutex_lock(&mapping->i_mmap_mutex);
>  	vma_prio_tree_foreach(svma, &iter, &mapping->i_mmap, idx, idx) {
>  		if (svma == vma)
>  			continue;
> +		if (svma->vm_mm == vma->vm_mm)
> +			continue;
> +
> +		/*
> +		 * The target mm could be in the process of tearing down
> +		 * its page tables and the i_mmap_mutex on its own is
> +		 * not sufficient. To prevent races against teardown and
> +		 * pagetable updates, we acquire the mmap_sem and pagetable
> +		 * lock of the remote address space. down_read_trylock()
> +		 * is necessary as the other process could also be trying
> +		 * to share pagetables with the current mm.
> +		 */
> +		if (!down_read_trylock(&svma->vm_mm->mmap_sem)) {
> +			mutex_unlock(&mapping->i_mmap_mutex);
> +			goto retry;
> +		}
> +
> +		smmap_sem = &svma->vm_mm->mmap_sem;
> +		spage_table_lock = &svma->vm_mm->page_table_lock;
> +		spin_lock_nested(spage_table_lock, SINGLE_DEPTH_NESTING);
>  
>  		saddr = page_table_shareable(svma, vma, addr, idx);
>  		if (saddr) {
> @@ -85,6 +108,10 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>  				break;
>  			}
>  		}
> +		up_read(smmap_sem);
> +		spin_unlock(spage_table_lock);
> +		spage_table_lock = NULL;
> +		smmap_sem = NULL;
>  	}
>  
>  	if (!spte)
> @@ -96,6 +123,8 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>  	else
>  		put_page(virt_to_page(spte));
>  	spin_unlock(&mm->page_table_lock);
> +	spin_unlock(spage_table_lock);
> +	up_read(smmap_sem);
>  out:
>  	mutex_unlock(&mapping->i_mmap_mutex);
>  }
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables
  2012-07-19 12:38   ` Hugh Dickins
@ 2012-07-19 14:16     ` Mel Gorman
  -1 siblings, 0 replies; 13+ messages in thread
From: Mel Gorman @ 2012-07-19 14:16 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Linux-MM, Michal Hocko, David Gibson, Ken Chen, LKML

On Thu, Jul 19, 2012 at 05:38:03AM -0700, Hugh Dickins wrote:
> On Wed, 18 Jul 2012, Mel Gorman wrote:
> 
> > (Sending as RFC as this one is tricky and as it is timing dependent the
> > patch may accidentally be papering over a more fundamental problem. Even
> > if it is not, it may be more heavy handed than necessary but am suffering
> > from tunnel vision from looking at this. I wanted to get comments on this
> > version before trying to be clever.)
> 
> I haven't been able to spend as long on this as I'd like, and can't
> look more today, nor probably tomorrow: premature reply and hope I'm
> getting it right.
> 

Any reply is good. Thanks for taking the time. Thanks as well for
correcting Ken Chen's email address as I had no idea where he was any
more :)

> The bug looks genuine to me: good job you both tracking it down.
> 

Cheers

> I am a bit surprised we didn't notice back when reviewing, it does
> seem exactly the kind of issue one would be on the lookout for;
> but I've not read through those old mails.
> 

I read through some of the old mails and there were a lot of revisions that
caught a wide variety of problems. It looked exhausting and I can see how
this one might have been missed.  Even when I had a test case demonstrating
the problem, it took me a long time to spot exactly where we might be racing.

> Your interim trylock fix looks just too worrying to me to spend any
> time considering it - and does it even work?

Yes - or at least the test case was able to run for 2 hours without
triggering the bug when normally it would trigger in under 3 minutes.

My concern with the trylock thing actually was that in theory it could
retry forever. If Peter happens to read this patch his "unbounded latency"
detector is going to tut at me. I wondered if it would be better to just
fail sharing in the contended case but I expected that it would only need
to retry for a very short period of time for any real workload.

My main concern is that I might be papering over the problem and I have
the wrong race. One impact of this patch is that holding mmap_sem of the
remote process prevents shmdt running in parallel as it takes mmap_sem
for write so the fix may be coincidental even if it is effective.


> given that an exiting mm
> does not take mmap_sem, where munmap would have down_write mmap_sem.
> 
> As I understand it, there would be no problem at all if the entry (and
> here I'm rushed and avoid making a fool of myself by not saying whether
> I mean pmd entry or pud entry! I get so confused amongst the levels)

I'll take the risk for you: you almost certainly mean clearing the PUD entry.

> were cleared under lock in or near the huge_pmd_unshare(), but it's
> left set so that the subsequent free_pgtables() gets the freeing via
> TLB mmu_gather right.
> 
> I'd feel much happier with a fix approaching it from that direction:
> which is probably just what you have in mind when you're thinking
> of trying to be clever - please advance to that.
> 

I'm relieved that you thought along these lines as well. I also considered
the possibility of clearing the PUD entry on unsharing and doing the freeing
there. I abandoned the approach early on  because I concluded there would
be a few problems with it.

1. When sharing, it is obvious if there are other users of the shared
   page table - elevated page count on the PTE. For the last user of the
   shared page table, it is not known if we shared in the past. In itself
   this is not a major problem but it does mean that huge_pmd_unshare
   would always have to tear down page tables below the PUD level

2. It means duplicating some logic of free_pgtables() and moving it into
   arch-specific code or into mm/hugetlb.c. I did not want to move any
   details related to mmu_gather into a rarely used path

3. Pagetable teardown and free of hugetlbfs pages on x86 would be
   different to every other arch. x86 and hugetlbfs is already different
   to other architectures but I did not want to compound this problem more
   than necessary

4. If I'm right about the race conditions then a parallel process running
   free_pgtables() could conceivably be operating on the same area at the
   same time because it is holding the wrong mm->page_table_lock. They
   are both freeing so may not be a problem but it felt wrong.

5. In general, I felt this would be significantly more complex than my
   heavy-handed approach unless I missed an obvious way of implementing
   it which is always possible :)

> If nothing but x86 uses sharing at this level (again I've not checked)

AFAIK, only x86 shares like this.

> then you may be able to clear the entry and put the page directly into
> into the mmu_gather like an ordinary page, leaving nothing for
> free_pgtables() to do there.  Ah, that might leave a danger that the
> table above never gets freed. 

It's a risk albeit it a small one.

> Well, another possibility is to use a
> special pte bit (perhaps the one we play with for PROT_NONE, or a THP
> bit) to flag this: to prevent more sharing but get the freeing right.
> 

I think it would have to be a VMA flag because that's what
page_table_shareable() is using. I don't think I can use a new VMA flag
for a user case like pagetable sharing but I could use an "impossible"
flag.

> Anyway, over to you: just a couple of further comments below.
> 

I'm not keen on the idea of doing the page table teardown and free on or
near huge_pmd_unshare because I think it'll be more complex. Abusing VMA
flags might work but there is a problem that the VMA flag might "leak".
If we are truncating for example, we call unmap_hugepage_range and if
that thing sets a VMA flag it might persist to trip on some BUG check
later or introduce weird race conditions depending on what flag was
abused (e.g. abusing VM_GROWSDOWN might introduce a weird race with
fault)

I could set the impossible flag in unmap_vmas() but then it is moving a
hack necessary for arch-specific code to the core MM. Would that be ok
or is it just ugly?


> > <SNIP>
> > The test case was mostly written by Michal Hocko with a few minor changes
> > by me to reproduce this bug. Michal did a lot of heavy lifting eliminating
> > possible sources of the race and saved me the embarrassment of posting a
> > completely broken patch yesterday. He did not see this patch before
> > going to the lists so any flaws are mine!
> 
> Well, I think you're being grossly unfair to Michal there:
> you're saying that he only has to look at a patch to put flaws in it ?-)
> 

See, this is exactly the sort of mistake I am responsible for :)

> > 
> > The basic problem is a race between page table sharing and teardown. For
> > the most part page table sharing depends on i_mmap_mutex. In some cases,
> > it is also taking the mm->page_table_lock for the PTE updates but with
> > shared page tables, it is the i_mmap_mutex that is more important.
> > 
> > Unfortunately it appears to be also insufficient. Consider the following
> > situation
> > 
> > Process A					Process B
> > ---------					---------
> > hugetlb_fault					shmdt
> >   huge_pte_alloc				LockWrite(mmap_sem)
> >     Lock(i_mmap_mutex)				  do_munmap
> 
> You meant to erase the huge_pte_alloc and Lock(i_mmap_mutex)
> above, didn't you?  They're repeated in the right place below.
> 

Yes, I was cut&pasting a bit here. 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables
@ 2012-07-19 14:16     ` Mel Gorman
  0 siblings, 0 replies; 13+ messages in thread
From: Mel Gorman @ 2012-07-19 14:16 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Linux-MM, Michal Hocko, David Gibson, Ken Chen, LKML

On Thu, Jul 19, 2012 at 05:38:03AM -0700, Hugh Dickins wrote:
> On Wed, 18 Jul 2012, Mel Gorman wrote:
> 
> > (Sending as RFC as this one is tricky and as it is timing dependent the
> > patch may accidentally be papering over a more fundamental problem. Even
> > if it is not, it may be more heavy handed than necessary but am suffering
> > from tunnel vision from looking at this. I wanted to get comments on this
> > version before trying to be clever.)
> 
> I haven't been able to spend as long on this as I'd like, and can't
> look more today, nor probably tomorrow: premature reply and hope I'm
> getting it right.
> 

Any reply is good. Thanks for taking the time. Thanks as well for
correcting Ken Chen's email address as I had no idea where he was any
more :)

> The bug looks genuine to me: good job you both tracking it down.
> 

Cheers

> I am a bit surprised we didn't notice back when reviewing, it does
> seem exactly the kind of issue one would be on the lookout for;
> but I've not read through those old mails.
> 

I read through some of the old mails and there were a lot of revisions that
caught a wide variety of problems. It looked exhausting and I can see how
this one might have been missed.  Even when I had a test case demonstrating
the problem, it took me a long time to spot exactly where we might be racing.

> Your interim trylock fix looks just too worrying to me to spend any
> time considering it - and does it even work?

Yes - or at least the test case was able to run for 2 hours without
triggering the bug when normally it would trigger in under 3 minutes.

My concern with the trylock thing actually was that in theory it could
retry forever. If Peter happens to read this patch his "unbounded latency"
detector is going to tut at me. I wondered if it would be better to just
fail sharing in the contended case but I expected that it would only need
to retry for a very short period of time for any real workload.

My main concern is that I might be papering over the problem and I have
the wrong race. One impact of this patch is that holding mmap_sem of the
remote process prevents shmdt running in parallel as it takes mmap_sem
for write so the fix may be coincidental even if it is effective.


> given that an exiting mm
> does not take mmap_sem, where munmap would have down_write mmap_sem.
> 
> As I understand it, there would be no problem at all if the entry (and
> here I'm rushed and avoid making a fool of myself by not saying whether
> I mean pmd entry or pud entry! I get so confused amongst the levels)

I'll take the risk for you: you almost certainly mean clearing the PUD entry.

> were cleared under lock in or near the huge_pmd_unshare(), but it's
> left set so that the subsequent free_pgtables() gets the freeing via
> TLB mmu_gather right.
> 
> I'd feel much happier with a fix approaching it from that direction:
> which is probably just what you have in mind when you're thinking
> of trying to be clever - please advance to that.
> 

I'm relieved that you thought along these lines as well. I also considered
the possibility of clearing the PUD entry on unsharing and doing the freeing
there. I abandoned the approach early on  because I concluded there would
be a few problems with it.

1. When sharing, it is obvious if there are other users of the shared
   page table - elevated page count on the PTE. For the last user of the
   shared page table, it is not known if we shared in the past. In itself
   this is not a major problem but it does mean that huge_pmd_unshare
   would always have to tear down page tables below the PUD level

2. It means duplicating some logic of free_pgtables() and moving it into
   arch-specific code or into mm/hugetlb.c. I did not want to move any
   details related to mmu_gather into a rarely used path

3. Pagetable teardown and free of hugetlbfs pages on x86 would be
   different to every other arch. x86 and hugetlbfs is already different
   to other architectures but I did not want to compound this problem more
   than necessary

4. If I'm right about the race conditions then a parallel process running
   free_pgtables() could conceivably be operating on the same area at the
   same time because it is holding the wrong mm->page_table_lock. They
   are both freeing so may not be a problem but it felt wrong.

5. In general, I felt this would be significantly more complex than my
   heavy-handed approach unless I missed an obvious way of implementing
   it which is always possible :)

> If nothing but x86 uses sharing at this level (again I've not checked)

AFAIK, only x86 shares like this.

> then you may be able to clear the entry and put the page directly into
> into the mmu_gather like an ordinary page, leaving nothing for
> free_pgtables() to do there.  Ah, that might leave a danger that the
> table above never gets freed. 

It's a risk albeit it a small one.

> Well, another possibility is to use a
> special pte bit (perhaps the one we play with for PROT_NONE, or a THP
> bit) to flag this: to prevent more sharing but get the freeing right.
> 

I think it would have to be a VMA flag because that's what
page_table_shareable() is using. I don't think I can use a new VMA flag
for a user case like pagetable sharing but I could use an "impossible"
flag.

> Anyway, over to you: just a couple of further comments below.
> 

I'm not keen on the idea of doing the page table teardown and free on or
near huge_pmd_unshare because I think it'll be more complex. Abusing VMA
flags might work but there is a problem that the VMA flag might "leak".
If we are truncating for example, we call unmap_hugepage_range and if
that thing sets a VMA flag it might persist to trip on some BUG check
later or introduce weird race conditions depending on what flag was
abused (e.g. abusing VM_GROWSDOWN might introduce a weird race with
fault)

I could set the impossible flag in unmap_vmas() but then it is moving a
hack necessary for arch-specific code to the core MM. Would that be ok
or is it just ugly?


> > <SNIP>
> > The test case was mostly written by Michal Hocko with a few minor changes
> > by me to reproduce this bug. Michal did a lot of heavy lifting eliminating
> > possible sources of the race and saved me the embarrassment of posting a
> > completely broken patch yesterday. He did not see this patch before
> > going to the lists so any flaws are mine!
> 
> Well, I think you're being grossly unfair to Michal there:
> you're saying that he only has to look at a patch to put flaws in it ?-)
> 

See, this is exactly the sort of mistake I am responsible for :)

> > 
> > The basic problem is a race between page table sharing and teardown. For
> > the most part page table sharing depends on i_mmap_mutex. In some cases,
> > it is also taking the mm->page_table_lock for the PTE updates but with
> > shared page tables, it is the i_mmap_mutex that is more important.
> > 
> > Unfortunately it appears to be also insufficient. Consider the following
> > situation
> > 
> > Process A					Process B
> > ---------					---------
> > hugetlb_fault					shmdt
> >   huge_pte_alloc				LockWrite(mmap_sem)
> >     Lock(i_mmap_mutex)				  do_munmap
> 
> You meant to erase the huge_pte_alloc and Lock(i_mmap_mutex)
> above, didn't you?  They're repeated in the right place below.
> 

Yes, I was cut&pasting a bit here. 

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables
  2012-07-18 10:43 ` Mel Gorman
@ 2012-07-19 14:42   ` Michal Hocko
  -1 siblings, 0 replies; 13+ messages in thread
From: Michal Hocko @ 2012-07-19 14:42 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux-MM, Hugh Dickins, David Gibson, Kenneth W Chen, LKML

[/me puts the patch destroyer glasses on]

On Wed 18-07-12 11:43:09, Mel Gorman wrote:
[...]
> diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> index f6679a7..0524556 100644
> --- a/arch/x86/mm/hugetlbpage.c
> +++ b/arch/x86/mm/hugetlbpage.c
> @@ -68,14 +68,37 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>  	struct vm_area_struct *svma;
>  	unsigned long saddr;
>  	pte_t *spte = NULL;
> +	spinlock_t *spage_table_lock = NULL;
> +	struct rw_semaphore *smmap_sem = NULL;
>  
>  	if (!vma_shareable(vma, addr))
>  		return;
>  
> +retry:
>  	mutex_lock(&mapping->i_mmap_mutex);
>  	vma_prio_tree_foreach(svma, &iter, &mapping->i_mmap, idx, idx) {
>  		if (svma == vma)
>  			continue;
> +		if (svma->vm_mm == vma->vm_mm)
> +			continue;
> +
> +		/*
> +		 * The target mm could be in the process of tearing down
> +		 * its page tables and the i_mmap_mutex on its own is
> +		 * not sufficient. To prevent races against teardown and
> +		 * pagetable updates, we acquire the mmap_sem and pagetable
> +		 * lock of the remote address space. down_read_trylock()
> +		 * is necessary as the other process could also be trying
> +		 * to share pagetables with the current mm.
> +		 */
> +		if (!down_read_trylock(&svma->vm_mm->mmap_sem)) {
> +			mutex_unlock(&mapping->i_mmap_mutex);
> +			goto retry;
> +		}
> +

I am afraid this can easily cause a dead lock. Consider
fork
  dup_mmap
    down_write(&oldmm->mmap_sem)
    copy_page_range
      copy_hugetlb_page_range
        huge_pte_alloc

svma could belong to oldmm and then we would loop for ever. 
svma->vm_mm == vma->vm_mm doesn't help because vma is child's one and mm
differ in that case. I am wondering you didn't hit this while testing.
It would suggest that the ptes are not populated yet because we didn't
let parent play and then other children could place its vma in the list
before parent?

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables
@ 2012-07-19 14:42   ` Michal Hocko
  0 siblings, 0 replies; 13+ messages in thread
From: Michal Hocko @ 2012-07-19 14:42 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux-MM, Hugh Dickins, David Gibson, Kenneth W Chen, LKML

[/me puts the patch destroyer glasses on]

On Wed 18-07-12 11:43:09, Mel Gorman wrote:
[...]
> diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> index f6679a7..0524556 100644
> --- a/arch/x86/mm/hugetlbpage.c
> +++ b/arch/x86/mm/hugetlbpage.c
> @@ -68,14 +68,37 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>  	struct vm_area_struct *svma;
>  	unsigned long saddr;
>  	pte_t *spte = NULL;
> +	spinlock_t *spage_table_lock = NULL;
> +	struct rw_semaphore *smmap_sem = NULL;
>  
>  	if (!vma_shareable(vma, addr))
>  		return;
>  
> +retry:
>  	mutex_lock(&mapping->i_mmap_mutex);
>  	vma_prio_tree_foreach(svma, &iter, &mapping->i_mmap, idx, idx) {
>  		if (svma == vma)
>  			continue;
> +		if (svma->vm_mm == vma->vm_mm)
> +			continue;
> +
> +		/*
> +		 * The target mm could be in the process of tearing down
> +		 * its page tables and the i_mmap_mutex on its own is
> +		 * not sufficient. To prevent races against teardown and
> +		 * pagetable updates, we acquire the mmap_sem and pagetable
> +		 * lock of the remote address space. down_read_trylock()
> +		 * is necessary as the other process could also be trying
> +		 * to share pagetables with the current mm.
> +		 */
> +		if (!down_read_trylock(&svma->vm_mm->mmap_sem)) {
> +			mutex_unlock(&mapping->i_mmap_mutex);
> +			goto retry;
> +		}
> +

I am afraid this can easily cause a dead lock. Consider
fork
  dup_mmap
    down_write(&oldmm->mmap_sem)
    copy_page_range
      copy_hugetlb_page_range
        huge_pte_alloc

svma could belong to oldmm and then we would loop for ever. 
svma->vm_mm == vma->vm_mm doesn't help because vma is child's one and mm
differ in that case. I am wondering you didn't hit this while testing.
It would suggest that the ptes are not populated yet because we didn't
let parent play and then other children could place its vma in the list
before parent?

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables
  2012-07-19 14:42   ` Michal Hocko
@ 2012-07-19 14:49     ` Mel Gorman
  -1 siblings, 0 replies; 13+ messages in thread
From: Mel Gorman @ 2012-07-19 14:49 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Linux-MM, Hugh Dickins, David Gibson, Kenneth W Chen, LKML

On Thu, Jul 19, 2012 at 04:42:13PM +0200, Michal Hocko wrote:
> [/me puts the patch destroyer glasses on]
> 

It's a super power now.

> On Wed 18-07-12 11:43:09, Mel Gorman wrote:
> [...]
> > diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> > index f6679a7..0524556 100644
> > --- a/arch/x86/mm/hugetlbpage.c
> > +++ b/arch/x86/mm/hugetlbpage.c
> > @@ -68,14 +68,37 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> >  	struct vm_area_struct *svma;
> >  	unsigned long saddr;
> >  	pte_t *spte = NULL;
> > +	spinlock_t *spage_table_lock = NULL;
> > +	struct rw_semaphore *smmap_sem = NULL;
> >  
> >  	if (!vma_shareable(vma, addr))
> >  		return;
> >  
> > +retry:
> >  	mutex_lock(&mapping->i_mmap_mutex);
> >  	vma_prio_tree_foreach(svma, &iter, &mapping->i_mmap, idx, idx) {
> >  		if (svma == vma)
> >  			continue;
> > +		if (svma->vm_mm == vma->vm_mm)
> > +			continue;
> > +
> > +		/*
> > +		 * The target mm could be in the process of tearing down
> > +		 * its page tables and the i_mmap_mutex on its own is
> > +		 * not sufficient. To prevent races against teardown and
> > +		 * pagetable updates, we acquire the mmap_sem and pagetable
> > +		 * lock of the remote address space. down_read_trylock()
> > +		 * is necessary as the other process could also be trying
> > +		 * to share pagetables with the current mm.
> > +		 */
> > +		if (!down_read_trylock(&svma->vm_mm->mmap_sem)) {
> > +			mutex_unlock(&mapping->i_mmap_mutex);
> > +			goto retry;
> > +		}
> > +
> 
> I am afraid this can easily cause a dead lock. Consider
> fork
>   dup_mmap
>     down_write(&oldmm->mmap_sem)
>     copy_page_range
>       copy_hugetlb_page_range
>         huge_pte_alloc
> 
> svma could belong to oldmm and then we would loop for ever. 
> svma->vm_mm == vma->vm_mm doesn't help because vma is child's one and mm
> differ in that case. I am wondering you didn't hit this while testing.
> It would suggest that the ptes are not populated yet because we didn't
> let parent play and then other children could place its vma in the list
> before parent?
> 

Yes, I think you're right - both about the race and why I didn't hit it.
The libhugetlbfs tests probably avoided the bug for the same reason.
Thanks for pointing this out.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables
@ 2012-07-19 14:49     ` Mel Gorman
  0 siblings, 0 replies; 13+ messages in thread
From: Mel Gorman @ 2012-07-19 14:49 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Linux-MM, Hugh Dickins, David Gibson, Kenneth W Chen, LKML

On Thu, Jul 19, 2012 at 04:42:13PM +0200, Michal Hocko wrote:
> [/me puts the patch destroyer glasses on]
> 

It's a super power now.

> On Wed 18-07-12 11:43:09, Mel Gorman wrote:
> [...]
> > diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> > index f6679a7..0524556 100644
> > --- a/arch/x86/mm/hugetlbpage.c
> > +++ b/arch/x86/mm/hugetlbpage.c
> > @@ -68,14 +68,37 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> >  	struct vm_area_struct *svma;
> >  	unsigned long saddr;
> >  	pte_t *spte = NULL;
> > +	spinlock_t *spage_table_lock = NULL;
> > +	struct rw_semaphore *smmap_sem = NULL;
> >  
> >  	if (!vma_shareable(vma, addr))
> >  		return;
> >  
> > +retry:
> >  	mutex_lock(&mapping->i_mmap_mutex);
> >  	vma_prio_tree_foreach(svma, &iter, &mapping->i_mmap, idx, idx) {
> >  		if (svma == vma)
> >  			continue;
> > +		if (svma->vm_mm == vma->vm_mm)
> > +			continue;
> > +
> > +		/*
> > +		 * The target mm could be in the process of tearing down
> > +		 * its page tables and the i_mmap_mutex on its own is
> > +		 * not sufficient. To prevent races against teardown and
> > +		 * pagetable updates, we acquire the mmap_sem and pagetable
> > +		 * lock of the remote address space. down_read_trylock()
> > +		 * is necessary as the other process could also be trying
> > +		 * to share pagetables with the current mm.
> > +		 */
> > +		if (!down_read_trylock(&svma->vm_mm->mmap_sem)) {
> > +			mutex_unlock(&mapping->i_mmap_mutex);
> > +			goto retry;
> > +		}
> > +
> 
> I am afraid this can easily cause a dead lock. Consider
> fork
>   dup_mmap
>     down_write(&oldmm->mmap_sem)
>     copy_page_range
>       copy_hugetlb_page_range
>         huge_pte_alloc
> 
> svma could belong to oldmm and then we would loop for ever. 
> svma->vm_mm == vma->vm_mm doesn't help because vma is child's one and mm
> differ in that case. I am wondering you didn't hit this while testing.
> It would suggest that the ptes are not populated yet because we didn't
> let parent play and then other children could place its vma in the list
> before parent?
> 

Yes, I think you're right - both about the race and why I didn't hit it.
The libhugetlbfs tests probably avoided the bug for the same reason.
Thanks for pointing this out.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables
  2012-07-19  9:08 ` Cong Wang
@ 2012-07-20 13:43     ` Mel Gorman
  0 siblings, 0 replies; 13+ messages in thread
From: Mel Gorman @ 2012-07-20 13:43 UTC (permalink / raw)
  To: Cong Wang; +Cc: linux-mm, linux-kernel

On Thu, Jul 19, 2012 at 09:08:34AM +0000, Cong Wang wrote:
> On Wed, 18 Jul 2012 at 10:43 GMT, Mel Gorman <mgorman@suse.de> wrote:
> > +		if (!down_read_trylock(&svma->vm_mm->mmap_sem)) {
> > +			mutex_unlock(&mapping->i_mmap_mutex);
> > +			goto retry;
> > +		}
> > +
> > +		smmap_sem = &svma->vm_mm->mmap_sem;
> > +		spage_table_lock = &svma->vm_mm->page_table_lock;
> > +		spin_lock_nested(spage_table_lock, SINGLE_DEPTH_NESTING);
> >  
> >  		saddr = page_table_shareable(svma, vma, addr, idx);
> >  		if (saddr) {
> > @@ -85,6 +108,10 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> >  				break;
> >  			}
> >  		}
> > +		up_read(smmap_sem);
> > +		spin_unlock(spage_table_lock);
> 
> Looks like we should do spin_unlock() before up_read(),
> in the reverse order of how they get accquired.
> 

Will fix, thanks for pointing this out.

As an aside, I would prefer if you did not drop people from the CC list. I
would have missed this mail for a long time if it hadn't been pointed out
to me privately.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables
@ 2012-07-20 13:43     ` Mel Gorman
  0 siblings, 0 replies; 13+ messages in thread
From: Mel Gorman @ 2012-07-20 13:43 UTC (permalink / raw)
  To: Cong Wang; +Cc: linux-mm, linux-kernel

On Thu, Jul 19, 2012 at 09:08:34AM +0000, Cong Wang wrote:
> On Wed, 18 Jul 2012 at 10:43 GMT, Mel Gorman <mgorman@suse.de> wrote:
> > +		if (!down_read_trylock(&svma->vm_mm->mmap_sem)) {
> > +			mutex_unlock(&mapping->i_mmap_mutex);
> > +			goto retry;
> > +		}
> > +
> > +		smmap_sem = &svma->vm_mm->mmap_sem;
> > +		spage_table_lock = &svma->vm_mm->page_table_lock;
> > +		spin_lock_nested(spage_table_lock, SINGLE_DEPTH_NESTING);
> >  
> >  		saddr = page_table_shareable(svma, vma, addr, idx);
> >  		if (saddr) {
> > @@ -85,6 +108,10 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> >  				break;
> >  			}
> >  		}
> > +		up_read(smmap_sem);
> > +		spin_unlock(spage_table_lock);
> 
> Looks like we should do spin_unlock() before up_read(),
> in the reverse order of how they get accquired.
> 

Will fix, thanks for pointing this out.

As an aside, I would prefer if you did not drop people from the CC list. I
would have missed this mail for a long time if it hadn't been pointed out
to me privately.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2012-07-20 13:43 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-18 10:43 [RFC PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables Mel Gorman
2012-07-18 10:43 ` Mel Gorman
2012-07-19  9:08 ` Cong Wang
2012-07-20 13:43   ` Mel Gorman
2012-07-20 13:43     ` Mel Gorman
2012-07-19 12:38 ` Hugh Dickins
2012-07-19 12:38   ` Hugh Dickins
2012-07-19 14:16   ` Mel Gorman
2012-07-19 14:16     ` Mel Gorman
2012-07-19 14:42 ` Michal Hocko
2012-07-19 14:42   ` Michal Hocko
2012-07-19 14:49   ` Mel Gorman
2012-07-19 14:49     ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.