linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables v2
@ 2012-07-20 13:49 Mel Gorman
  2012-07-20 14:11 ` [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend) Mel Gorman
                   ` (2 more replies)
  0 siblings, 3 replies; 50+ messages in thread
From: Mel Gorman @ 2012-07-20 13:49 UTC (permalink / raw)
  To: Linux-MM
  Cc: Michal Hocko, Hugh Dickins, David Gibson, Ken Chen, Cong Wang,
	LKML, Mel Gorman

This V2 is still the mmap_sem approach that fixes a potential deadlock
problem pointed out by Michal.

Changelog since V1
 o Correct cut&paste error in race description			(hugh)
 o Handle potential deadlock during fork			(mhocko)
 o Reorder unlocking						(wangcong)

If a process creates a large hugetlbfs mapping that is eligible for page
table sharing and forks heavily with children some of whom fault and
others which destroy the mapping then it is possible for page tables to
get corrupted. Some teardowns of the mapping encounter a "bad pmd" and
output a message to the kernel log. The final teardown will trigger a
BUG_ON in mm/filemap.c.

This was reproduced in 3.4 but is known to have existed for a long time
and goes back at least as far as 2.6.37. It was probably was introduced in
2.6.20 by [39dde65c: shared page table for hugetlb page]. The messages
look like this;

[  ..........] Lots of bad pmd messages followed by this
[  127.164256] mm/memory.c:391: bad pmd ffff880412e04fe8(80000003de4000e7).
[  127.164257] mm/memory.c:391: bad pmd ffff880412e04ff0(80000003de6000e7).
[  127.164258] mm/memory.c:391: bad pmd ffff880412e04ff8(80000003de0000e7).
[  127.186778] ------------[ cut here ]------------
[  127.186781] kernel BUG at mm/filemap.c:134!
[  127.186782] invalid opcode: 0000 [#1] SMP
[  127.186783] CPU 7
[  127.186784] Modules linked in: af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos usb_common button i2c_core intel_agp video intel_gtt fan processor thermal thermal_sys hwmon ata_generic pata_atiixp libata scsi_mod
[  127.186801]
[  127.186802] Pid: 9017, comm: hugetlbfs-test Not tainted 3.4.0-autobuild #53 Dell Inc. OptiPlex 990/06D7TR
[  127.186804] RIP: 0010:[<ffffffff810ed6ce>]  [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
[  127.186809] RSP: 0000:ffff8804144b5c08  EFLAGS: 00010002
[  127.186810] RAX: 0000000000000001 RBX: ffffea000a5c9000 RCX: 00000000ffffffc0
[  127.186811] RDX: 0000000000000000 RSI: 0000000000000009 RDI: ffff88042dfdad00
[  127.186812] RBP: ffff8804144b5c18 R08: 0000000000000009 R09: 0000000000000003
[  127.186813] R10: 0000000000000000 R11: 000000000000002d R12: ffff880412ff83d8
[  127.186814] R13: ffff880412ff83d8 R14: 0000000000000000 R15: ffff880412ff83d8
[  127.186815] FS:  00007fe18ed2c700(0000) GS:ffff88042dce0000(0000) knlGS:0000000000000000
[  127.186816] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  127.186817] CR2: 00007fe340000503 CR3: 0000000417a14000 CR4: 00000000000407e0
[  127.186818] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  127.186819] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  127.186820] Process hugetlbfs-test (pid: 9017, threadinfo ffff8804144b4000, task ffff880417f803c0)
[  127.186821] Stack:
[  127.186822]  ffffea000a5c9000 0000000000000000 ffff8804144b5c48 ffffffff810ed83b
[  127.186824]  ffff8804144b5c48 000000000000138a 0000000000001387 ffff8804144b5c98
[  127.186825]  ffff8804144b5d48 ffffffff811bc925 ffff8804144b5cb8 0000000000000000
[  127.186827] Call Trace:
[  127.186829]  [<ffffffff810ed83b>] delete_from_page_cache+0x3b/0x80
[  127.186832]  [<ffffffff811bc925>] truncate_hugepages+0x115/0x220
[  127.186834]  [<ffffffff811bca43>] hugetlbfs_evict_inode+0x13/0x30
[  127.186837]  [<ffffffff811655c7>] evict+0xa7/0x1b0
[  127.186839]  [<ffffffff811657a3>] iput_final+0xd3/0x1f0
[  127.186840]  [<ffffffff811658f9>] iput+0x39/0x50
[  127.186842]  [<ffffffff81162708>] d_kill+0xf8/0x130
[  127.186843]  [<ffffffff81162812>] dput+0xd2/0x1a0
[  127.186845]  [<ffffffff8114e2d0>] __fput+0x170/0x230
[  127.186848]  [<ffffffff81236e0e>] ? rb_erase+0xce/0x150
[  127.186849]  [<ffffffff8114e3ad>] fput+0x1d/0x30
[  127.186851]  [<ffffffff81117db7>] remove_vma+0x37/0x80
[  127.186853]  [<ffffffff81119182>] do_munmap+0x2d2/0x360
[  127.186855]  [<ffffffff811cc639>] sys_shmdt+0xc9/0x170
[  127.186857]  [<ffffffff81410a39>] system_call_fastpath+0x16/0x1b
[  127.186858] Code: 0f 1f 44 00 00 48 8b 43 08 48 8b 00 48 8b 40 28 8b b0 40 03 00 00 85 f6 0f 88 df fe ff ff 48 89 df e8 e7 cb 05 00 e9 d2 fe ff ff <0f> 0b 55 83 e2 fd 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 65 e0
[  127.186868] RIP  [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
[  127.186870]  RSP <ffff8804144b5c08>
[  127.186871] ---[ end trace 7cbac5d1db69f426 ]---

The bug is a race and not always easy to reproduce. To reproduce it I was
doing the following on a single socket I7-based machine with 16G of RAM.

$ hugeadm --pool-pages-max DEFAULT:13G
$ echo $((18*1048576*1024)) > /proc/sys/kernel/shmmax
$ echo $((18*1048576*1024)) > /proc/sys/kernel/shmall
$ for i in `seq 1 9000`; do ./hugetlbfs-test; done

On my particular machine, it usually triggers within 10 minutes but enabling
debug options can change the timing such that it never hits. Once the bug is
triggered, the machine is in trouble and needs to be rebooted. The machine
will respond but processes accessing proc like "ps aux" will hang due to
the BUG_ON. shutdown will also hang and needs a hard reset or a sysrq-b.

The test case was mostly written by Michal Hocko with a few minor changes
by me to reproduce this bug. Michal did a lot of heavy lifting eliminating
possible sources of the race and saved me the embarrassment of posting a
completely broken patch.

The basic problem is a race between page table sharing and teardown. For
the most part page table sharing depends on i_mmap_mutex. In some cases,
it is also taking the mm->page_table_lock for the PTE updates but with
shared page tables, it is the i_mmap_mutex that is more important.

Unfortunately it appears to be also insufficient. Consider the following
situation

Process A					Process B
---------					---------
hugetlb_fault					shmdt
  						LockWrite(mmap_sem)
    						  do_munmap
						    unmap_region
						      unmap_vmas
						        unmap_single_vma
						          unmap_hugepage_range
      						            Lock(i_mmap_mutex)
							    Lock(mm->page_table_lock)
							    huge_pmd_unshare/unmap tables <--- (1)
							    Unlock(mm->page_table_lock)
      						            Unlock(i_mmap_mutex)
  huge_pte_alloc				      ...
    Lock(i_mmap_mutex)				      ...
    vma_prio_walk, find svma, spte		      ...
    Lock(mm->page_table_lock)			      ...
    share spte					      ...
    Unlock(mm->page_table_lock)			      ...
    Unlock(i_mmap_mutex)			      ...
  hugetlb_no_page									  <--- (2)
						      free_pgtables
						        unlink_file_vma
							hugetlb_free_pgd_range
						    remove_vma_list

In this scenario, it is possible for Process A to share page tables with
Process B that is trying to tear them down.  The i_mmap_mutex on its own
does not prevent Process A walking Process B's page tables. At (1) above,
the page tables are not shared yet so it unmaps the PMDs. Process A sets
up page table sharing and at (2) faults a new entry. Process B then trips
up on it in free_pgtables.

This patch takes the mmap_sem for read and then the page_table_lock of
address spaces being considered for page table sharing. I verified that
page table sharing still occurs using the awesome technology of printk
to spit out a message when huge_pmd_share is successful. libhugetlbfs
regression test suite passed.

I strongly suggest this be treated as a -stable candidate if it is merged.

Test program is as follows.

==== CUT HERE ====

static size_t huge_page_size = (2UL << 20);
static size_t nr_huge_page_A = 512;
static size_t nr_huge_page_B = 5632;

unsigned int get_random(unsigned int max)
{
	struct timeval tv;

	gettimeofday(&tv, NULL);
	srandom(tv.tv_usec);
	return random() % max;
}

static void play(void *addr, size_t size)
{
	unsigned char *start = addr,
		      *end = start + size,
		      *a;
	start += get_random(size/2);

	/* we could itterate on huge pages but let's give it more time. */
	for (a = start; a < end; a += 4096)
		*a = 0;
}

int main(int argc, char **argv)
{
	key_t key = IPC_PRIVATE;
	size_t sizeA = nr_huge_page_A * huge_page_size;
	size_t sizeB = nr_huge_page_B * huge_page_size;
	int shmidA, shmidB;
	void *addrA = NULL, *addrB = NULL;
	int nr_children = 300, n = 0;

	if ((shmidA = shmget(key, sizeA, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
		perror("shmget:");
		return 1;
	}

	if ((addrA = shmat(shmidA, addrA, SHM_R|SHM_W)) == (void *)-1UL) {
		perror("shmat");
		return 1;
	}
	if ((shmidB = shmget(key, sizeB, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
		perror("shmget:");
		return 1;
	}

	if ((addrB = shmat(shmidB, addrB, SHM_R|SHM_W)) == (void *)-1UL) {
		perror("shmat");
		return 1;
	}

fork_child:
	switch(fork()) {
		case 0:
			switch (n%3) {
			case 0:
				play(addrA, sizeA);
				break;
			case 1:
				play(addrB, sizeB);
				break;
			case 2:
				break;
			}
			break;
		case -1:
			perror("fork:");
			break;
		default:
			if (++n < nr_children)
				goto fork_child;
			play(addrA, sizeA);
			break;
	}
	shmdt(addrA);
	shmdt(addrB);
	do {
		wait(NULL);
	} while (--n > 0);
	shmctl(shmidA, IPC_RMID, NULL);
	shmctl(shmidB, IPC_RMID, NULL);
	return 0;
}

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/ia64/mm/hugetlbpage.c    |    3 ++-
 arch/mips/mm/hugetlbpage.c    |    4 ++--
 arch/powerpc/mm/hugetlbpage.c |    3 ++-
 arch/s390/mm/hugetlbpage.c    |    2 +-
 arch/sh/mm/hugetlbpage.c      |    2 +-
 arch/sparc/mm/hugetlbpage.c   |    2 +-
 arch/tile/mm/hugetlbpage.c    |    2 +-
 arch/x86/mm/hugetlbpage.c     |   42 ++++++++++++++++++++++++++++++++++++++---
 include/linux/hugetlb.h       |    2 +-
 mm/hugetlb.c                  |    4 ++--
 10 files changed, 52 insertions(+), 14 deletions(-)

diff --git a/arch/ia64/mm/hugetlbpage.c b/arch/ia64/mm/hugetlbpage.c
index 5ca674b..2d4f574 100644
--- a/arch/ia64/mm/hugetlbpage.c
+++ b/arch/ia64/mm/hugetlbpage.c
@@ -25,7 +25,8 @@ unsigned int hpage_shift = HPAGE_SHIFT_DEFAULT;
 EXPORT_SYMBOL(hpage_shift);
 
 pte_t *
-huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz)
+huge_pte_alloc(struct mm_struct *mm, struct mm_struct *locked_mm,
+	       unsigned long addr, unsigned long sz)
 {
 	unsigned long taddr = htlbpage_to_page(addr);
 	pgd_t *pgd;
diff --git a/arch/mips/mm/hugetlbpage.c b/arch/mips/mm/hugetlbpage.c
index a7fee0d..8cfec7e 100644
--- a/arch/mips/mm/hugetlbpage.c
+++ b/arch/mips/mm/hugetlbpage.c
@@ -22,8 +22,8 @@
 #include <asm/tlb.h>
 #include <asm/tlbflush.h>
 
-pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr,
-		      unsigned long sz)
+pte_t *huge_pte_alloc(struct mm_struct *mm, struct mm_struct *locked_mm,
+		      unsigned long addr, unsigned long sz)
 {
 	pgd_t *pgd;
 	pud_t *pud;
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index fb05b12..4884b97 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -175,7 +175,8 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
 #define HUGEPD_PUD_SHIFT PMD_SHIFT
 #endif
 
-pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz)
+pte_t *huge_pte_alloc(struct mm_struct *mm, struct mm_struct *locked_mm,
+		      unsigned long addr, unsigned long sz)
 {
 	pgd_t *pg;
 	pud_t *pu;
diff --git a/arch/s390/mm/hugetlbpage.c b/arch/s390/mm/hugetlbpage.c
index 597bb2d..912647b 100644
--- a/arch/s390/mm/hugetlbpage.c
+++ b/arch/s390/mm/hugetlbpage.c
@@ -62,7 +62,7 @@ void arch_release_hugepage(struct page *page)
 	page[1].index = 0;
 }
 
-pte_t *huge_pte_alloc(struct mm_struct *mm,
+pte_t *huge_pte_alloc(struct mm_struct *mm, struct mm_struct *locked_mm,
 			unsigned long addr, unsigned long sz)
 {
 	pgd_t *pgdp;
diff --git a/arch/sh/mm/hugetlbpage.c b/arch/sh/mm/hugetlbpage.c
index d776234..fb0a427 100644
--- a/arch/sh/mm/hugetlbpage.c
+++ b/arch/sh/mm/hugetlbpage.c
@@ -21,7 +21,7 @@
 #include <asm/tlbflush.h>
 #include <asm/cacheflush.h>
 
-pte_t *huge_pte_alloc(struct mm_struct *mm,
+pte_t *huge_pte_alloc(struct mm_struct *mm, struct mm_struct *locked_mm,
 			unsigned long addr, unsigned long sz)
 {
 	pgd_t *pgd;
diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index 07e1453..04c063f 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -193,7 +193,7 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 				pgoff, flags);
 }
 
-pte_t *huge_pte_alloc(struct mm_struct *mm,
+pte_t *huge_pte_alloc(struct mm_struct *mm, struct mm_struct *locked_mm,
 			unsigned long addr, unsigned long sz)
 {
 	pgd_t *pgd;
diff --git a/arch/tile/mm/hugetlbpage.c b/arch/tile/mm/hugetlbpage.c
index 42cfcba..490d29b 100644
--- a/arch/tile/mm/hugetlbpage.c
+++ b/arch/tile/mm/hugetlbpage.c
@@ -28,7 +28,7 @@
 #include <asm/tlb.h>
 #include <asm/tlbflush.h>
 
-pte_t *huge_pte_alloc(struct mm_struct *mm,
+pte_t *huge_pte_alloc(struct mm_struct *mm, struct mm_struct *locked_mm,
 		      unsigned long addr, unsigned long sz)
 {
 	pgd_t *pgd;
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index f6679a7..f9ec915 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -58,7 +58,8 @@ static int vma_shareable(struct vm_area_struct *vma, unsigned long addr)
 /*
  * search for a shareable pmd page for hugetlb.
  */
-static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
+static void huge_pmd_share(struct mm_struct *mm, struct mm_struct *locked_mm,
+			   unsigned long addr, pud_t *pud)
 {
 	struct vm_area_struct *vma = find_vma(mm, addr);
 	struct address_space *mapping = vma->vm_file->f_mapping;
@@ -68,14 +69,40 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	struct vm_area_struct *svma;
 	unsigned long saddr;
 	pte_t *spte = NULL;
+	spinlock_t *spage_table_lock = NULL;
+	struct rw_semaphore *smmap_sem = NULL;
 
 	if (!vma_shareable(vma, addr))
 		return;
 
+retry:
 	mutex_lock(&mapping->i_mmap_mutex);
 	vma_prio_tree_foreach(svma, &iter, &mapping->i_mmap, idx, idx) {
 		if (svma == vma)
 			continue;
+		if (svma->vm_mm == vma->vm_mm)
+			continue;
+
+		/*
+		 * The target mm could be in the process of tearing down
+		 * its page tables and the i_mmap_mutex on its own is
+		 * not sufficient. To prevent races against teardown and
+		 * pagetable updates, we acquire the mmap_sem and pagetable
+		 * lock of the remote address space. down_read_trylock()
+		 * is necessary as the other process could also be trying
+		 * to share pagetables with the current mm. In the fork
+		 * case, we are already both mm's so check for that
+		 */
+		if (locked_mm != svma->vm_mm) {
+			if (!down_read_trylock(&svma->vm_mm->mmap_sem)) {
+				mutex_unlock(&mapping->i_mmap_mutex);
+				goto retry;
+			}
+			smmap_sem = &svma->vm_mm->mmap_sem;
+		}
+
+		spage_table_lock = &svma->vm_mm->page_table_lock;
+		spin_lock_nested(spage_table_lock, SINGLE_DEPTH_NESTING);
 
 		saddr = page_table_shareable(svma, vma, addr, idx);
 		if (saddr) {
@@ -85,6 +112,12 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 				break;
 			}
 		}
+		if (smmap_sem) {
+			up_read(smmap_sem);
+			smmap_sem = NULL;
+		}
+		spin_unlock(spage_table_lock);
+		spage_table_lock = NULL;
 	}
 
 	if (!spte)
@@ -96,6 +129,9 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	else
 		put_page(virt_to_page(spte));
 	spin_unlock(&mm->page_table_lock);
+	spin_unlock(spage_table_lock);
+	if (smmap_sem)
+		up_read(smmap_sem);
 out:
 	mutex_unlock(&mapping->i_mmap_mutex);
 }
@@ -127,7 +163,7 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
 	return 1;
 }
 
-pte_t *huge_pte_alloc(struct mm_struct *mm,
+pte_t *huge_pte_alloc(struct mm_struct *mm, struct mm_struct *locked_mm,
 			unsigned long addr, unsigned long sz)
 {
 	pgd_t *pgd;
@@ -142,7 +178,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 		} else {
 			BUG_ON(sz != PMD_SIZE);
 			if (pud_none(*pud))
-				huge_pmd_share(mm, addr, pud);
+				huge_pmd_share(mm, locked_mm, addr, pud);
 			pte = (pte_t *) pmd_alloc(mm, pud, addr);
 		}
 	}
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 000837e..bae0f7b 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -63,7 +63,7 @@ extern struct list_head huge_boot_pages;
 
 /* arch callbacks */
 
-pte_t *huge_pte_alloc(struct mm_struct *mm,
+pte_t *huge_pte_alloc(struct mm_struct *mm, struct mm_struct *locked_mm,
 			unsigned long addr, unsigned long sz);
 pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr);
 int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ae8f708..4832277 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2244,7 +2244,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		src_pte = huge_pte_offset(src, addr);
 		if (!src_pte)
 			continue;
-		dst_pte = huge_pte_alloc(dst, addr, sz);
+		dst_pte = huge_pte_alloc(dst, src, addr, sz);
 		if (!dst_pte)
 			goto nomem;
 
@@ -2745,7 +2745,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			       VM_FAULT_SET_HINDEX(h - hstates);
 	}
 
-	ptep = huge_pte_alloc(mm, address, huge_page_size(h));
+	ptep = huge_pte_alloc(mm, NULL, address, huge_page_size(h));
 	if (!ptep)
 		return VM_FAULT_OOM;
 

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
  2012-07-20 13:49 [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables v2 Mel Gorman
@ 2012-07-20 14:11 ` Mel Gorman
  2012-07-20 14:29   ` Michal Hocko
  2012-07-20 14:36   ` [PATCH -alternative] " Michal Hocko
  2012-07-26 16:01 ` [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables v2 Larry Woodman
  2012-07-26 21:00 ` Rik van Riel
  2 siblings, 2 replies; 50+ messages in thread
From: Mel Gorman @ 2012-07-20 14:11 UTC (permalink / raw)
  To: Linux-MM
  Cc: Michal Hocko, Hugh Dickins, David Gibson, Ken Chen, Cong Wang, LKML

Sorry for the resend, I did not properly refresh Cong Wang's suggested
fix. This V2 is still the mmap_sem approach that fixes a potential deadlock
problem pointed out by Michal.

Changelog since V1
 o Correct cut&paste error in race description			(hugh)
 o Handle potential deadlock during fork			(mhocko)
 o Reorder unlocking						(wangcong)

If a process creates a large hugetlbfs mapping that is eligible for page
table sharing and forks heavily with children some of whom fault and
others which destroy the mapping then it is possible for page tables to
get corrupted. Some teardowns of the mapping encounter a "bad pmd" and
output a message to the kernel log. The final teardown will trigger a
BUG_ON in mm/filemap.c.

This was reproduced in 3.4 but is known to have existed for a long time
and goes back at least as far as 2.6.37. It was probably was introduced in
2.6.20 by [39dde65c: shared page table for hugetlb page]. The messages
look like this;

[  ..........] Lots of bad pmd messages followed by this
[  127.164256] mm/memory.c:391: bad pmd ffff880412e04fe8(80000003de4000e7).
[  127.164257] mm/memory.c:391: bad pmd ffff880412e04ff0(80000003de6000e7).
[  127.164258] mm/memory.c:391: bad pmd ffff880412e04ff8(80000003de0000e7).
[  127.186778] ------------[ cut here ]------------
[  127.186781] kernel BUG at mm/filemap.c:134!
[  127.186782] invalid opcode: 0000 [#1] SMP
[  127.186783] CPU 7
[  127.186784] Modules linked in: af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos usb_common button i2c_core intel_agp video intel_gtt fan processor thermal thermal_sys hwmon ata_generic pata_atiixp libata scsi_mod
[  127.186801]
[  127.186802] Pid: 9017, comm: hugetlbfs-test Not tainted 3.4.0-autobuild #53 Dell Inc. OptiPlex 990/06D7TR
[  127.186804] RIP: 0010:[<ffffffff810ed6ce>]  [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
[  127.186809] RSP: 0000:ffff8804144b5c08  EFLAGS: 00010002
[  127.186810] RAX: 0000000000000001 RBX: ffffea000a5c9000 RCX: 00000000ffffffc0
[  127.186811] RDX: 0000000000000000 RSI: 0000000000000009 RDI: ffff88042dfdad00
[  127.186812] RBP: ffff8804144b5c18 R08: 0000000000000009 R09: 0000000000000003
[  127.186813] R10: 0000000000000000 R11: 000000000000002d R12: ffff880412ff83d8
[  127.186814] R13: ffff880412ff83d8 R14: 0000000000000000 R15: ffff880412ff83d8
[  127.186815] FS:  00007fe18ed2c700(0000) GS:ffff88042dce0000(0000) knlGS:0000000000000000
[  127.186816] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  127.186817] CR2: 00007fe340000503 CR3: 0000000417a14000 CR4: 00000000000407e0
[  127.186818] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  127.186819] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  127.186820] Process hugetlbfs-test (pid: 9017, threadinfo ffff8804144b4000, task ffff880417f803c0)
[  127.186821] Stack:
[  127.186822]  ffffea000a5c9000 0000000000000000 ffff8804144b5c48 ffffffff810ed83b
[  127.186824]  ffff8804144b5c48 000000000000138a 0000000000001387 ffff8804144b5c98
[  127.186825]  ffff8804144b5d48 ffffffff811bc925 ffff8804144b5cb8 0000000000000000
[  127.186827] Call Trace:
[  127.186829]  [<ffffffff810ed83b>] delete_from_page_cache+0x3b/0x80
[  127.186832]  [<ffffffff811bc925>] truncate_hugepages+0x115/0x220
[  127.186834]  [<ffffffff811bca43>] hugetlbfs_evict_inode+0x13/0x30
[  127.186837]  [<ffffffff811655c7>] evict+0xa7/0x1b0
[  127.186839]  [<ffffffff811657a3>] iput_final+0xd3/0x1f0
[  127.186840]  [<ffffffff811658f9>] iput+0x39/0x50
[  127.186842]  [<ffffffff81162708>] d_kill+0xf8/0x130
[  127.186843]  [<ffffffff81162812>] dput+0xd2/0x1a0
[  127.186845]  [<ffffffff8114e2d0>] __fput+0x170/0x230
[  127.186848]  [<ffffffff81236e0e>] ? rb_erase+0xce/0x150
[  127.186849]  [<ffffffff8114e3ad>] fput+0x1d/0x30
[  127.186851]  [<ffffffff81117db7>] remove_vma+0x37/0x80
[  127.186853]  [<ffffffff81119182>] do_munmap+0x2d2/0x360
[  127.186855]  [<ffffffff811cc639>] sys_shmdt+0xc9/0x170
[  127.186857]  [<ffffffff81410a39>] system_call_fastpath+0x16/0x1b
[  127.186858] Code: 0f 1f 44 00 00 48 8b 43 08 48 8b 00 48 8b 40 28 8b b0 40 03 00 00 85 f6 0f 88 df fe ff ff 48 89 df e8 e7 cb 05 00 e9 d2 fe ff ff <0f> 0b 55 83 e2 fd 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 65 e0
[  127.186868] RIP  [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
[  127.186870]  RSP <ffff8804144b5c08>
[  127.186871] ---[ end trace 7cbac5d1db69f426 ]---

The bug is a race and not always easy to reproduce. To reproduce it I was
doing the following on a single socket I7-based machine with 16G of RAM.

$ hugeadm --pool-pages-max DEFAULT:13G
$ echo $((18*1048576*1024)) > /proc/sys/kernel/shmmax
$ echo $((18*1048576*1024)) > /proc/sys/kernel/shmall
$ for i in `seq 1 9000`; do ./hugetlbfs-test; done

On my particular machine, it usually triggers within 10 minutes but enabling
debug options can change the timing such that it never hits. Once the bug is
triggered, the machine is in trouble and needs to be rebooted. The machine
will respond but processes accessing proc like "ps aux" will hang due to
the BUG_ON. shutdown will also hang and needs a hard reset or a sysrq-b.

The test case was mostly written by Michal Hocko with a few minor changes
by me to reproduce this bug. Michal did a lot of heavy lifting eliminating
possible sources of the race and saved me the embarrassment of posting a
completely broken patch yesterday. He did not see this patch before
going to the lists so any critical flaws are all mine!

The basic problem is a race between page table sharing and teardown. For
the most part page table sharing depends on i_mmap_mutex. In some cases,
it is also taking the mm->page_table_lock for the PTE updates but with
shared page tables, it is the i_mmap_mutex that is more important.

Unfortunately it appears to be also insufficient. Consider the following
situation

Process A					Process B
---------					---------
hugetlb_fault					shmdt
  						LockWrite(mmap_sem)
    						  do_munmap
						    unmap_region
						      unmap_vmas
						        unmap_single_vma
						          unmap_hugepage_range
      						            Lock(i_mmap_mutex)
							    Lock(mm->page_table_lock)
							    huge_pmd_unshare/unmap tables <--- (1)
							    Unlock(mm->page_table_lock)
      						            Unlock(i_mmap_mutex)
  huge_pte_alloc				      ...
    Lock(i_mmap_mutex)				      ...
    vma_prio_walk, find svma, spte		      ...
    Lock(mm->page_table_lock)			      ...
    share spte					      ...
    Unlock(mm->page_table_lock)			      ...
    Unlock(i_mmap_mutex)			      ...
  hugetlb_no_page									  <--- (2)
						      free_pgtables
						        unlink_file_vma
							hugetlb_free_pgd_range
						    remove_vma_list

In this scenario, it is possible for Process A to share page tables with
Process B that is trying to tear them down.  The i_mmap_mutex on its own
does not prevent Process A walking Process B's page tables. At (1) above,
the page tables are not shared yet so it unmaps the PMDs. Process A sets
up page table sharing and at (2) faults a new entry. Process B then trips
up on it in free_pgtables.

This patch takes the mmap_sem for read and then the page_table_lock of
address spaces being considered for page table sharing. I verified that
page table sharing still occurs using the awesome technology of printk
to spit out a message when huge_pmd_share is successful. libhugetlbfs
regression test suite passed.

I strongly suggest this be treated as a -stable candidate if it is merged.

Test program is as follows.

==== CUT HERE ====

static size_t huge_page_size = (2UL << 20);
static size_t nr_huge_page_A = 512;
static size_t nr_huge_page_B = 5632;

unsigned int get_random(unsigned int max)
{
	struct timeval tv;

	gettimeofday(&tv, NULL);
	srandom(tv.tv_usec);
	return random() % max;
}

static void play(void *addr, size_t size)
{
	unsigned char *start = addr,
		      *end = start + size,
		      *a;
	start += get_random(size/2);

	/* we could itterate on huge pages but let's give it more time. */
	for (a = start; a < end; a += 4096)
		*a = 0;
}

int main(int argc, char **argv)
{
	key_t key = IPC_PRIVATE;
	size_t sizeA = nr_huge_page_A * huge_page_size;
	size_t sizeB = nr_huge_page_B * huge_page_size;
	int shmidA, shmidB;
	void *addrA = NULL, *addrB = NULL;
	int nr_children = 300, n = 0;

	if ((shmidA = shmget(key, sizeA, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
		perror("shmget:");
		return 1;
	}

	if ((addrA = shmat(shmidA, addrA, SHM_R|SHM_W)) == (void *)-1UL) {
		perror("shmat");
		return 1;
	}
	if ((shmidB = shmget(key, sizeB, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
		perror("shmget:");
		return 1;
	}

	if ((addrB = shmat(shmidB, addrB, SHM_R|SHM_W)) == (void *)-1UL) {
		perror("shmat");
		return 1;
	}

fork_child:
	switch(fork()) {
		case 0:
			switch (n%3) {
			case 0:
				play(addrA, sizeA);
				break;
			case 1:
				play(addrB, sizeB);
				break;
			case 2:
				break;
			}
			break;
		case -1:
			perror("fork:");
			break;
		default:
			if (++n < nr_children)
				goto fork_child;
			play(addrA, sizeA);
			break;
	}
	shmdt(addrA);
	shmdt(addrB);
	do {
		wait(NULL);
	} while (--n > 0);
	shmctl(shmidA, IPC_RMID, NULL);
	shmctl(shmidB, IPC_RMID, NULL);
	return 0;
}

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/ia64/mm/hugetlbpage.c    |    3 ++-
 arch/mips/mm/hugetlbpage.c    |    4 ++--
 arch/powerpc/mm/hugetlbpage.c |    3 ++-
 arch/s390/mm/hugetlbpage.c    |    2 +-
 arch/sh/mm/hugetlbpage.c      |    2 +-
 arch/sparc/mm/hugetlbpage.c   |    2 +-
 arch/tile/mm/hugetlbpage.c    |    2 +-
 arch/x86/mm/hugetlbpage.c     |   42 ++++++++++++++++++++++++++++++++++++++---
 include/linux/hugetlb.h       |    2 +-
 mm/hugetlb.c                  |    4 ++--
 10 files changed, 52 insertions(+), 14 deletions(-)

diff --git a/arch/ia64/mm/hugetlbpage.c b/arch/ia64/mm/hugetlbpage.c
index 5ca674b..2d4f574 100644
--- a/arch/ia64/mm/hugetlbpage.c
+++ b/arch/ia64/mm/hugetlbpage.c
@@ -25,7 +25,8 @@ unsigned int hpage_shift = HPAGE_SHIFT_DEFAULT;
 EXPORT_SYMBOL(hpage_shift);
 
 pte_t *
-huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz)
+huge_pte_alloc(struct mm_struct *mm, struct mm_struct *locked_mm,
+	       unsigned long addr, unsigned long sz)
 {
 	unsigned long taddr = htlbpage_to_page(addr);
 	pgd_t *pgd;
diff --git a/arch/mips/mm/hugetlbpage.c b/arch/mips/mm/hugetlbpage.c
index a7fee0d..8cfec7e 100644
--- a/arch/mips/mm/hugetlbpage.c
+++ b/arch/mips/mm/hugetlbpage.c
@@ -22,8 +22,8 @@
 #include <asm/tlb.h>
 #include <asm/tlbflush.h>
 
-pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr,
-		      unsigned long sz)
+pte_t *huge_pte_alloc(struct mm_struct *mm, struct mm_struct *locked_mm,
+		      unsigned long addr, unsigned long sz)
 {
 	pgd_t *pgd;
 	pud_t *pud;
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index fb05b12..4884b97 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -175,7 +175,8 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
 #define HUGEPD_PUD_SHIFT PMD_SHIFT
 #endif
 
-pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz)
+pte_t *huge_pte_alloc(struct mm_struct *mm, struct mm_struct *locked_mm,
+		      unsigned long addr, unsigned long sz)
 {
 	pgd_t *pg;
 	pud_t *pu;
diff --git a/arch/s390/mm/hugetlbpage.c b/arch/s390/mm/hugetlbpage.c
index 597bb2d..912647b 100644
--- a/arch/s390/mm/hugetlbpage.c
+++ b/arch/s390/mm/hugetlbpage.c
@@ -62,7 +62,7 @@ void arch_release_hugepage(struct page *page)
 	page[1].index = 0;
 }
 
-pte_t *huge_pte_alloc(struct mm_struct *mm,
+pte_t *huge_pte_alloc(struct mm_struct *mm, struct mm_struct *locked_mm,
 			unsigned long addr, unsigned long sz)
 {
 	pgd_t *pgdp;
diff --git a/arch/sh/mm/hugetlbpage.c b/arch/sh/mm/hugetlbpage.c
index d776234..fb0a427 100644
--- a/arch/sh/mm/hugetlbpage.c
+++ b/arch/sh/mm/hugetlbpage.c
@@ -21,7 +21,7 @@
 #include <asm/tlbflush.h>
 #include <asm/cacheflush.h>
 
-pte_t *huge_pte_alloc(struct mm_struct *mm,
+pte_t *huge_pte_alloc(struct mm_struct *mm, struct mm_struct *locked_mm,
 			unsigned long addr, unsigned long sz)
 {
 	pgd_t *pgd;
diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index 07e1453..04c063f 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -193,7 +193,7 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 				pgoff, flags);
 }
 
-pte_t *huge_pte_alloc(struct mm_struct *mm,
+pte_t *huge_pte_alloc(struct mm_struct *mm, struct mm_struct *locked_mm,
 			unsigned long addr, unsigned long sz)
 {
 	pgd_t *pgd;
diff --git a/arch/tile/mm/hugetlbpage.c b/arch/tile/mm/hugetlbpage.c
index 42cfcba..490d29b 100644
--- a/arch/tile/mm/hugetlbpage.c
+++ b/arch/tile/mm/hugetlbpage.c
@@ -28,7 +28,7 @@
 #include <asm/tlb.h>
 #include <asm/tlbflush.h>
 
-pte_t *huge_pte_alloc(struct mm_struct *mm,
+pte_t *huge_pte_alloc(struct mm_struct *mm, struct mm_struct *locked_mm,
 		      unsigned long addr, unsigned long sz)
 {
 	pgd_t *pgd;
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index f6679a7..944b2df 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -58,7 +58,8 @@ static int vma_shareable(struct vm_area_struct *vma, unsigned long addr)
 /*
  * search for a shareable pmd page for hugetlb.
  */
-static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
+static void huge_pmd_share(struct mm_struct *mm, struct mm_struct *locked_mm,
+			   unsigned long addr, pud_t *pud)
 {
 	struct vm_area_struct *vma = find_vma(mm, addr);
 	struct address_space *mapping = vma->vm_file->f_mapping;
@@ -68,14 +69,40 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	struct vm_area_struct *svma;
 	unsigned long saddr;
 	pte_t *spte = NULL;
+	spinlock_t *spage_table_lock = NULL;
+	struct rw_semaphore *smmap_sem = NULL;
 
 	if (!vma_shareable(vma, addr))
 		return;
 
+retry:
 	mutex_lock(&mapping->i_mmap_mutex);
 	vma_prio_tree_foreach(svma, &iter, &mapping->i_mmap, idx, idx) {
 		if (svma == vma)
 			continue;
+		if (svma->vm_mm == vma->vm_mm)
+			continue;
+
+		/*
+		 * The target mm could be in the process of tearing down
+		 * its page tables and the i_mmap_mutex on its own is
+		 * not sufficient. To prevent races against teardown and
+		 * pagetable updates, we acquire the mmap_sem and pagetable
+		 * lock of the remote address space. down_read_trylock()
+		 * is necessary as the other process could also be trying
+		 * to share pagetables with the current mm. In the fork
+		 * case, we are already both mm's so check for that
+		 */
+		if (locked_mm != svma->vm_mm) {
+			if (!down_read_trylock(&svma->vm_mm->mmap_sem)) {
+				mutex_unlock(&mapping->i_mmap_mutex);
+				goto retry;
+			}
+			smmap_sem = &svma->vm_mm->mmap_sem;
+		}
+
+		spage_table_lock = &svma->vm_mm->page_table_lock;
+		spin_lock_nested(spage_table_lock, SINGLE_DEPTH_NESTING);
 
 		saddr = page_table_shareable(svma, vma, addr, idx);
 		if (saddr) {
@@ -85,6 +112,12 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 				break;
 			}
 		}
+		spin_unlock(spage_table_lock);
+		spage_table_lock = NULL;
+		if (smmap_sem) {
+			up_read(smmap_sem);
+			smmap_sem = NULL;
+		}
 	}
 
 	if (!spte)
@@ -96,6 +129,9 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	else
 		put_page(virt_to_page(spte));
 	spin_unlock(&mm->page_table_lock);
+	spin_unlock(spage_table_lock);
+	if (smmap_sem)
+		up_read(smmap_sem);
 out:
 	mutex_unlock(&mapping->i_mmap_mutex);
 }
@@ -127,7 +163,7 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
 	return 1;
 }
 
-pte_t *huge_pte_alloc(struct mm_struct *mm,
+pte_t *huge_pte_alloc(struct mm_struct *mm, struct mm_struct *locked_mm,
 			unsigned long addr, unsigned long sz)
 {
 	pgd_t *pgd;
@@ -142,7 +178,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 		} else {
 			BUG_ON(sz != PMD_SIZE);
 			if (pud_none(*pud))
-				huge_pmd_share(mm, addr, pud);
+				huge_pmd_share(mm, locked_mm, addr, pud);
 			pte = (pte_t *) pmd_alloc(mm, pud, addr);
 		}
 	}
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 000837e..bae0f7b 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -63,7 +63,7 @@ extern struct list_head huge_boot_pages;
 
 /* arch callbacks */
 
-pte_t *huge_pte_alloc(struct mm_struct *mm,
+pte_t *huge_pte_alloc(struct mm_struct *mm, struct mm_struct *locked_mm,
 			unsigned long addr, unsigned long sz);
 pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr);
 int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ae8f708..4832277 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2244,7 +2244,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		src_pte = huge_pte_offset(src, addr);
 		if (!src_pte)
 			continue;
-		dst_pte = huge_pte_alloc(dst, addr, sz);
+		dst_pte = huge_pte_alloc(dst, src, addr, sz);
 		if (!dst_pte)
 			goto nomem;
 
@@ -2745,7 +2745,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			       VM_FAULT_SET_HINDEX(h - hstates);
 	}
 
-	ptep = huge_pte_alloc(mm, address, huge_page_size(h));
+	ptep = huge_pte_alloc(mm, NULL, address, huge_page_size(h));
 	if (!ptep)
 		return VM_FAULT_OOM;
 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
  2012-07-20 14:11 ` [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend) Mel Gorman
@ 2012-07-20 14:29   ` Michal Hocko
  2012-07-20 14:37     ` Mel Gorman
  2012-07-20 14:36   ` [PATCH -alternative] " Michal Hocko
  1 sibling, 1 reply; 50+ messages in thread
From: Michal Hocko @ 2012-07-20 14:29 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Hugh Dickins, David Gibson, Ken Chen, Cong Wang, LKML

On Fri 20-07-12 15:11:08, Mel Gorman wrote:
> Sorry for the resend, I did not properly refresh Cong Wang's suggested
> fix. This V2 is still the mmap_sem approach that fixes a potential deadlock
> problem pointed out by Michal.
> 
> Changelog since V1
>  o Correct cut&paste error in race description			(hugh)
>  o Handle potential deadlock during fork			(mhocko)
>  o Reorder unlocking						(wangcong)
> 
> If a process creates a large hugetlbfs mapping that is eligible for page
> table sharing and forks heavily with children some of whom fault and
> others which destroy the mapping then it is possible for page tables to
> get corrupted. Some teardowns of the mapping encounter a "bad pmd" and
> output a message to the kernel log. The final teardown will trigger a
> BUG_ON in mm/filemap.c.
> 
> This was reproduced in 3.4 but is known to have existed for a long time
> and goes back at least as far as 2.6.37. It was probably was introduced in
> 2.6.20 by [39dde65c: shared page table for hugetlb page]. The messages
> look like this;
> 
> [  ..........] Lots of bad pmd messages followed by this
> [  127.164256] mm/memory.c:391: bad pmd ffff880412e04fe8(80000003de4000e7).
> [  127.164257] mm/memory.c:391: bad pmd ffff880412e04ff0(80000003de6000e7).
> [  127.164258] mm/memory.c:391: bad pmd ffff880412e04ff8(80000003de0000e7).
> [  127.186778] ------------[ cut here ]------------
> [  127.186781] kernel BUG at mm/filemap.c:134!
> [  127.186782] invalid opcode: 0000 [#1] SMP
> [  127.186783] CPU 7
> [  127.186784] Modules linked in: af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos usb_common button i2c_core intel_agp video intel_gtt fan processor thermal thermal_sys hwmon ata_generic pata_atiixp libata scsi_mod
> [  127.186801]
> [  127.186802] Pid: 9017, comm: hugetlbfs-test Not tainted 3.4.0-autobuild #53 Dell Inc. OptiPlex 990/06D7TR
> [  127.186804] RIP: 0010:[<ffffffff810ed6ce>]  [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
> [  127.186809] RSP: 0000:ffff8804144b5c08  EFLAGS: 00010002
> [  127.186810] RAX: 0000000000000001 RBX: ffffea000a5c9000 RCX: 00000000ffffffc0
> [  127.186811] RDX: 0000000000000000 RSI: 0000000000000009 RDI: ffff88042dfdad00
> [  127.186812] RBP: ffff8804144b5c18 R08: 0000000000000009 R09: 0000000000000003
> [  127.186813] R10: 0000000000000000 R11: 000000000000002d R12: ffff880412ff83d8
> [  127.186814] R13: ffff880412ff83d8 R14: 0000000000000000 R15: ffff880412ff83d8
> [  127.186815] FS:  00007fe18ed2c700(0000) GS:ffff88042dce0000(0000) knlGS:0000000000000000
> [  127.186816] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  127.186817] CR2: 00007fe340000503 CR3: 0000000417a14000 CR4: 00000000000407e0
> [  127.186818] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  127.186819] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [  127.186820] Process hugetlbfs-test (pid: 9017, threadinfo ffff8804144b4000, task ffff880417f803c0)
> [  127.186821] Stack:
> [  127.186822]  ffffea000a5c9000 0000000000000000 ffff8804144b5c48 ffffffff810ed83b
> [  127.186824]  ffff8804144b5c48 000000000000138a 0000000000001387 ffff8804144b5c98
> [  127.186825]  ffff8804144b5d48 ffffffff811bc925 ffff8804144b5cb8 0000000000000000
> [  127.186827] Call Trace:
> [  127.186829]  [<ffffffff810ed83b>] delete_from_page_cache+0x3b/0x80
> [  127.186832]  [<ffffffff811bc925>] truncate_hugepages+0x115/0x220
> [  127.186834]  [<ffffffff811bca43>] hugetlbfs_evict_inode+0x13/0x30
> [  127.186837]  [<ffffffff811655c7>] evict+0xa7/0x1b0
> [  127.186839]  [<ffffffff811657a3>] iput_final+0xd3/0x1f0
> [  127.186840]  [<ffffffff811658f9>] iput+0x39/0x50
> [  127.186842]  [<ffffffff81162708>] d_kill+0xf8/0x130
> [  127.186843]  [<ffffffff81162812>] dput+0xd2/0x1a0
> [  127.186845]  [<ffffffff8114e2d0>] __fput+0x170/0x230
> [  127.186848]  [<ffffffff81236e0e>] ? rb_erase+0xce/0x150
> [  127.186849]  [<ffffffff8114e3ad>] fput+0x1d/0x30
> [  127.186851]  [<ffffffff81117db7>] remove_vma+0x37/0x80
> [  127.186853]  [<ffffffff81119182>] do_munmap+0x2d2/0x360
> [  127.186855]  [<ffffffff811cc639>] sys_shmdt+0xc9/0x170
> [  127.186857]  [<ffffffff81410a39>] system_call_fastpath+0x16/0x1b
> [  127.186858] Code: 0f 1f 44 00 00 48 8b 43 08 48 8b 00 48 8b 40 28 8b b0 40 03 00 00 85 f6 0f 88 df fe ff ff 48 89 df e8 e7 cb 05 00 e9 d2 fe ff ff <0f> 0b 55 83 e2 fd 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 65 e0
> [  127.186868] RIP  [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
> [  127.186870]  RSP <ffff8804144b5c08>
> [  127.186871] ---[ end trace 7cbac5d1db69f426 ]---
> 
> The bug is a race and not always easy to reproduce. To reproduce it I was
> doing the following on a single socket I7-based machine with 16G of RAM.
> 
> $ hugeadm --pool-pages-max DEFAULT:13G
> $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmmax
> $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmall
> $ for i in `seq 1 9000`; do ./hugetlbfs-test; done
> 
> On my particular machine, it usually triggers within 10 minutes but enabling
> debug options can change the timing such that it never hits. Once the bug is
> triggered, the machine is in trouble and needs to be rebooted. The machine
> will respond but processes accessing proc like "ps aux" will hang due to
> the BUG_ON. shutdown will also hang and needs a hard reset or a sysrq-b.
> 
> The test case was mostly written by Michal Hocko with a few minor changes
> by me to reproduce this bug. Michal did a lot of heavy lifting eliminating
> possible sources of the race and saved me the embarrassment of posting a
> completely broken patch yesterday. He did not see this patch before
> going to the lists so any critical flaws are all mine!
> 
> The basic problem is a race between page table sharing and teardown. For
> the most part page table sharing depends on i_mmap_mutex. In some cases,
> it is also taking the mm->page_table_lock for the PTE updates but with
> shared page tables, it is the i_mmap_mutex that is more important.
> 
> Unfortunately it appears to be also insufficient. Consider the following
> situation
> 
> Process A					Process B
> ---------					---------
> hugetlb_fault					shmdt
>   						LockWrite(mmap_sem)
>     						  do_munmap
> 						    unmap_region
> 						      unmap_vmas
> 						        unmap_single_vma
> 						          unmap_hugepage_range
>       						            Lock(i_mmap_mutex)
> 							    Lock(mm->page_table_lock)
> 							    huge_pmd_unshare/unmap tables <--- (1)
> 							    Unlock(mm->page_table_lock)
>       						            Unlock(i_mmap_mutex)
>   huge_pte_alloc				      ...
>     Lock(i_mmap_mutex)				      ...
>     vma_prio_walk, find svma, spte		      ...
>     Lock(mm->page_table_lock)			      ...
>     share spte					      ...
>     Unlock(mm->page_table_lock)			      ...
>     Unlock(i_mmap_mutex)			      ...
>   hugetlb_no_page									  <--- (2)
> 						      free_pgtables
> 						        unlink_file_vma
> 							hugetlb_free_pgd_range
> 						    remove_vma_list
> 
> In this scenario, it is possible for Process A to share page tables with
> Process B that is trying to tear them down.  The i_mmap_mutex on its own
> does not prevent Process A walking Process B's page tables. At (1) above,
> the page tables are not shared yet so it unmaps the PMDs. Process A sets
> up page table sharing and at (2) faults a new entry. Process B then trips
> up on it in free_pgtables.
> 
> This patch takes the mmap_sem for read and then the page_table_lock of
> address spaces being considered for page table sharing. I verified that
> page table sharing still occurs using the awesome technology of printk
> to spit out a message when huge_pmd_share is successful. libhugetlbfs
> regression test suite passed.
> 
> I strongly suggest this be treated as a -stable candidate if it is merged.
> 
> Test program is as follows.
> 
> ==== CUT HERE ====
> 
> static size_t huge_page_size = (2UL << 20);
> static size_t nr_huge_page_A = 512;
> static size_t nr_huge_page_B = 5632;
> 
> unsigned int get_random(unsigned int max)
> {
> 	struct timeval tv;
> 
> 	gettimeofday(&tv, NULL);
> 	srandom(tv.tv_usec);
> 	return random() % max;
> }
> 
> static void play(void *addr, size_t size)
> {
> 	unsigned char *start = addr,
> 		      *end = start + size,
> 		      *a;
> 	start += get_random(size/2);
> 
> 	/* we could itterate on huge pages but let's give it more time. */
> 	for (a = start; a < end; a += 4096)
> 		*a = 0;
> }
> 
> int main(int argc, char **argv)
> {
> 	key_t key = IPC_PRIVATE;
> 	size_t sizeA = nr_huge_page_A * huge_page_size;
> 	size_t sizeB = nr_huge_page_B * huge_page_size;
> 	int shmidA, shmidB;
> 	void *addrA = NULL, *addrB = NULL;
> 	int nr_children = 300, n = 0;
> 
> 	if ((shmidA = shmget(key, sizeA, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
> 		perror("shmget:");
> 		return 1;
> 	}
> 
> 	if ((addrA = shmat(shmidA, addrA, SHM_R|SHM_W)) == (void *)-1UL) {
> 		perror("shmat");
> 		return 1;
> 	}
> 	if ((shmidB = shmget(key, sizeB, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
> 		perror("shmget:");
> 		return 1;
> 	}
> 
> 	if ((addrB = shmat(shmidB, addrB, SHM_R|SHM_W)) == (void *)-1UL) {
> 		perror("shmat");
> 		return 1;
> 	}
> 
> fork_child:
> 	switch(fork()) {
> 		case 0:
> 			switch (n%3) {
> 			case 0:
> 				play(addrA, sizeA);
> 				break;
> 			case 1:
> 				play(addrB, sizeB);
> 				break;
> 			case 2:
> 				break;
> 			}
> 			break;
> 		case -1:
> 			perror("fork:");
> 			break;
> 		default:
> 			if (++n < nr_children)
> 				goto fork_child;
> 			play(addrA, sizeA);
> 			break;
> 	}
> 	shmdt(addrA);
> 	shmdt(addrB);
> 	do {
> 		wait(NULL);
> 	} while (--n > 0);
> 	shmctl(shmidA, IPC_RMID, NULL);
> 	shmctl(shmidB, IPC_RMID, NULL);
> 	return 0;
> }
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Yes this looks correct. mmap_sem will make sure that unmap_vmas and
free_pgtables are executed atomicaly wrt. huge_pmd_share so it doesn't
see non-NULL spte on the way out. I am just wondering whether we need
the page_table_lock as well. It is not harmful but I guess we can drop
it because both exit_mmap and shmdt are not taking it and mmap_sem is
sufficient for them.
One more nit bellow.

I will send my version of the fix which took another path as a reply to
this email to have something to compare with.

[...]
> diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> index f6679a7..944b2df 100644
> --- a/arch/x86/mm/hugetlbpage.c
> +++ b/arch/x86/mm/hugetlbpage.c
> @@ -58,7 +58,8 @@ static int vma_shareable(struct vm_area_struct *vma, unsigned long addr)
>  /*
>   * search for a shareable pmd page for hugetlb.
>   */
> -static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> +static void huge_pmd_share(struct mm_struct *mm, struct mm_struct *locked_mm,
> +			   unsigned long addr, pud_t *pud)
>  {
>  	struct vm_area_struct *vma = find_vma(mm, addr);
>  	struct address_space *mapping = vma->vm_file->f_mapping;
> @@ -68,14 +69,40 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>  	struct vm_area_struct *svma;
>  	unsigned long saddr;
>  	pte_t *spte = NULL;
> +	spinlock_t *spage_table_lock = NULL;
> +	struct rw_semaphore *smmap_sem = NULL;
>  
>  	if (!vma_shareable(vma, addr))
>  		return;
>  
> +retry:
>  	mutex_lock(&mapping->i_mmap_mutex);
>  	vma_prio_tree_foreach(svma, &iter, &mapping->i_mmap, idx, idx) {
>  		if (svma == vma)
>  			continue;
> +		if (svma->vm_mm == vma->vm_mm)
> +			continue;
> +
> +		/*
> +		 * The target mm could be in the process of tearing down
> +		 * its page tables and the i_mmap_mutex on its own is
> +		 * not sufficient. To prevent races against teardown and
> +		 * pagetable updates, we acquire the mmap_sem and pagetable
> +		 * lock of the remote address space. down_read_trylock()
> +		 * is necessary as the other process could also be trying
> +		 * to share pagetables with the current mm. In the fork
> +		 * case, we are already both mm's so check for that
> +		 */
> +		if (locked_mm != svma->vm_mm) {
> +			if (!down_read_trylock(&svma->vm_mm->mmap_sem)) {
> +				mutex_unlock(&mapping->i_mmap_mutex);
> +				goto retry;
> +			}
> +			smmap_sem = &svma->vm_mm->mmap_sem;
> +		}
> +
> +		spage_table_lock = &svma->vm_mm->page_table_lock;
> +		spin_lock_nested(spage_table_lock, SINGLE_DEPTH_NESTING);
>  
>  		saddr = page_table_shareable(svma, vma, addr, idx);
>  		if (saddr) {
[...]
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index ae8f708..4832277 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2244,7 +2244,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>  		src_pte = huge_pte_offset(src, addr);
>  		if (!src_pte)
>  			continue;
> -		dst_pte = huge_pte_alloc(dst, addr, sz);
> +		dst_pte = huge_pte_alloc(dst, src, addr, sz);
>  		if (!dst_pte)
>  			goto nomem;
>  
> @@ -2745,7 +2745,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  			       VM_FAULT_SET_HINDEX(h - hstates);
>  	}
>  
> -	ptep = huge_pte_alloc(mm, address, huge_page_size(h));
> +	ptep = huge_pte_alloc(mm, NULL, address, huge_page_size(h));

strictly speaking we should provide current->mm here because we are in
the page fault path and mmap_sem is held for reading. This doesn't
matter here though because huge_pmd_share will take it for reading so
nesting wouldn't hurt. Maybe a small comment that this is intentional
and correct would be nice.

>  	if (!ptep)
>  		return VM_FAULT_OOM;
>  
> 
> -- 
> Mel Gorman
> SUSE Labs

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
  2012-07-20 14:11 ` [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend) Mel Gorman
  2012-07-20 14:29   ` Michal Hocko
@ 2012-07-20 14:36   ` Michal Hocko
  2012-07-20 14:51     ` Mel Gorman
  2012-07-26 18:31     ` Rik van Riel
  1 sibling, 2 replies; 50+ messages in thread
From: Michal Hocko @ 2012-07-20 14:36 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Hugh Dickins, David Gibson, Ken Chen, Cong Wang, LKML

And here is my attempt for the fix (Hugh mentioned something similar
earlier but he suggested using special flags in ptes or VMAs). I still
owe doc. update and it hasn't been tested with too many configs and I
could missed some definition updates.
I also think that changelog could be much better, I will add (steal) the
full bug description if people think that this way is worth going rather
than the one suggested by Mel.
To be honest I am not quite happy how I had to pollute generic mm code with
something that is specific to a single architecture.
Mel hammered it with the test case and it survived.
---
>From d71cd88da83c669beced2aa752847265e896b89b Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Fri, 20 Jul 2012 15:10:40 +0200
Subject: [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs

The primary problem is that huge_pte_offset called from huge_pmd_share might
return a spte which is about to be deallocated and the caller doesn't have any
way to find this out. This means that huge_pmd_share might reuse spte and
increase the reference count on its page.  i_mmap_mutex is not sufficient
because the spte is still available after unmap_hugepage_range and free
free_pgtables which takes care of spte teardown doesn't use any locking.

This patch addresses the issue by marking spte's page that it is due to removal
(misuse page mapping for that matter because that one is not used for spte page
during whole life cycle).
huge_pte_offset then checks is_hugetlb_pmd_page_valid and ignores such sptes.
The spte page is invalidated after the last pte is unshared and only if
unmap_vmas is followed by free_pgtables. This is all done under i_mmap_mutex so
we cannot race.

The spte page is cleaned up later in free_pmd_range after pud is cleared and
before it is handed over to pmd_free_tlb which frees the page. Write memory
barrier makes sure that other CPUs always see either cleared pud (thus NULL
pmd) or invalidated spte page.

[motivated by Hugh's idea]
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 arch/x86/include/asm/hugetlb.h |   30 ++++++++++++++++++++++++++++++
 arch/x86/mm/hugetlbpage.c      |    7 ++++++-
 fs/hugetlbfs/inode.c           |    2 +-
 include/linux/hugetlb.h        |    7 ++++---
 include/linux/mm.h             |    2 +-
 mm/hugetlb.c                   |   10 ++++++----
 mm/memory.c                    |   17 +++++++++++------
 mm/mmap.c                      |    4 ++--
 8 files changed, 61 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/hugetlb.h b/arch/x86/include/asm/hugetlb.h
index 439a9ac..eaff713 100644
--- a/arch/x86/include/asm/hugetlb.h
+++ b/arch/x86/include/asm/hugetlb.h
@@ -48,6 +48,36 @@ static inline pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
 	return ptep_get_and_clear(mm, addr, ptep);
 }
 
+static inline bool is_hugetlb_pmd_page_valid(struct page *page)
+{
+	smp_rmb();
+	return page->mapping == NULL;
+}
+
+static inline void invalidate_hugetlb_pmd_page(pmd_t *pmd)
+{
+	struct page *pmd_page = virt_to_page(pmd);
+
+	/* TODO add comment about the race */
+	pmd_page->mapping = (struct address_space *)(-1UL);
+	smp_wmb();
+}
+
+#define ARCH_HAVE_CLEAR_HUGETLB_PMD_PAGE 1
+static inline void clear_hugetlb_pmd_page(pmd_t *pmd)
+{
+	struct page *pmd_page = virt_to_page(pmd);
+	if (!is_hugetlb_pmd_page_valid(pmd_page)) {
+		/*
+		 * Make sure that pud_clear is visible before we remove the
+		 * invalidate flag here so huge_pte_offset either returns NULL
+		 * or invalidated pmd page during the tear down.
+		 */
+		smp_wmb();
+		pmd_page->mapping = NULL;
+	}
+}
+
 static inline void huge_ptep_clear_flush(struct vm_area_struct *vma,
 					 unsigned long addr, pte_t *ptep)
 {
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index f6679a7..c040089 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -81,7 +81,12 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 		if (saddr) {
 			spte = huge_pte_offset(svma->vm_mm, saddr);
 			if (spte) {
-				get_page(virt_to_page(spte));
+				struct page *spte_page = virt_to_page(spte);
+				if (!is_hugetlb_pmd_page_valid(spte_page)) {
+					spte = NULL;
+					continue;
+				}
+				get_page(spte_page);
 				break;
 			}
 		}
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 001ef01..0d0c235 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -417,7 +417,7 @@ hugetlb_vmtruncate_list(struct prio_tree_root *root, pgoff_t pgoff)
 			v_offset = 0;
 
 		__unmap_hugepage_range(vma,
-				vma->vm_start + v_offset, vma->vm_end, NULL);
+				vma->vm_start + v_offset, vma->vm_end, NULL, false);
 	}
 }
 
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 000837e..208b662 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -40,9 +40,9 @@ int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
 			struct page **, struct vm_area_struct **,
 			unsigned long *, int *, int, unsigned int flags);
 void unmap_hugepage_range(struct vm_area_struct *,
-			unsigned long, unsigned long, struct page *);
+			unsigned long, unsigned long, struct page *, bool);
 void __unmap_hugepage_range(struct vm_area_struct *,
-			unsigned long, unsigned long, struct page *);
+			unsigned long, unsigned long, struct page *, bool);
 int hugetlb_prefault(struct address_space *, struct vm_area_struct *);
 void hugetlb_report_meminfo(struct seq_file *);
 int hugetlb_report_node_meminfo(int, char *);
@@ -98,7 +98,7 @@ static inline unsigned long hugetlb_total_pages(void)
 #define follow_huge_addr(mm, addr, write)	ERR_PTR(-EINVAL)
 #define copy_hugetlb_page_range(src, dst, vma)	({ BUG(); 0; })
 #define hugetlb_prefault(mapping, vma)		({ BUG(); 0; })
-#define unmap_hugepage_range(vma, start, end, page)	BUG()
+#define unmap_hugepage_range(vma, start, end, page, last)	BUG()
 static inline void hugetlb_report_meminfo(struct seq_file *m)
 {
 }
@@ -112,6 +112,7 @@ static inline void hugetlb_report_meminfo(struct seq_file *m)
 #define hugetlb_free_pgd_range(tlb, addr, end, floor, ceiling) ({BUG(); 0; })
 #define hugetlb_fault(mm, vma, addr, flags)	({ BUG(); 0; })
 #define huge_pte_offset(mm, address)	0
+#define clear_hugetlb_pmd_page(pmd)	0
 #define dequeue_hwpoisoned_huge_page(page)	0
 static inline void copy_huge_page(struct page *dst, struct page *src)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 74aa71b..ba891bb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -899,7 +899,7 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long address,
 void unmap_vmas(struct mmu_gather *tlb,
 		struct vm_area_struct *start_vma, unsigned long start_addr,
 		unsigned long end_addr, unsigned long *nr_accounted,
-		struct zap_details *);
+		struct zap_details *, bool last);
 
 /**
  * mm_walk - callbacks for walk_page_range
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ae8f708..9952d19 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2299,7 +2299,7 @@ static int is_hugetlb_entry_hwpoisoned(pte_t pte)
 }
 
 void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
-			    unsigned long end, struct page *ref_page)
+			    unsigned long end, struct page *ref_page, bool last)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long address;
@@ -2332,6 +2332,8 @@ void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
 			continue;
 
 		pte = huge_ptep_get(ptep);
+		if (last)
+			invalidate_hugetlb_pmd_page((pmd_t*)ptep);
 		if (huge_pte_none(pte))
 			continue;
 
@@ -2379,10 +2381,10 @@ void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
 }
 
 void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
-			  unsigned long end, struct page *ref_page)
+			  unsigned long end, struct page *ref_page, bool last)
 {
 	mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
-	__unmap_hugepage_range(vma, start, end, ref_page);
+	__unmap_hugepage_range(vma, start, end, ref_page, last);
 	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
 }
 
@@ -2430,7 +2432,7 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (!is_vma_resv_set(iter_vma, HPAGE_RESV_OWNER))
 			__unmap_hugepage_range(iter_vma,
 				address, address + huge_page_size(h),
-				page);
+				page, false);
 	}
 	mutex_unlock(&mapping->i_mmap_mutex);
 
diff --git a/mm/memory.c b/mm/memory.c
index 6105f47..c6c6e83 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -405,6 +405,10 @@ static void free_pte_range(struct mmu_gather *tlb, pmd_t *pmd,
 	tlb->mm->nr_ptes--;
 }
 
+#ifndef ARCH_HAVE_CLEAR_HUGETLB_PMD_PAGE
+#define clear_hugetlb_pmd_page(pmd)	0
+#endif
+
 static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 				unsigned long addr, unsigned long end,
 				unsigned long floor, unsigned long ceiling)
@@ -435,6 +439,7 @@ static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 
 	pmd = pmd_offset(pud, start);
 	pud_clear(pud);
+	clear_hugetlb_pmd_page(pmd);
 	pmd_free_tlb(tlb, pmd, start);
 }
 
@@ -1296,7 +1301,7 @@ static void unmap_page_range(struct mmu_gather *tlb,
 static void unmap_single_vma(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, unsigned long start_addr,
 		unsigned long end_addr, unsigned long *nr_accounted,
-		struct zap_details *details)
+		struct zap_details *details, bool last)
 {
 	unsigned long start = max(vma->vm_start, start_addr);
 	unsigned long end;
@@ -1327,7 +1332,7 @@ static void unmap_single_vma(struct mmu_gather *tlb,
 			 * safe to do nothing in this case.
 			 */
 			if (vma->vm_file)
-				unmap_hugepage_range(vma, start, end, NULL);
+				unmap_hugepage_range(vma, start, end, NULL, last);
 		} else
 			unmap_page_range(tlb, vma, start, end, details);
 	}
@@ -1356,14 +1361,14 @@ static void unmap_single_vma(struct mmu_gather *tlb,
 void unmap_vmas(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, unsigned long start_addr,
 		unsigned long end_addr, unsigned long *nr_accounted,
-		struct zap_details *details)
+		struct zap_details *details, bool last)
 {
 	struct mm_struct *mm = vma->vm_mm;
 
 	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
 	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
 		unmap_single_vma(tlb, vma, start_addr, end_addr, nr_accounted,
-				 details);
+				 details, last);
 	mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
 }
 
@@ -1387,7 +1392,7 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long address,
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm, 0);
 	update_hiwater_rss(mm);
-	unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
+	unmap_vmas(&tlb, vma, address, end, &nr_accounted, details, false);
 	tlb_finish_mmu(&tlb, address, end);
 }
 
@@ -1412,7 +1417,7 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
 	tlb_gather_mmu(&tlb, mm, 0);
 	update_hiwater_rss(mm);
 	mmu_notifier_invalidate_range_start(mm, address, end);
-	unmap_single_vma(&tlb, vma, address, end, &nr_accounted, details);
+	unmap_single_vma(&tlb, vma, address, end, &nr_accounted, details, false);
 	mmu_notifier_invalidate_range_end(mm, address, end);
 	tlb_finish_mmu(&tlb, address, end);
 }
diff --git a/mm/mmap.c b/mm/mmap.c
index 848ef52..f4566e4 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1917,7 +1917,7 @@ static void unmap_region(struct mm_struct *mm,
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm, 0);
 	update_hiwater_rss(mm);
-	unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL);
+	unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL, true);
 	vm_unacct_memory(nr_accounted);
 	free_pgtables(&tlb, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
 				 next ? next->vm_start : 0);
@@ -2305,7 +2305,7 @@ void exit_mmap(struct mm_struct *mm)
 	tlb_gather_mmu(&tlb, mm, 1);
 	/* update_hiwater_rss(mm) here? but nobody should be looking */
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
-	unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
+	unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL, true);
 	vm_unacct_memory(nr_accounted);
 
 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
-- 
1.7.10.4

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
  2012-07-20 14:29   ` Michal Hocko
@ 2012-07-20 14:37     ` Mel Gorman
  2012-07-20 14:40       ` Michal Hocko
  0 siblings, 1 reply; 50+ messages in thread
From: Mel Gorman @ 2012-07-20 14:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Linux-MM, Hugh Dickins, David Gibson, Ken Chen, Cong Wang, LKML

On Fri, Jul 20, 2012 at 04:29:20PM +0200, Michal Hocko wrote:
> > <SNIP>
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> 
> Yes this looks correct. mmap_sem will make sure that unmap_vmas and
> free_pgtables are executed atomicaly wrt. huge_pmd_share so it doesn't
> see non-NULL spte on the way out.

Yes.

> I am just wondering whether we need
> the page_table_lock as well. It is not harmful but I guess we can drop
> it because both exit_mmap and shmdt are not taking it and mmap_sem is
> sufficient for them.

While it is true that we don't *really* need page_table_lock here, we are
still updating page tables and it's in line with the the ordinary locking
rules.  There are other cases in hugetlb.c where we do pte_same() checks even
though we are protected from the related races by the instantiation_mutex.

page_table_lock is actually a bit useless for shared page tables. If shared
page tables were every to be a general thing then I think we'd have to
revisit how PTE update locking is done but I doubt anyone wants to dive
down that rat-hole.

For now, I'm going to keep taking it even if strictly speaking it's not
necessary.

> One more nit bellow.
> 
> I will send my version of the fix which took another path as a reply to
> this email to have something to compare with.
> 

Thanks.

> [...]
> > diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> > index f6679a7..944b2df 100644
> > --- a/arch/x86/mm/hugetlbpage.c
> > +++ b/arch/x86/mm/hugetlbpage.c
> > @@ -58,7 +58,8 @@ static int vma_shareable(struct vm_area_struct *vma, unsigned long addr)
> >  /*
> >   * search for a shareable pmd page for hugetlb.
> >   */
> > -static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> > +static void huge_pmd_share(struct mm_struct *mm, struct mm_struct *locked_mm,
> > +			   unsigned long addr, pud_t *pud)
> >  {
> >  	struct vm_area_struct *vma = find_vma(mm, addr);
> >  	struct address_space *mapping = vma->vm_file->f_mapping;
> > @@ -68,14 +69,40 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> >  	struct vm_area_struct *svma;
> >  	unsigned long saddr;
> >  	pte_t *spte = NULL;
> > +	spinlock_t *spage_table_lock = NULL;
> > +	struct rw_semaphore *smmap_sem = NULL;
> >  
> >  	if (!vma_shareable(vma, addr))
> >  		return;
> >  
> > +retry:
> >  	mutex_lock(&mapping->i_mmap_mutex);
> >  	vma_prio_tree_foreach(svma, &iter, &mapping->i_mmap, idx, idx) {
> >  		if (svma == vma)
> >  			continue;
> > +		if (svma->vm_mm == vma->vm_mm)
> > +			continue;
> > +
> > +		/*
> > +		 * The target mm could be in the process of tearing down
> > +		 * its page tables and the i_mmap_mutex on its own is
> > +		 * not sufficient. To prevent races against teardown and
> > +		 * pagetable updates, we acquire the mmap_sem and pagetable
> > +		 * lock of the remote address space. down_read_trylock()
> > +		 * is necessary as the other process could also be trying
> > +		 * to share pagetables with the current mm. In the fork
> > +		 * case, we are already both mm's so check for that
> > +		 */
> > +		if (locked_mm != svma->vm_mm) {
> > +			if (!down_read_trylock(&svma->vm_mm->mmap_sem)) {
> > +				mutex_unlock(&mapping->i_mmap_mutex);
> > +				goto retry;
> > +			}
> > +			smmap_sem = &svma->vm_mm->mmap_sem;
> > +		}
> > +
> > +		spage_table_lock = &svma->vm_mm->page_table_lock;
> > +		spin_lock_nested(spage_table_lock, SINGLE_DEPTH_NESTING);
> >  
> >  		saddr = page_table_shareable(svma, vma, addr, idx);
> >  		if (saddr) {
> [...]
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index ae8f708..4832277 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -2244,7 +2244,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> >  		src_pte = huge_pte_offset(src, addr);
> >  		if (!src_pte)
> >  			continue;
> > -		dst_pte = huge_pte_alloc(dst, addr, sz);
> > +		dst_pte = huge_pte_alloc(dst, src, addr, sz);
> >  		if (!dst_pte)
> >  			goto nomem;
> >  
> > @@ -2745,7 +2745,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >  			       VM_FAULT_SET_HINDEX(h - hstates);
> >  	}
> >  
> > -	ptep = huge_pte_alloc(mm, address, huge_page_size(h));
> > +	ptep = huge_pte_alloc(mm, NULL, address, huge_page_size(h));
> 
> strictly speaking we should provide current->mm here because we are in
> the page fault path and mmap_sem is held for reading. This doesn't
> matter here though because huge_pmd_share will take it for reading so
> nesting wouldn't hurt. Maybe a small comment that this is intentional
> and correct would be nice.
> 

Fair point. If we go with this version of the fix, I'll improve the
documentation a bit.

Thanks!

> >  	if (!ptep)
> >  		return VM_FAULT_OOM;
> >  
> > 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
  2012-07-20 14:37     ` Mel Gorman
@ 2012-07-20 14:40       ` Michal Hocko
  0 siblings, 0 replies; 50+ messages in thread
From: Michal Hocko @ 2012-07-20 14:40 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Hugh Dickins, David Gibson, Ken Chen, Cong Wang, LKML

On Fri 20-07-12 15:37:53, Mel Gorman wrote:
> On Fri, Jul 20, 2012 at 04:29:20PM +0200, Michal Hocko wrote:
> > > <SNIP>
> > > 
> > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > 
> > Yes this looks correct. mmap_sem will make sure that unmap_vmas and
> > free_pgtables are executed atomicaly wrt. huge_pmd_share so it doesn't
> > see non-NULL spte on the way out.
> 
> Yes.
> 
> > I am just wondering whether we need
> > the page_table_lock as well. It is not harmful but I guess we can drop
> > it because both exit_mmap and shmdt are not taking it and mmap_sem is
> > sufficient for them.
> 
> While it is true that we don't *really* need page_table_lock here, we are
> still updating page tables and it's in line with the the ordinary locking
> rules.  There are other cases in hugetlb.c where we do pte_same() checks even
> though we are protected from the related races by the instantiation_mutex.
> 
> page_table_lock is actually a bit useless for shared page tables. If shared
> page tables were every to be a general thing then I think we'd have to
> revisit how PTE update locking is done but I doubt anyone wants to dive
> down that rat-hole.
> 
> For now, I'm going to keep taking it even if strictly speaking it's not
> necessary.

Fair enough

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
  2012-07-20 14:36   ` [PATCH -alternative] " Michal Hocko
@ 2012-07-20 14:51     ` Mel Gorman
  2012-07-23  4:04       ` Hugh Dickins
  2012-07-26 18:31     ` Rik van Riel
  1 sibling, 1 reply; 50+ messages in thread
From: Mel Gorman @ 2012-07-20 14:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Linux-MM, Hugh Dickins, David Gibson, Ken Chen, Cong Wang, LKML

On Fri, Jul 20, 2012 at 04:36:35PM +0200, Michal Hocko wrote:
> And here is my attempt for the fix (Hugh mentioned something similar
> earlier but he suggested using special flags in ptes or VMAs). I still
> owe doc. update and it hasn't been tested with too many configs and I
> could missed some definition updates.
> I also think that changelog could be much better, I will add (steal) the
> full bug description if people think that this way is worth going rather
> than the one suggested by Mel.
> To be honest I am not quite happy how I had to pollute generic mm code with
> something that is specific to a single architecture.
> Mel hammered it with the test case and it survived.

Tested-by: Mel Gorman <mgorman@suse.de>

This approach looks more or less like what I was expecting. I like that
the trick was applied to the page table page instead of using PTE tricks
or by bodging it with a VMA flag like I was thinking so kudos for that. I
also prefer this approach to trying to free the page tables on or near
huge_pmd_unshare()

In general I think this patch would execute better than mine because it is
far less heavy-handed but I share your concern that it changes the core MM
quite a bit for a corner case that only one architecture cares about. I am
completely biased of course, but I still prefer my patch because other than
an API change it keeps the bulk of the madness in arch/x86/mm/hugetlbpage.c
. I am also not concerned with the scalability of how quickly we can setup
page table sharing.

Hugh, I'm afraid you get to choose :)

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
  2012-07-20 14:51     ` Mel Gorman
@ 2012-07-23  4:04       ` Hugh Dickins
  2012-07-23 11:40         ` Mel Gorman
                           ` (2 more replies)
  0 siblings, 3 replies; 50+ messages in thread
From: Hugh Dickins @ 2012-07-23  4:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Michal Hocko, Linux-MM, David Gibson, Ken Chen, Cong Wang, LKML

On Fri, 20 Jul 2012, Mel Gorman wrote:
> On Fri, Jul 20, 2012 at 04:36:35PM +0200, Michal Hocko wrote:
> > And here is my attempt for the fix (Hugh mentioned something similar
> > earlier but he suggested using special flags in ptes or VMAs). I still
> > owe doc. update and it hasn't been tested with too many configs and I
> > could missed some definition updates.
> > I also think that changelog could be much better, I will add (steal) the
> > full bug description if people think that this way is worth going rather
> > than the one suggested by Mel.
> > To be honest I am not quite happy how I had to pollute generic mm code with
> > something that is specific to a single architecture.
> > Mel hammered it with the test case and it survived.
> 
> Tested-by: Mel Gorman <mgorman@suse.de>
> 
> This approach looks more or less like what I was expecting. I like that
> the trick was applied to the page table page instead of using PTE tricks
> or by bodging it with a VMA flag like I was thinking so kudos for that. I
> also prefer this approach to trying to free the page tables on or near
> huge_pmd_unshare()
> 
> In general I think this patch would execute better than mine because it is
> far less heavy-handed but I share your concern that it changes the core MM
> quite a bit for a corner case that only one architecture cares about. I am
> completely biased of course, but I still prefer my patch because other than
> an API change it keeps the bulk of the madness in arch/x86/mm/hugetlbpage.c
> . I am also not concerned with the scalability of how quickly we can setup
> page table sharing.
> 
> Hugh, I'm afraid you get to choose :)

Thank you bestowing that honour upon me :)  Seriously, though, you
were quite right to Cc me on this, it is one of those areas I ought
to know something about (unlike hugetlb reservations, for example).

Please don't be upset if I say that I don't like either of your patches.
Mainly for obvious reasons - I don't like Mel's because anything with
trylock retries and nested spinlocks worries me before I can even start
to think about it; and I don't like Michal's for the same reason as Mel,
that it spreads more change around in common paths than we would like.

But I didn't spend much time thinking through either of them, they just
seemed more complicated than should be needed.  I cannot confirm or deny
whether they're correct - though I still do not understand how mmap_sem
can help you, Mel.  I can see that it will help in your shmdt()ing test,
but if you leave the area mapped on exit, then mmap_sem is not taken in
the exit_mmap() path, so how does it help?

I spent hours trying to dream up a better patch, trying various
approaches.  I think I have a nice one now, what do you think?  And
more importantly, does it work?  I have not tried to test it at all,
that I'm hoping to leave to you, I'm sure you'll attack it with gusto!

If you like it, please take it over and add your comments and signoff
and send it in.  The second part won't come up in your testing, and could
be made a separate patch if you prefer: it's a related point that struck
me while I was playing with a different approach.

I'm sorely tempted to leave a dangerous pair of eyes off the Cc,
but that too would be unfair.

Subject-to-your-testing-
Signed-off-by: Hugh Dickins <hughd@google.com>
---

 mm/hugetlb.c |   18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

--- v3.5/mm/hugetlb.c	2012-07-21 13:58:29.000000000 -0700
+++ linux/mm/hugetlb.c	2012-07-22 20:28:59.858077817 -0700
@@ -2393,6 +2393,15 @@ void unmap_hugepage_range(struct vm_area
 {
 	mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
 	__unmap_hugepage_range(vma, start, end, ref_page);
+	/*
+	 * Clear this flag so that x86's huge_pmd_share page_table_shareable
+	 * test will fail on a vma being torn down, and not grab a page table
+	 * on its way out.  We're lucky that the flag has such an appropriate
+	 * name, and can in fact be safely cleared here.  We could clear it
+	 * before the __unmap_hugepage_range above, but all that's necessary
+	 * is to clear it before releasing the i_mmap_mutex below.
+	 */
+	vma->vm_flags &= ~VM_MAYSHARE;
 	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
 }
 
@@ -2959,9 +2968,14 @@ void hugetlb_change_protection(struct vm
 		}
 	}
 	spin_unlock(&mm->page_table_lock);
-	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
-
+	/*
+	 * Must flush TLB before releasing i_mmap_mutex: x86's huge_pmd_unshare
+	 * may have cleared our pud entry and done put_page on the page table:
+	 * once we release i_mmap_mutex, another task can do the final put_page
+	 * and that page table be reused and filled with junk.
+	 */
 	flush_tlb_range(vma, start, end);
+	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
 }
 
 int hugetlb_reserve_pages(struct inode *inode,

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
  2012-07-23  4:04       ` Hugh Dickins
@ 2012-07-23 11:40         ` Mel Gorman
  2012-07-24  1:08           ` Hugh Dickins
  2012-07-26 17:42         ` Rik van Riel
  2012-07-26 18:37         ` Rik van Riel
  2 siblings, 1 reply; 50+ messages in thread
From: Mel Gorman @ 2012-07-23 11:40 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Michal Hocko, Linux-MM, David Gibson, Ken Chen, Cong Wang, LKML

On Sun, Jul 22, 2012 at 09:04:33PM -0700, Hugh Dickins wrote:
> On Fri, 20 Jul 2012, Mel Gorman wrote:
> > On Fri, Jul 20, 2012 at 04:36:35PM +0200, Michal Hocko wrote:
> > > And here is my attempt for the fix (Hugh mentioned something similar
> > > earlier but he suggested using special flags in ptes or VMAs). I still
> > > owe doc. update and it hasn't been tested with too many configs and I
> > > could missed some definition updates.
> > > I also think that changelog could be much better, I will add (steal) the
> > > full bug description if people think that this way is worth going rather
> > > than the one suggested by Mel.
> > > To be honest I am not quite happy how I had to pollute generic mm code with
> > > something that is specific to a single architecture.
> > > Mel hammered it with the test case and it survived.
> > 
> > Tested-by: Mel Gorman <mgorman@suse.de>
> > 
> > This approach looks more or less like what I was expecting. I like that
> > the trick was applied to the page table page instead of using PTE tricks
> > or by bodging it with a VMA flag like I was thinking so kudos for that. I
> > also prefer this approach to trying to free the page tables on or near
> > huge_pmd_unshare()
> > 
> > In general I think this patch would execute better than mine because it is
> > far less heavy-handed but I share your concern that it changes the core MM
> > quite a bit for a corner case that only one architecture cares about. I am
> > completely biased of course, but I still prefer my patch because other than
> > an API change it keeps the bulk of the madness in arch/x86/mm/hugetlbpage.c
> > . I am also not concerned with the scalability of how quickly we can setup
> > page table sharing.
> > 
> > Hugh, I'm afraid you get to choose :)
> 
> Thank you bestowing that honour upon me :) 

Just so you know, there was a ceremonial gong when it happened.

> Seriously, though, you
> were quite right to Cc me on this, it is one of those areas I ought
> to know something about (unlike hugetlb reservations, for example).
> 
> Please don't be upset if I say that I don't like either of your patches.

I can live with that :) It would not be the first time we found the right
patch out of dislike for the first proposed and getting the fix is what's
important.

> Mainly for obvious reasons - I don't like Mel's because anything with
> trylock retries and nested spinlocks worries me before I can even start
> to think about it;

That's a reasonable objection. The trylock could be avoided by always
falling through at the cost of reducing the amount of sharing
opportunities but the nested locking is unavoidable. I agree with you
that nested locking like this should always be a cause for concern.

> and I don't like Michal's for the same reason as Mel,
> that it spreads more change around in common paths than we would like.
> 
> But I didn't spend much time thinking through either of them, they just
> seemed more complicated than should be needed.  I cannot confirm or deny
> whether they're correct - though I still do not understand how mmap_sem
> can help you, Mel.  I can see that it will help in your shmdt()ing test,
> but if you leave the area mapped on exit, then mmap_sem is not taken in
> the exit_mmap() path, so how does it help?
> 

It certainly helps in the shmdt case which is what the test case focused
on because that is what the application that triggered this bug was
doing. However, you're right in that the exit_mmap() path is still vunerable
because it does not take mmap_sem. I'll think about that a bit more.

> I spent hours trying to dream up a better patch, trying various
> approaches.  I think I have a nice one now, what do you think?  And
> more importantly, does it work?  I have not tried to test it at all,
> that I'm hoping to leave to you, I'm sure you'll attack it with gusto!
> 
> If you like it, please take it over and add your comments and signoff
> and send it in. 
> 

I like it in that it's simple and I can confirm it works for the test case
of interest.

However, is your patch not vunerable to truncate issues?
madvise()/truncate() issues was the main reason why I was wary of VMA tricks
as a solution. As it turns out, madvise(DONTNEED) is not a problem as it is
ignored for hugetlbfs but I think truncate is still problematic. Lets say
we mmap(MAP_SHARED) a hugetlbfs file and then truncate for whatever reason.

invalidate_inode_pages2
  invalidate_inode_pages2_range
    unmap_mapping_range_vma
      zap_page_range_single
        unmap_single_vma
	  __unmap_hugepage_range (removes VM_MAYSHARE)

The VMA still exists so the consequences for this would be varied but
minimally fault is going to be "interesting".

I think that potentially we could work around this but it may end up stomping
on the core MM and not necessarily be any better than Michal's patch.

> The second part won't come up in your testing, and could
> be made a separate patch if you prefer: it's a related point that struck
> me while I was playing with a different approach.

I'm fine with the second part.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
  2012-07-23 11:40         ` Mel Gorman
@ 2012-07-24  1:08           ` Hugh Dickins
  2012-07-24  8:32             ` Michal Hocko
  2012-07-24  9:34             ` Mel Gorman
  0 siblings, 2 replies; 50+ messages in thread
From: Hugh Dickins @ 2012-07-24  1:08 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Michal Hocko, Linux-MM, David Gibson, Ken Chen, Cong Wang, LKML

On Mon, 23 Jul 2012, Mel Gorman wrote:
> On Sun, Jul 22, 2012 at 09:04:33PM -0700, Hugh Dickins wrote:
> > On Fri, 20 Jul 2012, Mel Gorman wrote:
> > > On Fri, Jul 20, 2012 at 04:36:35PM +0200, Michal Hocko wrote:
> 
> I like it in that it's simple and I can confirm it works for the test case
> of interest.

Phew, I'm glad to hear that, thanks.

> 
> However, is your patch not vunerable to truncate issues?
> madvise()/truncate() issues was the main reason why I was wary of VMA tricks
> as a solution. As it turns out, madvise(DONTNEED) is not a problem as it is
> ignored for hugetlbfs but I think truncate is still problematic. Lets say
> we mmap(MAP_SHARED) a hugetlbfs file and then truncate for whatever reason.
> 
> invalidate_inode_pages2
>   invalidate_inode_pages2_range
>     unmap_mapping_range_vma
>       zap_page_range_single
>         unmap_single_vma
> 	  __unmap_hugepage_range (removes VM_MAYSHARE)
> 
> The VMA still exists so the consequences for this would be varied but
> minimally fault is going to be "interesting".

You had me worried there, I hadn't considered truncation or invalidation2
at all.

But actually, I don't think they do pose any problem for my patch.  They
would indeed if I were removing VM_MAYSHARE in __unmap_hugepage_range()
as you show above; but no, I'm removing it in unmap_hugepage_range().

That's only called by unmap_single_vma(): which is called via
unmap_vmas() by unmap_region() or exit_mmap() just before free_pgtables()
(the problem cases); or by madvise_dontneed() via zap_page_range(), which
as you note is disallowed on VM_HUGETLB; or by zap_page_range_single().

zap_page_range_single() is called by zap_vma_ptes(), which is only
allowed on VM_PFNMAP; or by unmap_mapping_range_vma(), which looked
like it was going to deadlock on i_mmap_mutex (with or without my
patch) until I realized that hugetlbfs has its own hugetlbfs_setattr()
and hugetlb_vmtruncate() which don't use unmap_mapping_range() at all.

invalidate_inode_pages2() (and _range()) do use unmap_mapping_range(),
but hugetlbfs doesn't support direct_IO, and otherwise I think they're
called by a filesystem directly on its own inodes, which hugetlbfs
does not.  Anyway, if there's a deadlock on i_mmap_mutex somewhere
in there, it's not introduced by the proposed patch.

So, unmap_hugepage_range() is only being called in the problem cases,
just before free_pgtables(), when unmapping a vma (with mmap_sem held),
or when exiting (when we have the last reference to mm): in each case,
the vma is on its way out, and VM_MAYSHARE no longer of interest to others.

I spent a while concerned that I'd overlooked the truncation case, before
realizing that it's not a problem: the issue comes when we free_pgtables(),
which truncation makes no attempt to do.

So, after a bout of anxiety, I think my &= ~VM_MAYSHARE remains good.

Hugh

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
  2012-07-24  1:08           ` Hugh Dickins
@ 2012-07-24  8:32             ` Michal Hocko
  2012-07-24  9:34             ` Mel Gorman
  1 sibling, 0 replies; 50+ messages in thread
From: Michal Hocko @ 2012-07-24  8:32 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Mel Gorman, Linux-MM, David Gibson, Ken Chen, Cong Wang, LKML

On Mon 23-07-12 18:08:05, Hugh Dickins wrote:
> On Mon, 23 Jul 2012, Mel Gorman wrote:
> > On Sun, Jul 22, 2012 at 09:04:33PM -0700, Hugh Dickins wrote:
> > > On Fri, 20 Jul 2012, Mel Gorman wrote:
> > > > On Fri, Jul 20, 2012 at 04:36:35PM +0200, Michal Hocko wrote:
> > 
> > I like it in that it's simple and I can confirm it works for the test case
> > of interest.
> 
> Phew, I'm glad to hear that, thanks.
> 
> > 
> > However, is your patch not vunerable to truncate issues?
> > madvise()/truncate() issues was the main reason why I was wary of VMA tricks
> > as a solution. As it turns out, madvise(DONTNEED) is not a problem as it is
> > ignored for hugetlbfs but I think truncate is still problematic. Lets say
> > we mmap(MAP_SHARED) a hugetlbfs file and then truncate for whatever reason.
> > 
> > invalidate_inode_pages2
> >   invalidate_inode_pages2_range
> >     unmap_mapping_range_vma
> >       zap_page_range_single
> >         unmap_single_vma
> > 	  __unmap_hugepage_range (removes VM_MAYSHARE)
> > 
> > The VMA still exists so the consequences for this would be varied but
> > minimally fault is going to be "interesting".
> 
> You had me worried there, I hadn't considered truncation or invalidation2
> at all.
> 
> But actually, I don't think they do pose any problem for my patch.  They
> would indeed if I were removing VM_MAYSHARE in __unmap_hugepage_range()
> as you show above; but no, I'm removing it in unmap_hugepage_range().
> 
> That's only called by unmap_single_vma(): which is called via
> unmap_vmas() by unmap_region() or exit_mmap() just before free_pgtables()
> (the problem cases); or by madvise_dontneed() via zap_page_range(), which
> as you note is disallowed on VM_HUGETLB; or by zap_page_range_single().
> 
> zap_page_range_single() is called by zap_vma_ptes(), which is only
> allowed on VM_PFNMAP; or by unmap_mapping_range_vma(), which looked
> like it was going to deadlock on i_mmap_mutex (with or without my
> patch) until I realized that hugetlbfs has its own hugetlbfs_setattr()
> and hugetlb_vmtruncate() which don't use unmap_mapping_range() at all.
> 
> invalidate_inode_pages2() (and _range()) do use unmap_mapping_range(),
> but hugetlbfs doesn't support direct_IO, and otherwise I think they're
> called by a filesystem directly on its own inodes, which hugetlbfs
> does not.  

Good point, I didn't get this while looking into the code so I introduce
the `last' parameter which told me that I am in the correct path.
Thanks for clarification.

> Anyway, if there's a deadlock on i_mmap_mutex somewhere in there, it's
> not introduced by the proposed patch.

> So, unmap_hugepage_range() is only being called in the problem cases,
> just before free_pgtables(), when unmapping a vma (with mmap_sem held),
> or when exiting (when we have the last reference to mm): in each case,
> the vma is on its way out, and VM_MAYSHARE no longer of interest to others.
> 
> I spent a while concerned that I'd overlooked the truncation case, before
> realizing that it's not a problem: the issue comes when we free_pgtables(),
> which truncation makes no attempt to do.
> 
> So, after a bout of anxiety, I think my &= ~VM_MAYSHARE remains good.

Yes, this is convincing (and subtle ;)) and much less polluting.
You can add my Reviewed-by (with the above reasoning in the patch
description)

Anyway, the patch for mmotm needs an update because there was a
reorganization in the area. First, we need to revert "hugetlb: avoid
taking i_mmap_mutex in unmap_single_vma() for hugetlb)" (80f408f5 in
memcg-devel) and then push your code into unmap_single_vma. All the
above is still valid AFAICS.

> 
> Hugh

Thanks a lot Hugh!
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
  2012-07-24  1:08           ` Hugh Dickins
  2012-07-24  8:32             ` Michal Hocko
@ 2012-07-24  9:34             ` Mel Gorman
  2012-07-24 10:04               ` Michal Hocko
  2012-07-24 19:23               ` Hugh Dickins
  1 sibling, 2 replies; 50+ messages in thread
From: Mel Gorman @ 2012-07-24  9:34 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Michal Hocko, Linux-MM, David Gibson, Ken Chen, Cong Wang, LKML

On Mon, Jul 23, 2012 at 06:08:05PM -0700, Hugh Dickins wrote:
> On Mon, 23 Jul 2012, Mel Gorman wrote:
> > On Sun, Jul 22, 2012 at 09:04:33PM -0700, Hugh Dickins wrote:
> > > On Fri, 20 Jul 2012, Mel Gorman wrote:
> > > > On Fri, Jul 20, 2012 at 04:36:35PM +0200, Michal Hocko wrote:
> > 
> > I like it in that it's simple and I can confirm it works for the test case
> > of interest.
> 
> Phew, I'm glad to hear that, thanks.
> 
> > 
> > However, is your patch not vunerable to truncate issues?
> > madvise()/truncate() issues was the main reason why I was wary of VMA tricks
> > as a solution. As it turns out, madvise(DONTNEED) is not a problem as it is
> > ignored for hugetlbfs but I think truncate is still problematic. Lets say
> > we mmap(MAP_SHARED) a hugetlbfs file and then truncate for whatever reason.
> > 
> > invalidate_inode_pages2
> >   invalidate_inode_pages2_range
> >     unmap_mapping_range_vma
> >       zap_page_range_single
> >         unmap_single_vma
> > 	  __unmap_hugepage_range (removes VM_MAYSHARE)
> > 
> > The VMA still exists so the consequences for this would be varied but
> > minimally fault is going to be "interesting".
> 
> You had me worried there, I hadn't considered truncation or invalidation2
> at all.
> 
> But actually, I don't think they do pose any problem for my patch.  They
> would indeed if I were removing VM_MAYSHARE in __unmap_hugepage_range()
> as you show above; but no, I'm removing it in unmap_hugepage_range().
> 

True, I hadn't considered the distinction to be relevant. My train of
thought was "we potentially end up calling these functions and affect
VMAs that are not being torn down". This set off an alarm but I didn't
follow through properly.

> That's only called by unmap_single_vma(): which is called via
> unmap_vmas() by unmap_region() or exit_mmap() just before free_pgtables()
> (the problem cases); or by madvise_dontneed() via zap_page_range(), which
> as you note is disallowed on VM_HUGETLB; or by zap_page_range_single().
> 

The madvise(DONTNEED) case also made me worry that if this fix worked, it
worked by co-incidence. If someone "fixes" hugetlbfs to support DONTNEED
(which would be nonsensical) or direct IO (which would be weird) then
this fix potentially regressed.

> zap_page_range_single() is called by zap_vma_ptes(), which is only
> allowed on VM_PFNMAP; or by unmap_mapping_range_vma(), which looked
> like it was going to deadlock on i_mmap_mutex (with or without my
> patch) until I realized that hugetlbfs has its own hugetlbfs_setattr()
> and hugetlb_vmtruncate() which don't use unmap_mapping_range() at all.
> 

Yep, it happens to work.

> invalidate_inode_pages2() (and _range()) do use unmap_mapping_range(),
> but hugetlbfs doesn't support direct_IO, and otherwise I think they're
> called by a filesystem directly on its own inodes, which hugetlbfs
> does not. Anyway, if there's a deadlock on i_mmap_mutex somewhere
> in there, it's not introduced by the proposed patch.
> 
> So, unmap_hugepage_range() is only being called in the problem cases,
> just before free_pgtables(), when unmapping a vma (with mmap_sem held),
> or when exiting (when we have the last reference to mm): in each case,
> the vma is on its way out, and VM_MAYSHARE no longer of interest to others.
> 
> I spent a while concerned that I'd overlooked the truncation case, before
> realizing that it's not a problem: the issue comes when we free_pgtables(),
> which truncation makes no attempt to do.
> 
> So, after a bout of anxiety, I think my &= ~VM_MAYSHARE remains good.
> 

I agree with you. When I was thinking about the potential problems, I was
thinking of them in the general context of the core VM and what we normally
take into account.

I confess that I really find this working-by-coincidence very icky and am
uncomfortable with it but your patch is the only patch that contains the
mess to hugetlbfs. I fixed exit_mmap() for my version but only by changing
the core to introduce exit_vmas() to take mmap_sem for write if a hugetlb
VMA is found so I also affected the core.

So, lets go with your patch but with all this documented! I stuck a
changelog and an additional comment onto your patch and this is the end
result.

Do you want to pick this up and send it to Andrew or will I?

Thanks Hugh!

---8<---
mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables

If a process creates a large hugetlbfs mapping that is eligible for page
table sharing and forks heavily with children some of whom fault and
others which destroy the mapping then it is possible for page tables to
get corrupted. Some teardowns of the mapping encounter a "bad pmd" and
output a message to the kernel log. The final teardown will trigger a
BUG_ON in mm/filemap.c.

This was reproduced in 3.4 but is known to have existed for a long time
and goes back at least as far as 2.6.37. It was probably was introduced in
2.6.20 by [39dde65c: shared page table for hugetlb page]. The messages
look like this;

[  ..........] Lots of bad pmd messages followed by this
[  127.164256] mm/memory.c:391: bad pmd ffff880412e04fe8(80000003de4000e7).
[  127.164257] mm/memory.c:391: bad pmd ffff880412e04ff0(80000003de6000e7).
[  127.164258] mm/memory.c:391: bad pmd ffff880412e04ff8(80000003de0000e7).
[  127.186778] ------------[ cut here ]------------
[  127.186781] kernel BUG at mm/filemap.c:134!
[  127.186782] invalid opcode: 0000 [#1] SMP
[  127.186783] CPU 7
[  127.186784] Modules linked in: af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos usb_common button i2c_core intel_agp video intel_gtt fan processor thermal thermal_sys hwmon ata_generic pata_atiixp libata scsi_mod
[  127.186801]
[  127.186802] Pid: 9017, comm: hugetlbfs-test Not tainted 3.4.0-autobuild #53 Dell Inc. OptiPlex 990/06D7TR
[  127.186804] RIP: 0010:[<ffffffff810ed6ce>]  [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
[  127.186809] RSP: 0000:ffff8804144b5c08  EFLAGS: 00010002
[  127.186810] RAX: 0000000000000001 RBX: ffffea000a5c9000 RCX: 00000000ffffffc0
[  127.186811] RDX: 0000000000000000 RSI: 0000000000000009 RDI: ffff88042dfdad00
[  127.186812] RBP: ffff8804144b5c18 R08: 0000000000000009 R09: 0000000000000003
[  127.186813] R10: 0000000000000000 R11: 000000000000002d R12: ffff880412ff83d8
[  127.186814] R13: ffff880412ff83d8 R14: 0000000000000000 R15: ffff880412ff83d8
[  127.186815] FS:  00007fe18ed2c700(0000) GS:ffff88042dce0000(0000) knlGS:0000000000000000
[  127.186816] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  127.186817] CR2: 00007fe340000503 CR3: 0000000417a14000 CR4: 00000000000407e0
[  127.186818] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  127.186819] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  127.186820] Process hugetlbfs-test (pid: 9017, threadinfo ffff8804144b4000, task ffff880417f803c0)
[  127.186821] Stack:
[  127.186822]  ffffea000a5c9000 0000000000000000 ffff8804144b5c48 ffffffff810ed83b
[  127.186824]  ffff8804144b5c48 000000000000138a 0000000000001387 ffff8804144b5c98
[  127.186825]  ffff8804144b5d48 ffffffff811bc925 ffff8804144b5cb8 0000000000000000
[  127.186827] Call Trace:
[  127.186829]  [<ffffffff810ed83b>] delete_from_page_cache+0x3b/0x80
[  127.186832]  [<ffffffff811bc925>] truncate_hugepages+0x115/0x220
[  127.186834]  [<ffffffff811bca43>] hugetlbfs_evict_inode+0x13/0x30
[  127.186837]  [<ffffffff811655c7>] evict+0xa7/0x1b0
[  127.186839]  [<ffffffff811657a3>] iput_final+0xd3/0x1f0
[  127.186840]  [<ffffffff811658f9>] iput+0x39/0x50
[  127.186842]  [<ffffffff81162708>] d_kill+0xf8/0x130
[  127.186843]  [<ffffffff81162812>] dput+0xd2/0x1a0
[  127.186845]  [<ffffffff8114e2d0>] __fput+0x170/0x230
[  127.186848]  [<ffffffff81236e0e>] ? rb_erase+0xce/0x150
[  127.186849]  [<ffffffff8114e3ad>] fput+0x1d/0x30
[  127.186851]  [<ffffffff81117db7>] remove_vma+0x37/0x80
[  127.186853]  [<ffffffff81119182>] do_munmap+0x2d2/0x360
[  127.186855]  [<ffffffff811cc639>] sys_shmdt+0xc9/0x170
[  127.186857]  [<ffffffff81410a39>] system_call_fastpath+0x16/0x1b
[  127.186858] Code: 0f 1f 44 00 00 48 8b 43 08 48 8b 00 48 8b 40 28 8b b0 40 03 00 00 85 f6 0f 88 df fe ff ff 48 89 df e8 e7 cb 05 00 e9 d2 fe ff ff <0f> 0b 55 83 e2 fd 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 65 e0
[  127.186868] RIP  [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
[  127.186870]  RSP <ffff8804144b5c08>
[  127.186871] ---[ end trace 7cbac5d1db69f426 ]---

The bug is a race and not always easy to reproduce. To reproduce it I was
doing the following on a single socket I7-based machine with 16G of RAM.

$ hugeadm --pool-pages-max DEFAULT:13G
$ echo $((18*1048576*1024)) > /proc/sys/kernel/shmmax
$ echo $((18*1048576*1024)) > /proc/sys/kernel/shmall
$ for i in `seq 1 9000`; do ./hugetlbfs-test; done

On my particular machine, it usually triggers within 10 minutes but enabling
debug options can change the timing such that it never hits. Once the bug is
triggered, the machine is in trouble and needs to be rebooted. The machine
will respond but processes accessing proc like "ps aux" will hang due to
the BUG_ON. shutdown will also hang and needs a hard reset or a sysrq-b.

The basic problem is a race between page table sharing and teardown. For
the most part page table sharing depends on i_mmap_mutex. In some cases,
it is also taking the mm->page_table_lock for the PTE updates but with
shared page tables, it is the i_mmap_mutex that is more important.

Unfortunately it appears to be also insufficient. Consider the following
situation

Process A					Process B
---------					---------
hugetlb_fault					shmdt
  						LockWrite(mmap_sem)
    						  do_munmap
						    unmap_region
						      unmap_vmas
						        unmap_single_vma
						          unmap_hugepage_range
      						            Lock(i_mmap_mutex)
							    Lock(mm->page_table_lock)
							    huge_pmd_unshare/unmap tables <--- (1)
							    Unlock(mm->page_table_lock)
      						            Unlock(i_mmap_mutex)
  huge_pte_alloc				      ...
    Lock(i_mmap_mutex)				      ...
    vma_prio_walk, find svma, spte		      ...
    Lock(mm->page_table_lock)			      ...
    share spte					      ...
    Unlock(mm->page_table_lock)			      ...
    Unlock(i_mmap_mutex)			      ...
  hugetlb_no_page									  <--- (2)
						      free_pgtables
						        unlink_file_vma
							hugetlb_free_pgd_range
						    remove_vma_list

In this scenario, it is possible for Process A to share page tables with
Process B that is trying to tear them down.  The i_mmap_mutex on its own
does not prevent Process A walking Process B's page tables. At (1) above,
the page tables are not shared yet so it unmaps the PMDs. Process A sets
up page table sharing and at (2) faults a new entry. Process B then trips
up on it in free_pgtables.

This patch fixes the problem by clearing VM_MAYSHARE during
unmap_hugepage_range() under the i_mmap_mutex. This makes the VMA
ineligible for sharing and avoids the race. Superficially this looks
like it would then be vunerable to truncate and madvise problems but
this is avoided by the limitations of hugetlbfs.

madvise and trunctate would be problems if removing VM_MAYSHARE in
__unmap_hugepage_range() but it is removed in unmap_hugepage_range().
This is only called by unmap_single_vma(): which is called via unmap_vmas()
by unmap_region() or exit_mmap() just before free_pgtables() (the problem
cases); or by madvise_dontneed() via zap_page_range(), which is disallowed
on VM_HUGETLB; or by zap_page_range_single().

zap_page_range_single() is called by zap_vma_ptes(), which is only allowed
on VM_PFNMAP; or by unmap_mapping_range_vma(), which looked like it was
going to deadlock on i_mmap_mutex (with or without my patch) but does
not as hugetlbfs has its own hugetlbfs_setattr() and hugetlb_vmtruncate()
which don't use unmap_mapping_range() at all.

invalidate_inode_pages2() (and _range()) do use unmap_mapping_range(),
but hugetlbfs doesn't support direct_IO, and otherwise they're called by a
filesystem directly on its own inodes, which hugetlbfs does not.  If there's
a deadlock on i_mmap_mutex somewhere in there, it's not introduced by the
proposed patch.

This should be treated as a -stable candidate if it is merged.

Test program is as follows. The test case was mostly written by Michal
Hocko with a few minor changes to reproduce this bug.

==== CUT HERE ====

static size_t huge_page_size = (2UL << 20);
static size_t nr_huge_page_A = 512;
static size_t nr_huge_page_B = 5632;

unsigned int get_random(unsigned int max)
{
	struct timeval tv;

	gettimeofday(&tv, NULL);
	srandom(tv.tv_usec);
	return random() % max;
}

static void play(void *addr, size_t size)
{
	unsigned char *start = addr,
		      *end = start + size,
		      *a;
	start += get_random(size/2);

	/* we could itterate on huge pages but let's give it more time. */
	for (a = start; a < end; a += 4096)
		*a = 0;
}

int main(int argc, char **argv)
{
	key_t key = IPC_PRIVATE;
	size_t sizeA = nr_huge_page_A * huge_page_size;
	size_t sizeB = nr_huge_page_B * huge_page_size;
	int shmidA, shmidB;
	void *addrA = NULL, *addrB = NULL;
	int nr_children = 300, n = 0;

	if ((shmidA = shmget(key, sizeA, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
		perror("shmget:");
		return 1;
	}

	if ((addrA = shmat(shmidA, addrA, SHM_R|SHM_W)) == (void *)-1UL) {
		perror("shmat");
		return 1;
	}
	if ((shmidB = shmget(key, sizeB, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
		perror("shmget:");
		return 1;
	}

	if ((addrB = shmat(shmidB, addrB, SHM_R|SHM_W)) == (void *)-1UL) {
		perror("shmat");
		return 1;
	}

fork_child:
	switch(fork()) {
		case 0:
			switch (n%3) {
			case 0:
				play(addrA, sizeA);
				break;
			case 1:
				play(addrB, sizeB);
				break;
			case 2:
				break;
			}
			break;
		case -1:
			perror("fork:");
			break;
		default:
			if (++n < nr_children)
				goto fork_child;
			play(addrA, sizeA);
			break;
	}
	shmdt(addrA);
	shmdt(addrB);
	do {
		wait(NULL);
	} while (--n > 0);
	shmctl(shmidA, IPC_RMID, NULL);
	shmctl(shmidB, IPC_RMID, NULL);
	return 0;
}

Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Mel Gorman <mgorman@suse.de>
---
 mm/hugetlb.c |   25 +++++++++++++++++++++++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ae8f708..d488476 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2383,6 +2383,22 @@ void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
 {
 	mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
 	__unmap_hugepage_range(vma, start, end, ref_page);
+	/*
+	 * Clear this flag so that x86's huge_pmd_share page_table_shareable
+	 * test will fail on a vma being torn down, and not grab a page table
+	 * on its way out.  We're lucky that the flag has such an appropriate
+	 * name, and can in fact be safely cleared here. We could clear it
+	 * before the __unmap_hugepage_range above, but all that's necessary
+	 * is to clear it before releasing the i_mmap_mutex below.
+	 *
+	 * This works because in the contexts this is called, the VMA is
+	 * going to be destroyed. It is not vunerable to madvise(DONTNEED)
+	 * because madvise is not supported on hugetlbfs. The same applies
+	 * for direct IO. unmap_hugepage_range() is only being called just
+	 * before free_pgtables() so clearing VM_MAYSHARE will not cause
+	 * surprises later.
+	 */
+	vma->vm_flags &= ~VM_MAYSHARE;
 	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
 }
 
@@ -2949,9 +2965,14 @@ void hugetlb_change_protection(struct vm_area_struct *vma,
 		}
 	}
 	spin_unlock(&mm->page_table_lock);
-	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
-
+	/*
+	 * Must flush TLB before releasing i_mmap_mutex: x86's huge_pmd_unshare
+	 * may have cleared our pud entry and done put_page on the page table:
+	 * once we release i_mmap_mutex, another task can do the final put_page
+	 * and that page table be reused and filled with junk.
+	 */
 	flush_tlb_range(vma, start, end);
+	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
 }
 
 int hugetlb_reserve_pages(struct inode *inode,

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
  2012-07-24  9:34             ` Mel Gorman
@ 2012-07-24 10:04               ` Michal Hocko
  2012-07-24 19:23               ` Hugh Dickins
  1 sibling, 0 replies; 50+ messages in thread
From: Michal Hocko @ 2012-07-24 10:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Hugh Dickins, Linux-MM, David Gibson, Ken Chen, Cong Wang, LKML

On Tue 24-07-12 10:34:06, Mel Gorman wrote:
[...]
> ---8<---
> mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables
> 
> If a process creates a large hugetlbfs mapping that is eligible for page
> table sharing and forks heavily with children some of whom fault and
> others which destroy the mapping then it is possible for page tables to
> get corrupted. Some teardowns of the mapping encounter a "bad pmd" and
> output a message to the kernel log. The final teardown will trigger a
> BUG_ON in mm/filemap.c.
> 
> This was reproduced in 3.4 but is known to have existed for a long time
> and goes back at least as far as 2.6.37. It was probably was introduced in
> 2.6.20 by [39dde65c: shared page table for hugetlb page]. The messages
> look like this;
> 
> [  ..........] Lots of bad pmd messages followed by this
> [  127.164256] mm/memory.c:391: bad pmd ffff880412e04fe8(80000003de4000e7).
> [  127.164257] mm/memory.c:391: bad pmd ffff880412e04ff0(80000003de6000e7).
> [  127.164258] mm/memory.c:391: bad pmd ffff880412e04ff8(80000003de0000e7).
> [  127.186778] ------------[ cut here ]------------
> [  127.186781] kernel BUG at mm/filemap.c:134!
> [  127.186782] invalid opcode: 0000 [#1] SMP
> [  127.186783] CPU 7
> [  127.186784] Modules linked in: af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos usb_common button i2c_core intel_agp video intel_gtt fan processor thermal thermal_sys hwmon ata_generic pata_atiixp libata scsi_mod
> [  127.186801]
> [  127.186802] Pid: 9017, comm: hugetlbfs-test Not tainted 3.4.0-autobuild #53 Dell Inc. OptiPlex 990/06D7TR
> [  127.186804] RIP: 0010:[<ffffffff810ed6ce>]  [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
> [  127.186809] RSP: 0000:ffff8804144b5c08  EFLAGS: 00010002
> [  127.186810] RAX: 0000000000000001 RBX: ffffea000a5c9000 RCX: 00000000ffffffc0
> [  127.186811] RDX: 0000000000000000 RSI: 0000000000000009 RDI: ffff88042dfdad00
> [  127.186812] RBP: ffff8804144b5c18 R08: 0000000000000009 R09: 0000000000000003
> [  127.186813] R10: 0000000000000000 R11: 000000000000002d R12: ffff880412ff83d8
> [  127.186814] R13: ffff880412ff83d8 R14: 0000000000000000 R15: ffff880412ff83d8
> [  127.186815] FS:  00007fe18ed2c700(0000) GS:ffff88042dce0000(0000) knlGS:0000000000000000
> [  127.186816] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  127.186817] CR2: 00007fe340000503 CR3: 0000000417a14000 CR4: 00000000000407e0
> [  127.186818] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  127.186819] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [  127.186820] Process hugetlbfs-test (pid: 9017, threadinfo ffff8804144b4000, task ffff880417f803c0)
> [  127.186821] Stack:
> [  127.186822]  ffffea000a5c9000 0000000000000000 ffff8804144b5c48 ffffffff810ed83b
> [  127.186824]  ffff8804144b5c48 000000000000138a 0000000000001387 ffff8804144b5c98
> [  127.186825]  ffff8804144b5d48 ffffffff811bc925 ffff8804144b5cb8 0000000000000000
> [  127.186827] Call Trace:
> [  127.186829]  [<ffffffff810ed83b>] delete_from_page_cache+0x3b/0x80
> [  127.186832]  [<ffffffff811bc925>] truncate_hugepages+0x115/0x220
> [  127.186834]  [<ffffffff811bca43>] hugetlbfs_evict_inode+0x13/0x30
> [  127.186837]  [<ffffffff811655c7>] evict+0xa7/0x1b0
> [  127.186839]  [<ffffffff811657a3>] iput_final+0xd3/0x1f0
> [  127.186840]  [<ffffffff811658f9>] iput+0x39/0x50
> [  127.186842]  [<ffffffff81162708>] d_kill+0xf8/0x130
> [  127.186843]  [<ffffffff81162812>] dput+0xd2/0x1a0
> [  127.186845]  [<ffffffff8114e2d0>] __fput+0x170/0x230
> [  127.186848]  [<ffffffff81236e0e>] ? rb_erase+0xce/0x150
> [  127.186849]  [<ffffffff8114e3ad>] fput+0x1d/0x30
> [  127.186851]  [<ffffffff81117db7>] remove_vma+0x37/0x80
> [  127.186853]  [<ffffffff81119182>] do_munmap+0x2d2/0x360
> [  127.186855]  [<ffffffff811cc639>] sys_shmdt+0xc9/0x170
> [  127.186857]  [<ffffffff81410a39>] system_call_fastpath+0x16/0x1b
> [  127.186858] Code: 0f 1f 44 00 00 48 8b 43 08 48 8b 00 48 8b 40 28 8b b0 40 03 00 00 85 f6 0f 88 df fe ff ff 48 89 df e8 e7 cb 05 00 e9 d2 fe ff ff <0f> 0b 55 83 e2 fd 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 65 e0
> [  127.186868] RIP  [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
> [  127.186870]  RSP <ffff8804144b5c08>
> [  127.186871] ---[ end trace 7cbac5d1db69f426 ]---
> 
> The bug is a race and not always easy to reproduce. To reproduce it I was
> doing the following on a single socket I7-based machine with 16G of RAM.
> 
> $ hugeadm --pool-pages-max DEFAULT:13G
> $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmmax
> $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmall
> $ for i in `seq 1 9000`; do ./hugetlbfs-test; done
> 
> On my particular machine, it usually triggers within 10 minutes but enabling
> debug options can change the timing such that it never hits. Once the bug is
> triggered, the machine is in trouble and needs to be rebooted. The machine
> will respond but processes accessing proc like "ps aux" will hang due to
> the BUG_ON. shutdown will also hang and needs a hard reset or a sysrq-b.
> 
> The basic problem is a race between page table sharing and teardown. For
> the most part page table sharing depends on i_mmap_mutex. In some cases,
> it is also taking the mm->page_table_lock for the PTE updates but with
> shared page tables, it is the i_mmap_mutex that is more important.
> 
> Unfortunately it appears to be also insufficient. Consider the following
> situation
> 
> Process A					Process B
> ---------					---------
> hugetlb_fault					shmdt
>   						LockWrite(mmap_sem)
>     						  do_munmap
> 						    unmap_region
> 						      unmap_vmas
> 						        unmap_single_vma
> 						          unmap_hugepage_range
>       						            Lock(i_mmap_mutex)
> 							    Lock(mm->page_table_lock)
> 							    huge_pmd_unshare/unmap tables <--- (1)
> 							    Unlock(mm->page_table_lock)
>       						            Unlock(i_mmap_mutex)
>   huge_pte_alloc				      ...
>     Lock(i_mmap_mutex)				      ...
>     vma_prio_walk, find svma, spte		      ...
>     Lock(mm->page_table_lock)			      ...
>     share spte					      ...
>     Unlock(mm->page_table_lock)			      ...
>     Unlock(i_mmap_mutex)			      ...
>   hugetlb_no_page									  <--- (2)
> 						      free_pgtables
> 						        unlink_file_vma
> 							hugetlb_free_pgd_range
> 						    remove_vma_list
> 
> In this scenario, it is possible for Process A to share page tables with
> Process B that is trying to tear them down.  The i_mmap_mutex on its own
> does not prevent Process A walking Process B's page tables. At (1) above,
> the page tables are not shared yet so it unmaps the PMDs. Process A sets
> up page table sharing and at (2) faults a new entry. Process B then trips
> up on it in free_pgtables.
> 
> This patch fixes the problem by clearing VM_MAYSHARE during
> unmap_hugepage_range() under the i_mmap_mutex. This makes the VMA
> ineligible for sharing and avoids the race. Superficially this looks
> like it would then be vunerable to truncate and madvise problems but
> this is avoided by the limitations of hugetlbfs.
> 
> madvise and trunctate would be problems if removing VM_MAYSHARE in
> __unmap_hugepage_range() but it is removed in unmap_hugepage_range().
> This is only called by unmap_single_vma(): which is called via unmap_vmas()
> by unmap_region() or exit_mmap() just before free_pgtables() (the problem
> cases); or by madvise_dontneed() via zap_page_range(), which is disallowed
> on VM_HUGETLB; or by zap_page_range_single().
> 
> zap_page_range_single() is called by zap_vma_ptes(), which is only allowed
> on VM_PFNMAP; or by unmap_mapping_range_vma(), which looked like it was
> going to deadlock on i_mmap_mutex (with or without my patch) but does
> not as hugetlbfs has its own hugetlbfs_setattr() and hugetlb_vmtruncate()
> which don't use unmap_mapping_range() at all.
> 
> invalidate_inode_pages2() (and _range()) do use unmap_mapping_range(),
> but hugetlbfs doesn't support direct_IO, and otherwise they're called by a
> filesystem directly on its own inodes, which hugetlbfs does not.  If there's
> a deadlock on i_mmap_mutex somewhere in there, it's not introduced by the
> proposed patch.
> 
> This should be treated as a -stable candidate if it is merged.
> 
> Test program is as follows. The test case was mostly written by Michal
> Hocko with a few minor changes to reproduce this bug.
> 
> ==== CUT HERE ====
> 
> static size_t huge_page_size = (2UL << 20);
> static size_t nr_huge_page_A = 512;
> static size_t nr_huge_page_B = 5632;
> 
> unsigned int get_random(unsigned int max)
> {
> 	struct timeval tv;
> 
> 	gettimeofday(&tv, NULL);
> 	srandom(tv.tv_usec);
> 	return random() % max;
> }
> 
> static void play(void *addr, size_t size)
> {
> 	unsigned char *start = addr,
> 		      *end = start + size,
> 		      *a;
> 	start += get_random(size/2);
> 
> 	/* we could itterate on huge pages but let's give it more time. */
> 	for (a = start; a < end; a += 4096)
> 		*a = 0;
> }
> 
> int main(int argc, char **argv)
> {
> 	key_t key = IPC_PRIVATE;
> 	size_t sizeA = nr_huge_page_A * huge_page_size;
> 	size_t sizeB = nr_huge_page_B * huge_page_size;
> 	int shmidA, shmidB;
> 	void *addrA = NULL, *addrB = NULL;
> 	int nr_children = 300, n = 0;
> 
> 	if ((shmidA = shmget(key, sizeA, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
> 		perror("shmget:");
> 		return 1;
> 	}
> 
> 	if ((addrA = shmat(shmidA, addrA, SHM_R|SHM_W)) == (void *)-1UL) {
> 		perror("shmat");
> 		return 1;
> 	}
> 	if ((shmidB = shmget(key, sizeB, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
> 		perror("shmget:");
> 		return 1;
> 	}
> 
> 	if ((addrB = shmat(shmidB, addrB, SHM_R|SHM_W)) == (void *)-1UL) {
> 		perror("shmat");
> 		return 1;
> 	}
> 
> fork_child:
> 	switch(fork()) {
> 		case 0:
> 			switch (n%3) {
> 			case 0:
> 				play(addrA, sizeA);
> 				break;
> 			case 1:
> 				play(addrB, sizeB);
> 				break;
> 			case 2:
> 				break;
> 			}
> 			break;
> 		case -1:
> 			perror("fork:");
> 			break;
> 		default:
> 			if (++n < nr_children)
> 				goto fork_child;
> 			play(addrA, sizeA);
> 			break;
> 	}
> 	shmdt(addrA);
> 	shmdt(addrB);
> 	do {
> 		wait(NULL);
> 	} while (--n > 0);
> 	shmctl(shmidA, IPC_RMID, NULL);
> 	shmctl(shmidB, IPC_RMID, NULL);
> 	return 0;
> }
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Reviewed-by: Michal Hocko <mhocko@suse.cz>
> Reviewed-by: Mel Gorman <mgorman@suse.de>

Thanks a lot! The patch and the description look good and really helpful!

> ---
>  mm/hugetlb.c |   25 +++++++++++++++++++++++--
>  1 file changed, 23 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index ae8f708..d488476 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2383,6 +2383,22 @@ void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
>  {
>  	mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
>  	__unmap_hugepage_range(vma, start, end, ref_page);
> +	/*
> +	 * Clear this flag so that x86's huge_pmd_share page_table_shareable
> +	 * test will fail on a vma being torn down, and not grab a page table
> +	 * on its way out.  We're lucky that the flag has such an appropriate
> +	 * name, and can in fact be safely cleared here. We could clear it
> +	 * before the __unmap_hugepage_range above, but all that's necessary
> +	 * is to clear it before releasing the i_mmap_mutex below.
> +	 *
> +	 * This works because in the contexts this is called, the VMA is
> +	 * going to be destroyed. It is not vunerable to madvise(DONTNEED)
> +	 * because madvise is not supported on hugetlbfs. The same applies
> +	 * for direct IO. unmap_hugepage_range() is only being called just
> +	 * before free_pgtables() so clearing VM_MAYSHARE will not cause
> +	 * surprises later.
> +	 */
> +	vma->vm_flags &= ~VM_MAYSHARE;
>  	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
>  }
>  
> @@ -2949,9 +2965,14 @@ void hugetlb_change_protection(struct vm_area_struct *vma,
>  		}
>  	}
>  	spin_unlock(&mm->page_table_lock);
> -	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
> -
> +	/*
> +	 * Must flush TLB before releasing i_mmap_mutex: x86's huge_pmd_unshare
> +	 * may have cleared our pud entry and done put_page on the page table:
> +	 * once we release i_mmap_mutex, another task can do the final put_page
> +	 * and that page table be reused and filled with junk.
> +	 */
>  	flush_tlb_range(vma, start, end);
> +	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
>  }
>  
>  int hugetlb_reserve_pages(struct inode *inode,

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
  2012-07-24  9:34             ` Mel Gorman
  2012-07-24 10:04               ` Michal Hocko
@ 2012-07-24 19:23               ` Hugh Dickins
  2012-07-25  8:36                 ` Mel Gorman
  1 sibling, 1 reply; 50+ messages in thread
From: Hugh Dickins @ 2012-07-24 19:23 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Michal Hocko, Linux-MM, David Gibson, Ken Chen, Cong Wang,
	Aneesh Kumar K.V, LKML

On Tue, 24 Jul 2012, Mel Gorman wrote:
> On Mon, Jul 23, 2012 at 06:08:05PM -0700, Hugh Dickins wrote:
> > 
> > So, after a bout of anxiety, I think my &= ~VM_MAYSHARE remains good.
> > 
> 
> I agree with you. When I was thinking about the potential problems, I was
> thinking of them in the general context of the core VM and what we normally
> take into account.
> 
> I confess that I really find this working-by-coincidence very icky and am
> uncomfortable with it but your patch is the only patch that contains the
> mess to hugetlbfs. I fixed exit_mmap() for my version but only by changing
> the core to introduce exit_vmas() to take mmap_sem for write if a hugetlb
> VMA is found so I also affected the core.

"icky" is not quite the word I'd use, but yes, it feels like you only
have to dislodge a stone somewhere at the other end of the kernel,
and the whole lot would come tumbling down.

If I could think of a suitable VM_BUG_ON to insert next to the ~VM_MAYSHARE,
I would: to warn us when assumptions change.  If we were prepared to waste
another vm_flag on it (and just because there's now a type which lets them
expand does not mean we can be profligate with them), then you can imagine
a VM_GOINGAWAY flag set in unmap_region() and exit_mmap(), and we key off
that instead; or something of that kind.

But I'm afraid I see that as TODO-list material: the one-liner is pretty
good for stable backporting, and I felt smiled-upon when it turned out to
be workable (and not even needing a change in arch/x86/mm, that really
surprised me).  It seems ungrateful not to seize the simple fix it offers,
which I found much easier to understand than the alternatives.

> 
> So, lets go with your patch but with all this documented! I stuck a
> changelog and an additional comment onto your patch and this is the end
> result.

Okay, thanks.  (I think you've copied rather more of my previous mail
into the commit description than it deserves, but it looks like you
like more words where I like less!)

> 
> Do you want to pick this up and send it to Andrew or will I?

Oh, please change your Reviewed-by to Signed-off-by: almost all of the
work and description comes from you and Michal; then please, you send it
in to Andrew - sorry, I really need to turn my attention to other things.

But I hadn't realized how messy it's going to be: I was concentrating on
3.5, not the mmotm tree, which as Michal points out is fairly different.
Yes, it definitely needs to revert to holding i_mmap_mutex when unsharing,
it was a mistake to have removed that (what stabilizes the page_count 1
check in huge_pmd_unshare? only i_mmap_mutex).

I guess this fix needs to go in to 3.6 early, and "someone" rejig the
hugetlb area of mmotm before that goes on to Linus.  Urggh.  AArgh64.
Sorry, I'm not volunteering.

But an interesting aspect of the hugetlb changes there, is that
mmu_gather is now being used by __unmap_hugepage_range: I did want that
for one of the solutions to this bug that I was toying with.  Although
earlier I had been afraid of "doing free_pgtables work" down in unshare,
it occurred to me later that already we do that (pud_clear) with no harm
observed, and free_pgtables does not depend on having entries still
present at the lower levels.  It may be that there's a less tricky
fix available once the dust has settled here.

> 
> Thanks Hugh!
> 
> ---8<---
> mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables
> 
> If a process creates a large hugetlbfs mapping that is eligible for page
> table sharing and forks heavily with children some of whom fault and
> others which destroy the mapping then it is possible for page tables to
> get corrupted. Some teardowns of the mapping encounter a "bad pmd" and
> output a message to the kernel log. The final teardown will trigger a
> BUG_ON in mm/filemap.c.
> 
> This was reproduced in 3.4 but is known to have existed for a long time
> and goes back at least as far as 2.6.37. It was probably was introduced in
> 2.6.20 by [39dde65c: shared page table for hugetlb page]. The messages
> look like this;
> 
> [  ..........] Lots of bad pmd messages followed by this
> [  127.164256] mm/memory.c:391: bad pmd ffff880412e04fe8(80000003de4000e7).
> [  127.164257] mm/memory.c:391: bad pmd ffff880412e04ff0(80000003de6000e7).
> [  127.164258] mm/memory.c:391: bad pmd ffff880412e04ff8(80000003de0000e7).
> [  127.186778] ------------[ cut here ]------------
> [  127.186781] kernel BUG at mm/filemap.c:134!
> [  127.186782] invalid opcode: 0000 [#1] SMP
> [  127.186783] CPU 7
> [  127.186784] Modules linked in: af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos usb_common button i2c_core intel_agp video intel_gtt fan processor thermal thermal_sys hwmon ata_generic pata_atiixp libata scsi_mod
> [  127.186801]
> [  127.186802] Pid: 9017, comm: hugetlbfs-test Not tainted 3.4.0-autobuild #53 Dell Inc. OptiPlex 990/06D7TR
> [  127.186804] RIP: 0010:[<ffffffff810ed6ce>]  [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
> [  127.186809] RSP: 0000:ffff8804144b5c08  EFLAGS: 00010002
> [  127.186810] RAX: 0000000000000001 RBX: ffffea000a5c9000 RCX: 00000000ffffffc0
> [  127.186811] RDX: 0000000000000000 RSI: 0000000000000009 RDI: ffff88042dfdad00
> [  127.186812] RBP: ffff8804144b5c18 R08: 0000000000000009 R09: 0000000000000003
> [  127.186813] R10: 0000000000000000 R11: 000000000000002d R12: ffff880412ff83d8
> [  127.186814] R13: ffff880412ff83d8 R14: 0000000000000000 R15: ffff880412ff83d8
> [  127.186815] FS:  00007fe18ed2c700(0000) GS:ffff88042dce0000(0000) knlGS:0000000000000000
> [  127.186816] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  127.186817] CR2: 00007fe340000503 CR3: 0000000417a14000 CR4: 00000000000407e0
> [  127.186818] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  127.186819] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [  127.186820] Process hugetlbfs-test (pid: 9017, threadinfo ffff8804144b4000, task ffff880417f803c0)
> [  127.186821] Stack:
> [  127.186822]  ffffea000a5c9000 0000000000000000 ffff8804144b5c48 ffffffff810ed83b
> [  127.186824]  ffff8804144b5c48 000000000000138a 0000000000001387 ffff8804144b5c98
> [  127.186825]  ffff8804144b5d48 ffffffff811bc925 ffff8804144b5cb8 0000000000000000
> [  127.186827] Call Trace:
> [  127.186829]  [<ffffffff810ed83b>] delete_from_page_cache+0x3b/0x80
> [  127.186832]  [<ffffffff811bc925>] truncate_hugepages+0x115/0x220
> [  127.186834]  [<ffffffff811bca43>] hugetlbfs_evict_inode+0x13/0x30
> [  127.186837]  [<ffffffff811655c7>] evict+0xa7/0x1b0
> [  127.186839]  [<ffffffff811657a3>] iput_final+0xd3/0x1f0
> [  127.186840]  [<ffffffff811658f9>] iput+0x39/0x50
> [  127.186842]  [<ffffffff81162708>] d_kill+0xf8/0x130
> [  127.186843]  [<ffffffff81162812>] dput+0xd2/0x1a0
> [  127.186845]  [<ffffffff8114e2d0>] __fput+0x170/0x230
> [  127.186848]  [<ffffffff81236e0e>] ? rb_erase+0xce/0x150
> [  127.186849]  [<ffffffff8114e3ad>] fput+0x1d/0x30
> [  127.186851]  [<ffffffff81117db7>] remove_vma+0x37/0x80
> [  127.186853]  [<ffffffff81119182>] do_munmap+0x2d2/0x360
> [  127.186855]  [<ffffffff811cc639>] sys_shmdt+0xc9/0x170
> [  127.186857]  [<ffffffff81410a39>] system_call_fastpath+0x16/0x1b
> [  127.186858] Code: 0f 1f 44 00 00 48 8b 43 08 48 8b 00 48 8b 40 28 8b b0 40 03 00 00 85 f6 0f 88 df fe ff ff 48 89 df e8 e7 cb 05 00 e9 d2 fe ff ff <0f> 0b 55 83 e2 fd 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 65 e0
> [  127.186868] RIP  [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
> [  127.186870]  RSP <ffff8804144b5c08>
> [  127.186871] ---[ end trace 7cbac5d1db69f426 ]---
> 
> The bug is a race and not always easy to reproduce. To reproduce it I was
> doing the following on a single socket I7-based machine with 16G of RAM.
> 
> $ hugeadm --pool-pages-max DEFAULT:13G
> $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmmax
> $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmall
> $ for i in `seq 1 9000`; do ./hugetlbfs-test; done
> 
> On my particular machine, it usually triggers within 10 minutes but enabling
> debug options can change the timing such that it never hits. Once the bug is
> triggered, the machine is in trouble and needs to be rebooted. The machine
> will respond but processes accessing proc like "ps aux" will hang due to
> the BUG_ON. shutdown will also hang and needs a hard reset or a sysrq-b.
> 
> The basic problem is a race between page table sharing and teardown. For
> the most part page table sharing depends on i_mmap_mutex. In some cases,
> it is also taking the mm->page_table_lock for the PTE updates but with
> shared page tables, it is the i_mmap_mutex that is more important.
> 
> Unfortunately it appears to be also insufficient. Consider the following
> situation
> 
> Process A					Process B
> ---------					---------
> hugetlb_fault					shmdt
>   						LockWrite(mmap_sem)
>     						  do_munmap
> 						    unmap_region
> 						      unmap_vmas
> 						        unmap_single_vma
> 						          unmap_hugepage_range
>       						            Lock(i_mmap_mutex)
> 							    Lock(mm->page_table_lock)
> 							    huge_pmd_unshare/unmap tables <--- (1)
> 							    Unlock(mm->page_table_lock)
>       						            Unlock(i_mmap_mutex)
>   huge_pte_alloc				      ...
>     Lock(i_mmap_mutex)				      ...
>     vma_prio_walk, find svma, spte		      ...
>     Lock(mm->page_table_lock)			      ...
>     share spte					      ...
>     Unlock(mm->page_table_lock)			      ...
>     Unlock(i_mmap_mutex)			      ...
>   hugetlb_no_page									  <--- (2)
> 						      free_pgtables
> 						        unlink_file_vma
> 							hugetlb_free_pgd_range
> 						    remove_vma_list
> 
> In this scenario, it is possible for Process A to share page tables with
> Process B that is trying to tear them down.  The i_mmap_mutex on its own
> does not prevent Process A walking Process B's page tables. At (1) above,
> the page tables are not shared yet so it unmaps the PMDs. Process A sets
> up page table sharing and at (2) faults a new entry. Process B then trips
> up on it in free_pgtables.
> 
> This patch fixes the problem by clearing VM_MAYSHARE during
> unmap_hugepage_range() under the i_mmap_mutex. This makes the VMA
> ineligible for sharing and avoids the race. Superficially this looks
> like it would then be vunerable to truncate and madvise problems but
> this is avoided by the limitations of hugetlbfs.
> 
> madvise and trunctate would be problems if removing VM_MAYSHARE in
> __unmap_hugepage_range() but it is removed in unmap_hugepage_range().
> This is only called by unmap_single_vma(): which is called via unmap_vmas()
> by unmap_region() or exit_mmap() just before free_pgtables() (the problem
> cases); or by madvise_dontneed() via zap_page_range(), which is disallowed
> on VM_HUGETLB; or by zap_page_range_single().
> 
> zap_page_range_single() is called by zap_vma_ptes(), which is only allowed
> on VM_PFNMAP; or by unmap_mapping_range_vma(), which looked like it was
> going to deadlock on i_mmap_mutex (with or without my patch) but does
> not as hugetlbfs has its own hugetlbfs_setattr() and hugetlb_vmtruncate()
> which don't use unmap_mapping_range() at all.
> 
> invalidate_inode_pages2() (and _range()) do use unmap_mapping_range(),
> but hugetlbfs doesn't support direct_IO, and otherwise they're called by a
> filesystem directly on its own inodes, which hugetlbfs does not.  If there's
> a deadlock on i_mmap_mutex somewhere in there, it's not introduced by the
> proposed patch.
> 
> This should be treated as a -stable candidate if it is merged.
> 
> Test program is as follows. The test case was mostly written by Michal
> Hocko with a few minor changes to reproduce this bug.
> 
> ==== CUT HERE ====
> 
> static size_t huge_page_size = (2UL << 20);
> static size_t nr_huge_page_A = 512;
> static size_t nr_huge_page_B = 5632;
> 
> unsigned int get_random(unsigned int max)
> {
> 	struct timeval tv;
> 
> 	gettimeofday(&tv, NULL);
> 	srandom(tv.tv_usec);
> 	return random() % max;
> }
> 
> static void play(void *addr, size_t size)
> {
> 	unsigned char *start = addr,
> 		      *end = start + size,
> 		      *a;
> 	start += get_random(size/2);
> 
> 	/* we could itterate on huge pages but let's give it more time. */
> 	for (a = start; a < end; a += 4096)
> 		*a = 0;
> }
> 
> int main(int argc, char **argv)
> {
> 	key_t key = IPC_PRIVATE;
> 	size_t sizeA = nr_huge_page_A * huge_page_size;
> 	size_t sizeB = nr_huge_page_B * huge_page_size;
> 	int shmidA, shmidB;
> 	void *addrA = NULL, *addrB = NULL;
> 	int nr_children = 300, n = 0;
> 
> 	if ((shmidA = shmget(key, sizeA, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
> 		perror("shmget:");
> 		return 1;
> 	}
> 
> 	if ((addrA = shmat(shmidA, addrA, SHM_R|SHM_W)) == (void *)-1UL) {
> 		perror("shmat");
> 		return 1;
> 	}
> 	if ((shmidB = shmget(key, sizeB, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
> 		perror("shmget:");
> 		return 1;
> 	}
> 
> 	if ((addrB = shmat(shmidB, addrB, SHM_R|SHM_W)) == (void *)-1UL) {
> 		perror("shmat");
> 		return 1;
> 	}
> 
> fork_child:
> 	switch(fork()) {
> 		case 0:
> 			switch (n%3) {
> 			case 0:
> 				play(addrA, sizeA);
> 				break;
> 			case 1:
> 				play(addrB, sizeB);
> 				break;
> 			case 2:
> 				break;
> 			}
> 			break;
> 		case -1:
> 			perror("fork:");
> 			break;
> 		default:
> 			if (++n < nr_children)
> 				goto fork_child;
> 			play(addrA, sizeA);
> 			break;
> 	}
> 	shmdt(addrA);
> 	shmdt(addrB);
> 	do {
> 		wait(NULL);
> 	} while (--n > 0);
> 	shmctl(shmidA, IPC_RMID, NULL);
> 	shmctl(shmidB, IPC_RMID, NULL);
> 	return 0;
> }
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Reviewed-by: Michal Hocko <mhocko@suse.cz>
> Reviewed-by: Mel Gorman <mgorman@suse.de>
> ---
>  mm/hugetlb.c |   25 +++++++++++++++++++++++--
>  1 file changed, 23 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index ae8f708..d488476 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2383,6 +2383,22 @@ void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
>  {
>  	mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
>  	__unmap_hugepage_range(vma, start, end, ref_page);
> +	/*
> +	 * Clear this flag so that x86's huge_pmd_share page_table_shareable
> +	 * test will fail on a vma being torn down, and not grab a page table
> +	 * on its way out.  We're lucky that the flag has such an appropriate
> +	 * name, and can in fact be safely cleared here. We could clear it
> +	 * before the __unmap_hugepage_range above, but all that's necessary
> +	 * is to clear it before releasing the i_mmap_mutex below.
> +	 *
> +	 * This works because in the contexts this is called, the VMA is
> +	 * going to be destroyed. It is not vunerable to madvise(DONTNEED)
> +	 * because madvise is not supported on hugetlbfs. The same applies
> +	 * for direct IO. unmap_hugepage_range() is only being called just
> +	 * before free_pgtables() so clearing VM_MAYSHARE will not cause
> +	 * surprises later.
> +	 */
> +	vma->vm_flags &= ~VM_MAYSHARE;
>  	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
>  }
>  
> @@ -2949,9 +2965,14 @@ void hugetlb_change_protection(struct vm_area_struct *vma,
>  		}
>  	}
>  	spin_unlock(&mm->page_table_lock);
> -	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
> -
> +	/*
> +	 * Must flush TLB before releasing i_mmap_mutex: x86's huge_pmd_unshare
> +	 * may have cleared our pud entry and done put_page on the page table:
> +	 * once we release i_mmap_mutex, another task can do the final put_page
> +	 * and that page table be reused and filled with junk.
> +	 */
>  	flush_tlb_range(vma, start, end);
> +	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
>  }
>  
>  int hugetlb_reserve_pages(struct inode *inode,
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
  2012-07-24 19:23               ` Hugh Dickins
@ 2012-07-25  8:36                 ` Mel Gorman
  0 siblings, 0 replies; 50+ messages in thread
From: Mel Gorman @ 2012-07-25  8:36 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Michal Hocko, Linux-MM, David Gibson, Ken Chen, Cong Wang,
	Aneesh Kumar K.V, LKML

On Tue, Jul 24, 2012 at 12:23:58PM -0700, Hugh Dickins wrote:
> On Tue, 24 Jul 2012, Mel Gorman wrote:
> > On Mon, Jul 23, 2012 at 06:08:05PM -0700, Hugh Dickins wrote:
> > > 
> > > So, after a bout of anxiety, I think my &= ~VM_MAYSHARE remains good.
> > > 
> > 
> > I agree with you. When I was thinking about the potential problems, I was
> > thinking of them in the general context of the core VM and what we normally
> > take into account.
> > 
> > I confess that I really find this working-by-coincidence very icky and am
> > uncomfortable with it but your patch is the only patch that contains the
> > mess to hugetlbfs. I fixed exit_mmap() for my version but only by changing
> > the core to introduce exit_vmas() to take mmap_sem for write if a hugetlb
> > VMA is found so I also affected the core.
> 
> "icky" is not quite the word I'd use, but yes, it feels like you only
> have to dislodge a stone somewhere at the other end of the kernel,
> and the whole lot would come tumbling down.
> 
> If I could think of a suitable VM_BUG_ON to insert next to the ~VM_MAYSHARE,
> I would: to warn us when assumptions change.  If we were prepared to waste
> another vm_flag on it (and just because there's now a type which lets them
> expand does not mean we can be profligate with them), then you can imagine
> a VM_GOINGAWAY flag set in unmap_region() and exit_mmap(), and we key off
> that instead; or something of that kind.
> 

A new VM flag would be overkill for this right now.

> But I'm afraid I see that as TODO-list material: the one-liner is pretty
> good for stable backporting, and I felt smiled-upon when it turned out to
> be workable (and not even needing a change in arch/x86/mm, that really
> surprised me).  It seems ungrateful not to seize the simple fix it offers,
> which I found much easier to understand than the alternatives.
> 

That's fair enough.

> > 
> > So, lets go with your patch but with all this documented! I stuck a
> > changelog and an additional comment onto your patch and this is the end
> > result.
> 
> Okay, thanks.  (I think you've copied rather more of my previous mail
> into the commit description than it deserves, but it looks like you
> like more words where I like less!)
> 

I did copy more than was necessary, I'll fix it.

> > 
> > Do you want to pick this up and send it to Andrew or will I?
> 
> Oh, please change your Reviewed-by to Signed-off-by: almost all of the
> work and description comes from you and Michal; then please, you send it
> in to Andrew - sorry, I really need to turn my attention to other things.
> 

That's fine, I'll pick it. Thanks for working on this.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables v2
  2012-07-20 13:49 [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables v2 Mel Gorman
  2012-07-20 14:11 ` [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend) Mel Gorman
@ 2012-07-26 16:01 ` Larry Woodman
  2012-07-27  8:47   ` Mel Gorman
  2012-07-26 21:00 ` Rik van Riel
  2 siblings, 1 reply; 50+ messages in thread
From: Larry Woodman @ 2012-07-26 16:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Michal Hocko, Hugh Dickins, David Gibson, Ken Chen,
	Cong Wang, LKML

On 07/20/2012 09:49 AM, Mel Gorman wrote:
> +retry:
>   	mutex_lock(&mapping->i_mmap_mutex);
>   	vma_prio_tree_foreach(svma,&iter,&mapping->i_mmap, idx, idx) {
>   		if (svma == vma)
>   			continue;
> +		if (svma->vm_mm == vma->vm_mm)
> +			continue;
> +
> +		/*
> +		 * The target mm could be in the process of tearing down
> +		 * its page tables and the i_mmap_mutex on its own is
> +		 * not sufficient. To prevent races against teardown and
> +		 * pagetable updates, we acquire the mmap_sem and pagetable
> +		 * lock of the remote address space. down_read_trylock()
> +		 * is necessary as the other process could also be trying
> +		 * to share pagetables with the current mm. In the fork
> +		 * case, we are already both mm's so check for that
> +		 */
> +		if (locked_mm != svma->vm_mm) {
> +			if (!down_read_trylock(&svma->vm_mm->mmap_sem)) {
> +				mutex_unlock(&mapping->i_mmap_mutex);
> +				goto retry;
> +			}
> +			smmap_sem =&svma->vm_mm->mmap_sem;
> +		}
> +
> +		spage_table_lock =&svma->vm_mm->page_table_lock;
> +		spin_lock_nested(spage_table_lock, SINGLE_DEPTH_NESTING);
>
>   		saddr = page_table_shareable(svma, vma, addr, idx);
>   		if (saddr) {

Hi Mel, FYI I tried this and ran into a problem.  When there are 
multiple processes
in huge_pmd_share() just faulting in the same i_map they all have their 
mmap_sem
down for write so the down_read_trylock(&svma->vm_mm->mmap_sem) never
succeeds.  What am I missing?

Thanks, Larry


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
  2012-07-23  4:04       ` Hugh Dickins
  2012-07-23 11:40         ` Mel Gorman
@ 2012-07-26 17:42         ` Rik van Riel
  2012-07-26 18:04           ` Larry Woodman
  2012-07-27  8:42           ` Mel Gorman
  2012-07-26 18:37         ` Rik van Riel
  2 siblings, 2 replies; 50+ messages in thread
From: Rik van Riel @ 2012-07-26 17:42 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Mel Gorman, Michal Hocko, Linux-MM, David Gibson, Ken Chen,
	Cong Wang, LKML, Larry Woodman

On 07/23/2012 12:04 AM, Hugh Dickins wrote:

> Please don't be upset if I say that I don't like either of your patches.
> Mainly for obvious reasons - I don't like Mel's because anything with
> trylock retries and nested spinlocks worries me before I can even start
> to think about it; and I don't like Michal's for the same reason as Mel,
> that it spreads more change around in common paths than we would like.

I have a naive question.

In huge_pmd_share, we protect ourselves by taking
the mapping->i_mmap_mutex.

Is there any reason we could not take the i_mmap_mutex
in the huge_pmd_unshare path?

I see that hugetlb_change_protection already takes that
lock. Is there something preventing __unmap_hugepage_range
from also taking mapping->i_mmap_mutex?

That way the sharing and the unsharing code are
protected by the same, per shm segment, lock.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
  2012-07-26 17:42         ` Rik van Riel
@ 2012-07-26 18:04           ` Larry Woodman
  2012-07-27  8:42           ` Mel Gorman
  1 sibling, 0 replies; 50+ messages in thread
From: Larry Woodman @ 2012-07-26 18:04 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Hugh Dickins, Mel Gorman, Michal Hocko, Linux-MM, David Gibson,
	Ken Chen, Cong Wang, LKML

On 07/26/2012 01:42 PM, Rik van Riel wrote:
> On 07/23/2012 12:04 AM, Hugh Dickins wrote:
>
>> Please don't be upset if I say that I don't like either of your patches.
>> Mainly for obvious reasons - I don't like Mel's because anything with
>> trylock retries and nested spinlocks worries me before I can even start
>> to think about it; and I don't like Michal's for the same reason as Mel,
>> that it spreads more change around in common paths than we would like.
>
> I have a naive question.
>
> In huge_pmd_share, we protect ourselves by taking
> the mapping->i_mmap_mutex.
>
> Is there any reason we could not take the i_mmap_mutex
> in the huge_pmd_unshare path?

I think it is already taken on every path into huge_pmd_unshare().

Larry
>
> I see that hugetlb_change_protection already takes that
> lock. Is there something preventing __unmap_hugepage_range
> from also taking mapping->i_mmap_mutex?
>
> That way the sharing and the unsharing code are
> protected by the same, per shm segment, lock.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
  2012-07-20 14:36   ` [PATCH -alternative] " Michal Hocko
  2012-07-20 14:51     ` Mel Gorman
@ 2012-07-26 18:31     ` Rik van Riel
  2012-07-27  9:02       ` Michal Hocko
  1 sibling, 1 reply; 50+ messages in thread
From: Rik van Riel @ 2012-07-26 18:31 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mel Gorman, Linux-MM, Hugh Dickins, David Gibson, Ken Chen,
	Cong Wang, LKML

On 07/20/2012 10:36 AM, Michal Hocko wrote:

> --- a/arch/x86/mm/hugetlbpage.c
> +++ b/arch/x86/mm/hugetlbpage.c
> @@ -81,7 +81,12 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>   		if (saddr) {
>   			spte = huge_pte_offset(svma->vm_mm, saddr);
>   			if (spte) {
> -				get_page(virt_to_page(spte));
> +				struct page *spte_page = virt_to_page(spte);
> +				if (!is_hugetlb_pmd_page_valid(spte_page)) {

What prevents somebody else from marking the hugetlb
pmd invalid, between here...

> +					spte = NULL;
> +					continue;
> +				}

... and here?

> +				get_page(spte_page);
>   				break;
>   			}

I think need to take the refcount before checking whether
the hugetlb pmd is still valid.

Also, disregard my previous email in this thread, I just
read Mel's detailed explanation and wrapped my brain
around the bug :)

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
  2012-07-23  4:04       ` Hugh Dickins
  2012-07-23 11:40         ` Mel Gorman
  2012-07-26 17:42         ` Rik van Riel
@ 2012-07-26 18:37         ` Rik van Riel
  2012-07-26 21:03           ` Larry Woodman
  2012-07-27  3:48           ` Larry Woodman
  2 siblings, 2 replies; 50+ messages in thread
From: Rik van Riel @ 2012-07-26 18:37 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Mel Gorman, Michal Hocko, Linux-MM, David Gibson, Ken Chen,
	Cong Wang, LKML, Larry Woodman

On 07/23/2012 12:04 AM, Hugh Dickins wrote:

> I spent hours trying to dream up a better patch, trying various
> approaches.  I think I have a nice one now, what do you think?  And
> more importantly, does it work?  I have not tried to test it at all,
> that I'm hoping to leave to you, I'm sure you'll attack it with gusto!
>
> If you like it, please take it over and add your comments and signoff
> and send it in.  The second part won't come up in your testing, and could
> be made a separate patch if you prefer: it's a related point that struck
> me while I was playing with a different approach.
>
> I'm sorely tempted to leave a dangerous pair of eyes off the Cc,
> but that too would be unfair.
>
> Subject-to-your-testing-
> Signed-off-by: Hugh Dickins <hughd@google.com>

This patch looks good to me.

Larry, does Hugh's patch survive your testing?



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables v2
  2012-07-20 13:49 [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables v2 Mel Gorman
  2012-07-20 14:11 ` [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend) Mel Gorman
  2012-07-26 16:01 ` [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables v2 Larry Woodman
@ 2012-07-26 21:00 ` Rik van Riel
  2012-07-26 21:54   ` Hugh Dickins
  2012-07-27  8:52   ` Mel Gorman
  2 siblings, 2 replies; 50+ messages in thread
From: Rik van Riel @ 2012-07-26 21:00 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Michal Hocko, Hugh Dickins, David Gibson, Ken Chen,
	Cong Wang, LKML, Larry Woodman

On 07/20/2012 09:49 AM, Mel Gorman wrote:
> This V2 is still the mmap_sem approach that fixes a potential deadlock
> problem pointed out by Michal.

Larry and I were looking around the hugetlb code some
more, and found what looks like yet another race.

In hugetlb_no_page, we have the following code:


         spin_lock(&mm->page_table_lock);
         size = i_size_read(mapping->host) >> huge_page_shift(h);
         if (idx >= size)
                 goto backout;

         ret = 0;
         if (!huge_pte_none(huge_ptep_get(ptep)))
                 goto backout;

         if (anon_rmap)
                 hugepage_add_new_anon_rmap(page, vma, address);
         else
                 page_dup_rmap(page);
         new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE)
                                 && (vma->vm_flags & VM_SHARED)));
         set_huge_pte_at(mm, address, ptep, new_pte);
	...
	spin_unlock(&mm->page_table_lock);

Notice how we check !huge_pte_none with our own
mm->page_table_lock held.

This offers no protection at all against other
processes, that also hold their own page_table_lock.

In short, it looks like it is possible for multiple
processes to go through the above code simultaneously,
potentially resulting in:

1) one process overwriting the pte just created by
    another process

2) data corruption, as one partially written page
    gets superceded by an newly zeroed page, but no
    TLB invalidates get sent to other CPUs

3) a memory leak of a huge page

Is there anything that would make this race impossible,
or is this a real bug?

If so, are there more like it in the hugetlbfs code?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
  2012-07-26 18:37         ` Rik van Riel
@ 2012-07-26 21:03           ` Larry Woodman
  2012-07-27  3:48           ` Larry Woodman
  1 sibling, 0 replies; 50+ messages in thread
From: Larry Woodman @ 2012-07-26 21:03 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Hugh Dickins, Mel Gorman, Michal Hocko, Linux-MM, David Gibson,
	Ken Chen, Cong Wang, LKML

On 07/26/2012 02:37 PM, Rik van Riel wrote:
> On 07/23/2012 12:04 AM, Hugh Dickins wrote:
>
>> I spent hours trying to dream up a better patch, trying various
>> approaches.  I think I have a nice one now, what do you think?  And
>> more importantly, does it work?  I have not tried to test it at all,
>> that I'm hoping to leave to you, I'm sure you'll attack it with gusto!
>>
>> If you like it, please take it over and add your comments and signoff
>> and send it in.  The second part won't come up in your testing, and 
>> could
>> be made a separate patch if you prefer: it's a related point that struck
>> me while I was playing with a different approach.
>>
>> I'm sorely tempted to leave a dangerous pair of eyes off the Cc,
>> but that too would be unfair.
>>
>> Subject-to-your-testing-
>> Signed-off-by: Hugh Dickins <hughd@google.com>
>
> This patch looks good to me.
>
> Larry, does Hugh's patch survive your testing?
>
>
It doesnt.  However its got a slightly different footprint because this 
is RHEL6 and
there have been changes to the hugetlbfs_inode code.  Also, we are 
seeing the
problem via group_exit() rather than shmdt().  Also, I print out the 
actual _mapcount
at the BUG and most of the time its 1 but have seen it as high as 6.



dell-per620-01.lab.bos.redhat.com login: MAPCOUNT = 2
------------[ cut here ]------------
kernel BUG at mm/filemap.c:131!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/devices/system/cpu/cpu23/cache/index2/shared_cpu_map
CPU 8
Modules linked in: autofs4 sunrpc ipv6 acpi_pad power_meter dcdbas 
microcode sb_edac edac_core iTCO_wdt i]

Pid: 3106, comm: mpitest Not tainted 2.6.32-289.el6.sharedpte.x86_64 #17 
Dell Inc. PowerEdge R620/07NDJ2
RIP: 0010:[<ffffffff81114a42>]  [<ffffffff81114a42>] 
__remove_from_page_cache+0xe2/0x100
RSP: 0018:ffff880434897b78  EFLAGS: 00010002
RAX: 0000000000000001 RBX: ffffea00074ec000 RCX: 00000000000010f6
RDX: 0000000000000000 RSI: 0000000000000046 RDI: 0000000000000046
RBP: ffff880434897b88 R08: ffffffff81c01a00 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000004 R12: ffff880432683d98
R13: ffff880432683db0 R14: 0000000000000000 R15: ffffea00074ec000
FS:  0000000000000000(0000) GS:ffff880028280000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000003a1d38c4a8 CR3: 0000000001a85000 CR4: 00000000000406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process mpitest (pid: 3106, threadinfo ffff880434896000, task 
ffff880431abb500)
Stack:
  ffffea00074ec000 0000000000000000 ffff880434897bb8 ffffffff81114ab4
<d> ffff880434897bb8 00000000000002ab 00000000000002a0 ffff880434897c08
<d> ffff880434897cb8 ffffffff811f758d ffff880000022dd8 0000000000000000
Call Trace:
  [<ffffffff81114ab4>] remove_from_page_cache+0x54/0x90
  [<ffffffff811f758d>] truncate_hugepages+0x11d/0x200
  [<ffffffff811f7670>] ? hugetlbfs_delete_inode+0x0/0x30
  [<ffffffff811f7688>] hugetlbfs_delete_inode+0x18/0x30
  [<ffffffff8119618e>] generic_delete_inode+0xde/0x1d0
  [<ffffffff811f76fd>] hugetlbfs_drop_inode+0x5d/0x70
  [<ffffffff81195132>] iput+0x62/0x70
  [<ffffffff81191c90>] dentry_iput+0x90/0x100
  [<ffffffff81191df1>] d_kill+0x31/0x60
  [<ffffffff8119381c>] dput+0x7c/0x150
  [<ffffffff8117c979>] __fput+0x189/0x210
  [<ffffffff8117ca25>] fput+0x25/0x30
  [<ffffffff8117844d>] filp_close+0x5d/0x90
  [<ffffffff8106e45f>] put_files_struct+0x7f/0xf0
  [<ffffffff8106e523>] exit_files+0x53/0x70
  [<ffffffff8107059d>] do_exit+0x18d/0x870
  [<ffffffff810d6cc2>] ? audit_syscall_entry+0x272/0x2a0
  [<ffffffff81070cd8>] do_group_exit+0x58/0xd0
  [<ffffffff81070d67>] sys_exit_group+0x17/0x20
  [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables v2
  2012-07-26 21:00 ` Rik van Riel
@ 2012-07-26 21:54   ` Hugh Dickins
  2012-07-27  8:52   ` Mel Gorman
  1 sibling, 0 replies; 50+ messages in thread
From: Hugh Dickins @ 2012-07-26 21:54 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Mel Gorman, Linux-MM, Michal Hocko, David Gibson, Ken Chen,
	Cong Wang, LKML, Larry Woodman

On Thu, 26 Jul 2012, Rik van Riel wrote:
> On 07/20/2012 09:49 AM, Mel Gorman wrote:
> > This V2 is still the mmap_sem approach that fixes a potential deadlock
> > problem pointed out by Michal.
> 
> Larry and I were looking around the hugetlb code some
> more, and found what looks like yet another race.
> 
> In hugetlb_no_page, we have the following code:
> 
> 
>         spin_lock(&mm->page_table_lock);
>         size = i_size_read(mapping->host) >> huge_page_shift(h);
>         if (idx >= size)
>                 goto backout;
> 
>         ret = 0;
>         if (!huge_pte_none(huge_ptep_get(ptep)))
>                 goto backout;
> 
>         if (anon_rmap)
>                 hugepage_add_new_anon_rmap(page, vma, address);
>         else
>                 page_dup_rmap(page);
>         new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE)
>                                 && (vma->vm_flags & VM_SHARED)));
>         set_huge_pte_at(mm, address, ptep, new_pte);
> 	...
> 	spin_unlock(&mm->page_table_lock);
> 
> Notice how we check !huge_pte_none with our own
> mm->page_table_lock held.
> 
> This offers no protection at all against other
> processes, that also hold their own page_table_lock.
> 
> In short, it looks like it is possible for multiple
> processes to go through the above code simultaneously,
> potentially resulting in:
> 
> 1) one process overwriting the pte just created by
>    another process
> 
> 2) data corruption, as one partially written page
>    gets superceded by an newly zeroed page, but no
>    TLB invalidates get sent to other CPUs
> 
> 3) a memory leak of a huge page
> 
> Is there anything that would make this race impossible,
> or is this a real bug?

I believe it's protected by the unloved hugetlb_instantiation_mutex.

> 
> If so, are there more like it in the hugetlbfs code?

What, more than one bug in that code?
Surely that would defy the laws of probability ;)

Hugh

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
  2012-07-26 18:37         ` Rik van Riel
  2012-07-26 21:03           ` Larry Woodman
@ 2012-07-27  3:48           ` Larry Woodman
  2012-07-27 10:10             ` Larry Woodman
  2012-07-27 10:23             ` Mel Gorman
  1 sibling, 2 replies; 50+ messages in thread
From: Larry Woodman @ 2012-07-27  3:48 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Hugh Dickins, Mel Gorman, Michal Hocko, Linux-MM, David Gibson,
	Ken Chen, Cong Wang, LKML

On 07/26/2012 02:37 PM, Rik van Riel wrote:
> On 07/23/2012 12:04 AM, Hugh Dickins wrote:
>
>> I spent hours trying to dream up a better patch, trying various
>> approaches.  I think I have a nice one now, what do you think?  And
>> more importantly, does it work?  I have not tried to test it at all,
>> that I'm hoping to leave to you, I'm sure you'll attack it with gusto!
>>
>> If you like it, please take it over and add your comments and signoff
>> and send it in.  The second part won't come up in your testing, and 
>> could
>> be made a separate patch if you prefer: it's a related point that struck
>> me while I was playing with a different approach.
>>
>> I'm sorely tempted to leave a dangerous pair of eyes off the Cc,
>> but that too would be unfair.
>>
>> Subject-to-your-testing-
>> Signed-off-by: Hugh Dickins <hughd@google.com>
>
> This patch looks good to me.
>
> Larry, does Hugh's patch survive your testing?
>
>
Like I said earlier, no.  However, I finally set up a reproducer that 
only takes a few seconds
on a large system and this totally fixes the problem:

-------------------------------------------------------------------------------------------------------------------------
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c36febb..cc023b8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2151,7 +2151,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, 
struct mm_struct *src,
                         goto nomem;

                 /* If the pagetables are shared don't copy or take 
references */
-               if (dst_pte == src_pte)
+               if (*(unsigned long *)dst_pte == *(unsigned long *)src_pte)
                         continue;

                 spin_lock(&dst->page_table_lock);
---------------------------------------------------------------------------------------------------------------------------

When we compare what the src_pte & dst_pte point to instead of their 
addresses everything works,
I suspect there is a missing memory barrier somewhere ???

Larry


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
  2012-07-26 17:42         ` Rik van Riel
  2012-07-26 18:04           ` Larry Woodman
@ 2012-07-27  8:42           ` Mel Gorman
  1 sibling, 0 replies; 50+ messages in thread
From: Mel Gorman @ 2012-07-27  8:42 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Hugh Dickins, Michal Hocko, Linux-MM, David Gibson, Ken Chen,
	Cong Wang, LKML, Larry Woodman

On Thu, Jul 26, 2012 at 01:42:26PM -0400, Rik van Riel wrote:
> On 07/23/2012 12:04 AM, Hugh Dickins wrote:
> 
> >Please don't be upset if I say that I don't like either of your patches.
> >Mainly for obvious reasons - I don't like Mel's because anything with
> >trylock retries and nested spinlocks worries me before I can even start
> >to think about it; and I don't like Michal's for the same reason as Mel,
> >that it spreads more change around in common paths than we would like.
> 
> I have a naive question.
> 
> In huge_pmd_share, we protect ourselves by taking
> the mapping->i_mmap_mutex.
> 
> Is there any reason we could not take the i_mmap_mutex
> in the huge_pmd_unshare path?
> 

We do, in 3.4 at least - callers of __unmap_hugepage_range hold the
i_mmap_mutex. Locking changes in mmotm and there is a patch there that
needs to be reverted. What tree are you looking at?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables v2
  2012-07-26 16:01 ` [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables v2 Larry Woodman
@ 2012-07-27  8:47   ` Mel Gorman
  0 siblings, 0 replies; 50+ messages in thread
From: Mel Gorman @ 2012-07-27  8:47 UTC (permalink / raw)
  To: Larry Woodman
  Cc: Linux-MM, Michal Hocko, Hugh Dickins, David Gibson, Ken Chen,
	Cong Wang, LKML

On Thu, Jul 26, 2012 at 12:01:04PM -0400, Larry Woodman wrote:
> On 07/20/2012 09:49 AM, Mel Gorman wrote:
> >+retry:
> >  	mutex_lock(&mapping->i_mmap_mutex);
> >  	vma_prio_tree_foreach(svma,&iter,&mapping->i_mmap, idx, idx) {
> >  		if (svma == vma)
> >  			continue;
> >+		if (svma->vm_mm == vma->vm_mm)
> >+			continue;
> >+
> >+		/*
> >+		 * The target mm could be in the process of tearing down
> >+		 * its page tables and the i_mmap_mutex on its own is
> >+		 * not sufficient. To prevent races against teardown and
> >+		 * pagetable updates, we acquire the mmap_sem and pagetable
> >+		 * lock of the remote address space. down_read_trylock()
> >+		 * is necessary as the other process could also be trying
> >+		 * to share pagetables with the current mm. In the fork
> >+		 * case, we are already both mm's so check for that
> >+		 */
> >+		if (locked_mm != svma->vm_mm) {
> >+			if (!down_read_trylock(&svma->vm_mm->mmap_sem)) {
> >+				mutex_unlock(&mapping->i_mmap_mutex);
> >+				goto retry;
> >+			}
> >+			smmap_sem =&svma->vm_mm->mmap_sem;
> >+		}
> >+
> >+		spage_table_lock =&svma->vm_mm->page_table_lock;
> >+		spin_lock_nested(spage_table_lock, SINGLE_DEPTH_NESTING);
> >
> >  		saddr = page_table_shareable(svma, vma, addr, idx);
> >  		if (saddr) {
> 
> Hi Mel, FYI I tried this and ran into a problem.  When there are
> multiple processes
> in huge_pmd_share() just faulting in the same i_map they all have
> their mmap_sem
> down for write so the down_read_trylock(&svma->vm_mm->mmap_sem) never
> succeeds.  What am I missing?
> 

Probably nothing, this version of the patch is flawed. In the final
(unreleased) version of this approach it had to check if it tried this
trylock for too long and bail out if that happened and fail to share
the page tables. I've dropped this approach to the problem as better
alternatives exist.

Thanks Larry!

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables v2
  2012-07-26 21:00 ` Rik van Riel
  2012-07-26 21:54   ` Hugh Dickins
@ 2012-07-27  8:52   ` Mel Gorman
  1 sibling, 0 replies; 50+ messages in thread
From: Mel Gorman @ 2012-07-27  8:52 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linux-MM, Michal Hocko, Hugh Dickins, David Gibson, Ken Chen,
	Cong Wang, LKML, Larry Woodman

On Thu, Jul 26, 2012 at 05:00:28PM -0400, Rik van Riel wrote:
> On 07/20/2012 09:49 AM, Mel Gorman wrote:
> >This V2 is still the mmap_sem approach that fixes a potential deadlock
> >problem pointed out by Michal.
> 
> Larry and I were looking around the hugetlb code some
> more, and found what looks like yet another race.
> 
> In hugetlb_no_page, we have the following code:
> 
> 
>         spin_lock(&mm->page_table_lock);
>         size = i_size_read(mapping->host) >> huge_page_shift(h);
>         if (idx >= size)
>                 goto backout;
> 
>         ret = 0;
>         if (!huge_pte_none(huge_ptep_get(ptep)))
>                 goto backout;
> 
>         if (anon_rmap)
>                 hugepage_add_new_anon_rmap(page, vma, address);
>         else
>                 page_dup_rmap(page);
>         new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE)
>                                 && (vma->vm_flags & VM_SHARED)));
>         set_huge_pte_at(mm, address, ptep, new_pte);
> 	...
> 	spin_unlock(&mm->page_table_lock);
> 
> Notice how we check !huge_pte_none with our own
> mm->page_table_lock held.
> 
> This offers no protection at all against other
> processes, that also hold their own page_table_lock.
> 

Yes, the page_table_lock is close to useless once shared page tables are
involved. It's why if we ever wanted to make shared page tables a core MM
thing we'd have to revisit how PTE locking at any level that can share
page tables works.

> In short, it looks like it is possible for multiple
> processes to go through the above code simultaneously,
> potentially resulting in:
> 
> 1) one process overwriting the pte just created by
>    another process
> 
> 2) data corruption, as one partially written page
>    gets superceded by an newly zeroed page, but no
>    TLB invalidates get sent to other CPUs
> 
> 3) a memory leak of a huge page
> 
> Is there anything that would make this race impossible,
> or is this a real bug?
> 

In this case it all happens under the hugetlb instantiation mutex in
hugetlb_fault(). It's yet another reason why removing that mutex would
be a serious undertaking and the gain for doing so is marginal.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
  2012-07-26 18:31     ` Rik van Riel
@ 2012-07-27  9:02       ` Michal Hocko
  0 siblings, 0 replies; 50+ messages in thread
From: Michal Hocko @ 2012-07-27  9:02 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Mel Gorman, Linux-MM, Hugh Dickins, David Gibson, Ken Chen,
	Cong Wang, LKML

On Thu 26-07-12 14:31:50, Rik van Riel wrote:
> On 07/20/2012 10:36 AM, Michal Hocko wrote:
> 
> >--- a/arch/x86/mm/hugetlbpage.c
> >+++ b/arch/x86/mm/hugetlbpage.c
> >@@ -81,7 +81,12 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> >  		if (saddr) {
> >  			spte = huge_pte_offset(svma->vm_mm, saddr);
> >  			if (spte) {
> >-				get_page(virt_to_page(spte));
> >+				struct page *spte_page = virt_to_page(spte);
> >+				if (!is_hugetlb_pmd_page_valid(spte_page)) {
> 
> What prevents somebody else from marking the hugetlb
> pmd invalid, between here...
> 
> >+					spte = NULL;
> >+					continue;
> >+				}
> 
> ... and here?

huge_ptep_get_and_clear is (should be) called inside i_mmap which is not
the case right now as Mel already pointed out in other email

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
  2012-07-27  3:48           ` Larry Woodman
@ 2012-07-27 10:10             ` Larry Woodman
  2012-07-27 10:23             ` Mel Gorman
  1 sibling, 0 replies; 50+ messages in thread
From: Larry Woodman @ 2012-07-27 10:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rik van Riel, Hugh Dickins, Michal Hocko, Linux-MM, David Gibson,
	Ken Chen, Cong Wang, LKML

On 07/26/2012 11:48 PM, Larry Woodman wrote:


Mel, did you see this???

Larry

>> This patch looks good to me.
>>
>> Larry, does Hugh's patch survive your testing?
>>
>>
>
> Like I said earlier, no.  However, I finally set up a reproducer that 
> only takes a few seconds
> on a large system and this totally fixes the problem:
>
> ------------------------------------------------------------------------------------------------------------------------- 
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index c36febb..cc023b8 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2151,7 +2151,7 @@ int copy_hugetlb_page_range(struct mm_struct 
> *dst, struct mm_struct *src,
>                         goto nomem;
>
>                 /* If the pagetables are shared don't copy or take 
> references */
> -               if (dst_pte == src_pte)
> +               if (*(unsigned long *)dst_pte == *(unsigned long 
> *)src_pte)
>                         continue;
>
>                 spin_lock(&dst->page_table_lock);
> --------------------------------------------------------------------------------------------------------------------------- 
>
>
> When we compare what the src_pte & dst_pte point to instead of their 
> addresses everything works,
> I suspect there is a missing memory barrier somewhere ???
>
> Larry
>


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
  2012-07-27  3:48           ` Larry Woodman
  2012-07-27 10:10             ` Larry Woodman
@ 2012-07-27 10:23             ` Mel Gorman
  2012-07-27 10:36               ` Larry Woodman
  2012-07-30 19:11               ` Larry Woodman
  1 sibling, 2 replies; 50+ messages in thread
From: Mel Gorman @ 2012-07-27 10:23 UTC (permalink / raw)
  To: Larry Woodman
  Cc: Rik van Riel, Hugh Dickins, Michal Hocko, Linux-MM, David Gibson,
	Ken Chen, Cong Wang, LKML

On Thu, Jul 26, 2012 at 11:48:56PM -0400, Larry Woodman wrote:
> On 07/26/2012 02:37 PM, Rik van Riel wrote:
> >On 07/23/2012 12:04 AM, Hugh Dickins wrote:
> >
> >>I spent hours trying to dream up a better patch, trying various
> >>approaches.  I think I have a nice one now, what do you think?  And
> >>more importantly, does it work?  I have not tried to test it at all,
> >>that I'm hoping to leave to you, I'm sure you'll attack it with gusto!
> >>
> >>If you like it, please take it over and add your comments and signoff
> >>and send it in.  The second part won't come up in your testing,
> >>and could
> >>be made a separate patch if you prefer: it's a related point that struck
> >>me while I was playing with a different approach.
> >>
> >>I'm sorely tempted to leave a dangerous pair of eyes off the Cc,
> >>but that too would be unfair.
> >>
> >>Subject-to-your-testing-
> >>Signed-off-by: Hugh Dickins <hughd@google.com>
> >
> >This patch looks good to me.
> >
> >Larry, does Hugh's patch survive your testing?
> >
> >
>
> Like I said earlier, no. 

That is a surprise. Can you try your test case on 3.4 and tell us if the
patch fixes the problem there? I would like to rule out the possibility
that the locking rules are slightly different in RHEL. If it hits on 3.4
then it's also possible you are seeing a different bug, more on this later.

> However, I finally set up a reproducer
> that only takes a few seconds
> on a large system and this totally fixes the problem:
> 

The other possibility is that your reproducer case is triggering a
different race to mine. Would it be possible to post?

> -------------------------------------------------------------------------------------------------------------------------
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index c36febb..cc023b8 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2151,7 +2151,7 @@ int copy_hugetlb_page_range(struct mm_struct
> *dst, struct mm_struct *src,
>                         goto nomem;
> 
>                 /* If the pagetables are shared don't copy or take references */
> -               if (dst_pte == src_pte)
> +               if (*(unsigned long *)dst_pte == *(unsigned long *)src_pte)
>                         continue;
> 
>                 spin_lock(&dst->page_table_lock);
> ---------------------------------------------------------------------------------------------------------------------------
> 
> When we compare what the src_pte & dst_pte point to instead of their
> addresses everything works,

The dst_pte and src_pte are pointing to the PMD page though which is what
we're meant to be checking. Your patch appears to change that to check if
they are sharing data which is quite different. This is functionally
similar to if you just checked VM_MAYSHARE at the start of the function
and bailed if so. The PTEs would be populated at fault time instead.

> I suspect there is a missing memory barrier somewhere ???
> 

Possibly but hard to tell whether it's barriers that are the real
problem during fork. The copy routine is suspicious.

On the barrier side - in normal PTE alloc routines there is a write
barrier which is documented in __pte_alloc. If hugepage table sharing is
successful, there is no similar barrier in huge_pmd_share before the PUD
is populated. By rights, there should be a smp_wmb() before the page table
spinlock is taken in huge_pmd_share().

The lack of a write barrier leads to a possible snarls between fork()
and fault. Take three processes, parent, child and other. Parent is
forking to create child. Other is calling fault.

Other faults
	hugetlb_fault()->huge_pte_alloc->allocate a PMD (write barrier)
	It is about to enter hugetlb_no_fault()

Parent forks() runs at the same time
	Child shares a page table page but NOT with the forking process (dst_pte
	!= src_pte) and calls huge_pte_offset.

As it's not reading the contents of the PMD page, there is no implicit read
barrier to pair with the write barrier from hugetlb_fault that updates
the PMD page and they are not serialised by the page table lock. Hard to
see exactly where that would cause a problem though.

Thing is, in this scenario I think it's possible that page table sharing
is not correctly detected by that dst_pte == src_pte check.  dst_pte !=
src_pte but that does not mean it's not sharing with somebody! If it's
sharing and it falls though then it copies the src PTE even though the
dst PTE could already be populated and updates the mapcount accordingly.
That would be a mess in its own right.

There might be two bugs here.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
  2012-07-27 10:23             ` Mel Gorman
@ 2012-07-27 10:36               ` Larry Woodman
  2012-07-30 19:11               ` Larry Woodman
  1 sibling, 0 replies; 50+ messages in thread
From: Larry Woodman @ 2012-07-27 10:36 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rik van Riel, Hugh Dickins, Michal Hocko, Linux-MM, David Gibson,
	Ken Chen, Cong Wang, LKML

On 07/27/2012 06:23 AM, Mel Gorman wrote:
> On Thu, Jul 26, 2012 at 11:48:56PM -0400, Larry Woodman wrote:
>> On 07/26/2012 02:37 PM, Rik van Riel wrote:
>>> On 07/23/2012 12:04 AM, Hugh Dickins wrote:
>>>
>>>> I spent hours trying to dream up a better patch, trying various
>>>> approaches.  I think I have a nice one now, what do you think?  And
>>>> more importantly, does it work?  I have not tried to test it at all,
>>>> that I'm hoping to leave to you, I'm sure you'll attack it with gusto!
>>>>
>>>> If you like it, please take it over and add your comments and signoff
>>>> and send it in.  The second part won't come up in your testing,
>>>> and could
>>>> be made a separate patch if you prefer: it's a related point that struck
>>>> me while I was playing with a different approach.
>>>>
>>>> I'm sorely tempted to leave a dangerous pair of eyes off the Cc,
>>>> but that too would be unfair.
>>>>
>>>> Subject-to-your-testing-
>>>> Signed-off-by: Hugh Dickins<hughd@google.com>
>>> This patch looks good to me.
>>>
>>> Larry, does Hugh's patch survive your testing?
>>>
>>>
>> Like I said earlier, no.
> That is a surprise. Can you try your test case on 3.4 and tell us if the
> patch fixes the problem there? I would like to rule out the possibility
> that the locking rules are slightly different in RHEL. If it hits on 3.4
> then it's also possible you are seeing a different bug, more on this later.
Sure, it will take me a little while because the machine is shared between
several users.
>
>> However, I finally set up a reproducer
>> that only takes a few seconds
>> on a large system and this totally fixes the problem:
>>
> The other possibility is that your reproducer case is triggering a
> different race to mine. Would it be possible to post?
Let me ask, I only have the binary and dont know if its OK to distribute
so I dont know exactly what is going on.  I did some tracing and saw 
forking,
group exits, multi-threading, hufetlbfs file creation, mmap'ng munmap'ng &
deleting the hugetlbfs
files.

>
>> -------------------------------------------------------------------------------------------------------------------------
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index c36febb..cc023b8 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -2151,7 +2151,7 @@ int copy_hugetlb_page_range(struct mm_struct
>> *dst, struct mm_struct *src,
>>                          goto nomem;
>>
>>                  /* If the pagetables are shared don't copy or take references */
>> -               if (dst_pte == src_pte)
>> +               if (*(unsigned long *)dst_pte == *(unsigned long *)src_pte)
>>                          continue;
>>
>>                  spin_lock(&dst->page_table_lock);
>> ---------------------------------------------------------------------------------------------------------------------------
>>
>> When we compare what the src_pte&  dst_pte point to instead of their
>> addresses everything works,
> The dst_pte and src_pte are pointing to the PMD page though which is what
> we're meant to be checking. Your patch appears to change that to check if
> they are sharing data which is quite different. This is functionally
> similar to if you just checked VM_MAYSHARE at the start of the function
> and bailed if so. The PTEs would be populated at fault time instead.
>
>> I suspect there is a missing memory barrier somewhere ???
>>
> Possibly but hard to tell whether it's barriers that are the real
> problem during fork. The copy routine is suspicious.
>
> On the barrier side - in normal PTE alloc routines there is a write
> barrier which is documented in __pte_alloc. If hugepage table sharing is
> successful, there is no similar barrier in huge_pmd_share before the PUD
> is populated. By rights, there should be a smp_wmb() before the page table
> spinlock is taken in huge_pmd_share().
>
> The lack of a write barrier leads to a possible snarls between fork()
> and fault. Take three processes, parent, child and other. Parent is
> forking to create child. Other is calling fault.
>
> Other faults
> 	hugetlb_fault()->huge_pte_alloc->allocate a PMD (write barrier)
> 	It is about to enter hugetlb_no_fault()
>
> Parent forks() runs at the same time
> 	Child shares a page table page but NOT with the forking process (dst_pte
> 	!= src_pte) and calls huge_pte_offset.
>
> As it's not reading the contents of the PMD page, there is no implicit read
> barrier to pair with the write barrier from hugetlb_fault that updates
> the PMD page and they are not serialised by the page table lock. Hard to
> see exactly where that would cause a problem though.
>
> Thing is, in this scenario I think it's possible that page table sharing
> is not correctly detected by that dst_pte == src_pte check.  dst_pte !=
> src_pte but that does not mean it's not sharing with somebody! If it's
> sharing and it falls though then it copies the src PTE even though the
> dst PTE could already be populated and updates the mapcount accordingly.
> That would be a mess in its own right.
I think this is exactly what is happening.  I'll put more cave-man debugging
code in and let you know.

Larry

>
> There might be two bugs here.
>


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
  2012-07-27 10:23             ` Mel Gorman
  2012-07-27 10:36               ` Larry Woodman
@ 2012-07-30 19:11               ` Larry Woodman
  2012-07-31 12:16                 ` Hillf Danton
  2012-07-31 12:46                 ` Mel Gorman
  1 sibling, 2 replies; 50+ messages in thread
From: Larry Woodman @ 2012-07-30 19:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rik van Riel, Hugh Dickins, Michal Hocko, Linux-MM, David Gibson,
	Ken Chen, Cong Wang, LKML

On 07/27/2012 06:23 AM, Mel Gorman wrote:
> On Thu, Jul 26, 2012 at 11:48:56PM -0400, Larry Woodman wrote:
>> On 07/26/2012 02:37 PM, Rik van Riel wrote:
>>> On 07/23/2012 12:04 AM, Hugh Dickins wrote:
>>>
>>>> I spent hours trying to dream up a better patch, trying various
>>>> approaches.  I think I have a nice one now, what do you think?  And
>>>> more importantly, does it work?  I have not tried to test it at all,
>>>> that I'm hoping to leave to you, I'm sure you'll attack it with gusto!
>>>>
>>>> If you like it, please take it over and add your comments and signoff
>>>> and send it in.  The second part won't come up in your testing,
>>>> and could
>>>> be made a separate patch if you prefer: it's a related point that struck
>>>> me while I was playing with a different approach.
>>>>
>>>> I'm sorely tempted to leave a dangerous pair of eyes off the Cc,
>>>> but that too would be unfair.
>>>>
>>>> Subject-to-your-testing-
>>>> Signed-off-by: Hugh Dickins<hughd@google.com>
>>> This patch looks good to me.
>>>
>>> Larry, does Hugh's patch survive your testing?
>>>
>>>
>> Like I said earlier, no.
> That is a surprise. Can you try your test case on 3.4 and tell us if the
> patch fixes the problem there? I would like to rule out the possibility
> that the locking rules are slightly different in RHEL. If it hits on 3.4
> then it's also possible you are seeing a different bug, more on this later.
>

Sorry for the delay Mel, here is the BUG() traceback from the 3.4 kernel 
with your
patches:

--------------------------------------------------------------------------------------------------------------------------------------------
[ 1106.156569] ------------[ cut here ]------------
[ 1106.161731] kernel BUG at mm/filemap.c:135!
[ 1106.166395] invalid opcode: 0000 [#1] SMP
[ 1106.170975] CPU 22
[ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc 
dcdbas microcode pcspkr acpi_pad acpi]
[ 1106.201770]
[ 1106.203426] Pid: 18001, comm: mpitest Tainted: G        W    3.3.0+ 
#4 Dell Inc. PowerEdge R620/07NDJ2
[ 1106.213822] RIP: 0010:[<ffffffff8112cfed>]  [<ffffffff8112cfed>] 
__delete_from_page_cache+0x15d/0x170
[ 1106.224117] RSP: 0018:ffff880428973b88  EFLAGS: 00010002
[ 1106.230032] RAX: 0000000000000001 RBX: ffffea0006b80000 RCX: 
00000000ffffffb0
[ 1106.237979] RDX: 0000000000016df1 RSI: 0000000000000009 RDI: 
ffff88043ffd9e00
[ 1106.245927] RBP: ffff880428973b98 R08: 0000000000000050 R09: 
0000000000000003
[ 1106.253876] R10: 000000000000000d R11: 0000000000000000 R12: 
ffff880428708150
[ 1106.261826] R13: ffff880428708150 R14: 0000000000000000 R15: 
ffffea0006b80000
[ 1106.269780] FS:  0000000000000000(0000) GS:ffff88042fd60000(0000) 
knlGS:0000000000000000
[ 1106.278794] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1106.285193] CR2: 0000003a1d38c4a8 CR3: 000000000187d000 CR4: 
00000000000406e0
[ 1106.293149] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[ 1106.301097] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 
0000000000000400
[ 1106.309046] Process mpitest (pid: 18001, threadinfo ffff880428972000, 
task ffff880428b5cc20)
[ 1106.318447] Stack:
[ 1106.320690]  ffffea0006b80000 0000000000000000 ffff880428973bc8 
ffffffff8112d040
[ 1106.328958]  ffff880428973bc8 00000000000002ab 00000000000002a0 
ffff880428973c18
[ 1106.337234]  ffff880428973cc8 ffffffff8125b405 ffff880400000001 
0000000000000000
[ 1106.345513] Call Trace:
[ 1106.348235]  [<ffffffff8112d040>] delete_from_page_cache+0x40/0x80
[ 1106.355128]  [<ffffffff8125b405>] truncate_hugepages+0x115/0x1f0
[ 1106.361826]  [<ffffffff8125b4f8>] hugetlbfs_evict_inode+0x18/0x30
[ 1106.368615]  [<ffffffff811ab1af>] evict+0x9f/0x1b0
[ 1106.373951]  [<ffffffff811ab3a3>] iput_final+0xe3/0x1e0
[ 1106.379773]  [<ffffffff811ab4de>] iput+0x3e/0x50
[ 1106.384922]  [<ffffffff811a8e18>] d_kill+0xf8/0x110
[ 1106.390356]  [<ffffffff811a8f12>] dput+0xe2/0x1b0
[ 1106.395595]  [<ffffffff81193612>] __fput+0x162/0x240
[ 1106.401124]  [<ffffffff81193715>] fput+0x25/0x30
[ 1106.406265]  [<ffffffff8118f6c3>] filp_close+0x63/0x90
[ 1106.411997]  [<ffffffff8106058f>] put_files_struct+0x7f/0xf0
[ 1106.418302]  [<ffffffff8106064c>] exit_files+0x4c/0x60
[ 1106.424025]  [<ffffffff810629d7>] do_exit+0x1a7/0x470
[ 1106.429652]  [<ffffffff81062cf5>] do_group_exit+0x55/0xd0
[ 1106.435665]  [<ffffffff81062d87>] sys_exit_group+0x17/0x20
[ 1106.441777]  [<ffffffff815d0229>] system_call_fastpath+0x16/0x1b
[ 1106.448474] Code: 66 0f 1f 44 00 00 48 8b 47 08 48 8b 00 48 8b 40 28 
44 8b 80 38 03 00 00 45 85 c0 0f
[ 1106.470022] RIP  [<ffffffff8112cfed>] 
__delete_from_page_cache+0x15d/0x170
[ 1106.477693]  RSP <ffff880428973b88>
--------------------------------------------------------------------------------------------------------------------------------------

I'll see if I can distribute the program that causes the panic, I dont 
have source, only binary.

Larry


BTW, the only way Ilve been able to get the panic to stop is:

--------------------------------------------------------------------------------------------------------------------------------------
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c36febb..cc023b8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2151,7 +2151,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, 
struct mm_struct *src,
                         goto nomem;

                 /* If the pagetables are shared don't copy or take 
references */
-               if (dst_pte == src_pte)
+               if (*(unsigned long *)dst_pte == *(unsigned long *)src_pte)
                         continue;
                 spin_lock(&dst->page_table_lock);
---------------------------------------------------------------------------------------------------------------------------------------



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
  2012-07-30 19:11               ` Larry Woodman
@ 2012-07-31 12:16                 ` Hillf Danton
  2012-07-31 12:46                 ` Mel Gorman
  1 sibling, 0 replies; 50+ messages in thread
From: Hillf Danton @ 2012-07-31 12:16 UTC (permalink / raw)
  To: lwoodman
  Cc: Mel Gorman, Rik van Riel, Hugh Dickins, Michal Hocko, Linux-MM,
	David Gibson, Ken Chen, Cong Wang, LKML

On Tue, Jul 31, 2012 at 3:11 AM, Larry Woodman <lwoodman@redhat.com> wrote:
> [ 1106.156569] ------------[ cut here ]------------
> [ 1106.161731] kernel BUG at mm/filemap.c:135!
> [ 1106.166395] invalid opcode: 0000 [#1] SMP
> [ 1106.170975] CPU 22
> [ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas
> microcode pcspkr acpi_pad acpi]
> [ 1106.201770]
> [ 1106.203426] Pid: 18001, comm: mpitest Tainted: G        W    3.3.0+ #4
> Dell Inc. PowerEdge R620/07NDJ2
> [ 1106.213822] RIP: 0010:[<ffffffff8112cfed>]  [<ffffffff8112cfed>]
> __delete_from_page_cache+0x15d/0x170
> [ 1106.224117] RSP: 0018:ffff880428973b88  EFLAGS: 00010002
> [ 1106.230032] RAX: 0000000000000001 RBX: ffffea0006b80000 RCX:
> 00000000ffffffb0
> [ 1106.237979] RDX: 0000000000016df1 RSI: 0000000000000009 RDI:
> ffff88043ffd9e00
> [ 1106.245927] RBP: ffff880428973b98 R08: 0000000000000050 R09:
> 0000000000000003
> [ 1106.253876] R10: 000000000000000d R11: 0000000000000000 R12:
> ffff880428708150
> [ 1106.261826] R13: ffff880428708150 R14: 0000000000000000 R15:
> ffffea0006b80000
> [ 1106.269780] FS:  0000000000000000(0000) GS:ffff88042fd60000(0000)
> knlGS:0000000000000000
> [ 1106.278794] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1106.285193] CR2: 0000003a1d38c4a8 CR3: 000000000187d000 CR4:
> 00000000000406e0
> [ 1106.293149] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [ 1106.301097] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> 0000000000000400
> [ 1106.309046] Process mpitest (pid: 18001, threadinfo ffff880428972000,
> task ffff880428b5cc20)
> [ 1106.318447] Stack:
> [ 1106.320690]  ffffea0006b80000 0000000000000000 ffff880428973bc8
> ffffffff8112d040
> [ 1106.328958]  ffff880428973bc8 00000000000002ab 00000000000002a0
> ffff880428973c18
> [ 1106.337234]  ffff880428973cc8 ffffffff8125b405 ffff880400000001
> 0000000000000000
> [ 1106.345513] Call Trace:
> [ 1106.348235]  [<ffffffff8112d040>] delete_from_page_cache+0x40/0x80
> [ 1106.355128]  [<ffffffff8125b405>] truncate_hugepages+0x115/0x1f0
> [ 1106.361826]  [<ffffffff8125b4f8>] hugetlbfs_evict_inode+0x18/0x30
> [ 1106.368615]  [<ffffffff811ab1af>] evict+0x9f/0x1b0
> [ 1106.373951]  [<ffffffff811ab3a3>] iput_final+0xe3/0x1e0
> [ 1106.379773]  [<ffffffff811ab4de>] iput+0x3e/0x50
> [ 1106.384922]  [<ffffffff811a8e18>] d_kill+0xf8/0x110
> [ 1106.390356]  [<ffffffff811a8f12>] dput+0xe2/0x1b0
> [ 1106.395595]  [<ffffffff81193612>] __fput+0x162/0x240
> [ 1106.401124]  [<ffffffff81193715>] fput+0x25/0x30
> [ 1106.406265]  [<ffffffff8118f6c3>] filp_close+0x63/0x90
> [ 1106.411997]  [<ffffffff8106058f>] put_files_struct+0x7f/0xf0
> [ 1106.418302]  [<ffffffff8106064c>] exit_files+0x4c/0x60
> [ 1106.424025]  [<ffffffff810629d7>] do_exit+0x1a7/0x470
> [ 1106.429652]  [<ffffffff81062cf5>] do_group_exit+0x55/0xd0
> [ 1106.435665]  [<ffffffff81062d87>] sys_exit_group+0x17/0x20
> [ 1106.441777]  [<ffffffff815d0229>] system_call_fastpath+0x16/0x1b


Perhaps we have to remove rmap when evicting inode.

--- a/fs/hugetlbfs/inode.c	Tue Jul 31 19:59:32 2012
+++ b/fs/hugetlbfs/inode.c	Tue Jul 31 20:04:14 2012
@@ -390,9 +390,11 @@ static void truncate_hugepages(struct in
 	hugetlb_unreserve_pages(inode, start, freed);
 }

+static int hugetlb_vmtruncate(struct inode *, loff_t);
+
 static void hugetlbfs_evict_inode(struct inode *inode)
 {
-	truncate_hugepages(inode, 0);
+	hugetlb_vmtruncate(inode, 0);
 	clear_inode(inode);
 }

--

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
  2012-07-30 19:11               ` Larry Woodman
  2012-07-31 12:16                 ` Hillf Danton
@ 2012-07-31 12:46                 ` Mel Gorman
  2012-07-31 13:07                   ` Larry Woodman
                                     ` (3 more replies)
  1 sibling, 4 replies; 50+ messages in thread
From: Mel Gorman @ 2012-07-31 12:46 UTC (permalink / raw)
  To: Larry Woodman
  Cc: Rik van Riel, Hugh Dickins, Michal Hocko, Linux-MM, David Gibson,
	Ken Chen, Cong Wang, LKML

On Mon, Jul 30, 2012 at 03:11:27PM -0400, Larry Woodman wrote:
> > <SNIP>
> >That is a surprise. Can you try your test case on 3.4 and tell us if the
> >patch fixes the problem there? I would like to rule out the possibility
> >that the locking rules are slightly different in RHEL. If it hits on 3.4
> >then it's also possible you are seeing a different bug, more on this later.
> >
> 
> Sorry for the delay Mel, here is the BUG() traceback from the 3.4
> kernel with your
> patches:
> 
> --------------------------------------------------------------------------------------------------------------------------------------------
> [ 1106.156569] ------------[ cut here ]------------
> [ 1106.161731] kernel BUG at mm/filemap.c:135!
> [ 1106.166395] invalid opcode: 0000 [#1] SMP
> [ 1106.170975] CPU 22
> [ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc
> dcdbas microcode pcspkr acpi_pad acpi]
> [ 1106.201770]

Thanks, looks very similar.

> [ 1106.203426] Pid: 18001, comm: mpitest Tainted: G        W
> 3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2

You say this was a 3.4 kernel but the message says 3.3. Probably not
relevant, just interesting.

> I'll see if I can distribute the program that causes the panic, I
> dont have source, only binary.
> 
> Larry
> 
> 
> BTW, the only way Ilve been able to get the panic to stop is:
> 
> --------------------------------------------------------------------------------------------------------------------------------------
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index c36febb..cc023b8 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2151,7 +2151,7 @@ int copy_hugetlb_page_range(struct mm_struct
> *dst, struct mm_struct *src,
>                         goto nomem;
> 
>                 /* If the pagetables are shared don't copy or take
> references */
> -               if (dst_pte == src_pte)
> +               if (*(unsigned long *)dst_pte == *(unsigned long *)src_pte)
>                         continue;
>                 spin_lock(&dst->page_table_lock);

I think this is papering over the problem. Basically this check works
because after page table sharing, the parent and child are pointing to the
same data page even if they are not sharing page tables. As it's during
fork(), we cannot have faulted in parallel so the populated PTE must be
due to page table sharing. If the parent has not faulted the page, then
sharing is not attempted and again the problem is avoided. It would be a
new instance though of hugetlbfs just happening to work because of its
limitations - in this case, it works because we only share page tables
for MAP_SHARED.

Fundamentally I think the problem is that we are not correctly detecting
that page table sharing took place during huge_pte_alloc(). This patch is
longer and makes an API change but if I'm right, it addresses the underlying
problem. The first VM_MAYSHARE patch is still necessary but would you mind
testing this on top please?

---8<---
mm: hugetlbfs: Correctly detect if page tables have just been shared

Each page mapped in a processes address space must be correctly
accounted for in _mapcount. Normally the rules for this are
straight-forward but hugetlbfs page table sharing is different.
The page table pages at the PMD level are reference counted while
the mapcount remains the same. If this accounting is wrong, it causes
bugs like this one reported by Larry Woodman

[ 1106.156569] ------------[ cut here ]------------
[ 1106.161731] kernel BUG at mm/filemap.c:135!
[ 1106.166395] invalid opcode: 0000 [#1] SMP
[ 1106.170975] CPU 22
[ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas microcode pcspkr acpi_pad acpi]
[ 1106.201770]
[ 1106.203426] Pid: 18001, comm: mpitest Tainted: G        W    3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2
[ 1106.213822] RIP: 0010:[<ffffffff8112cfed>]  [<ffffffff8112cfed>] __delete_from_page_cache+0x15d/0x170
[ 1106.224117] RSP: 0018:ffff880428973b88  EFLAGS: 00010002
[ 1106.230032] RAX: 0000000000000001 RBX: ffffea0006b80000 RCX: 00000000ffffffb0
[ 1106.237979] RDX: 0000000000016df1 RSI: 0000000000000009 RDI: ffff88043ffd9e00
[ 1106.245927] RBP: ffff880428973b98 R08: 0000000000000050 R09: 0000000000000003
[ 1106.253876] R10: 000000000000000d R11: 0000000000000000 R12: ffff880428708150
[ 1106.261826] R13: ffff880428708150 R14: 0000000000000000 R15: ffffea0006b80000
[ 1106.269780] FS:  0000000000000000(0000) GS:ffff88042fd60000(0000) knlGS:0000000000000000
[ 1106.278794] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1106.285193] CR2: 0000003a1d38c4a8 CR3: 000000000187d000 CR4: 00000000000406e0
[ 1106.293149] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1106.301097] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1106.309046] Process mpitest (pid: 18001, threadinfo ffff880428972000, task ffff880428b5cc20)
[ 1106.318447] Stack:
[ 1106.320690]  ffffea0006b80000 0000000000000000 ffff880428973bc8 ffffffff8112d040
[ 1106.328958]  ffff880428973bc8 00000000000002ab 00000000000002a0 ffff880428973c18
[ 1106.337234]  ffff880428973cc8 ffffffff8125b405 ffff880400000001 0000000000000000
[ 1106.345513] Call Trace:
[ 1106.348235]  [<ffffffff8112d040>] delete_from_page_cache+0x40/0x80
[ 1106.355128]  [<ffffffff8125b405>] truncate_hugepages+0x115/0x1f0
[ 1106.361826]  [<ffffffff8125b4f8>] hugetlbfs_evict_inode+0x18/0x30
[ 1106.368615]  [<ffffffff811ab1af>] evict+0x9f/0x1b0
[ 1106.373951]  [<ffffffff811ab3a3>] iput_final+0xe3/0x1e0
[ 1106.379773]  [<ffffffff811ab4de>] iput+0x3e/0x50
[ 1106.384922]  [<ffffffff811a8e18>] d_kill+0xf8/0x110
[ 1106.390356]  [<ffffffff811a8f12>] dput+0xe2/0x1b0
[ 1106.395595]  [<ffffffff81193612>] __fput+0x162/0x240

During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc()
shared page tables with the check dst_pte == src_pte. The logic is if
the PMD page is the same, they must be shared. This assumes that the
sharing is between the parent and child. However, if the sharing is with
a different process entirely then this check fails as in this diagram.

parent
  |
  ------------>pmd
               src_pte----------> data page
                                      ^
other--------->pmd--------------------|
                ^
child-----------|
               dst_pte

For this situation to occur, it must be possible for Parent and Other
to have faulted and failed to share page tables with each other. This is
possible due to the following style of race.

PROC A						PROC B
copy_hugetlb_page_range				copy_hugetlb_page_range
  src_pte == huge_pte_offset			  src_pte == huge_pte_offset
  !src_pte so no sharing			  !src_pte so no sharing

(time passes)

hugetlb_fault					hugetlb_fault
  huge_pte_alloc				  huge_pte_alloc
    huge_pmd_share				   huge_pmd_share
      LOCK(i_mmap_mutex)
      find nothing, no sharing
      UNLOCK(i_mmap_mutex)
      						    LOCK(i_mmap_mutex)
      						    find nothing, no sharing
      						    UNLOCK(i_mmap_mutex)
    pmd_alloc					    pmd_alloc
    LOCK(instantiation_mutex)
    fault
    UNLOCK(instantiation_mutex)
    						LOCK(instantiation_mutex)
    						fault
    						UNLOCK(instantiation_mutex)

These two processes are not poing to the same data page but are not sharing
page tables because the opportunity was missed. When either process later
forks, the src_pte == dst pte is potentially insufficient.  As the check
falls through, the wrong PTE information is copied in (harmless but wrong)
and the mapcount is bumped for a page mapped by a shared page table leading
to the BUG_ON.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/ia64/mm/hugetlbpage.c    |    3 ++-
 arch/mips/mm/hugetlbpage.c    |    2 +-
 arch/powerpc/mm/hugetlbpage.c |    3 ++-
 arch/s390/mm/hugetlbpage.c    |    3 ++-
 arch/sh/mm/hugetlbpage.c      |    3 ++-
 arch/sparc/mm/hugetlbpage.c   |    3 ++-
 arch/tile/mm/hugetlbpage.c    |    3 ++-
 arch/x86/mm/hugetlbpage.c     |   13 ++++++++-----
 include/linux/hugetlb.h       |    3 ++-
 mm/hugetlb.c                  |   12 +++++++++---
 10 files changed, 32 insertions(+), 16 deletions(-)

diff --git a/arch/ia64/mm/hugetlbpage.c b/arch/ia64/mm/hugetlbpage.c
index 5ca674b..a0bb307 100644
--- a/arch/ia64/mm/hugetlbpage.c
+++ b/arch/ia64/mm/hugetlbpage.c
@@ -25,7 +25,8 @@ unsigned int hpage_shift = HPAGE_SHIFT_DEFAULT;
 EXPORT_SYMBOL(hpage_shift);
 
 pte_t *
-huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz)
+huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz,
+	       bool *shared)
 {
 	unsigned long taddr = htlbpage_to_page(addr);
 	pgd_t *pgd;
diff --git a/arch/mips/mm/hugetlbpage.c b/arch/mips/mm/hugetlbpage.c
index a7fee0d..06ca4a3 100644
--- a/arch/mips/mm/hugetlbpage.c
+++ b/arch/mips/mm/hugetlbpage.c
@@ -23,7 +23,7 @@
 #include <asm/tlbflush.h>
 
 pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr,
-		      unsigned long sz)
+		      unsigned long sz, bool *shared)
 {
 	pgd_t *pgd;
 	pud_t *pud;
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 1a6de0a..5fc6672 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -175,7 +175,8 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
 #define HUGEPD_PUD_SHIFT PMD_SHIFT
 #endif
 
-pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz)
+pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz,
+		      bool *shared)
 {
 	pgd_t *pg;
 	pud_t *pu;
diff --git a/arch/s390/mm/hugetlbpage.c b/arch/s390/mm/hugetlbpage.c
index 532525e..2c3b501 100644
--- a/arch/s390/mm/hugetlbpage.c
+++ b/arch/s390/mm/hugetlbpage.c
@@ -65,7 +65,8 @@ void arch_release_hugepage(struct page *page)
 }
 
 pte_t *huge_pte_alloc(struct mm_struct *mm,
-			unsigned long addr, unsigned long sz)
+			unsigned long addr, unsigned long sz,
+			bool *shared)
 {
 	pgd_t *pgdp;
 	pud_t *pudp;
diff --git a/arch/sh/mm/hugetlbpage.c b/arch/sh/mm/hugetlbpage.c
index d776234..bbe154d 100644
--- a/arch/sh/mm/hugetlbpage.c
+++ b/arch/sh/mm/hugetlbpage.c
@@ -22,7 +22,8 @@
 #include <asm/cacheflush.h>
 
 pte_t *huge_pte_alloc(struct mm_struct *mm,
-			unsigned long addr, unsigned long sz)
+			unsigned long addr, unsigned long sz,
+			bool *shared)
 {
 	pgd_t *pgd;
 	pud_t *pud;
diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index 07e1453..d3ef01b 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -194,7 +194,8 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 }
 
 pte_t *huge_pte_alloc(struct mm_struct *mm,
-			unsigned long addr, unsigned long sz)
+			unsigned long addr, unsigned long sz,
+			bool *shared)
 {
 	pgd_t *pgd;
 	pud_t *pud;
diff --git a/arch/tile/mm/hugetlbpage.c b/arch/tile/mm/hugetlbpage.c
index 812e2d0..db01091 100644
--- a/arch/tile/mm/hugetlbpage.c
+++ b/arch/tile/mm/hugetlbpage.c
@@ -84,7 +84,8 @@ static pte_t *pte_alloc_hugetlb(struct mm_struct *mm, pmd_t *pmd,
 #endif
 
 pte_t *huge_pte_alloc(struct mm_struct *mm,
-		      unsigned long addr, unsigned long sz)
+		      unsigned long addr, unsigned long sz,
+		      bool *shared)
 {
 	pgd_t *pgd;
 	pud_t *pud;
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index f6679a7..8c53064 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -58,7 +58,8 @@ static int vma_shareable(struct vm_area_struct *vma, unsigned long addr)
 /*
  * search for a shareable pmd page for hugetlb.
  */
-static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
+static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud,
+			   bool *shared)
 {
 	struct vm_area_struct *vma = find_vma(mm, addr);
 	struct address_space *mapping = vma->vm_file->f_mapping;
@@ -91,9 +92,10 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 		goto out;
 
 	spin_lock(&mm->page_table_lock);
-	if (pud_none(*pud))
+	if (pud_none(*pud)) {
 		pud_populate(mm, pud, (pmd_t *)((unsigned long)spte & PAGE_MASK));
-	else
+		*shared = true;
+	} else
 		put_page(virt_to_page(spte));
 	spin_unlock(&mm->page_table_lock);
 out:
@@ -128,7 +130,8 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
 }
 
 pte_t *huge_pte_alloc(struct mm_struct *mm,
-			unsigned long addr, unsigned long sz)
+			unsigned long addr, unsigned long sz,
+			bool *shared)
 {
 	pgd_t *pgd;
 	pud_t *pud;
@@ -142,7 +145,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 		} else {
 			BUG_ON(sz != PMD_SIZE);
 			if (pud_none(*pud))
-				huge_pmd_share(mm, addr, pud);
+				huge_pmd_share(mm, addr, pud, shared);
 			pte = (pte_t *) pmd_alloc(mm, pud, addr);
 		}
 	}
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 73c7782..68d2597 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -75,7 +75,8 @@ extern struct list_head huge_boot_pages;
 /* arch callbacks */
 
 pte_t *huge_pte_alloc(struct mm_struct *mm,
-			unsigned long addr, unsigned long sz);
+			unsigned long addr, unsigned long sz,
+			bool *shared);
 pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr);
 int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
 struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 71c93d7..45c2196 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2282,6 +2282,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	int cow;
 	struct hstate *h = hstate_vma(vma);
 	unsigned long sz = huge_page_size(h);
+	bool shared = false;
 
 	cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
 
@@ -2289,12 +2290,12 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		src_pte = huge_pte_offset(src, addr);
 		if (!src_pte)
 			continue;
-		dst_pte = huge_pte_alloc(dst, addr, sz);
+		dst_pte = huge_pte_alloc(dst, addr, sz, &shared);
 		if (!dst_pte)
 			goto nomem;
 
 		/* If the pagetables are shared don't copy or take references */
-		if (dst_pte == src_pte)
+		if (shared)
 			continue;
 
 		spin_lock(&dst->page_table_lock);
@@ -2817,6 +2818,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *pagecache_page = NULL;
 	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
 	struct hstate *h = hstate_vma(vma);
+	bool shared = false;
 
 	address &= huge_page_mask(h);
 
@@ -2831,10 +2833,14 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 				VM_FAULT_SET_HINDEX(hstate_index(h));
 	}
 
-	ptep = huge_pte_alloc(mm, address, huge_page_size(h));
+	ptep = huge_pte_alloc(mm, address, huge_page_size(h), &shared);
 	if (!ptep)
 		return VM_FAULT_OOM;
 
+	/* If the pagetable is shared, no other work is necessary */
+	if (shared)
+		return 0;
+
 	/*
 	 * Serialize hugepage allocation and instantiation, so that we don't
 	 * get spurious allocation failures if two CPUs race to instantiate

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
  2012-07-31 12:46                 ` Mel Gorman
@ 2012-07-31 13:07                   ` Larry Woodman
  2012-07-31 13:29                     ` Mel Gorman
  2012-07-31 13:21                   ` Michal Hocko
                                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 50+ messages in thread
From: Larry Woodman @ 2012-07-31 13:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rik van Riel, Hugh Dickins, Michal Hocko, Linux-MM, David Gibson,
	Ken Chen, Cong Wang, LKML

On 07/31/2012 08:46 AM, Mel Gorman wrote:
> On Mon, Jul 30, 2012 at 03:11:27PM -0400, Larry Woodman wrote:
>>> <SNIP>
>>> That is a surprise. Can you try your test case on 3.4 and tell us if the
>>> patch fixes the problem there? I would like to rule out the possibility
>>> that the locking rules are slightly different in RHEL. If it hits on 3.4
>>> then it's also possible you are seeing a different bug, more on this later.
>>>
>> Sorry for the delay Mel, here is the BUG() traceback from the 3.4
>> kernel with your
>> patches:
>>
>> --------------------------------------------------------------------------------------------------------------------------------------------
>> [ 1106.156569] ------------[ cut here ]------------
>> [ 1106.161731] kernel BUG at mm/filemap.c:135!
>> [ 1106.166395] invalid opcode: 0000 [#1] SMP
>> [ 1106.170975] CPU 22
>> [ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc
>> dcdbas microcode pcspkr acpi_pad acpi]
>> [ 1106.201770]
> Thanks, looks very similar.
>
>> [ 1106.203426] Pid: 18001, comm: mpitest Tainted: G        W
>> 3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2
> You say this was a 3.4 kernel but the message says 3.3. Probably not
> relevant, just interesting.
>
Oh, sorry I posted the wrong traceback.  I tested both 3.3 & 3.4 and had 
the same results.
I'll do it again and post the 3.4 traceback for you,

Larry


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
  2012-07-31 12:46                 ` Mel Gorman
  2012-07-31 13:07                   ` Larry Woodman
@ 2012-07-31 13:21                   ` Michal Hocko
  2012-07-31 17:49                   ` Larry Woodman
  2012-07-31 18:03                   ` Rik van Riel
  3 siblings, 0 replies; 50+ messages in thread
From: Michal Hocko @ 2012-07-31 13:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Larry Woodman, Rik van Riel, Hugh Dickins, Linux-MM,
	David Gibson, Ken Chen, Cong Wang, LKML

On Tue 31-07-12 13:46:50, Mel Gorman wrote:
[...]
> mm: hugetlbfs: Correctly detect if page tables have just been shared
> 
> Each page mapped in a processes address space must be correctly
> accounted for in _mapcount. Normally the rules for this are
> straight-forward but hugetlbfs page table sharing is different.
> The page table pages at the PMD level are reference counted while
> the mapcount remains the same. If this accounting is wrong, it causes
> bugs like this one reported by Larry Woodman
> 
> [ 1106.156569] ------------[ cut here ]------------
> [ 1106.161731] kernel BUG at mm/filemap.c:135!
> [ 1106.166395] invalid opcode: 0000 [#1] SMP
> [ 1106.170975] CPU 22
> [ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas microcode pcspkr acpi_pad acpi]
> [ 1106.201770]
> [ 1106.203426] Pid: 18001, comm: mpitest Tainted: G        W    3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2
> [ 1106.213822] RIP: 0010:[<ffffffff8112cfed>]  [<ffffffff8112cfed>] __delete_from_page_cache+0x15d/0x170
> [ 1106.224117] RSP: 0018:ffff880428973b88  EFLAGS: 00010002
> [ 1106.230032] RAX: 0000000000000001 RBX: ffffea0006b80000 RCX: 00000000ffffffb0
> [ 1106.237979] RDX: 0000000000016df1 RSI: 0000000000000009 RDI: ffff88043ffd9e00
> [ 1106.245927] RBP: ffff880428973b98 R08: 0000000000000050 R09: 0000000000000003
> [ 1106.253876] R10: 000000000000000d R11: 0000000000000000 R12: ffff880428708150
> [ 1106.261826] R13: ffff880428708150 R14: 0000000000000000 R15: ffffea0006b80000
> [ 1106.269780] FS:  0000000000000000(0000) GS:ffff88042fd60000(0000) knlGS:0000000000000000
> [ 1106.278794] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1106.285193] CR2: 0000003a1d38c4a8 CR3: 000000000187d000 CR4: 00000000000406e0
> [ 1106.293149] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 1106.301097] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 1106.309046] Process mpitest (pid: 18001, threadinfo ffff880428972000, task ffff880428b5cc20)
> [ 1106.318447] Stack:
> [ 1106.320690]  ffffea0006b80000 0000000000000000 ffff880428973bc8 ffffffff8112d040
> [ 1106.328958]  ffff880428973bc8 00000000000002ab 00000000000002a0 ffff880428973c18
> [ 1106.337234]  ffff880428973cc8 ffffffff8125b405 ffff880400000001 0000000000000000
> [ 1106.345513] Call Trace:
> [ 1106.348235]  [<ffffffff8112d040>] delete_from_page_cache+0x40/0x80
> [ 1106.355128]  [<ffffffff8125b405>] truncate_hugepages+0x115/0x1f0
> [ 1106.361826]  [<ffffffff8125b4f8>] hugetlbfs_evict_inode+0x18/0x30
> [ 1106.368615]  [<ffffffff811ab1af>] evict+0x9f/0x1b0
> [ 1106.373951]  [<ffffffff811ab3a3>] iput_final+0xe3/0x1e0
> [ 1106.379773]  [<ffffffff811ab4de>] iput+0x3e/0x50
> [ 1106.384922]  [<ffffffff811a8e18>] d_kill+0xf8/0x110
> [ 1106.390356]  [<ffffffff811a8f12>] dput+0xe2/0x1b0
> [ 1106.395595]  [<ffffffff81193612>] __fput+0x162/0x240
> 
> During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc()
> shared page tables with the check dst_pte == src_pte. The logic is if
> the PMD page is the same, they must be shared. This assumes that the
> sharing is between the parent and child. However, if the sharing is with
> a different process entirely then this check fails as in this diagram.
> 
> parent
>   |
>   ------------>pmd
>                src_pte----------> data page
>                                       ^
> other--------->pmd--------------------|
>                 ^
> child-----------|
>                dst_pte
> 
> For this situation to occur, it must be possible for Parent and Other
> to have faulted and failed to share page tables with each other. This is
> possible due to the following style of race.
> 
> PROC A						PROC B
> copy_hugetlb_page_range				copy_hugetlb_page_range
>   src_pte == huge_pte_offset			  src_pte == huge_pte_offset
>   !src_pte so no sharing			  !src_pte so no sharing
> 
> (time passes)
> 
> hugetlb_fault					hugetlb_fault
>   huge_pte_alloc				  huge_pte_alloc
>     huge_pmd_share				   huge_pmd_share
>       LOCK(i_mmap_mutex)
>       find nothing, no sharing
>       UNLOCK(i_mmap_mutex)
>       						    LOCK(i_mmap_mutex)
>       						    find nothing, no sharing
>       						    UNLOCK(i_mmap_mutex)
>     pmd_alloc					    pmd_alloc
>     LOCK(instantiation_mutex)
>     fault
>     UNLOCK(instantiation_mutex)
>     						LOCK(instantiation_mutex)
>     						fault
>     						UNLOCK(instantiation_mutex)

Makes sense. I was wondering how the child could share with somebody
else and pmd would be different from the parent because parent is
blocked by mmap_sem (for write) so no concurrent faults are allowed.
Other process should see the parents pmd if it is shareable. I obviously
underestimated that the sharing could fail before and that
mapping->i_mmap contains 2 different candidates for sharing for the same
address.
Thanks!

> These two processes are not poing to the same data page but are not sharing
> page tables because the opportunity was missed. When either process later
> forks, the src_pte == dst pte is potentially insufficient.  As the check
> falls through, the wrong PTE information is copied in (harmless but wrong)
> and the mapcount is bumped for a page mapped by a shared page table leading
> to the BUG_ON.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Michal Hocko <mhocko@suse.cz>

> ---
>  arch/ia64/mm/hugetlbpage.c    |    3 ++-
>  arch/mips/mm/hugetlbpage.c    |    2 +-
>  arch/powerpc/mm/hugetlbpage.c |    3 ++-
>  arch/s390/mm/hugetlbpage.c    |    3 ++-
>  arch/sh/mm/hugetlbpage.c      |    3 ++-
>  arch/sparc/mm/hugetlbpage.c   |    3 ++-
>  arch/tile/mm/hugetlbpage.c    |    3 ++-
>  arch/x86/mm/hugetlbpage.c     |   13 ++++++++-----
>  include/linux/hugetlb.h       |    3 ++-
>  mm/hugetlb.c                  |   12 +++++++++---
>  10 files changed, 32 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/ia64/mm/hugetlbpage.c b/arch/ia64/mm/hugetlbpage.c
> index 5ca674b..a0bb307 100644
> --- a/arch/ia64/mm/hugetlbpage.c
> +++ b/arch/ia64/mm/hugetlbpage.c
> @@ -25,7 +25,8 @@ unsigned int hpage_shift = HPAGE_SHIFT_DEFAULT;
>  EXPORT_SYMBOL(hpage_shift);
>  
>  pte_t *
> -huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz)
> +huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz,
> +	       bool *shared)
>  {
>  	unsigned long taddr = htlbpage_to_page(addr);
>  	pgd_t *pgd;
> diff --git a/arch/mips/mm/hugetlbpage.c b/arch/mips/mm/hugetlbpage.c
> index a7fee0d..06ca4a3 100644
> --- a/arch/mips/mm/hugetlbpage.c
> +++ b/arch/mips/mm/hugetlbpage.c
> @@ -23,7 +23,7 @@
>  #include <asm/tlbflush.h>
>  
>  pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr,
> -		      unsigned long sz)
> +		      unsigned long sz, bool *shared)
>  {
>  	pgd_t *pgd;
>  	pud_t *pud;
> diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
> index 1a6de0a..5fc6672 100644
> --- a/arch/powerpc/mm/hugetlbpage.c
> +++ b/arch/powerpc/mm/hugetlbpage.c
> @@ -175,7 +175,8 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
>  #define HUGEPD_PUD_SHIFT PMD_SHIFT
>  #endif
>  
> -pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz)
> +pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz,
> +		      bool *shared)
>  {
>  	pgd_t *pg;
>  	pud_t *pu;
> diff --git a/arch/s390/mm/hugetlbpage.c b/arch/s390/mm/hugetlbpage.c
> index 532525e..2c3b501 100644
> --- a/arch/s390/mm/hugetlbpage.c
> +++ b/arch/s390/mm/hugetlbpage.c
> @@ -65,7 +65,8 @@ void arch_release_hugepage(struct page *page)
>  }
>  
>  pte_t *huge_pte_alloc(struct mm_struct *mm,
> -			unsigned long addr, unsigned long sz)
> +			unsigned long addr, unsigned long sz,
> +			bool *shared)
>  {
>  	pgd_t *pgdp;
>  	pud_t *pudp;
> diff --git a/arch/sh/mm/hugetlbpage.c b/arch/sh/mm/hugetlbpage.c
> index d776234..bbe154d 100644
> --- a/arch/sh/mm/hugetlbpage.c
> +++ b/arch/sh/mm/hugetlbpage.c
> @@ -22,7 +22,8 @@
>  #include <asm/cacheflush.h>
>  
>  pte_t *huge_pte_alloc(struct mm_struct *mm,
> -			unsigned long addr, unsigned long sz)
> +			unsigned long addr, unsigned long sz,
> +			bool *shared)
>  {
>  	pgd_t *pgd;
>  	pud_t *pud;
> diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
> index 07e1453..d3ef01b 100644
> --- a/arch/sparc/mm/hugetlbpage.c
> +++ b/arch/sparc/mm/hugetlbpage.c
> @@ -194,7 +194,8 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
>  }
>  
>  pte_t *huge_pte_alloc(struct mm_struct *mm,
> -			unsigned long addr, unsigned long sz)
> +			unsigned long addr, unsigned long sz,
> +			bool *shared)
>  {
>  	pgd_t *pgd;
>  	pud_t *pud;
> diff --git a/arch/tile/mm/hugetlbpage.c b/arch/tile/mm/hugetlbpage.c
> index 812e2d0..db01091 100644
> --- a/arch/tile/mm/hugetlbpage.c
> +++ b/arch/tile/mm/hugetlbpage.c
> @@ -84,7 +84,8 @@ static pte_t *pte_alloc_hugetlb(struct mm_struct *mm, pmd_t *pmd,
>  #endif
>  
>  pte_t *huge_pte_alloc(struct mm_struct *mm,
> -		      unsigned long addr, unsigned long sz)
> +		      unsigned long addr, unsigned long sz,
> +		      bool *shared)
>  {
>  	pgd_t *pgd;
>  	pud_t *pud;
> diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> index f6679a7..8c53064 100644
> --- a/arch/x86/mm/hugetlbpage.c
> +++ b/arch/x86/mm/hugetlbpage.c
> @@ -58,7 +58,8 @@ static int vma_shareable(struct vm_area_struct *vma, unsigned long addr)
>  /*
>   * search for a shareable pmd page for hugetlb.
>   */
> -static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> +static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud,
> +			   bool *shared)
>  {
>  	struct vm_area_struct *vma = find_vma(mm, addr);
>  	struct address_space *mapping = vma->vm_file->f_mapping;
> @@ -91,9 +92,10 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>  		goto out;
>  
>  	spin_lock(&mm->page_table_lock);
> -	if (pud_none(*pud))
> +	if (pud_none(*pud)) {
>  		pud_populate(mm, pud, (pmd_t *)((unsigned long)spte & PAGE_MASK));
> -	else
> +		*shared = true;
> +	} else
>  		put_page(virt_to_page(spte));
>  	spin_unlock(&mm->page_table_lock);
>  out:
> @@ -128,7 +130,8 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
>  }
>  
>  pte_t *huge_pte_alloc(struct mm_struct *mm,
> -			unsigned long addr, unsigned long sz)
> +			unsigned long addr, unsigned long sz,
> +			bool *shared)
>  {
>  	pgd_t *pgd;
>  	pud_t *pud;
> @@ -142,7 +145,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
>  		} else {
>  			BUG_ON(sz != PMD_SIZE);
>  			if (pud_none(*pud))
> -				huge_pmd_share(mm, addr, pud);
> +				huge_pmd_share(mm, addr, pud, shared);
>  			pte = (pte_t *) pmd_alloc(mm, pud, addr);
>  		}
>  	}
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 73c7782..68d2597 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -75,7 +75,8 @@ extern struct list_head huge_boot_pages;
>  /* arch callbacks */
>  
>  pte_t *huge_pte_alloc(struct mm_struct *mm,
> -			unsigned long addr, unsigned long sz);
> +			unsigned long addr, unsigned long sz,
> +			bool *shared);
>  pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr);
>  int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
>  struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 71c93d7..45c2196 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2282,6 +2282,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>  	int cow;
>  	struct hstate *h = hstate_vma(vma);
>  	unsigned long sz = huge_page_size(h);
> +	bool shared = false;
>  
>  	cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
>  
> @@ -2289,12 +2290,12 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>  		src_pte = huge_pte_offset(src, addr);
>  		if (!src_pte)
>  			continue;
> -		dst_pte = huge_pte_alloc(dst, addr, sz);
> +		dst_pte = huge_pte_alloc(dst, addr, sz, &shared);
>  		if (!dst_pte)
>  			goto nomem;
>  
>  		/* If the pagetables are shared don't copy or take references */
> -		if (dst_pte == src_pte)
> +		if (shared)
>  			continue;
>  
>  		spin_lock(&dst->page_table_lock);
> @@ -2817,6 +2818,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	struct page *pagecache_page = NULL;
>  	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
>  	struct hstate *h = hstate_vma(vma);
> +	bool shared = false;
>  
>  	address &= huge_page_mask(h);
>  
> @@ -2831,10 +2833,14 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  				VM_FAULT_SET_HINDEX(hstate_index(h));
>  	}
>  
> -	ptep = huge_pte_alloc(mm, address, huge_page_size(h));
> +	ptep = huge_pte_alloc(mm, address, huge_page_size(h), &shared);
>  	if (!ptep)
>  		return VM_FAULT_OOM;
>  
> +	/* If the pagetable is shared, no other work is necessary */
> +	if (shared)
> +		return 0;
> +
>  	/*
>  	 * Serialize hugepage allocation and instantiation, so that we don't
>  	 * get spurious allocation failures if two CPUs race to instantiate

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
  2012-07-31 13:07                   ` Larry Woodman
@ 2012-07-31 13:29                     ` Mel Gorman
  0 siblings, 0 replies; 50+ messages in thread
From: Mel Gorman @ 2012-07-31 13:29 UTC (permalink / raw)
  To: Larry Woodman
  Cc: Rik van Riel, Hugh Dickins, Michal Hocko, Linux-MM, David Gibson,
	Ken Chen, Cong Wang, LKML

On Tue, Jul 31, 2012 at 09:07:14AM -0400, Larry Woodman wrote:
> On 07/31/2012 08:46 AM, Mel Gorman wrote:
> >On Mon, Jul 30, 2012 at 03:11:27PM -0400, Larry Woodman wrote:
> >>><SNIP>
> >>>That is a surprise. Can you try your test case on 3.4 and tell us if the
> >>>patch fixes the problem there? I would like to rule out the possibility
> >>>that the locking rules are slightly different in RHEL. If it hits on 3.4
> >>>then it's also possible you are seeing a different bug, more on this later.
> >>>
> >>Sorry for the delay Mel, here is the BUG() traceback from the 3.4
> >>kernel with your
> >>patches:
> >>
> >>--------------------------------------------------------------------------------------------------------------------------------------------
> >>[ 1106.156569] ------------[ cut here ]------------
> >>[ 1106.161731] kernel BUG at mm/filemap.c:135!
> >>[ 1106.166395] invalid opcode: 0000 [#1] SMP
> >>[ 1106.170975] CPU 22
> >>[ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc
> >>dcdbas microcode pcspkr acpi_pad acpi]
> >>[ 1106.201770]
> >Thanks, looks very similar.
> >
> >>[ 1106.203426] Pid: 18001, comm: mpitest Tainted: G        W
> >>3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2
> >You say this was a 3.4 kernel but the message says 3.3. Probably not
> >relevant, just interesting.
> >
> Oh, sorry I posted the wrong traceback.  I tested both 3.3 & 3.4 and
> had the same results.
> I'll do it again and post the 3.4 traceback for you,

It'll probably be the same. The likelhood is that the bug is really old and
did not change between 3.3 and 3.4. I mentioned it in case you accidentally
tested with an old kernel that was not patched or patched with something
different. I considered this to be very unlikely though and you already
said that RHEL was affected so it's probably the same bug seen in all
three.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
  2012-07-31 12:46                 ` Mel Gorman
  2012-07-31 13:07                   ` Larry Woodman
  2012-07-31 13:21                   ` Michal Hocko
@ 2012-07-31 17:49                   ` Larry Woodman
  2012-07-31 20:06                     ` Michal Hocko
  2012-07-31 18:03                   ` Rik van Riel
  3 siblings, 1 reply; 50+ messages in thread
From: Larry Woodman @ 2012-07-31 17:49 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rik van Riel, Hugh Dickins, Michal Hocko, Linux-MM, David Gibson,
	Ken Chen, Cong Wang, LKML

On 07/31/2012 08:46 AM, Mel Gorman wrote:
>
> Fundamentally I think the problem is that we are not correctly detecting
> that page table sharing took place during huge_pte_alloc(). This patch is
> longer and makes an API change but if I'm right, it addresses the underlying
> problem. The first VM_MAYSHARE patch is still necessary but would you mind
> testing this on top please?
Hi Mel, yes this does work just fine.  It ran for hours without a panic so
I'll Ack this one if you send it to the list.

Larry



> ---8<---
> mm: hugetlbfs: Correctly detect if page tables have just been shared
>
> Each page mapped in a processes address space must be correctly
> accounted for in _mapcount. Normally the rules for this are
> straight-forward but hugetlbfs page table sharing is different.
> The page table pages at the PMD level are reference counted while
> the mapcount remains the same. If this accounting is wrong, it causes
> bugs like this one reported by Larry Woodman
>
> [ 1106.156569] ------------[ cut here ]------------
> [ 1106.161731] kernel BUG at mm/filemap.c:135!
> [ 1106.166395] invalid opcode: 0000 [#1] SMP
> [ 1106.170975] CPU 22
> [ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas microcode pcspkr acpi_pad acpi]
> [ 1106.201770]
> [ 1106.203426] Pid: 18001, comm: mpitest Tainted: G        W    3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2
> [ 1106.213822] RIP: 0010:[<ffffffff8112cfed>]  [<ffffffff8112cfed>] __delete_from_page_cache+0x15d/0x170
> [ 1106.224117] RSP: 0018:ffff880428973b88  EFLAGS: 00010002
> [ 1106.230032] RAX: 0000000000000001 RBX: ffffea0006b80000 RCX: 00000000ffffffb0
> [ 1106.237979] RDX: 0000000000016df1 RSI: 0000000000000009 RDI: ffff88043ffd9e00
> [ 1106.245927] RBP: ffff880428973b98 R08: 0000000000000050 R09: 0000000000000003
> [ 1106.253876] R10: 000000000000000d R11: 0000000000000000 R12: ffff880428708150
> [ 1106.261826] R13: ffff880428708150 R14: 0000000000000000 R15: ffffea0006b80000
> [ 1106.269780] FS:  0000000000000000(0000) GS:ffff88042fd60000(0000) knlGS:0000000000000000
> [ 1106.278794] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1106.285193] CR2: 0000003a1d38c4a8 CR3: 000000000187d000 CR4: 00000000000406e0
> [ 1106.293149] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 1106.301097] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 1106.309046] Process mpitest (pid: 18001, threadinfo ffff880428972000, task ffff880428b5cc20)
> [ 1106.318447] Stack:
> [ 1106.320690]  ffffea0006b80000 0000000000000000 ffff880428973bc8 ffffffff8112d040
> [ 1106.328958]  ffff880428973bc8 00000000000002ab 00000000000002a0 ffff880428973c18
> [ 1106.337234]  ffff880428973cc8 ffffffff8125b405 ffff880400000001 0000000000000000
> [ 1106.345513] Call Trace:
> [ 1106.348235]  [<ffffffff8112d040>] delete_from_page_cache+0x40/0x80
> [ 1106.355128]  [<ffffffff8125b405>] truncate_hugepages+0x115/0x1f0
> [ 1106.361826]  [<ffffffff8125b4f8>] hugetlbfs_evict_inode+0x18/0x30
> [ 1106.368615]  [<ffffffff811ab1af>] evict+0x9f/0x1b0
> [ 1106.373951]  [<ffffffff811ab3a3>] iput_final+0xe3/0x1e0
> [ 1106.379773]  [<ffffffff811ab4de>] iput+0x3e/0x50
> [ 1106.384922]  [<ffffffff811a8e18>] d_kill+0xf8/0x110
> [ 1106.390356]  [<ffffffff811a8f12>] dput+0xe2/0x1b0
> [ 1106.395595]  [<ffffffff81193612>] __fput+0x162/0x240
>
> During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc()
> shared page tables with the check dst_pte == src_pte. The logic is if
> the PMD page is the same, they must be shared. This assumes that the
> sharing is between the parent and child. However, if the sharing is with
> a different process entirely then this check fails as in this diagram.
>
> parent
>    |
>    ------------>pmd
>                 src_pte---------->  data page
>                                        ^
> other--------->pmd--------------------|
>                  ^
> child-----------|
>                 dst_pte
>
> For this situation to occur, it must be possible for Parent and Other
> to have faulted and failed to share page tables with each other. This is
> possible due to the following style of race.
>
> PROC A						PROC B
> copy_hugetlb_page_range				copy_hugetlb_page_range
>    src_pte == huge_pte_offset			  src_pte == huge_pte_offset
>    !src_pte so no sharing			  !src_pte so no sharing
>
> (time passes)
>
> hugetlb_fault					hugetlb_fault
>    huge_pte_alloc				  huge_pte_alloc
>      huge_pmd_share				   huge_pmd_share
>        LOCK(i_mmap_mutex)
>        find nothing, no sharing
>        UNLOCK(i_mmap_mutex)
>        						    LOCK(i_mmap_mutex)
>        						    find nothing, no sharing
>        						    UNLOCK(i_mmap_mutex)
>      pmd_alloc					    pmd_alloc
>      LOCK(instantiation_mutex)
>      fault
>      UNLOCK(instantiation_mutex)
>      						LOCK(instantiation_mutex)
>      						fault
>      						UNLOCK(instantiation_mutex)
>
> These two processes are not poing to the same data page but are not sharing
> page tables because the opportunity was missed. When either process later
> forks, the src_pte == dst pte is potentially insufficient.  As the check
> falls through, the wrong PTE information is copied in (harmless but wrong)
> and the mapcount is bumped for a page mapped by a shared page table leading
> to the BUG_ON.
>
> Signed-off-by: Mel Gorman<mgorman@suse.de>
> ---
>   arch/ia64/mm/hugetlbpage.c    |    3 ++-
>   arch/mips/mm/hugetlbpage.c    |    2 +-
>   arch/powerpc/mm/hugetlbpage.c |    3 ++-
>   arch/s390/mm/hugetlbpage.c    |    3 ++-
>   arch/sh/mm/hugetlbpage.c      |    3 ++-
>   arch/sparc/mm/hugetlbpage.c   |    3 ++-
>   arch/tile/mm/hugetlbpage.c    |    3 ++-
>   arch/x86/mm/hugetlbpage.c     |   13 ++++++++-----
>   include/linux/hugetlb.h       |    3 ++-
>   mm/hugetlb.c                  |   12 +++++++++---
>   10 files changed, 32 insertions(+), 16 deletions(-)
>
> diff --git a/arch/ia64/mm/hugetlbpage.c b/arch/ia64/mm/hugetlbpage.c
> index 5ca674b..a0bb307 100644
> --- a/arch/ia64/mm/hugetlbpage.c
> +++ b/arch/ia64/mm/hugetlbpage.c
> @@ -25,7 +25,8 @@ unsigned int hpage_shift = HPAGE_SHIFT_DEFAULT;
>   EXPORT_SYMBOL(hpage_shift);
>
>   pte_t *
> -huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz)
> +huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz,
> +	       bool *shared)
>   {
>   	unsigned long taddr = htlbpage_to_page(addr);
>   	pgd_t *pgd;
> diff --git a/arch/mips/mm/hugetlbpage.c b/arch/mips/mm/hugetlbpage.c
> index a7fee0d..06ca4a3 100644
> --- a/arch/mips/mm/hugetlbpage.c
> +++ b/arch/mips/mm/hugetlbpage.c
> @@ -23,7 +23,7 @@
>   #include<asm/tlbflush.h>
>
>   pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr,
> -		      unsigned long sz)
> +		      unsigned long sz, bool *shared)
>   {
>   	pgd_t *pgd;
>   	pud_t *pud;
> diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
> index 1a6de0a..5fc6672 100644
> --- a/arch/powerpc/mm/hugetlbpage.c
> +++ b/arch/powerpc/mm/hugetlbpage.c
> @@ -175,7 +175,8 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
>   #define HUGEPD_PUD_SHIFT PMD_SHIFT
>   #endif
>
> -pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz)
> +pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz,
> +		      bool *shared)
>   {
>   	pgd_t *pg;
>   	pud_t *pu;
> diff --git a/arch/s390/mm/hugetlbpage.c b/arch/s390/mm/hugetlbpage.c
> index 532525e..2c3b501 100644
> --- a/arch/s390/mm/hugetlbpage.c
> +++ b/arch/s390/mm/hugetlbpage.c
> @@ -65,7 +65,8 @@ void arch_release_hugepage(struct page *page)
>   }
>
>   pte_t *huge_pte_alloc(struct mm_struct *mm,
> -			unsigned long addr, unsigned long sz)
> +			unsigned long addr, unsigned long sz,
> +			bool *shared)
>   {
>   	pgd_t *pgdp;
>   	pud_t *pudp;
> diff --git a/arch/sh/mm/hugetlbpage.c b/arch/sh/mm/hugetlbpage.c
> index d776234..bbe154d 100644
> --- a/arch/sh/mm/hugetlbpage.c
> +++ b/arch/sh/mm/hugetlbpage.c
> @@ -22,7 +22,8 @@
>   #include<asm/cacheflush.h>
>
>   pte_t *huge_pte_alloc(struct mm_struct *mm,
> -			unsigned long addr, unsigned long sz)
> +			unsigned long addr, unsigned long sz,
> +			bool *shared)
>   {
>   	pgd_t *pgd;
>   	pud_t *pud;
> diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
> index 07e1453..d3ef01b 100644
> --- a/arch/sparc/mm/hugetlbpage.c
> +++ b/arch/sparc/mm/hugetlbpage.c
> @@ -194,7 +194,8 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
>   }
>
>   pte_t *huge_pte_alloc(struct mm_struct *mm,
> -			unsigned long addr, unsigned long sz)
> +			unsigned long addr, unsigned long sz,
> +			bool *shared)
>   {
>   	pgd_t *pgd;
>   	pud_t *pud;
> diff --git a/arch/tile/mm/hugetlbpage.c b/arch/tile/mm/hugetlbpage.c
> index 812e2d0..db01091 100644
> --- a/arch/tile/mm/hugetlbpage.c
> +++ b/arch/tile/mm/hugetlbpage.c
> @@ -84,7 +84,8 @@ static pte_t *pte_alloc_hugetlb(struct mm_struct *mm, pmd_t *pmd,
>   #endif
>
>   pte_t *huge_pte_alloc(struct mm_struct *mm,
> -		      unsigned long addr, unsigned long sz)
> +		      unsigned long addr, unsigned long sz,
> +		      bool *shared)
>   {
>   	pgd_t *pgd;
>   	pud_t *pud;
> diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> index f6679a7..8c53064 100644
> --- a/arch/x86/mm/hugetlbpage.c
> +++ b/arch/x86/mm/hugetlbpage.c
> @@ -58,7 +58,8 @@ static int vma_shareable(struct vm_area_struct *vma, unsigned long addr)
>   /*
>    * search for a shareable pmd page for hugetlb.
>    */
> -static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> +static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud,
> +			   bool *shared)
>   {
>   	struct vm_area_struct *vma = find_vma(mm, addr);
>   	struct address_space *mapping = vma->vm_file->f_mapping;
> @@ -91,9 +92,10 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>   		goto out;
>
>   	spin_lock(&mm->page_table_lock);
> -	if (pud_none(*pud))
> +	if (pud_none(*pud)) {
>   		pud_populate(mm, pud, (pmd_t *)((unsigned long)spte&  PAGE_MASK));
> -	else
> +		*shared = true;
> +	} else
>   		put_page(virt_to_page(spte));
>   	spin_unlock(&mm->page_table_lock);
>   out:
> @@ -128,7 +130,8 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
>   }
>
>   pte_t *huge_pte_alloc(struct mm_struct *mm,
> -			unsigned long addr, unsigned long sz)
> +			unsigned long addr, unsigned long sz,
> +			bool *shared)
>   {
>   	pgd_t *pgd;
>   	pud_t *pud;
> @@ -142,7 +145,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
>   		} else {
>   			BUG_ON(sz != PMD_SIZE);
>   			if (pud_none(*pud))
> -				huge_pmd_share(mm, addr, pud);
> +				huge_pmd_share(mm, addr, pud, shared);
>   			pte = (pte_t *) pmd_alloc(mm, pud, addr);
>   		}
>   	}
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 73c7782..68d2597 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -75,7 +75,8 @@ extern struct list_head huge_boot_pages;
>   /* arch callbacks */
>
>   pte_t *huge_pte_alloc(struct mm_struct *mm,
> -			unsigned long addr, unsigned long sz);
> +			unsigned long addr, unsigned long sz,
> +			bool *shared);
>   pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr);
>   int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
>   struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 71c93d7..45c2196 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2282,6 +2282,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>   	int cow;
>   	struct hstate *h = hstate_vma(vma);
>   	unsigned long sz = huge_page_size(h);
> +	bool shared = false;
>
>   	cow = (vma->vm_flags&  (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
>
> @@ -2289,12 +2290,12 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>   		src_pte = huge_pte_offset(src, addr);
>   		if (!src_pte)
>   			continue;
> -		dst_pte = huge_pte_alloc(dst, addr, sz);
> +		dst_pte = huge_pte_alloc(dst, addr, sz,&shared);
>   		if (!dst_pte)
>   			goto nomem;
>
>   		/* If the pagetables are shared don't copy or take references */
> -		if (dst_pte == src_pte)
> +		if (shared)
>   			continue;
>
>   		spin_lock(&dst->page_table_lock);
> @@ -2817,6 +2818,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>   	struct page *pagecache_page = NULL;
>   	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
>   	struct hstate *h = hstate_vma(vma);
> +	bool shared = false;
>
>   	address&= huge_page_mask(h);
>
> @@ -2831,10 +2833,14 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>   				VM_FAULT_SET_HINDEX(hstate_index(h));
>   	}
>
> -	ptep = huge_pte_alloc(mm, address, huge_page_size(h));
> +	ptep = huge_pte_alloc(mm, address, huge_page_size(h),&shared);
>   	if (!ptep)
>   		return VM_FAULT_OOM;
>
> +	/* If the pagetable is shared, no other work is necessary */
> +	if (shared)
> +		return 0;
> +
>   	/*
>   	 * Serialize hugepage allocation and instantiation, so that we don't
>   	 * get spurious allocation failures if two CPUs race to instantiate
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email:<a href=mailto:"dont@kvack.org">  email@kvack.org</a>


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
  2012-07-31 12:46                 ` Mel Gorman
                                     ` (2 preceding siblings ...)
  2012-07-31 17:49                   ` Larry Woodman
@ 2012-07-31 18:03                   ` Rik van Riel
  3 siblings, 0 replies; 50+ messages in thread
From: Rik van Riel @ 2012-07-31 18:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Larry Woodman, Hugh Dickins, Michal Hocko, Linux-MM,
	David Gibson, Ken Chen, Cong Wang, LKML

On 07/31/2012 08:46 AM, Mel Gorman wrote:

> mm: hugetlbfs: Correctly detect if page tables have just been shared
>
> Each page mapped in a processes address space must be correctly
> accounted for in _mapcount. Normally the rules for this are
> straight-forward but hugetlbfs page table sharing is different.
> The page table pages at the PMD level are reference counted while
> the mapcount remains the same. If this accounting is wrong, it causes
> bugs like this one reported by Larry Woodman

> Signed-off-by: Mel Gorman<mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
  2012-07-31 17:49                   ` Larry Woodman
@ 2012-07-31 20:06                     ` Michal Hocko
  2012-07-31 20:57                       ` Larry Woodman
  2012-08-01  2:45                       ` Larry Woodman
  0 siblings, 2 replies; 50+ messages in thread
From: Michal Hocko @ 2012-07-31 20:06 UTC (permalink / raw)
  To: Larry Woodman
  Cc: Mel Gorman, Rik van Riel, Hugh Dickins, Linux-MM, David Gibson,
	Ken Chen, Cong Wang, LKML

On Tue 31-07-12 13:49:21, Larry Woodman wrote:
> On 07/31/2012 08:46 AM, Mel Gorman wrote:
> >
> >Fundamentally I think the problem is that we are not correctly detecting
> >that page table sharing took place during huge_pte_alloc(). This patch is
> >longer and makes an API change but if I'm right, it addresses the underlying
> >problem. The first VM_MAYSHARE patch is still necessary but would you mind
> >testing this on top please?
> Hi Mel, yes this does work just fine.  It ran for hours without a panic so
> I'll Ack this one if you send it to the list.

Hi Larry, thanks for testing! I have a different patch which tries to
address this very same issue. I am not saying it is better or that it
should be merged instead of Mel's one but I would be really happy if you
could give it a try. We can discuss (dis)advantages of both approaches
later.

Thanks!
---
>From 8cbf3bd27125fc0a2a46cd5b1085d9e63f9c01fd Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Tue, 31 Jul 2012 15:00:26 +0200
Subject: [PATCH] mm: hugetlbfs: Correctly populate shared pmd

Each page mapped in a processes address space must be correctly
accounted for in _mapcount. Normally the rules for this are
straight-forward but hugetlbfs page table sharing is different.
The page table pages at the PMD level are reference counted while
the mapcount remains the same. If this accounting is wrong, it causes
bugs like this one reported by Larry Woodman

[ 1106.156569] ------------[ cut here ]------------
[ 1106.161731] kernel BUG at mm/filemap.c:135!
[ 1106.166395] invalid opcode: 0000 [#1] SMP
[ 1106.170975] CPU 22
[ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas microcode pcspkr acpi_pad acpi]
[ 1106.201770]
[ 1106.203426] Pid: 18001, comm: mpitest Tainted: G        W    3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2
[ 1106.213822] RIP: 0010:[<ffffffff8112cfed>]  [<ffffffff8112cfed>] __delete_from_page_cache+0x15d/0x170
[ 1106.224117] RSP: 0018:ffff880428973b88  EFLAGS: 00010002
[ 1106.230032] RAX: 0000000000000001 RBX: ffffea0006b80000 RCX: 00000000ffffffb0
[ 1106.237979] RDX: 0000000000016df1 RSI: 0000000000000009 RDI: ffff88043ffd9e00
[ 1106.245927] RBP: ffff880428973b98 R08: 0000000000000050 R09: 0000000000000003
[ 1106.253876] R10: 000000000000000d R11: 0000000000000000 R12: ffff880428708150
[ 1106.261826] R13: ffff880428708150 R14: 0000000000000000 R15: ffffea0006b80000
[ 1106.269780] FS:  0000000000000000(0000) GS:ffff88042fd60000(0000) knlGS:0000000000000000
[ 1106.278794] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1106.285193] CR2: 0000003a1d38c4a8 CR3: 000000000187d000 CR4: 00000000000406e0
[ 1106.293149] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1106.301097] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1106.309046] Process mpitest (pid: 18001, threadinfo ffff880428972000, task ffff880428b5cc20)
[ 1106.318447] Stack:
[ 1106.320690]  ffffea0006b80000 0000000000000000 ffff880428973bc8 ffffffff8112d040
[ 1106.328958]  ffff880428973bc8 00000000000002ab 00000000000002a0 ffff880428973c18
[ 1106.337234]  ffff880428973cc8 ffffffff8125b405 ffff880400000001 0000000000000000
[ 1106.345513] Call Trace:
[ 1106.348235]  [<ffffffff8112d040>] delete_from_page_cache+0x40/0x80
[ 1106.355128]  [<ffffffff8125b405>] truncate_hugepages+0x115/0x1f0
[ 1106.361826]  [<ffffffff8125b4f8>] hugetlbfs_evict_inode+0x18/0x30
[ 1106.368615]  [<ffffffff811ab1af>] evict+0x9f/0x1b0
[ 1106.373951]  [<ffffffff811ab3a3>] iput_final+0xe3/0x1e0
[ 1106.379773]  [<ffffffff811ab4de>] iput+0x3e/0x50
[ 1106.384922]  [<ffffffff811a8e18>] d_kill+0xf8/0x110
[ 1106.390356]  [<ffffffff811a8f12>] dput+0xe2/0x1b0
[ 1106.395595]  [<ffffffff81193612>] __fput+0x162/0x240

During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc()
shared page tables with the check dst_pte == src_pte. The logic is if
the PMD page is the same, they must be shared. This assumes that the
sharing is between the parent and child. However, if the sharing is with
a different process entirely then this check fails as in this diagram.

parent
  |
  ------------>pmd
               src_pte----------> data page
                                      ^
other--------->pmd--------------------|
                ^
child-----------|
               dst_pte

For this situation to occur, it must be possible for Parent and Other
to have faulted and failed to share page tables with each other. This is
possible due to the following style of race.

PROC A                                          PROC B
copy_hugetlb_page_range                         copy_hugetlb_page_range
  src_pte == huge_pte_offset                      src_pte == huge_pte_offset
  !src_pte so no sharing                          !src_pte so no sharing

(time passes)

hugetlb_fault                                   hugetlb_fault
  huge_pte_alloc                                  huge_pte_alloc
    huge_pmd_share                                 huge_pmd_share
      LOCK(i_mmap_mutex)
      find nothing, no sharing
      UNLOCK(i_mmap_mutex)
                                                    LOCK(i_mmap_mutex)
                                                    find nothing, no sharing
                                                    UNLOCK(i_mmap_mutex)
    pmd_alloc                                       pmd_alloc
    LOCK(instantiation_mutex)
    fault
    UNLOCK(instantiation_mutex)
                                                LOCK(instantiation_mutex)
                                                fault
                                                UNLOCK(instantiation_mutex)

These two processes are not poing to the same data page but are not sharing
page tables because the opportunity was missed. When either process later
forks, the src_pte == dst pte is potentially insufficient.  As the check
falls through, the wrong PTE information is copied in (harmless but wrong)
and the mapcount is bumped for a page mapped by a shared page table leading
to the BUG_ON.

This patch addresses the issue by moving pmd_alloc into huge_pmd_share
which guarantees that the shared pud is populated in the same
critical section as pmd. This also means that huge_pte_offset test in
huge_pmd_share is serialized correctly now.

Changelog and race identified by Mel Gorman
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reported-by: Larry Woodman <lwoodman@redhat.com>
---
 arch/x86/mm/hugetlbpage.c |   10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index f6679a7..bb05f79 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -58,7 +58,7 @@ static int vma_shareable(struct vm_area_struct *vma, unsigned long addr)
 /*
  * search for a shareable pmd page for hugetlb.
  */
-static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
+static pte_t* huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 {
 	struct vm_area_struct *vma = find_vma(mm, addr);
 	struct address_space *mapping = vma->vm_file->f_mapping;
@@ -68,6 +68,7 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	struct vm_area_struct *svma;
 	unsigned long saddr;
 	pte_t *spte = NULL;
+	pte_t *pte;
 
 	if (!vma_shareable(vma, addr))
 		return;
@@ -96,8 +97,10 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	else
 		put_page(virt_to_page(spte));
 	spin_unlock(&mm->page_table_lock);
+	pte = pmd_alloc(mm, pud, addr);
 out:
 	mutex_unlock(&mapping->i_mmap_mutex);
+	return pte;
 }
 
 /*
@@ -142,8 +145,9 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 		} else {
 			BUG_ON(sz != PMD_SIZE);
 			if (pud_none(*pud))
-				huge_pmd_share(mm, addr, pud);
-			pte = (pte_t *) pmd_alloc(mm, pud, addr);
+				pte = huge_pmd_share(mm, addr, pud);
+			else
+				pte = (pte_t *) pmd_alloc(mm, pud, addr);
 		}
 	}
 	BUG_ON(pte && !pte_none(*pte) && !pte_huge(*pte));
-- 
1.7.10.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
  2012-07-31 20:06                     ` Michal Hocko
@ 2012-07-31 20:57                       ` Larry Woodman
  2012-08-01  2:45                       ` Larry Woodman
  1 sibling, 0 replies; 50+ messages in thread
From: Larry Woodman @ 2012-07-31 20:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mel Gorman, Rik van Riel, Hugh Dickins, Linux-MM, David Gibson,
	Ken Chen, Cong Wang, LKML

On 07/31/2012 04:06 PM, Michal Hocko wrote:
> On Tue 31-07-12 13:49:21, Larry Woodman wrote:
>> On 07/31/2012 08:46 AM, Mel Gorman wrote:
>>> Fundamentally I think the problem is that we are not correctly detecting
>>> that page table sharing took place during huge_pte_alloc(). This patch is
>>> longer and makes an API change but if I'm right, it addresses the underlying
>>> problem. The first VM_MAYSHARE patch is still necessary but would you mind
>>> testing this on top please?
>> Hi Mel, yes this does work just fine.  It ran for hours without a panic so
>> I'll Ack this one if you send it to the list.
> Hi Larry, thanks for testing! I have a different patch which tries to
> address this very same issue. I am not saying it is better or that it
> should be merged instead of Mel's one but I would be really happy if you
> could give it a try. We can discuss (dis)advantages of both approaches
> later.
>
> Thanks!
Sure, it will take me a day since I keep loosing the hardware to 
proproduce the
problem with.  I'll report back tomorrow.

Larry


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
  2012-07-31 20:06                     ` Michal Hocko
  2012-07-31 20:57                       ` Larry Woodman
@ 2012-08-01  2:45                       ` Larry Woodman
  2012-08-01  8:20                         ` Michal Hocko
  1 sibling, 1 reply; 50+ messages in thread
From: Larry Woodman @ 2012-08-01  2:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mel Gorman, Rik van Riel, Hugh Dickins, Linux-MM, David Gibson,
	Ken Chen, Cong Wang, LKML

On 07/31/2012 04:06 PM, Michal Hocko wrote:
> On Tue 31-07-12 13:49:21, Larry Woodman wrote:
>> On 07/31/2012 08:46 AM, Mel Gorman wrote:
>>> Fundamentally I think the problem is that we are not correctly detecting
>>> that page table sharing took place during huge_pte_alloc(). This patch is
>>> longer and makes an API change but if I'm right, it addresses the underlying
>>> problem. The first VM_MAYSHARE patch is still necessary but would you mind
>>> testing this on top please?
>> Hi Mel, yes this does work just fine.  It ran for hours without a panic so
>> I'll Ack this one if you send it to the list.
> Hi Larry, thanks for testing! I have a different patch which tries to
> address this very same issue. I am not saying it is better or that it
> should be merged instead of Mel's one but I would be really happy if you
> could give it a try. We can discuss (dis)advantages of both approaches
> later.
>
> Thanks!

Hi Michal, the system hung when I tested this patch on top of the
latest 3.5 kernel.  I wont have AltSysrq access to the system until
tomorrow AM.  I'll retry this kernel and get AltSysrq output and let
you know whats happening in the morning.

Larry

> ---
>  From 8cbf3bd27125fc0a2a46cd5b1085d9e63f9c01fd Mon Sep 17 00:00:00 2001
> From: Michal Hocko<mhocko@suse.cz>
> Date: Tue, 31 Jul 2012 15:00:26 +0200
> Subject: [PATCH] mm: hugetlbfs: Correctly populate shared pmd
>
> Each page mapped in a processes address space must be correctly
> accounted for in _mapcount. Normally the rules for this are
> straight-forward but hugetlbfs page table sharing is different.
> The page table pages at the PMD level are reference counted while
> the mapcount remains the same. If this accounting is wrong, it causes
> bugs like this one reported by Larry Woodman
>
> [ 1106.156569] ------------[ cut here ]------------
> [ 1106.161731] kernel BUG at mm/filemap.c:135!
> [ 1106.166395] invalid opcode: 0000 [#1] SMP
> [ 1106.170975] CPU 22
> [ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas microcode pcspkr acpi_pad acpi]
> [ 1106.201770]
> [ 1106.203426] Pid: 18001, comm: mpitest Tainted: G        W    3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2
> [ 1106.213822] RIP: 0010:[<ffffffff8112cfed>]  [<ffffffff8112cfed>] __delete_from_page_cache+0x15d/0x170
> [ 1106.224117] RSP: 0018:ffff880428973b88  EFLAGS: 00010002
> [ 1106.230032] RAX: 0000000000000001 RBX: ffffea0006b80000 RCX: 00000000ffffffb0
> [ 1106.237979] RDX: 0000000000016df1 RSI: 0000000000000009 RDI: ffff88043ffd9e00
> [ 1106.245927] RBP: ffff880428973b98 R08: 0000000000000050 R09: 0000000000000003
> [ 1106.253876] R10: 000000000000000d R11: 0000000000000000 R12: ffff880428708150
> [ 1106.261826] R13: ffff880428708150 R14: 0000000000000000 R15: ffffea0006b80000
> [ 1106.269780] FS:  0000000000000000(0000) GS:ffff88042fd60000(0000) knlGS:0000000000000000
> [ 1106.278794] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1106.285193] CR2: 0000003a1d38c4a8 CR3: 000000000187d000 CR4: 00000000000406e0
> [ 1106.293149] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 1106.301097] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 1106.309046] Process mpitest (pid: 18001, threadinfo ffff880428972000, task ffff880428b5cc20)
> [ 1106.318447] Stack:
> [ 1106.320690]  ffffea0006b80000 0000000000000000 ffff880428973bc8 ffffffff8112d040
> [ 1106.328958]  ffff880428973bc8 00000000000002ab 00000000000002a0 ffff880428973c18
> [ 1106.337234]  ffff880428973cc8 ffffffff8125b405 ffff880400000001 0000000000000000
> [ 1106.345513] Call Trace:
> [ 1106.348235]  [<ffffffff8112d040>] delete_from_page_cache+0x40/0x80
> [ 1106.355128]  [<ffffffff8125b405>] truncate_hugepages+0x115/0x1f0
> [ 1106.361826]  [<ffffffff8125b4f8>] hugetlbfs_evict_inode+0x18/0x30
> [ 1106.368615]  [<ffffffff811ab1af>] evict+0x9f/0x1b0
> [ 1106.373951]  [<ffffffff811ab3a3>] iput_final+0xe3/0x1e0
> [ 1106.379773]  [<ffffffff811ab4de>] iput+0x3e/0x50
> [ 1106.384922]  [<ffffffff811a8e18>] d_kill+0xf8/0x110
> [ 1106.390356]  [<ffffffff811a8f12>] dput+0xe2/0x1b0
> [ 1106.395595]  [<ffffffff81193612>] __fput+0x162/0x240
>
> During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc()
> shared page tables with the check dst_pte == src_pte. The logic is if
> the PMD page is the same, they must be shared. This assumes that the
> sharing is between the parent and child. However, if the sharing is with
> a different process entirely then this check fails as in this diagram.
>
> parent
>    |
>    ------------>pmd
>                 src_pte---------->  data page
>                                        ^
> other--------->pmd--------------------|
>                  ^
> child-----------|
>                 dst_pte
>
> For this situation to occur, it must be possible for Parent and Other
> to have faulted and failed to share page tables with each other. This is
> possible due to the following style of race.
>
> PROC A                                          PROC B
> copy_hugetlb_page_range                         copy_hugetlb_page_range
>    src_pte == huge_pte_offset                      src_pte == huge_pte_offset
>    !src_pte so no sharing                          !src_pte so no sharing
>
> (time passes)
>
> hugetlb_fault                                   hugetlb_fault
>    huge_pte_alloc                                  huge_pte_alloc
>      huge_pmd_share                                 huge_pmd_share
>        LOCK(i_mmap_mutex)
>        find nothing, no sharing
>        UNLOCK(i_mmap_mutex)
>                                                      LOCK(i_mmap_mutex)
>                                                      find nothing, no sharing
>                                                      UNLOCK(i_mmap_mutex)
>      pmd_alloc                                       pmd_alloc
>      LOCK(instantiation_mutex)
>      fault
>      UNLOCK(instantiation_mutex)
>                                                  LOCK(instantiation_mutex)
>                                                  fault
>                                                  UNLOCK(instantiation_mutex)
>
> These two processes are not poing to the same data page but are not sharing
> page tables because the opportunity was missed. When either process later
> forks, the src_pte == dst pte is potentially insufficient.  As the check
> falls through, the wrong PTE information is copied in (harmless but wrong)
> and the mapcount is bumped for a page mapped by a shared page table leading
> to the BUG_ON.
>
> This patch addresses the issue by moving pmd_alloc into huge_pmd_share
> which guarantees that the shared pud is populated in the same
> critical section as pmd. This also means that huge_pte_offset test in
> huge_pmd_share is serialized correctly now.
>
> Changelog and race identified by Mel Gorman
> Signed-off-by: Michal Hocko<mhocko@suse.cz>
> Reported-by: Larry Woodman<lwoodman@redhat.com>
> ---
>   arch/x86/mm/hugetlbpage.c |   10 +++++++---
>   1 file changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> index f6679a7..bb05f79 100644
> --- a/arch/x86/mm/hugetlbpage.c
> +++ b/arch/x86/mm/hugetlbpage.c
> @@ -58,7 +58,7 @@ static int vma_shareable(struct vm_area_struct *vma, unsigned long addr)
>   /*
>    * search for a shareable pmd page for hugetlb.
>    */
> -static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> +static pte_t* huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>   {
>   	struct vm_area_struct *vma = find_vma(mm, addr);
>   	struct address_space *mapping = vma->vm_file->f_mapping;
> @@ -68,6 +68,7 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>   	struct vm_area_struct *svma;
>   	unsigned long saddr;
>   	pte_t *spte = NULL;
> +	pte_t *pte;
>
>   	if (!vma_shareable(vma, addr))
>   		return;
> @@ -96,8 +97,10 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>   	else
>   		put_page(virt_to_page(spte));
>   	spin_unlock(&mm->page_table_lock);
> +	pte = pmd_alloc(mm, pud, addr);
>   out:
>   	mutex_unlock(&mapping->i_mmap_mutex);
> +	return pte;
>   }
>
>   /*
> @@ -142,8 +145,9 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
>   		} else {
>   			BUG_ON(sz != PMD_SIZE);
>   			if (pud_none(*pud))
> -				huge_pmd_share(mm, addr, pud);
> -			pte = (pte_t *) pmd_alloc(mm, pud, addr);
> +				pte = huge_pmd_share(mm, addr, pud);
> +			else
> +				pte = (pte_t *) pmd_alloc(mm, pud, addr);
>   		}
>   	}
>   	BUG_ON(pte&&  !pte_none(*pte)&&  !pte_huge(*pte));


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
  2012-08-01  2:45                       ` Larry Woodman
@ 2012-08-01  8:20                         ` Michal Hocko
  2012-08-01 12:32                           ` Michal Hocko
  0 siblings, 1 reply; 50+ messages in thread
From: Michal Hocko @ 2012-08-01  8:20 UTC (permalink / raw)
  To: Larry Woodman
  Cc: Mel Gorman, Rik van Riel, Hugh Dickins, Linux-MM, David Gibson,
	Ken Chen, Cong Wang, LKML

On Tue 31-07-12 22:45:43, Larry Woodman wrote:
> On 07/31/2012 04:06 PM, Michal Hocko wrote:
> >On Tue 31-07-12 13:49:21, Larry Woodman wrote:
> >>On 07/31/2012 08:46 AM, Mel Gorman wrote:
> >>>Fundamentally I think the problem is that we are not correctly detecting
> >>>that page table sharing took place during huge_pte_alloc(). This patch is
> >>>longer and makes an API change but if I'm right, it addresses the underlying
> >>>problem. The first VM_MAYSHARE patch is still necessary but would you mind
> >>>testing this on top please?
> >>Hi Mel, yes this does work just fine.  It ran for hours without a panic so
> >>I'll Ack this one if you send it to the list.
> >Hi Larry, thanks for testing! I have a different patch which tries to
> >address this very same issue. I am not saying it is better or that it
> >should be merged instead of Mel's one but I would be really happy if you
> >could give it a try. We can discuss (dis)advantages of both approaches
> >later.
> >
> >Thanks!
> 
> Hi Michal, the system hung when I tested this patch on top of the
> latest 3.5 kernel.  I wont have AltSysrq access to the system until
> tomorrow AM.  

Please hold on. The patch is crap. I forgot about 
if (!vma_shareable(vma, addr))
	return;

case so somebody got an uninitialized pmd. The patch bellow handles
that.

Sorry about that and thanks to Mel to notice this.
---
>From 14d1cfcb7e19369653e2367c38204b1012a398f9 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Tue, 31 Jul 2012 15:00:26 +0200
Subject: [PATCH] mm: hugetlbfs: Correctly populate shared pmd

Each page mapped in a processes address space must be correctly
accounted for in _mapcount. Normally the rules for this are
straight-forward but hugetlbfs page table sharing is different.
The page table pages at the PMD level are reference counted while
the mapcount remains the same. If this accounting is wrong, it causes
bugs like this one reported by Larry Woodman

[ 1106.156569] ------------[ cut here ]------------
[ 1106.161731] kernel BUG at mm/filemap.c:135!
[ 1106.166395] invalid opcode: 0000 [#1] SMP
[ 1106.170975] CPU 22
[ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas microcode pcspkr acpi_pad acpi]
[ 1106.201770]
[ 1106.203426] Pid: 18001, comm: mpitest Tainted: G        W    3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2
[ 1106.213822] RIP: 0010:[<ffffffff8112cfed>]  [<ffffffff8112cfed>] __delete_from_page_cache+0x15d/0x170
[ 1106.224117] RSP: 0018:ffff880428973b88  EFLAGS: 00010002
[ 1106.230032] RAX: 0000000000000001 RBX: ffffea0006b80000 RCX: 00000000ffffffb0
[ 1106.237979] RDX: 0000000000016df1 RSI: 0000000000000009 RDI: ffff88043ffd9e00
[ 1106.245927] RBP: ffff880428973b98 R08: 0000000000000050 R09: 0000000000000003
[ 1106.253876] R10: 000000000000000d R11: 0000000000000000 R12: ffff880428708150
[ 1106.261826] R13: ffff880428708150 R14: 0000000000000000 R15: ffffea0006b80000
[ 1106.269780] FS:  0000000000000000(0000) GS:ffff88042fd60000(0000) knlGS:0000000000000000
[ 1106.278794] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1106.285193] CR2: 0000003a1d38c4a8 CR3: 000000000187d000 CR4: 00000000000406e0
[ 1106.293149] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1106.301097] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1106.309046] Process mpitest (pid: 18001, threadinfo ffff880428972000, task ffff880428b5cc20)
[ 1106.318447] Stack:
[ 1106.320690]  ffffea0006b80000 0000000000000000 ffff880428973bc8 ffffffff8112d040
[ 1106.328958]  ffff880428973bc8 00000000000002ab 00000000000002a0 ffff880428973c18
[ 1106.337234]  ffff880428973cc8 ffffffff8125b405 ffff880400000001 0000000000000000
[ 1106.345513] Call Trace:
[ 1106.348235]  [<ffffffff8112d040>] delete_from_page_cache+0x40/0x80
[ 1106.355128]  [<ffffffff8125b405>] truncate_hugepages+0x115/0x1f0
[ 1106.361826]  [<ffffffff8125b4f8>] hugetlbfs_evict_inode+0x18/0x30
[ 1106.368615]  [<ffffffff811ab1af>] evict+0x9f/0x1b0
[ 1106.373951]  [<ffffffff811ab3a3>] iput_final+0xe3/0x1e0
[ 1106.379773]  [<ffffffff811ab4de>] iput+0x3e/0x50
[ 1106.384922]  [<ffffffff811a8e18>] d_kill+0xf8/0x110
[ 1106.390356]  [<ffffffff811a8f12>] dput+0xe2/0x1b0
[ 1106.395595]  [<ffffffff81193612>] __fput+0x162/0x240

During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc()
shared page tables with the check dst_pte == src_pte. The logic is if
the PMD page is the same, they must be shared. This assumes that the
sharing is between the parent and child. However, if the sharing is with
a different process entirely then this check fails as in this diagram.

parent
  |
  ------------>pmd
               src_pte----------> data page
                                      ^
other--------->pmd--------------------|
                ^
child-----------|
               dst_pte

For this situation to occur, it must be possible for Parent and Other
to have faulted and failed to share page tables with each other. This is
possible due to the following style of race.

PROC A                                          PROC B
copy_hugetlb_page_range                         copy_hugetlb_page_range
  src_pte == huge_pte_offset                      src_pte == huge_pte_offset
  !src_pte so no sharing                          !src_pte so no sharing

(time passes)

hugetlb_fault                                   hugetlb_fault
  huge_pte_alloc                                  huge_pte_alloc
    huge_pmd_share                                 huge_pmd_share
      LOCK(i_mmap_mutex)
      find nothing, no sharing
      UNLOCK(i_mmap_mutex)
                                                    LOCK(i_mmap_mutex)
                                                    find nothing, no sharing
                                                    UNLOCK(i_mmap_mutex)
    pmd_alloc                                       pmd_alloc
    LOCK(instantiation_mutex)
    fault
    UNLOCK(instantiation_mutex)
                                                LOCK(instantiation_mutex)
                                                fault
                                                UNLOCK(instantiation_mutex)

These two processes are not poing to the same data page but are not sharing
page tables because the opportunity was missed. When either process later
forks, the src_pte == dst pte is potentially insufficient.  As the check
falls through, the wrong PTE information is copied in (harmless but wrong)
and the mapcount is bumped for a page mapped by a shared page table leading
to the BUG_ON.

This patch addresses the issue by moving pmd_alloc into huge_pmd_share
which guarantees that the shared pud is populated in the same
critical section as pmd. This also means that huge_pte_offset test in
huge_pmd_share is serialized correctly now.

Changelog and race identified by Mel Gorman
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reported-by: Larry Woodman <lwoodman@redhat.com>
---
 arch/x86/mm/hugetlbpage.c |   12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index f6679a7..cc385c4 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -58,7 +58,7 @@ static int vma_shareable(struct vm_area_struct *vma, unsigned long addr)
 /*
  * search for a shareable pmd page for hugetlb.
  */
-static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
+static pte_t* huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 {
 	struct vm_area_struct *vma = find_vma(mm, addr);
 	struct address_space *mapping = vma->vm_file->f_mapping;
@@ -68,9 +68,10 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	struct vm_area_struct *svma;
 	unsigned long saddr;
 	pte_t *spte = NULL;
+	pte_t *pte;
 
 	if (!vma_shareable(vma, addr))
-		return;
+		return pmd_alloc(mm, pud, addr);
 
 	mutex_lock(&mapping->i_mmap_mutex);
 	vma_prio_tree_foreach(svma, &iter, &mapping->i_mmap, idx, idx) {
@@ -96,8 +97,10 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	else
 		put_page(virt_to_page(spte));
 	spin_unlock(&mm->page_table_lock);
+	pte = pmd_alloc(mm, pud, addr);
 out:
 	mutex_unlock(&mapping->i_mmap_mutex);
+	return pte;
 }
 
 /*
@@ -142,8 +145,9 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 		} else {
 			BUG_ON(sz != PMD_SIZE);
 			if (pud_none(*pud))
-				huge_pmd_share(mm, addr, pud);
-			pte = (pte_t *) pmd_alloc(mm, pud, addr);
+				pte = huge_pmd_share(mm, addr, pud);
+			else
+				pte = (pte_t *) pmd_alloc(mm, pud, addr);
 		}
 	}
 	BUG_ON(pte && !pte_none(*pte) && !pte_huge(*pte));
-- 
1.7.10.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
  2012-08-01  8:20                         ` Michal Hocko
@ 2012-08-01 12:32                           ` Michal Hocko
  2012-08-01 15:06                             ` Larry Woodman
  0 siblings, 1 reply; 50+ messages in thread
From: Michal Hocko @ 2012-08-01 12:32 UTC (permalink / raw)
  To: Larry Woodman
  Cc: Mel Gorman, Rik van Riel, Hugh Dickins, Linux-MM, David Gibson,
	Ken Chen, Cong Wang, LKML

On Wed 01-08-12 10:20:36, Michal Hocko wrote:
> On Tue 31-07-12 22:45:43, Larry Woodman wrote:
> > On 07/31/2012 04:06 PM, Michal Hocko wrote:
> > >On Tue 31-07-12 13:49:21, Larry Woodman wrote:
> > >>On 07/31/2012 08:46 AM, Mel Gorman wrote:
> > >>>Fundamentally I think the problem is that we are not correctly detecting
> > >>>that page table sharing took place during huge_pte_alloc(). This patch is
> > >>>longer and makes an API change but if I'm right, it addresses the underlying
> > >>>problem. The first VM_MAYSHARE patch is still necessary but would you mind
> > >>>testing this on top please?
> > >>Hi Mel, yes this does work just fine.  It ran for hours without a panic so
> > >>I'll Ack this one if you send it to the list.
> > >Hi Larry, thanks for testing! I have a different patch which tries to
> > >address this very same issue. I am not saying it is better or that it
> > >should be merged instead of Mel's one but I would be really happy if you
> > >could give it a try. We can discuss (dis)advantages of both approaches
> > >later.
> > >
> > >Thanks!
> > 
> > Hi Michal, the system hung when I tested this patch on top of the
> > latest 3.5 kernel.  I wont have AltSysrq access to the system until
> > tomorrow AM.  
> 
> Please hold on. The patch is crap. I forgot about 
> if (!vma_shareable(vma, addr))
> 	return;
> 
> case so somebody got an uninitialized pmd. The patch bellow handles
> that.
> 

I am really lame :/. The previous patch is wrong as well for goto out
branch. The updated patch as follows:
---
>From 886b79204491b500437e156aa8eb35e776a4bb07 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Tue, 31 Jul 2012 15:00:26 +0200
Subject: [PATCH] mm: hugetlbfs: Correctly populate shared pmd

Each page mapped in a processes address space must be correctly
accounted for in _mapcount. Normally the rules for this are
straight-forward but hugetlbfs page table sharing is different.
The page table pages at the PMD level are reference counted while
the mapcount remains the same. If this accounting is wrong, it causes
bugs like this one reported by Larry Woodman

[ 1106.156569] ------------[ cut here ]------------
[ 1106.161731] kernel BUG at mm/filemap.c:135!
[ 1106.166395] invalid opcode: 0000 [#1] SMP
[ 1106.170975] CPU 22
[ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas microcode pcspkr acpi_pad acpi]
[ 1106.201770]
[ 1106.203426] Pid: 18001, comm: mpitest Tainted: G        W    3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2
[ 1106.213822] RIP: 0010:[<ffffffff8112cfed>]  [<ffffffff8112cfed>] __delete_from_page_cache+0x15d/0x170
[ 1106.224117] RSP: 0018:ffff880428973b88  EFLAGS: 00010002
[ 1106.230032] RAX: 0000000000000001 RBX: ffffea0006b80000 RCX: 00000000ffffffb0
[ 1106.237979] RDX: 0000000000016df1 RSI: 0000000000000009 RDI: ffff88043ffd9e00
[ 1106.245927] RBP: ffff880428973b98 R08: 0000000000000050 R09: 0000000000000003
[ 1106.253876] R10: 000000000000000d R11: 0000000000000000 R12: ffff880428708150
[ 1106.261826] R13: ffff880428708150 R14: 0000000000000000 R15: ffffea0006b80000
[ 1106.269780] FS:  0000000000000000(0000) GS:ffff88042fd60000(0000) knlGS:0000000000000000
[ 1106.278794] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1106.285193] CR2: 0000003a1d38c4a8 CR3: 000000000187d000 CR4: 00000000000406e0
[ 1106.293149] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1106.301097] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1106.309046] Process mpitest (pid: 18001, threadinfo ffff880428972000, task ffff880428b5cc20)
[ 1106.318447] Stack:
[ 1106.320690]  ffffea0006b80000 0000000000000000 ffff880428973bc8 ffffffff8112d040
[ 1106.328958]  ffff880428973bc8 00000000000002ab 00000000000002a0 ffff880428973c18
[ 1106.337234]  ffff880428973cc8 ffffffff8125b405 ffff880400000001 0000000000000000
[ 1106.345513] Call Trace:
[ 1106.348235]  [<ffffffff8112d040>] delete_from_page_cache+0x40/0x80
[ 1106.355128]  [<ffffffff8125b405>] truncate_hugepages+0x115/0x1f0
[ 1106.361826]  [<ffffffff8125b4f8>] hugetlbfs_evict_inode+0x18/0x30
[ 1106.368615]  [<ffffffff811ab1af>] evict+0x9f/0x1b0
[ 1106.373951]  [<ffffffff811ab3a3>] iput_final+0xe3/0x1e0
[ 1106.379773]  [<ffffffff811ab4de>] iput+0x3e/0x50
[ 1106.384922]  [<ffffffff811a8e18>] d_kill+0xf8/0x110
[ 1106.390356]  [<ffffffff811a8f12>] dput+0xe2/0x1b0
[ 1106.395595]  [<ffffffff81193612>] __fput+0x162/0x240

During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc()
shared page tables with the check dst_pte == src_pte. The logic is if
the PMD page is the same, they must be shared. This assumes that the
sharing is between the parent and child. However, if the sharing is with
a different process entirely then this check fails as in this diagram.

parent
  |
  ------------>pmd
               src_pte----------> data page
                                      ^
other--------->pmd--------------------|
                ^
child-----------|
               dst_pte

For this situation to occur, it must be possible for Parent and Other
to have faulted and failed to share page tables with each other. This is
possible due to the following style of race.

PROC A                                          PROC B
copy_hugetlb_page_range                         copy_hugetlb_page_range
  src_pte == huge_pte_offset                      src_pte == huge_pte_offset
  !src_pte so no sharing                          !src_pte so no sharing

(time passes)

hugetlb_fault                                   hugetlb_fault
  huge_pte_alloc                                  huge_pte_alloc
    huge_pmd_share                                 huge_pmd_share
      LOCK(i_mmap_mutex)
      find nothing, no sharing
      UNLOCK(i_mmap_mutex)
                                                    LOCK(i_mmap_mutex)
                                                    find nothing, no sharing
                                                    UNLOCK(i_mmap_mutex)
    pmd_alloc                                       pmd_alloc
    LOCK(instantiation_mutex)
    fault
    UNLOCK(instantiation_mutex)
                                                LOCK(instantiation_mutex)
                                                fault
                                                UNLOCK(instantiation_mutex)

These two processes are not poing to the same data page but are not sharing
page tables because the opportunity was missed. When either process later
forks, the src_pte == dst pte is potentially insufficient.  As the check
falls through, the wrong PTE information is copied in (harmless but wrong)
and the mapcount is bumped for a page mapped by a shared page table leading
to the BUG_ON.

This patch addresses the issue by moving pmd_alloc into huge_pmd_share
which guarantees that the shared pud is populated in the same
critical section as pmd. This also means that huge_pte_offset test in
huge_pmd_share is serialized correctly now.

Changelog and race identified by Mel Gorman
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reported-by: Larry Woodman <lwoodman@redhat.com>
---
 arch/x86/mm/hugetlbpage.c |   13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index f6679a7..40b2500 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -58,7 +58,8 @@ static int vma_shareable(struct vm_area_struct *vma, unsigned long addr)
 /*
  * search for a shareable pmd page for hugetlb.
  */
-static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
+static pte_t*
+huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 {
 	struct vm_area_struct *vma = find_vma(mm, addr);
 	struct address_space *mapping = vma->vm_file->f_mapping;
@@ -68,9 +69,10 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	struct vm_area_struct *svma;
 	unsigned long saddr;
 	pte_t *spte = NULL;
+	pte_t *pte;
 
 	if (!vma_shareable(vma, addr))
-		return;
+		return (pte_t *)pmd_alloc(mm, pud, addr);
 
 	mutex_lock(&mapping->i_mmap_mutex);
 	vma_prio_tree_foreach(svma, &iter, &mapping->i_mmap, idx, idx) {
@@ -97,7 +99,9 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 		put_page(virt_to_page(spte));
 	spin_unlock(&mm->page_table_lock);
 out:
+	pte = (pte_t *)pmd_alloc(mm, pud, addr);
 	mutex_unlock(&mapping->i_mmap_mutex);
+	return pte;
 }
 
 /*
@@ -142,8 +146,9 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 		} else {
 			BUG_ON(sz != PMD_SIZE);
 			if (pud_none(*pud))
-				huge_pmd_share(mm, addr, pud);
-			pte = (pte_t *) pmd_alloc(mm, pud, addr);
+				pte = huge_pmd_share(mm, addr, pud);
+			else
+				pte = (pte_t *) pmd_alloc(mm, pud, addr);
 		}
 	}
 	BUG_ON(pte && !pte_none(*pte) && !pte_huge(*pte));
-- 
1.7.10.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
  2012-08-01 12:32                           ` Michal Hocko
@ 2012-08-01 15:06                             ` Larry Woodman
  2012-08-02  7:19                               ` Michal Hocko
  0 siblings, 1 reply; 50+ messages in thread
From: Larry Woodman @ 2012-08-01 15:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mel Gorman, Rik van Riel, Hugh Dickins, Linux-MM, David Gibson,
	Ken Chen, Cong Wang, LKML

On 08/01/2012 08:32 AM, Michal Hocko wrote:
>
> I am really lame :/. The previous patch is wrong as well for goto out
> branch. The updated patch as follows:
This patch worked fine Michal!  You and Mel can duke it out over who's 
is best. :)

Larry


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
  2012-08-01 15:06                             ` Larry Woodman
@ 2012-08-02  7:19                               ` Michal Hocko
  2012-08-02  7:37                                 ` Mel Gorman
  0 siblings, 1 reply; 50+ messages in thread
From: Michal Hocko @ 2012-08-02  7:19 UTC (permalink / raw)
  To: Larry Woodman
  Cc: Mel Gorman, Rik van Riel, Hugh Dickins, Linux-MM, David Gibson,
	Ken Chen, Cong Wang, LKML

Hi Larry,

On Wed 01-08-12 11:06:33, Larry Woodman wrote:
> On 08/01/2012 08:32 AM, Michal Hocko wrote:
> >
> >I am really lame :/. The previous patch is wrong as well for goto out
> >branch. The updated patch as follows:
> This patch worked fine Michal!  

Thanks for the good news!

> You and Mel can duke it out over who's is best. :)

The answer is clear here ;) Mel did the hard work of identifying the
culprit so kudos go to him.
I just tried to solve the issue more inside x86 arch code. The pmd
allocation outside of sharing code seemed strange to me for quite some
time I just underestimated its consequences completely.

Both approaches have some pros. Mel's patch is more resistant to other
not-yet-discovered races and it also makes the arch independent code
more robust because relying on the pmd trick is not ideal.
On the other hand, mine is more coupled with the sharing code so it
makes the code easier to follow and also makes the sharing more
effective because racing processes see pmd populated when checking for
shareable mappings.

So I am more inclined to mine but I don't want to push it because both
are good and make sense. What other people think?

> 
> Larry
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
  2012-08-02  7:19                               ` Michal Hocko
@ 2012-08-02  7:37                                 ` Mel Gorman
  2012-08-02 12:36                                   ` Michal Hocko
  0 siblings, 1 reply; 50+ messages in thread
From: Mel Gorman @ 2012-08-02  7:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Larry Woodman, Rik van Riel, Hugh Dickins, Linux-MM,
	David Gibson, Ken Chen, Cong Wang, LKML

On Thu, Aug 02, 2012 at 09:19:34AM +0200, Michal Hocko wrote:
> Hi Larry,
> 
> On Wed 01-08-12 11:06:33, Larry Woodman wrote:
> > On 08/01/2012 08:32 AM, Michal Hocko wrote:
> > >
> > >I am really lame :/. The previous patch is wrong as well for goto out
> > >branch. The updated patch as follows:
> > This patch worked fine Michal!  
> 
> Thanks for the good news!
> 
> > You and Mel can duke it out over who's is best. :)
> 
> The answer is clear here ;) Mel did the hard work of identifying the
> culprit so kudos go to him.

I'm happy once it's fixed!

> I just tried to solve the issue more inside x86 arch code. The pmd
> allocation outside of sharing code seemed strange to me for quite some
> time I just underestimated its consequences completely.
> 
> Both approaches have some pros. Mel's patch is more resistant to other
> not-yet-discovered races and it also makes the arch independent code
> more robust because relying on the pmd trick is not ideal.

If there is another race then it is best to hear about it, understand
it and fix the underlying problem. More importantly, your patch ensures
that two processes faulting at the same time will share page tables with
each other. My patch only noted that this missed opportunity could cause
problems with fork.

> On the other hand, mine is more coupled with the sharing code so it
> makes the code easier to follow and also makes the sharing more
> effective because racing processes see pmd populated when checking for
> shareable mappings.
> 

It could do with a small comment above huge_pmd_share() explaining that
calling pmd_alloc() under the i_mmap_mutex is necessary to prevent two
parallel faults missing a sharing opportunity with each other but it's
not mandatory.

> So I am more inclined to mine but I don't want to push it because both
> are good and make sense. What other people think?
> 

I vote yours

Reviewed-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
  2012-08-02  7:37                                 ` Mel Gorman
@ 2012-08-02 12:36                                   ` Michal Hocko
  2012-08-02 13:33                                     ` Mel Gorman
  0 siblings, 1 reply; 50+ messages in thread
From: Michal Hocko @ 2012-08-02 12:36 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Larry Woodman, Rik van Riel, Hugh Dickins, Linux-MM,
	David Gibson, Ken Chen, Cong Wang, LKML

On Thu 02-08-12 08:37:57, Mel Gorman wrote:
> On Thu, Aug 02, 2012 at 09:19:34AM +0200, Michal Hocko wrote:
[...]
> > On the other hand, mine is more coupled with the sharing code so it
> > makes the code easier to follow and also makes the sharing more
> > effective because racing processes see pmd populated when checking for
> > shareable mappings.
> > 
> 
> It could do with a small comment above huge_pmd_share() explaining that
> calling pmd_alloc() under the i_mmap_mutex is necessary to prevent two
> parallel faults missing a sharing opportunity with each other but it's
> not mandatory.

Sure, that's a good idea. What about the following:

diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 40b2500..51839d1 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -56,7 +56,13 @@ static int vma_shareable(struct vm_area_struct *vma, unsigned long addr)
 }
 
 /*
- * search for a shareable pmd page for hugetlb.
+ * search for a shareable pmd page for hugetlb. In any case calls
+ * pmd_alloc and returns the corresponding pte. While this not necessary
+ * for the !shared pmd case because we can allocate the pmd later as
+ * well it makes the code much cleaner. pmd allocation is essential for
+ * the shared case though because pud has to be populated inside the
+ * same i_mmap_mutex section otherwise racing tasks could either miss
+ * the sharing (see huge_pte_offset) or selected a bad pmd for sharing.
  */
 static pte_t*
 huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)

> 
> > So I am more inclined to mine but I don't want to push it because both
> > are good and make sense. What other people think?
> > 
> 
> I vote yours
> 
> Reviewed-by: Mel Gorman <mgorman@suse.de>

Thanks!

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
  2012-08-02 12:36                                   ` Michal Hocko
@ 2012-08-02 13:33                                     ` Mel Gorman
  2012-08-02 13:53                                       ` Michal Hocko
  0 siblings, 1 reply; 50+ messages in thread
From: Mel Gorman @ 2012-08-02 13:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Larry Woodman, Rik van Riel, Hugh Dickins, Linux-MM,
	David Gibson, Ken Chen, Cong Wang, LKML

On Thu, Aug 02, 2012 at 02:36:58PM +0200, Michal Hocko wrote:
> On Thu 02-08-12 08:37:57, Mel Gorman wrote:
> > On Thu, Aug 02, 2012 at 09:19:34AM +0200, Michal Hocko wrote:
> [...]
> > > On the other hand, mine is more coupled with the sharing code so it
> > > makes the code easier to follow and also makes the sharing more
> > > effective because racing processes see pmd populated when checking for
> > > shareable mappings.
> > > 
> > 
> > It could do with a small comment above huge_pmd_share() explaining that
> > calling pmd_alloc() under the i_mmap_mutex is necessary to prevent two
> > parallel faults missing a sharing opportunity with each other but it's
> > not mandatory.
> 
> Sure, that's a good idea. What about the following:
> 
> diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> index 40b2500..51839d1 100644
> --- a/arch/x86/mm/hugetlbpage.c
> +++ b/arch/x86/mm/hugetlbpage.c
> @@ -56,7 +56,13 @@ static int vma_shareable(struct vm_area_struct *vma, unsigned long addr)
>  }
>  
>  /*
> - * search for a shareable pmd page for hugetlb.
> + * search for a shareable pmd page for hugetlb. In any case calls
> + * pmd_alloc and returns the corresponding pte. While this not necessary
> + * for the !shared pmd case because we can allocate the pmd later as
> + * well it makes the code much cleaner. pmd allocation is essential for
> + * the shared case though because pud has to be populated inside the
> + * same i_mmap_mutex section otherwise racing tasks could either miss
> + * the sharing (see huge_pte_offset) or selected a bad pmd for sharing.
>   */
>  static pte_t*
>  huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> 

Looks reasonable to me.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
  2012-08-02 13:33                                     ` Mel Gorman
@ 2012-08-02 13:53                                       ` Michal Hocko
  0 siblings, 0 replies; 50+ messages in thread
From: Michal Hocko @ 2012-08-02 13:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Larry Woodman, Rik van Riel, Hugh Dickins, Linux-MM,
	David Gibson, Ken Chen, Cong Wang, LKML

On Thu 02-08-12 14:33:10, Mel Gorman wrote:
> On Thu, Aug 02, 2012 at 02:36:58PM +0200, Michal Hocko wrote:
> > On Thu 02-08-12 08:37:57, Mel Gorman wrote:
> > > On Thu, Aug 02, 2012 at 09:19:34AM +0200, Michal Hocko wrote:
> > [...]
> > > > On the other hand, mine is more coupled with the sharing code so it
> > > > makes the code easier to follow and also makes the sharing more
> > > > effective because racing processes see pmd populated when checking for
> > > > shareable mappings.
> > > > 
> > > 
> > > It could do with a small comment above huge_pmd_share() explaining that
> > > calling pmd_alloc() under the i_mmap_mutex is necessary to prevent two
> > > parallel faults missing a sharing opportunity with each other but it's
> > > not mandatory.
> > 
> > Sure, that's a good idea. What about the following:
> > 
> > diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> > index 40b2500..51839d1 100644
> > --- a/arch/x86/mm/hugetlbpage.c
> > +++ b/arch/x86/mm/hugetlbpage.c
> > @@ -56,7 +56,13 @@ static int vma_shareable(struct vm_area_struct *vma, unsigned long addr)
> >  }
> >  
> >  /*
> > - * search for a shareable pmd page for hugetlb.
> > + * search for a shareable pmd page for hugetlb. In any case calls
> > + * pmd_alloc and returns the corresponding pte. While this not necessary
> > + * for the !shared pmd case because we can allocate the pmd later as
> > + * well it makes the code much cleaner. pmd allocation is essential for
> > + * the shared case though because pud has to be populated inside the
> > + * same i_mmap_mutex section otherwise racing tasks could either miss
> > + * the sharing (see huge_pte_offset) or selected a bad pmd for sharing.
> >   */
> >  static pte_t*
> >  huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> > 
> 
> Looks reasonable to me.

OK, added to the patch. I will send it to Andrew now.

Thanks a lot!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2012-08-02 13:53 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-20 13:49 [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables v2 Mel Gorman
2012-07-20 14:11 ` [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend) Mel Gorman
2012-07-20 14:29   ` Michal Hocko
2012-07-20 14:37     ` Mel Gorman
2012-07-20 14:40       ` Michal Hocko
2012-07-20 14:36   ` [PATCH -alternative] " Michal Hocko
2012-07-20 14:51     ` Mel Gorman
2012-07-23  4:04       ` Hugh Dickins
2012-07-23 11:40         ` Mel Gorman
2012-07-24  1:08           ` Hugh Dickins
2012-07-24  8:32             ` Michal Hocko
2012-07-24  9:34             ` Mel Gorman
2012-07-24 10:04               ` Michal Hocko
2012-07-24 19:23               ` Hugh Dickins
2012-07-25  8:36                 ` Mel Gorman
2012-07-26 17:42         ` Rik van Riel
2012-07-26 18:04           ` Larry Woodman
2012-07-27  8:42           ` Mel Gorman
2012-07-26 18:37         ` Rik van Riel
2012-07-26 21:03           ` Larry Woodman
2012-07-27  3:48           ` Larry Woodman
2012-07-27 10:10             ` Larry Woodman
2012-07-27 10:23             ` Mel Gorman
2012-07-27 10:36               ` Larry Woodman
2012-07-30 19:11               ` Larry Woodman
2012-07-31 12:16                 ` Hillf Danton
2012-07-31 12:46                 ` Mel Gorman
2012-07-31 13:07                   ` Larry Woodman
2012-07-31 13:29                     ` Mel Gorman
2012-07-31 13:21                   ` Michal Hocko
2012-07-31 17:49                   ` Larry Woodman
2012-07-31 20:06                     ` Michal Hocko
2012-07-31 20:57                       ` Larry Woodman
2012-08-01  2:45                       ` Larry Woodman
2012-08-01  8:20                         ` Michal Hocko
2012-08-01 12:32                           ` Michal Hocko
2012-08-01 15:06                             ` Larry Woodman
2012-08-02  7:19                               ` Michal Hocko
2012-08-02  7:37                                 ` Mel Gorman
2012-08-02 12:36                                   ` Michal Hocko
2012-08-02 13:33                                     ` Mel Gorman
2012-08-02 13:53                                       ` Michal Hocko
2012-07-31 18:03                   ` Rik van Riel
2012-07-26 18:31     ` Rik van Riel
2012-07-27  9:02       ` Michal Hocko
2012-07-26 16:01 ` [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables v2 Larry Woodman
2012-07-27  8:47   ` Mel Gorman
2012-07-26 21:00 ` Rik van Riel
2012-07-26 21:54   ` Hugh Dickins
2012-07-27  8:52   ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).