linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: Greg KH <gregkh@linuxfoundation.org>,
	Larry Woodman <lwoodman@redhat.com>, Mel Gorman <mgorman@suse.de>,
	Michal Hocko <mhocko@suse.cz>, Rik van Riel <riel@redhat.com>,
	David Gibson <david@gibson.dropbear.id.au>,
	Ken Chen <kenchen@google.com>,
	Cong Wang <xiyou.wangcong@gmail.com>,
	Hillf Danton <dhillf@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: [ 14/46] mm: hugetlbfs: correctly populate shared pmd
Date: Wed, 12 Sep 2012 16:39:04 -0700	[thread overview]
Message-ID: <20120912233819.157588671@linuxfoundation.org> (raw)
In-Reply-To: <20120912233817.662663809@linuxfoundation.org>

From: Greg KH <gregkh@linuxfoundation.org>

3.0-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Michal Hocko <mhocko@suse.cz>

commit eb48c071464757414538c68a6033c8f8c15196f8 upstream.

Each page mapped in a process's address space must be correctly
accounted for in _mapcount.  Normally the rules for this are
straightforward but hugetlbfs page table sharing is different.  The page
table pages at the PMD level are reference counted while the mapcount
remains the same.

If this accounting is wrong, it causes bugs like this one reported by
Larry Woodman:

  kernel BUG at mm/filemap.c:135!
  invalid opcode: 0000 [#1] SMP
  CPU 22
  Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas microcode pcspkr acpi_pad acpi]
  Pid: 18001, comm: mpitest Tainted: G        W    3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2
  RIP: 0010:[<ffffffff8112cfed>]  [<ffffffff8112cfed>] __delete_from_page_cache+0x15d/0x170
  Process mpitest (pid: 18001, threadinfo ffff880428972000, task ffff880428b5cc20)
  Call Trace:
    delete_from_page_cache+0x40/0x80
    truncate_hugepages+0x115/0x1f0
    hugetlbfs_evict_inode+0x18/0x30
    evict+0x9f/0x1b0
    iput_final+0xe3/0x1e0
    iput+0x3e/0x50
    d_kill+0xf8/0x110
    dput+0xe2/0x1b0
    __fput+0x162/0x240

During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc()
shared page tables with the check dst_pte == src_pte.  The logic is if
the PMD page is the same, they must be shared.  This assumes that the
sharing is between the parent and child.  However, if the sharing is
with a different process entirely then this check fails as in this
diagram:

  parent
    |
    ------------>pmd
                 src_pte----------> data page
                                        ^
  other--------->pmd--------------------|
                  ^
  child-----------|
                 dst_pte

For this situation to occur, it must be possible for Parent and Other to
have faulted and failed to share page tables with each other.  This is
possible due to the following style of race.

  PROC A                                          PROC B
  copy_hugetlb_page_range                         copy_hugetlb_page_range
    src_pte == huge_pte_offset                      src_pte == huge_pte_offset
    !src_pte so no sharing                          !src_pte so no sharing

  (time passes)

  hugetlb_fault                                   hugetlb_fault
    huge_pte_alloc                                  huge_pte_alloc
      huge_pmd_share                                 huge_pmd_share
        LOCK(i_mmap_mutex)
        find nothing, no sharing
        UNLOCK(i_mmap_mutex)
                                                      LOCK(i_mmap_mutex)
                                                      find nothing, no sharing
                                                      UNLOCK(i_mmap_mutex)
      pmd_alloc                                       pmd_alloc
      LOCK(instantiation_mutex)
      fault
      UNLOCK(instantiation_mutex)
                                                  LOCK(instantiation_mutex)
                                                  fault
                                                  UNLOCK(instantiation_mutex)

These two processes are not poing to the same data page but are not
sharing page tables because the opportunity was missed.  When either
process later forks, the src_pte == dst pte is potentially insufficient.
As the check falls through, the wrong PTE information is copied in
(harmless but wrong) and the mapcount is bumped for a page mapped by a
shared page table leading to the BUG_ON.

This patch addresses the issue by moving pmd_alloc into huge_pmd_share
which guarantees that the shared pud is populated in the same critical
section as pmd.  This also means that huge_pte_offset test in
huge_pmd_share is serialized correctly now which in turn means that the
success of the sharing will be higher as the racing tasks see the pud
and pmd populated together.

Race identified and changelog written mostly by Mel Gorman.

{akpm@linux-foundation.org: attempt to make the huge_pmd_share() comment comprehensible, clean up coding style]
Reported-by: Larry Woodman <lwoodman@redhat.com>
Tested-by: Larry Woodman <lwoodman@redhat.com>
Reviewed-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Ken Chen <kenchen@google.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 arch/x86/mm/hugetlbpage.c |   21 ++++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -56,9 +56,16 @@ static int vma_shareable(struct vm_area_
 }
 
 /*
- * search for a shareable pmd page for hugetlb.
+ * Search for a shareable pmd page for hugetlb. In any case calls pmd_alloc()
+ * and returns the corresponding pte. While this is not necessary for the
+ * !shared pmd case because we can allocate the pmd later as well, it makes the
+ * code much cleaner. pmd allocation is essential for the shared case because
+ * pud has to be populated inside the same i_mmap_mutex section - otherwise
+ * racing tasks could either miss the sharing (see huge_pte_offset) or select a
+ * bad pmd for sharing.
  */
-static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
+static pte_t *
+huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 {
 	struct vm_area_struct *vma = find_vma(mm, addr);
 	struct address_space *mapping = vma->vm_file->f_mapping;
@@ -68,9 +75,10 @@ static void huge_pmd_share(struct mm_str
 	struct vm_area_struct *svma;
 	unsigned long saddr;
 	pte_t *spte = NULL;
+	pte_t *pte;
 
 	if (!vma_shareable(vma, addr))
-		return;
+		return (pte_t *)pmd_alloc(mm, pud, addr);
 
 	mutex_lock(&mapping->i_mmap_mutex);
 	vma_prio_tree_foreach(svma, &iter, &mapping->i_mmap, idx, idx) {
@@ -97,7 +105,9 @@ static void huge_pmd_share(struct mm_str
 		put_page(virt_to_page(spte));
 	spin_unlock(&mm->page_table_lock);
 out:
+	pte = (pte_t *)pmd_alloc(mm, pud, addr);
 	mutex_unlock(&mapping->i_mmap_mutex);
+	return pte;
 }
 
 /*
@@ -142,8 +152,9 @@ pte_t *huge_pte_alloc(struct mm_struct *
 		} else {
 			BUG_ON(sz != PMD_SIZE);
 			if (pud_none(*pud))
-				huge_pmd_share(mm, addr, pud);
-			pte = (pte_t *) pmd_alloc(mm, pud, addr);
+				pte = huge_pmd_share(mm, addr, pud);
+			else
+				pte = (pte_t *)pmd_alloc(mm, pud, addr);
 		}
 	}
 	BUG_ON(pte && !pte_none(*pte) && !pte_huge(*pte));



  parent reply	other threads:[~2012-09-12 23:52 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-09-12 23:38 [ 00/46] 3.0.43-stable review Greg Kroah-Hartman
2012-09-12 23:38 ` [ 01/46] USB: vt6656: remove __devinit* from the struct usb_device_id table Greg Kroah-Hartman
2012-09-12 23:38 ` [ 02/46] USB: emi62: " Greg Kroah-Hartman
2012-09-12 23:38 ` [ 03/46] ALSA: hda - fix Copyright debug message Greg Kroah-Hartman
2012-09-12 23:38 ` [ 04/46] ARM: 7487/1: mm: avoid setting nG bit for user mappings that arent present Greg Kroah-Hartman
2012-09-12 23:38 ` [ 05/46] ARM: 7488/1: mm: use 5 bits for swapfile type encoding Greg Kroah-Hartman
2012-09-12 23:38 ` [ 06/46] ARM: 7489/1: errata: fix workaround for erratum #720789 on UP systems Greg Kroah-Hartman
2012-09-12 23:38 ` [ 07/46] ARM: S3C24XX: Fix s3c2410_dma_enqueue parameters Greg Kroah-Hartman
2012-09-12 23:38 ` [ 08/46] ARM: imx: select CPU_FREQ_TABLE when needed Greg Kroah-Hartman
2012-09-12 23:38 ` [ 09/46] ASoC: wm9712: Fix microphone source selection Greg Kroah-Hartman
2012-09-12 23:39 ` [ 10/46] vfs: missed source of ->f_pos races Greg Kroah-Hartman
2012-09-12 23:39 ` [ 11/46] vfs: canonicalize create mode in build_open_flags() Greg Kroah-Hartman
2012-09-12 23:39 ` [ 12/46] alpha: Dont export SOCK_NONBLOCK to user space Greg Kroah-Hartman
2012-09-12 23:39 ` [ 13/46] USB: winbond: remove __devinit* from the struct usb_device_id table Greg Kroah-Hartman
2012-09-12 23:39 ` Greg Kroah-Hartman [this message]
2012-09-12 23:39 ` [ 15/46] NFSv3: Ensure that do_proc_get_root() reports errors correctly Greg Kroah-Hartman
2012-09-12 23:39 ` [ 16/46] NFSv4.1: Remove a bogus BUG_ON() in nfs4_layoutreturn_done Greg Kroah-Hartman
2012-09-16 16:33   ` Ben Hutchings
2012-09-16 16:37     ` Greg Kroah-Hartman
2012-09-17 13:05       ` Myklebust, Trond
2012-09-19  9:49         ` Boaz Harrosh
2012-09-12 23:39 ` [ 17/46] NFS: Alias the nfs module to nfs4 Greg Kroah-Hartman
2012-09-12 23:39 ` [ 18/46] audit: dont free_chunk() after fsnotify_add_mark() Greg Kroah-Hartman
2012-09-12 23:39 ` [ 19/46] audit: fix refcounting in audit-tree Greg Kroah-Hartman
2012-09-12 23:39 ` [ 20/46] svcrpc: fix BUG() in svc_tcp_clear_pages Greg Kroah-Hartman
2012-09-12 23:39 ` [ 21/46] svcrpc: fix svc_xprt_enqueue/svc_recv busy-looping Greg Kroah-Hartman
2012-09-12 23:39 ` [ 22/46] svcrpc: sends on closed socket should stop immediately Greg Kroah-Hartman
2012-09-12 23:39 ` [ 23/46] cciss: fix incorrect scsi status reporting Greg Kroah-Hartman
2012-09-12 23:39 ` [ 24/46] ACPI: export symbol acpi_get_table_with_size Greg Kroah-Hartman
2012-09-15  0:22   ` Ben Hutchings
2012-09-15  3:13     ` Greg Kroah-Hartman
2012-09-12 23:39 ` [ 25/46] ath9k: fix decrypt_error initialization in ath_rx_tasklet() Greg Kroah-Hartman
2012-09-12 23:39 ` [ 26/46] PCI: EHCI: Fix crash during hibernation on ASUS computers Greg Kroah-Hartman
2012-09-12 23:39 ` [ 27/46] block: replace __getblk_slow misfix by grow_dev_page fix Greg Kroah-Hartman
2012-09-12 23:39 ` [ 28/46] USB: spca506: remove __devinit* from the struct usb_device_id table Greg Kroah-Hartman
2012-09-12 23:39 ` [ 29/46] USB: p54usb: " Greg Kroah-Hartman
2012-09-12 23:39 ` [ 30/46] USB: rtl8187: " Greg Kroah-Hartman
2012-09-12 23:39 ` [ 31/46] USB: smsusb: " Greg Kroah-Hartman
2012-09-12 23:39 ` [ 32/46] USB: CDC ACM: Fix NULL pointer dereference Greg Kroah-Hartman
2012-09-12 23:39 ` [ 33/46] powerpc: Fix DSCR inheritance in copy_thread() Greg Kroah-Hartman
2012-09-12 23:39 ` [ 34/46] powerpc: Restore correct DSCR in context switch Greg Kroah-Hartman
2012-09-12 23:39 ` [ 35/46] Remove user-triggerable BUG from mpol_to_str Greg Kroah-Hartman
2012-09-12 23:39 ` [ 36/46] SCSI: megaraid_sas: Move poll_aen_lock initializer Greg Kroah-Hartman
2012-09-12 23:39 ` [ 37/46] SCSI: mpt2sas: Fix for Driver oops, when loading driver with max_queue_depth command line option to a very small value Greg Kroah-Hartman
2012-09-12 23:39 ` [ 38/46] SCSI: Fix Device not ready issue on mpt2sas Greg Kroah-Hartman
2012-09-12 23:39 ` [ 39/46] udf: Fix data corruption for files in ICB Greg Kroah-Hartman
2012-09-12 23:39 ` [ 40/46] ext3: Fix fdatasync() for files with only i_size changes Greg Kroah-Hartman
2012-09-12 23:39 ` [ 41/46] fuse: fix retrieve length Greg Kroah-Hartman
2012-09-12 23:39 ` [ 42/46] Input: i8042 - add Gigabyte T1005 series netbooks to noloop table Greg Kroah-Hartman
2012-09-12 23:39 ` [ 43/46] drm/vmwgfx: add MODULE_DEVICE_TABLE so vmwgfx loads at boot Greg Kroah-Hartman
2012-09-12 23:39 ` [ 44/46] PARISC: Redefine ATOMIC_INIT and ATOMIC64_INIT to drop the casts Greg Kroah-Hartman
2012-09-12 23:39 ` [ 45/46] dccp: check ccid before dereferencing Greg Kroah-Hartman
2012-09-12 23:39 ` [ 46/46] hwmon: (asus_atk0110) Add quirk for Asus M5A78L Greg Kroah-Hartman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120912233819.157588671@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=akpm@linux-foundation.org \
    --cc=david@gibson.dropbear.id.au \
    --cc=dhillf@gmail.com \
    --cc=kenchen@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lwoodman@redhat.com \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.cz \
    --cc=riel@redhat.com \
    --cc=stable@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=xiyou.wangcong@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).