Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH STABLE 4.4 0/8] page refcount overflow backports
@ 2019-11-08  9:38 Vlastimil Babka
  2019-11-08  9:38 ` [PATCH STABLE 4.4 1/8] mm, gup: remove broken VM_BUG_ON_PAGE compound check for hugepages Vlastimil Babka
                   ` (7 more replies)
  0 siblings, 8 replies; 9+ messages in thread
From: Vlastimil Babka @ 2019-11-08  9:38 UTC (permalink / raw)
  To: stable
  Cc: linux-mm, linux-kernel, Ajay Kaher, Vlastimil Babka, Al Viro,
	Andrew Morton, Andy Lutomirski, Aneesh Kumar K.V,
	Borislav Petkov, Catalin Marinas, Dave Hansen, Hillf Danton,
	Ingo Molnar, Jann Horn, Juergen Gross, Kirill A. Shutemov,
	Linus Torvalds, Mark Rutland, Matthew Wilcox, Michal Hocko,
	Mike Kravetz, Miklos Szeredi, Naoya Horiguchi, Oscar Salvador,
	Peter Zijlstra, Punit Agrawal, Steve Capper, Thomas Gleixner,
	Vitaly Kuznetsov, Will Deacon

Hi,

this series backports the CVE-2019-11487 fixes (page refcount overflow) to
4.4 stable. It differs from Ajay's series [1] in the following:

- gup.c variants of fast gup for x86 and s390 are fixed too. I've not fixed
  sparc, mips, sh. It's unlikely the known overflow scenario based on FUSE,
  which needs 140GB of RAM, is a problem for those architectures, and I don't
  feel confident enough to patch them. I've sent the same fixup for 4.9 [3]
- there are some differences in backport adaptations, hopefully not important.
  My version is taken from our 4.4 based kernel, which was just simpler for me
  than adding the missing parts to Ajay's version
- The last patch fixes another problem in the fast gup implementation on x86,
  that I've previously posted and got merged to 4.9 stable [2].

[1] https://lore.kernel.org/linux-mm/1570581863-12090-1-git-send-email-akaher@vmware.com/
[2] https://lore.kernel.org/linux-mm/20190802160614.8089-1-vbabka@suse.cz/
[3] https://lore.kernel.org/linux-mm/9c130fa4-e52d-f8bd-c450-42341c7ab441@suse.cz/

Linus Torvalds (3):
  mm: make page ref count overflow check tighter and more explicit
  mm: add 'try_get_page()' helper function
  mm: prevent get_user_pages() from overflowing page refcount

Matthew Wilcox (1):
  fs: prevent page refcount overflow in pipe_buf_get

Miklos Szeredi (1):
  pipe: add pipe_buf_get() helper

Punit Agrawal (1):
  mm, gup: ensure real head page is ref-counted when using hugepages

Vlastimil Babka (1):
  x86, mm, gup: prevent get_page() race with munmap in paravirt guest

Will Deacon (1):
  mm, gup: remove broken VM_BUG_ON_PAGE compound check for hugepages

 arch/s390/mm/gup.c        |  6 +++--
 arch/x86/mm/gup.c         | 23 ++++++++++++++++++-
 fs/fuse/dev.c             | 12 +++++-----
 fs/pipe.c                 |  4 ++--
 fs/splice.c               | 12 ++++++++--
 include/linux/mm.h        | 26 ++++++++++++++++++++-
 include/linux/pipe_fs_i.h | 17 ++++++++++++--
 kernel/trace/trace.c      |  6 ++++-
 mm/gup.c                  | 48 +++++++++++++++++++++++++++------------
 mm/huge_memory.c          |  2 +-
 mm/hugetlb.c              | 18 +++++++++++++--
 mm/internal.h             | 17 ++++++++++----
 12 files changed, 152 insertions(+), 39 deletions(-)

-- 
2.23.0



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH STABLE 4.4 1/8] mm, gup: remove broken VM_BUG_ON_PAGE compound check for hugepages
  2019-11-08  9:38 [PATCH STABLE 4.4 0/8] page refcount overflow backports Vlastimil Babka
@ 2019-11-08  9:38 ` Vlastimil Babka
  2019-11-08  9:38 ` [PATCH STABLE 4.4 2/8] mm, gup: ensure real head page is ref-counted when using hugepages Vlastimil Babka
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Vlastimil Babka @ 2019-11-08  9:38 UTC (permalink / raw)
  To: stable
  Cc: linux-mm, linux-kernel, Ajay Kaher, Will Deacon, Punit Agrawal,
	Steve Capper, Kirill A . Shutemov, Aneesh Kumar K . V,
	Catalin Marinas, Naoya Horiguchi, Mark Rutland, Hillf Danton,
	Michal Hocko, Mike Kravetz, Andrew Morton, Linus Torvalds,
	Vlastimil Babka

From: Will Deacon <will.deacon@arm.com>

commit a3e328556d41bb61c55f9dfcc62d6a826ea97b85 upstream.

When operating on hugepages with DEBUG_VM enabled, the GUP code checks
the compound head for each tail page prior to calling
page_cache_add_speculative.  This is broken, because on the fast-GUP
path (where we don't hold any page table locks) we can be racing with a
concurrent invocation of split_huge_page_to_list.

split_huge_page_to_list deals with this race by using page_ref_freeze to
freeze the page and force concurrent GUPs to fail whilst the component
pages are modified.  This modification includes clearing the
compound_head field for the tail pages, so checking this prior to a
successful call to page_cache_add_speculative can lead to false
positives: In fact, page_cache_add_speculative *already* has this check
once the page refcount has been successfully updated, so we can simply
remove the broken calls to VM_BUG_ON_PAGE.

Link: http://lkml.kernel.org/r/20170522133604.11392-2-punit.agrawal@arm.com
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
Acked-by: Steve Capper <steve.capper@arm.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/gup.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 2cd3b31e3666..6f9088cb8ebe 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1134,7 +1134,6 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
 	tail = page;
 	do {
-		VM_BUG_ON_PAGE(compound_head(page) != head, page);
 		pages[*nr] = page;
 		(*nr)++;
 		page++;
@@ -1181,7 +1180,6 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 	page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
 	tail = page;
 	do {
-		VM_BUG_ON_PAGE(compound_head(page) != head, page);
 		pages[*nr] = page;
 		(*nr)++;
 		page++;
@@ -1224,7 +1222,6 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 	page = head + ((addr & ~PGDIR_MASK) >> PAGE_SHIFT);
 	tail = page;
 	do {
-		VM_BUG_ON_PAGE(compound_head(page) != head, page);
 		pages[*nr] = page;
 		(*nr)++;
 		page++;
-- 
2.23.0



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH STABLE 4.4 2/8] mm, gup: ensure real head page is ref-counted when using hugepages
  2019-11-08  9:38 [PATCH STABLE 4.4 0/8] page refcount overflow backports Vlastimil Babka
  2019-11-08  9:38 ` [PATCH STABLE 4.4 1/8] mm, gup: remove broken VM_BUG_ON_PAGE compound check for hugepages Vlastimil Babka
@ 2019-11-08  9:38 ` Vlastimil Babka
  2019-11-08  9:38 ` [PATCH STABLE 4.4 3/8] mm: make page ref count overflow check tighter and more explicit Vlastimil Babka
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Vlastimil Babka @ 2019-11-08  9:38 UTC (permalink / raw)
  To: stable
  Cc: linux-mm, linux-kernel, Ajay Kaher, Punit Agrawal, Steve Capper,
	Michal Hocko, Kirill A. Shutemov, Aneesh Kumar K . V,
	Catalin Marinas, Will Deacon, Naoya Horiguchi, Mark Rutland,
	Hillf Danton, Mike Kravetz, Andrew Morton, Linus Torvalds,
	Vlastimil Babka

From: Punit Agrawal <punit.agrawal@arm.com>

commit d63206ee32b6e64b0e12d46e5d6004afd9913713 upstream.

When speculatively taking references to a hugepage using
page_cache_add_speculative() in gup_huge_pmd(), it is assumed that the
page returned by pmd_page() is the head page.  Although normally true,
this assumption doesn't hold when the hugepage comprises of successive
page table entries such as when using contiguous bit on arm64 at PTE or
PMD levels.

This can be addressed by ensuring that the page passed to
page_cache_add_speculative() is the real head or by de-referencing the
head page within the function.

We take the first approach to keep the usage pattern aligned with
page_cache_get_speculative() where users already pass the appropriate
page, i.e., the de-referenced head.

Apply the same logic to fix gup_huge_[pud|pgd]() as well.

[punit.agrawal@arm.com: fix arm64 ltp failure]
  Link: http://lkml.kernel.org/r/20170619170145.25577-5-punit.agrawal@arm.com
Link: http://lkml.kernel.org/r/20170522133604.11392-3-punit.agrawal@arm.com
Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
Acked-by: Steve Capper <steve.capper@arm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/gup.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 6f9088cb8ebe..71e9d0093a35 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1130,8 +1130,7 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 		return 0;
 
 	refs = 0;
-	head = pmd_page(orig);
-	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+	page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
 	tail = page;
 	do {
 		pages[*nr] = page;
@@ -1140,6 +1139,7 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 		refs++;
 	} while (addr += PAGE_SIZE, addr != end);
 
+	head = compound_head(pmd_page(orig));
 	if (!page_cache_add_speculative(head, refs)) {
 		*nr -= refs;
 		return 0;
@@ -1176,8 +1176,7 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 		return 0;
 
 	refs = 0;
-	head = pud_page(orig);
-	page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
+	page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
 	tail = page;
 	do {
 		pages[*nr] = page;
@@ -1186,6 +1185,7 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 		refs++;
 	} while (addr += PAGE_SIZE, addr != end);
 
+	head = compound_head(pud_page(orig));
 	if (!page_cache_add_speculative(head, refs)) {
 		*nr -= refs;
 		return 0;
@@ -1218,8 +1218,7 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 		return 0;
 
 	refs = 0;
-	head = pgd_page(orig);
-	page = head + ((addr & ~PGDIR_MASK) >> PAGE_SHIFT);
+	page = pgd_page(orig) + ((addr & ~PGDIR_MASK) >> PAGE_SHIFT);
 	tail = page;
 	do {
 		pages[*nr] = page;
@@ -1228,6 +1227,7 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 		refs++;
 	} while (addr += PAGE_SIZE, addr != end);
 
+	head = compound_head(pgd_page(orig));
 	if (!page_cache_add_speculative(head, refs)) {
 		*nr -= refs;
 		return 0;
-- 
2.23.0



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH STABLE 4.4 3/8] mm: make page ref count overflow check tighter and more explicit
  2019-11-08  9:38 [PATCH STABLE 4.4 0/8] page refcount overflow backports Vlastimil Babka
  2019-11-08  9:38 ` [PATCH STABLE 4.4 1/8] mm, gup: remove broken VM_BUG_ON_PAGE compound check for hugepages Vlastimil Babka
  2019-11-08  9:38 ` [PATCH STABLE 4.4 2/8] mm, gup: ensure real head page is ref-counted when using hugepages Vlastimil Babka
@ 2019-11-08  9:38 ` Vlastimil Babka
  2019-11-08  9:38 ` [PATCH STABLE 4.4 4/8] mm: add 'try_get_page()' helper function Vlastimil Babka
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Vlastimil Babka @ 2019-11-08  9:38 UTC (permalink / raw)
  To: stable
  Cc: linux-mm, linux-kernel, Ajay Kaher, Linus Torvalds,
	Matthew Wilcox, Jann Horn, stable, Vlastimil Babka

From: Linus Torvalds <torvalds@linux-foundation.org>

commit f958d7b528b1b40c44cfda5eabe2d82760d868c3 upstream.

[ 4.4 backport: page_ref_count() doesn't exist, introduce it to reduce churn.
		Change also two similar checks in mm/internal.h		  ]

We have a VM_BUG_ON() to check that the page reference count doesn't
underflow (or get close to overflow) by checking the sign of the count.

That's all fine, but we actually want to allow people to use a "get page
ref unless it's already very high" helper function, and we want that one
to use the sign of the page ref (without triggering this VM_BUG_ON).

Change the VM_BUG_ON to only check for small underflows (or _very_ close
to overflowing), and ignore overflows which have strayed into negative
territory.

Acked-by: Matthew Wilcox <willy@infradead.org>
Cc: Jann Horn <jannh@google.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mm.h | 11 ++++++++++-
 mm/internal.h      |  5 +++--
 2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ed653ba47c46..997edfcb0a30 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -488,6 +488,15 @@ static inline void get_huge_page_tail(struct page *page)
 
 extern bool __get_page_tail(struct page *page);
 
+static inline int page_ref_count(struct page *page)
+{
+	return atomic_read(&page->_count);
+}
+
+/* 127: arbitrary random number, small enough to assemble well */
+#define page_ref_zero_or_close_to_overflow(page) \
+	((unsigned int) page_ref_count(page) + 127u <= 127u)
+
 static inline void get_page(struct page *page)
 {
 	if (unlikely(PageTail(page)))
@@ -497,7 +506,7 @@ static inline void get_page(struct page *page)
 	 * Getting a normal page or the head of a compound page
 	 * requires to already have an elevated page->_count.
 	 */
-	VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page);
+	VM_BUG_ON_PAGE(page_ref_zero_or_close_to_overflow(page), page);
 	atomic_inc(&page->_count);
 }
 
diff --git a/mm/internal.h b/mm/internal.h
index f63f4393d633..a6639c72780a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -81,7 +81,8 @@ static inline void __get_page_tail_foll(struct page *page,
 	 * speculative page access (like in
 	 * page_cache_get_speculative()) on tail pages.
 	 */
-	VM_BUG_ON_PAGE(atomic_read(&compound_head(page)->_count) <= 0, page);
+	VM_BUG_ON_PAGE(page_ref_zero_or_close_to_overflow(compound_head(page)),
+									page);
 	if (get_page_head)
 		atomic_inc(&compound_head(page)->_count);
 	get_huge_page_tail(page);
@@ -106,7 +107,7 @@ static inline void get_page_foll(struct page *page)
 		 * Getting a normal page or the head of a compound page
 		 * requires to already have an elevated page->_count.
 		 */
-		VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page);
+		VM_BUG_ON_PAGE(page_ref_zero_or_close_to_overflow(page), page);
 		atomic_inc(&page->_count);
 	}
 }
-- 
2.23.0



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH STABLE 4.4 4/8] mm: add 'try_get_page()' helper function
  2019-11-08  9:38 [PATCH STABLE 4.4 0/8] page refcount overflow backports Vlastimil Babka
                   ` (2 preceding siblings ...)
  2019-11-08  9:38 ` [PATCH STABLE 4.4 3/8] mm: make page ref count overflow check tighter and more explicit Vlastimil Babka
@ 2019-11-08  9:38 ` Vlastimil Babka
  2019-11-08  9:38 ` [PATCH STABLE 4.4 5/8] mm: prevent get_user_pages() from overflowing page refcount Vlastimil Babka
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Vlastimil Babka @ 2019-11-08  9:38 UTC (permalink / raw)
  To: stable
  Cc: linux-mm, linux-kernel, Ajay Kaher, Linus Torvalds,
	Matthew Wilcox, Jann Horn, stable, Vlastimil Babka

From: Linus Torvalds <torvalds@linux-foundation.org>

commit 88b1a17dfc3ed7728316478fae0f5ad508f50397 upstream.

[ 4.4 backport: get_page() is more complicated due to special handling
  of tail pages via __get_page_tail(). But in all cases, eventually the
  compound head page's refcount is incremented. So try_get_page() just
  checks compound head's refcount for overflow and then simply calls
  get_page().								 ]

This is the same as the traditional 'get_page()' function, but instead
of unconditionally incrementing the reference count of the page, it only
does so if the count was "safe".  It returns whether the reference count
was incremented (and is marked __must_check, since the caller obviously
has to be aware of it).

Also like 'get_page()', you can't use this function unless you already
had a reference to the page.  The intent is that you can use this
exactly like get_page(), but in situations where you want to limit the
maximum reference count.

The code currently does an unconditional WARN_ON_ONCE() if we ever hit
the reference count issues (either zero or negative), as a notification
that the conditional non-increment actually happened.

NOTE! The count access for the "safety" check is inherently racy, but
that doesn't matter since the buffer we use is basically half the range
of the reference count (ie we look at the sign of the count).

Acked-by: Matthew Wilcox <willy@infradead.org>
Cc: Jann Horn <jannh@google.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mm.h | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 997edfcb0a30..78358aeb7732 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -510,6 +510,21 @@ static inline void get_page(struct page *page)
 	atomic_inc(&page->_count);
 }
 
+static inline __must_check bool try_get_page(struct page *page)
+{
+	struct page *head = compound_head(page);
+
+	/*
+	 * get_page() increases always head page's refcount, either directly or
+	 * via __get_page_tail() for tail page, so we check that
+	 */
+	if (WARN_ON_ONCE(page_ref_count(head) <= 0))
+		return false;
+
+	get_page(page);
+	return true;
+}
+
 static inline struct page *virt_to_head_page(const void *x)
 {
 	struct page *page = virt_to_page(x);
-- 
2.23.0



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH STABLE 4.4 5/8] mm: prevent get_user_pages() from overflowing page refcount
  2019-11-08  9:38 [PATCH STABLE 4.4 0/8] page refcount overflow backports Vlastimil Babka
                   ` (3 preceding siblings ...)
  2019-11-08  9:38 ` [PATCH STABLE 4.4 4/8] mm: add 'try_get_page()' helper function Vlastimil Babka
@ 2019-11-08  9:38 ` Vlastimil Babka
  2019-11-08  9:38 ` [PATCH STABLE 4.4 6/8] pipe: add pipe_buf_get() helper Vlastimil Babka
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Vlastimil Babka @ 2019-11-08  9:38 UTC (permalink / raw)
  To: stable
  Cc: linux-mm, linux-kernel, Ajay Kaher, Linus Torvalds, Jann Horn,
	Matthew Wilcox, stable, Vlastimil Babka

From: Linus Torvalds <torvalds@linux-foundation.org>

commit 8fde12ca79aff9b5ba951fce1a2641901b8d8e64 upstream.

[ 4.4 backport: there's get_page_foll(), so add try_get_page()-like checks
                in there, enabled by a new parameter, which is false where
                upstream patch doesn't replace get_page() with try_get_page()
                (the THP and hugetlb callers).
		In gup_pte_range(), we don't expect tail pages, so just check
                page ref count instead of try_get_compound_head()
		Also patch arch-specific variants of gup.c for x86 and s390,
		leaving mips, sh, sparc alone				      ]

If the page refcount wraps around past zero, it will be freed while
there are still four billion references to it.  One of the possible
avenues for an attacker to try to make this happen is by doing direct IO
on a page multiple times.  This patch makes get_user_pages() refuse to
take a new page reference if there are already more than two billion
references to the page.

Reported-by: Jann Horn <jannh@google.com>
Acked-by: Matthew Wilcox <willy@infradead.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 arch/s390/mm/gup.c |  6 ++++--
 arch/x86/mm/gup.c  |  9 ++++++++-
 mm/gup.c           | 39 +++++++++++++++++++++++++++++++--------
 mm/huge_memory.c   |  2 +-
 mm/hugetlb.c       | 18 ++++++++++++++++--
 mm/internal.h      | 12 +++++++++---
 6 files changed, 69 insertions(+), 17 deletions(-)

diff --git a/arch/s390/mm/gup.c b/arch/s390/mm/gup.c
index 7ad41be8b373..bdaa5f7b652c 100644
--- a/arch/s390/mm/gup.c
+++ b/arch/s390/mm/gup.c
@@ -37,7 +37,8 @@ static inline int gup_pte_range(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
 			return 0;
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 		page = pte_page(pte);
-		if (!page_cache_get_speculative(page))
+		if (unlikely(WARN_ON_ONCE(page_ref_count(page) < 0)
+		    || !page_cache_get_speculative(page)))
 			return 0;
 		if (unlikely(pte_val(pte) != pte_val(*ptep))) {
 			put_page(page);
@@ -76,7 +77,8 @@ static inline int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
 		refs++;
 	} while (addr += PAGE_SIZE, addr != end);
 
-	if (!page_cache_add_speculative(head, refs)) {
+	if (unlikely(WARN_ON_ONCE(page_ref_count(head) < 0)
+	    || !page_cache_add_speculative(head, refs))) {
 		*nr -= refs;
 		return 0;
 	}
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 7d2542ad346a..6612d532e42e 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -95,7 +95,10 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
 		}
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 		page = pte_page(pte);
-		get_page(page);
+		if (unlikely(!try_get_page(page))) {
+			pte_unmap(ptep);
+			return 0;
+		}
 		SetPageReferenced(page);
 		pages[*nr] = page;
 		(*nr)++;
@@ -132,6 +135,8 @@ static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
 
 	refs = 0;
 	head = pmd_page(pmd);
+	if (WARN_ON_ONCE(page_ref_count(head) <= 0))
+		return 0;
 	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
 	do {
 		VM_BUG_ON_PAGE(compound_head(page) != head, page);
@@ -208,6 +213,8 @@ static noinline int gup_huge_pud(pud_t pud, unsigned long addr,
 
 	refs = 0;
 	head = pud_page(pud);
+	if (WARN_ON_ONCE(page_ref_count(head) <= 0))
+		return 0;
 	page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
 	do {
 		VM_BUG_ON_PAGE(compound_head(page) != head, page);
diff --git a/mm/gup.c b/mm/gup.c
index 71e9d0093a35..fc8e2dca99fc 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -127,7 +127,10 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 	}
 
 	if (flags & FOLL_GET)
-		get_page_foll(page);
+		if (!get_page_foll(page, true)) {
+			page = ERR_PTR(-ENOMEM);
+			goto out;
+		}
 	if (flags & FOLL_TOUCH) {
 		if ((flags & FOLL_WRITE) &&
 		    !pte_dirty(pte) && !PageDirty(page))
@@ -289,7 +292,10 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
 			goto unmap;
 		*page = pte_page(*pte);
 	}
-	get_page(*page);
+	if (unlikely(!try_get_page(*page))) {
+		ret = -ENOMEM;
+		goto unmap;
+	}
 out:
 	ret = 0;
 unmap:
@@ -1053,6 +1059,20 @@ struct page *get_dump_page(unsigned long addr)
  */
 #ifdef CONFIG_HAVE_GENERIC_RCU_GUP
 
+/*
+ * Return the compund head page with ref appropriately incremented,
+ * or NULL if that failed.
+ */
+static inline struct page *try_get_compound_head(struct page *page, int refs)
+{
+	struct page *head = compound_head(page);
+	if (WARN_ON_ONCE(page_ref_count(head) < 0))
+		return NULL;
+	if (unlikely(!page_cache_add_speculative(head, refs)))
+		return NULL;
+	return head;
+}
+
 #ifdef __HAVE_ARCH_PTE_SPECIAL
 static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 			 int write, struct page **pages, int *nr)
@@ -1083,6 +1103,9 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 		page = pte_page(pte);
 
+		if (WARN_ON_ONCE(page_ref_count(page) < 0))
+			goto pte_unmap;
+
 		if (!page_cache_get_speculative(page))
 			goto pte_unmap;
 
@@ -1139,8 +1162,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 		refs++;
 	} while (addr += PAGE_SIZE, addr != end);
 
-	head = compound_head(pmd_page(orig));
-	if (!page_cache_add_speculative(head, refs)) {
+	head = try_get_compound_head(pmd_page(orig), refs);
+	if (!head) {
 		*nr -= refs;
 		return 0;
 	}
@@ -1185,8 +1208,8 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 		refs++;
 	} while (addr += PAGE_SIZE, addr != end);
 
-	head = compound_head(pud_page(orig));
-	if (!page_cache_add_speculative(head, refs)) {
+	head = try_get_compound_head(pud_page(orig), refs);
+	if (!head) {
 		*nr -= refs;
 		return 0;
 	}
@@ -1227,8 +1250,8 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 		refs++;
 	} while (addr += PAGE_SIZE, addr != end);
 
-	head = compound_head(pgd_page(orig));
-	if (!page_cache_add_speculative(head, refs)) {
+	head = try_get_compound_head(pgd_page(orig), refs);
+	if (!head) {
 		*nr -= refs;
 		return 0;
 	}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 465786cd6490..6087277981a6 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1322,7 +1322,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
 	VM_BUG_ON_PAGE(!PageCompound(page), page);
 	if (flags & FOLL_GET)
-		get_page_foll(page);
+		get_page_foll(page, false);
 
 out:
 	return page;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index fd932e7a25dd..b4a8a18fa3a5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3886,6 +3886,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long vaddr = *position;
 	unsigned long remainder = *nr_pages;
 	struct hstate *h = hstate_vma(vma);
+	int err = -EFAULT;
 
 	while (vaddr < vma->vm_end && remainder) {
 		pte_t *pte;
@@ -3957,10 +3958,23 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 		pfn_offset = (vaddr & ~huge_page_mask(h)) >> PAGE_SHIFT;
 		page = pte_page(huge_ptep_get(pte));
+
+		/*
+		 * Instead of doing 'try_get_page()' below in the same_page
+		 * loop, just check the count once here.
+		 */
+		if (unlikely(page_count(page) <= 0)) {
+			if (pages) {
+				spin_unlock(ptl);
+				remainder = 0;
+				err = -ENOMEM;
+				break;
+			}
+		}
 same_page:
 		if (pages) {
 			pages[i] = mem_map_offset(page, pfn_offset);
-			get_page_foll(pages[i]);
+			get_page_foll(pages[i], false);
 		}
 
 		if (vmas)
@@ -3983,7 +3997,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	*nr_pages = remainder;
 	*position = vaddr;
 
-	return i ? i : -EFAULT;
+	return i ? i : err;
 }
 
 unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
diff --git a/mm/internal.h b/mm/internal.h
index a6639c72780a..b52041969d06 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -93,23 +93,29 @@ static inline void __get_page_tail_foll(struct page *page,
  * follow_page() and it must be called while holding the proper PT
  * lock while the pte (or pmd_trans_huge) is still mapping the page.
  */
-static inline void get_page_foll(struct page *page)
+static inline bool get_page_foll(struct page *page, bool check)
 {
-	if (unlikely(PageTail(page)))
+	if (unlikely(PageTail(page))) {
 		/*
 		 * This is safe only because
 		 * __split_huge_page_refcount() can't run under
 		 * get_page_foll() because we hold the proper PT lock.
 		 */
+		if (check && WARN_ON_ONCE(
+				page_ref_count(compound_head(page)) <= 0))
+			return false;
 		__get_page_tail_foll(page, true);
-	else {
+	} else {
 		/*
 		 * Getting a normal page or the head of a compound page
 		 * requires to already have an elevated page->_count.
 		 */
 		VM_BUG_ON_PAGE(page_ref_zero_or_close_to_overflow(page), page);
+		if (check && WARN_ON_ONCE(page_ref_count(page) <= 0))
+			return false;
 		atomic_inc(&page->_count);
 	}
+	return true;
 }
 
 extern unsigned long highest_memmap_pfn;
-- 
2.23.0



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH STABLE 4.4 6/8] pipe: add pipe_buf_get() helper
  2019-11-08  9:38 [PATCH STABLE 4.4 0/8] page refcount overflow backports Vlastimil Babka
                   ` (4 preceding siblings ...)
  2019-11-08  9:38 ` [PATCH STABLE 4.4 5/8] mm: prevent get_user_pages() from overflowing page refcount Vlastimil Babka
@ 2019-11-08  9:38 ` Vlastimil Babka
  2019-11-08  9:38 ` [PATCH STABLE 4.4 7/8] fs: prevent page refcount overflow in pipe_buf_get Vlastimil Babka
  2019-11-08  9:38 ` [PATCH STABLE 4.4 8/8] x86, mm, gup: prevent get_page() race with munmap in paravirt guest Vlastimil Babka
  7 siblings, 0 replies; 9+ messages in thread
From: Vlastimil Babka @ 2019-11-08  9:38 UTC (permalink / raw)
  To: stable
  Cc: linux-mm, linux-kernel, Ajay Kaher, Miklos Szeredi, Al Viro,
	Vlastimil Babka

From: Miklos Szeredi <mszeredi@redhat.com>

commit 7bf2d1df80822ec056363627e2014990f068f7aa upstream.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 fs/fuse/dev.c             |  2 +-
 fs/splice.c               |  4 ++--
 include/linux/pipe_fs_i.h | 11 +++++++++++
 3 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index f5d2d2340b44..36a5df92eb9c 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -2052,7 +2052,7 @@ static ssize_t fuse_dev_splice_write(struct pipe_inode_info *pipe,
 			pipe->curbuf = (pipe->curbuf + 1) & (pipe->buffers - 1);
 			pipe->nrbufs--;
 		} else {
-			ibuf->ops->get(pipe, ibuf);
+			pipe_buf_get(pipe, ibuf);
 			*obuf = *ibuf;
 			obuf->flags &= ~PIPE_BUF_FLAG_GIFT;
 			obuf->len = rem;
diff --git a/fs/splice.c b/fs/splice.c
index 8398974e1538..fde126369966 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1876,7 +1876,7 @@ static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
 			 * Get a reference to this pipe buffer,
 			 * so we can copy the contents over.
 			 */
-			ibuf->ops->get(ipipe, ibuf);
+			pipe_buf_get(ipipe, ibuf);
 			*obuf = *ibuf;
 
 			/*
@@ -1948,7 +1948,7 @@ static int link_pipe(struct pipe_inode_info *ipipe,
 		 * Get a reference to this pipe buffer,
 		 * so we can copy the contents over.
 		 */
-		ibuf->ops->get(ipipe, ibuf);
+		pipe_buf_get(ipipe, ibuf);
 
 		obuf = opipe->bufs + nbuf;
 		*obuf = *ibuf;
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index 24f5470d3944..10876f3cb3da 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -115,6 +115,17 @@ struct pipe_buf_operations {
 	void (*get)(struct pipe_inode_info *, struct pipe_buffer *);
 };
 
+/**
+ * pipe_buf_get - get a reference to a pipe_buffer
+ * @pipe:	the pipe that the buffer belongs to
+ * @buf:	the buffer to get a reference to
+ */
+static inline void pipe_buf_get(struct pipe_inode_info *pipe,
+				struct pipe_buffer *buf)
+{
+	buf->ops->get(pipe, buf);
+}
+
 /* Differs from PIPE_BUF in that PIPE_SIZE is the length of the actual
    memory allocation, whereas PIPE_BUF makes atomicity guarantees.  */
 #define PIPE_SIZE		PAGE_SIZE
-- 
2.23.0



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH STABLE 4.4 7/8] fs: prevent page refcount overflow in pipe_buf_get
  2019-11-08  9:38 [PATCH STABLE 4.4 0/8] page refcount overflow backports Vlastimil Babka
                   ` (5 preceding siblings ...)
  2019-11-08  9:38 ` [PATCH STABLE 4.4 6/8] pipe: add pipe_buf_get() helper Vlastimil Babka
@ 2019-11-08  9:38 ` Vlastimil Babka
  2019-11-08  9:38 ` [PATCH STABLE 4.4 8/8] x86, mm, gup: prevent get_page() race with munmap in paravirt guest Vlastimil Babka
  7 siblings, 0 replies; 9+ messages in thread
From: Vlastimil Babka @ 2019-11-08  9:38 UTC (permalink / raw)
  To: stable
  Cc: linux-mm, linux-kernel, Ajay Kaher, Matthew Wilcox, Jann Horn,
	stable, Linus Torvalds, Vlastimil Babka

From: Matthew Wilcox <willy@infradead.org>

commit 15fab63e1e57be9fdb5eec1bbc5916e9825e9acb upstream.

Change pipe_buf_get() to return a bool indicating whether it succeeded
in raising the refcount of the page (if the thing in the pipe is a page).
This removes another mechanism for overflowing the page refcount.  All
callers converted to handle a failure.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 fs/fuse/dev.c             | 12 ++++++------
 fs/pipe.c                 |  4 ++--
 fs/splice.c               | 12 ++++++++++--
 include/linux/pipe_fs_i.h | 10 ++++++----
 kernel/trace/trace.c      |  6 +++++-
 5 files changed, 29 insertions(+), 15 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 36a5df92eb9c..16891f5364af 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -2031,10 +2031,8 @@ static ssize_t fuse_dev_splice_write(struct pipe_inode_info *pipe,
 		rem += pipe->bufs[(pipe->curbuf + idx) & (pipe->buffers - 1)].len;
 
 	ret = -EINVAL;
-	if (rem < len) {
-		pipe_unlock(pipe);
-		goto out;
-	}
+	if (rem < len)
+		goto out_free;
 
 	rem = len;
 	while (rem) {
@@ -2052,7 +2050,9 @@ static ssize_t fuse_dev_splice_write(struct pipe_inode_info *pipe,
 			pipe->curbuf = (pipe->curbuf + 1) & (pipe->buffers - 1);
 			pipe->nrbufs--;
 		} else {
-			pipe_buf_get(pipe, ibuf);
+			if (!pipe_buf_get(pipe, ibuf))
+				goto out_free;
+
 			*obuf = *ibuf;
 			obuf->flags &= ~PIPE_BUF_FLAG_GIFT;
 			obuf->len = rem;
@@ -2075,13 +2075,13 @@ static ssize_t fuse_dev_splice_write(struct pipe_inode_info *pipe,
 	ret = fuse_dev_do_write(fud, &cs, len);
 
 	pipe_lock(pipe);
+out_free:
 	for (idx = 0; idx < nbuf; idx++) {
 		struct pipe_buffer *buf = &bufs[idx];
 		buf->ops->release(pipe, buf);
 	}
 	pipe_unlock(pipe);
 
-out:
 	kfree(bufs);
 	return ret;
 }
diff --git a/fs/pipe.c b/fs/pipe.c
index 1e7263bb837a..6534470a6c19 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -178,9 +178,9 @@ EXPORT_SYMBOL(generic_pipe_buf_steal);
  *	in the tee() system call, when we duplicate the buffers in one
  *	pipe into another.
  */
-void generic_pipe_buf_get(struct pipe_inode_info *pipe, struct pipe_buffer *buf)
+bool generic_pipe_buf_get(struct pipe_inode_info *pipe, struct pipe_buffer *buf)
 {
-	page_cache_get(buf->page);
+	return try_get_page(buf->page);
 }
 EXPORT_SYMBOL(generic_pipe_buf_get);
 
diff --git a/fs/splice.c b/fs/splice.c
index fde126369966..57ccc583a172 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1876,7 +1876,11 @@ static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
 			 * Get a reference to this pipe buffer,
 			 * so we can copy the contents over.
 			 */
-			pipe_buf_get(ipipe, ibuf);
+			if (!pipe_buf_get(ipipe, ibuf)) {
+				if (ret == 0)
+					ret = -EFAULT;
+				break;
+			}
 			*obuf = *ibuf;
 
 			/*
@@ -1948,7 +1952,11 @@ static int link_pipe(struct pipe_inode_info *ipipe,
 		 * Get a reference to this pipe buffer,
 		 * so we can copy the contents over.
 		 */
-		pipe_buf_get(ipipe, ibuf);
+		if (!pipe_buf_get(ipipe, ibuf)) {
+			if (ret == 0)
+				ret = -EFAULT;
+			break;
+		}
 
 		obuf = opipe->bufs + nbuf;
 		*obuf = *ibuf;
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index 10876f3cb3da..0b28b65c12fb 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -112,18 +112,20 @@ struct pipe_buf_operations {
 	/*
 	 * Get a reference to the pipe buffer.
 	 */
-	void (*get)(struct pipe_inode_info *, struct pipe_buffer *);
+	bool (*get)(struct pipe_inode_info *, struct pipe_buffer *);
 };
 
 /**
  * pipe_buf_get - get a reference to a pipe_buffer
  * @pipe:	the pipe that the buffer belongs to
  * @buf:	the buffer to get a reference to
+ *
+ * Return: %true if the reference was successfully obtained.
  */
-static inline void pipe_buf_get(struct pipe_inode_info *pipe,
+static inline __must_check bool pipe_buf_get(struct pipe_inode_info *pipe,
 				struct pipe_buffer *buf)
 {
-	buf->ops->get(pipe, buf);
+	return buf->ops->get(pipe, buf);
 }
 
 /* Differs from PIPE_BUF in that PIPE_SIZE is the length of the actual
@@ -148,7 +150,7 @@ struct pipe_inode_info *alloc_pipe_info(void);
 void free_pipe_info(struct pipe_inode_info *);
 
 /* Generic pipe buffer ops functions */
-void generic_pipe_buf_get(struct pipe_inode_info *, struct pipe_buffer *);
+bool generic_pipe_buf_get(struct pipe_inode_info *, struct pipe_buffer *);
 int generic_pipe_buf_confirm(struct pipe_inode_info *, struct pipe_buffer *);
 int generic_pipe_buf_steal(struct pipe_inode_info *, struct pipe_buffer *);
 void generic_pipe_buf_release(struct pipe_inode_info *, struct pipe_buffer *);
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index c6e4e3e7f685..32cc4ea93ad6 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -5748,12 +5748,16 @@ static void buffer_pipe_buf_release(struct pipe_inode_info *pipe,
 	buf->private = 0;
 }
 
-static void buffer_pipe_buf_get(struct pipe_inode_info *pipe,
+static bool buffer_pipe_buf_get(struct pipe_inode_info *pipe,
 				struct pipe_buffer *buf)
 {
 	struct buffer_ref *ref = (struct buffer_ref *)buf->private;
 
+	if (ref->ref > INT_MAX/2)
+		return false;
+
 	ref->ref++;
+	return true;
 }
 
 /* Pipe buffer operations for a buffer. */
-- 
2.23.0



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH STABLE 4.4 8/8] x86, mm, gup: prevent get_page() race with munmap in paravirt guest
  2019-11-08  9:38 [PATCH STABLE 4.4 0/8] page refcount overflow backports Vlastimil Babka
                   ` (6 preceding siblings ...)
  2019-11-08  9:38 ` [PATCH STABLE 4.4 7/8] fs: prevent page refcount overflow in pipe_buf_get Vlastimil Babka
@ 2019-11-08  9:38 ` Vlastimil Babka
  7 siblings, 0 replies; 9+ messages in thread
From: Vlastimil Babka @ 2019-11-08  9:38 UTC (permalink / raw)
  To: stable
  Cc: linux-mm, linux-kernel, Ajay Kaher, Vlastimil Babka,
	Oscar Salvador, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Juergen Gross, Kirill A . Shutemov, Vitaly Kuznetsov,
	Linus Torvalds, Borislav Petkov, Dave Hansen, Andy Lutomirski

The x86 version of get_user_pages_fast() relies on disabled interrupts to
synchronize gup_pte_range() between gup_get_pte(ptep); and get_page() against
a parallel munmap. The munmap side nulls the pte, then flushes TLBs, then
releases the page. As TLB flush is done synchronously via IPI disabling
interrupts blocks the page release, and get_page(), which assumes existing
reference on page, is thus safe.
However when TLB flush is done by a hypercall, e.g. in a Xen PV guest, there is
no blocking thanks to disabled interrupts, and get_page() can succeed on a page
that was already freed or even reused.

We have recently seen this happen with our 4.4 and 4.12 based kernels, with
userspace (java) that exits a thread, where mm_release() performs a futex_wake()
on tsk->clear_child_tid, and another thread in parallel unmaps the page where
tsk->clear_child_tid points to. The spurious get_page() succeeds, but futex code
immediately releases the page again, while it's already on a freelist. Symptoms
include a bad page state warning, general protection faults acessing a poisoned
list prev/next pointer in the freelist, or free page pcplists of two cpus joined
together in a single list. Oscar has also reproduced this scenario, with a
patch inserting delays before the get_page() to make the race window larger.

Fix this by removing the dependency on TLB flush interrupts the same way as the
generic get_user_pages_fast() code by using page_cache_add_speculative() and
revalidating the PTE contents after pinning the page. Mainline is safe since
4.13 where the x86 gup code was removed in favor of the common code. Accessing
the page table itself safely also relies on disabled interrupts and TLB flush
IPIs that don't happen with hypercalls, which was acknowledged in commit
9e52fc2b50de ("x86/mm: Enable RCU based page table freeing
(CONFIG_HAVE_RCU_TABLE_FREE=y)"). That commit with follups should also be
backported for full safety, although our reproducer didn't hit a problem
without that backport.

Reproduced-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 arch/x86/mm/gup.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 6612d532e42e..6379a4883c0a 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -9,6 +9,7 @@
 #include <linux/vmstat.h>
 #include <linux/highmem.h>
 #include <linux/swap.h>
+#include <linux/pagemap.h>
 
 #include <asm/pgtable.h>
 
@@ -95,10 +96,23 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
 		}
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 		page = pte_page(pte);
-		if (unlikely(!try_get_page(page))) {
+
+		if (WARN_ON_ONCE(page_ref_count(page) < 0)) {
+			pte_unmap(ptep);
+			return 0;
+		}
+
+		if (!page_cache_get_speculative(page)) {
 			pte_unmap(ptep);
 			return 0;
 		}
+
+		if (unlikely(pte_val(pte) != pte_val(*ptep))) {
+			put_page(page);
+			pte_unmap(ptep);
+			return 0;
+		}
+
 		SetPageReferenced(page);
 		pages[*nr] = page;
 		(*nr)++;
-- 
2.23.0



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, back to index

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-08  9:38 [PATCH STABLE 4.4 0/8] page refcount overflow backports Vlastimil Babka
2019-11-08  9:38 ` [PATCH STABLE 4.4 1/8] mm, gup: remove broken VM_BUG_ON_PAGE compound check for hugepages Vlastimil Babka
2019-11-08  9:38 ` [PATCH STABLE 4.4 2/8] mm, gup: ensure real head page is ref-counted when using hugepages Vlastimil Babka
2019-11-08  9:38 ` [PATCH STABLE 4.4 3/8] mm: make page ref count overflow check tighter and more explicit Vlastimil Babka
2019-11-08  9:38 ` [PATCH STABLE 4.4 4/8] mm: add 'try_get_page()' helper function Vlastimil Babka
2019-11-08  9:38 ` [PATCH STABLE 4.4 5/8] mm: prevent get_user_pages() from overflowing page refcount Vlastimil Babka
2019-11-08  9:38 ` [PATCH STABLE 4.4 6/8] pipe: add pipe_buf_get() helper Vlastimil Babka
2019-11-08  9:38 ` [PATCH STABLE 4.4 7/8] fs: prevent page refcount overflow in pipe_buf_get Vlastimil Babka
2019-11-08  9:38 ` [PATCH STABLE 4.4 8/8] x86, mm, gup: prevent get_page() race with munmap in paravirt guest Vlastimil Babka

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git