All of lore.kernel.org
 help / color / mirror / Atom feed
* incoming
@ 2021-06-25  1:38 Andrew Morton
  2021-06-25  1:39 ` [patch 01/24] mm: page_vma_mapped_walk(): use page for pvmw->page Andrew Morton
                   ` (23 more replies)
  0 siblings, 24 replies; 25+ messages in thread
From: Andrew Morton @ 2021-06-25  1:38 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: mm-commits, linux-mm

24 patches, based on 4a09d388f2ab382f217a764e6a152b3f614246f6.

Subsystems affected by this patch series:

  mm/thp
  nilfs2
  mm/vmalloc
  kthread
  mm/hugetlb
  mm/memory-failure
  mm/pagealloc
  MAINTAINERS
  mailmap

Subsystem: mm/thp

    Hugh Dickins <hughd@google.com>:
    Patch series "mm: page_vma_mapped_walk() cleanup and THP fixes":
      mm: page_vma_mapped_walk(): use page for pvmw->page
      mm: page_vma_mapped_walk(): settle PageHuge on entry
      mm: page_vma_mapped_walk(): use pmde for *pvmw->pmd
      mm: page_vma_mapped_walk(): prettify PVMW_MIGRATION block
      mm: page_vma_mapped_walk(): crossing page table boundary
      mm: page_vma_mapped_walk(): add a level of indentation
      mm: page_vma_mapped_walk(): use goto instead of while (1)
      mm: page_vma_mapped_walk(): get vma_address_end() earlier
      mm/thp: fix page_vma_mapped_walk() if THP mapped by ptes
      mm/thp: another PVMW_SYNC fix in page_vma_mapped_walk()

Subsystem: nilfs2

    Pavel Skripkin <paskripkin@gmail.com>:
      nilfs2: fix memory leak in nilfs_sysfs_delete_device_group

Subsystem: mm/vmalloc

    Claudio Imbrenda <imbrenda@linux.ibm.com>:
    Patch series "mm: add vmalloc_no_huge and use it", v4:
      mm/vmalloc: add vmalloc_no_huge
      KVM: s390: prepare for hugepage vmalloc

    Daniel Axtens <dja@axtens.net>:
      mm/vmalloc: unbreak kasan vmalloc support

Subsystem: kthread

    Petr Mladek <pmladek@suse.com>:
    Patch series "kthread_worker: Fix race between kthread_mod_delayed_work():
      kthread_worker: split code for canceling the delayed work timer
      kthread: prevent deadlock when kthread_mod_delayed_work() races with kthread_cancel_delayed_work_sync()

Subsystem: mm/hugetlb

    Hugh Dickins <hughd@google.com>:
      mm, futex: fix shared futex pgoff on shmem huge page

Subsystem: mm/memory-failure

    Tony Luck <tony.luck@intel.com>:
    Patch series "mm,hwpoison: fix sending SIGBUS for Action Required MCE", v5:
      mm/memory-failure: use a mutex to avoid memory_failure() races

    Aili Yao <yaoaili@kingsoft.com>:
      mm,hwpoison: return -EHWPOISON to denote that the page has already been poisoned

    Naoya Horiguchi <naoya.horiguchi@nec.com>:
      mm/hwpoison: do not lock page again when me_huge_page() successfully recovers

Subsystem: mm/pagealloc

    Rasmus Villemoes <linux@rasmusvillemoes.dk>:
      mm/page_alloc: __alloc_pages_bulk(): do bounds check before accessing array

    Mel Gorman <mgorman@techsingularity.net>:
      mm/page_alloc: do bulk array bounds check after checking populated elements

Subsystem: MAINTAINERS

    Marek Behún <kabel@kernel.org>:
      MAINTAINERS: fix Marek's identity again

Subsystem: mailmap

    Marek Behún <kabel@kernel.org>:
      mailmap: add Marek's other e-mail address and identity without diacritics

 .mailmap                |    2 
 MAINTAINERS             |    4 
 arch/s390/kvm/pv.c      |    7 +
 fs/nilfs2/sysfs.c       |    1 
 include/linux/hugetlb.h |   16 ---
 include/linux/pagemap.h |   13 +-
 include/linux/vmalloc.h |    1 
 kernel/futex.c          |    3 
 kernel/kthread.c        |   81 ++++++++++------
 mm/hugetlb.c            |    5 -
 mm/memory-failure.c     |   83 +++++++++++------
 mm/page_alloc.c         |    6 +
 mm/page_vma_mapped.c    |  233 +++++++++++++++++++++++++++---------------------
 mm/vmalloc.c            |   41 ++++++--
 14 files changed, 297 insertions(+), 199 deletions(-)


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 01/24] mm: page_vma_mapped_walk(): use page for pvmw->page
  2021-06-25  1:38 incoming Andrew Morton
@ 2021-06-25  1:39 ` Andrew Morton
  2021-06-25  1:39 ` [patch 02/24] mm: page_vma_mapped_walk(): settle PageHuge on entry Andrew Morton
                   ` (22 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Andrew Morton @ 2021-06-25  1:39 UTC (permalink / raw)
  To: akpm, apopple, hughd, kirill.shutemov, linux-mm, mm-commits,
	peterx, rcampbell, shy828301, stable, torvalds, wangyugui, will,
	willy, ziy

From: Hugh Dickins <hughd@google.com>
Subject: mm: page_vma_mapped_walk(): use page for pvmw->page

Patch series "mm: page_vma_mapped_walk() cleanup and THP fixes".

I've marked all of these for stable: many are merely cleanups,
but I think they are much better before the main fix than after.


This patch (of 11):

page_vma_mapped_walk() cleanup: sometimes the local copy of pvwm->page was
used, sometimes pvmw->page itself: use the local copy "page" throughout.

Link: https://lkml.kernel.org/r/589b358c-febc-c88e-d4c2-7834b37fa7bf@google.com
Link: https://lkml.kernel.org/r/88e67645-f467-c279-bf5e-af4b5c6b13eb@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Will Deacon <will@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_vma_mapped.c |    9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

--- a/mm/page_vma_mapped.c~mm-page_vma_mapped_walk-use-page-for-pvmw-page
+++ a/mm/page_vma_mapped.c
@@ -156,7 +156,7 @@ bool page_vma_mapped_walk(struct page_vm
 	if (pvmw->pte)
 		goto next_pte;
 
-	if (unlikely(PageHuge(pvmw->page))) {
+	if (unlikely(PageHuge(page))) {
 		/* when pud is not present, pte will be NULL */
 		pvmw->pte = huge_pte_offset(mm, pvmw->address, page_size(page));
 		if (!pvmw->pte)
@@ -217,8 +217,7 @@ restart:
 		 * cannot return prematurely, while zap_huge_pmd() has
 		 * cleared *pmd but not decremented compound_mapcount().
 		 */
-		if ((pvmw->flags & PVMW_SYNC) &&
-		    PageTransCompound(pvmw->page)) {
+		if ((pvmw->flags & PVMW_SYNC) && PageTransCompound(page)) {
 			spinlock_t *ptl = pmd_lock(mm, pvmw->pmd);
 
 			spin_unlock(ptl);
@@ -234,9 +233,9 @@ restart:
 			return true;
 next_pte:
 		/* Seek to next pte only makes sense for THP */
-		if (!PageTransHuge(pvmw->page) || PageHuge(pvmw->page))
+		if (!PageTransHuge(page) || PageHuge(page))
 			return not_found(pvmw);
-		end = vma_address_end(pvmw->page, pvmw->vma);
+		end = vma_address_end(page, pvmw->vma);
 		do {
 			pvmw->address += PAGE_SIZE;
 			if (pvmw->address >= end)
_

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 02/24] mm: page_vma_mapped_walk(): settle PageHuge on entry
  2021-06-25  1:38 incoming Andrew Morton
  2021-06-25  1:39 ` [patch 01/24] mm: page_vma_mapped_walk(): use page for pvmw->page Andrew Morton
@ 2021-06-25  1:39 ` Andrew Morton
  2021-06-25  1:39 ` [patch 03/24] mm: page_vma_mapped_walk(): use pmde for *pvmw->pmd Andrew Morton
                   ` (21 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Andrew Morton @ 2021-06-25  1:39 UTC (permalink / raw)
  To: akpm, apopple, hughd, kirill.shutemov, linux-mm, mm-commits,
	peterx, rcampbell, shy828301, stable, torvalds, wangyugui, will,
	willy, ziy

From: Hugh Dickins <hughd@google.com>
Subject: mm: page_vma_mapped_walk(): settle PageHuge on entry

page_vma_mapped_walk() cleanup: get the hugetlbfs PageHuge case out of the
way at the start, so no need to worry about it later.

Link: https://lkml.kernel.org/r/e31a483c-6d73-a6bb-26c5-43c3b880a2@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_vma_mapped.c |   12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

--- a/mm/page_vma_mapped.c~mm-page_vma_mapped_walk-settle-pagehuge-on-entry
+++ a/mm/page_vma_mapped.c
@@ -153,10 +153,11 @@ bool page_vma_mapped_walk(struct page_vm
 	if (pvmw->pmd && !pvmw->pte)
 		return not_found(pvmw);
 
-	if (pvmw->pte)
-		goto next_pte;
-
 	if (unlikely(PageHuge(page))) {
+		/* The only possible mapping was handled on last iteration */
+		if (pvmw->pte)
+			return not_found(pvmw);
+
 		/* when pud is not present, pte will be NULL */
 		pvmw->pte = huge_pte_offset(mm, pvmw->address, page_size(page));
 		if (!pvmw->pte)
@@ -168,6 +169,9 @@ bool page_vma_mapped_walk(struct page_vm
 			return not_found(pvmw);
 		return true;
 	}
+
+	if (pvmw->pte)
+		goto next_pte;
 restart:
 	pgd = pgd_offset(mm, pvmw->address);
 	if (!pgd_present(*pgd))
@@ -233,7 +237,7 @@ restart:
 			return true;
 next_pte:
 		/* Seek to next pte only makes sense for THP */
-		if (!PageTransHuge(page) || PageHuge(page))
+		if (!PageTransHuge(page))
 			return not_found(pvmw);
 		end = vma_address_end(page, pvmw->vma);
 		do {
_

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 03/24] mm: page_vma_mapped_walk(): use pmde for *pvmw->pmd
  2021-06-25  1:38 incoming Andrew Morton
  2021-06-25  1:39 ` [patch 01/24] mm: page_vma_mapped_walk(): use page for pvmw->page Andrew Morton
  2021-06-25  1:39 ` [patch 02/24] mm: page_vma_mapped_walk(): settle PageHuge on entry Andrew Morton
@ 2021-06-25  1:39 ` Andrew Morton
  2021-06-25  1:39 ` [patch 04/24] mm: page_vma_mapped_walk(): prettify PVMW_MIGRATION block Andrew Morton
                   ` (20 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Andrew Morton @ 2021-06-25  1:39 UTC (permalink / raw)
  To: akpm, apopple, hughd, kirill.shutemov, linux-mm, mm-commits,
	peterx, rcampbell, shy828301, stable, torvalds, wangyugui, will,
	willy, ziy

From: Hugh Dickins <hughd@google.com>
Subject: mm: page_vma_mapped_walk(): use pmde for *pvmw->pmd

page_vma_mapped_walk() cleanup: re-evaluate pmde after taking lock, then
use it in subsequent tests, instead of repeatedly dereferencing pointer.

Link: https://lkml.kernel.org/r/53fbc9d-891e-46b2-cb4b-468c3b19238e@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_vma_mapped.c |   11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

--- a/mm/page_vma_mapped.c~mm-page_vma_mapped_walk-use-pmde-for-pvmw-pmd
+++ a/mm/page_vma_mapped.c
@@ -191,18 +191,19 @@ restart:
 	pmde = READ_ONCE(*pvmw->pmd);
 	if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) {
 		pvmw->ptl = pmd_lock(mm, pvmw->pmd);
-		if (likely(pmd_trans_huge(*pvmw->pmd))) {
+		pmde = *pvmw->pmd;
+		if (likely(pmd_trans_huge(pmde))) {
 			if (pvmw->flags & PVMW_MIGRATION)
 				return not_found(pvmw);
-			if (pmd_page(*pvmw->pmd) != page)
+			if (pmd_page(pmde) != page)
 				return not_found(pvmw);
 			return true;
-		} else if (!pmd_present(*pvmw->pmd)) {
+		} else if (!pmd_present(pmde)) {
 			if (thp_migration_supported()) {
 				if (!(pvmw->flags & PVMW_MIGRATION))
 					return not_found(pvmw);
-				if (is_migration_entry(pmd_to_swp_entry(*pvmw->pmd))) {
-					swp_entry_t entry = pmd_to_swp_entry(*pvmw->pmd);
+				if (is_migration_entry(pmd_to_swp_entry(pmde))) {
+					swp_entry_t entry = pmd_to_swp_entry(pmde);
 
 					if (migration_entry_to_page(entry) != page)
 						return not_found(pvmw);
_

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 04/24] mm: page_vma_mapped_walk(): prettify PVMW_MIGRATION block
  2021-06-25  1:38 incoming Andrew Morton
                   ` (2 preceding siblings ...)
  2021-06-25  1:39 ` [patch 03/24] mm: page_vma_mapped_walk(): use pmde for *pvmw->pmd Andrew Morton
@ 2021-06-25  1:39 ` Andrew Morton
  2021-06-25  1:39 ` [patch 05/24] mm: page_vma_mapped_walk(): crossing page table boundary Andrew Morton
                   ` (19 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Andrew Morton @ 2021-06-25  1:39 UTC (permalink / raw)
  To: akpm, apopple, hughd, kirill.shutemov, linux-mm, mm-commits,
	peterx, rcampbell, shy828301, stable, torvalds, wangyugui, will,
	willy, ziy

From: Hugh Dickins <hughd@google.com>
Subject: mm: page_vma_mapped_walk(): prettify PVMW_MIGRATION block

page_vma_mapped_walk() cleanup: rearrange the !pmd_present() block to
follow the same "return not_found, return not_found, return true" pattern
as the block above it (note: returning not_found there is never premature,
since existence or prior existence of huge pmd guarantees good alignment).

Link: https://lkml.kernel.org/r/378c8650-1488-2edf-9647-32a53cf2e21@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_vma_mapped.c |   30 ++++++++++++++----------------
 1 file changed, 14 insertions(+), 16 deletions(-)

--- a/mm/page_vma_mapped.c~mm-page_vma_mapped_walk-prettify-pvmw_migration-block
+++ a/mm/page_vma_mapped.c
@@ -198,24 +198,22 @@ restart:
 			if (pmd_page(pmde) != page)
 				return not_found(pvmw);
 			return true;
-		} else if (!pmd_present(pmde)) {
-			if (thp_migration_supported()) {
-				if (!(pvmw->flags & PVMW_MIGRATION))
-					return not_found(pvmw);
-				if (is_migration_entry(pmd_to_swp_entry(pmde))) {
-					swp_entry_t entry = pmd_to_swp_entry(pmde);
+		}
+		if (!pmd_present(pmde)) {
+			swp_entry_t entry;
 
-					if (migration_entry_to_page(entry) != page)
-						return not_found(pvmw);
-					return true;
-				}
-			}
-			return not_found(pvmw);
-		} else {
-			/* THP pmd was split under us: handle on pte level */
-			spin_unlock(pvmw->ptl);
-			pvmw->ptl = NULL;
+			if (!thp_migration_supported() ||
+			    !(pvmw->flags & PVMW_MIGRATION))
+				return not_found(pvmw);
+			entry = pmd_to_swp_entry(pmde);
+			if (!is_migration_entry(entry) ||
+			    migration_entry_to_page(entry) != page)
+				return not_found(pvmw);
+			return true;
 		}
+		/* THP pmd was split under us: handle on pte level */
+		spin_unlock(pvmw->ptl);
+		pvmw->ptl = NULL;
 	} else if (!pmd_present(pmde)) {
 		/*
 		 * If PVMW_SYNC, take and drop THP pmd lock so that we
_

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 05/24] mm: page_vma_mapped_walk(): crossing page table boundary
  2021-06-25  1:38 incoming Andrew Morton
                   ` (3 preceding siblings ...)
  2021-06-25  1:39 ` [patch 04/24] mm: page_vma_mapped_walk(): prettify PVMW_MIGRATION block Andrew Morton
@ 2021-06-25  1:39 ` Andrew Morton
  2021-06-25  1:39 ` [patch 06/24] mm: page_vma_mapped_walk(): add a level of indentation Andrew Morton
                   ` (18 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Andrew Morton @ 2021-06-25  1:39 UTC (permalink / raw)
  To: akpm, apopple, hughd, kirill.shutemov, linux-mm, mm-commits,
	peterx, rcampbell, shy828301, stable, torvalds, wangyugui, will,
	willy, ziy

From: Hugh Dickins <hughd@google.com>
Subject: mm: page_vma_mapped_walk(): crossing page table boundary

page_vma_mapped_walk() cleanup: adjust the test for crossing page table
boundary - I believe pvmw->address is always page-aligned, but nothing
else here assumed that; and remember to reset pvmw->pte to NULL after
unmapping the page table, though I never saw any bug from that.

Link: https://lkml.kernel.org/r/799b3f9c-2a9e-dfef-5d89-26e9f76fd97@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_vma_mapped.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

--- a/mm/page_vma_mapped.c~mm-page_vma_mapped_walk-crossing-page-table-boundary
+++ a/mm/page_vma_mapped.c
@@ -244,16 +244,16 @@ next_pte:
 			if (pvmw->address >= end)
 				return not_found(pvmw);
 			/* Did we cross page table boundary? */
-			if (pvmw->address % PMD_SIZE == 0) {
-				pte_unmap(pvmw->pte);
+			if ((pvmw->address & (PMD_SIZE - PAGE_SIZE)) == 0) {
 				if (pvmw->ptl) {
 					spin_unlock(pvmw->ptl);
 					pvmw->ptl = NULL;
 				}
+				pte_unmap(pvmw->pte);
+				pvmw->pte = NULL;
 				goto restart;
-			} else {
-				pvmw->pte++;
 			}
+			pvmw->pte++;
 		} while (pte_none(*pvmw->pte));
 
 		if (!pvmw->ptl) {
_

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 06/24] mm: page_vma_mapped_walk(): add a level of indentation
  2021-06-25  1:38 incoming Andrew Morton
                   ` (4 preceding siblings ...)
  2021-06-25  1:39 ` [patch 05/24] mm: page_vma_mapped_walk(): crossing page table boundary Andrew Morton
@ 2021-06-25  1:39 ` Andrew Morton
  2021-06-25  1:39 ` [patch 07/24] mm: page_vma_mapped_walk(): use goto instead of while (1) Andrew Morton
                   ` (17 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Andrew Morton @ 2021-06-25  1:39 UTC (permalink / raw)
  To: akpm, apopple, hughd, kirill.shutemov, linux-mm, mm-commits,
	peterx, rcampbell, shy828301, stable, torvalds, wangyugui, will,
	willy, ziy

From: Hugh Dickins <hughd@google.com>
Subject: mm: page_vma_mapped_walk(): add a level of indentation

page_vma_mapped_walk() cleanup: add a level of indentation to much of the
body, making no functional change in this commit, but reducing the later
diff when this is all converted to a loop.

[hughd@google.com: : page_vma_mapped_walk(): add a level of indentation fix]
  Link: https://lkml.kernel.org/r/7f817555-3ce1-c785-e438-87d8efdcaf26@google.com
Link: https://lkml.kernel.org/r/efde211-f3e2-fe54-977-ef481419e7f3@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_vma_mapped.c |  105 +++++++++++++++++++++--------------------
 1 file changed, 55 insertions(+), 50 deletions(-)

--- a/mm/page_vma_mapped.c~mm-page_vma_mapped_walk-add-a-level-of-indentation
+++ a/mm/page_vma_mapped.c
@@ -173,62 +173,67 @@ bool page_vma_mapped_walk(struct page_vm
 	if (pvmw->pte)
 		goto next_pte;
 restart:
-	pgd = pgd_offset(mm, pvmw->address);
-	if (!pgd_present(*pgd))
-		return false;
-	p4d = p4d_offset(pgd, pvmw->address);
-	if (!p4d_present(*p4d))
-		return false;
-	pud = pud_offset(p4d, pvmw->address);
-	if (!pud_present(*pud))
-		return false;
-	pvmw->pmd = pmd_offset(pud, pvmw->address);
-	/*
-	 * Make sure the pmd value isn't cached in a register by the
-	 * compiler and used as a stale value after we've observed a
-	 * subsequent update.
-	 */
-	pmde = READ_ONCE(*pvmw->pmd);
-	if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) {
-		pvmw->ptl = pmd_lock(mm, pvmw->pmd);
-		pmde = *pvmw->pmd;
-		if (likely(pmd_trans_huge(pmde))) {
-			if (pvmw->flags & PVMW_MIGRATION)
-				return not_found(pvmw);
-			if (pmd_page(pmde) != page)
-				return not_found(pvmw);
-			return true;
-		}
-		if (!pmd_present(pmde)) {
-			swp_entry_t entry;
+	{
+		pgd = pgd_offset(mm, pvmw->address);
+		if (!pgd_present(*pgd))
+			return false;
+		p4d = p4d_offset(pgd, pvmw->address);
+		if (!p4d_present(*p4d))
+			return false;
+		pud = pud_offset(p4d, pvmw->address);
+		if (!pud_present(*pud))
+			return false;
 
-			if (!thp_migration_supported() ||
-			    !(pvmw->flags & PVMW_MIGRATION))
-				return not_found(pvmw);
-			entry = pmd_to_swp_entry(pmde);
-			if (!is_migration_entry(entry) ||
-			    migration_entry_to_page(entry) != page)
-				return not_found(pvmw);
-			return true;
-		}
-		/* THP pmd was split under us: handle on pte level */
-		spin_unlock(pvmw->ptl);
-		pvmw->ptl = NULL;
-	} else if (!pmd_present(pmde)) {
+		pvmw->pmd = pmd_offset(pud, pvmw->address);
 		/*
-		 * If PVMW_SYNC, take and drop THP pmd lock so that we
-		 * cannot return prematurely, while zap_huge_pmd() has
-		 * cleared *pmd but not decremented compound_mapcount().
+		 * Make sure the pmd value isn't cached in a register by the
+		 * compiler and used as a stale value after we've observed a
+		 * subsequent update.
 		 */
-		if ((pvmw->flags & PVMW_SYNC) && PageTransCompound(page)) {
-			spinlock_t *ptl = pmd_lock(mm, pvmw->pmd);
+		pmde = READ_ONCE(*pvmw->pmd);
+
+		if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) {
+			pvmw->ptl = pmd_lock(mm, pvmw->pmd);
+			pmde = *pvmw->pmd;
+			if (likely(pmd_trans_huge(pmde))) {
+				if (pvmw->flags & PVMW_MIGRATION)
+					return not_found(pvmw);
+				if (pmd_page(pmde) != page)
+					return not_found(pvmw);
+				return true;
+			}
+			if (!pmd_present(pmde)) {
+				swp_entry_t entry;
 
-			spin_unlock(ptl);
+				if (!thp_migration_supported() ||
+				    !(pvmw->flags & PVMW_MIGRATION))
+					return not_found(pvmw);
+				entry = pmd_to_swp_entry(pmde);
+				if (!is_migration_entry(entry) ||
+				    migration_entry_to_page(entry) != page)
+					return not_found(pvmw);
+				return true;
+			}
+			/* THP pmd was split under us: handle on pte level */
+			spin_unlock(pvmw->ptl);
+			pvmw->ptl = NULL;
+		} else if (!pmd_present(pmde)) {
+			/*
+			 * If PVMW_SYNC, take and drop THP pmd lock so that we
+			 * cannot return prematurely, while zap_huge_pmd() has
+			 * cleared *pmd but not decremented compound_mapcount().
+			 */
+			if ((pvmw->flags & PVMW_SYNC) &&
+			    PageTransCompound(page)) {
+				spinlock_t *ptl = pmd_lock(mm, pvmw->pmd);
+
+				spin_unlock(ptl);
+			}
+			return false;
 		}
-		return false;
+		if (!map_pte(pvmw))
+			goto next_pte;
 	}
-	if (!map_pte(pvmw))
-		goto next_pte;
 	while (1) {
 		unsigned long end;
 
_

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 07/24] mm: page_vma_mapped_walk(): use goto instead of while (1)
  2021-06-25  1:38 incoming Andrew Morton
                   ` (5 preceding siblings ...)
  2021-06-25  1:39 ` [patch 06/24] mm: page_vma_mapped_walk(): add a level of indentation Andrew Morton
@ 2021-06-25  1:39 ` Andrew Morton
  2021-06-25  1:39 ` [patch 08/24] mm: page_vma_mapped_walk(): get vma_address_end() earlier Andrew Morton
                   ` (16 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Andrew Morton @ 2021-06-25  1:39 UTC (permalink / raw)
  To: akpm, apopple, hughd, kirill.shutemov, linux-mm, mm-commits,
	peterx, rcampbell, shy828301, stable, torvalds, wangyugui, will,
	willy, ziy

From: Hugh Dickins <hughd@google.com>
Subject: mm: page_vma_mapped_walk(): use goto instead of while (1)

page_vma_mapped_walk() cleanup: add a label this_pte, matching next_pte,
and use "goto this_pte", in place of the "while (1)" loop at the end.

Link: https://lkml.kernel.org/r/a52b234a-851-3616-2525-f42736e8934@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_vma_mapped.c |    7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

--- a/mm/page_vma_mapped.c~mm-page_vma_mapped_walk-use-goto-instead-of-while-1
+++ a/mm/page_vma_mapped.c
@@ -144,6 +144,7 @@ bool page_vma_mapped_walk(struct page_vm
 {
 	struct mm_struct *mm = pvmw->vma->vm_mm;
 	struct page *page = pvmw->page;
+	unsigned long end;
 	pgd_t *pgd;
 	p4d_t *p4d;
 	pud_t *pud;
@@ -233,10 +234,7 @@ restart:
 		}
 		if (!map_pte(pvmw))
 			goto next_pte;
-	}
-	while (1) {
-		unsigned long end;
-
+this_pte:
 		if (check_pte(pvmw))
 			return true;
 next_pte:
@@ -265,6 +263,7 @@ next_pte:
 			pvmw->ptl = pte_lockptr(mm, pvmw->pmd);
 			spin_lock(pvmw->ptl);
 		}
+		goto this_pte;
 	}
 }
 
_

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 08/24] mm: page_vma_mapped_walk(): get vma_address_end() earlier
  2021-06-25  1:38 incoming Andrew Morton
                   ` (6 preceding siblings ...)
  2021-06-25  1:39 ` [patch 07/24] mm: page_vma_mapped_walk(): use goto instead of while (1) Andrew Morton
@ 2021-06-25  1:39 ` Andrew Morton
  2021-06-25  1:39 ` [patch 09/24] mm/thp: fix page_vma_mapped_walk() if THP mapped by ptes Andrew Morton
                   ` (15 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Andrew Morton @ 2021-06-25  1:39 UTC (permalink / raw)
  To: akpm, apopple, hughd, kirill.shutemov, linux-mm, mm-commits,
	peterx, rcampbell, shy828301, stable, torvalds, wangyugui, will,
	willy, ziy

From: Hugh Dickins <hughd@google.com>
Subject: mm: page_vma_mapped_walk(): get vma_address_end() earlier

page_vma_mapped_walk() cleanup: get THP's vma_address_end() at the start,
rather than later at next_pte.  It's a little unnecessary overhead on the
first call, but makes for a simpler loop in the following commit.

Link: https://lkml.kernel.org/r/4542b34d-862f-7cb4-bb22-e0df6ce830a2@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_vma_mapped.c |   13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

--- a/mm/page_vma_mapped.c~mm-page_vma_mapped_walk-get-vma_address_end-earlier
+++ a/mm/page_vma_mapped.c
@@ -171,6 +171,15 @@ bool page_vma_mapped_walk(struct page_vm
 		return true;
 	}
 
+	/*
+	 * Seek to next pte only makes sense for THP.
+	 * But more important than that optimization, is to filter out
+	 * any PageKsm page: whose page->index misleads vma_address()
+	 * and vma_address_end() to disaster.
+	 */
+	end = PageTransCompound(page) ?
+		vma_address_end(page, pvmw->vma) :
+		pvmw->address + PAGE_SIZE;
 	if (pvmw->pte)
 		goto next_pte;
 restart:
@@ -238,10 +247,6 @@ this_pte:
 		if (check_pte(pvmw))
 			return true;
 next_pte:
-		/* Seek to next pte only makes sense for THP */
-		if (!PageTransHuge(page))
-			return not_found(pvmw);
-		end = vma_address_end(page, pvmw->vma);
 		do {
 			pvmw->address += PAGE_SIZE;
 			if (pvmw->address >= end)
_

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 09/24] mm/thp: fix page_vma_mapped_walk() if THP mapped by ptes
  2021-06-25  1:38 incoming Andrew Morton
                   ` (7 preceding siblings ...)
  2021-06-25  1:39 ` [patch 08/24] mm: page_vma_mapped_walk(): get vma_address_end() earlier Andrew Morton
@ 2021-06-25  1:39 ` Andrew Morton
  2021-06-25  1:39 ` [patch 10/24] mm/thp: another PVMW_SYNC fix in page_vma_mapped_walk() Andrew Morton
                   ` (14 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Andrew Morton @ 2021-06-25  1:39 UTC (permalink / raw)
  To: akpm, apopple, hughd, kirill.shutemov, linux-mm, mm-commits,
	peterx, rcampbell, shy828301, stable, torvalds, wangyugui, will,
	willy, ziy

From: Hugh Dickins <hughd@google.com>
Subject: mm/thp: fix page_vma_mapped_walk() if THP mapped by ptes

Running certain tests with a DEBUG_VM kernel would crash within hours, on
the total_mapcount BUG() in split_huge_page_to_list(), while trying to
free up some memory by punching a hole in a shmem huge page: split's
try_to_unmap() was unable to find all the mappings of the page (which, on
a !DEBUG_VM kernel, would then keep the huge page pinned in memory).

Crash dumps showed two tail pages of a shmem huge page remained mapped by
pte: ptes in a non-huge-aligned vma of a gVisor process, at the end of a
long unmapped range; and no page table had yet been allocated for the head
of the huge page to be mapped into.

Although designed to handle these odd misaligned huge-page-mapped-by-pte
cases, page_vma_mapped_walk() falls short by returning false prematurely
when !pmd_present or !pud_present or !p4d_present or !pgd_present: there
are cases when a huge page may span the boundary, with ptes present in the
next.

Restructure page_vma_mapped_walk() as a loop to continue in these cases,
while keeping its layout much as before.  Add a step_forward() helper to
advance pvmw->address across those boundaries: originally I tried to use
mm's standard p?d_addr_end() macros, but hit the same crash 512 times less
often: because of the way redundant levels are folded together, but folded
differently in different configurations, it was just too difficult to use
them correctly; and step_forward() is simpler anyway.

Link: https://lkml.kernel.org/r/fedb8632-1798-de42-f39e-873551d5bc81@google.com
Fixes: ace71a19cec5 ("mm: introduce page_vma_mapped_walk()")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_vma_mapped.c |   34 +++++++++++++++++++++++++---------
 1 file changed, 25 insertions(+), 9 deletions(-)

--- a/mm/page_vma_mapped.c~mm-thp-fix-page_vma_mapped_walk-if-thp-mapped-by-ptes
+++ a/mm/page_vma_mapped.c
@@ -116,6 +116,13 @@ static bool check_pte(struct page_vma_ma
 	return pfn_is_match(pvmw->page, pfn);
 }
 
+static void step_forward(struct page_vma_mapped_walk *pvmw, unsigned long size)
+{
+	pvmw->address = (pvmw->address + size) & ~(size - 1);
+	if (!pvmw->address)
+		pvmw->address = ULONG_MAX;
+}
+
 /**
  * page_vma_mapped_walk - check if @pvmw->page is mapped in @pvmw->vma at
  * @pvmw->address
@@ -183,16 +190,22 @@ bool page_vma_mapped_walk(struct page_vm
 	if (pvmw->pte)
 		goto next_pte;
 restart:
-	{
+	do {
 		pgd = pgd_offset(mm, pvmw->address);
-		if (!pgd_present(*pgd))
-			return false;
+		if (!pgd_present(*pgd)) {
+			step_forward(pvmw, PGDIR_SIZE);
+			continue;
+		}
 		p4d = p4d_offset(pgd, pvmw->address);
-		if (!p4d_present(*p4d))
-			return false;
+		if (!p4d_present(*p4d)) {
+			step_forward(pvmw, P4D_SIZE);
+			continue;
+		}
 		pud = pud_offset(p4d, pvmw->address);
-		if (!pud_present(*pud))
-			return false;
+		if (!pud_present(*pud)) {
+			step_forward(pvmw, PUD_SIZE);
+			continue;
+		}
 
 		pvmw->pmd = pmd_offset(pud, pvmw->address);
 		/*
@@ -239,7 +252,8 @@ restart:
 
 				spin_unlock(ptl);
 			}
-			return false;
+			step_forward(pvmw, PMD_SIZE);
+			continue;
 		}
 		if (!map_pte(pvmw))
 			goto next_pte;
@@ -269,7 +283,9 @@ next_pte:
 			spin_lock(pvmw->ptl);
 		}
 		goto this_pte;
-	}
+	} while (pvmw->address < end);
+
+	return false;
 }
 
 /**
_

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 10/24] mm/thp: another PVMW_SYNC fix in page_vma_mapped_walk()
  2021-06-25  1:38 incoming Andrew Morton
                   ` (8 preceding siblings ...)
  2021-06-25  1:39 ` [patch 09/24] mm/thp: fix page_vma_mapped_walk() if THP mapped by ptes Andrew Morton
@ 2021-06-25  1:39 ` Andrew Morton
  2021-06-25  1:39 ` [patch 11/24] nilfs2: fix memory leak in nilfs_sysfs_delete_device_group Andrew Morton
                   ` (13 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Andrew Morton @ 2021-06-25  1:39 UTC (permalink / raw)
  To: akpm, apopple, hughd, kirill.shutemov, linux-mm, mm-commits,
	peterx, rcampbell, shy828301, stable, torvalds, wangyugui, will,
	willy, ziy

From: Hugh Dickins <hughd@google.com>
Subject: mm/thp: another PVMW_SYNC fix in page_vma_mapped_walk()

Aha!  Shouldn't that quick scan over pte_none()s make sure that it holds
ptlock in the PVMW_SYNC case?  That too might have been responsible for
BUGs or WARNs in split_huge_page_to_list() or its unmap_page(), though
I've never seen any.

Link: https://lkml.kernel.org/r/1bdf384c-8137-a149-2a1e-475a4791c3c@google.com
Link: https://lore.kernel.org/linux-mm/20210412180659.B9E3.409509F4@e16-tech.com/
Fixes: ace71a19cec5 ("mm: introduce page_vma_mapped_walk()")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Wang Yugui <wangyugui@e16-tech.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_vma_mapped.c |    4 ++++
 1 file changed, 4 insertions(+)

--- a/mm/page_vma_mapped.c~mm-thp-another-pvmw_sync-fix-in-page_vma_mapped_walk
+++ a/mm/page_vma_mapped.c
@@ -276,6 +276,10 @@ next_pte:
 				goto restart;
 			}
 			pvmw->pte++;
+			if ((pvmw->flags & PVMW_SYNC) && !pvmw->ptl) {
+				pvmw->ptl = pte_lockptr(mm, pvmw->pmd);
+				spin_lock(pvmw->ptl);
+			}
 		} while (pte_none(*pvmw->pte));
 
 		if (!pvmw->ptl) {
_

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 11/24] nilfs2: fix memory leak in nilfs_sysfs_delete_device_group
  2021-06-25  1:38 incoming Andrew Morton
                   ` (9 preceding siblings ...)
  2021-06-25  1:39 ` [patch 10/24] mm/thp: another PVMW_SYNC fix in page_vma_mapped_walk() Andrew Morton
@ 2021-06-25  1:39 ` Andrew Morton
  2021-06-25  1:39 ` [patch 12/24] mm/vmalloc: add vmalloc_no_huge Andrew Morton
                   ` (12 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Andrew Morton @ 2021-06-25  1:39 UTC (permalink / raw)
  To: akpm, konishi.ryusuke, linux-mm, mlsemon35, mm-commits,
	paskripkin, torvalds

From: Pavel Skripkin <paskripkin@gmail.com>
Subject: nilfs2: fix memory leak in nilfs_sysfs_delete_device_group

My local syzbot instance hit memory leak in nilfs2.  The problem was in
missing kobject_put() in nilfs_sysfs_delete_device_group().

kobject_del() does not call kobject_cleanup() for passed kobject and it
leads to leaking duped kobject name if kobject_put() was not called.

Fail log:

BUG: memory leak
unreferenced object 0xffff8880596171e0 (size 8):
  comm "syz-executor379", pid 8381, jiffies 4294980258 (age 21.100s)
  hex dump (first 8 bytes):
    6c 6f 6f 70 30 00 00 00                          loop0...
  backtrace:
    [<ffffffff81a2c656>] kstrdup+0x36/0x70 mm/util.c:60
    [<ffffffff81a2c6e3>] kstrdup_const+0x53/0x80 mm/util.c:83
    [<ffffffff83ccc698>] kvasprintf_const+0x108/0x190 lib/kasprintf.c:48
    [<ffffffff83e99926>] kobject_set_name_vargs+0x56/0x150 lib/kobject.c:289
    [<ffffffff83e99fd9>] kobject_add_varg lib/kobject.c:384 [inline]
    [<ffffffff83e99fd9>] kobject_init_and_add+0xc9/0x160 lib/kobject.c:473
    [<ffffffff8304e840>] nilfs_sysfs_create_device_group+0x150/0x800 fs/nilfs2/sysfs.c:999
    [<ffffffff830159f6>] init_nilfs+0xe26/0x12b0 fs/nilfs2/the_nilfs.c:637

Link: https://lkml.kernel.org/r/20210612140559.20022-1-paskripkin@gmail.com
Fixes: da7141fb78db ("nilfs2: add /sys/fs/nilfs2/<device> group")
Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: Michael L. Semon <mlsemon35@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/nilfs2/sysfs.c |    1 +
 1 file changed, 1 insertion(+)

--- a/fs/nilfs2/sysfs.c~nilfs2-fix-memory-leak-in-nilfs_sysfs_delete_device_group
+++ a/fs/nilfs2/sysfs.c
@@ -1053,6 +1053,7 @@ void nilfs_sysfs_delete_device_group(str
 	nilfs_sysfs_delete_superblock_group(nilfs);
 	nilfs_sysfs_delete_segctor_group(nilfs);
 	kobject_del(&nilfs->ns_dev_kobj);
+	kobject_put(&nilfs->ns_dev_kobj);
 	kfree(nilfs->ns_dev_subgroups);
 }
 
_

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 12/24] mm/vmalloc: add vmalloc_no_huge
  2021-06-25  1:38 incoming Andrew Morton
                   ` (10 preceding siblings ...)
  2021-06-25  1:39 ` [patch 11/24] nilfs2: fix memory leak in nilfs_sysfs_delete_device_group Andrew Morton
@ 2021-06-25  1:39 ` Andrew Morton
  2021-06-25  1:39 ` [patch 13/24] KVM: s390: prepare for hugepage vmalloc Andrew Morton
                   ` (11 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Andrew Morton @ 2021-06-25  1:39 UTC (permalink / raw)
  To: akpm, catalin.marinas, cohuck, david, hch, imbrenda, linux-mm,
	mingo, mm-commits, npiggin, rientjes, tglx, torvalds, urezki

From: Claudio Imbrenda <imbrenda@linux.ibm.com>
Subject: mm/vmalloc: add vmalloc_no_huge

Patch series "mm: add vmalloc_no_huge and use it", v4.

Add vmalloc_no_huge() and export it, so modules can allocate memory with
small pages.

Use the newly added vmalloc_no_huge() in KVM on s390 to get around a
hardware limitation.


This patch (of 2):

Commit 121e6f3258fe3 ("mm/vmalloc: hugepage vmalloc mappings") added
support for hugepage vmalloc mappings, it also added the flag
VM_NO_HUGE_VMAP for __vmalloc_node_range to request the allocation to be
performed with 0-order non-huge pages.  This flag is not accessible when
calling vmalloc, the only option is to call directly __vmalloc_node_range,
which is not exported.

This means that a module can't vmalloc memory with small pages.

Case in point: KVM on s390x needs to vmalloc a large area, and it needs to
be mapped with non-huge pages, because of a hardware limitation.

This patch adds the function vmalloc_no_huge, which works like vmalloc,
but it is guaranteed to always back the mapping using small pages.  This
new function is exported, therefore it is usable by modules.

[akpm@linux-foundation.org: whitespace fixes, per Christoph]
Link: https://lkml.kernel.org/r/20210614132357.10202-1-imbrenda@linux.ibm.com
Link: https://lkml.kernel.org/r/20210614132357.10202-2-imbrenda@linux.ibm.com
Fixes: 121e6f3258fe3 ("mm/vmalloc: hugepage vmalloc mappings") 
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Acked-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/vmalloc.h |    1 +
 mm/vmalloc.c            |   17 +++++++++++++++++
 2 files changed, 18 insertions(+)

--- a/include/linux/vmalloc.h~mm-vmalloc-add-vmalloc_no_huge
+++ a/include/linux/vmalloc.h
@@ -135,6 +135,7 @@ extern void *__vmalloc_node_range(unsign
 			const void *caller);
 void *__vmalloc_node(unsigned long size, unsigned long align, gfp_t gfp_mask,
 		int node, const void *caller);
+void *vmalloc_no_huge(unsigned long size);
 
 extern void vfree(const void *addr);
 extern void vfree_atomic(const void *addr);
--- a/mm/vmalloc.c~mm-vmalloc-add-vmalloc_no_huge
+++ a/mm/vmalloc.c
@@ -2999,6 +2999,23 @@ void *vmalloc(unsigned long size)
 EXPORT_SYMBOL(vmalloc);
 
 /**
+ * vmalloc_no_huge - allocate virtually contiguous memory using small pages
+ * @size:    allocation size
+ *
+ * Allocate enough non-huge pages to cover @size from the page level
+ * allocator and map them into contiguous kernel virtual space.
+ *
+ * Return: pointer to the allocated memory or %NULL on error
+ */
+void *vmalloc_no_huge(unsigned long size)
+{
+	return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
+				    GFP_KERNEL, PAGE_KERNEL, VM_NO_HUGE_VMAP,
+				    NUMA_NO_NODE, __builtin_return_address(0));
+}
+EXPORT_SYMBOL(vmalloc_no_huge);
+
+/**
  * vzalloc - allocate virtually contiguous memory with zero fill
  * @size:    allocation size
  *
_

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 13/24] KVM: s390: prepare for hugepage vmalloc
  2021-06-25  1:38 incoming Andrew Morton
                   ` (11 preceding siblings ...)
  2021-06-25  1:39 ` [patch 12/24] mm/vmalloc: add vmalloc_no_huge Andrew Morton
@ 2021-06-25  1:39 ` Andrew Morton
  2021-06-25  1:39 ` [patch 14/24] mm/vmalloc: unbreak kasan vmalloc support Andrew Morton
                   ` (10 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Andrew Morton @ 2021-06-25  1:39 UTC (permalink / raw)
  To: akpm, borntraeger, catalin.marinas, david, frankja, hch,
	imbrenda, linux-mm, mingo, mm-commits, npiggin, rientjes, tglx,
	torvalds, urezki

From: Claudio Imbrenda <imbrenda@linux.ibm.com>
Subject: KVM: s390: prepare for hugepage vmalloc

The Create Secure Configuration Ultravisor Call does not support using
large pages for the virtual memory area.  This is a hardware limitation.

This patch replaces the vzalloc call with an almost equivalent call to the
newly introduced vmalloc_no_huge function, which guarantees that only
small pages will be used for the backing.

The new call will not clear the allocated memory, but that has never been
an actual requirement.

Link: https://lkml.kernel.org/r/20210614132357.10202-3-imbrenda@linux.ibm.com
Fixes: 121e6f3258fe3 ("mm/vmalloc: hugepage vmalloc mappings") 
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/s390/kvm/pv.c |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

--- a/arch/s390/kvm/pv.c~kvm-s390-prepare-for-hugepage-vmalloc
+++ a/arch/s390/kvm/pv.c
@@ -140,7 +140,12 @@ static int kvm_s390_pv_alloc_vm(struct k
 	/* Allocate variable storage */
 	vlen = ALIGN(virt * ((npages * PAGE_SIZE) / HPAGE_SIZE), PAGE_SIZE);
 	vlen += uv_info.guest_virt_base_stor_len;
-	kvm->arch.pv.stor_var = vzalloc(vlen);
+	/*
+	 * The Create Secure Configuration Ultravisor Call does not support
+	 * using large pages for the virtual memory area.
+	 * This is a hardware limitation.
+	 */
+	kvm->arch.pv.stor_var = vmalloc_no_huge(vlen);
 	if (!kvm->arch.pv.stor_var)
 		goto out_err;
 	return 0;
_

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 14/24] mm/vmalloc: unbreak kasan vmalloc support
  2021-06-25  1:38 incoming Andrew Morton
                   ` (12 preceding siblings ...)
  2021-06-25  1:39 ` [patch 13/24] KVM: s390: prepare for hugepage vmalloc Andrew Morton
@ 2021-06-25  1:39 ` Andrew Morton
  2021-06-25  1:39 ` [patch 15/24] kthread_worker: split code for canceling the delayed work timer Andrew Morton
                   ` (9 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Andrew Morton @ 2021-06-25  1:39 UTC (permalink / raw)
  To: akpm, andreyknvl, davidgow, dja, dvyukov, linux-mm, mm-commits,
	npiggin, torvalds, urezki

From: Daniel Axtens <dja@axtens.net>
Subject: mm/vmalloc: unbreak kasan vmalloc support

In commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings"),
__vmalloc_node_range was changed such that __get_vm_area_node was no
longer called with the requested/real size of the vmalloc allocation, but
rather with a rounded-up size.

This means that __get_vm_area_node called kasan_unpoision_vmalloc() with a
rounded up size rather than the real size.  This led to it allowing access
to too much memory and so missing vmalloc OOBs and failing the kasan kunit
tests.

Pass the real size and the desired shift into __get_vm_area_node.  This
allows it to round up the size for the underlying allocators while still
unpoisioning the correct quantity of shadow memory.

Adjust the other call-sites to pass in PAGE_SHIFT for the shift value.

Link: https://lkml.kernel.org/r/20210617081330.98629-1-dja@axtens.net
Link: https://bugzilla.kernel.org/show_bug.cgi?id=213335
Fixes: 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings")
Signed-off-by: Daniel Axtens <dja@axtens.net>
Tested-by: David Gow <davidgow@google.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Tested-by: Andrey Konovalov <andreyknvl@gmail.com>
Acked-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/vmalloc.c |   24 ++++++++++++++----------
 1 file changed, 14 insertions(+), 10 deletions(-)

--- a/mm/vmalloc.c~mm-vmalloc-unbreak-kasan-vmalloc-support
+++ a/mm/vmalloc.c
@@ -2344,15 +2344,16 @@ static void clear_vm_uninitialized_flag(
 }
 
 static struct vm_struct *__get_vm_area_node(unsigned long size,
-		unsigned long align, unsigned long flags, unsigned long start,
-		unsigned long end, int node, gfp_t gfp_mask, const void *caller)
+		unsigned long align, unsigned long shift, unsigned long flags,
+		unsigned long start, unsigned long end, int node,
+		gfp_t gfp_mask, const void *caller)
 {
 	struct vmap_area *va;
 	struct vm_struct *area;
 	unsigned long requested_size = size;
 
 	BUG_ON(in_interrupt());
-	size = PAGE_ALIGN(size);
+	size = ALIGN(size, 1ul << shift);
 	if (unlikely(!size))
 		return NULL;
 
@@ -2384,8 +2385,8 @@ struct vm_struct *__get_vm_area_caller(u
 				       unsigned long start, unsigned long end,
 				       const void *caller)
 {
-	return __get_vm_area_node(size, 1, flags, start, end, NUMA_NO_NODE,
-				  GFP_KERNEL, caller);
+	return __get_vm_area_node(size, 1, PAGE_SHIFT, flags, start, end,
+				  NUMA_NO_NODE, GFP_KERNEL, caller);
 }
 
 /**
@@ -2401,7 +2402,8 @@ struct vm_struct *__get_vm_area_caller(u
  */
 struct vm_struct *get_vm_area(unsigned long size, unsigned long flags)
 {
-	return __get_vm_area_node(size, 1, flags, VMALLOC_START, VMALLOC_END,
+	return __get_vm_area_node(size, 1, PAGE_SHIFT, flags,
+				  VMALLOC_START, VMALLOC_END,
 				  NUMA_NO_NODE, GFP_KERNEL,
 				  __builtin_return_address(0));
 }
@@ -2409,7 +2411,8 @@ struct vm_struct *get_vm_area(unsigned l
 struct vm_struct *get_vm_area_caller(unsigned long size, unsigned long flags,
 				const void *caller)
 {
-	return __get_vm_area_node(size, 1, flags, VMALLOC_START, VMALLOC_END,
+	return __get_vm_area_node(size, 1, PAGE_SHIFT, flags,
+				  VMALLOC_START, VMALLOC_END,
 				  NUMA_NO_NODE, GFP_KERNEL, caller);
 }
 
@@ -2902,9 +2905,9 @@ void *__vmalloc_node_range(unsigned long
 	}
 
 again:
-	size = PAGE_ALIGN(size);
-	area = __get_vm_area_node(size, align, VM_ALLOC | VM_UNINITIALIZED |
-				vm_flags, start, end, node, gfp_mask, caller);
+	area = __get_vm_area_node(real_size, align, shift, VM_ALLOC |
+				  VM_UNINITIALIZED | vm_flags, start, end, node,
+				  gfp_mask, caller);
 	if (!area) {
 		warn_alloc(gfp_mask, NULL,
 			   "vmalloc size %lu allocation failure: "
@@ -2923,6 +2926,7 @@ again:
 	 */
 	clear_vm_uninitialized_flag(area);
 
+	size = PAGE_ALIGN(size);
 	kmemleak_vmalloc(area, size, gfp_mask);
 
 	return addr;
_

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 15/24] kthread_worker: split code for canceling the delayed work timer
  2021-06-25  1:38 incoming Andrew Morton
                   ` (13 preceding siblings ...)
  2021-06-25  1:39 ` [patch 14/24] mm/vmalloc: unbreak kasan vmalloc support Andrew Morton
@ 2021-06-25  1:39 ` Andrew Morton
  2021-06-25  1:39 ` [patch 16/24] kthread: prevent deadlock when kthread_mod_delayed_work() races with kthread_cancel_delayed_work_sync() Andrew Morton
                   ` (8 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Andrew Morton @ 2021-06-25  1:39 UTC (permalink / raw)
  To: akpm, jenhaochen, linux-mm, liumartin, minchan, mm-commits,
	nathan, ndesaulniers, oleg, pmladek, stable, tj, torvalds

From: Petr Mladek <pmladek@suse.com>
Subject: kthread_worker: split code for canceling the delayed work timer

Patch series "kthread_worker: Fix race between kthread_mod_delayed_work()
and kthread_cancel_delayed_work_sync()".

This patchset fixes the race between kthread_mod_delayed_work() and
kthread_cancel_delayed_work_sync() including proper return value handling.


This patch (of 2):

Simple code refactoring as a preparation step for fixing a race between
kthread_mod_delayed_work() and kthread_cancel_delayed_work_sync().

It does not modify the existing behavior.

Link: https://lkml.kernel.org/r/20210610133051.15337-2-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Cc: <jenhaochen@google.com>
Cc: Martin Liu <liumartin@google.com>
Cc: Minchan Kim <minchan@google.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 kernel/kthread.c |   46 ++++++++++++++++++++++++++++-----------------
 1 file changed, 29 insertions(+), 17 deletions(-)

--- a/kernel/kthread.c~kthread_worker-split-code-for-canceling-the-delayed-work-timer
+++ a/kernel/kthread.c
@@ -1093,6 +1093,33 @@ void kthread_flush_work(struct kthread_w
 EXPORT_SYMBOL_GPL(kthread_flush_work);
 
 /*
+ * Make sure that the timer is neither set nor running and could
+ * not manipulate the work list_head any longer.
+ *
+ * The function is called under worker->lock. The lock is temporary
+ * released but the timer can't be set again in the meantime.
+ */
+static void kthread_cancel_delayed_work_timer(struct kthread_work *work,
+					      unsigned long *flags)
+{
+	struct kthread_delayed_work *dwork =
+		container_of(work, struct kthread_delayed_work, work);
+	struct kthread_worker *worker = work->worker;
+
+	/*
+	 * del_timer_sync() must be called to make sure that the timer
+	 * callback is not running. The lock must be temporary released
+	 * to avoid a deadlock with the callback. In the meantime,
+	 * any queuing is blocked by setting the canceling counter.
+	 */
+	work->canceling++;
+	raw_spin_unlock_irqrestore(&worker->lock, *flags);
+	del_timer_sync(&dwork->timer);
+	raw_spin_lock_irqsave(&worker->lock, *flags);
+	work->canceling--;
+}
+
+/*
  * This function removes the work from the worker queue. Also it makes sure
  * that it won't get queued later via the delayed work's timer.
  *
@@ -1106,23 +1133,8 @@ static bool __kthread_cancel_work(struct
 				  unsigned long *flags)
 {
 	/* Try to cancel the timer if exists. */
-	if (is_dwork) {
-		struct kthread_delayed_work *dwork =
-			container_of(work, struct kthread_delayed_work, work);
-		struct kthread_worker *worker = work->worker;
-
-		/*
-		 * del_timer_sync() must be called to make sure that the timer
-		 * callback is not running. The lock must be temporary released
-		 * to avoid a deadlock with the callback. In the meantime,
-		 * any queuing is blocked by setting the canceling counter.
-		 */
-		work->canceling++;
-		raw_spin_unlock_irqrestore(&worker->lock, *flags);
-		del_timer_sync(&dwork->timer);
-		raw_spin_lock_irqsave(&worker->lock, *flags);
-		work->canceling--;
-	}
+	if (is_dwork)
+		kthread_cancel_delayed_work_timer(work, flags);
 
 	/*
 	 * Try to remove the work from a worker list. It might either
_

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 16/24] kthread: prevent deadlock when kthread_mod_delayed_work() races with kthread_cancel_delayed_work_sync()
  2021-06-25  1:38 incoming Andrew Morton
                   ` (14 preceding siblings ...)
  2021-06-25  1:39 ` [patch 15/24] kthread_worker: split code for canceling the delayed work timer Andrew Morton
@ 2021-06-25  1:39 ` Andrew Morton
  2021-06-25  1:39 ` [patch 17/24] mm, futex: fix shared futex pgoff on shmem huge page Andrew Morton
                   ` (7 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Andrew Morton @ 2021-06-25  1:39 UTC (permalink / raw)
  To: akpm, jenhaochen, linux-mm, liumartin, minchan, mm-commits,
	nathan, ndesaulniers, oleg, pmladek, stable, tj, torvalds

From: Petr Mladek <pmladek@suse.com>
Subject: kthread: prevent deadlock when kthread_mod_delayed_work() races with kthread_cancel_delayed_work_sync()

The system might hang with the following backtrace:

	schedule+0x80/0x100
	schedule_timeout+0x48/0x138
	wait_for_common+0xa4/0x134
	wait_for_completion+0x1c/0x2c
	kthread_flush_work+0x114/0x1cc
	kthread_cancel_work_sync.llvm.16514401384283632983+0xe8/0x144
	kthread_cancel_delayed_work_sync+0x18/0x2c
	xxxx_pm_notify+0xb0/0xd8
	blocking_notifier_call_chain_robust+0x80/0x194
	pm_notifier_call_chain_robust+0x28/0x4c
	suspend_prepare+0x40/0x260
	enter_state+0x80/0x3f4
	pm_suspend+0x60/0xdc
	state_store+0x108/0x144
	kobj_attr_store+0x38/0x88
	sysfs_kf_write+0x64/0xc0
	kernfs_fop_write_iter+0x108/0x1d0
	vfs_write+0x2f4/0x368
	ksys_write+0x7c/0xec

It is caused by the following race between kthread_mod_delayed_work() and
kthread_cancel_delayed_work_sync():

CPU0				CPU1

Context: Thread A		Context: Thread B

kthread_mod_delayed_work()
  spin_lock()
  __kthread_cancel_work()
     spin_unlock()
     del_timer_sync()
				kthread_cancel_delayed_work_sync()
				  spin_lock()
				  __kthread_cancel_work()
				    spin_unlock()
				    del_timer_sync()
				    spin_lock()

				  work->canceling++
				  spin_unlock
     spin_lock()
   queue_delayed_work()
     // dwork is put into the worker->delayed_work_list

   spin_unlock()

				  kthread_flush_work()
     // flush_work is put at the tail of the dwork

				    wait_for_completion()

Context: IRQ

  kthread_delayed_work_timer_fn()
    spin_lock()
    list_del_init(&work->node);
    spin_unlock()

BANG: flush_work is not longer linked and will never get proceed.

The problem is that kthread_mod_delayed_work() checks work->canceling flag
before canceling the timer.

A simple solution is to (re)check work->canceling after
__kthread_cancel_work().  But then it is not clear what should be returned
when __kthread_cancel_work() removed the work from the queue (list) and it
can't queue it again with the new @delay.

The return value might be used for reference counting.  The caller has to
know whether a new work has been queued or an existing one was replaced.

The proper solution is that kthread_mod_delayed_work() will remove the
work from the queue (list) _only_ when work->canceling is not set.  The
flag must be checked after the timer is stopped and the remaining
operations can be done under worker->lock.

Note that kthread_mod_delayed_work() could remove the timer and then bail
out.  It is fine.  The other canceling caller needs to cancel the timer as
well.  The important thing is that the queue (list) manipulation is done
atomically under worker->lock.

Link: https://lkml.kernel.org/r/20210610133051.15337-3-pmladek@suse.com
Fixes: 9a6b06c8d9a220860468a ("kthread: allow to modify delayed kthread work")
Signed-off-by: Petr Mladek <pmladek@suse.com>
Reported-by: Martin Liu <liumartin@google.com>
Cc: <jenhaochen@google.com>
Cc: Minchan Kim <minchan@google.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 kernel/kthread.c |   35 ++++++++++++++++++++++++-----------
 1 file changed, 24 insertions(+), 11 deletions(-)

--- a/kernel/kthread.c~kthread-prevent-deadlock-when-kthread_mod_delayed_work-races-with-kthread_cancel_delayed_work_sync
+++ a/kernel/kthread.c
@@ -1120,8 +1120,11 @@ static void kthread_cancel_delayed_work_
 }
 
 /*
- * This function removes the work from the worker queue. Also it makes sure
- * that it won't get queued later via the delayed work's timer.
+ * This function removes the work from the worker queue.
+ *
+ * It is called under worker->lock. The caller must make sure that
+ * the timer used by delayed work is not running, e.g. by calling
+ * kthread_cancel_delayed_work_timer().
  *
  * The work might still be in use when this function finishes. See the
  * current_work proceed by the worker.
@@ -1129,13 +1132,8 @@ static void kthread_cancel_delayed_work_
  * Return: %true if @work was pending and successfully canceled,
  *	%false if @work was not pending
  */
-static bool __kthread_cancel_work(struct kthread_work *work, bool is_dwork,
-				  unsigned long *flags)
+static bool __kthread_cancel_work(struct kthread_work *work)
 {
-	/* Try to cancel the timer if exists. */
-	if (is_dwork)
-		kthread_cancel_delayed_work_timer(work, flags);
-
 	/*
 	 * Try to remove the work from a worker list. It might either
 	 * be from worker->work_list or from worker->delayed_work_list.
@@ -1188,11 +1186,23 @@ bool kthread_mod_delayed_work(struct kth
 	/* Work must not be used with >1 worker, see kthread_queue_work() */
 	WARN_ON_ONCE(work->worker != worker);
 
-	/* Do not fight with another command that is canceling this work. */
+	/*
+	 * Temporary cancel the work but do not fight with another command
+	 * that is canceling the work as well.
+	 *
+	 * It is a bit tricky because of possible races with another
+	 * mod_delayed_work() and cancel_delayed_work() callers.
+	 *
+	 * The timer must be canceled first because worker->lock is released
+	 * when doing so. But the work can be removed from the queue (list)
+	 * only when it can be queued again so that the return value can
+	 * be used for reference counting.
+	 */
+	kthread_cancel_delayed_work_timer(work, &flags);
 	if (work->canceling)
 		goto out;
+	ret = __kthread_cancel_work(work);
 
-	ret = __kthread_cancel_work(work, true, &flags);
 fast_queue:
 	__kthread_queue_delayed_work(worker, dwork, delay);
 out:
@@ -1214,7 +1224,10 @@ static bool __kthread_cancel_work_sync(s
 	/* Work must not be used with >1 worker, see kthread_queue_work(). */
 	WARN_ON_ONCE(work->worker != worker);
 
-	ret = __kthread_cancel_work(work, is_dwork, &flags);
+	if (is_dwork)
+		kthread_cancel_delayed_work_timer(work, &flags);
+
+	ret = __kthread_cancel_work(work);
 
 	if (worker->current_work != work)
 		goto out_fast;
_

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 17/24] mm, futex: fix shared futex pgoff on shmem huge page
  2021-06-25  1:38 incoming Andrew Morton
                   ` (15 preceding siblings ...)
  2021-06-25  1:39 ` [patch 16/24] kthread: prevent deadlock when kthread_mod_delayed_work() races with kthread_cancel_delayed_work_sync() Andrew Morton
@ 2021-06-25  1:39 ` Andrew Morton
  2021-06-25  1:39 ` [patch 18/24] mm/memory-failure: use a mutex to avoid memory_failure() races Andrew Morton
                   ` (6 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Andrew Morton @ 2021-06-25  1:39 UTC (permalink / raw)
  To: akpm, dave, dvhart, hughd, kirill.shutemov, linux-mm, mgorman,
	mike.kravetz, mingo, mm-commits, neelnatu, peterz, stable, tglx,
	torvalds, wetpzy, willy

From: Hugh Dickins <hughd@google.com>
Subject: mm, futex: fix shared futex pgoff on shmem huge page

If more than one futex is placed on a shmem huge page, it can happen that
waking the second wakes the first instead, and leaves the second waiting:
the key's shared.pgoff is wrong.

When 3.11 commit 13d60f4b6ab5 ("futex: Take hugepages into account when
generating futex_key"), the only shared huge pages came from hugetlbfs,
and the code added to deal with its exceptional page->index was put into
hugetlb source.  Then that was missed when 4.8 added shmem huge pages.

page_to_pgoff() is what others use for this nowadays: except that, as
currently written, it gives the right answer on hugetlbfs head, but
nonsense on hugetlbfs tails.  Fix that by calling hugetlbfs-specific
hugetlb_basepage_index() on PageHuge tails as well as on head.

Yes, it's unconventional to declare hugetlb_basepage_index() there in
pagemap.h, rather than in hugetlb.h; but I do not expect anything but
page_to_pgoff() ever to need it.

[akpm@linux-foundation.org: give hugetlb_basepage_index() prototype the correct scope]
Link: https://lkml.kernel.org/r/b17d946b-d09-326e-b42a-52884c36df32@google.com
Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
Reported-by: Neel Natu <neelnatu@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Zhang Yi <wetpzy@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/hugetlb.h |   16 ----------------
 include/linux/pagemap.h |   13 +++++++------
 kernel/futex.c          |    3 +--
 mm/hugetlb.c            |    5 +----
 4 files changed, 9 insertions(+), 28 deletions(-)

--- a/include/linux/hugetlb.h~mm-futex-fix-shared-futex-pgoff-on-shmem-huge-page
+++ a/include/linux/hugetlb.h
@@ -741,17 +741,6 @@ static inline int hstate_index(struct hs
 	return h - hstates;
 }
 
-pgoff_t __basepage_index(struct page *page);
-
-/* Return page->index in PAGE_SIZE units */
-static inline pgoff_t basepage_index(struct page *page)
-{
-	if (!PageCompound(page))
-		return page->index;
-
-	return __basepage_index(page);
-}
-
 extern int dissolve_free_huge_page(struct page *page);
 extern int dissolve_free_huge_pages(unsigned long start_pfn,
 				    unsigned long end_pfn);
@@ -988,11 +977,6 @@ static inline int hstate_index(struct hs
 	return 0;
 }
 
-static inline pgoff_t basepage_index(struct page *page)
-{
-	return page->index;
-}
-
 static inline int dissolve_free_huge_page(struct page *page)
 {
 	return 0;
--- a/include/linux/pagemap.h~mm-futex-fix-shared-futex-pgoff-on-shmem-huge-page
+++ a/include/linux/pagemap.h
@@ -516,7 +516,7 @@ static inline struct page *read_mapping_
 }
 
 /*
- * Get index of the page with in radix-tree
+ * Get index of the page within radix-tree (but not for hugetlb pages).
  * (TODO: remove once hugetlb pages will have ->index in PAGE_SIZE)
  */
 static inline pgoff_t page_to_index(struct page *page)
@@ -535,15 +535,16 @@ static inline pgoff_t page_to_index(stru
 	return pgoff;
 }
 
+extern pgoff_t hugetlb_basepage_index(struct page *page);
+
 /*
- * Get the offset in PAGE_SIZE.
- * (TODO: hugepage should have ->index in PAGE_SIZE)
+ * Get the offset in PAGE_SIZE (even for hugetlb pages).
+ * (TODO: hugetlb pages should have ->index in PAGE_SIZE)
  */
 static inline pgoff_t page_to_pgoff(struct page *page)
 {
-	if (unlikely(PageHeadHuge(page)))
-		return page->index << compound_order(page);
-
+	if (unlikely(PageHuge(page)))
+		return hugetlb_basepage_index(page);
 	return page_to_index(page);
 }
 
--- a/kernel/futex.c~mm-futex-fix-shared-futex-pgoff-on-shmem-huge-page
+++ a/kernel/futex.c
@@ -35,7 +35,6 @@
 #include <linux/jhash.h>
 #include <linux/pagemap.h>
 #include <linux/syscalls.h>
-#include <linux/hugetlb.h>
 #include <linux/freezer.h>
 #include <linux/memblock.h>
 #include <linux/fault-inject.h>
@@ -650,7 +649,7 @@ again:
 
 		key->both.offset |= FUT_OFF_INODE; /* inode-based key */
 		key->shared.i_seq = get_inode_sequence_number(inode);
-		key->shared.pgoff = basepage_index(tail);
+		key->shared.pgoff = page_to_pgoff(tail);
 		rcu_read_unlock();
 	}
 
--- a/mm/hugetlb.c~mm-futex-fix-shared-futex-pgoff-on-shmem-huge-page
+++ a/mm/hugetlb.c
@@ -1588,15 +1588,12 @@ struct address_space *hugetlb_page_mappi
 	return NULL;
 }
 
-pgoff_t __basepage_index(struct page *page)
+pgoff_t hugetlb_basepage_index(struct page *page)
 {
 	struct page *page_head = compound_head(page);
 	pgoff_t index = page_index(page_head);
 	unsigned long compound_idx;
 
-	if (!PageHuge(page_head))
-		return page_index(page);
-
 	if (compound_order(page_head) >= MAX_ORDER)
 		compound_idx = page_to_pfn(page) - page_to_pfn(page_head);
 	else
_

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 18/24] mm/memory-failure: use a mutex to avoid memory_failure() races
  2021-06-25  1:38 incoming Andrew Morton
                   ` (16 preceding siblings ...)
  2021-06-25  1:39 ` [patch 17/24] mm, futex: fix shared futex pgoff on shmem huge page Andrew Morton
@ 2021-06-25  1:39 ` Andrew Morton
  2021-06-25  1:39 ` [patch 19/24] mm,hwpoison: return -EHWPOISON to denote that the page has already been poisoned Andrew Morton
                   ` (5 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Andrew Morton @ 2021-06-25  1:39 UTC (permalink / raw)
  To: akpm, bp, bp, david, juew, linux-mm, luto, mm-commits,
	naoya.horiguchi, osalvador, stable, tony.luck, torvalds, yaoaili

From: Tony Luck <tony.luck@intel.com>
Subject: mm/memory-failure: use a mutex to avoid memory_failure() races

Patch series "mm,hwpoison: fix sending SIGBUS for Action Required MCE", v5.

I wrote this patchset to materialize what I think is the current allowable
solution mentioned by the previous discussion [1].  I simply borrowed
Tony's mutex patch and Aili's return code patch, then I queued another one
to find error virtual address in the best effort manner.  I know that this
is not a perfect solution, but should work for some typical case.

[1]: https://lore.kernel.org/linux-mm/20210331192540.2141052f@alex-virtual-machine/


This patch (of 2):

There can be races when multiple CPUs consume poison from the same page. 
The first into memory_failure() atomically sets the HWPoison page flag and
begins hunting for tasks that map this page.  Eventually it invalidates
those mappings and may send a SIGBUS to the affected tasks.

But while all that work is going on, other CPUs see a "success" return
code from memory_failure() and so they believe the error has been handled
and continue executing.

Fix by wrapping most of the internal parts of memory_failure() in a mutex.

[akpm@linux-foundation.org: make mf_mutex local to memory_failure()]
Link: https://lkml.kernel.org/r/20210521030156.2612074-1-nao.horiguchi@gmail.com
Link: https://lkml.kernel.org/r/20210521030156.2612074-2-nao.horiguchi@gmail.com
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Aili Yao <yaoaili@kingsoft.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jue Wang <juew@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory-failure.c |   36 +++++++++++++++++++++++-------------
 1 file changed, 23 insertions(+), 13 deletions(-)

--- a/mm/memory-failure.c~mm-memory-failure-use-a-mutex-to-avoid-memory_failure-races
+++ a/mm/memory-failure.c
@@ -1429,9 +1429,10 @@ int memory_failure(unsigned long pfn, in
 	struct page *hpage;
 	struct page *orig_head;
 	struct dev_pagemap *pgmap;
-	int res;
+	int res = 0;
 	unsigned long page_flags;
 	bool retry = true;
+	static DEFINE_MUTEX(mf_mutex);
 
 	if (!sysctl_memory_failure_recovery)
 		panic("Memory failure on page %lx", pfn);
@@ -1449,13 +1450,18 @@ int memory_failure(unsigned long pfn, in
 		return -ENXIO;
 	}
 
+	mutex_lock(&mf_mutex);
+
 try_again:
-	if (PageHuge(p))
-		return memory_failure_hugetlb(pfn, flags);
+	if (PageHuge(p)) {
+		res = memory_failure_hugetlb(pfn, flags);
+		goto unlock_mutex;
+	}
+
 	if (TestSetPageHWPoison(p)) {
 		pr_err("Memory failure: %#lx: already hardware poisoned\n",
 			pfn);
-		return 0;
+		goto unlock_mutex;
 	}
 
 	orig_head = hpage = compound_head(p);
@@ -1488,17 +1494,19 @@ try_again:
 				res = MF_FAILED;
 			}
 			action_result(pfn, MF_MSG_BUDDY, res);
-			return res == MF_RECOVERED ? 0 : -EBUSY;
+			res = res == MF_RECOVERED ? 0 : -EBUSY;
 		} else {
 			action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED);
-			return -EBUSY;
+			res = -EBUSY;
 		}
+		goto unlock_mutex;
 	}
 
 	if (PageTransHuge(hpage)) {
 		if (try_to_split_thp_page(p, "Memory Failure") < 0) {
 			action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
-			return -EBUSY;
+			res = -EBUSY;
+			goto unlock_mutex;
 		}
 		VM_BUG_ON_PAGE(!page_count(p), p);
 	}
@@ -1522,7 +1530,7 @@ try_again:
 	if (PageCompound(p) && compound_head(p) != orig_head) {
 		action_result(pfn, MF_MSG_DIFFERENT_COMPOUND, MF_IGNORED);
 		res = -EBUSY;
-		goto out;
+		goto unlock_page;
 	}
 
 	/*
@@ -1542,14 +1550,14 @@ try_again:
 		num_poisoned_pages_dec();
 		unlock_page(p);
 		put_page(p);
-		return 0;
+		goto unlock_mutex;
 	}
 	if (hwpoison_filter(p)) {
 		if (TestClearPageHWPoison(p))
 			num_poisoned_pages_dec();
 		unlock_page(p);
 		put_page(p);
-		return 0;
+		goto unlock_mutex;
 	}
 
 	/*
@@ -1573,7 +1581,7 @@ try_again:
 	if (!hwpoison_user_mappings(p, pfn, flags, &p)) {
 		action_result(pfn, MF_MSG_UNMAP_FAILED, MF_IGNORED);
 		res = -EBUSY;
-		goto out;
+		goto unlock_page;
 	}
 
 	/*
@@ -1582,13 +1590,15 @@ try_again:
 	if (PageLRU(p) && !PageSwapCache(p) && p->mapping == NULL) {
 		action_result(pfn, MF_MSG_TRUNCATED_LRU, MF_IGNORED);
 		res = -EBUSY;
-		goto out;
+		goto unlock_page;
 	}
 
 identify_page_state:
 	res = identify_page_state(pfn, p, page_flags);
-out:
+unlock_page:
 	unlock_page(p);
+unlock_mutex:
+	mutex_unlock(&mf_mutex);
 	return res;
 }
 EXPORT_SYMBOL_GPL(memory_failure);
_

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 19/24] mm,hwpoison: return -EHWPOISON to denote that the page has already been poisoned
  2021-06-25  1:38 incoming Andrew Morton
                   ` (17 preceding siblings ...)
  2021-06-25  1:39 ` [patch 18/24] mm/memory-failure: use a mutex to avoid memory_failure() races Andrew Morton
@ 2021-06-25  1:39 ` Andrew Morton
  2021-06-25  1:40 ` [patch 20/24] mm/hwpoison: do not lock page again when me_huge_page() successfully recovers Andrew Morton
                   ` (4 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Andrew Morton @ 2021-06-25  1:39 UTC (permalink / raw)
  To: akpm, bp, bp, david, juew, linux-mm, luto, mm-commits,
	naoya.horiguchi, osalvador, stable, tony.luck, torvalds, yaoaili

From: Aili Yao <yaoaili@kingsoft.com>
Subject: mm,hwpoison: return -EHWPOISON to denote that the page has already been poisoned

When memory_failure() is called with MF_ACTION_REQUIRED on the page that
has already been hwpoisoned, memory_failure() could fail to send SIGBUS to
the affected process, which results in infinite loop of MCEs.

Currently memory_failure() returns 0 if it's called for already hwpoisoned
page, then the caller, kill_me_maybe(), could return without sending
SIGBUS to current process.  An action required MCE is raised when the
current process accesses to the broken memory, so no SIGBUS means that the
current process continues to run and access to the error page again soon,
so running into MCE loop.

This issue can arise for example in the following scenarios:

- Two or more threads access to the poisoned page concurrently.  If
  local MCE is enabled, MCE handler independently handles the MCE events. 
  So there's a race among MCE events, and the second or latter threads
  fall into the situation in question.

- If there was a precedent memory error event and memory_failure() for
  the event failed to unmap the error page for some reason, the subsequent
  memory access to the error page triggers the MCE loop situation.

To fix the issue, make memory_failure() return an error code when the
error page has already been hwpoisoned.  This allows memory error handler
to control how it sends signals to userspace.  And make sure that any
process touching a hwpoisoned page should get a SIGBUS even in "already
hwpoisoned" path of memory_failure() as is done in page fault path.

Link: https://lkml.kernel.org/r/20210521030156.2612074-3-nao.horiguchi@gmail.com
Signed-off-by: Aili Yao <yaoaili@kingsoft.com>
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jue Wang <juew@google.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory-failure.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/mm/memory-failure.c~mmhwpoison-return-ehwpoison-to-denote-that-the-page-has-already-been-poisoned
+++ a/mm/memory-failure.c
@@ -1253,7 +1253,7 @@ static int memory_failure_hugetlb(unsign
 	if (TestSetPageHWPoison(head)) {
 		pr_err("Memory failure: %#lx: already hardware poisoned\n",
 		       pfn);
-		return 0;
+		return -EHWPOISON;
 	}
 
 	num_poisoned_pages_inc();
@@ -1461,6 +1461,7 @@ try_again:
 	if (TestSetPageHWPoison(p)) {
 		pr_err("Memory failure: %#lx: already hardware poisoned\n",
 			pfn);
+		res = -EHWPOISON;
 		goto unlock_mutex;
 	}
 
_

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 20/24] mm/hwpoison: do not lock page again when me_huge_page() successfully recovers
  2021-06-25  1:38 incoming Andrew Morton
                   ` (18 preceding siblings ...)
  2021-06-25  1:39 ` [patch 19/24] mm,hwpoison: return -EHWPOISON to denote that the page has already been poisoned Andrew Morton
@ 2021-06-25  1:40 ` Andrew Morton
  2021-06-25  1:40 ` [patch 21/24] mm/page_alloc: __alloc_pages_bulk(): do bounds check before accessing array Andrew Morton
                   ` (3 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Andrew Morton @ 2021-06-25  1:40 UTC (permalink / raw)
  To: akpm, aneesh.kumar, linux-mm, mhocko, mm-commits,
	naoya.horiguchi, osalvador, stable, tony.luck, torvalds

From: Naoya Horiguchi <naoya.horiguchi@nec.com>
Subject: mm/hwpoison: do not lock page again when me_huge_page() successfully recovers

Currently me_huge_page() temporary unlocks page to perform some actions
then locks it again later.  My testcase (which calls hard-offline on some
tail page in a hugetlb, then accesses the address of the hugetlb range)
showed that page allocation code detects this page lock on buddy page and
printed out "BUG: Bad page state" message.

check_new_page_bad() does not consider a page with __PG_HWPOISON as bad
page, so this flag works as kind of filter, but this filtering doesn't
work in this case because the "bad page" is not the actual hwpoisoned
page.  So stop locking page again.  Actions to be taken depend on the page
type of the error, so page unlocking should be done in ->action()
callbacks.  So let's make it assumed and change all existing callbacks
that way.

Link: https://lkml.kernel.org/r/20210609072029.74645-1-nao.horiguchi@gmail.com
Fixes: commit 78bb920344b8 ("mm: hwpoison: dissolve in-use hugepage in unrecoverable memory error")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory-failure.c |   44 ++++++++++++++++++++++++++++--------------
 1 file changed, 30 insertions(+), 14 deletions(-)

--- a/mm/memory-failure.c~mm-hwpoison-do-not-lock-page-again-when-me_huge_page-successfully-recovers
+++ a/mm/memory-failure.c
@@ -658,6 +658,7 @@ static int truncate_error_page(struct pa
  */
 static int me_kernel(struct page *p, unsigned long pfn)
 {
+	unlock_page(p);
 	return MF_IGNORED;
 }
 
@@ -667,6 +668,7 @@ static int me_kernel(struct page *p, uns
 static int me_unknown(struct page *p, unsigned long pfn)
 {
 	pr_err("Memory failure: %#lx: Unknown page state\n", pfn);
+	unlock_page(p);
 	return MF_FAILED;
 }
 
@@ -675,6 +677,7 @@ static int me_unknown(struct page *p, un
  */
 static int me_pagecache_clean(struct page *p, unsigned long pfn)
 {
+	int ret;
 	struct address_space *mapping;
 
 	delete_from_lru_cache(p);
@@ -683,8 +686,10 @@ static int me_pagecache_clean(struct pag
 	 * For anonymous pages we're done the only reference left
 	 * should be the one m_f() holds.
 	 */
-	if (PageAnon(p))
-		return MF_RECOVERED;
+	if (PageAnon(p)) {
+		ret = MF_RECOVERED;
+		goto out;
+	}
 
 	/*
 	 * Now truncate the page in the page cache. This is really
@@ -698,7 +703,8 @@ static int me_pagecache_clean(struct pag
 		/*
 		 * Page has been teared down in the meanwhile
 		 */
-		return MF_FAILED;
+		ret = MF_FAILED;
+		goto out;
 	}
 
 	/*
@@ -706,7 +712,10 @@ static int me_pagecache_clean(struct pag
 	 *
 	 * Open: to take i_mutex or not for this? Right now we don't.
 	 */
-	return truncate_error_page(p, pfn, mapping);
+	ret = truncate_error_page(p, pfn, mapping);
+out:
+	unlock_page(p);
+	return ret;
 }
 
 /*
@@ -782,24 +791,26 @@ static int me_pagecache_dirty(struct pag
  */
 static int me_swapcache_dirty(struct page *p, unsigned long pfn)
 {
+	int ret;
+
 	ClearPageDirty(p);
 	/* Trigger EIO in shmem: */
 	ClearPageUptodate(p);
 
-	if (!delete_from_lru_cache(p))
-		return MF_DELAYED;
-	else
-		return MF_FAILED;
+	ret = delete_from_lru_cache(p) ? MF_FAILED : MF_DELAYED;
+	unlock_page(p);
+	return ret;
 }
 
 static int me_swapcache_clean(struct page *p, unsigned long pfn)
 {
+	int ret;
+
 	delete_from_swap_cache(p);
 
-	if (!delete_from_lru_cache(p))
-		return MF_RECOVERED;
-	else
-		return MF_FAILED;
+	ret = delete_from_lru_cache(p) ? MF_FAILED : MF_RECOVERED;
+	unlock_page(p);
+	return ret;
 }
 
 /*
@@ -820,6 +831,7 @@ static int me_huge_page(struct page *p,
 	mapping = page_mapping(hpage);
 	if (mapping) {
 		res = truncate_error_page(hpage, pfn, mapping);
+		unlock_page(hpage);
 	} else {
 		res = MF_FAILED;
 		unlock_page(hpage);
@@ -834,7 +846,6 @@ static int me_huge_page(struct page *p,
 			page_ref_inc(p);
 			res = MF_RECOVERED;
 		}
-		lock_page(hpage);
 	}
 
 	return res;
@@ -866,6 +877,8 @@ static struct page_state {
 	unsigned long mask;
 	unsigned long res;
 	enum mf_action_page_type type;
+
+	/* Callback ->action() has to unlock the relevant page inside it. */
 	int (*action)(struct page *p, unsigned long pfn);
 } error_states[] = {
 	{ reserved,	reserved,	MF_MSG_KERNEL,	me_kernel },
@@ -929,6 +942,7 @@ static int page_action(struct page_state
 	int result;
 	int count;
 
+	/* page p should be unlocked after returning from ps->action().  */
 	result = ps->action(p, pfn);
 
 	count = page_count(p) - 1;
@@ -1313,7 +1327,7 @@ static int memory_failure_hugetlb(unsign
 		goto out;
 	}
 
-	res = identify_page_state(pfn, p, page_flags);
+	return identify_page_state(pfn, p, page_flags);
 out:
 	unlock_page(head);
 	return res;
@@ -1596,6 +1610,8 @@ try_again:
 
 identify_page_state:
 	res = identify_page_state(pfn, p, page_flags);
+	mutex_unlock(&mf_mutex);
+	return res;
 unlock_page:
 	unlock_page(p);
 unlock_mutex:
_

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 21/24] mm/page_alloc: __alloc_pages_bulk(): do bounds check before accessing array
  2021-06-25  1:38 incoming Andrew Morton
                   ` (19 preceding siblings ...)
  2021-06-25  1:40 ` [patch 20/24] mm/hwpoison: do not lock page again when me_huge_page() successfully recovers Andrew Morton
@ 2021-06-25  1:40 ` Andrew Morton
  2021-06-25  1:40 ` [patch 22/24] mm/page_alloc: do bulk array bounds check after checking populated elements Andrew Morton
                   ` (2 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Andrew Morton @ 2021-06-25  1:40 UTC (permalink / raw)
  To: akpm, linux-mm, linux, mgorman, mm-commits, torvalds

From: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Subject: mm/page_alloc: __alloc_pages_bulk(): do bounds check before accessing array

In the event that somebody would call this with an already fully populated
page_array, the last loop iteration would do an access beyond the end of
page_array.

It's of course extremely unlikely that would ever be done, but this
triggers my internal static analyzer.  Also, if it really is not supposed
to be invoked this way (i.e., with no NULL entries in page_array), the
nr_populated<nr_pages check could simply be removed instead.

Link: https://lkml.kernel.org/r/20210507064504.1712559-1-linux@rasmusvillemoes.dk
Fixes: 0f87d9d30f21 ("mm/page_alloc: add an array-based interface to the bulk page allocator")
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/page_alloc.c~mm-page_alloc-__alloc_pages_bulk-do-bounds-check-before-accessing-array
+++ a/mm/page_alloc.c
@@ -5053,7 +5053,7 @@ unsigned long __alloc_pages_bulk(gfp_t g
 	 * Skip populated array elements to determine if any pages need
 	 * to be allocated before disabling IRQs.
 	 */
-	while (page_array && page_array[nr_populated] && nr_populated < nr_pages)
+	while (page_array && nr_populated < nr_pages && page_array[nr_populated])
 		nr_populated++;
 
 	/* Use the single page allocator for one page. */
_

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 22/24] mm/page_alloc: do bulk array bounds check after checking populated elements
  2021-06-25  1:38 incoming Andrew Morton
                   ` (20 preceding siblings ...)
  2021-06-25  1:40 ` [patch 21/24] mm/page_alloc: __alloc_pages_bulk(): do bounds check before accessing array Andrew Morton
@ 2021-06-25  1:40 ` Andrew Morton
  2021-06-25  1:40 ` [patch 23/24] MAINTAINERS: fix Marek's identity again Andrew Morton
  2021-06-25  1:40 ` [patch 24/24] mailmap: add Marek's other e-mail address and identity without diacritics Andrew Morton
  23 siblings, 0 replies; 25+ messages in thread
From: Andrew Morton @ 2021-06-25  1:40 UTC (permalink / raw)
  To: akpm, brouer, dan.carpenter, linux-mm, mgorman, mgorman,
	mm-commits, torvalds, vbabka

From: Mel Gorman <mgorman@techsingularity.net>
Subject: mm/page_alloc: do bulk array bounds check after checking populated elements

Dan Carpenter reported the following

  The patch 0f87d9d30f21: "mm/page_alloc: add an array-based interface
  to the bulk page allocator" from Apr 29, 2021, leads to the following
  static checker warning:

        mm/page_alloc.c:5338 __alloc_pages_bulk()
        warn: potentially one past the end of array 'page_array[nr_populated]'

The problem can occur if an array is passed in that is fully populated. 
That potentially ends up allocating a single page and storing it past the
end of the array.  This patch returns 0 if the array is fully populated.

Link: https://lkml.kernel.org/r/20210618125102.GU30378@techsingularity.net
Fixes: 0f87d9d30f21 ("mm/page_alloc: add an array-based interface to the bulk page allocator")
Signed-off-by: Mel Gorman <mgorman@techsinguliarity.net>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |    4 ++++
 1 file changed, 4 insertions(+)

--- a/mm/page_alloc.c~mm-page_alloc-do-bulk-array-bounds-check-after-checking-populated-elements
+++ a/mm/page_alloc.c
@@ -5056,6 +5056,10 @@ unsigned long __alloc_pages_bulk(gfp_t g
 	while (page_array && nr_populated < nr_pages && page_array[nr_populated])
 		nr_populated++;
 
+	/* Already populated array? */
+	if (unlikely(page_array && nr_pages - nr_populated == 0))
+		return 0;
+
 	/* Use the single page allocator for one page. */
 	if (nr_pages - nr_populated == 1)
 		goto failed;
_

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 23/24] MAINTAINERS: fix Marek's identity again
  2021-06-25  1:38 incoming Andrew Morton
                   ` (21 preceding siblings ...)
  2021-06-25  1:40 ` [patch 22/24] mm/page_alloc: do bulk array bounds check after checking populated elements Andrew Morton
@ 2021-06-25  1:40 ` Andrew Morton
  2021-06-25  1:40 ` [patch 24/24] mailmap: add Marek's other e-mail address and identity without diacritics Andrew Morton
  23 siblings, 0 replies; 25+ messages in thread
From: Andrew Morton @ 2021-06-25  1:40 UTC (permalink / raw)
  To: akpm, kabel, linux-mm, mm-commits, torvalds

From: Marek Behún <kabel@kernel.org>
Subject: MAINTAINERS: fix Marek's identity again

Fix my name to use diacritics, since MAINTAINERS supports it.

Fix my e-mail address in MAINTAINERS' marvell10g PHY driver description,
I accidentally put my other e-mail address here.

Link: https://lkml.kernel.org/r/20210616113624.19351-1-kabel@kernel.org
Signed-off-by: Marek Behún <kabel@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 MAINTAINERS |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/MAINTAINERS~maintainers-fix-my-identity-again
+++ a/MAINTAINERS
@@ -1816,7 +1816,7 @@ F:	drivers/pinctrl/pinctrl-gemini.c
 F:	drivers/rtc/rtc-ftrtc010.c
 
 ARM/CZ.NIC TURRIS SUPPORT
-M:	Marek Behun <kabel@kernel.org>
+M:	Marek Behún <kabel@kernel.org>
 S:	Maintained
 W:	https://www.turris.cz/
 F:	Documentation/ABI/testing/debugfs-moxtet
@@ -10946,7 +10946,7 @@ F:	include/linux/mv643xx.h
 
 MARVELL MV88X3310 PHY DRIVER
 M:	Russell King <linux@armlinux.org.uk>
-M:	Marek Behun <marek.behun@nic.cz>
+M:	Marek Behún <kabel@kernel.org>
 L:	netdev@vger.kernel.org
 S:	Maintained
 F:	drivers/net/phy/marvell10g.c
_

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 24/24] mailmap: add Marek's other e-mail address and identity without diacritics
  2021-06-25  1:38 incoming Andrew Morton
                   ` (22 preceding siblings ...)
  2021-06-25  1:40 ` [patch 23/24] MAINTAINERS: fix Marek's identity again Andrew Morton
@ 2021-06-25  1:40 ` Andrew Morton
  23 siblings, 0 replies; 25+ messages in thread
From: Andrew Morton @ 2021-06-25  1:40 UTC (permalink / raw)
  To: akpm, kabel, linux-mm, mm-commits, torvalds

From: Marek Behún <kabel@kernel.org>
Subject: mailmap: add Marek's other e-mail address and identity without diacritics

Some of my commits were sent with identities
  Marek Behun <marek.behun@nic.cz>
  Marek Behún <marek.behun@nic.cz>
while the correct one is
  Marek Behún <kabel@kernel.org>

Put this into mailmap so that git shortlog prints all my commits under
one identity.

Link: https://lkml.kernel.org/r/20210616113624.19351-2-kabel@kernel.org
Signed-off-by: Marek Behún <kabel@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 .mailmap |    2 ++
 1 file changed, 2 insertions(+)

--- a/.mailmap~mailmap-add-my-other-e-mail-address-and-identity-without-diacritics
+++ a/.mailmap
@@ -212,6 +212,8 @@ Manivannan Sadhasivam <mani@kernel.org>
 Manivannan Sadhasivam <mani@kernel.org> <manivannan.sadhasivam@linaro.org>
 Marcin Nowakowski <marcin.nowakowski@mips.com> <marcin.nowakowski@imgtec.com>
 Marc Zyngier <maz@kernel.org> <marc.zyngier@arm.com>
+Marek Behún <kabel@kernel.org> <marek.behun@nic.cz>
+Marek Behún <kabel@kernel.org> Marek Behun <marek.behun@nic.cz>
 Mark Brown <broonie@sirena.org.uk>
 Mark Starovoytov <mstarovo@pm.me> <mstarovoitov@marvell.com>
 Mark Yao <markyao0591@gmail.com> <mark.yao@rock-chips.com>
_

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2021-06-25  1:40 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-25  1:38 incoming Andrew Morton
2021-06-25  1:39 ` [patch 01/24] mm: page_vma_mapped_walk(): use page for pvmw->page Andrew Morton
2021-06-25  1:39 ` [patch 02/24] mm: page_vma_mapped_walk(): settle PageHuge on entry Andrew Morton
2021-06-25  1:39 ` [patch 03/24] mm: page_vma_mapped_walk(): use pmde for *pvmw->pmd Andrew Morton
2021-06-25  1:39 ` [patch 04/24] mm: page_vma_mapped_walk(): prettify PVMW_MIGRATION block Andrew Morton
2021-06-25  1:39 ` [patch 05/24] mm: page_vma_mapped_walk(): crossing page table boundary Andrew Morton
2021-06-25  1:39 ` [patch 06/24] mm: page_vma_mapped_walk(): add a level of indentation Andrew Morton
2021-06-25  1:39 ` [patch 07/24] mm: page_vma_mapped_walk(): use goto instead of while (1) Andrew Morton
2021-06-25  1:39 ` [patch 08/24] mm: page_vma_mapped_walk(): get vma_address_end() earlier Andrew Morton
2021-06-25  1:39 ` [patch 09/24] mm/thp: fix page_vma_mapped_walk() if THP mapped by ptes Andrew Morton
2021-06-25  1:39 ` [patch 10/24] mm/thp: another PVMW_SYNC fix in page_vma_mapped_walk() Andrew Morton
2021-06-25  1:39 ` [patch 11/24] nilfs2: fix memory leak in nilfs_sysfs_delete_device_group Andrew Morton
2021-06-25  1:39 ` [patch 12/24] mm/vmalloc: add vmalloc_no_huge Andrew Morton
2021-06-25  1:39 ` [patch 13/24] KVM: s390: prepare for hugepage vmalloc Andrew Morton
2021-06-25  1:39 ` [patch 14/24] mm/vmalloc: unbreak kasan vmalloc support Andrew Morton
2021-06-25  1:39 ` [patch 15/24] kthread_worker: split code for canceling the delayed work timer Andrew Morton
2021-06-25  1:39 ` [patch 16/24] kthread: prevent deadlock when kthread_mod_delayed_work() races with kthread_cancel_delayed_work_sync() Andrew Morton
2021-06-25  1:39 ` [patch 17/24] mm, futex: fix shared futex pgoff on shmem huge page Andrew Morton
2021-06-25  1:39 ` [patch 18/24] mm/memory-failure: use a mutex to avoid memory_failure() races Andrew Morton
2021-06-25  1:39 ` [patch 19/24] mm,hwpoison: return -EHWPOISON to denote that the page has already been poisoned Andrew Morton
2021-06-25  1:40 ` [patch 20/24] mm/hwpoison: do not lock page again when me_huge_page() successfully recovers Andrew Morton
2021-06-25  1:40 ` [patch 21/24] mm/page_alloc: __alloc_pages_bulk(): do bounds check before accessing array Andrew Morton
2021-06-25  1:40 ` [patch 22/24] mm/page_alloc: do bulk array bounds check after checking populated elements Andrew Morton
2021-06-25  1:40 ` [patch 23/24] MAINTAINERS: fix Marek's identity again Andrew Morton
2021-06-25  1:40 ` [patch 24/24] mailmap: add Marek's other e-mail address and identity without diacritics Andrew Morton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.