All of lore.kernel.org
 help / color / mirror / Atom feed
* incoming
@ 2021-08-20  2:03 Andrew Morton
  2021-08-20  2:04 ` [patch 01/10] Revert "mm/shmem: fix shmem_swapin() race with swapoff" Andrew Morton
                   ` (9 more replies)
  0 siblings, 10 replies; 11+ messages in thread
From: Andrew Morton @ 2021-08-20  2:03 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-mm, mm-commits

10 patches, based on 614cb2751d3150850d459bee596c397f344a7936.

Subsystems affected by this patch series:

  mm/shmem
  mm/pagealloc
  mm/tracing
  MAINTAINERS
  mm/memcg
  mm/memory-failure
  mm/vmscan
  mm/kfence
  mm/hugetlb

Subsystem: mm/shmem

    Yang Shi <shy828301@gmail.com>:
      Revert "mm/shmem: fix shmem_swapin() race with swapoff"
      Revert "mm: swap: check if swap backing device is congested or not"

Subsystem: mm/pagealloc

    Doug Berger <opendmb@gmail.com>:
      mm/page_alloc: don't corrupt pcppage_migratetype

Subsystem: mm/tracing

    Mike Rapoport <rppt@linux.ibm.com>:
      mmflags.h: add missing __GFP_ZEROTAGS and __GFP_SKIP_KASAN_POISON names

Subsystem: MAINTAINERS

    Nathan Chancellor <nathan@kernel.org>:
      MAINTAINERS: update ClangBuiltLinux IRC chat

Subsystem: mm/memcg

    Johannes Weiner <hannes@cmpxchg.org>:
      mm: memcontrol: fix occasional OOMs due to proportional memory.low reclaim

Subsystem: mm/memory-failure

    Naoya Horiguchi <naoya.horiguchi@nec.com>:
      mm/hwpoison: retry with shake_page() for unhandlable pages

Subsystem: mm/vmscan

    Johannes Weiner <hannes@cmpxchg.org>:
      mm: vmscan: fix missing psi annotation for node_reclaim()

Subsystem: mm/kfence

    Marco Elver <elver@google.com>:
      kfence: fix is_kfence_address() for addresses below KFENCE_POOL_SIZE

Subsystem: mm/hugetlb

    Mike Kravetz <mike.kravetz@oracle.com>:
      hugetlb: don't pass page cache pages to restore_reserve_on_error

 MAINTAINERS                    |    2 +-
 include/linux/kfence.h         |    7 ++++---
 include/linux/memcontrol.h     |   29 +++++++++++++++--------------
 include/trace/events/mmflags.h |    4 +++-
 mm/hugetlb.c                   |   19 ++++++++++++++-----
 mm/memory-failure.c            |   12 +++++++++---
 mm/page_alloc.c                |   25 ++++++++++++-------------
 mm/shmem.c                     |   14 +-------------
 mm/swap_state.c                |    7 -------
 mm/vmscan.c                    |   30 ++++++++++++++++++++++--------
 10 files changed, 81 insertions(+), 68 deletions(-)


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [patch 01/10] Revert "mm/shmem: fix shmem_swapin() race with swapoff"
  2021-08-20  2:03 incoming Andrew Morton
@ 2021-08-20  2:04 ` Andrew Morton
  2021-08-20  2:04 ` [patch 02/10] Revert "mm: swap: check if swap backing device is congested or not" Andrew Morton
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Andrew Morton @ 2021-08-20  2:04 UTC (permalink / raw)
  To: akpm, david, hannes, hughd, iamjoonsoo.kim, linmiaohe, linux-mm,
	mhocko, minchan, mm-commits, shy828301, torvalds, willy,
	ying.huang

From: Yang Shi <shy828301@gmail.com>
Subject: Revert "mm/shmem: fix shmem_swapin() race with swapoff"

Due to the change about how block layer detects congestion the
justification of commit 8fd2e0b505d1 ("mm: swap: check if swap backing device
is congested or not") doesn't stand anymore, so the commit could be just
reverted in order to solve the race reported by commit 2efa33fc7f6e ("mm/shmem:
fix shmem_swapin() race with swapoff"), so the fix commit could be just
reverted as well.

And that fix is also kind of buggy as discussed by [1] and [2].

[1] https://lore.kernel.org/linux-mm/24187e5e-069-9f3f-cefe-39ac70783753@google.com/
[2] https://lore.kernel.org/linux-mm/e82380b9-3ad4-4a52-be50-6d45c7f2b5da@google.com/

Link: https://lkml.kernel.org/r/20210810202936.2672-2-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Suggested-by: Hugh Dickins <hughd@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/shmem.c |   14 +-------------
 1 file changed, 1 insertion(+), 13 deletions(-)

--- a/mm/shmem.c~revert-mm-shmem-fix-shmem_swapin-race-with-swapoff
+++ a/mm/shmem.c
@@ -1696,8 +1696,7 @@ static int shmem_swapin_page(struct inod
 	struct address_space *mapping = inode->i_mapping;
 	struct shmem_inode_info *info = SHMEM_I(inode);
 	struct mm_struct *charge_mm = vma ? vma->vm_mm : NULL;
-	struct swap_info_struct *si;
-	struct page *page = NULL;
+	struct page *page;
 	swp_entry_t swap;
 	int error;
 
@@ -1705,12 +1704,6 @@ static int shmem_swapin_page(struct inod
 	swap = radix_to_swp_entry(*pagep);
 	*pagep = NULL;
 
-	/* Prevent swapoff from happening to us. */
-	si = get_swap_device(swap);
-	if (!si) {
-		error = EINVAL;
-		goto failed;
-	}
 	/* Look it up and read it in.. */
 	page = lookup_swap_cache(swap, NULL, 0);
 	if (!page) {
@@ -1772,8 +1765,6 @@ static int shmem_swapin_page(struct inod
 	swap_free(swap);
 
 	*pagep = page;
-	if (si)
-		put_swap_device(si);
 	return 0;
 failed:
 	if (!shmem_confirm_swap(mapping, index, swap))
@@ -1784,9 +1775,6 @@ unlock:
 		put_page(page);
 	}
 
-	if (si)
-		put_swap_device(si);
-
 	return error;
 }
 
_

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [patch 02/10] Revert "mm: swap: check if swap backing device is congested or not"
  2021-08-20  2:03 incoming Andrew Morton
  2021-08-20  2:04 ` [patch 01/10] Revert "mm/shmem: fix shmem_swapin() race with swapoff" Andrew Morton
@ 2021-08-20  2:04 ` Andrew Morton
  2021-08-20  2:04 ` [patch 03/10] mm/page_alloc: don't corrupt pcppage_migratetype Andrew Morton
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Andrew Morton @ 2021-08-20  2:04 UTC (permalink / raw)
  To: akpm, david, hannes, hughd, iamjoonsoo.kim, linmiaohe, linux-mm,
	mhocko, minchan, mm-commits, shy828301, torvalds, willy,
	ying.huang

From: Yang Shi <shy828301@gmail.com>
Subject: Revert "mm: swap: check if swap backing device is congested or not"

Due to the change about how block layer detects congestion the
justification of commit 8fd2e0b505d1 ("mm: swap: check if swap backing
device is congested or not") doesn't stand anymore, so the commit could be
just reverted in order to solve the race reported by commit 2efa33fc7f6e
("mm/shmem: fix shmem_swapin() race with swapoff").  The fix was reverted
by the previous patch.

Link: https://lkml.kernel.org/r/20210810202936.2672-3-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Suggested-by: Hugh Dickins <hughd@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/swap_state.c |    7 -------
 1 file changed, 7 deletions(-)

--- a/mm/swap_state.c~revert-mm-swap-check-if-swap-backing-device-is-congested-or-not
+++ a/mm/swap_state.c
@@ -628,13 +628,6 @@ struct page *swap_cluster_readahead(swp_
 	if (!mask)
 		goto skip;
 
-	/* Test swap type to make sure the dereference is safe */
-	if (likely(si->flags & (SWP_BLKDEV | SWP_FS_OPS))) {
-		struct inode *inode = si->swap_file->f_mapping->host;
-		if (inode_read_congested(inode))
-			goto skip;
-	}
-
 	do_poll = false;
 	/* Read a page_cluster sized and aligned cluster around offset. */
 	start_offset = offset & ~mask;
_

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [patch 03/10] mm/page_alloc: don't corrupt pcppage_migratetype
  2021-08-20  2:03 incoming Andrew Morton
  2021-08-20  2:04 ` [patch 01/10] Revert "mm/shmem: fix shmem_swapin() race with swapoff" Andrew Morton
  2021-08-20  2:04 ` [patch 02/10] Revert "mm: swap: check if swap backing device is congested or not" Andrew Morton
@ 2021-08-20  2:04 ` Andrew Morton
  2021-08-20  2:04 ` [patch 04/10] mmflags.h: add missing __GFP_ZEROTAGS and __GFP_SKIP_KASAN_POISON names Andrew Morton
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Andrew Morton @ 2021-08-20  2:04 UTC (permalink / raw)
  To: akpm, linux-mm, mgorman, mm-commits, opendmb, peterz, torvalds, vbabka

From: Doug Berger <opendmb@gmail.com>
Subject: mm/page_alloc: don't corrupt pcppage_migratetype

When placing pages on a pcp list, migratetype values over MIGRATE_PCPTYPES
get added to the MIGRATE_MOVABLE pcp list.

However, the actual migratetype is preserved in the page and should not be
changed to MIGRATE_MOVABLE or the page may end up on the wrong free_list.

The impact is that HIGHATOMIC or CMA pages getting bulk freed from the PCP
lists could potentially end up on the wrong buddy list.  There are various
consequences but minimally NR_FREE_CMA_PAGES accounting could get screwed
up.

[mgorman@techsingularity.net: changelog update]
Link: https://lkml.kernel.org/r/20210811182917.2607994-1-opendmb@gmail.com
Fixes: df1acc856923 ("mm/page_alloc: avoid conflating IRQs disabled with zone->lock")
Signed-off-by: Doug Berger <opendmb@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |   25 ++++++++++++-------------
 1 file changed, 12 insertions(+), 13 deletions(-)

--- a/mm/page_alloc.c~mm-page_alloc-dont-corrupt-pcppage_migratetype
+++ a/mm/page_alloc.c
@@ -3453,19 +3453,10 @@ void free_unref_page_list(struct list_he
 		 * comment in free_unref_page.
 		 */
 		migratetype = get_pcppage_migratetype(page);
-		if (unlikely(migratetype >= MIGRATE_PCPTYPES)) {
-			if (unlikely(is_migrate_isolate(migratetype))) {
-				list_del(&page->lru);
-				free_one_page(page_zone(page), page, pfn, 0,
-							migratetype, FPI_NONE);
-				continue;
-			}
-
-			/*
-			 * Non-isolated types over MIGRATE_PCPTYPES get added
-			 * to the MIGRATE_MOVABLE pcp list.
-			 */
-			set_pcppage_migratetype(page, MIGRATE_MOVABLE);
+		if (unlikely(is_migrate_isolate(migratetype))) {
+			list_del(&page->lru);
+			free_one_page(page_zone(page), page, pfn, 0, migratetype, FPI_NONE);
+			continue;
 		}
 
 		set_page_private(page, pfn);
@@ -3475,7 +3466,15 @@ void free_unref_page_list(struct list_he
 	list_for_each_entry_safe(page, next, list, lru) {
 		pfn = page_private(page);
 		set_page_private(page, 0);
+
+		/*
+		 * Non-isolated types over MIGRATE_PCPTYPES get added
+		 * to the MIGRATE_MOVABLE pcp list.
+		 */
 		migratetype = get_pcppage_migratetype(page);
+		if (unlikely(migratetype >= MIGRATE_PCPTYPES))
+			migratetype = MIGRATE_MOVABLE;
+
 		trace_mm_page_free_batched(page);
 		free_unref_page_commit(page, pfn, migratetype, 0);
 
_

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [patch 04/10] mmflags.h: add missing __GFP_ZEROTAGS and __GFP_SKIP_KASAN_POISON names
  2021-08-20  2:03 incoming Andrew Morton
                   ` (2 preceding siblings ...)
  2021-08-20  2:04 ` [patch 03/10] mm/page_alloc: don't corrupt pcppage_migratetype Andrew Morton
@ 2021-08-20  2:04 ` Andrew Morton
  2021-08-20  2:04 ` [patch 05/10] MAINTAINERS: update ClangBuiltLinux IRC chat Andrew Morton
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Andrew Morton @ 2021-08-20  2:04 UTC (permalink / raw)
  To: akpm, linux-mm, mm-commits, pcc, rostedt, rppt, torvalds

From: Mike Rapoport <rppt@linux.ibm.com>
Subject: mmflags.h: add missing __GFP_ZEROTAGS and __GFP_SKIP_KASAN_POISON names

printk("%pGg") outputs these two flags as hexadecimal number, rather
than as a string, e.g:

	GFP_KERNEL|0x1800000

Fix this by adding missing names of __GFP_ZEROTAGS and
__GFP_SKIP_KASAN_POISON flags to __def_gfpflag_names.

Link: https://lkml.kernel.org/r/20210816133502.590-1-rppt@kernel.org
Fixes: 013bb59dbb7c ("arm64: mte: handle tags zeroing at page allocation time")
Fixes: c275c5c6d50a ("kasan: disable freed user page poisoning with HW tags")
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/trace/events/mmflags.h |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

--- a/include/trace/events/mmflags.h~mmflagsh-add-missing-__gfp_zerotags-and-__gfp_skip_kasan_poison-names
+++ a/include/trace/events/mmflags.h
@@ -48,7 +48,9 @@
 	{(unsigned long)__GFP_WRITE,		"__GFP_WRITE"},		\
 	{(unsigned long)__GFP_RECLAIM,		"__GFP_RECLAIM"},	\
 	{(unsigned long)__GFP_DIRECT_RECLAIM,	"__GFP_DIRECT_RECLAIM"},\
-	{(unsigned long)__GFP_KSWAPD_RECLAIM,	"__GFP_KSWAPD_RECLAIM"}\
+	{(unsigned long)__GFP_KSWAPD_RECLAIM,	"__GFP_KSWAPD_RECLAIM"},\
+	{(unsigned long)__GFP_ZEROTAGS,		"__GFP_ZEROTAGS"},	\
+	{(unsigned long)__GFP_SKIP_KASAN_POISON,"__GFP_SKIP_KASAN_POISON"}\
 
 #define show_gfp_flags(flags)						\
 	(flags) ? __print_flags(flags, "|",				\
_

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [patch 05/10] MAINTAINERS: update ClangBuiltLinux IRC chat
  2021-08-20  2:03 incoming Andrew Morton
                   ` (3 preceding siblings ...)
  2021-08-20  2:04 ` [patch 04/10] mmflags.h: add missing __GFP_ZEROTAGS and __GFP_SKIP_KASAN_POISON names Andrew Morton
@ 2021-08-20  2:04 ` Andrew Morton
  2021-08-20  2:04 ` [patch 06/10] mm: memcontrol: fix occasional OOMs due to proportional memory.low reclaim Andrew Morton
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Andrew Morton @ 2021-08-20  2:04 UTC (permalink / raw)
  To: akpm, keescook, linux-mm, mm-commits, nathan, ndesaulniers, torvalds

From: Nathan Chancellor <nathan@kernel.org>
Subject: MAINTAINERS: update ClangBuiltLinux IRC chat

Everyone has moved from Freenode to Libera so updated the channel entry
for MAINTAINERS.

Link: https://github.com/ClangBuiltLinux/linux/issues/1402
Link: https://lkml.kernel.org/r/20210818022339.3863058-1-nathan@kernel.org
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 MAINTAINERS |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/MAINTAINERS~maintainers-update-clangbuiltlinux-irc-chat
+++ a/MAINTAINERS
@@ -4498,7 +4498,7 @@ L:	clang-built-linux@googlegroups.com
 S:	Supported
 W:	https://clangbuiltlinux.github.io/
 B:	https://github.com/ClangBuiltLinux/linux/issues
-C:	irc://chat.freenode.net/clangbuiltlinux
+C:	irc://irc.libera.chat/clangbuiltlinux
 F:	Documentation/kbuild/llvm.rst
 F:	include/linux/compiler-clang.h
 F:	scripts/clang-tools/
_

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [patch 06/10] mm: memcontrol: fix occasional OOMs due to proportional memory.low reclaim
  2021-08-20  2:03 incoming Andrew Morton
                   ` (4 preceding siblings ...)
  2021-08-20  2:04 ` [patch 05/10] MAINTAINERS: update ClangBuiltLinux IRC chat Andrew Morton
@ 2021-08-20  2:04 ` Andrew Morton
  2021-08-20  2:04 ` [patch 07/10] mm/hwpoison: retry with shake_page() for unhandlable pages Andrew Morton
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Andrew Morton @ 2021-08-20  2:04 UTC (permalink / raw)
  To: akpm, chris, guro, hannes, linux-mm, lnyng, mhocko, mm-commits,
	riel, shakeelb, stable, torvalds

From: Johannes Weiner <hannes@cmpxchg.org>
Subject: mm: memcontrol: fix occasional OOMs due to proportional memory.low reclaim

We've noticed occasional OOM killing when memory.low settings are in
effect for cgroups.  This is unexpected and undesirable as memory.low
is supposed to express non-OOMing memory priorities between cgroups.

The reason for this is proportional memory.low reclaim.  When cgroups
are below their memory.low threshold, reclaim passes them over in the
first round, and then retries if it couldn't find pages anywhere else. 
But when cgroups are slightly above their memory.low setting, page scan
force is scaled down and diminished in proportion to the overage, to
the point where it can cause reclaim to fail as well - only in that
case we currently don't retry, and instead trigger OOM.

To fix this, hook proportional reclaim into the same retry logic we
have in place for when cgroups are skipped entirely.  This way if
reclaim fails and some cgroups were scanned with diminished pressure,
we'll try another full-force cycle before giving up and OOMing.

[akpm@linux-foundation.org: coding-style fixes]
Link: https://lkml.kernel.org/r/20210817180506.220056-1-hannes@cmpxchg.org
Fixes: 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Leon Yang <lnyng@fb.com>
Reviewed-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Chris Down <chris@chrisdown.name>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>		[5.4+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/memcontrol.h |   29 +++++++++++++++--------------
 mm/vmscan.c                |   27 +++++++++++++++++++--------
 2 files changed, 34 insertions(+), 22 deletions(-)

--- a/include/linux/memcontrol.h~mm-memcontrol-fix-occasional-ooms-due-to-proportional-memorylow-reclaim
+++ a/include/linux/memcontrol.h
@@ -612,12 +612,15 @@ static inline bool mem_cgroup_disabled(v
 	return !cgroup_subsys_enabled(memory_cgrp_subsys);
 }
 
-static inline unsigned long mem_cgroup_protection(struct mem_cgroup *root,
-						  struct mem_cgroup *memcg,
-						  bool in_low_reclaim)
+static inline void mem_cgroup_protection(struct mem_cgroup *root,
+					 struct mem_cgroup *memcg,
+					 unsigned long *min,
+					 unsigned long *low)
 {
+	*min = *low = 0;
+
 	if (mem_cgroup_disabled())
-		return 0;
+		return;
 
 	/*
 	 * There is no reclaim protection applied to a targeted reclaim.
@@ -653,13 +656,10 @@ static inline unsigned long mem_cgroup_p
 	 *
 	 */
 	if (root == memcg)
-		return 0;
-
-	if (in_low_reclaim)
-		return READ_ONCE(memcg->memory.emin);
+		return;
 
-	return max(READ_ONCE(memcg->memory.emin),
-		   READ_ONCE(memcg->memory.elow));
+	*min = READ_ONCE(memcg->memory.emin);
+	*low = READ_ONCE(memcg->memory.elow);
 }
 
 void mem_cgroup_calculate_protection(struct mem_cgroup *root,
@@ -1147,11 +1147,12 @@ static inline void memcg_memory_event_mm
 {
 }
 
-static inline unsigned long mem_cgroup_protection(struct mem_cgroup *root,
-						  struct mem_cgroup *memcg,
-						  bool in_low_reclaim)
+static inline void mem_cgroup_protection(struct mem_cgroup *root,
+					 struct mem_cgroup *memcg,
+					 unsigned long *min,
+					 unsigned long *low)
 {
-	return 0;
+	*min = *low = 0;
 }
 
 static inline void mem_cgroup_calculate_protection(struct mem_cgroup *root,
--- a/mm/vmscan.c~mm-memcontrol-fix-occasional-ooms-due-to-proportional-memorylow-reclaim
+++ a/mm/vmscan.c
@@ -100,9 +100,12 @@ struct scan_control {
 	unsigned int may_swap:1;
 
 	/*
-	 * Cgroups are not reclaimed below their configured memory.low,
-	 * unless we threaten to OOM. If any cgroups are skipped due to
-	 * memory.low and nothing was reclaimed, go back for memory.low.
+	 * Cgroup memory below memory.low is protected as long as we
+	 * don't threaten to OOM. If any cgroup is reclaimed at
+	 * reduced force or passed over entirely due to its memory.low
+	 * setting (memcg_low_skipped), and nothing is reclaimed as a
+	 * result, then go back for one more cycle that reclaims the protected
+	 * memory (memcg_low_reclaim) to avert OOM.
 	 */
 	unsigned int memcg_low_reclaim:1;
 	unsigned int memcg_low_skipped:1;
@@ -2537,15 +2540,14 @@ out:
 	for_each_evictable_lru(lru) {
 		int file = is_file_lru(lru);
 		unsigned long lruvec_size;
+		unsigned long low, min;
 		unsigned long scan;
-		unsigned long protection;
 
 		lruvec_size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
-		protection = mem_cgroup_protection(sc->target_mem_cgroup,
-						   memcg,
-						   sc->memcg_low_reclaim);
+		mem_cgroup_protection(sc->target_mem_cgroup, memcg,
+				      &min, &low);
 
-		if (protection) {
+		if (min || low) {
 			/*
 			 * Scale a cgroup's reclaim pressure by proportioning
 			 * its current usage to its memory.low or memory.min
@@ -2576,6 +2578,15 @@ out:
 			 * hard protection.
 			 */
 			unsigned long cgroup_size = mem_cgroup_size(memcg);
+			unsigned long protection;
+
+			/* memory.low scaling, make sure we retry before OOM */
+			if (!sc->memcg_low_reclaim && low > min) {
+				protection = low;
+				sc->memcg_low_skipped = 1;
+			} else {
+				protection = min;
+			}
 
 			/* Avoid TOCTOU with earlier protection check */
 			cgroup_size = max(cgroup_size, protection);
_

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [patch 07/10] mm/hwpoison: retry with shake_page() for unhandlable pages
  2021-08-20  2:03 incoming Andrew Morton
                   ` (5 preceding siblings ...)
  2021-08-20  2:04 ` [patch 06/10] mm: memcontrol: fix occasional OOMs due to proportional memory.low reclaim Andrew Morton
@ 2021-08-20  2:04 ` Andrew Morton
  2021-08-20  2:04 ` [patch 08/10] mm: vmscan: fix missing psi annotation for node_reclaim() Andrew Morton
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Andrew Morton @ 2021-08-20  2:04 UTC (permalink / raw)
  To: akpm, linux-mm, mhocko, mike.kravetz, mm-commits,
	naoya.horiguchi, osalvador, shy828301, songmuchun, stable,
	tony.luck, torvalds

From: Naoya Horiguchi <naoya.horiguchi@nec.com>
Subject: mm/hwpoison: retry with shake_page() for unhandlable pages

HWPoisonHandlable() sometimes returns false for typical user pages due to
races with average memory events like transfers over LRU lists.  This
causes failures in hwpoison handling.

There's retry code for such a case but does not work because the retry
loop reaches the retry limit too quickly before the page settles down to
handlable state.  Let get_any_page() call shake_page() to fix it.

[naoya.horiguchi@nec.com: get_any_page(): return -EIO when retry limit reached]
  Link: https://lkml.kernel.org/r/20210819001958.2365157-1-naoya.horiguchi@linux.dev
Link: https://lkml.kernel.org/r/20210817053703.2267588-1-naoya.horiguchi@linux.dev
Fixes: 25182f05ffed ("mm,hwpoison: fix race with hugetlb page allocation")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reported-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>		[5.13+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory-failure.c |   12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

--- a/mm/memory-failure.c~mm-hwpoison-retry-with-shake_page-for-unhandlable-pages
+++ a/mm/memory-failure.c
@@ -1146,7 +1146,7 @@ static int __get_hwpoison_page(struct pa
 	 * unexpected races caused by taking a page refcount.
 	 */
 	if (!HWPoisonHandlable(head))
-		return 0;
+		return -EBUSY;
 
 	if (PageTransHuge(head)) {
 		/*
@@ -1199,9 +1199,15 @@ try_again:
 			}
 			goto out;
 		} else if (ret == -EBUSY) {
-			/* We raced with freeing huge page to buddy, retry. */
-			if (pass++ < 3)
+			/*
+			 * We raced with (possibly temporary) unhandlable
+			 * page, retry.
+			 */
+			if (pass++ < 3) {
+				shake_page(p, 1);
 				goto try_again;
+			}
+			ret = -EIO;
 			goto out;
 		}
 	}
_

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [patch 08/10] mm: vmscan: fix missing psi annotation for node_reclaim()
  2021-08-20  2:03 incoming Andrew Morton
                   ` (6 preceding siblings ...)
  2021-08-20  2:04 ` [patch 07/10] mm/hwpoison: retry with shake_page() for unhandlable pages Andrew Morton
@ 2021-08-20  2:04 ` Andrew Morton
  2021-08-20  2:04 ` [patch 09/10] kfence: fix is_kfence_address() for addresses below KFENCE_POOL_SIZE Andrew Morton
  2021-08-20  2:04 ` [patch 10/10] hugetlb: don't pass page cache pages to restore_reserve_on_error Andrew Morton
  9 siblings, 0 replies; 11+ messages in thread
From: Andrew Morton @ 2021-08-20  2:04 UTC (permalink / raw)
  To: akpm, hannes, linux-mm, mm-commits, riel, shakeelb, torvalds

From: Johannes Weiner <hannes@cmpxchg.org>
Subject: mm: vmscan: fix missing psi annotation for node_reclaim()

In a debugging session the other day, Rik noticed that node_reclaim() was
missing memstall annotations.  This means we'll miss pressure and lost
productivity resulting from reclaim on an overloaded local NUMA node when
vm.zone_reclaim_mode is enabled.

There haven't been any reports, but that's likely because
vm.zone_reclaim_mode hasn't been a commonly used feature recently, and the
intersection between such setups and psi users is probably nil.  Although,
secondary memory such as CXL-connected DIMMS, persistent memory etc.  and
the page demotion patches that handle them
(https://lore.kernel.org/lkml/20210401183216.443C4443@viggo.jf.intel.com/)
could soon make this a more common codepath again.

Link: https://lkml.kernel.org/r/20210818152457.35846-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/vmscan.c |    3 +++
 1 file changed, 3 insertions(+)

--- a/mm/vmscan.c~mm-vmscan-fix-missing-psi-annotation-for-node_reclaim
+++ a/mm/vmscan.c
@@ -4424,11 +4424,13 @@ static int __node_reclaim(struct pglist_
 		.may_swap = 1,
 		.reclaim_idx = gfp_zone(gfp_mask),
 	};
+	unsigned long pflags;
 
 	trace_mm_vmscan_node_reclaim_begin(pgdat->node_id, order,
 					   sc.gfp_mask);
 
 	cond_resched();
+	psi_memstall_enter(&pflags);
 	fs_reclaim_acquire(sc.gfp_mask);
 	/*
 	 * We need to be able to allocate from the reserves for RECLAIM_UNMAP
@@ -4453,6 +4455,7 @@ static int __node_reclaim(struct pglist_
 	current->flags &= ~PF_SWAPWRITE;
 	memalloc_noreclaim_restore(noreclaim_flag);
 	fs_reclaim_release(sc.gfp_mask);
+	psi_memstall_leave(&pflags);
 
 	trace_mm_vmscan_node_reclaim_end(sc.nr_reclaimed);
 
_

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [patch 09/10] kfence: fix is_kfence_address() for addresses below KFENCE_POOL_SIZE
  2021-08-20  2:03 incoming Andrew Morton
                   ` (7 preceding siblings ...)
  2021-08-20  2:04 ` [patch 08/10] mm: vmscan: fix missing psi annotation for node_reclaim() Andrew Morton
@ 2021-08-20  2:04 ` Andrew Morton
  2021-08-20  2:04 ` [patch 10/10] hugetlb: don't pass page cache pages to restore_reserve_on_error Andrew Morton
  9 siblings, 0 replies; 11+ messages in thread
From: Andrew Morton @ 2021-08-20  2:04 UTC (permalink / raw)
  To: akpm, dvyukov, elver, glider, Kuan-Ying.Lee, linux-mm,
	mm-commits, stable, torvalds

From: Marco Elver <elver@google.com>
Subject: kfence: fix is_kfence_address() for addresses below KFENCE_POOL_SIZE

Originally the addr != NULL check was meant to take care of the case where
__kfence_pool == NULL (KFENCE is disabled).  However, this does not work
for addresses where addr > 0 && addr < KFENCE_POOL_SIZE.

This can be the case on NULL-deref where addr > 0 && addr < PAGE_SIZE or
any other faulting access with addr < KFENCE_POOL_SIZE.  While the kernel
would likely crash, the stack traces and report might be confusing due to
double faults upon KFENCE's attempt to unprotect such an address.

Fix it by just checking that __kfence_pool != NULL instead.

Link: https://lkml.kernel.org/r/20210818130300.2482437-1-elver@google.com
Fixes: 0ce20dd84089 ("mm: add Kernel Electric-Fence infrastructure")
Signed-off-by: Marco Elver <elver@google.com>
Reported-by: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com>
Acked-by: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: <stable@vger.kernel.org>    [5.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/kfence.h |    7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

--- a/include/linux/kfence.h~kfence-fix-is_kfence_address-for-addresses-below-kfence_pool_size
+++ a/include/linux/kfence.h
@@ -51,10 +51,11 @@ extern atomic_t kfence_allocation_gate;
 static __always_inline bool is_kfence_address(const void *addr)
 {
 	/*
-	 * The non-NULL check is required in case the __kfence_pool pointer was
-	 * never initialized; keep it in the slow-path after the range-check.
+	 * The __kfence_pool != NULL check is required to deal with the case
+	 * where __kfence_pool == NULL && addr < KFENCE_POOL_SIZE. Keep it in
+	 * the slow-path after the range-check!
 	 */
-	return unlikely((unsigned long)((char *)addr - __kfence_pool) < KFENCE_POOL_SIZE && addr);
+	return unlikely((unsigned long)((char *)addr - __kfence_pool) < KFENCE_POOL_SIZE && __kfence_pool);
 }
 
 /**
_

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [patch 10/10] hugetlb: don't pass page cache pages to restore_reserve_on_error
  2021-08-20  2:03 incoming Andrew Morton
                   ` (8 preceding siblings ...)
  2021-08-20  2:04 ` [patch 09/10] kfence: fix is_kfence_address() for addresses below KFENCE_POOL_SIZE Andrew Morton
@ 2021-08-20  2:04 ` Andrew Morton
  9 siblings, 0 replies; 11+ messages in thread
From: Andrew Morton @ 2021-08-20  2:04 UTC (permalink / raw)
  To: akpm, almasrymina, axelrasmussen, linux-mm, mhocko, mike.kravetz,
	mm-commits, naoya.horiguchi, peterx, songmuchun, stable,
	syzbot+67654e51e54455f1c585, torvalds

From: Mike Kravetz <mike.kravetz@oracle.com>
Subject: hugetlb: don't pass page cache pages to restore_reserve_on_error

syzbot hit kernel BUG at fs/hugetlbfs/inode.c:532 as described in [1].
This BUG triggers if the HPageRestoreReserve flag is set on a page in
the page cache.  It should never be set, as the routine
huge_add_to_page_cache explicitly clears the flag after adding a page
to the cache.

The only code other than huge page allocation which sets the flag is
restore_reserve_on_error.  It will potentially set the flag in rare out
of memory conditions.  syzbot was injecting errors to cause memory
allocation errors which exercised this specific path.

The code in restore_reserve_on_error is doing the right thing.  However,
there are instances where pages in the page cache were being passed to
restore_reserve_on_error.  This is incorrect, as once a page goes into
the cache reservation information will not be modified for the page until
it is removed from the cache.  Error paths do not remove pages from the
cache, so even in the case of error, the page will remain in the cache
and no reservation adjustment is needed.

Modify routines that potentially call restore_reserve_on_error with a
page cache page to no longer do so.

Note on fixes tag:
Prior to commit 846be08578ed ("mm/hugetlb: expand restore_reserve_on_error
functionality") the routine would not process page cache pages because
the HPageRestoreReserve flag is not set on such pages.  Therefore, this
issue could not be trigggered.  The code added by commit 846be08578ed
("mm/hugetlb: expand restore_reserve_on_error functionality") is needed
and correct.  It exposed incorrect calls to restore_reserve_on_error which
is the root cause addressed by this commit.

[1] https://lore.kernel.org/linux-mm/00000000000050776d05c9b7c7f0@google.com/

Link: https://lkml.kernel.org/r/20210818213304.37038-1-mike.kravetz@oracle.com
Fixes: 846be08578ed ("mm/hugetlb: expand restore_reserve_on_error functionality")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: <syzbot+67654e51e54455f1c585@syzkaller.appspotmail.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/hugetlb.c |   19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

--- a/mm/hugetlb.c~hugetlb-dont-pass-page-cache-pages-to-restore_reserve_on_error
+++ a/mm/hugetlb.c
@@ -2476,7 +2476,7 @@ void restore_reserve_on_error(struct hst
 		if (!rc) {
 			/*
 			 * This indicates there is an entry in the reserve map
-			 * added by alloc_huge_page.  We know it was added
+			 * not added by alloc_huge_page.  We know it was added
 			 * before the alloc_huge_page call, otherwise
 			 * HPageRestoreReserve would be set on the page.
 			 * Remove the entry so that a subsequent allocation
@@ -4660,7 +4660,9 @@ retry_avoidcopy:
 	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(&range);
 out_release_all:
-	restore_reserve_on_error(h, vma, haddr, new_page);
+	/* No restore in case of successful pagetable update (Break COW) */
+	if (new_page != old_page)
+		restore_reserve_on_error(h, vma, haddr, new_page);
 	put_page(new_page);
 out_release_old:
 	put_page(old_page);
@@ -4776,7 +4778,7 @@ static vm_fault_t hugetlb_no_page(struct
 	pte_t new_pte;
 	spinlock_t *ptl;
 	unsigned long haddr = address & huge_page_mask(h);
-	bool new_page = false;
+	bool new_page, new_pagecache_page = false;
 
 	/*
 	 * Currently, we are forced to kill the process in the event the
@@ -4799,6 +4801,7 @@ static vm_fault_t hugetlb_no_page(struct
 		goto out;
 
 retry:
+	new_page = false;
 	page = find_lock_page(mapping, idx);
 	if (!page) {
 		/* Check for page in userfault range */
@@ -4842,6 +4845,7 @@ retry:
 					goto retry;
 				goto out;
 			}
+			new_pagecache_page = true;
 		} else {
 			lock_page(page);
 			if (unlikely(anon_vma_prepare(vma))) {
@@ -4926,7 +4930,9 @@ backout:
 	spin_unlock(ptl);
 backout_unlocked:
 	unlock_page(page);
-	restore_reserve_on_error(h, vma, haddr, page);
+	/* restore reserve for newly allocated pages not in page cache */
+	if (new_page && !new_pagecache_page)
+		restore_reserve_on_error(h, vma, haddr, page);
 	put_page(page);
 	goto out;
 }
@@ -5135,6 +5141,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_s
 	int ret = -ENOMEM;
 	struct page *page;
 	int writable;
+	bool new_pagecache_page = false;
 
 	if (is_continue) {
 		ret = -EFAULT;
@@ -5228,6 +5235,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_s
 		ret = huge_add_to_page_cache(page, mapping, idx);
 		if (ret)
 			goto out_release_nounlock;
+		new_pagecache_page = true;
 	}
 
 	ptl = huge_pte_lockptr(h, dst_mm, dst_pte);
@@ -5291,7 +5299,8 @@ out_release_unlock:
 	if (vm_shared || is_continue)
 		unlock_page(page);
 out_release_nounlock:
-	restore_reserve_on_error(h, dst_vma, dst_addr, page);
+	if (!new_pagecache_page)
+		restore_reserve_on_error(h, dst_vma, dst_addr, page);
 	put_page(page);
 	goto out;
 }
_

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-08-20  2:04 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-20  2:03 incoming Andrew Morton
2021-08-20  2:04 ` [patch 01/10] Revert "mm/shmem: fix shmem_swapin() race with swapoff" Andrew Morton
2021-08-20  2:04 ` [patch 02/10] Revert "mm: swap: check if swap backing device is congested or not" Andrew Morton
2021-08-20  2:04 ` [patch 03/10] mm/page_alloc: don't corrupt pcppage_migratetype Andrew Morton
2021-08-20  2:04 ` [patch 04/10] mmflags.h: add missing __GFP_ZEROTAGS and __GFP_SKIP_KASAN_POISON names Andrew Morton
2021-08-20  2:04 ` [patch 05/10] MAINTAINERS: update ClangBuiltLinux IRC chat Andrew Morton
2021-08-20  2:04 ` [patch 06/10] mm: memcontrol: fix occasional OOMs due to proportional memory.low reclaim Andrew Morton
2021-08-20  2:04 ` [patch 07/10] mm/hwpoison: retry with shake_page() for unhandlable pages Andrew Morton
2021-08-20  2:04 ` [patch 08/10] mm: vmscan: fix missing psi annotation for node_reclaim() Andrew Morton
2021-08-20  2:04 ` [patch 09/10] kfence: fix is_kfence_address() for addresses below KFENCE_POOL_SIZE Andrew Morton
2021-08-20  2:04 ` [patch 10/10] hugetlb: don't pass page cache pages to restore_reserve_on_error Andrew Morton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.