stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 01/19] x86/setup: don't remove E820_TYPE_RAM for pfn 0
       [not found] <20210123210029.a7c704d0cab204683e41ad10@linux-foundation.org>
@ 2021-01-24  5:00 ` Andrew Morton
  2021-01-24  5:01 ` [patch 02/19] mm: fix initialization of struct page for holes in memory layout Andrew Morton
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: Andrew Morton @ 2021-01-24  5:00 UTC (permalink / raw)
  To: akpm, bhe, bp, cai, david, hpa, linux-mm, mgorman, mhocko, mingo,
	mm-commits, rppt, stable, tglx, torvalds, vbabka

From: Mike Rapoport <rppt@linux.ibm.com>
Subject: x86/setup: don't remove E820_TYPE_RAM for pfn 0

Patch series "mm: fix initialization of struct page for holes in  memory layout", v3.

Commit 73a6e474cb37 ("mm: memmap_init: iterate over
memblock regions rather that check each PFN") exposed several issues with
the memory map initialization and these patches fix those issues.

Initially there were crashes during compaction that Qian Cai reported back
in April [1]. It seemed back then that the problem was fixed, but a few
weeks ago Andrea Arcangeli hit the same bug [2] and there was an additional
discussion at [3].

[1] https://lore.kernel.org/lkml/8C537EB7-85EE-4DCF-943E-3CC0ED0DF56D@lca.pw
[2] https://lore.kernel.org/lkml/20201121194506.13464-1-aarcange@redhat.com
[3] https://lore.kernel.org/mm-commits/20201206005401.qKuAVgOXr%akpm@linux-foundation.org


This patch (of 2):

The first 4Kb of memory is a BIOS owned area and to avoid its allocation
for the kernel it was not listed in e820 tables as memory.  As the result,
pfn 0 was never recognised by the generic memory management and it is not
a part of neither node 0 nor ZONE_DMA.

If set_pfnblock_flags_mask() would be ever called for the pageblock
corresponding to the first 2Mbytes of memory, having pfn 0 outside of
ZONE_DMA would trigger

	VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);

Along with reserving the first 4Kb in e820 tables, several first pages are
reserved with memblock in several places during setup_arch().  These
reservations are enough to ensure the kernel does not touch the BIOS area
and it is not necessary to remove E820_TYPE_RAM for pfn 0.

Remove the update of e820 table that changes the type of pfn 0 and move
the comment describing why it was done to trim_low_memory_range() that
reserves the beginning of the memory.

Link: https://lkml.kernel.org/r/20210111194017.22696-2-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/x86/kernel/setup.c |   20 +++++++++-----------
 1 file changed, 9 insertions(+), 11 deletions(-)

--- a/arch/x86/kernel/setup.c~x86-setup-dont-remove-e820_type_ram-for-pfn-0
+++ a/arch/x86/kernel/setup.c
@@ -661,17 +661,6 @@ static void __init trim_platform_memory_
 static void __init trim_bios_range(void)
 {
 	/*
-	 * A special case is the first 4Kb of memory;
-	 * This is a BIOS owned area, not kernel ram, but generally
-	 * not listed as such in the E820 table.
-	 *
-	 * This typically reserves additional memory (64KiB by default)
-	 * since some BIOSes are known to corrupt low memory.  See the
-	 * Kconfig help text for X86_RESERVE_LOW.
-	 */
-	e820__range_update(0, PAGE_SIZE, E820_TYPE_RAM, E820_TYPE_RESERVED);
-
-	/*
 	 * special case: Some BIOSes report the PC BIOS
 	 * area (640Kb -> 1Mb) as RAM even though it is not.
 	 * take them out.
@@ -728,6 +717,15 @@ early_param("reservelow", parse_reservel
 
 static void __init trim_low_memory_range(void)
 {
+	/*
+	 * A special case is the first 4Kb of memory;
+	 * This is a BIOS owned area, not kernel ram, but generally
+	 * not listed as such in the E820 table.
+	 *
+	 * This typically reserves additional memory (64KiB by default)
+	 * since some BIOSes are known to corrupt low memory.  See the
+	 * Kconfig help text for X86_RESERVE_LOW.
+	 */
 	memblock_reserve(0, ALIGN(reserve_low, PAGE_SIZE));
 }
 	
_

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [patch 02/19] mm: fix initialization of struct page for holes in memory layout
       [not found] <20210123210029.a7c704d0cab204683e41ad10@linux-foundation.org>
  2021-01-24  5:00 ` [patch 01/19] x86/setup: don't remove E820_TYPE_RAM for pfn 0 Andrew Morton
@ 2021-01-24  5:01 ` Andrew Morton
  2021-01-24  5:01 ` [patch 03/19] mm: memcg/slab: optimize objcg stock draining Andrew Morton
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: Andrew Morton @ 2021-01-24  5:01 UTC (permalink / raw)
  To: aarcange, akpm, bhe, bp, cai, david, hpa, linux-mm, mgorman,
	mhocko, mingo, mm-commits, rppt, stable, tglx, torvalds, vbabka

From: Mike Rapoport <rppt@linux.ibm.com>
Subject: mm: fix initialization of struct page for holes in memory layout

There could be struct pages that are not backed by actual physical memory.
This can happen when the actual memory bank is not a multiple of
SECTION_SIZE or when an architecture does not register memory holes
reserved by the firmware as memblock.memory.

Such pages are currently initialized using init_unavailable_mem() function
that iterates through PFNs in holes in memblock.memory and if there is a
struct page corresponding to a PFN, the fields if this page are set to
default values and the page is marked as Reserved.

init_unavailable_mem() does not take into account zone and node the page
belongs to and sets both zone and node links in struct page to zero.

On a system that has firmware reserved holes in a zone above ZONE_DMA, for
instance in a configuration below:

	# grep -A1 E820 /proc/iomem
	7a17b000-7a216fff : Unknown E820 type
	7a217000-7bffffff : System RAM

unset zone link in struct page will trigger

	VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);

because there are pages in both ZONE_DMA32 and ZONE_DMA (unset zone link
in struct page) in the same pageblock.

Update init_unavailable_mem() to use zone constraints defined by an
architecture to properly setup the zone link and use node ID of the
adjacent range in memblock.memory to set the node link.

Link: https://lkml.kernel.org/r/20210111194017.22696-3-rppt@kernel.org
Fixes: 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather that check each PFN")
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |   84 +++++++++++++++++++++++++++-------------------
 1 file changed, 50 insertions(+), 34 deletions(-)

--- a/mm/page_alloc.c~mm-fix-initialization-of-struct-page-for-holes-in-memory-layout
+++ a/mm/page_alloc.c
@@ -7078,23 +7078,26 @@ void __init free_area_init_memoryless_no
  * Initialize all valid struct pages in the range [spfn, epfn) and mark them
  * PageReserved(). Return the number of struct pages that were initialized.
  */
-static u64 __init init_unavailable_range(unsigned long spfn, unsigned long epfn)
+static u64 __init init_unavailable_range(unsigned long spfn, unsigned long epfn,
+					 int zone, int nid)
 {
-	unsigned long pfn;
+	unsigned long pfn, zone_spfn, zone_epfn;
 	u64 pgcnt = 0;
 
+	zone_spfn = arch_zone_lowest_possible_pfn[zone];
+	zone_epfn = arch_zone_highest_possible_pfn[zone];
+
+	spfn = clamp(spfn, zone_spfn, zone_epfn);
+	epfn = clamp(epfn, zone_spfn, zone_epfn);
+
 	for (pfn = spfn; pfn < epfn; pfn++) {
 		if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages))) {
 			pfn = ALIGN_DOWN(pfn, pageblock_nr_pages)
 				+ pageblock_nr_pages - 1;
 			continue;
 		}
-		/*
-		 * Use a fake node/zone (0) for now. Some of these pages
-		 * (in memblock.reserved but not in memblock.memory) will
-		 * get re-initialized via reserve_bootmem_region() later.
-		 */
-		__init_single_page(pfn_to_page(pfn), pfn, 0, 0);
+
+		__init_single_page(pfn_to_page(pfn), pfn, zone, nid);
 		__SetPageReserved(pfn_to_page(pfn));
 		pgcnt++;
 	}
@@ -7103,51 +7106,64 @@ static u64 __init init_unavailable_range
 }
 
 /*
- * Only struct pages that are backed by physical memory are zeroed and
- * initialized by going through __init_single_page(). But, there are some
- * struct pages which are reserved in memblock allocator and their fields
- * may be accessed (for example page_to_pfn() on some configuration accesses
- * flags). We must explicitly initialize those struct pages.
+ * Only struct pages that correspond to ranges defined by memblock.memory
+ * are zeroed and initialized by going through __init_single_page() during
+ * memmap_init().
  *
- * This function also addresses a similar issue where struct pages are left
- * uninitialized because the physical address range is not covered by
- * memblock.memory or memblock.reserved. That could happen when memblock
- * layout is manually configured via memmap=, or when the highest physical
- * address (max_pfn) does not end on a section boundary.
+ * But, there could be struct pages that correspond to holes in
+ * memblock.memory. This can happen because of the following reasons:
+ * - phyiscal memory bank size is not necessarily the exact multiple of the
+ *   arbitrary section size
+ * - early reserved memory may not be listed in memblock.memory
+ * - memory layouts defined with memmap= kernel parameter may not align
+ *   nicely with memmap sections
+ *
+ * Explicitly initialize those struct pages so that:
+ * - PG_Reserved is set
+ * - zone link is set accorging to the architecture constrains
+ * - node is set to node id of the next populated region except for the
+ *   trailing hole where last node id is used
  */
-static void __init init_unavailable_mem(void)
+static void __init init_zone_unavailable_mem(int zone)
 {
-	phys_addr_t start, end;
-	u64 i, pgcnt;
-	phys_addr_t next = 0;
+	unsigned long start, end;
+	int i, nid;
+	u64 pgcnt;
+	unsigned long next = 0;
 
 	/*
-	 * Loop through unavailable ranges not covered by memblock.memory.
+	 * Loop through holes in memblock.memory and initialize struct
+	 * pages corresponding to these holes
 	 */
 	pgcnt = 0;
-	for_each_mem_range(i, &start, &end) {
+	for_each_mem_pfn_range(i, MAX_NUMNODES, &start, &end, &nid) {
 		if (next < start)
-			pgcnt += init_unavailable_range(PFN_DOWN(next),
-							PFN_UP(start));
+			pgcnt += init_unavailable_range(next, start, zone, nid);
 		next = end;
 	}
 
 	/*
-	 * Early sections always have a fully populated memmap for the whole
-	 * section - see pfn_valid(). If the last section has holes at the
-	 * end and that section is marked "online", the memmap will be
-	 * considered initialized. Make sure that memmap has a well defined
-	 * state.
+	 * Last section may surpass the actual end of memory (e.g. we can
+	 * have 1Gb section and 512Mb of RAM pouplated).
+	 * Make sure that memmap has a well defined state in this case.
 	 */
-	pgcnt += init_unavailable_range(PFN_DOWN(next),
-					round_up(max_pfn, PAGES_PER_SECTION));
+	end = round_up(max_pfn, PAGES_PER_SECTION);
+	pgcnt += init_unavailable_range(next, end, zone, nid);
 
 	/*
 	 * Struct pages that do not have backing memory. This could be because
 	 * firmware is using some of this memory, or for some other reasons.
 	 */
 	if (pgcnt)
-		pr_info("Zeroed struct page in unavailable ranges: %lld pages", pgcnt);
+		pr_info("Zone %s: zeroed struct page in unavailable ranges: %lld pages", zone_names[zone], pgcnt);
+}
+
+static void __init init_unavailable_mem(void)
+{
+	int zone;
+
+	for (zone = 0; zone < ZONE_MOVABLE; zone++)
+		init_zone_unavailable_mem(zone);
 }
 #else
 static inline void __init init_unavailable_mem(void)
_

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [patch 03/19] mm: memcg/slab: optimize objcg stock draining
       [not found] <20210123210029.a7c704d0cab204683e41ad10@linux-foundation.org>
  2021-01-24  5:00 ` [patch 01/19] x86/setup: don't remove E820_TYPE_RAM for pfn 0 Andrew Morton
  2021-01-24  5:01 ` [patch 02/19] mm: fix initialization of struct page for holes in memory layout Andrew Morton
@ 2021-01-24  5:01 ` Andrew Morton
  2021-01-24  5:01 ` [patch 04/19] mm: memcg: fix memcg file_dirty numa stat Andrew Morton
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: Andrew Morton @ 2021-01-24  5:01 UTC (permalink / raw)
  To: akpm, guro, hannes, imran.f.khan, linux-mm, mkoutny, mm-commits,
	shakeelb, stable, torvalds

From: Roman Gushchin <guro@fb.com>
Subject: mm: memcg/slab: optimize objcg stock draining

Imran Khan reported a 16% regression in hackbench results caused by the
commit f2fe7b09a52b ("mm: memcg/slab: charge individual slab objects
instead of pages").  The regression is noticeable in the case of a
consequent allocation of several relatively large slab objects, e.g. 
skb's.  As soon as the amount of stocked bytes exceeds PAGE_SIZE,
drain_obj_stock() and __memcg_kmem_uncharge() are called, and it leads
to a number of atomic operations in page_counter_uncharge().

The corresponding call graph is below (provided by Imran Khan):
  |__alloc_skb
  |    |
  |    |__kmalloc_reserve.isra.61
  |    |    |
  |    |    |__kmalloc_node_track_caller
  |    |    |    |
  |    |    |    |slab_pre_alloc_hook.constprop.88
  |    |    |     obj_cgroup_charge
  |    |    |    |    |
  |    |    |    |    |__memcg_kmem_charge
  |    |    |    |    |    |
  |    |    |    |    |    |page_counter_try_charge
  |    |    |    |    |
  |    |    |    |    |refill_obj_stock
  |    |    |    |    |    |
  |    |    |    |    |    |drain_obj_stock.isra.68
  |    |    |    |    |    |    |
  |    |    |    |    |    |    |__memcg_kmem_uncharge
  |    |    |    |    |    |    |    |
  |    |    |    |    |    |    |    |page_counter_uncharge
  |    |    |    |    |    |    |    |    |
  |    |    |    |    |    |    |    |    |page_counter_cancel
  |    |    |    |
  |    |    |    |
  |    |    |    |__slab_alloc
  |    |    |    |    |
  |    |    |    |    |___slab_alloc
  |    |    |    |    |
  |    |    |    |slab_post_alloc_hook

Instead of directly uncharging the accounted kernel memory, it's possible
to refill the generic page-sized per-cpu stock instead.  It's a much
faster operation, especially on a default hierarchy.  As a bonus,
__memcg_kmem_uncharge_page() will also get faster, so the freeing of
page-sized kernel allocations (e.g.  large kmallocs) will become faster.

A similar change has been done earlier for the socket memory by the commit
475d0487a2ad ("mm: memcontrol: use per-cpu stocks for socket memory
uncharging").

Link: https://lkml.kernel.org/r/20210106042239.2860107-1-guro@fb.com
Fixes: f2fe7b09a52b ("mm: memcg/slab: charge individual slab objects instead of
pages")
Signed-off-by: Roman Gushchin <guro@fb.com>
Reported-by: Imran Khan <imran.f.khan@oracle.com>
Tested-by: Imran Khan <imran.f.khan@oracle.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Michal Koutn <mkoutny@suse.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memcontrol.c |    4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

--- a/mm/memcontrol.c~mm-memcg-slab-optimize-objcg-stock-draining
+++ a/mm/memcontrol.c
@@ -3115,9 +3115,7 @@ void __memcg_kmem_uncharge(struct mem_cg
 	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
 		page_counter_uncharge(&memcg->kmem, nr_pages);
 
-	page_counter_uncharge(&memcg->memory, nr_pages);
-	if (do_memsw_account())
-		page_counter_uncharge(&memcg->memsw, nr_pages);
+	refill_stock(memcg, nr_pages);
 }
 
 /**
_

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [patch 04/19] mm: memcg: fix memcg file_dirty numa stat
       [not found] <20210123210029.a7c704d0cab204683e41ad10@linux-foundation.org>
                   ` (2 preceding siblings ...)
  2021-01-24  5:01 ` [patch 03/19] mm: memcg/slab: optimize objcg stock draining Andrew Morton
@ 2021-01-24  5:01 ` Andrew Morton
  2021-01-24  5:01 ` [patch 05/19] mm: fix numa stats for thp migration Andrew Morton
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: Andrew Morton @ 2021-01-24  5:01 UTC (permalink / raw)
  To: akpm, guro, hannes, linux-mm, mhocko, mm-commits, shakeelb,
	shy828301, songmuchun, stable, torvalds

From: Shakeel Butt <shakeelb@google.com>
Subject: mm: memcg: fix memcg file_dirty numa stat

The kernel updates the per-node NR_FILE_DIRTY stats on page migration but
not the memcg numa stats.  That was not an issue until recently the commit
5f9a4f4a7096 ("mm: memcontrol: add the missing numa_stat interface for
cgroup v2") exposed numa stats for the memcg.  So fix the file_dirty
per-memcg numa stat.

Link: https://lkml.kernel.org/r/20210108155813.2914586-1-shakeelb@google.com
Fixes: 5f9a4f4a7096 ("mm: memcontrol: add the missing numa_stat interface for cgroup v2")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/migrate.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/mm/migrate.c~mm-memcg-fix-memcg-file_dirty-numa-stat
+++ a/mm/migrate.c
@@ -500,9 +500,9 @@ int migrate_page_move_mapping(struct add
 			__inc_lruvec_state(new_lruvec, NR_SHMEM);
 		}
 		if (dirty && mapping_can_writeback(mapping)) {
-			__dec_node_state(oldzone->zone_pgdat, NR_FILE_DIRTY);
+			__dec_lruvec_state(old_lruvec, NR_FILE_DIRTY);
 			__dec_zone_state(oldzone, NR_ZONE_WRITE_PENDING);
-			__inc_node_state(newzone->zone_pgdat, NR_FILE_DIRTY);
+			__inc_lruvec_state(new_lruvec, NR_FILE_DIRTY);
 			__inc_zone_state(newzone, NR_ZONE_WRITE_PENDING);
 		}
 	}
_

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [patch 05/19] mm: fix numa stats for thp migration
       [not found] <20210123210029.a7c704d0cab204683e41ad10@linux-foundation.org>
                   ` (3 preceding siblings ...)
  2021-01-24  5:01 ` [patch 04/19] mm: memcg: fix memcg file_dirty numa stat Andrew Morton
@ 2021-01-24  5:01 ` Andrew Morton
  2021-01-24  5:01 ` [patch 06/19] mm: memcontrol: prevent starvation when writing memory.high Andrew Morton
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: Andrew Morton @ 2021-01-24  5:01 UTC (permalink / raw)
  To: akpm, guro, hannes, linux-mm, mhocko, mm-commits, shakeelb,
	shy828301, songmuchun, stable, torvalds

From: Shakeel Butt <shakeelb@google.com>
Subject: mm: fix numa stats for thp migration

Currently the kernel is not correctly updating the numa stats for
NR_FILE_PAGES and NR_SHMEM on THP migration.  Fix that.  For NR_FILE_DIRTY
and NR_ZONE_WRITE_PENDING, although at the moment there is no need to
handle THP migration as kernel still does not have write support for file
THP but to be more future proof, this patch adds the THP support for those
stats as well.

Link: https://lkml.kernel.org/r/20210108155813.2914586-2-shakeelb@google.com
Fixes: e71769ae52609 ("mm: enable thp migration for shmem thp")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/migrate.c |   23 ++++++++++++-----------
 1 file changed, 12 insertions(+), 11 deletions(-)

--- a/mm/migrate.c~mm-fix-numa-stats-for-thp-migration
+++ a/mm/migrate.c
@@ -402,6 +402,7 @@ int migrate_page_move_mapping(struct add
 	struct zone *oldzone, *newzone;
 	int dirty;
 	int expected_count = expected_page_refs(mapping, page) + extra_count;
+	int nr = thp_nr_pages(page);
 
 	if (!mapping) {
 		/* Anonymous page without mapping */
@@ -437,7 +438,7 @@ int migrate_page_move_mapping(struct add
 	 */
 	newpage->index = page->index;
 	newpage->mapping = page->mapping;
-	page_ref_add(newpage, thp_nr_pages(page)); /* add cache reference */
+	page_ref_add(newpage, nr); /* add cache reference */
 	if (PageSwapBacked(page)) {
 		__SetPageSwapBacked(newpage);
 		if (PageSwapCache(page)) {
@@ -459,7 +460,7 @@ int migrate_page_move_mapping(struct add
 	if (PageTransHuge(page)) {
 		int i;
 
-		for (i = 1; i < HPAGE_PMD_NR; i++) {
+		for (i = 1; i < nr; i++) {
 			xas_next(&xas);
 			xas_store(&xas, newpage);
 		}
@@ -470,7 +471,7 @@ int migrate_page_move_mapping(struct add
 	 * to one less reference.
 	 * We know this isn't the last reference.
 	 */
-	page_ref_unfreeze(page, expected_count - thp_nr_pages(page));
+	page_ref_unfreeze(page, expected_count - nr);
 
 	xas_unlock(&xas);
 	/* Leave irq disabled to prevent preemption while updating stats */
@@ -493,17 +494,17 @@ int migrate_page_move_mapping(struct add
 		old_lruvec = mem_cgroup_lruvec(memcg, oldzone->zone_pgdat);
 		new_lruvec = mem_cgroup_lruvec(memcg, newzone->zone_pgdat);
 
-		__dec_lruvec_state(old_lruvec, NR_FILE_PAGES);
-		__inc_lruvec_state(new_lruvec, NR_FILE_PAGES);
+		__mod_lruvec_state(old_lruvec, NR_FILE_PAGES, -nr);
+		__mod_lruvec_state(new_lruvec, NR_FILE_PAGES, nr);
 		if (PageSwapBacked(page) && !PageSwapCache(page)) {
-			__dec_lruvec_state(old_lruvec, NR_SHMEM);
-			__inc_lruvec_state(new_lruvec, NR_SHMEM);
+			__mod_lruvec_state(old_lruvec, NR_SHMEM, -nr);
+			__mod_lruvec_state(new_lruvec, NR_SHMEM, nr);
 		}
 		if (dirty && mapping_can_writeback(mapping)) {
-			__dec_lruvec_state(old_lruvec, NR_FILE_DIRTY);
-			__dec_zone_state(oldzone, NR_ZONE_WRITE_PENDING);
-			__inc_lruvec_state(new_lruvec, NR_FILE_DIRTY);
-			__inc_zone_state(newzone, NR_ZONE_WRITE_PENDING);
+			__mod_lruvec_state(old_lruvec, NR_FILE_DIRTY, -nr);
+			__mod_zone_page_state(oldzone, NR_ZONE_WRITE_PENDING, -nr);
+			__mod_lruvec_state(new_lruvec, NR_FILE_DIRTY, nr);
+			__mod_zone_page_state(newzone, NR_ZONE_WRITE_PENDING, nr);
 		}
 	}
 	local_irq_enable();
_

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [patch 06/19] mm: memcontrol: prevent starvation when writing memory.high
       [not found] <20210123210029.a7c704d0cab204683e41ad10@linux-foundation.org>
                   ` (4 preceding siblings ...)
  2021-01-24  5:01 ` [patch 05/19] mm: fix numa stats for thp migration Andrew Morton
@ 2021-01-24  5:01 ` Andrew Morton
  2021-01-24 18:01   ` Shakeel Butt
  2021-01-24  5:01 ` [patch 13/19] mm: fix page reference leak in soft_offline_page() Andrew Morton
  2021-01-24  5:02 ` [patch 18/19] proc_sysctl: fix oops caused by incorrect command parameters Andrew Morton
  7 siblings, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2021-01-24  5:01 UTC (permalink / raw)
  To: akpm, guro, hannes, linux-mm, mhocko, mkoutny, mm-commits,
	shakeelb, stable, tj, torvalds

From: Johannes Weiner <hannes@cmpxchg.org>
Subject: mm: memcontrol: prevent starvation when writing memory.high

When a value is written to a cgroup's memory.high control file, the
write() context first tries to reclaim the cgroup to size before putting
the limit in place for the workload.  Concurrent charges from the workload
can keep such a write() looping in reclaim indefinitely.

In the past, a write to memory.high would first put the limit in place for
the workload, then do targeted reclaim until the new limit has been met -
similar to how we do it for memory.max.  This wasn't prone to the
described starvation issue.  However, this sequence could cause excessive
latencies in the workload, when allocating threads could be put into long
penalty sleeps on the sudden memory.high overage created by the write(),
before that had a chance to work it off.

Now that memory_high_write() performs reclaim before enforcing the new
limit, reflect that the cgroup may well fail to converge due to concurrent
workload activity.  Bail out of the loop after a few tries.

Link: https://lkml.kernel.org/r/20210112163011.127833-1-hannes@cmpxchg.org
Fixes: 536d3bf261a2 ("mm: memcontrol: avoid workload stalls when lowering memory.high")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Reported-by: Tejun Heo <tj@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>	[5.8+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memcontrol.c |    7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

--- a/mm/memcontrol.c~mm-memcontrol-prevent-starvation-when-writing-memoryhigh
+++ a/mm/memcontrol.c
@@ -6273,7 +6273,6 @@ static ssize_t memory_high_write(struct
 
 	for (;;) {
 		unsigned long nr_pages = page_counter_read(&memcg->memory);
-		unsigned long reclaimed;
 
 		if (nr_pages <= high)
 			break;
@@ -6287,10 +6286,10 @@ static ssize_t memory_high_write(struct
 			continue;
 		}
 
-		reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
-							 GFP_KERNEL, true);
+		try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
+					     GFP_KERNEL, true);
 
-		if (!reclaimed && !nr_retries--)
+		if (!nr_retries--)
 			break;
 	}
 
_

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [patch 13/19] mm: fix page reference leak in soft_offline_page()
       [not found] <20210123210029.a7c704d0cab204683e41ad10@linux-foundation.org>
                   ` (5 preceding siblings ...)
  2021-01-24  5:01 ` [patch 06/19] mm: memcontrol: prevent starvation when writing memory.high Andrew Morton
@ 2021-01-24  5:01 ` Andrew Morton
  2021-01-24  5:02 ` [patch 18/19] proc_sysctl: fix oops caused by incorrect command parameters Andrew Morton
  7 siblings, 0 replies; 10+ messages in thread
From: Andrew Morton @ 2021-01-24  5:01 UTC (permalink / raw)
  To: akpm, cai, dan.j.williams, david, linux-mm, mhocko, mm-commits,
	naoya.horiguchi, osalvador, stable, torvalds

From: Dan Williams <dan.j.williams@intel.com>
Subject: mm: fix page reference leak in soft_offline_page()

The conversion to move pfn_to_online_page() internal to
soft_offline_page() missed that the get_user_pages() reference taken by
the madvise() path needs to be dropped when pfn_to_online_page() fails. 
Note the direct sysfs-path to soft_offline_page() does not perform a
get_user_pages() lookup.

When soft_offline_page() is handed a pfn_valid() && !pfn_to_online_page()
pfn the kernel hangs at dax-device shutdown due to a leaked reference.

Link: https://lkml.kernel.org/r/161058501210.1840162.8108917599181157327.stgit@dwillia2-desk3.amr.corp.intel.com
Fixes: feec24a6139d ("mm, soft-offline: convert parameter to pfn")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Qian Cai <cai@lca.pw>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory-failure.c |   20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

--- a/mm/memory-failure.c~mm-fix-page-reference-leak-in-soft_offline_page
+++ a/mm/memory-failure.c
@@ -1885,6 +1885,12 @@ static int soft_offline_free_page(struct
 	return rc;
 }
 
+static void put_ref_page(struct page *page)
+{
+	if (page)
+		put_page(page);
+}
+
 /**
  * soft_offline_page - Soft offline a page.
  * @pfn: pfn to soft-offline
@@ -1910,20 +1916,26 @@ static int soft_offline_free_page(struct
 int soft_offline_page(unsigned long pfn, int flags)
 {
 	int ret;
-	struct page *page;
 	bool try_again = true;
+	struct page *page, *ref_page = NULL;
+
+	WARN_ON_ONCE(!pfn_valid(pfn) && (flags & MF_COUNT_INCREASED));
 
 	if (!pfn_valid(pfn))
 		return -ENXIO;
+	if (flags & MF_COUNT_INCREASED)
+		ref_page = pfn_to_page(pfn);
+
 	/* Only online pages can be soft-offlined (esp., not ZONE_DEVICE). */
 	page = pfn_to_online_page(pfn);
-	if (!page)
+	if (!page) {
+		put_ref_page(ref_page);
 		return -EIO;
+	}
 
 	if (PageHWPoison(page)) {
 		pr_info("%s: %#lx page already poisoned\n", __func__, pfn);
-		if (flags & MF_COUNT_INCREASED)
-			put_page(page);
+		put_ref_page(ref_page);
 		return 0;
 	}
 
_

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [patch 18/19] proc_sysctl: fix oops caused by incorrect command parameters
       [not found] <20210123210029.a7c704d0cab204683e41ad10@linux-foundation.org>
                   ` (6 preceding siblings ...)
  2021-01-24  5:01 ` [patch 13/19] mm: fix page reference leak in soft_offline_page() Andrew Morton
@ 2021-01-24  5:02 ` Andrew Morton
  7 siblings, 0 replies; 10+ messages in thread
From: Andrew Morton @ 2021-01-24  5:02 UTC (permalink / raw)
  To: adobriyan, akpm, hkallweit1, keescook, linux-mm, mcgrof,
	mhiramat, mhocko, mm-commits, nixiaoming, rdunlap, stable,
	torvalds, vbabka, yzaikin

From: Xiaoming Ni <nixiaoming@huawei.com>
Subject: proc_sysctl: fix oops caused by incorrect command parameters

The process_sysctl_arg() does not check whether val is empty before
invoking strlen(val).  If the command line parameter () is incorrectly
configured and val is empty, oops is triggered.

For example:
  "hung_task_panic=1" is incorrectly written as "hung_task_panic", oops is
  triggered. The call stack is as follows:
    Kernel command line: .... hung_task_panic
    ......
    Call trace:
    __pi_strlen+0x10/0x98
    parse_args+0x278/0x344
    do_sysctl_args+0x8c/0xfc
    kernel_init+0x5c/0xf4
    ret_from_fork+0x10/0x30

To fix it, check whether "val" is empty when "phram" is a sysctl field.
Error codes are returned in the failure branch, and error logs are
generated by parse_args().

Link: https://lkml.kernel.org/r/20210118133029.28580-1-nixiaoming@huawei.com
Fixes: 3db978d480e2843 ("kernel/sysctl: support setting sysctl parameters from kernel command line")
Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Heiner Kallweit <hkallweit1@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: <stable@vger.kernel.org>	[5.8+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/proc/proc_sysctl.c |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

--- a/fs/proc/proc_sysctl.c~proc_sysctl-fix-oops-caused-by-incorrect-command-parameters
+++ a/fs/proc/proc_sysctl.c
@@ -1770,6 +1770,12 @@ static int process_sysctl_arg(char *para
 			return 0;
 	}
 
+	if (!val)
+		return -EINVAL;
+	len = strlen(val);
+	if (len == 0)
+		return -EINVAL;
+
 	/*
 	 * To set sysctl options, we use a temporary mount of proc, look up the
 	 * respective sys/ file and write to it. To avoid mounting it when no
@@ -1811,7 +1817,6 @@ static int process_sysctl_arg(char *para
 				file, param, val);
 		goto out;
 	}
-	len = strlen(val);
 	wret = kernel_write(file, val, len, &pos);
 	if (wret < 0) {
 		err = wret;
_

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch 06/19] mm: memcontrol: prevent starvation when writing memory.high
  2021-01-24  5:01 ` [patch 06/19] mm: memcontrol: prevent starvation when writing memory.high Andrew Morton
@ 2021-01-24 18:01   ` Shakeel Butt
  2021-01-24 18:35     ` Linus Torvalds
  0 siblings, 1 reply; 10+ messages in thread
From: Shakeel Butt @ 2021-01-24 18:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Roman Gushchin, Johannes Weiner, Linux MM, Michal Hocko,
	Michal Koutný,
	mm-commits, stable, Tejun Heo, Linus Torvalds

On Sat, Jan 23, 2021 at 9:01 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> From: Johannes Weiner <hannes@cmpxchg.org>
> Subject: mm: memcontrol: prevent starvation when writing memory.high
>
> When a value is written to a cgroup's memory.high control file, the
> write() context first tries to reclaim the cgroup to size before putting
> the limit in place for the workload.  Concurrent charges from the workload
> can keep such a write() looping in reclaim indefinitely.
>
> In the past, a write to memory.high would first put the limit in place for
> the workload, then do targeted reclaim until the new limit has been met -
> similar to how we do it for memory.max.  This wasn't prone to the
> described starvation issue.  However, this sequence could cause excessive
> latencies in the workload, when allocating threads could be put into long
> penalty sleeps on the sudden memory.high overage created by the write(),
> before that had a chance to work it off.
>
> Now that memory_high_write() performs reclaim before enforcing the new
> limit, reflect that the cgroup may well fail to converge due to concurrent
> workload activity.  Bail out of the loop after a few tries.
>
> Link: https://lkml.kernel.org/r/20210112163011.127833-1-hannes@cmpxchg.org
> Fixes: 536d3bf261a2 ("mm: memcontrol: avoid workload stalls when lowering memory.high")
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Reviewed-by: Shakeel Butt <shakeelb@google.com>
> Reported-by: Tejun Heo <tj@kernel.org>
> Acked-by: Roman Gushchin <guro@fb.com>
> Reviewed-by: Michal Koutný <mkoutny@suse.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: <stable@vger.kernel.org>    [5.8+]
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Johannes requested to replace this patch with
https://lore.kernel.org/linux-mm/20210122184341.292461-1-hannes@cmpxchg.org/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch 06/19] mm: memcontrol: prevent starvation when writing memory.high
  2021-01-24 18:01   ` Shakeel Butt
@ 2021-01-24 18:35     ` Linus Torvalds
  0 siblings, 0 replies; 10+ messages in thread
From: Linus Torvalds @ 2021-01-24 18:35 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Roman Gushchin, Johannes Weiner, Linux MM,
	Michal Hocko, Michal Koutný,
	mm-commits, stable, Tejun Heo

On Sun, Jan 24, 2021 at 10:02 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> Johannes requested to replace this patch with
> https://lore.kernel.org/linux-mm/20210122184341.292461-1-hannes@cmpxchg.org/

I've dropped it (not replaced it - will wait for Andrew to
comment/send) from my queue.

          Linus

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-01-24 18:37 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20210123210029.a7c704d0cab204683e41ad10@linux-foundation.org>
2021-01-24  5:00 ` [patch 01/19] x86/setup: don't remove E820_TYPE_RAM for pfn 0 Andrew Morton
2021-01-24  5:01 ` [patch 02/19] mm: fix initialization of struct page for holes in memory layout Andrew Morton
2021-01-24  5:01 ` [patch 03/19] mm: memcg/slab: optimize objcg stock draining Andrew Morton
2021-01-24  5:01 ` [patch 04/19] mm: memcg: fix memcg file_dirty numa stat Andrew Morton
2021-01-24  5:01 ` [patch 05/19] mm: fix numa stats for thp migration Andrew Morton
2021-01-24  5:01 ` [patch 06/19] mm: memcontrol: prevent starvation when writing memory.high Andrew Morton
2021-01-24 18:01   ` Shakeel Butt
2021-01-24 18:35     ` Linus Torvalds
2021-01-24  5:01 ` [patch 13/19] mm: fix page reference leak in soft_offline_page() Andrew Morton
2021-01-24  5:02 ` [patch 18/19] proc_sysctl: fix oops caused by incorrect command parameters Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).