mm-commits.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* incoming
@ 2021-03-13  5:06 Andrew Morton
  2021-03-13  5:07 ` [patch 01/29] memblock: fix section mismatch warning Andrew Morton
                   ` (28 more replies)
  0 siblings, 29 replies; 32+ messages in thread
From: Andrew Morton @ 2021-03-13  5:06 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: mm-commits, linux-mm


29 patches, based on f78d76e72a4671ea52d12752d92077788b4f5d50.

Subsystems affected by this patch series:

  mm/memblock
  core-kernel
  kconfig
  mm/pagealloc
  fork
  mm/hugetlb
  mm/highmem
  binfmt
  MAINTAINERS
  kbuild
  mm/kfence
  mm/oom-kill
  mm/madvise
  mm/kasan
  mm/userfaultfd
  mm/memory-failure
  ia64
  mm/memcg
  mm/zram

Subsystem: mm/memblock

    Arnd Bergmann <arnd@arndb.de>:
      memblock: fix section mismatch warning

Subsystem: core-kernel

    Arnd Bergmann <arnd@arndb.de>:
      stop_machine: mark helpers __always_inline

Subsystem: kconfig

    Masahiro Yamada <masahiroy@kernel.org>:
      init/Kconfig: make COMPILE_TEST depend on HAS_IOMEM

Subsystem: mm/pagealloc

    Mike Rapoport <rppt@linux.ibm.com>:
      mm/page_alloc.c: refactor initialization of struct page for holes in memory layout

Subsystem: fork

    Fenghua Yu <fenghua.yu@intel.com>:
      mm/fork: clear PASID for new mm

Subsystem: mm/hugetlb

    Peter Xu <peterx@redhat.com>:
    Patch series "mm/hugetlb: Early cow on fork, and a few cleanups", v5:
      hugetlb: dedup the code to add a new file_region
      hugetlb: break earlier in add_reservation_in_range() when we can
      mm: introduce page_needs_cow_for_dma() for deciding whether cow
      mm: use is_cow_mapping() across tree where proper
      hugetlb: do early cow when page pinned on src mm

Subsystem: mm/highmem

    OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>:
      mm/highmem.c: fix zero_user_segments() with start > end

Subsystem: binfmt

    Lior Ribak <liorribak@gmail.com>:
      binfmt_misc: fix possible deadlock in bm_register_write

Subsystem: MAINTAINERS

    Vlastimil Babka <vbabka@suse.cz>:
      MAINTAINERS: exclude uapi directories in API/ABI section

Subsystem: kbuild

    Arnd Bergmann <arnd@arndb.de>:
      linux/compiler-clang.h: define HAVE_BUILTIN_BSWAP*

Subsystem: mm/kfence

    Marco Elver <elver@google.com>:
      kfence: fix printk format for ptrdiff_t
      kfence, slab: fix cache_alloc_debugcheck_after() for bulk allocations
      kfence: fix reports if constant function prefixes exist

Subsystem: mm/oom-kill

    "Matthew Wilcox (Oracle)" <willy@infradead.org>:
      include/linux/sched/mm.h: use rcu_dereference in in_vfork()

Subsystem: mm/madvise

    Suren Baghdasaryan <surenb@google.com>:
      mm/madvise: replace ptrace attach requirement for process_madvise

Subsystem: mm/kasan

    Andrey Konovalov <andreyknvl@google.com>:
      kasan, mm: fix crash with HW_TAGS and DEBUG_PAGEALLOC
      kasan: fix KASAN_STACK dependency for HW_TAGS

Subsystem: mm/userfaultfd

    Nadav Amit <namit@vmware.com>:
      mm/userfaultfd: fix memory corruption due to writeprotect

Subsystem: mm/memory-failure

    Naoya Horiguchi <naoya.horiguchi@nec.com>:
      mm, hwpoison: do not lock page again when me_huge_page() successfully recovers

Subsystem: ia64

    Sergei Trofimovich <slyfox@gentoo.org>:
      ia64: fix ia64_syscall_get_set_arguments() for break-based syscalls
      ia64: fix ptrace(PTRACE_SYSCALL_INFO_EXIT) sign

Subsystem: mm/memcg

    Zhou Guanghui <zhouguanghui1@huawei.com>:
      mm/memcg: rename mem_cgroup_split_huge_fixup to split_page_memcg and add nr_pages argument
      mm/memcg: set memcg when splitting page

Subsystem: mm/zram

    Minchan Kim <minchan@kernel.org>:
      zram: fix return value on writeback_store
      zram: fix broken page writeback

 MAINTAINERS                                |    4 
 arch/ia64/include/asm/syscall.h            |    2 
 arch/ia64/kernel/ptrace.c                  |   24 +++-
 drivers/block/zram/zram_drv.c              |   17 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_page_dirty.c |    4 
 drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c   |    2 
 fs/binfmt_misc.c                           |   29 ++---
 fs/proc/task_mmu.c                         |    2 
 include/linux/compiler-clang.h             |    6 +
 include/linux/memblock.h                   |    4 
 include/linux/memcontrol.h                 |    6 -
 include/linux/mm.h                         |   21 +++
 include/linux/mm_types.h                   |    1 
 include/linux/sched/mm.h                   |    3 
 include/linux/stop_machine.h               |   11 +
 init/Kconfig                               |    3 
 kernel/fork.c                              |    8 +
 lib/Kconfig.kasan                          |    1 
 mm/highmem.c                               |   17 ++
 mm/huge_memory.c                           |   10 -
 mm/hugetlb.c                               |  123 +++++++++++++++------
 mm/internal.h                              |    5 
 mm/kfence/report.c                         |   30 +++--
 mm/madvise.c                               |   13 ++
 mm/memcontrol.c                            |   15 +-
 mm/memory-failure.c                        |    4 
 mm/memory.c                                |   16 +-
 mm/page_alloc.c                            |  167 ++++++++++++++---------------
 mm/slab.c                                  |    2 
 29 files changed, 334 insertions(+), 216 deletions(-)


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 01/29] memblock: fix section mismatch warning
  2021-03-13  5:06 incoming Andrew Morton
@ 2021-03-13  5:07 ` Andrew Morton
  2021-03-13  5:07 ` [patch 02/29] stop_machine: mark helpers __always_inline Andrew Morton
                   ` (27 subsequent siblings)
  28 siblings, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2021-03-13  5:07 UTC (permalink / raw)
  To: akpm, arnd, aslan, bhe, david, faiyazm, linux-mm, mm-commits,
	nathan, ndesaulniers, rppt, torvalds, tsbogend

From: Arnd Bergmann <arnd@arndb.de>
Subject: memblock: fix section mismatch warning

The inlining logic in clang-13 is rewritten to often not inline some
functions that were inlined by all earlier compilers.

In case of the memblock interfaces, this exposed a harmless bug of a
missing __init annotation:

WARNING: modpost: vmlinux.o(.text+0x507c0a): Section mismatch in reference from the function memblock_bottom_up() to the variable .meminit.data:memblock
The function memblock_bottom_up() references
the variable __meminitdata memblock.
This is often because memblock_bottom_up lacks a __meminitdata
annotation or the annotation of memblock is wrong.

Interestingly, these annotations were present originally, but got removed
with the explanation that the __init annotation prevents the function from
getting inlined.  I checked this again and found that while this is the
case with clang, gcc (version 7 through 10, did not test others) does
inline the functions regardless.

As the previous change was apparently intended to help the clang builds,
reverting it to help the newer clang versions seems appropriate as well. 
gcc builds don't seem to care either way.

Link: https://lkml.kernel.org/r/20210225133808.2188581-1-arnd@kernel.org
Fixes: 5bdba520c1b3 ("mm: memblock: drop __init from memblock functions to make it inline")
Reference: 2cfb3665e864 ("include/linux/memblock.h: add __init to memblock_set_bottom_up()")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Faiyaz Mohammed <faiyazm@codeaurora.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Aslan Bakirov <aslan@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/memblock.h |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/include/linux/memblock.h~memblock-fix-section-mismatch-warning
+++ a/include/linux/memblock.h
@@ -460,7 +460,7 @@ static inline void memblock_free_late(ph
 /*
  * Set the allocation direction to bottom-up or top-down.
  */
-static inline void memblock_set_bottom_up(bool enable)
+static inline __init void memblock_set_bottom_up(bool enable)
 {
 	memblock.bottom_up = enable;
 }
@@ -470,7 +470,7 @@ static inline void memblock_set_bottom_u
  * if this is true, that said, memblock will allocate memory
  * in bottom-up direction.
  */
-static inline bool memblock_bottom_up(void)
+static inline __init bool memblock_bottom_up(void)
 {
 	return memblock.bottom_up;
 }
_

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 02/29] stop_machine: mark helpers __always_inline
  2021-03-13  5:06 incoming Andrew Morton
  2021-03-13  5:07 ` [patch 01/29] memblock: fix section mismatch warning Andrew Morton
@ 2021-03-13  5:07 ` Andrew Morton
  2021-03-13  5:07 ` [patch 03/29] init/Kconfig: make COMPILE_TEST depend on HAS_IOMEM Andrew Morton
                   ` (26 subsequent siblings)
  28 siblings, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2021-03-13  5:07 UTC (permalink / raw)
  To: akpm, arnd, bigeasy, bristot, linux-mm, mingo, mm-commits,
	nathan, ndesaulniers, paulmck, peterz, prarit, tglx, torvalds,
	valentin.schneider

From: Arnd Bergmann <arnd@arndb.de>
Subject: stop_machine: mark helpers __always_inline

With clang-13, some functions only get partially inlined, with a
specialized version referring to a global variable.  This triggers a
harmless build-time check for the intel-rng driver:

WARNING: modpost: drivers/char/hw_random/intel-rng.o(.text+0xe): Section mismatch in reference from the function stop_machine() to the function .init.text:intel_rng_hw_init()
The function stop_machine() references
the function __init intel_rng_hw_init().
This is often because stop_machine lacks a __init
annotation or the annotation of intel_rng_hw_init is wrong.

In this instance, an easy workaround is to force the stop_machine()
function to be inline, along with related interfaces that did not show the
same behavior at the moment, but theoretically could.

The combination of the two patches listed below triggers the behavior in
clang-13, but individually these commits are correct.

Link: https://lkml.kernel.org/r/20210225130153.1956990-1-arnd@kernel.org
Fixes: fe5595c07400 ("stop_machine: Provide stop_machine_cpuslocked()")
Fixes: ee527cd3a20c ("Use stop_machine_run in the Intel RNG driver")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/stop_machine.h |   11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

--- a/include/linux/stop_machine.h~stop_machine-mark-helpers-__always_inline
+++ a/include/linux/stop_machine.h
@@ -128,7 +128,7 @@ int stop_machine_from_inactive_cpu(cpu_s
 				   const struct cpumask *cpus);
 #else	/* CONFIG_SMP || CONFIG_HOTPLUG_CPU */
 
-static inline int stop_machine_cpuslocked(cpu_stop_fn_t fn, void *data,
+static __always_inline int stop_machine_cpuslocked(cpu_stop_fn_t fn, void *data,
 					  const struct cpumask *cpus)
 {
 	unsigned long flags;
@@ -139,14 +139,15 @@ static inline int stop_machine_cpuslocke
 	return ret;
 }
 
-static inline int stop_machine(cpu_stop_fn_t fn, void *data,
-			       const struct cpumask *cpus)
+static __always_inline int
+stop_machine(cpu_stop_fn_t fn, void *data, const struct cpumask *cpus)
 {
 	return stop_machine_cpuslocked(fn, data, cpus);
 }
 
-static inline int stop_machine_from_inactive_cpu(cpu_stop_fn_t fn, void *data,
-						 const struct cpumask *cpus)
+static __always_inline int
+stop_machine_from_inactive_cpu(cpu_stop_fn_t fn, void *data,
+			       const struct cpumask *cpus)
 {
 	return stop_machine(fn, data, cpus);
 }
_

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 03/29] init/Kconfig: make COMPILE_TEST depend on HAS_IOMEM
  2021-03-13  5:06 incoming Andrew Morton
  2021-03-13  5:07 ` [patch 01/29] memblock: fix section mismatch warning Andrew Morton
  2021-03-13  5:07 ` [patch 02/29] stop_machine: mark helpers __always_inline Andrew Morton
@ 2021-03-13  5:07 ` Andrew Morton
  2021-03-13  5:07 ` [patch 04/29] mm/page_alloc.c: refactor initialization of struct page for holes in memory layout Andrew Morton
                   ` (25 subsequent siblings)
  28 siblings, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2021-03-13  5:07 UTC (permalink / raw)
  To: akpm, arnd, daniel, hannes, hca, keescook, kpsingh, linux-mm,
	linux, lkml, masahiroy, mm-commits, nathan, qperret, terrelln,
	torvalds, valentin.schneider

From: Masahiro Yamada <masahiroy@kernel.org>
Subject: init/Kconfig: make COMPILE_TEST depend on HAS_IOMEM

I read the commit log of the following two:

- bc083a64b6c0 ("init/Kconfig: make COMPILE_TEST depend on !UML")
- 334ef6ed06fa ("init/Kconfig: make COMPILE_TEST depend on !S390")

Both are talking about HAS_IOMEM dependency missing in many drivers.

So, 'depends on HAS_IOMEM' seems the direct, sensible solution to me.

This does not change the behavior of UML. UML still cannot enable
COMPILE_TEST because it does not provide HAS_IOMEM.

The current dependency for S390 is too strong. Under the condition of
CONFIG_PCI=y, S390 provides HAS_IOMEM, hence can enable COMPILE_TEST.

I also removed the meaningless 'default n'.

Link: https://lkml.kernel.org/r/20210224140809.1067582-1-masahiroy@kernel.org
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Arnd Bergmann <arnd@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KP Singh <kpsingh@google.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Cc: Quentin Perret <qperret@google.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: "Enrico Weigelt, metux IT consult" <lkml@metux.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 init/Kconfig |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- a/init/Kconfig~init-kconfig-make-compile_test-depend-on-has_iomem
+++ a/init/Kconfig
@@ -119,8 +119,7 @@ config INIT_ENV_ARG_LIMIT
 
 config COMPILE_TEST
 	bool "Compile also drivers which will not load"
-	depends on !UML && !S390
-	default n
+	depends on HAS_IOMEM
 	help
 	  Some drivers can be compiled on a different platform than they are
 	  intended to be run on. Despite they cannot be loaded there (or even
_

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 04/29] mm/page_alloc.c: refactor initialization of struct page for holes in memory layout
  2021-03-13  5:06 incoming Andrew Morton
                   ` (2 preceding siblings ...)
  2021-03-13  5:07 ` [patch 03/29] init/Kconfig: make COMPILE_TEST depend on HAS_IOMEM Andrew Morton
@ 2021-03-13  5:07 ` Andrew Morton
  2021-03-13  5:07 ` [patch 05/29] mm/fork: clear PASID for new mm Andrew Morton
                   ` (24 subsequent siblings)
  28 siblings, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2021-03-13  5:07 UTC (permalink / raw)
  To: aarcange, akpm, bhe, bp, cai, chris, david, hpa, linux-mm, lma,
	mgorman, mhocko, mingo, mm-commits, rppt, stable, tglx,
	tomi.p.sarvela, torvalds, vbabka

From: Mike Rapoport <rppt@linux.ibm.com>
Subject: mm/page_alloc.c: refactor initialization of struct page for holes in memory layout

There could be struct pages that are not backed by actual physical memory.
This can happen when the actual memory bank is not a multiple of
SECTION_SIZE or when an architecture does not register memory holes
reserved by the firmware as memblock.memory.

Such pages are currently initialized using init_unavailable_mem() function
that iterates through PFNs in holes in memblock.memory and if there is a
struct page corresponding to a PFN, the fields of this page are set to
default values and it is marked as Reserved.

init_unavailable_mem() does not take into account zone and node the page
belongs to and sets both zone and node links in struct page to zero.

Before commit 73a6e474cb37 ("mm: memmap_init: iterate over memblock
regions rather that check each PFN") the holes inside a zone were
re-initialized during memmap_init() and got their zone/node links right. 
However, after that commit nothing updates the struct pages representing
such holes.

On a system that has firmware reserved holes in a zone above ZONE_DMA, for
instance in a configuration below:

	# grep -A1 E820 /proc/iomem
	7a17b000-7a216fff : Unknown E820 type
	7a217000-7bffffff : System RAM

unset zone link in struct page will trigger

	VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);

in set_pfnblock_flags_mask() when called with a struct page from a range
other than E820_TYPE_RAM because there are pages in the range of
ZONE_DMA32 but the unset zone link in struct page makes them appear as a
part of ZONE_DMA.

Interleave initialization of the unavailable pages with the normal
initialization of memory map, so that zone and node information will be
properly set on struct pages that are not backed by the actual memory.

With this change the pages for holes inside a zone will get proper
zone/node links and the pages that are not spanned by any node will get
links to the adjacent zone/node.  The holes between nodes will be
prepended to the zone/node above the hole and the trailing pages in the
last section that will be appended to the zone/node below.

[akpm@linux-foundation.org: don't initialize static to zero, use %llu for u64]
Link: https://lkml.kernel.org/r/20210225224351.7356-2-rppt@kernel.org
Fixes: 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather that check each PFN")
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reported-by: Qian Cai <cai@lca.pw>
Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Łukasz Majczak <lma@semihalf.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: "Sarvela, Tomi P" <tomi.p.sarvela@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |  158 +++++++++++++++++++++-------------------------
 1 file changed, 75 insertions(+), 83 deletions(-)

--- a/mm/page_alloc.c~mm-page_allocc-refactor-initialization-of-struct-page-for-holes-in-memory-layout
+++ a/mm/page_alloc.c
@@ -6259,12 +6259,65 @@ static void __meminit zone_init_free_lis
 	}
 }
 
+#if !defined(CONFIG_FLAT_NODE_MEM_MAP)
+/*
+ * Only struct pages that correspond to ranges defined by memblock.memory
+ * are zeroed and initialized by going through __init_single_page() during
+ * memmap_init_zone().
+ *
+ * But, there could be struct pages that correspond to holes in
+ * memblock.memory. This can happen because of the following reasons:
+ * - physical memory bank size is not necessarily the exact multiple of the
+ *   arbitrary section size
+ * - early reserved memory may not be listed in memblock.memory
+ * - memory layouts defined with memmap= kernel parameter may not align
+ *   nicely with memmap sections
+ *
+ * Explicitly initialize those struct pages so that:
+ * - PG_Reserved is set
+ * - zone and node links point to zone and node that span the page if the
+ *   hole is in the middle of a zone
+ * - zone and node links point to adjacent zone/node if the hole falls on
+ *   the zone boundary; the pages in such holes will be prepended to the
+ *   zone/node above the hole except for the trailing pages in the last
+ *   section that will be appended to the zone/node below.
+ */
+static u64 __meminit init_unavailable_range(unsigned long spfn,
+					    unsigned long epfn,
+					    int zone, int node)
+{
+	unsigned long pfn;
+	u64 pgcnt = 0;
+
+	for (pfn = spfn; pfn < epfn; pfn++) {
+		if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages))) {
+			pfn = ALIGN_DOWN(pfn, pageblock_nr_pages)
+				+ pageblock_nr_pages - 1;
+			continue;
+		}
+		__init_single_page(pfn_to_page(pfn), pfn, zone, node);
+		__SetPageReserved(pfn_to_page(pfn));
+		pgcnt++;
+	}
+
+	return pgcnt;
+}
+#else
+static inline u64 init_unavailable_range(unsigned long spfn, unsigned long epfn,
+					 int zone, int node)
+{
+	return 0;
+}
+#endif
+
 void __meminit __weak memmap_init_zone(struct zone *zone)
 {
 	unsigned long zone_start_pfn = zone->zone_start_pfn;
 	unsigned long zone_end_pfn = zone_start_pfn + zone->spanned_pages;
 	int i, nid = zone_to_nid(zone), zone_id = zone_idx(zone);
+	static unsigned long hole_pfn;
 	unsigned long start_pfn, end_pfn;
+	u64 pgcnt = 0;
 
 	for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
 		start_pfn = clamp(start_pfn, zone_start_pfn, zone_end_pfn);
@@ -6274,7 +6327,29 @@ void __meminit __weak memmap_init_zone(s
 			memmap_init_range(end_pfn - start_pfn, nid,
 					zone_id, start_pfn, zone_end_pfn,
 					MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
+
+		if (hole_pfn < start_pfn)
+			pgcnt += init_unavailable_range(hole_pfn, start_pfn,
+							zone_id, nid);
+		hole_pfn = end_pfn;
 	}
+
+#ifdef CONFIG_SPARSEMEM
+	/*
+	 * Initialize the hole in the range [zone_end_pfn, section_end].
+	 * If zone boundary falls in the middle of a section, this hole
+	 * will be re-initialized during the call to this function for the
+	 * higher zone.
+	 */
+	end_pfn = round_up(zone_end_pfn, PAGES_PER_SECTION);
+	if (hole_pfn < end_pfn)
+		pgcnt += init_unavailable_range(hole_pfn, end_pfn,
+						zone_id, nid);
+#endif
+
+	if (pgcnt)
+		pr_info("  %s zone: %llu pages in unavailable ranges\n",
+			zone->name, pgcnt);
 }
 
 static int zone_batchsize(struct zone *zone)
@@ -7071,88 +7146,6 @@ void __init free_area_init_memoryless_no
 	free_area_init_node(nid);
 }
 
-#if !defined(CONFIG_FLAT_NODE_MEM_MAP)
-/*
- * Initialize all valid struct pages in the range [spfn, epfn) and mark them
- * PageReserved(). Return the number of struct pages that were initialized.
- */
-static u64 __init init_unavailable_range(unsigned long spfn, unsigned long epfn)
-{
-	unsigned long pfn;
-	u64 pgcnt = 0;
-
-	for (pfn = spfn; pfn < epfn; pfn++) {
-		if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages))) {
-			pfn = ALIGN_DOWN(pfn, pageblock_nr_pages)
-				+ pageblock_nr_pages - 1;
-			continue;
-		}
-		/*
-		 * Use a fake node/zone (0) for now. Some of these pages
-		 * (in memblock.reserved but not in memblock.memory) will
-		 * get re-initialized via reserve_bootmem_region() later.
-		 */
-		__init_single_page(pfn_to_page(pfn), pfn, 0, 0);
-		__SetPageReserved(pfn_to_page(pfn));
-		pgcnt++;
-	}
-
-	return pgcnt;
-}
-
-/*
- * Only struct pages that are backed by physical memory are zeroed and
- * initialized by going through __init_single_page(). But, there are some
- * struct pages which are reserved in memblock allocator and their fields
- * may be accessed (for example page_to_pfn() on some configuration accesses
- * flags). We must explicitly initialize those struct pages.
- *
- * This function also addresses a similar issue where struct pages are left
- * uninitialized because the physical address range is not covered by
- * memblock.memory or memblock.reserved. That could happen when memblock
- * layout is manually configured via memmap=, or when the highest physical
- * address (max_pfn) does not end on a section boundary.
- */
-static void __init init_unavailable_mem(void)
-{
-	phys_addr_t start, end;
-	u64 i, pgcnt;
-	phys_addr_t next = 0;
-
-	/*
-	 * Loop through unavailable ranges not covered by memblock.memory.
-	 */
-	pgcnt = 0;
-	for_each_mem_range(i, &start, &end) {
-		if (next < start)
-			pgcnt += init_unavailable_range(PFN_DOWN(next),
-							PFN_UP(start));
-		next = end;
-	}
-
-	/*
-	 * Early sections always have a fully populated memmap for the whole
-	 * section - see pfn_valid(). If the last section has holes at the
-	 * end and that section is marked "online", the memmap will be
-	 * considered initialized. Make sure that memmap has a well defined
-	 * state.
-	 */
-	pgcnt += init_unavailable_range(PFN_DOWN(next),
-					round_up(max_pfn, PAGES_PER_SECTION));
-
-	/*
-	 * Struct pages that do not have backing memory. This could be because
-	 * firmware is using some of this memory, or for some other reasons.
-	 */
-	if (pgcnt)
-		pr_info("Zeroed struct page in unavailable ranges: %lld pages", pgcnt);
-}
-#else
-static inline void __init init_unavailable_mem(void)
-{
-}
-#endif /* !CONFIG_FLAT_NODE_MEM_MAP */
-
 #if MAX_NUMNODES > 1
 /*
  * Figure out the number of possible node ids.
@@ -7576,7 +7569,6 @@ void __init free_area_init(unsigned long
 	/* Initialise every node */
 	mminit_verify_pageflags_layout();
 	setup_nr_node_ids();
-	init_unavailable_mem();
 	for_each_online_node(nid) {
 		pg_data_t *pgdat = NODE_DATA(nid);
 		free_area_init_node(nid);
_

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 05/29] mm/fork: clear PASID for new mm
  2021-03-13  5:06 incoming Andrew Morton
                   ` (3 preceding siblings ...)
  2021-03-13  5:07 ` [patch 04/29] mm/page_alloc.c: refactor initialization of struct page for holes in memory layout Andrew Morton
@ 2021-03-13  5:07 ` Andrew Morton
  2021-03-13  5:07 ` [patch 06/29] hugetlb: dedup the code to add a new file_region Andrew Morton
                   ` (23 subsequent siblings)
  28 siblings, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2021-03-13  5:07 UTC (permalink / raw)
  To: akpm, fenghua.yu, jacob.jun.pan, jean-philippe, linux-mm,
	mm-commits, tony.luck, torvalds

From: Fenghua Yu <fenghua.yu@intel.com>
Subject: mm/fork: clear PASID for new mm

When a new mm is created, its PASID should be cleared, i.e.  the PASID is
initialized to its init state 0 on both ARM and X86.

This patch was part of the series introducing mm->pasid, but got lost
along the way [1].  It still makes sense to have it, because each address
space has a different PASID.  And the IOMMU code in
iommu_sva_alloc_pasid() expects the pasid field of a new mm struct to be
cleared.

[1] https://lore.kernel.org/linux-iommu/YDgh53AcQHT+T3L0@otcwcpicx3.sc.intel.com/

Link: https://lkml.kernel.org/r/20210302103837.2562625-1-jean-philippe@linaro.org
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: Jacob Pan <jacob.jun.pan@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/mm_types.h |    1 +
 kernel/fork.c            |    8 ++++++++
 2 files changed, 9 insertions(+)

--- a/include/linux/mm_types.h~mm-fork-clear-pasid-for-new-mm
+++ a/include/linux/mm_types.h
@@ -23,6 +23,7 @@
 #endif
 #define AT_VECTOR_SIZE (2*(AT_VECTOR_SIZE_ARCH + AT_VECTOR_SIZE_BASE + 1))
 
+#define INIT_PASID	0
 
 struct address_space;
 struct mem_cgroup;
--- a/kernel/fork.c~mm-fork-clear-pasid-for-new-mm
+++ a/kernel/fork.c
@@ -994,6 +994,13 @@ static void mm_init_owner(struct mm_stru
 #endif
 }
 
+static void mm_init_pasid(struct mm_struct *mm)
+{
+#ifdef CONFIG_IOMMU_SUPPORT
+	mm->pasid = INIT_PASID;
+#endif
+}
+
 static void mm_init_uprobes_state(struct mm_struct *mm)
 {
 #ifdef CONFIG_UPROBES
@@ -1024,6 +1031,7 @@ static struct mm_struct *mm_init(struct
 	mm_init_cpumask(mm);
 	mm_init_aio(mm);
 	mm_init_owner(mm, p);
+	mm_init_pasid(mm);
 	RCU_INIT_POINTER(mm->exe_file, NULL);
 	mmu_notifier_subscriptions_init(mm);
 	init_tlb_flush_pending(mm);
_

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 06/29] hugetlb: dedup the code to add a new file_region
  2021-03-13  5:06 incoming Andrew Morton
                   ` (4 preceding siblings ...)
  2021-03-13  5:07 ` [patch 05/29] mm/fork: clear PASID for new mm Andrew Morton
@ 2021-03-13  5:07 ` Andrew Morton
  2021-03-13  5:07 ` [patch 07/29] hugetlb: break earlier in add_reservation_in_range() when we can Andrew Morton
                   ` (22 subsequent siblings)
  28 siblings, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2021-03-13  5:07 UTC (permalink / raw)
  To: aarcange, adobriyan, airlied, akpm, daniel, david, galpress, hch,
	jack, jannh, jgg, kirill, ktkhai, linmiaohe,
	linux-graphics-maintainer, linux-mm, mike.kravetz, mm-commits,
	peterx, rppt, sroland, torvalds, willy, wzam

From: Peter Xu <peterx@redhat.com>
Subject: hugetlb: dedup the code to add a new file_region

Patch series "mm/hugetlb: Early cow on fork, and a few cleanups", v5.

As reported by Gal [1], we still miss the code clip to handle early cow
for hugetlb case, which is true.  Again, it still feels odd to fork()
after using a few huge pages, especially if they're privately mapped to
me..  However I do agree with Gal and Jason in that we should still have
that since that'll complete the early cow on fork effort at least, and
it'll still fix issues where buffers are not well under control and not
easy to apply MADV_DONTFORK.

The first two patches (1-2) are some cleanups I noticed when reading into
the hugetlb reserve map code.  I think it's good to have but they're not
necessary for fixing the fork issue.

The last two patches (3-4) are the real fix.

I tested this with a fork() after some vfio-pci assignment, so I'm pretty
sure the page copy path could trigger well (page will be accounted right
after the fork()), but I didn't do data check since the card I assigned is
some random nic.

  https://github.com/xzpeter/linux/tree/fork-cow-pin-huge

[1] https://lore.kernel.org/lkml/27564187-4a08-f187-5a84-3df50009f6ca@amazon.com/

Introduce hugetlb_resv_map_add() helper to add a new file_region rather
than duplication the similar code twice in add_reservation_in_range().

Link: https://lkml.kernel.org/r/20210217233547.93892-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20210217233547.93892-2-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Gal Pressman <galpress@amazon.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Wei Zhang <wzam@amazon.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jann Horn <jannh@google.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Kirill Shutemov <kirill@shutemov.name>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: David Airlie <airlied@linux.ie>
Cc: Roland Scheidegger <sroland@vmware.com>
Cc: VMware Graphics <linux-graphics-maintainer@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/hugetlb.c |   51 +++++++++++++++++++++++++------------------------
 1 file changed, 27 insertions(+), 24 deletions(-)

--- a/mm/hugetlb.c~hugetlb-dedup-the-code-to-add-a-new-file_region
+++ a/mm/hugetlb.c
@@ -331,6 +331,24 @@ static void coalesce_file_region(struct
 	}
 }
 
+static inline long
+hugetlb_resv_map_add(struct resv_map *map, struct file_region *rg, long from,
+		     long to, struct hstate *h, struct hugetlb_cgroup *cg,
+		     long *regions_needed)
+{
+	struct file_region *nrg;
+
+	if (!regions_needed) {
+		nrg = get_file_region_entry_from_cache(map, from, to);
+		record_hugetlb_cgroup_uncharge_info(cg, h, map, nrg);
+		list_add(&nrg->link, rg->link.prev);
+		coalesce_file_region(map, nrg);
+	} else
+		*regions_needed += 1;
+
+	return to - from;
+}
+
 /*
  * Must be called with resv->lock held.
  *
@@ -346,7 +364,7 @@ static long add_reservation_in_range(str
 	long add = 0;
 	struct list_head *head = &resv->regions;
 	long last_accounted_offset = f;
-	struct file_region *rg = NULL, *trg = NULL, *nrg = NULL;
+	struct file_region *rg = NULL, *trg = NULL;
 
 	if (regions_needed)
 		*regions_needed = 0;
@@ -375,18 +393,11 @@ static long add_reservation_in_range(str
 		/* Add an entry for last_accounted_offset -> rg->from, and
 		 * update last_accounted_offset.
 		 */
-		if (rg->from > last_accounted_offset) {
-			add += rg->from - last_accounted_offset;
-			if (!regions_needed) {
-				nrg = get_file_region_entry_from_cache(
-					resv, last_accounted_offset, rg->from);
-				record_hugetlb_cgroup_uncharge_info(h_cg, h,
-								    resv, nrg);
-				list_add(&nrg->link, rg->link.prev);
-				coalesce_file_region(resv, nrg);
-			} else
-				*regions_needed += 1;
-		}
+		if (rg->from > last_accounted_offset)
+			add += hugetlb_resv_map_add(resv, rg,
+						    last_accounted_offset,
+						    rg->from, h, h_cg,
+						    regions_needed);
 
 		last_accounted_offset = rg->to;
 	}
@@ -394,17 +405,9 @@ static long add_reservation_in_range(str
 	/* Handle the case where our range extends beyond
 	 * last_accounted_offset.
 	 */
-	if (last_accounted_offset < t) {
-		add += t - last_accounted_offset;
-		if (!regions_needed) {
-			nrg = get_file_region_entry_from_cache(
-				resv, last_accounted_offset, t);
-			record_hugetlb_cgroup_uncharge_info(h_cg, h, resv, nrg);
-			list_add(&nrg->link, rg->link.prev);
-			coalesce_file_region(resv, nrg);
-		} else
-			*regions_needed += 1;
-	}
+	if (last_accounted_offset < t)
+		add += hugetlb_resv_map_add(resv, rg, last_accounted_offset,
+					    t, h, h_cg, regions_needed);
 
 	VM_BUG_ON(add < 0);
 	return add;
_

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 07/29] hugetlb: break earlier in add_reservation_in_range() when we can
  2021-03-13  5:06 incoming Andrew Morton
                   ` (5 preceding siblings ...)
  2021-03-13  5:07 ` [patch 06/29] hugetlb: dedup the code to add a new file_region Andrew Morton
@ 2021-03-13  5:07 ` Andrew Morton
  2021-03-13  5:07 ` [patch 08/29] mm: introduce page_needs_cow_for_dma() for deciding whether cow Andrew Morton
                   ` (21 subsequent siblings)
  28 siblings, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2021-03-13  5:07 UTC (permalink / raw)
  To: aarcange, adobriyan, airlied, akpm, daniel, david, galpress, hch,
	jack, jannh, jgg, kirill, ktkhai, linmiaohe,
	linux-graphics-maintainer, linux-mm, mike.kravetz, mm-commits,
	peterx, rppt, sroland, torvalds, willy, wzam

From: Peter Xu <peterx@redhat.com>
Subject: hugetlb: break earlier in add_reservation_in_range() when we can

All the regions maintained in hugetlb reserved map is inclusive on "from"
but exclusive on "to".  We can break earlier even if rg->from==t because
it already means no possible intersection.

This does not need a Fixes in all cases because when it happens
(rg->from==t) we'll not break out of the loop while we should, however the
next thing we'd do is still add the last file_region we'd need and quit
the loop in the next round.  So this change is not a bugfix (since the old
code should still run okay iiuc), but we'd better still touch it up to
make it logically sane.

Link: https://lkml.kernel.org/r/20210217233547.93892-3-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: David Airlie <airlied@linux.ie>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Gal Pressman <galpress@amazon.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kirill Shutemov <kirill@shutemov.name>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Roland Scheidegger <sroland@vmware.com>
Cc: VMware Graphics <linux-graphics-maintainer@vmware.com>
Cc: Wei Zhang <wzam@amazon.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/hugetlb.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/hugetlb.c~hugetlg-break-earlier-in-add_reservation_in_range-when-we-can
+++ a/mm/hugetlb.c
@@ -387,7 +387,7 @@ static long add_reservation_in_range(str
 		/* When we find a region that starts beyond our range, we've
 		 * finished.
 		 */
-		if (rg->from > t)
+		if (rg->from >= t)
 			break;
 
 		/* Add an entry for last_accounted_offset -> rg->from, and
_

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 08/29] mm: introduce page_needs_cow_for_dma() for deciding whether cow
  2021-03-13  5:06 incoming Andrew Morton
                   ` (6 preceding siblings ...)
  2021-03-13  5:07 ` [patch 07/29] hugetlb: break earlier in add_reservation_in_range() when we can Andrew Morton
@ 2021-03-13  5:07 ` Andrew Morton
  2021-03-13  5:07 ` [patch 09/29] mm: use is_cow_mapping() across tree where proper Andrew Morton
                   ` (20 subsequent siblings)
  28 siblings, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2021-03-13  5:07 UTC (permalink / raw)
  To: aarcange, adobriyan, airlied, akpm, daniel, david, galpress, hch,
	jack, jannh, jgg, kirill, ktkhai, linmiaohe,
	linux-graphics-maintainer, linux-mm, mike.kravetz, mm-commits,
	peterx, rppt, sroland, torvalds, willy, wzam

From: Peter Xu <peterx@redhat.com>
Subject: mm: introduce page_needs_cow_for_dma() for deciding whether cow

We've got quite a few places (pte, pmd, pud) that explicitly checked
against whether we should break the cow right now during fork().  It's
easier to provide a helper, especially before we work the same thing on
hugetlbfs.

Since we'll reference is_cow_mapping() in mm.h, move it there too. 
Actually it suites mm.h more since internal.h is mm/ only, but mm.h is
exported to the whole kernel.  With that we should expect another patch to
use is_cow_mapping() whenever we can across the kernel since we do use it
quite a lot but it's always done with raw code against VM_* flags.

Link: https://lkml.kernel.org/r/20210217233547.93892-4-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: David Airlie <airlied@linux.ie>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Gal Pressman <galpress@amazon.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Kirill Shutemov <kirill@shutemov.name>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Roland Scheidegger <sroland@vmware.com>
Cc: VMware Graphics <linux-graphics-maintainer@vmware.com>
Cc: Wei Zhang <wzam@amazon.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/mm.h |   21 +++++++++++++++++++++
 mm/huge_memory.c   |    8 ++------
 mm/internal.h      |    5 -----
 mm/memory.c        |    8 +-------
 4 files changed, 24 insertions(+), 18 deletions(-)

--- a/include/linux/mm.h~mm-introduce-page_needs_cow_for_dma-for-deciding-whether-cow
+++ a/include/linux/mm.h
@@ -1300,6 +1300,27 @@ static inline bool page_maybe_dma_pinned
 		GUP_PIN_COUNTING_BIAS;
 }
 
+static inline bool is_cow_mapping(vm_flags_t flags)
+{
+	return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
+}
+
+/*
+ * This should most likely only be called during fork() to see whether we
+ * should break the cow immediately for a page on the src mm.
+ */
+static inline bool page_needs_cow_for_dma(struct vm_area_struct *vma,
+					  struct page *page)
+{
+	if (!is_cow_mapping(vma->vm_flags))
+		return false;
+
+	if (!atomic_read(&vma->vm_mm->has_pinned))
+		return false;
+
+	return page_maybe_dma_pinned(page);
+}
+
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
 #define SECTION_IN_PAGE_FLAGS
 #endif
--- a/mm/huge_memory.c~mm-introduce-page_needs_cow_for_dma-for-deciding-whether-cow
+++ a/mm/huge_memory.c
@@ -1100,9 +1100,7 @@ int copy_huge_pmd(struct mm_struct *dst_
 	 * best effort that the pinned pages won't be replaced by another
 	 * random page during the coming copy-on-write.
 	 */
-	if (unlikely(is_cow_mapping(vma->vm_flags) &&
-		     atomic_read(&src_mm->has_pinned) &&
-		     page_maybe_dma_pinned(src_page))) {
+	if (unlikely(page_needs_cow_for_dma(vma, src_page))) {
 		pte_free(dst_mm, pgtable);
 		spin_unlock(src_ptl);
 		spin_unlock(dst_ptl);
@@ -1214,9 +1212,7 @@ int copy_huge_pud(struct mm_struct *dst_
 	}
 
 	/* Please refer to comments in copy_huge_pmd() */
-	if (unlikely(is_cow_mapping(vma->vm_flags) &&
-		     atomic_read(&src_mm->has_pinned) &&
-		     page_maybe_dma_pinned(pud_page(pud)))) {
+	if (unlikely(page_needs_cow_for_dma(vma, pud_page(pud)))) {
 		spin_unlock(src_ptl);
 		spin_unlock(dst_ptl);
 		__split_huge_pud(vma, src_pud, addr);
--- a/mm/internal.h~mm-introduce-page_needs_cow_for_dma-for-deciding-whether-cow
+++ a/mm/internal.h
@@ -296,11 +296,6 @@ static inline unsigned int buddy_order(s
  */
 #define buddy_order_unsafe(page)	READ_ONCE(page_private(page))
 
-static inline bool is_cow_mapping(vm_flags_t flags)
-{
-	return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
-}
-
 /*
  * These three helpers classifies VMAs for virtual memory accounting.
  */
--- a/mm/memory.c~mm-introduce-page_needs_cow_for_dma-for-deciding-whether-cow
+++ a/mm/memory.c
@@ -809,12 +809,8 @@ copy_present_page(struct vm_area_struct
 		  pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
 		  struct page **prealloc, pte_t pte, struct page *page)
 {
-	struct mm_struct *src_mm = src_vma->vm_mm;
 	struct page *new_page;
 
-	if (!is_cow_mapping(src_vma->vm_flags))
-		return 1;
-
 	/*
 	 * What we want to do is to check whether this page may
 	 * have been pinned by the parent process.  If so,
@@ -828,9 +824,7 @@ copy_present_page(struct vm_area_struct
 	 * the page count. That might give false positives for
 	 * for pinning, but it will work correctly.
 	 */
-	if (likely(!atomic_read(&src_mm->has_pinned)))
-		return 1;
-	if (likely(!page_maybe_dma_pinned(page)))
+	if (likely(!page_needs_cow_for_dma(src_vma, page)))
 		return 1;
 
 	new_page = *prealloc;
_

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 09/29] mm: use is_cow_mapping() across tree where proper
  2021-03-13  5:06 incoming Andrew Morton
                   ` (7 preceding siblings ...)
  2021-03-13  5:07 ` [patch 08/29] mm: introduce page_needs_cow_for_dma() for deciding whether cow Andrew Morton
@ 2021-03-13  5:07 ` Andrew Morton
  2021-03-13  5:07 ` [patch 10/29] hugetlb: do early cow when page pinned on src mm Andrew Morton
                   ` (19 subsequent siblings)
  28 siblings, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2021-03-13  5:07 UTC (permalink / raw)
  To: aarcange, adobriyan, airlied, akpm, daniel, david, galpress, hch,
	jack, jannh, jgg, kirill, ktkhai, linmiaohe,
	linux-graphics-maintainer, linux-mm, mike.kravetz, mm-commits,
	peterx, rppt, sroland, torvalds, willy, wzam

From: Peter Xu <peterx@redhat.com>
Subject: mm: use is_cow_mapping() across tree where proper

After is_cow_mapping() is exported in mm.h, replace some manual checks
elsewhere throughout the tree but start to use the new helper.

Link: https://lkml.kernel.org/r/20210217233547.93892-5-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@ziepe.ca>
Cc: VMware Graphics <linux-graphics-maintainer@vmware.com>
Cc: Roland Scheidegger <sroland@vmware.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Gal Pressman <galpress@amazon.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Kirill Shutemov <kirill@shutemov.name>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Wei Zhang <wzam@amazon.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/gpu/drm/vmwgfx/vmwgfx_page_dirty.c |    4 +---
 drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c   |    2 +-
 fs/proc/task_mmu.c                         |    2 --
 mm/hugetlb.c                               |    4 +---
 4 files changed, 3 insertions(+), 9 deletions(-)

--- a/drivers/gpu/drm/vmwgfx/vmwgfx_page_dirty.c~mm-use-is_cow_mapping-across-tree-where-proper
+++ a/drivers/gpu/drm/vmwgfx/vmwgfx_page_dirty.c
@@ -500,8 +500,6 @@ vm_fault_t vmw_bo_vm_huge_fault(struct v
 	vm_fault_t ret;
 	pgoff_t fault_page_size;
 	bool write = vmf->flags & FAULT_FLAG_WRITE;
-	bool is_cow_mapping =
-		(vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
 
 	switch (pe_size) {
 	case PE_SIZE_PMD:
@@ -518,7 +516,7 @@ vm_fault_t vmw_bo_vm_huge_fault(struct v
 	}
 
 	/* Always do write dirty-tracking and COW on PTE level. */
-	if (write && (READ_ONCE(vbo->dirty) || is_cow_mapping))
+	if (write && (READ_ONCE(vbo->dirty) || is_cow_mapping(vma->vm_flags)))
 		return VM_FAULT_FALLBACK;
 
 	ret = ttm_bo_vm_reserve(bo, vmf);
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c~mm-use-is_cow_mapping-across-tree-where-proper
+++ a/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c
@@ -49,7 +49,7 @@ int vmw_mmap(struct file *filp, struct v
 	vma->vm_ops = &vmw_vm_ops;
 
 	/* Use VM_PFNMAP rather than VM_MIXEDMAP if not a COW mapping */
-	if ((vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) != VM_MAYWRITE)
+	if (!is_cow_mapping(vma->vm_flags))
 		vma->vm_flags = (vma->vm_flags & ~VM_MIXEDMAP) | VM_PFNMAP;
 
 	return 0;
--- a/fs/proc/task_mmu.c~mm-use-is_cow_mapping-across-tree-where-proper
+++ a/fs/proc/task_mmu.c
@@ -1036,8 +1036,6 @@ struct clear_refs_private {
 
 #ifdef CONFIG_MEM_SOFT_DIRTY
 
-#define is_cow_mapping(flags) (((flags) & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE)
-
 static inline bool pte_is_pinned(struct vm_area_struct *vma, unsigned long addr, pte_t pte)
 {
 	struct page *page;
--- a/mm/hugetlb.c~mm-use-is_cow_mapping-across-tree-where-proper
+++ a/mm/hugetlb.c
@@ -3734,15 +3734,13 @@ int copy_hugetlb_page_range(struct mm_st
 	pte_t *src_pte, *dst_pte, entry, dst_entry;
 	struct page *ptepage;
 	unsigned long addr;
-	int cow;
+	bool cow = is_cow_mapping(vma->vm_flags);
 	struct hstate *h = hstate_vma(vma);
 	unsigned long sz = huge_page_size(h);
 	struct address_space *mapping = vma->vm_file->f_mapping;
 	struct mmu_notifier_range range;
 	int ret = 0;
 
-	cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
-
 	if (cow) {
 		mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, src,
 					vma->vm_start,
_

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 10/29] hugetlb: do early cow when page pinned on src mm
  2021-03-13  5:06 incoming Andrew Morton
                   ` (8 preceding siblings ...)
  2021-03-13  5:07 ` [patch 09/29] mm: use is_cow_mapping() across tree where proper Andrew Morton
@ 2021-03-13  5:07 ` Andrew Morton
  2021-03-13  5:07 ` [patch 11/29] mm/highmem.c: fix zero_user_segments() with start > end Andrew Morton
                   ` (18 subsequent siblings)
  28 siblings, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2021-03-13  5:07 UTC (permalink / raw)
  To: aarcange, adobriyan, airlied, akpm, daniel, david, galpress, hch,
	jack, jannh, jgg, kirill, ktkhai, linmiaohe,
	linux-graphics-maintainer, linux-mm, mike.kravetz, mm-commits,
	peterx, rppt, sroland, torvalds, willy, wzam

From: Peter Xu <peterx@redhat.com>
Subject: hugetlb: do early cow when page pinned on src mm

This is the last missing piece of the COW-during-fork effort when there're
pinned pages found.  One can reference 70e806e4e645 ("mm: Do early cow for
pinned pages during fork() for ptes", 2020-09-27) for more information,
since we do similar things here rather than pte this time, but just for
hugetlb.

Note that after Jason's recent work on 57efa1fe5957 ("mm/gup: prevent
gup_fast from racing with COW during fork", 2020-12-15) which is safer and
easier to understand, we're safe now within the whole copy_page_range()
against gup-fast, we don't need the wr-protect trick that proposed in
70e806e4e645 anymore.

Link: https://lkml.kernel.org/r/20210217233547.93892-6-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: David Airlie <airlied@linux.ie>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Gal Pressman <galpress@amazon.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Kirill Shutemov <kirill@shutemov.name>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Roland Scheidegger <sroland@vmware.com>
Cc: VMware Graphics <linux-graphics-maintainer@vmware.com>
Cc: Wei Zhang <wzam@amazon.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/hugetlb.c |   66 ++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 62 insertions(+), 4 deletions(-)

--- a/mm/hugetlb.c~hugetlb-do-early-cow-when-page-pinned-on-src-mm
+++ a/mm/hugetlb.c
@@ -3728,6 +3728,18 @@ static bool is_hugetlb_entry_hwpoisoned(
 		return false;
 }
 
+static void
+hugetlb_install_page(struct vm_area_struct *vma, pte_t *ptep, unsigned long addr,
+		     struct page *new_page)
+{
+	__SetPageUptodate(new_page);
+	set_huge_pte_at(vma->vm_mm, addr, ptep, make_huge_pte(vma, new_page, 1));
+	hugepage_add_new_anon_rmap(new_page, vma, addr);
+	hugetlb_count_add(pages_per_huge_page(hstate_vma(vma)), vma->vm_mm);
+	ClearHPageRestoreReserve(new_page);
+	SetHPageMigratable(new_page);
+}
+
 int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			    struct vm_area_struct *vma)
 {
@@ -3737,6 +3749,7 @@ int copy_hugetlb_page_range(struct mm_st
 	bool cow = is_cow_mapping(vma->vm_flags);
 	struct hstate *h = hstate_vma(vma);
 	unsigned long sz = huge_page_size(h);
+	unsigned long npages = pages_per_huge_page(h);
 	struct address_space *mapping = vma->vm_file->f_mapping;
 	struct mmu_notifier_range range;
 	int ret = 0;
@@ -3785,6 +3798,7 @@ int copy_hugetlb_page_range(struct mm_st
 		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
 		entry = huge_ptep_get(src_pte);
 		dst_entry = huge_ptep_get(dst_pte);
+again:
 		if (huge_pte_none(entry) || !huge_pte_none(dst_entry)) {
 			/*
 			 * Skip if src entry none.  Also, skip in the
@@ -3808,6 +3822,52 @@ int copy_hugetlb_page_range(struct mm_st
 			}
 			set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz);
 		} else {
+			entry = huge_ptep_get(src_pte);
+			ptepage = pte_page(entry);
+			get_page(ptepage);
+
+			/*
+			 * This is a rare case where we see pinned hugetlb
+			 * pages while they're prone to COW.  We need to do the
+			 * COW earlier during fork.
+			 *
+			 * When pre-allocating the page or copying data, we
+			 * need to be without the pgtable locks since we could
+			 * sleep during the process.
+			 */
+			if (unlikely(page_needs_cow_for_dma(vma, ptepage))) {
+				pte_t src_pte_old = entry;
+				struct page *new;
+
+				spin_unlock(src_ptl);
+				spin_unlock(dst_ptl);
+				/* Do not use reserve as it's private owned */
+				new = alloc_huge_page(vma, addr, 1);
+				if (IS_ERR(new)) {
+					put_page(ptepage);
+					ret = PTR_ERR(new);
+					break;
+				}
+				copy_user_huge_page(new, ptepage, addr, vma,
+						    npages);
+				put_page(ptepage);
+
+				/* Install the new huge page if src pte stable */
+				dst_ptl = huge_pte_lock(h, dst, dst_pte);
+				src_ptl = huge_pte_lockptr(h, src, src_pte);
+				spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
+				entry = huge_ptep_get(src_pte);
+				if (!pte_same(src_pte_old, entry)) {
+					put_page(new);
+					/* dst_entry won't change as in child */
+					goto again;
+				}
+				hugetlb_install_page(vma, dst_pte, addr, new);
+				spin_unlock(src_ptl);
+				spin_unlock(dst_ptl);
+				continue;
+			}
+
 			if (cow) {
 				/*
 				 * No need to notify as we are downgrading page
@@ -3818,12 +3878,10 @@ int copy_hugetlb_page_range(struct mm_st
 				 */
 				huge_ptep_set_wrprotect(src, addr, src_pte);
 			}
-			entry = huge_ptep_get(src_pte);
-			ptepage = pte_page(entry);
-			get_page(ptepage);
+
 			page_dup_rmap(ptepage, true);
 			set_huge_pte_at(dst, addr, dst_pte, entry);
-			hugetlb_count_add(pages_per_huge_page(h), dst);
+			hugetlb_count_add(npages, dst);
 		}
 		spin_unlock(src_ptl);
 		spin_unlock(dst_ptl);
_

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 11/29] mm/highmem.c: fix zero_user_segments() with start > end
  2021-03-13  5:06 incoming Andrew Morton
                   ` (9 preceding siblings ...)
  2021-03-13  5:07 ` [patch 10/29] hugetlb: do early cow when page pinned on src mm Andrew Morton
@ 2021-03-13  5:07 ` Andrew Morton
  2021-03-13  5:07 ` [patch 12/29] binfmt_misc: fix possible deadlock in bm_register_write Andrew Morton
                   ` (17 subsequent siblings)
  28 siblings, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2021-03-13  5:07 UTC (permalink / raw)
  To: akpm, hirofumi, linux-mm, mm-commits, stable, torvalds, willy

From: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Subject: mm/highmem.c: fix zero_user_segments() with start > end

zero_user_segments() is used from __block_write_begin_int(), for example
like the following

	zero_user_segments(page, 4096, 1024, 512, 918)

But new the zero_user_segments() implementation for for HIGHMEM +
TRANSPARENT_HUGEPAGE doesn't handle "start > end" case correctly, and hits
BUG_ON().  (we can fix __block_write_begin_int() instead though, it is the
old and multiple usage)

Also it calls kmap_atomic() unnecessarily while start == end == 0.

Link: https://lkml.kernel.org/r/87v9ab60r4.fsf@mail.parknet.co.jp
Fixes: 0060ef3b4e6d ("mm: support THPs in zero_user_segments")
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/highmem.c |   17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

--- a/mm/highmem.c~fix-zero_user_segments-with-start-end
+++ a/mm/highmem.c
@@ -368,20 +368,24 @@ void zero_user_segments(struct page *pag
 
 	BUG_ON(end1 > page_size(page) || end2 > page_size(page));
 
+	if (start1 >= end1)
+		start1 = end1 = 0;
+	if (start2 >= end2)
+		start2 = end2 = 0;
+
 	for (i = 0; i < compound_nr(page); i++) {
 		void *kaddr = NULL;
 
-		if (start1 < PAGE_SIZE || start2 < PAGE_SIZE)
-			kaddr = kmap_atomic(page + i);
-
 		if (start1 >= PAGE_SIZE) {
 			start1 -= PAGE_SIZE;
 			end1 -= PAGE_SIZE;
 		} else {
 			unsigned this_end = min_t(unsigned, end1, PAGE_SIZE);
 
-			if (end1 > start1)
+			if (end1 > start1) {
+				kaddr = kmap_atomic(page + i);
 				memset(kaddr + start1, 0, this_end - start1);
+			}
 			end1 -= this_end;
 			start1 = 0;
 		}
@@ -392,8 +396,11 @@ void zero_user_segments(struct page *pag
 		} else {
 			unsigned this_end = min_t(unsigned, end2, PAGE_SIZE);
 
-			if (end2 > start2)
+			if (end2 > start2) {
+				if (!kaddr)
+					kaddr = kmap_atomic(page + i);
 				memset(kaddr + start2, 0, this_end - start2);
+			}
 			end2 -= this_end;
 			start2 = 0;
 		}
_

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 12/29] binfmt_misc: fix possible deadlock in bm_register_write
  2021-03-13  5:06 incoming Andrew Morton
                   ` (10 preceding siblings ...)
  2021-03-13  5:07 ` [patch 11/29] mm/highmem.c: fix zero_user_segments() with start > end Andrew Morton
@ 2021-03-13  5:07 ` Andrew Morton
  2021-03-13  5:07 ` [patch 13/29] MAINTAINERS: exclude uapi directories in API/ABI section Andrew Morton
                   ` (16 subsequent siblings)
  28 siblings, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2021-03-13  5:07 UTC (permalink / raw)
  To: akpm, deller, linux-mm, liorribak, mm-commits, stable, torvalds, viro

From: Lior Ribak <liorribak@gmail.com>
Subject: binfmt_misc: fix possible deadlock in bm_register_write

There is a deadlock in bm_register_write:

First, in the begining of the function, a lock is taken on the binfmt_misc
root inode with inode_lock(d_inode(root)).

Then, if the user used the MISC_FMT_OPEN_FILE flag, the function will call
open_exec on the user-provided interpreter.

open_exec will call a path lookup, and if the path lookup process includes
the root of binfmt_misc, it will try to take a shared lock on its inode
again, but it is already locked, and the code will get stuck in a deadlock

To reproduce the bug:
$ echo ":iiiii:E::ii::/proc/sys/fs/binfmt_misc/bla:F" > /proc/sys/fs/binfmt_misc/register

backtrace of where the lock occurs (#5):
0  schedule () at ./arch/x86/include/asm/current.h:15
1  0xffffffff81b51237 in rwsem_down_read_slowpath (sem=0xffff888003b202e0, count=<optimized out>, state=state@entry=2) at kernel/locking/rwsem.c:992
2  0xffffffff81b5150a in __down_read_common (state=2, sem=<optimized out>) at kernel/locking/rwsem.c:1213
3  __down_read (sem=<optimized out>) at kernel/locking/rwsem.c:1222
4  down_read (sem=<optimized out>) at kernel/locking/rwsem.c:1355
5  0xffffffff811ee22a in inode_lock_shared (inode=<optimized out>) at ./include/linux/fs.h:783
6  open_last_lookups (op=0xffffc9000022fe34, file=0xffff888004098600, nd=0xffffc9000022fd10) at fs/namei.c:3177
7  path_openat (nd=nd@entry=0xffffc9000022fd10, op=op@entry=0xffffc9000022fe34, flags=flags@entry=65) at fs/namei.c:3366
8  0xffffffff811efe1c in do_filp_open (dfd=<optimized out>, pathname=pathname@entry=0xffff8880031b9000, op=op@entry=0xffffc9000022fe34) at fs/namei.c:3396
9  0xffffffff811e493f in do_open_execat (fd=fd@entry=-100, name=name@entry=0xffff8880031b9000, flags=<optimized out>, flags@entry=0) at fs/exec.c:913
10 0xffffffff811e4a92 in open_exec (name=<optimized out>) at fs/exec.c:948
11 0xffffffff8124aa84 in bm_register_write (file=<optimized out>, buffer=<optimized out>, count=19, ppos=<optimized out>) at fs/binfmt_misc.c:682
12 0xffffffff811decd2 in vfs_write (file=file@entry=0xffff888004098500, buf=buf@entry=0xa758d0 ":iiiii:E::ii::i:CF
", count=count@entry=19, pos=pos@entry=0xffffc9000022ff10) at fs/read_write.c:603
13 0xffffffff811defda in ksys_write (fd=<optimized out>, buf=0xa758d0 ":iiiii:E::ii::i:CF
", count=19) at fs/read_write.c:658
14 0xffffffff81b49813 in do_syscall_64 (nr=<optimized out>, regs=0xffffc9000022ff58) at arch/x86/entry/common.c:46
15 0xffffffff81c0007c in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:120

To solve the issue, the open_exec call is moved to before the write
lock is taken by bm_register_write

Link: https://lkml.kernel.org/r/20210228224414.95962-1-liorribak@gmail.com
Fixes: 948b701a607f1 ("binfmt_misc: add persistent opened binary handler for containers")
Signed-off-by: Lior Ribak <liorribak@gmail.com>
Acked-by: Helge Deller <deller@gmx.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/binfmt_misc.c |   29 ++++++++++++++---------------
 1 file changed, 14 insertions(+), 15 deletions(-)

--- a/fs/binfmt_misc.c~binfmt_misc-fix-possible-deadlock-in-bm_register_write
+++ a/fs/binfmt_misc.c
@@ -649,12 +649,24 @@ static ssize_t bm_register_write(struct
 	struct super_block *sb = file_inode(file)->i_sb;
 	struct dentry *root = sb->s_root, *dentry;
 	int err = 0;
+	struct file *f = NULL;
 
 	e = create_entry(buffer, count);
 
 	if (IS_ERR(e))
 		return PTR_ERR(e);
 
+	if (e->flags & MISC_FMT_OPEN_FILE) {
+		f = open_exec(e->interpreter);
+		if (IS_ERR(f)) {
+			pr_notice("register: failed to install interpreter file %s\n",
+				 e->interpreter);
+			kfree(e);
+			return PTR_ERR(f);
+		}
+		e->interp_file = f;
+	}
+
 	inode_lock(d_inode(root));
 	dentry = lookup_one_len(e->name, root, strlen(e->name));
 	err = PTR_ERR(dentry);
@@ -678,21 +690,6 @@ static ssize_t bm_register_write(struct
 		goto out2;
 	}
 
-	if (e->flags & MISC_FMT_OPEN_FILE) {
-		struct file *f;
-
-		f = open_exec(e->interpreter);
-		if (IS_ERR(f)) {
-			err = PTR_ERR(f);
-			pr_notice("register: failed to install interpreter file %s\n", e->interpreter);
-			simple_release_fs(&bm_mnt, &entry_count);
-			iput(inode);
-			inode = NULL;
-			goto out2;
-		}
-		e->interp_file = f;
-	}
-
 	e->dentry = dget(dentry);
 	inode->i_private = e;
 	inode->i_fop = &bm_entry_operations;
@@ -709,6 +706,8 @@ out:
 	inode_unlock(d_inode(root));
 
 	if (err) {
+		if (f)
+			filp_close(f, NULL);
 		kfree(e);
 		return err;
 	}
_

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 13/29] MAINTAINERS: exclude uapi directories in API/ABI section
  2021-03-13  5:06 incoming Andrew Morton
                   ` (11 preceding siblings ...)
  2021-03-13  5:07 ` [patch 12/29] binfmt_misc: fix possible deadlock in bm_register_write Andrew Morton
@ 2021-03-13  5:07 ` Andrew Morton
  2021-03-13  5:07 ` [patch 14/29] linux/compiler-clang.h: define HAVE_BUILTIN_BSWAP* Andrew Morton
                   ` (15 subsequent siblings)
  28 siblings, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2021-03-13  5:07 UTC (permalink / raw)
  To: akpm, linux-mm, mm-commits, mtk.manpages, torvalds, vbabka

From: Vlastimil Babka <vbabka@suse.cz>
Subject: MAINTAINERS: exclude uapi directories in API/ABI section

Commit 7b4693e644cb ("MAINTAINERS: add uapi directories to API/ABI
section") added include/uapi/ and arch/*/include/uapi/ so that patches
modifying them CC linux-api.  However that was already done in the past
and resulted in too much noise and thus later removed, as explained in
b14fd334ff3d ("MAINTAINERS: trim the file triggers for ABI/API")

To prevent another round of addition and removal in the future, change the
entries to X: (explicit exclusion) for documentation purposes, although
they are not subdirectories of broader included directories, as there is
apparently no defined way to add plain comments in subsystem sections.

Link: https://lkml.kernel.org/r/20210301100255.25229-1-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Michael Kerrisk (man-pages) <mtk.manpages@gmail.com>
Acked-by: Michael Kerrisk (man-pages) <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 MAINTAINERS |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/MAINTAINERS~maintainers-exclude-uapi-directories-in-api-abi-section
+++ a/MAINTAINERS
@@ -261,8 +261,8 @@ ABI/API
 L:	linux-api@vger.kernel.org
 F:	include/linux/syscalls.h
 F:	kernel/sys_ni.c
-F:	include/uapi/
-F:	arch/*/include/uapi/
+X:	include/uapi/
+X:	arch/*/include/uapi/
 
 ABIT UGURU 1,2 HARDWARE MONITOR DRIVER
 M:	Hans de Goede <hdegoede@redhat.com>
_

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 14/29] linux/compiler-clang.h: define HAVE_BUILTIN_BSWAP*
  2021-03-13  5:06 incoming Andrew Morton
                   ` (12 preceding siblings ...)
  2021-03-13  5:07 ` [patch 13/29] MAINTAINERS: exclude uapi directories in API/ABI section Andrew Morton
@ 2021-03-13  5:07 ` Andrew Morton
  2021-03-13  5:07 ` [patch 15/29] kfence: fix printk format for ptrdiff_t Andrew Morton
                   ` (14 subsequent siblings)
  28 siblings, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2021-03-13  5:07 UTC (permalink / raw)
  To: akpm, aou, arnd, deanbo422, elver, green.hu, guoren, keescook,
	linux-mm, luc.vanoostenryck, masahiroy, mm-commits, nathan,
	ndesaulniers, nickhu, nivedita, ojeda, palmer, paul.walmsley,
	rdunlap, samitolvanen, stable, torvalds

From: Arnd Bergmann <arnd@arndb.de>
Subject: linux/compiler-clang.h: define HAVE_BUILTIN_BSWAP*

Separating compiler-clang.h from compiler-gcc.h inadventently dropped the
definitions of the three HAVE_BUILTIN_BSWAP macros, which requires falling
back to the open-coded version and hoping that the compiler detects it.

Since all versions of clang support the __builtin_bswap interfaces, add
back the flags and have the headers pick these up automatically.

This results in a 4% improvement of compilation speed for arm defconfig.

Note: it might also be worth revisiting which architectures set
CONFIG_ARCH_USE_BUILTIN_BSWAP for one compiler or the other, today this is
set on six architectures (arm32, csky, mips, powerpc, s390, x86), while
another ten architectures define custom helpers (alpha, arc, ia64, m68k,
mips, nios2, parisc, sh, sparc, xtensa), and the rest (arm64, h8300,
hexagon, microblaze, nds32, openrisc, riscv) just get the unoptimized
version and rely on the compiler to detect it.

A long time ago, the compiler builtins were architecture specific, but
nowadays, all compilers that are able to build the kernel have correct
implementations of them, though some may not be as optimized as the inline
asm versions.

The patch that dropped the optimization landed in v4.19, so as discussed
it would be fairly safe to backport this revert to stable kernels to the
4.19/5.4/5.10 stable kernels, but there is a remaining risk for
regressions, and it has no known side-effects besides compile speed.

Link: https://lkml.kernel.org/r/20210226161151.2629097-1-arnd@kernel.org
Link: https://lore.kernel.org/lkml/20210225164513.3667778-1-arnd@kernel.org/
Fixes: 815f0ddb346c ("include/linux/compiler*.h: make compiler-*.h mutually exclusive")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Miguel Ojeda <ojeda@kernel.org>
Acked-by: Nick Desaulniers <ndesaulniers@google.com>
Acked-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Guo Ren <guoren@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Arvind Sankar <nivedita@alum.mit.edu>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/compiler-clang.h |    6 ++++++
 1 file changed, 6 insertions(+)

--- a/include/linux/compiler-clang.h~linux-compiler-clangh-define-have_builtin_bswap
+++ a/include/linux/compiler-clang.h
@@ -31,6 +31,12 @@
 #define __no_sanitize_thread
 #endif
 
+#if defined(CONFIG_ARCH_USE_BUILTIN_BSWAP)
+#define __HAVE_BUILTIN_BSWAP32__
+#define __HAVE_BUILTIN_BSWAP64__
+#define __HAVE_BUILTIN_BSWAP16__
+#endif /* CONFIG_ARCH_USE_BUILTIN_BSWAP */
+
 #if __has_feature(undefined_behavior_sanitizer)
 /* GCC does not have __SANITIZE_UNDEFINED__ */
 #define __no_sanitize_undefined \
_

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 15/29] kfence: fix printk format for ptrdiff_t
  2021-03-13  5:06 incoming Andrew Morton
                   ` (13 preceding siblings ...)
  2021-03-13  5:07 ` [patch 14/29] linux/compiler-clang.h: define HAVE_BUILTIN_BSWAP* Andrew Morton
@ 2021-03-13  5:07 ` Andrew Morton
  2021-03-13  5:07 ` [patch 16/29] kfence, slab: fix cache_alloc_debugcheck_after() for bulk allocations Andrew Morton
                   ` (13 subsequent siblings)
  28 siblings, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2021-03-13  5:07 UTC (permalink / raw)
  To: akpm, andreyknvl, christophe.leroy, dvyukov, elver, glider,
	jannh, linux-mm, mm-commits, torvalds

From: Marco Elver <elver@google.com>
Subject: kfence: fix printk format for ptrdiff_t

Use %td for ptrdiff_t.

Link: https://lkml.kernel.org/r/3abbe4c9-16ad-c168-a90f-087978ccd8f7@csgroup.eu
Link: https://lkml.kernel.org/r/20210303121157.3430807-1-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Reported-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/kfence/report.c |   12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

--- a/mm/kfence/report.c~kfence-fix-printk-format-for-ptrdiff_t
+++ a/mm/kfence/report.c
@@ -116,12 +116,12 @@ void kfence_print_object(struct seq_file
 	lockdep_assert_held(&meta->lock);
 
 	if (meta->state == KFENCE_OBJECT_UNUSED) {
-		seq_con_printf(seq, "kfence-#%zd unused\n", meta - kfence_metadata);
+		seq_con_printf(seq, "kfence-#%td unused\n", meta - kfence_metadata);
 		return;
 	}
 
 	seq_con_printf(seq,
-		       "kfence-#%zd [0x%p-0x%p"
+		       "kfence-#%td [0x%p-0x%p"
 		       ", size=%d, cache=%s] allocated by task %d:\n",
 		       meta - kfence_metadata, (void *)start, (void *)(start + size - 1), size,
 		       (cache && cache->name) ? cache->name : "<destroyed>", meta->alloc_track.pid);
@@ -204,7 +204,7 @@ void kfence_report_error(unsigned long a
 
 		pr_err("BUG: KFENCE: out-of-bounds %s in %pS\n\n", get_access_type(is_write),
 		       (void *)stack_entries[skipnr]);
-		pr_err("Out-of-bounds %s at 0x%p (%luB %s of kfence-#%zd):\n",
+		pr_err("Out-of-bounds %s at 0x%p (%luB %s of kfence-#%td):\n",
 		       get_access_type(is_write), (void *)address,
 		       left_of_object ? meta->addr - address : address - meta->addr,
 		       left_of_object ? "left" : "right", object_index);
@@ -213,14 +213,14 @@ void kfence_report_error(unsigned long a
 	case KFENCE_ERROR_UAF:
 		pr_err("BUG: KFENCE: use-after-free %s in %pS\n\n", get_access_type(is_write),
 		       (void *)stack_entries[skipnr]);
-		pr_err("Use-after-free %s at 0x%p (in kfence-#%zd):\n",
+		pr_err("Use-after-free %s at 0x%p (in kfence-#%td):\n",
 		       get_access_type(is_write), (void *)address, object_index);
 		break;
 	case KFENCE_ERROR_CORRUPTION:
 		pr_err("BUG: KFENCE: memory corruption in %pS\n\n", (void *)stack_entries[skipnr]);
 		pr_err("Corrupted memory at 0x%p ", (void *)address);
 		print_diff_canary(address, 16, meta);
-		pr_cont(" (in kfence-#%zd):\n", object_index);
+		pr_cont(" (in kfence-#%td):\n", object_index);
 		break;
 	case KFENCE_ERROR_INVALID:
 		pr_err("BUG: KFENCE: invalid %s in %pS\n\n", get_access_type(is_write),
@@ -230,7 +230,7 @@ void kfence_report_error(unsigned long a
 		break;
 	case KFENCE_ERROR_INVALID_FREE:
 		pr_err("BUG: KFENCE: invalid free in %pS\n\n", (void *)stack_entries[skipnr]);
-		pr_err("Invalid free of 0x%p (in kfence-#%zd):\n", (void *)address,
+		pr_err("Invalid free of 0x%p (in kfence-#%td):\n", (void *)address,
 		       object_index);
 		break;
 	}
_

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 16/29] kfence, slab: fix cache_alloc_debugcheck_after() for bulk allocations
  2021-03-13  5:06 incoming Andrew Morton
                   ` (14 preceding siblings ...)
  2021-03-13  5:07 ` [patch 15/29] kfence: fix printk format for ptrdiff_t Andrew Morton
@ 2021-03-13  5:07 ` Andrew Morton
  2021-03-13  5:08 ` [patch 17/29] kfence: fix reports if constant function prefixes exist Andrew Morton
                   ` (12 subsequent siblings)
  28 siblings, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2021-03-13  5:07 UTC (permalink / raw)
  To: akpm, andreyknvl, dvyukov, elver, glider, jannh, linux-mm,
	mm-commits, torvalds

From: Marco Elver <elver@google.com>
Subject: kfence, slab: fix cache_alloc_debugcheck_after() for bulk allocations

cache_alloc_debugcheck_after() performs checks on an object, including
adjusting the returned pointer.  None of this should apply to KFENCE
objects.  While for non-bulk allocations, the checks are skipped when we
allocate via KFENCE, for bulk allocations cache_alloc_debugcheck_after()
is called via cache_alloc_debugcheck_after_bulk().

Fix it by skipping cache_alloc_debugcheck_after() for KFENCE objects.

Link: https://lkml.kernel.org/r/20210304205256.2162309-1-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slab.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/slab.c~kfence-slab-fix-cache_alloc_debugcheck_after-for-bulk-allocations
+++ a/mm/slab.c
@@ -2992,7 +2992,7 @@ static void *cache_alloc_debugcheck_afte
 				gfp_t flags, void *objp, unsigned long caller)
 {
 	WARN_ON_ONCE(cachep->ctor && (flags & __GFP_ZERO));
-	if (!objp)
+	if (!objp || is_kfence_address(objp))
 		return objp;
 	if (cachep->flags & SLAB_POISON) {
 		check_poison_obj(cachep, objp);
_

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 17/29] kfence: fix reports if constant function prefixes exist
  2021-03-13  5:06 incoming Andrew Morton
                   ` (15 preceding siblings ...)
  2021-03-13  5:07 ` [patch 16/29] kfence, slab: fix cache_alloc_debugcheck_after() for bulk allocations Andrew Morton
@ 2021-03-13  5:08 ` Andrew Morton
  2021-03-13  5:08 ` [patch 18/29] include/linux/sched/mm.h: use rcu_dereference in in_vfork() Andrew Morton
                   ` (11 subsequent siblings)
  28 siblings, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2021-03-13  5:08 UTC (permalink / raw)
  To: akpm, andreyknvl, christophe.leroy, dvyukov, elver, glider,
	jannh, linux-mm, mm-commits, torvalds

From: Marco Elver <elver@google.com>
Subject: kfence: fix reports if constant function prefixes exist

Some architectures prefix all functions with a constant string ('.' on
ppc64).  Add ARCH_FUNC_PREFIX, which may optionally be defined in
<asm/kfence.h>, so that get_stack_skipnr() can work properly.

Link: https://lkml.kernel.org/r/f036c53d-7e81-763c-47f4-6024c6c5f058@csgroup.eu
Link: https://lkml.kernel.org/r/20210304144000.1148590-1-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Reported-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/kfence/report.c |   18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

--- a/mm/kfence/report.c~kfence-fix-reports-if-constant-function-prefixes-exist
+++ a/mm/kfence/report.c
@@ -20,6 +20,11 @@
 
 #include "kfence.h"
 
+/* May be overridden by <asm/kfence.h>. */
+#ifndef ARCH_FUNC_PREFIX
+#define ARCH_FUNC_PREFIX ""
+#endif
+
 extern bool no_hash_pointers;
 
 /* Helper function to either print to a seq_file or to console. */
@@ -67,8 +72,9 @@ static int get_stack_skipnr(const unsign
 	for (skipnr = 0; skipnr < num_entries; skipnr++) {
 		int len = scnprintf(buf, sizeof(buf), "%ps", (void *)stack_entries[skipnr]);
 
-		if (str_has_prefix(buf, "kfence_") || str_has_prefix(buf, "__kfence_") ||
-		    !strncmp(buf, "__slab_free", len)) {
+		if (str_has_prefix(buf, ARCH_FUNC_PREFIX "kfence_") ||
+		    str_has_prefix(buf, ARCH_FUNC_PREFIX "__kfence_") ||
+		    !strncmp(buf, ARCH_FUNC_PREFIX "__slab_free", len)) {
 			/*
 			 * In case of tail calls from any of the below
 			 * to any of the above.
@@ -77,10 +83,10 @@ static int get_stack_skipnr(const unsign
 		}
 
 		/* Also the *_bulk() variants by only checking prefixes. */
-		if (str_has_prefix(buf, "kfree") ||
-		    str_has_prefix(buf, "kmem_cache_free") ||
-		    str_has_prefix(buf, "__kmalloc") ||
-		    str_has_prefix(buf, "kmem_cache_alloc"))
+		if (str_has_prefix(buf, ARCH_FUNC_PREFIX "kfree") ||
+		    str_has_prefix(buf, ARCH_FUNC_PREFIX "kmem_cache_free") ||
+		    str_has_prefix(buf, ARCH_FUNC_PREFIX "__kmalloc") ||
+		    str_has_prefix(buf, ARCH_FUNC_PREFIX "kmem_cache_alloc"))
 			goto found;
 	}
 	if (fallback < num_entries)
_

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 18/29] include/linux/sched/mm.h: use rcu_dereference in in_vfork()
  2021-03-13  5:06 incoming Andrew Morton
                   ` (16 preceding siblings ...)
  2021-03-13  5:08 ` [patch 17/29] kfence: fix reports if constant function prefixes exist Andrew Morton
@ 2021-03-13  5:08 ` Andrew Morton
  2021-03-13  5:08 ` [patch 19/29] mm/madvise: replace ptrace attach requirement for process_madvise Andrew Morton
                   ` (10 subsequent siblings)
  28 siblings, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2021-03-13  5:08 UTC (permalink / raw)
  To: akpm, linmiaohe, linux-mm, mhocko, mm-commits, torvalds, willy

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Subject: include/linux/sched/mm.h: use rcu_dereference in in_vfork()

Fix a sparse warning by using rcu_dereference().  Technically this is a
bug and a sufficiently aggressive compiler could reload the `real_parent'
pointer outside the protection of the rcu lock (and access freed memory),
but I think it's pretty unlikely to happen.

Link: https://lkml.kernel.org/r/20210221194207.1351703-1-willy@infradead.org
Fixes: b18dc5f291c0 ("mm, oom: skip vforked tasks from being selected")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/sched/mm.h |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/include/linux/sched/mm.h~mm-use-rcu_dereference-in-in_vfork
+++ a/include/linux/sched/mm.h
@@ -140,7 +140,8 @@ static inline bool in_vfork(struct task_
 	 * another oom-unkillable task does this it should blame itself.
 	 */
 	rcu_read_lock();
-	ret = tsk->vfork_done && tsk->real_parent->mm == tsk->mm;
+	ret = tsk->vfork_done &&
+			rcu_dereference(tsk->real_parent)->mm == tsk->mm;
 	rcu_read_unlock();
 
 	return ret;
_

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 19/29] mm/madvise: replace ptrace attach requirement for process_madvise
  2021-03-13  5:06 incoming Andrew Morton
                   ` (17 preceding siblings ...)
  2021-03-13  5:08 ` [patch 18/29] include/linux/sched/mm.h: use rcu_dereference in in_vfork() Andrew Morton
@ 2021-03-13  5:08 ` Andrew Morton
  2021-03-13  5:08 ` [patch 20/29] kasan, mm: fix crash with HW_TAGS and DEBUG_PAGEALLOC Andrew Morton
                   ` (9 subsequent siblings)
  28 siblings, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2021-03-13  5:08 UTC (permalink / raw)
  To: akpm, fweimer, jannh, jeffv, jmorris, keescook, linux-mm, mhocko,
	minchan, mm-commits, oleg, rientjes, shakeelb, stable, surenb,
	timmurray, torvalds

From: Suren Baghdasaryan <surenb@google.com>
Subject: mm/madvise: replace ptrace attach requirement for process_madvise

process_madvise currently requires ptrace attach capability. 
PTRACE_MODE_ATTACH gives one process complete control over another
process.  It effectively removes the security boundary between the two
processes (in one direction).  Granting ptrace attach capability even to a
system process is considered dangerous since it creates an attack surface.
This severely limits the usage of this API.

The operations process_madvise can perform do not affect the correctness
of the operation of the target process; they only affect where the data is
physically located (and therefore, how fast it can be accessed).  What we
want is the ability for one process to influence another process in order
to optimize performance across the entire system while leaving the
security boundary intact.

Replace PTRACE_MODE_ATTACH with a combination of PTRACE_MODE_READ and
CAP_SYS_NICE.  PTRACE_MODE_READ to prevent leaking ASLR metadata and
CAP_SYS_NICE for influencing process performance.

Link: https://lkml.kernel.org/r/20210303185807.2160264-1-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: <stable@vger.kernel.org>	[5.10+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/madvise.c |   13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

--- a/mm/madvise.c~mm-madvise-replace-ptrace-attach-requirement-for-process_madvise
+++ a/mm/madvise.c
@@ -1198,12 +1198,22 @@ SYSCALL_DEFINE5(process_madvise, int, pi
 		goto release_task;
 	}
 
-	mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS);
+	/* Require PTRACE_MODE_READ to avoid leaking ASLR metadata. */
+	mm = mm_access(task, PTRACE_MODE_READ_FSCREDS);
 	if (IS_ERR_OR_NULL(mm)) {
 		ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
 		goto release_task;
 	}
 
+	/*
+	 * Require CAP_SYS_NICE for influencing process performance. Note that
+	 * only non-destructive hints are currently supported.
+	 */
+	if (!capable(CAP_SYS_NICE)) {
+		ret = -EPERM;
+		goto release_mm;
+	}
+
 	total_len = iov_iter_count(&iter);
 
 	while (iov_iter_count(&iter)) {
@@ -1218,6 +1228,7 @@ SYSCALL_DEFINE5(process_madvise, int, pi
 	if (ret == 0)
 		ret = total_len - iov_iter_count(&iter);
 
+release_mm:
 	mmput(mm);
 release_task:
 	put_task_struct(task);
_

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 20/29] kasan, mm: fix crash with HW_TAGS and DEBUG_PAGEALLOC
  2021-03-13  5:06 incoming Andrew Morton
                   ` (18 preceding siblings ...)
  2021-03-13  5:08 ` [patch 19/29] mm/madvise: replace ptrace attach requirement for process_madvise Andrew Morton
@ 2021-03-13  5:08 ` Andrew Morton
  2021-03-13  5:08 ` [patch 21/29] kasan: fix KASAN_STACK dependency for HW_TAGS Andrew Morton
                   ` (8 subsequent siblings)
  28 siblings, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2021-03-13  5:08 UTC (permalink / raw)
  To: akpm, andreyknvl, aryabinin, Branislav.Rankov, catalin.marinas,
	dvyukov, elver, eugenis, glider, hch, kevin.brodsky, linux-mm,
	mm-commits, pcc, stable, torvalds, vincenzo.frascino,
	will.deacon

From: Andrey Konovalov <andreyknvl@google.com>
Subject: kasan, mm: fix crash with HW_TAGS and DEBUG_PAGEALLOC

Currently, kasan_free_nondeferred_pages()->kasan_free_pages() is called
after debug_pagealloc_unmap_pages(). This causes a crash when
debug_pagealloc is enabled, as HW_TAGS KASAN can't set tags on an
unmapped page.

This patch puts kasan_free_nondeferred_pages() before
debug_pagealloc_unmap_pages() and arch_free_page(), which can also make
the page unavailable.

Link: https://lkml.kernel.org/r/24cd7db274090f0e5bc3adcdc7399243668e3171.1614987311.git.andreyknvl@google.com
Fixes: 94ab5b61ee16 ("kasan, arm64: enable CONFIG_KASAN_HW_TAGS")
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Branislav Rankov <Branislav.Rankov@arm.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |    8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

--- a/mm/page_alloc.c~kasan-mm-fix-crash-with-hw_tags-and-debug_pagealloc
+++ a/mm/page_alloc.c
@@ -1282,6 +1282,12 @@ static __always_inline bool free_pages_p
 	kernel_poison_pages(page, 1 << order);
 
 	/*
+	 * With hardware tag-based KASAN, memory tags must be set before the
+	 * page becomes unavailable via debug_pagealloc or arch_free_page.
+	 */
+	kasan_free_nondeferred_pages(page, order);
+
+	/*
 	 * arch_free_page() can make the page's contents inaccessible.  s390
 	 * does this.  So nothing which can access the page's contents should
 	 * happen after this.
@@ -1290,8 +1296,6 @@ static __always_inline bool free_pages_p
 
 	debug_pagealloc_unmap_pages(page, 1 << order);
 
-	kasan_free_nondeferred_pages(page, order);
-
 	return true;
 }
 
_

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 21/29] kasan: fix KASAN_STACK dependency for HW_TAGS
  2021-03-13  5:06 incoming Andrew Morton
                   ` (19 preceding siblings ...)
  2021-03-13  5:08 ` [patch 20/29] kasan, mm: fix crash with HW_TAGS and DEBUG_PAGEALLOC Andrew Morton
@ 2021-03-13  5:08 ` Andrew Morton
  2021-03-13  5:08 ` [patch 22/29] mm/userfaultfd: fix memory corruption due to writeprotect Andrew Morton
                   ` (7 subsequent siblings)
  28 siblings, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2021-03-13  5:08 UTC (permalink / raw)
  To: akpm, andreyknvl, aryabinin, Branislav.Rankov, catalin.marinas,
	dvyukov, elver, eugenis, glider, kevin.brodsky, linux-mm,
	mm-commits, pcc, stable, torvalds, vincenzo.frascino,
	will.deacon

From: Andrey Konovalov <andreyknvl@google.com>
Subject: kasan: fix KASAN_STACK dependency for HW_TAGS

There's a runtime failure when running HW_TAGS-enabled kernel built with
GCC on hardware that doesn't support MTE.  GCC-built kernels always have
CONFIG_KASAN_STACK enabled, even though stack instrumentation isn't
supported by HW_TAGS.  Having that config enabled causes KASAN to issue
MTE-only instructions to unpoison kernel stacks, which causes the failure.

Fix the issue by disallowing CONFIG_KASAN_STACK when HW_TAGS is used.

(The commit that introduced CONFIG_KASAN_HW_TAGS specified proper
 dependency for CONFIG_KASAN_STACK_ENABLE but not for CONFIG_KASAN_STACK.)

Link: https://lkml.kernel.org/r/59e75426241dbb5611277758c8d4d6f5f9298dac.1615215441.git.andreyknvl@google.com
Fixes: 6a63a63ff1ac ("kasan: introduce CONFIG_KASAN_HW_TAGS")
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reported-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: <stable@vger.kernel.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Branislav Rankov <Branislav.Rankov@arm.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/Kconfig.kasan |    1 +
 1 file changed, 1 insertion(+)

--- a/lib/Kconfig.kasan~kasan-fix-kasan_stack-dependency-for-hw_tags
+++ a/lib/Kconfig.kasan
@@ -156,6 +156,7 @@ config KASAN_STACK_ENABLE
 
 config KASAN_STACK
 	int
+	depends on KASAN_GENERIC || KASAN_SW_TAGS
 	default 1 if KASAN_STACK_ENABLE || CC_IS_GCC
 	default 0
 
_

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 22/29] mm/userfaultfd: fix memory corruption due to writeprotect
  2021-03-13  5:06 incoming Andrew Morton
                   ` (20 preceding siblings ...)
  2021-03-13  5:08 ` [patch 21/29] kasan: fix KASAN_STACK dependency for HW_TAGS Andrew Morton
@ 2021-03-13  5:08 ` Andrew Morton
  2021-03-13  5:08 ` [patch 23/29] mm, hwpoison: do not lock page again when me_huge_page() successfully recovers Andrew Morton
                   ` (6 subsequent siblings)
  28 siblings, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2021-03-13  5:08 UTC (permalink / raw)
  To: aarcange, akpm, linux-mm, luto, mike.kravetz, minchan,
	mm-commits, namit, peterx, peterz, rppt, stable, torvalds, will,
	xemul, yuzhao

From: Nadav Amit <namit@vmware.com>
Subject: mm/userfaultfd: fix memory corruption due to writeprotect

Userfaultfd self-test fails occasionally, indicating a memory corruption.

Analyzing this problem indicates that there is a real bug since mmap_lock
is only taken for read in mwriteprotect_range() and defers flushes, and
since there is insufficient consideration of concurrent deferred TLB
flushes in wp_page_copy().  Although the PTE is flushed from the TLBs in
wp_page_copy(), this flush takes place after the copy has already been
performed, and therefore changes of the page are possible between the time
of the copy and the time in which the PTE is flushed.

To make matters worse, memory-unprotection using userfaultfd also poses a
problem.  Although memory unprotection is logically a promotion of PTE
permissions, and therefore should not require a TLB flush, the current
userrfaultfd code might actually cause a demotion of the architectural PTE
permission: when userfaultfd_writeprotect() unprotects memory region, it
unintentionally *clears* the RW-bit if it was already set.  Note that this
unprotecting a PTE that is not write-protected is a valid use-case: the
userfaultfd monitor might ask to unprotect a region that holds both
write-protected and write-unprotected PTEs.

The scenario that happens in selftests/vm/userfaultfd is as follows:

cpu0				cpu1			cpu2
----				----			----
							[ Writable PTE
							  cached in TLB ]
userfaultfd_writeprotect()
[ write-*unprotect* ]
mwriteprotect_range()
mmap_read_lock()
change_protection()

change_protection_range()
...
change_pte_range()
[ *clear* “write”-bit ]
[ defer TLB flushes ]
				[ page-fault ]
				...
				wp_page_copy()
				 cow_user_page()
				  [ copy page ]
							[ write to old
							  page ]
				...
				 set_pte_at_notify()

A similar scenario can happen:

cpu0		cpu1		cpu2		cpu3
----		----		----		----
						[ Writable PTE
				  		  cached in TLB ]
userfaultfd_writeprotect()
[ write-protect ]
[ deferred TLB flush ]
		userfaultfd_writeprotect()
		[ write-unprotect ]
		[ deferred TLB flush]
				[ page-fault ]
				wp_page_copy()
				 cow_user_page()
				 [ copy page ]
				 ...		[ write to page ]
				set_pte_at_notify()

This race exists since commit 292924b26024 ("userfaultfd: wp: apply
_PAGE_UFFD_WP bit").  Yet, as Yu Zhao pointed, these races became apparent
since commit 09854ba94c6a ("mm: do_wp_page() simplification") which made
wp_page_copy() more likely to take place, specifically if page_count(page)
> 1.

To resolve the aforementioned races, check whether there are pending
flushes on uffd-write-protected VMAs, and if there are, perform a flush
before doing the COW.

Further optimizations will follow to avoid during uffd-write-unprotect
unnecassary PTE write-protection and TLB flushes.

Link: https://lkml.kernel.org/r/20210304095423.3825684-1-namit@vmware.com
Fixes: 09854ba94c6a ("mm: do_wp_page() simplification")
Signed-off-by: Nadav Amit <namit@vmware.com>
Suggested-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Tested-by: Peter Xu <peterx@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org>	[5.9+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory.c |    8 ++++++++
 1 file changed, 8 insertions(+)

--- a/mm/memory.c~mm-userfaultfd-fix-memory-corruption-due-to-writeprotect
+++ a/mm/memory.c
@@ -3097,6 +3097,14 @@ static vm_fault_t do_wp_page(struct vm_f
 		return handle_userfault(vmf, VM_UFFD_WP);
 	}
 
+	/*
+	 * Userfaultfd write-protect can defer flushes. Ensure the TLB
+	 * is flushed in this case before copying.
+	 */
+	if (unlikely(userfaultfd_wp(vmf->vma) &&
+		     mm_tlb_flush_pending(vmf->vma->vm_mm)))
+		flush_tlb_page(vmf->vma, vmf->address);
+
 	vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte);
 	if (!vmf->page) {
 		/*
_

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 23/29] mm, hwpoison: do not lock page again when me_huge_page() successfully recovers
  2021-03-13  5:06 incoming Andrew Morton
                   ` (21 preceding siblings ...)
  2021-03-13  5:08 ` [patch 22/29] mm/userfaultfd: fix memory corruption due to writeprotect Andrew Morton
@ 2021-03-13  5:08 ` Andrew Morton
  2021-03-13 19:23   ` Linus Torvalds
  2021-03-13  5:08 ` [patch 24/29] ia64: fix ia64_syscall_get_set_arguments() for break-based syscalls Andrew Morton
                   ` (5 subsequent siblings)
  28 siblings, 1 reply; 32+ messages in thread
From: Andrew Morton @ 2021-03-13  5:08 UTC (permalink / raw)
  To: akpm, aneesh.kumar, linux-mm, mhocko, mm-commits,
	naoya.horiguchi, osalvador, stable, tony.luck, torvalds

From: Naoya Horiguchi <naoya.horiguchi@nec.com>
Subject: mm, hwpoison: do not lock page again when me_huge_page() successfully recovers

Currently me_huge_page() temporary unlocks page to perform some actions
then locks it again later.  My testcase (which calls hard-offline on
some tail page in a hugetlb, then accesses the address of the hugetlb
range) showed that page allocation code detects this page lock on buddy
page and printed out "BUG: Bad page state" message.

check_new_page_bad() does not consider a page with __PG_HWPOISON as bad
page, so this flag works as kind of filter, but this filtering doesn't
work in this case because the "bad page" is not the actual hwpoisoned
page.

This patch suggests to drop the 2nd page lock to fix the issue.

Link: https://lkml.kernel.org/r/20210304064437.962442-1-nao.horiguchi@gmail.com
Fixes: commit 78bb920344b8 ("mm: hwpoison: dissolve in-use hugepage in unrecoverable memory error")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory-failure.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/mm/memory-failure.c~mm-hwpoison-do-not-lock-page-again-when-me_huge_page-successfully-recovers
+++ a/mm/memory-failure.c
@@ -834,7 +834,6 @@ static int me_huge_page(struct page *p,
 			page_ref_inc(p);
 			res = MF_RECOVERED;
 		}
-		lock_page(hpage);
 	}
 
 	return res;
@@ -1290,7 +1289,8 @@ static int memory_failure_hugetlb(unsign
 
 	res = identify_page_state(pfn, p, page_flags);
 out:
-	unlock_page(head);
+	if (PageLocked(head))
+		unlock_page(head);
 	return res;
 }
 
_

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 24/29] ia64: fix ia64_syscall_get_set_arguments() for break-based syscalls
  2021-03-13  5:06 incoming Andrew Morton
                   ` (22 preceding siblings ...)
  2021-03-13  5:08 ` [patch 23/29] mm, hwpoison: do not lock page again when me_huge_page() successfully recovers Andrew Morton
@ 2021-03-13  5:08 ` Andrew Morton
  2021-03-13  5:08 ` [patch 25/29] ia64: fix ptrace(PTRACE_SYSCALL_INFO_EXIT) sign Andrew Morton
                   ` (4 subsequent siblings)
  28 siblings, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2021-03-13  5:08 UTC (permalink / raw)
  To: akpm, glaubitz, ldv, linux-mm, mm-commits, oleg, slyfox, torvalds

From: Sergei Trofimovich <slyfox@gentoo.org>
Subject: ia64: fix ia64_syscall_get_set_arguments() for break-based syscalls

In https://bugs.gentoo.org/769614 Dmitry noticed that
`ptrace(PTRACE_GET_SYSCALL_INFO)` does not work for syscalls called via
glibc's syscall() wrapper.

ia64 has two ways to call syscalls from userspace: via `break` and via
`eps` instructions.

The difference is in stack layout:

1. `eps` creates simple stack frame: no locals, in{0..7} == out{0..8}
2. `break` uses userspace stack frame: may be locals (glibc provides
   one), in{0..7} == out{0..8}.

Both work fine in syscall handling cde itself.

But `ptrace(PTRACE_GET_SYSCALL_INFO)` uses unwind mechanism to
re-extract syscall arguments but it does not account for locals.

The change always skips locals registers. It should not change `eps`
path as kernel's handler already enforces locals=0 and fixes `break`.

Tested on v5.10 on rx3600 machine (ia64 9040 CPU).

Link: https://lkml.kernel.org/r/20210221002554.333076-1-slyfox@gentoo.org
Link: https://bugs.gentoo.org/769614
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Reported-by: Dmitry V. Levin <ldv@altlinux.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/ia64/kernel/ptrace.c |   24 ++++++++++++++++++------
 1 file changed, 18 insertions(+), 6 deletions(-)

--- a/arch/ia64/kernel/ptrace.c~ia64-fix-ia64_syscall_get_set_arguments-for-break-based-syscalls
+++ a/arch/ia64/kernel/ptrace.c
@@ -2013,27 +2013,39 @@ static void syscall_get_set_args_cb(stru
 {
 	struct syscall_get_set_args *args = data;
 	struct pt_regs *pt = args->regs;
-	unsigned long *krbs, cfm, ndirty;
+	unsigned long *krbs, cfm, ndirty, nlocals, nouts;
 	int i, count;
 
 	if (unw_unwind_to_user(info) < 0)
 		return;
 
+	/*
+	 * We get here via a few paths:
+	 * - break instruction: cfm is shared with caller.
+	 *   syscall args are in out= regs, locals are non-empty.
+	 * - epsinstruction: cfm is set by br.call
+	 *   locals don't exist.
+	 *
+	 * For both cases argguments are reachable in cfm.sof - cfm.sol.
+	 * CFM: [ ... | sor: 17..14 | sol : 13..7 | sof : 6..0 ]
+	 */
 	cfm = pt->cr_ifs;
+	nlocals = (cfm >> 7) & 0x7f; /* aka sol */
+	nouts = (cfm & 0x7f) - nlocals; /* aka sof - sol */
 	krbs = (unsigned long *)info->task + IA64_RBS_OFFSET/8;
 	ndirty = ia64_rse_num_regs(krbs, krbs + (pt->loadrs >> 19));
 
 	count = 0;
 	if (in_syscall(pt))
-		count = min_t(int, args->n, cfm & 0x7f);
+		count = min_t(int, args->n, nouts);
 
+	/* Iterate over outs. */
 	for (i = 0; i < count; i++) {
+		int j = ndirty + nlocals + i + args->i;
 		if (args->rw)
-			*ia64_rse_skip_regs(krbs, ndirty + i + args->i) =
-				args->args[i];
+			*ia64_rse_skip_regs(krbs, j) = args->args[i];
 		else
-			args->args[i] = *ia64_rse_skip_regs(krbs,
-				ndirty + i + args->i);
+			args->args[i] = *ia64_rse_skip_regs(krbs, j);
 	}
 
 	if (!args->rw) {
_

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 25/29] ia64: fix ptrace(PTRACE_SYSCALL_INFO_EXIT) sign
  2021-03-13  5:06 incoming Andrew Morton
                   ` (23 preceding siblings ...)
  2021-03-13  5:08 ` [patch 24/29] ia64: fix ia64_syscall_get_set_arguments() for break-based syscalls Andrew Morton
@ 2021-03-13  5:08 ` Andrew Morton
  2021-03-13  5:08 ` [patch 26/29] mm/memcg: rename mem_cgroup_split_huge_fixup to split_page_memcg and add nr_pages argument Andrew Morton
                   ` (3 subsequent siblings)
  28 siblings, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2021-03-13  5:08 UTC (permalink / raw)
  To: akpm, glaubitz, ldv, linux-mm, mm-commits, oleg, slyfox, torvalds

From: Sergei Trofimovich <slyfox@gentoo.org>
Subject: ia64: fix ptrace(PTRACE_SYSCALL_INFO_EXIT) sign

In https://bugs.gentoo.org/769614 Dmitry noticed that
`ptrace(PTRACE_GET_SYSCALL_INFO)` does not return error sign properly.

The bug is in mismatch between get/set errors:

static inline long syscall_get_error(struct task_struct *task,
                                     struct pt_regs *regs)
{
        return regs->r10 == -1 ? regs->r8:0;
}

static inline long syscall_get_return_value(struct task_struct *task,
                                            struct pt_regs *regs)
{
        return regs->r8;
}

static inline void syscall_set_return_value(struct task_struct *task,
                                            struct pt_regs *regs,
                                            int error, long val)
{
        if (error) {
                /* error < 0, but ia64 uses > 0 return value */
                regs->r8 = -error;
                regs->r10 = -1;
        } else {
                regs->r8 = val;
                regs->r10 = 0;
        }
}

Tested on v5.10 on rx3600 machine (ia64 9040 CPU).

Link: https://lkml.kernel.org/r/20210221002554.333076-2-slyfox@gentoo.org
Link: https://bugs.gentoo.org/769614
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Reported-by: Dmitry V. Levin <ldv@altlinux.org>
Reviewed-by: Dmitry V. Levin <ldv@altlinux.org>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/ia64/include/asm/syscall.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/arch/ia64/include/asm/syscall.h~ia64-fix-ptraceptrace_syscall_info_exit-sign
+++ a/arch/ia64/include/asm/syscall.h
@@ -32,7 +32,7 @@ static inline void syscall_rollback(stru
 static inline long syscall_get_error(struct task_struct *task,
 				     struct pt_regs *regs)
 {
-	return regs->r10 == -1 ? regs->r8:0;
+	return regs->r10 == -1 ? -regs->r8:0;
 }
 
 static inline long syscall_get_return_value(struct task_struct *task,
_

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 26/29] mm/memcg: rename mem_cgroup_split_huge_fixup to split_page_memcg and add nr_pages argument
  2021-03-13  5:06 incoming Andrew Morton
                   ` (24 preceding siblings ...)
  2021-03-13  5:08 ` [patch 25/29] ia64: fix ptrace(PTRACE_SYSCALL_INFO_EXIT) sign Andrew Morton
@ 2021-03-13  5:08 ` Andrew Morton
  2021-03-13  5:08 ` [patch 27/29] mm/memcg: set memcg when splitting page Andrew Morton
                   ` (2 subsequent siblings)
  28 siblings, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2021-03-13  5:08 UTC (permalink / raw)
  To: akpm, chenweilong, dingtianhong, guohanjun, hannes, hughd,
	kirill.shutemov, linux-mm, mhocko, mm-commits, npiggin,
	rui.xiang, shakeelb, stable, torvalds, wangkefeng.wang,
	zhouguanghui1, ziy

From: Zhou Guanghui <zhouguanghui1@huawei.com>
Subject: mm/memcg: rename mem_cgroup_split_huge_fixup to split_page_memcg and add nr_pages argument

Rename mem_cgroup_split_huge_fixup to split_page_memcg and explicitly pass
in page number argument.

In this way, the interface name is more common and can be used by
potential users.  In addition, the complete info(memcg and flag) of the
memcg needs to be set to the tail pages.

Link: https://lkml.kernel.org/r/20210304074053.65527-2-zhouguanghui1@huawei.com
Signed-off-by: Zhou Guanghui <zhouguanghui1@huawei.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Tianhong Ding <dingtianhong@huawei.com>
Cc: Weilong Chen <chenweilong@huawei.com>
Cc: Rui Xiang <rui.xiang@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/memcontrol.h |    6 ++----
 mm/huge_memory.c           |    2 +-
 mm/memcontrol.c            |   15 ++++++---------
 3 files changed, 9 insertions(+), 14 deletions(-)

--- a/include/linux/memcontrol.h~mm-memcg-rename-mem_cgroup_split_huge_fixup-to-split_page_memcg
+++ a/include/linux/memcontrol.h
@@ -1061,9 +1061,7 @@ static inline void memcg_memory_event_mm
 	rcu_read_unlock();
 }
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-void mem_cgroup_split_huge_fixup(struct page *head);
-#endif
+void split_page_memcg(struct page *head, unsigned int nr);
 
 #else /* CONFIG_MEMCG */
 
@@ -1400,7 +1398,7 @@ unsigned long mem_cgroup_soft_limit_recl
 	return 0;
 }
 
-static inline void mem_cgroup_split_huge_fixup(struct page *head)
+static inline void split_page_memcg(struct page *head, unsigned int nr)
 {
 }
 
--- a/mm/huge_memory.c~mm-memcg-rename-mem_cgroup_split_huge_fixup-to-split_page_memcg
+++ a/mm/huge_memory.c
@@ -2467,7 +2467,7 @@ static void __split_huge_page(struct pag
 	int i;
 
 	/* complete memcg works before add pages to LRU */
-	mem_cgroup_split_huge_fixup(head);
+	split_page_memcg(head, nr);
 
 	if (PageAnon(head) && PageSwapCache(head)) {
 		swp_entry_t entry = { .val = page_private(head) };
--- a/mm/memcontrol.c~mm-memcg-rename-mem_cgroup_split_huge_fixup-to-split_page_memcg
+++ a/mm/memcontrol.c
@@ -3287,24 +3287,21 @@ void obj_cgroup_uncharge(struct obj_cgro
 
 #endif /* CONFIG_MEMCG_KMEM */
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 /*
- * Because page_memcg(head) is not set on compound tails, set it now.
+ * Because page_memcg(head) is not set on tails, set it now.
  */
-void mem_cgroup_split_huge_fixup(struct page *head)
+void split_page_memcg(struct page *head, unsigned int nr)
 {
 	struct mem_cgroup *memcg = page_memcg(head);
 	int i;
 
-	if (mem_cgroup_disabled())
+	if (mem_cgroup_disabled() || !memcg)
 		return;
 
-	for (i = 1; i < HPAGE_PMD_NR; i++) {
-		css_get(&memcg->css);
-		head[i].memcg_data = (unsigned long)memcg;
-	}
+	for (i = 1; i < nr; i++)
+		head[i].memcg_data = head->memcg_data;
+	css_get_many(&memcg->css, nr - 1);
 }
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #ifdef CONFIG_MEMCG_SWAP
 /**
_

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 27/29] mm/memcg: set memcg when splitting page
  2021-03-13  5:06 incoming Andrew Morton
                   ` (25 preceding siblings ...)
  2021-03-13  5:08 ` [patch 26/29] mm/memcg: rename mem_cgroup_split_huge_fixup to split_page_memcg and add nr_pages argument Andrew Morton
@ 2021-03-13  5:08 ` Andrew Morton
  2021-03-13  5:08 ` [patch 28/29] zram: fix return value on writeback_store Andrew Morton
  2021-03-13  5:08 ` [patch 29/29] zram: fix broken page writeback Andrew Morton
  28 siblings, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2021-03-13  5:08 UTC (permalink / raw)
  To: akpm, chenweilong, dingtianhong, guohanjun, hannes, hughd,
	kirill.shutemov, linux-mm, mhocko, mm-commits, npiggin,
	rui.xiang, shakeelb, stable, torvalds, wangkefeng.wang,
	zhouguanghui1, ziy

From: Zhou Guanghui <zhouguanghui1@huawei.com>
Subject: mm/memcg: set memcg when splitting page

As described in the split_page() comment, for the non-compound high order
page, the sub-pages must be freed individually.  If the memcg of the first
page is valid, the tail pages cannot be uncharged when be freed.

For example, when alloc_pages_exact is used to allocate 1MB continuous
physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is
set).  When make_alloc_exact free the unused 1MB and free_pages_exact free
the applied 1MB, actually, only 4KB(one page) is uncharged.

Therefore, the memcg of the tail page needs to be set when splitting a
page.

Michel:

There are at least two explicit users of __GFP_ACCOUNT with
alloc_exact_pages added recently.  See 7efe8ef274024 ("KVM: arm64:
Allocate stage-2 pgd pages with GFP_KERNEL_ACCOUNT") and c419621873713
("KVM: s390: Add memcg accounting to KVM allocations"), so this is not
just a theoretical issue.

Link: https://lkml.kernel.org/r/20210304074053.65527-3-zhouguanghui1@huawei.com
Signed-off-by: Zhou Guanghui <zhouguanghui1@huawei.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Rui Xiang <rui.xiang@huawei.com>
Cc: Tianhong Ding <dingtianhong@huawei.com>
Cc: Weilong Chen <chenweilong@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |    1 +
 1 file changed, 1 insertion(+)

--- a/mm/page_alloc.c~mm-memcg-set-memcg-when-split-page
+++ a/mm/page_alloc.c
@@ -3314,6 +3314,7 @@ void split_page(struct page *page, unsig
 	for (i = 1; i < (1 << order); i++)
 		set_page_refcounted(page + i);
 	split_page_owner(page, 1 << order);
+	split_page_memcg(page, 1 << order);
 }
 EXPORT_SYMBOL_GPL(split_page);
 
_

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 28/29] zram: fix return value on writeback_store
  2021-03-13  5:06 incoming Andrew Morton
                   ` (26 preceding siblings ...)
  2021-03-13  5:08 ` [patch 27/29] mm/memcg: set memcg when splitting page Andrew Morton
@ 2021-03-13  5:08 ` Andrew Morton
  2021-03-13  5:08 ` [patch 29/29] zram: fix broken page writeback Andrew Morton
  28 siblings, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2021-03-13  5:08 UTC (permalink / raw)
  To: akpm, colin.king, joaodias, linux-mm, minchan, mm-commits,
	sergey.senozhatsky, stable, torvalds

From: Minchan Kim <minchan@kernel.org>
Subject: zram: fix return value on writeback_store

writeback_store's return value is overwritten by submit_bio_wait's return
value.  Thus, writeback_store will return zero since there was no IO
error.  In the end, write syscall from userspace will see the zero as
return value, which could make the process stall to keep trying the write
until it will succeed.

Link: https://lkml.kernel.org/r/20210312173949.2197662-1-minchan@kernel.org
Fixes: 3b82a051c101("drivers/block/zram/zram_drv.c: fix error return codes not being returned in writeback_store")
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: John Dias <joaodias@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/block/zram/zram_drv.c |   11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

--- a/drivers/block/zram/zram_drv.c~zram-fix-return-value-on-writeback_store
+++ a/drivers/block/zram/zram_drv.c
@@ -627,7 +627,7 @@ static ssize_t writeback_store(struct de
 	struct bio_vec bio_vec;
 	struct page *page;
 	ssize_t ret = len;
-	int mode;
+	int mode, err;
 	unsigned long blk_idx = 0;
 
 	if (sysfs_streq(buf, "idle"))
@@ -728,12 +728,17 @@ static ssize_t writeback_store(struct de
 		 * XXX: A single page IO would be inefficient for write
 		 * but it would be not bad as starter.
 		 */
-		ret = submit_bio_wait(&bio);
-		if (ret) {
+		err = submit_bio_wait(&bio);
+		if (err) {
 			zram_slot_lock(zram, index);
 			zram_clear_flag(zram, index, ZRAM_UNDER_WB);
 			zram_clear_flag(zram, index, ZRAM_IDLE);
 			zram_slot_unlock(zram, index);
+			/*
+			 * Return last IO error unless every IO were
+			 * not suceeded.
+			 */
+			ret = err;
 			continue;
 		}
 
_

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 29/29] zram: fix broken page writeback
  2021-03-13  5:06 incoming Andrew Morton
                   ` (27 preceding siblings ...)
  2021-03-13  5:08 ` [patch 28/29] zram: fix return value on writeback_store Andrew Morton
@ 2021-03-13  5:08 ` Andrew Morton
  28 siblings, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2021-03-13  5:08 UTC (permalink / raw)
  To: akpm, amosbianchi, joaodias, linux-mm, minchan, mm-commits,
	sergey.senozhatsky, stable, torvalds

From: Minchan Kim <minchan@kernel.org>
Subject: zram: fix broken page writeback

commit 0d8359620d9b ("zram: support page writeback") introduced two
problems.  It overwrites writeback_store's return value as kstrtol's
return value, which makes return value zero so user could see zero as
return value of write syscall even though it wrote data successfully.

It also breaks index value in the loop in that it doesn't increase the
index any longer.  It means it can write only first starting block index
so user couldn't write all idle pages in the zram so lose memory saving
chance.

This patch fixes those issues.

Link: https://lkml.kernel.org/r/20210312173949.2197662-2-minchan@kernel.org
Fixes: 0d8359620d9b("zram: support page writeback")
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: Amos Bianchi <amosbianchi@google.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: John Dias <joaodias@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/block/zram/zram_drv.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

--- a/drivers/block/zram/zram_drv.c~zram-fix-broken-page-writeback
+++ a/drivers/block/zram/zram_drv.c
@@ -638,8 +638,8 @@ static ssize_t writeback_store(struct de
 		if (strncmp(buf, PAGE_WB_SIG, sizeof(PAGE_WB_SIG) - 1))
 			return -EINVAL;
 
-		ret = kstrtol(buf + sizeof(PAGE_WB_SIG) - 1, 10, &index);
-		if (ret || index >= nr_pages)
+		if (kstrtol(buf + sizeof(PAGE_WB_SIG) - 1, 10, &index) ||
+				index >= nr_pages)
 			return -EINVAL;
 
 		nr_pages = 1;
@@ -663,7 +663,7 @@ static ssize_t writeback_store(struct de
 		goto release_init_lock;
 	}
 
-	while (nr_pages--) {
+	for (; nr_pages != 0; index++, nr_pages--) {
 		struct bio_vec bvec;
 
 		bvec.bv_page = page;
_

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch 23/29] mm, hwpoison: do not lock page again when me_huge_page() successfully recovers
  2021-03-13  5:08 ` [patch 23/29] mm, hwpoison: do not lock page again when me_huge_page() successfully recovers Andrew Morton
@ 2021-03-13 19:23   ` Linus Torvalds
  2021-03-14  6:36     ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 1 reply; 32+ messages in thread
From: Linus Torvalds @ 2021-03-13 19:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: aneesh.kumar, Linux-MM, Michal Hocko, mm-commits,
	naoya.horiguchi, Oscar Salvador, stable, Tony Luck

On Fri, Mar 12, 2021 at 9:08 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Subject: mm, hwpoison: do not lock page again when me_huge_page() successfully recovers

This patch can not possibly be correct, and is dangerous and very very wrong.

>  out:
> -       unlock_page(head);
> +       if (PageLocked(head))
> +               unlock_page(head);

You absolutely CANNOT do things like this. It is fundamentally insane.

You have two situations:

 (a) you know you locked the page.

     In this case an unconditional "unlock_page()" is the only correct
thing to do.

 (b) you don't know whether you maybe unlocked it again.

     In this case SOMEBODY ELSE might have locked the page, and you
doing "if it's locked, then unlock it" is pure and utter drivel, and
fundamentally and seriously wrong. You're unlocking SOMEBODY ELSES
lock, after having already unlocked your own once.

Now, it is possible that this kind of pattern could be ok if you
absolutely 100% know - and it is obvious from the code - that you are
the only thread that can possibly access the page. But if that is the
case, then the page should never have been locked in the first place,
because that would be an insane and pointless operation, since the
whole - and only - point of locking is to enforce some kind of
exclusivity - and if exclusivity is explicit in the code-path, then
locking the page is pointless.

And yes, in this case I can see a remote theoretical model: "all good
cases keep the page locked, and the only trhing that unlocks the page
also guarantees nothing else can possiblly see it".

But no. I don't actually believe this is fuaranteed to the case here,
and even if it were, I don't think this is a code sequence we can or
should accept.

End result: there is no possible situation that I can think of where the above

       if (PageLocked(head))
               unlock_page(head);

kind of sequence can EVER possibly be correct, and it shows a complete
disregard of everything that locking is all about.

Now, the memory error handling may be so special that this effectively
"works" in practice, but there is no way I can apply such a
fundamentally broken patch.

I see two options:

 - make the result of identify_page_state() *tell* you whether the
page was already unlocked (and then make the unlock conditional on
*that*)

   This is valid, because now you removed that whole "somebody else
might have transferred the lock" issue: you know you already unlocked
the page based on the return value, so you don't unlock it again.
That's perfectly valid.

 - probably better: make identify_page_state() itself always unlock
the page, and turn the two different code sequences that currently do

        res = identify_page_state(pfn, p, page_flags);
  out:
        unlock_page(head);
        return res;

into just doing

        return identify_page_state(pfn, p, page_flags);
  out:
        unlock_page(head);
        return -EBUSY;

instead, and move that now pointless "res" variable into the only
if-statement that actually uses it (for other things entirely! It
should be a "enum mf_result" there!)

And yes, that second (and imho better) rule means that now
{page_action()" itself needs to be the thing that unlocks the page,
and that in turn might be done a few different ways:

 (a) either add a separate "MF_UNLOCKED" state bit to the possible
return codes and make that me_huge_page() that unlocks the page set
that bit (NOTE! It still needs to also return either MF_FAILED or
MF_RECOVERED, so this "MF_UNLOCKED" bit really should be a separate
bitmask entirely from the other MF_xyz bits)

 (b) make the rule be that *all* the ->action() functions need to just
unlock the page.

( suspect (b) is the better model to aim for, because honestly, static
code rules for locking are basically almost always superior to dynamic
"based on this flag" locking rules. You can in theory actually have
static tooling check that the locking nests correctly with the
unlocking that way (and it's just conceptually simpler to have a hard
rule about "this function always unlocks the page").

But that existing patch is *so* fundamentally wrong that I cannot
possibly apply it, and I feel bad about the fact that it has made it
all the way to me with sign-offs and reviewed-by's and everything.

                  Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch 23/29] mm, hwpoison: do not lock page again when me_huge_page() successfully recovers
  2021-03-13 19:23   ` Linus Torvalds
@ 2021-03-14  6:36     ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 0 replies; 32+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2021-03-14  6:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, aneesh.kumar, Linux-MM, Michal Hocko, mm-commits,
	Oscar Salvador, stable, Tony Luck

On Sat, Mar 13, 2021 at 11:23:40AM -0800, Linus Torvalds wrote:
> On Fri, Mar 12, 2021 at 9:08 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > Subject: mm, hwpoison: do not lock page again when me_huge_page() successfully recovers
> 
> This patch can not possibly be correct, and is dangerous and very very wrong.
> 
> >  out:
> > -       unlock_page(head);
> > +       if (PageLocked(head))
> > +               unlock_page(head);
> 
> You absolutely CANNOT do things like this. It is fundamentally insane.
> 
> You have two situations:
> 
>  (a) you know you locked the page.
> 
>      In this case an unconditional "unlock_page()" is the only correct
> thing to do.
> 
>  (b) you don't know whether you maybe unlocked it again.
> 
>      In this case SOMEBODY ELSE might have locked the page, and you
> doing "if it's locked, then unlock it" is pure and utter drivel, and
> fundamentally and seriously wrong. You're unlocking SOMEBODY ELSES
> lock, after having already unlocked your own once.
> 
> Now, it is possible that this kind of pattern could be ok if you
> absolutely 100% know - and it is obvious from the code - that you are
> the only thread that can possibly access the page. But if that is the
> case, then the page should never have been locked in the first place,
> because that would be an insane and pointless operation, since the
> whole - and only - point of locking is to enforce some kind of
> exclusivity - and if exclusivity is explicit in the code-path, then
> locking the page is pointless.
> 
> And yes, in this case I can see a remote theoretical model: "all good
> cases keep the page locked, and the only trhing that unlocks the page
> also guarantees nothing else can possiblly see it".
> 
> But no. I don't actually believe this is fuaranteed to the case here,
> and even if it were, I don't think this is a code sequence we can or
> should accept.
> 
> End result: there is no possible situation that I can think of where the above
> 
>        if (PageLocked(head))
>                unlock_page(head);
> 
> kind of sequence can EVER possibly be correct, and it shows a complete
> disregard of everything that locking is all about.
> 
> Now, the memory error handling may be so special that this effectively
> "works" in practice, but there is no way I can apply such a
> fundamentally broken patch.

OK, I'll update the patch with the second approach you mentioned below.

> 
> I see two options:
> 
>  - make the result of identify_page_state() *tell* you whether the
> page was already unlocked (and then make the unlock conditional on
> *that*)
> 
>    This is valid, because now you removed that whole "somebody else
> might have transferred the lock" issue: you know you already unlocked
> the page based on the return value, so you don't unlock it again.
> That's perfectly valid.
> 
>  - probably better: make identify_page_state() itself always unlock
> the page, and turn the two different code sequences that currently do
> 
>         res = identify_page_state(pfn, p, page_flags);
>   out:
>         unlock_page(head);
>         return res;
> 
> into just doing
> 
>         return identify_page_state(pfn, p, page_flags);
>   out:
>         unlock_page(head);
>         return -EBUSY;
> 
> instead, and move that now pointless "res" variable into the only
> if-statement that actually uses it (for other things entirely! It
> should be a "enum mf_result" there!)
> 
> And yes, that second (and imho better) rule means that now
> {page_action()" itself needs to be the thing that unlocks the page,
> and that in turn might be done a few different ways:
> 
>  (a) either add a separate "MF_UNLOCKED" state bit to the possible
> return codes and make that me_huge_page() that unlocks the page set
> that bit (NOTE! It still needs to also return either MF_FAILED or
> MF_RECOVERED, so this "MF_UNLOCKED" bit really should be a separate
> bitmask entirely from the other MF_xyz bits)
> 
>  (b) make the rule be that *all* the ->action() functions need to just
> unlock the page.
> 
> ( suspect (b) is the better model to aim for, because honestly, static
> code rules for locking are basically almost always superior to dynamic
> "based on this flag" locking rules. You can in theory actually have
> static tooling check that the locking nests correctly with the
> unlocking that way (and it's just conceptually simpler to have a hard
> rule about "this function always unlocks the page").

I'll try (b). Thank you for the detailed advices.

- Naoya Horiguchi

> 
> But that existing patch is *so* fundamentally wrong that I cannot
> possibly apply it, and I feel bad about the fact that it has made it
> all the way to me with sign-offs and reviewed-by's and everything.
> 
>                   Linus
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2021-03-14  6:37 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-13  5:06 incoming Andrew Morton
2021-03-13  5:07 ` [patch 01/29] memblock: fix section mismatch warning Andrew Morton
2021-03-13  5:07 ` [patch 02/29] stop_machine: mark helpers __always_inline Andrew Morton
2021-03-13  5:07 ` [patch 03/29] init/Kconfig: make COMPILE_TEST depend on HAS_IOMEM Andrew Morton
2021-03-13  5:07 ` [patch 04/29] mm/page_alloc.c: refactor initialization of struct page for holes in memory layout Andrew Morton
2021-03-13  5:07 ` [patch 05/29] mm/fork: clear PASID for new mm Andrew Morton
2021-03-13  5:07 ` [patch 06/29] hugetlb: dedup the code to add a new file_region Andrew Morton
2021-03-13  5:07 ` [patch 07/29] hugetlb: break earlier in add_reservation_in_range() when we can Andrew Morton
2021-03-13  5:07 ` [patch 08/29] mm: introduce page_needs_cow_for_dma() for deciding whether cow Andrew Morton
2021-03-13  5:07 ` [patch 09/29] mm: use is_cow_mapping() across tree where proper Andrew Morton
2021-03-13  5:07 ` [patch 10/29] hugetlb: do early cow when page pinned on src mm Andrew Morton
2021-03-13  5:07 ` [patch 11/29] mm/highmem.c: fix zero_user_segments() with start > end Andrew Morton
2021-03-13  5:07 ` [patch 12/29] binfmt_misc: fix possible deadlock in bm_register_write Andrew Morton
2021-03-13  5:07 ` [patch 13/29] MAINTAINERS: exclude uapi directories in API/ABI section Andrew Morton
2021-03-13  5:07 ` [patch 14/29] linux/compiler-clang.h: define HAVE_BUILTIN_BSWAP* Andrew Morton
2021-03-13  5:07 ` [patch 15/29] kfence: fix printk format for ptrdiff_t Andrew Morton
2021-03-13  5:07 ` [patch 16/29] kfence, slab: fix cache_alloc_debugcheck_after() for bulk allocations Andrew Morton
2021-03-13  5:08 ` [patch 17/29] kfence: fix reports if constant function prefixes exist Andrew Morton
2021-03-13  5:08 ` [patch 18/29] include/linux/sched/mm.h: use rcu_dereference in in_vfork() Andrew Morton
2021-03-13  5:08 ` [patch 19/29] mm/madvise: replace ptrace attach requirement for process_madvise Andrew Morton
2021-03-13  5:08 ` [patch 20/29] kasan, mm: fix crash with HW_TAGS and DEBUG_PAGEALLOC Andrew Morton
2021-03-13  5:08 ` [patch 21/29] kasan: fix KASAN_STACK dependency for HW_TAGS Andrew Morton
2021-03-13  5:08 ` [patch 22/29] mm/userfaultfd: fix memory corruption due to writeprotect Andrew Morton
2021-03-13  5:08 ` [patch 23/29] mm, hwpoison: do not lock page again when me_huge_page() successfully recovers Andrew Morton
2021-03-13 19:23   ` Linus Torvalds
2021-03-14  6:36     ` HORIGUCHI NAOYA(堀口 直也)
2021-03-13  5:08 ` [patch 24/29] ia64: fix ia64_syscall_get_set_arguments() for break-based syscalls Andrew Morton
2021-03-13  5:08 ` [patch 25/29] ia64: fix ptrace(PTRACE_SYSCALL_INFO_EXIT) sign Andrew Morton
2021-03-13  5:08 ` [patch 26/29] mm/memcg: rename mem_cgroup_split_huge_fixup to split_page_memcg and add nr_pages argument Andrew Morton
2021-03-13  5:08 ` [patch 27/29] mm/memcg: set memcg when splitting page Andrew Morton
2021-03-13  5:08 ` [patch 28/29] zram: fix return value on writeback_store Andrew Morton
2021-03-13  5:08 ` [patch 29/29] zram: fix broken page writeback Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).