linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* incoming
@ 2021-10-18 22:14 Andrew Morton
  2021-10-18 22:15 ` [patch 01/19] mm/userfaultfd: selftests: fix memory corruption with thp enabled Andrew Morton
                   ` (18 more replies)
  0 siblings, 19 replies; 20+ messages in thread
From: Andrew Morton @ 2021-10-18 22:14 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-mm, mm-commits


19 patches, based on 519d81956ee277b4419c723adfb154603c2565ba.

Subsystems affected by this patch series:

  mm/userfaultfd
  mm/migration
  ocfs2
  mm/memblock
  mm/mempolicy
  mm/slub
  binfmt
  vfs
  mm/secretmem
  mm/thp
  misc

Subsystem: mm/userfaultfd

    Peter Xu <peterx@redhat.com>:
      mm/userfaultfd: selftests: fix memory corruption with thp enabled

    Nadav Amit <namit@vmware.com>:
      userfaultfd: fix a race between writeprotect and exit_mmap()

Subsystem: mm/migration

    Dave Hansen <dave.hansen@linux.intel.com>:
    Patch series "mm/migrate: 5.15 fixes for automatic demotion", v2:
      mm/migrate: optimize hotplug-time demotion order updates
      mm/migrate: add CPU hotplug to demotion #ifdef

    Huang Ying <ying.huang@intel.com>:
      mm/migrate: fix CPUHP state to update node demotion order

Subsystem: ocfs2

    Jan Kara <jack@suse.cz>:
      ocfs2: fix data corruption after conversion from inline format

    Valentin Vidic <vvidic@valentin-vidic.from.hr>:
      ocfs2: mount fails with buffer overflow in strlen

Subsystem: mm/memblock

    Peng Fan <peng.fan@nxp.com>:
      memblock: check memory total_size

Subsystem: mm/mempolicy

    Eric Dumazet <edumazet@google.com>:
      mm/mempolicy: do not allow illegal MPOL_F_NUMA_BALANCING | MPOL_LOCAL in mbind()

Subsystem: mm/slub

    Miaohe Lin <linmiaohe@huawei.com>:
    Patch series "Fixups for slub":
      mm, slub: fix two bugs in slab_debug_trace_open()
      mm, slub: fix mismatch between reconstructed freelist depth and cnt
      mm, slub: fix potential memoryleak in kmem_cache_open()
      mm, slub: fix potential use-after-free in slab_debugfs_fops
      mm, slub: fix incorrect memcg slab count for bulk free

Subsystem: binfmt

    Lukas Bulwahn <lukas.bulwahn@gmail.com>:
      elfcore: correct reference to CONFIG_UML

Subsystem: vfs

    "Matthew Wilcox (Oracle)" <willy@infradead.org>:
      vfs: check fd has read access in kernel_read_file_from_fd()

Subsystem: mm/secretmem

    Sean Christopherson <seanjc@google.com>:
      mm/secretmem: fix NULL page->mapping dereference in page_is_secretmem()

Subsystem: mm/thp

    Marek Szyprowski <m.szyprowski@samsung.com>:
      mm/thp: decrease nr_thps in file's mapping on THP split

Subsystem: misc

    Andrej Shadura <andrew.shadura@collabora.co.uk>:
      mailmap: add Andrej Shadura

 .mailmap                                 |    2 +
 fs/kernel_read_file.c                    |    2 -
 fs/ocfs2/alloc.c                         |   46 ++++++-----------------
 fs/ocfs2/super.c                         |   14 +++++--
 fs/userfaultfd.c                         |   12 ++++--
 include/linux/cpuhotplug.h               |    4 ++
 include/linux/elfcore.h                  |    2 -
 include/linux/memory.h                   |    5 ++
 include/linux/secretmem.h                |    2 -
 mm/huge_memory.c                         |    6 ++-
 mm/memblock.c                            |    2 -
 mm/mempolicy.c                           |   16 ++------
 mm/migrate.c                             |   62 ++++++++++++++++++-------------
 mm/page_ext.c                            |    4 --
 mm/slab.c                                |    4 +-
 mm/slub.c                                |   31 ++++++++++++---
 tools/testing/selftests/vm/userfaultfd.c |   23 ++++++++++-
 17 files changed, 138 insertions(+), 99 deletions(-)



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [patch 01/19] mm/userfaultfd: selftests: fix memory corruption with thp enabled
  2021-10-18 22:14 incoming Andrew Morton
@ 2021-10-18 22:15 ` Andrew Morton
  2021-10-18 22:15 ` [patch 02/19] userfaultfd: fix a race between writeprotect and exit_mmap() Andrew Morton
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: Andrew Morton @ 2021-10-18 22:15 UTC (permalink / raw)
  To: aarcange, akpm, axelrasmussen, linux-mm, liwan, liwang,
	mm-commits, nadav.amit, peterx, stable, torvalds

From: Peter Xu <peterx@redhat.com>
Subject: mm/userfaultfd: selftests: fix memory corruption with thp enabled

In RHEL's gating selftests we've encountered memory corruption in the uffd
event test even with upstream kernel:

        # ./userfaultfd anon 128 4
        nr_pages: 32768, nr_pages_per_cpu: 32768
        bounces: 3, mode: rnd racing read, userfaults: 6240 missing (6240) 14729 wp (14729)
        bounces: 2, mode: racing read, userfaults: 1444 missing (1444) 28877 wp (28877)
        bounces: 1, mode: rnd read, userfaults: 6055 missing (6055) 14699 wp (14699)
        bounces: 0, mode: read, userfaults: 82 missing (82) 25196 wp (25196)
        testing uffd-wp with pagemap (pgsize=4096): done
        testing uffd-wp with pagemap (pgsize=2097152): done
        testing events (fork, remap, remove): ERROR: nr 32427 memory corruption 0 1 (errno=0, line=963)
        ERROR: faulting process failed (errno=0, line=1117)

It can be easily reproduced when global thp enabled, which is the default for
RHEL.

It's also known as a side effect of commit 0db282ba2c12 ("selftest: use
mmap instead of posix_memalign to allocate memory", 2021-07-23), which is
imho right itself on using mmap() to make sure the addresses will be
untagged even on arm.

The problem is, for each test we allocate buffers using two
allocate_area() calls.  We assumed these two buffers won't affect each
other, however they could, because mmap() could have found that the two
buffers are near each other and having the same VMA flags, so they got
merged into one VMA.

It won't be a big problem if thp is not enabled, but when thp is
agressively enabled it means when initializing the src buffer it could
accidentally setup part of the dest buffer too when there's a shared THP
that overlaps the two regions.  Then some of the dest buffer won't be able
to be trapped by userfaultfd missing mode, then it'll cause memory
corruption as described.

To fix it, do release_pages() after initializing the src buffer.

Since the previous two release_pages() calls are after
uffd_test_ctx_clear() which will unmap all the buffers anyway (which is
stronger than release pages; as unmap() also tear town pgtables), drop
them as they shouldn't really be anything useful.

We can mark the Fixes tag upon 0db282ba2c12 as it's reported to only
happen there, however the real "Fixes" IMHO should be 8ba6e8640844, as
before that commit we'll always do explicit release_pages() before
registration of uffd, and 8ba6e8640844 changed that logic by adding extra
unmap/map and we didn't release the pages at the right place.  Meanwhile I
don't have a solid glue anyway on whether posix_memalign() could always
avoid triggering this bug, hence it's safer to attach this fix to commit
8ba6e8640844.

Link: https://lkml.kernel.org/r/20210923232512.210092-1-peterx@redhat.com
Fixes: 8ba6e8640844 ("userfaultfd/selftests: reinitialize test context in each test")
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1994931
Signed-off-by: Peter Xu <peterx@redhat.com>
Reported-by: Li Wang <liwan@redhat.com>
Tested-by: Li Wang <liwang@redhat.com>
Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/testing/selftests/vm/userfaultfd.c |   23 ++++++++++++++++++---
 1 file changed, 20 insertions(+), 3 deletions(-)

--- a/tools/testing/selftests/vm/userfaultfd.c~mm-userfaultfd-selftests-fix-memory-corruption-with-thp-enabled
+++ a/tools/testing/selftests/vm/userfaultfd.c
@@ -414,9 +414,6 @@ static void uffd_test_ctx_init_ext(uint6
 	uffd_test_ops->allocate_area((void **)&area_src);
 	uffd_test_ops->allocate_area((void **)&area_dst);
 
-	uffd_test_ops->release_pages(area_src);
-	uffd_test_ops->release_pages(area_dst);
-
 	userfaultfd_open(features);
 
 	count_verify = malloc(nr_pages * sizeof(unsigned long long));
@@ -437,6 +434,26 @@ static void uffd_test_ctx_init_ext(uint6
 		*(area_count(area_src, nr) + 1) = 1;
 	}
 
+	/*
+	 * After initialization of area_src, we must explicitly release pages
+	 * for area_dst to make sure it's fully empty.  Otherwise we could have
+	 * some area_dst pages be errornously initialized with zero pages,
+	 * hence we could hit memory corruption later in the test.
+	 *
+	 * One example is when THP is globally enabled, above allocate_area()
+	 * calls could have the two areas merged into a single VMA (as they
+	 * will have the same VMA flags so they're mergeable).  When we
+	 * initialize the area_src above, it's possible that some part of
+	 * area_dst could have been faulted in via one huge THP that will be
+	 * shared between area_src and area_dst.  It could cause some of the
+	 * area_dst won't be trapped by missing userfaults.
+	 *
+	 * This release_pages() will guarantee even if that happened, we'll
+	 * proactively split the thp and drop any accidentally initialized
+	 * pages within area_dst.
+	 */
+	uffd_test_ops->release_pages(area_dst);
+
 	pipefd = malloc(sizeof(int) * nr_cpus * 2);
 	if (!pipefd)
 		err("pipefd");
_


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [patch 02/19] userfaultfd: fix a race between writeprotect and exit_mmap()
  2021-10-18 22:14 incoming Andrew Morton
  2021-10-18 22:15 ` [patch 01/19] mm/userfaultfd: selftests: fix memory corruption with thp enabled Andrew Morton
@ 2021-10-18 22:15 ` Andrew Morton
  2021-10-18 22:15 ` [patch 03/19] mm/migrate: optimize hotplug-time demotion order updates Andrew Morton
                   ` (16 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: Andrew Morton @ 2021-10-18 22:15 UTC (permalink / raw)
  To: aarcange, akpm, linux-mm, liwang, mm-commits, namit, peterx,
	stable, torvalds

From: Nadav Amit <namit@vmware.com>
Subject: userfaultfd: fix a race between writeprotect and exit_mmap()

A race is possible when a process exits, its VMAs are removed by
exit_mmap() and at the same time userfaultfd_writeprotect() is called.

The race was detected by KASAN on a development kernel, but it appears to
be possible on vanilla kernels as well.

Use mmget_not_zero() to prevent the race as done in other userfaultfd
operations.

Link: https://lkml.kernel.org/r/20210921200247.25749-1-namit@vmware.com
Fixes: 63b2d4174c4ad ("userfaultfd: wp: add the writeprotect API to userfaultfd ioctl")
Signed-off-by: Nadav Amit <namit@vmware.com>
Tested-by: Li  Wang <liwang@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/userfaultfd.c |   12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

--- a/fs/userfaultfd.c~userfaultfd-fix-a-race-between-writeprotect-and-exit_mmap
+++ a/fs/userfaultfd.c
@@ -1827,9 +1827,15 @@ static int userfaultfd_writeprotect(stru
 	if (mode_wp && mode_dontwake)
 		return -EINVAL;
 
-	ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start,
-				  uffdio_wp.range.len, mode_wp,
-				  &ctx->mmap_changing);
+	if (mmget_not_zero(ctx->mm)) {
+		ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start,
+					  uffdio_wp.range.len, mode_wp,
+					  &ctx->mmap_changing);
+		mmput(ctx->mm);
+	} else {
+		return -ESRCH;
+	}
+
 	if (ret)
 		return ret;
 
_


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [patch 03/19] mm/migrate: optimize hotplug-time demotion order updates
  2021-10-18 22:14 incoming Andrew Morton
  2021-10-18 22:15 ` [patch 01/19] mm/userfaultfd: selftests: fix memory corruption with thp enabled Andrew Morton
  2021-10-18 22:15 ` [patch 02/19] userfaultfd: fix a race between writeprotect and exit_mmap() Andrew Morton
@ 2021-10-18 22:15 ` Andrew Morton
  2021-10-18 22:15 ` [patch 04/19] mm/migrate: add CPU hotplug to demotion #ifdef Andrew Morton
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: Andrew Morton @ 2021-10-18 22:15 UTC (permalink / raw)
  To: akpm, dan.j.williams, dave.hansen, david, gthelen, linux-mm,
	mhocko, mm-commits, oliver.sang, osalvador, rientjes, torvalds,
	weixugc, yang.shi, ying.huang

From: Dave Hansen <dave.hansen@linux.intel.com>
Subject: mm/migrate: optimize hotplug-time demotion order updates

Patch series "mm/migrate: 5.15 fixes for automatic demotion", v2.

This contains two fixes for the "automatic demotion" code which was
merged into 5.15:

 * Fix memory hotplug performance regression by watching
   suppressing any real action on irrelevant hotplug events.
 * Ensure CPU hotplug handler is registered when memory hotplug
   is disabled.


This patch (of 2):

== tl;dr ==

Automatic demotion opted for a simple, lazy approach to handling hotplug
events.  This noticeably slows down memory hotplug[1].  Optimize away
updates to the demotion order when memory hotplug events should have no
effect.

This has no effect on CPU hotplug.  There is no known problem on the CPU
side and any work there will be in a separate series.

== Background ==

Automatic demotion is a memory migration strategy to ensure that new
allocations have room in faster memory tiers on tiered memory systems. 
The kernel maintains an array (node_demotion[]) to drive these migrations.

The node_demotion[] path is calculated by starting at nodes with CPUs and
then "walking" to nodes with memory.  Only hotplug events which online or
offline a node with memory (N_ONLINE) or CPUs (N_CPU) will actually affect
the migration order.

== Problem ==

However, the current code is lazy.  It completely regenerates the
migration order on *any* CPU or memory hotplug event.  The logic was that
these events are extremely rare and that the overhead from indiscriminate
order regeneration is minimal.

Part of the update logic involves a synchronize_rcu(), which is a pretty
big hammer.  Its overhead was large enough to be detected by some 0day
tests that watch memory hotplug performance[1].

== Solution ==

Add a new helper (node_demotion_topo_changed()) which can differentiate
between superfluous and impactful hotplug events.  Skip the expensive
update operation for superfluous events.

== Aside: Locking ==

It took me a few moments to declare the locking to be safe enough for
node_demotion_topo_changed() to work.  It all hinges on the memory hotplug
lock:

During memory hotplug events, 'mem_hotplug_lock' is held for write.  This
ensures that two memory hotplug events can not be called simultaneously.

CPU hotplug has a similar lock (cpuhp_state_mutex) which also provides
mutual exclusion between CPU hotplug events.  In addition, the demotion
code acquire and hold the mem_hotplug_lock for read during its CPU hotplug
handlers.  This provides mutual exclusion between the demotion memory
hotplug callbacks and the CPU hotplug callbacks.

This effectively allows treating the migration target generation code to
act as if it is single-threaded.

1. https://lore.kernel.org/all/20210905135932.GE15026@xsang-OptiPlex-9020/

Link: https://lkml.kernel.org/r/20210924161251.093CCD06@davehans-spike.ostc.intel.com
Link: https://lkml.kernel.org/r/20210924161253.D7673E31@davehans-spike.ostc.intel.com
Fixes: 884a6e5d1f93 ("mm/migrate: update node demotion order on hotplug events")
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/migrate.c |   12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

--- a/mm/migrate.c~mm-migrate-optimize-hotplug-time-demotion-order-updates
+++ a/mm/migrate.c
@@ -3239,8 +3239,18 @@ static int migration_offline_cpu(unsigne
  * set_migration_target_nodes().
  */
 static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
-						 unsigned long action, void *arg)
+						 unsigned long action, void *_arg)
 {
+	struct memory_notify *arg = _arg;
+
+	/*
+	 * Only update the node migration order when a node is
+	 * changing status, like online->offline.  This avoids
+	 * the overhead of synchronize_rcu() in most cases.
+	 */
+	if (arg->status_change_nid < 0)
+		return notifier_from_errno(0);
+
 	switch (action) {
 	case MEM_GOING_OFFLINE:
 		/*
_


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [patch 04/19] mm/migrate: add CPU hotplug to demotion #ifdef
  2021-10-18 22:14 incoming Andrew Morton
                   ` (2 preceding siblings ...)
  2021-10-18 22:15 ` [patch 03/19] mm/migrate: optimize hotplug-time demotion order updates Andrew Morton
@ 2021-10-18 22:15 ` Andrew Morton
  2021-10-18 22:15 ` [patch 05/19] mm/migrate: fix CPUHP state to update node demotion order Andrew Morton
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: Andrew Morton @ 2021-10-18 22:15 UTC (permalink / raw)
  To: akpm, arnd, dan.j.williams, dave.hansen, david, gthelen,
	linux-mm, mhocko, mm-commits, osalvador, rientjes, torvalds,
	weixugc, yang.shi, ying.huang

From: Dave Hansen <dave.hansen@linux.intel.com>
Subject: mm/migrate: add CPU hotplug to demotion #ifdef

Once upon a time, the node demotion updates were driven solely by memory
hotplug events.  But now, there are handlers for both CPU and memory
hotplug.

However, the #ifdef around the code checks only memory hotplug.  A system
that has HOTPLUG_CPU=y but MEMORY_HOTPLUG=n would miss CPU hotplug events.

Update the #ifdef around the common code.  Add memory and CPU-specific
#ifdefs for their handlers.  These memory/CPU #ifdefs avoid unused
function warnings when their Kconfig option is off.

[arnd@arndb.de: rework hotplug_memory_notifier() stub]
  Link: https://lkml.kernel.org/r/20211013144029.2154629-1-arnd@kernel.org
Link: https://lkml.kernel.org/r/20210924161255.E5FE8F7E@davehans-spike.ostc.intel.com
Fixes: 884a6e5d1f93 ("mm/migrate: update node demotion order on hotplug events")
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/memory.h |    5 +++-
 mm/migrate.c           |   42 +++++++++++++++++++--------------------
 mm/page_ext.c          |    4 ---
 mm/slab.c              |    4 +--
 4 files changed, 28 insertions(+), 27 deletions(-)

--- a/include/linux/memory.h~mm-migrate-add-cpu-hotplug-to-demotion-ifdef
+++ a/include/linux/memory.h
@@ -160,7 +160,10 @@ int walk_dynamic_memory_groups(int nid,
 #define register_hotmemory_notifier(nb)		register_memory_notifier(nb)
 #define unregister_hotmemory_notifier(nb) 	unregister_memory_notifier(nb)
 #else
-#define hotplug_memory_notifier(fn, pri)	({ 0; })
+static inline int hotplug_memory_notifier(notifier_fn_t fn, int pri)
+{
+	return 0;
+}
 /* These aren't inline functions due to a GCC bug. */
 #define register_hotmemory_notifier(nb)    ({ (void)(nb); 0; })
 #define unregister_hotmemory_notifier(nb)  ({ (void)(nb); })
--- a/mm/migrate.c~mm-migrate-add-cpu-hotplug-to-demotion-ifdef
+++ a/mm/migrate.c
@@ -3066,7 +3066,7 @@ void migrate_vma_finalize(struct migrate
 EXPORT_SYMBOL(migrate_vma_finalize);
 #endif /* CONFIG_DEVICE_PRIVATE */
 
-#if defined(CONFIG_MEMORY_HOTPLUG)
+#if defined(CONFIG_HOTPLUG_CPU)
 /* Disable reclaim-based migration. */
 static void __disable_all_migrate_targets(void)
 {
@@ -3209,25 +3209,6 @@ static void set_migration_target_nodes(v
 }
 
 /*
- * React to hotplug events that might affect the migration targets
- * like events that online or offline NUMA nodes.
- *
- * The ordering is also currently dependent on which nodes have
- * CPUs.  That means we need CPU on/offline notification too.
- */
-static int migration_online_cpu(unsigned int cpu)
-{
-	set_migration_target_nodes();
-	return 0;
-}
-
-static int migration_offline_cpu(unsigned int cpu)
-{
-	set_migration_target_nodes();
-	return 0;
-}
-
-/*
  * This leaves migrate-on-reclaim transiently disabled between
  * the MEM_GOING_OFFLINE and MEM_OFFLINE events.  This runs
  * whether reclaim-based migration is enabled or not, which
@@ -3284,6 +3265,25 @@ static int __meminit migrate_on_reclaim_
 	return notifier_from_errno(0);
 }
 
+/*
+ * React to hotplug events that might affect the migration targets
+ * like events that online or offline NUMA nodes.
+ *
+ * The ordering is also currently dependent on which nodes have
+ * CPUs.  That means we need CPU on/offline notification too.
+ */
+static int migration_online_cpu(unsigned int cpu)
+{
+	set_migration_target_nodes();
+	return 0;
+}
+
+static int migration_offline_cpu(unsigned int cpu)
+{
+	set_migration_target_nodes();
+	return 0;
+}
+
 static int __init migrate_on_reclaim_init(void)
 {
 	int ret;
@@ -3303,4 +3303,4 @@ static int __init migrate_on_reclaim_ini
 	return 0;
 }
 late_initcall(migrate_on_reclaim_init);
-#endif /* CONFIG_MEMORY_HOTPLUG */
+#endif /* CONFIG_HOTPLUG_CPU */
--- a/mm/page_ext.c~mm-migrate-add-cpu-hotplug-to-demotion-ifdef
+++ a/mm/page_ext.c
@@ -269,7 +269,7 @@ static int __meminit init_section_page_e
 	total_usage += table_size;
 	return 0;
 }
-#ifdef CONFIG_MEMORY_HOTPLUG
+
 static void free_page_ext(void *addr)
 {
 	if (is_vmalloc_addr(addr)) {
@@ -374,8 +374,6 @@ static int __meminit page_ext_callback(s
 	return notifier_from_errno(ret);
 }
 
-#endif
-
 void __init page_ext_init(void)
 {
 	unsigned long pfn;
--- a/mm/slab.c~mm-migrate-add-cpu-hotplug-to-demotion-ifdef
+++ a/mm/slab.c
@@ -1095,7 +1095,7 @@ static int slab_offline_cpu(unsigned int
 	return 0;
 }
 
-#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
+#if defined(CONFIG_NUMA)
 /*
  * Drains freelist for a node on each slab cache, used for memory hot-remove.
  * Returns -EBUSY if all objects cannot be drained so that the node is not
@@ -1157,7 +1157,7 @@ static int __meminit slab_memory_callbac
 out:
 	return notifier_from_errno(ret);
 }
-#endif /* CONFIG_NUMA && CONFIG_MEMORY_HOTPLUG */
+#endif /* CONFIG_NUMA */
 
 /*
  * swap the static kmem_cache_node with kmalloced memory
_


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [patch 05/19] mm/migrate: fix CPUHP state to update node demotion order
  2021-10-18 22:14 incoming Andrew Morton
                   ` (3 preceding siblings ...)
  2021-10-18 22:15 ` [patch 04/19] mm/migrate: add CPU hotplug to demotion #ifdef Andrew Morton
@ 2021-10-18 22:15 ` Andrew Morton
  2021-10-18 22:15 ` [patch 06/19] ocfs2: fix data corruption after conversion from inline format Andrew Morton
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: Andrew Morton @ 2021-10-18 22:15 UTC (permalink / raw)
  To: akpm, dan.j.williams, dave.hansen, david, gthelen, kbusch,
	linux-mm, mhocko, mm-commits, osalvador, rientjes, shy828301,
	tglx, torvalds, weixugc, ying.huang, ziy

From: Huang Ying <ying.huang@intel.com>
Subject: mm/migrate: fix CPUHP state to update node demotion order

The node demotion order needs to be updated during CPU hotplug.  Because
whether a NUMA node has CPU may influence the demotion order.  The update
function should be called during CPU online/offline after the
node_states[N_CPU] has been updated.  That is done in CPUHP_AP_ONLINE_DYN
during CPU online and in CPUHP_MM_VMSTAT_DEAD during CPU offline.  But in
commit 884a6e5d1f93 ("mm/migrate: update node demotion order on hotplug
events"), the function to update node demotion order is called in
CPUHP_AP_ONLINE_DYN during CPU online/offline.  This doesn't satisfy the
order requirement.

For example, there are 4 CPUs (P0, P1, P2, P3) in 2 sockets (P0, P1 in S0
and P2, P3 in S1), the demotion order is

- S0 -> NUMA_NO_NODE
- S1 -> NUMA_NO_NODE

After P2 and P3 is offlined, because S1 has no CPU now, the demotion
order should have been changed to

- S0 -> S1
- S1 -> NO_NODE

but it isn't changed, because the order updating callback for CPU hotplug
doesn't see the new nodemask.  After that, if P1 is offlined, the demotion
order is changed to the expected order as above.

So in this patch, we added CPUHP_AP_MM_DEMOTION_ONLINE and
CPUHP_MM_DEMOTION_DEAD to be called after CPUHP_AP_ONLINE_DYN and
CPUHP_MM_VMSTAT_DEAD during CPU online and offline, and register the
update function on them.

Link: https://lkml.kernel.org/r/20210929060351.7293-1-ying.huang@intel.com
Fixes: 884a6e5d1f93 ("mm/migrate: update node demotion order on hotplug events")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Keith Busch <kbusch@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/cpuhotplug.h |    4 ++++
 mm/migrate.c               |    8 +++++---
 2 files changed, 9 insertions(+), 3 deletions(-)

--- a/include/linux/cpuhotplug.h~mm-migrate-fix-cpuhp-state-to-update-node-demotion-order
+++ a/include/linux/cpuhotplug.h
@@ -72,6 +72,8 @@ enum cpuhp_state {
 	CPUHP_SLUB_DEAD,
 	CPUHP_DEBUG_OBJ_DEAD,
 	CPUHP_MM_WRITEBACK_DEAD,
+	/* Must be after CPUHP_MM_VMSTAT_DEAD */
+	CPUHP_MM_DEMOTION_DEAD,
 	CPUHP_MM_VMSTAT_DEAD,
 	CPUHP_SOFTIRQ_DEAD,
 	CPUHP_NET_MVNETA_DEAD,
@@ -240,6 +242,8 @@ enum cpuhp_state {
 	CPUHP_AP_BASE_CACHEINFO_ONLINE,
 	CPUHP_AP_ONLINE_DYN,
 	CPUHP_AP_ONLINE_DYN_END		= CPUHP_AP_ONLINE_DYN + 30,
+	/* Must be after CPUHP_AP_ONLINE_DYN for node_states[N_CPU] update */
+	CPUHP_AP_MM_DEMOTION_ONLINE,
 	CPUHP_AP_X86_HPET_ONLINE,
 	CPUHP_AP_X86_KVM_CLK_ONLINE,
 	CPUHP_AP_DTPM_CPU_ONLINE,
--- a/mm/migrate.c~mm-migrate-fix-cpuhp-state-to-update-node-demotion-order
+++ a/mm/migrate.c
@@ -3288,9 +3288,8 @@ static int __init migrate_on_reclaim_ini
 {
 	int ret;
 
-	ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "migrate on reclaim",
-				migration_online_cpu,
-				migration_offline_cpu);
+	ret = cpuhp_setup_state_nocalls(CPUHP_MM_DEMOTION_DEAD, "mm/demotion:offline",
+					NULL, migration_offline_cpu);
 	/*
 	 * In the unlikely case that this fails, the automatic
 	 * migration targets may become suboptimal for nodes
@@ -3298,6 +3297,9 @@ static int __init migrate_on_reclaim_ini
 	 * rare case, do not bother trying to do anything special.
 	 */
 	WARN_ON(ret < 0);
+	ret = cpuhp_setup_state(CPUHP_AP_MM_DEMOTION_ONLINE, "mm/demotion:online",
+				migration_online_cpu, NULL);
+	WARN_ON(ret < 0);
 
 	hotplug_memory_notifier(migrate_on_reclaim_callback, 100);
 	return 0;
_


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [patch 06/19] ocfs2: fix data corruption after conversion from inline format
  2021-10-18 22:14 incoming Andrew Morton
                   ` (4 preceding siblings ...)
  2021-10-18 22:15 ` [patch 05/19] mm/migrate: fix CPUHP state to update node demotion order Andrew Morton
@ 2021-10-18 22:15 ` Andrew Morton
  2021-10-18 22:15 ` [patch 07/19] ocfs2: mount fails with buffer overflow in strlen Andrew Morton
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: Andrew Morton @ 2021-10-18 22:15 UTC (permalink / raw)
  To: akpm, gechangwei, ghe, jack, jlbec, joseph.qi, junxiao.bi,
	linux-mm, mark, Markov.Andrey, mm-commits, piaojun, stable,
	torvalds

From: Jan Kara <jack@suse.cz>
Subject: ocfs2: fix data corruption after conversion from inline format

Commit 6dbf7bb55598 ("fs: Don't invalidate page buffers in
block_write_full_page()") uncovered a latent bug in ocfs2 conversion
from inline inode format to a normal inode format.

The code in
ocfs2_convert_inline_data_to_extents() attempts to zero out the whole
cluster allocated for file data by grabbing, zeroing, and dirtying all
pages covering this cluster. However these pages are beyond i_size, thus
writeback code generally ignores these dirty pages and no blocks were
ever actually zeroed on the disk.

This oversight was fixed by commit 693c241a5f6a ("ocfs2: No need to zero
pages past i_size.") for standard ocfs2 write path, inline conversion path
was apparently forgotten; the commit log also has a reasoning why the
zeroing actually is not needed.

After commit 6dbf7bb55598, things became worse as writeback code stopped
invalidating buffers on pages beyond i_size and thus these pages end up
with clean PageDirty bit but with buffers attached to these pages being
still dirty.  So when a file is converted from inline format, then
writeback triggers, and then the file is grown so that these pages become
valid, the invalid dirtiness state is preserved, mark_buffer_dirty() does
nothing on these pages (buffers are already dirty) but page is never
written back because it is clean.  So data written to these pages is lost
once pages are reclaimed.

Simple reproducer for the problem is:

xfs_io -f -c "pwrite 0 2000" -c "pwrite 2000 2000" -c "fsync" \
  -c "pwrite 4000 2000" ocfs2_file

After unmounting and mounting the fs again, you can observe that end of
'ocfs2_file' has lost its contents.

Fix the problem by not doing the pointless zeroing during conversion
from inline format similarly as in the standard write path.

[akpm@linux-foundation.org: fix whitespace, per Joseph]
Link: https://lkml.kernel.org/r/20210930095405.21433-1-jack@suse.cz
Fixes: 6dbf7bb55598 ("fs: Don't invalidate page buffers in block_write_full_page()")
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: Gang He <ghe@suse.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: "Markov, Andrey" <Markov.Andrey@Dell.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/ocfs2/alloc.c |   46 +++++++++++----------------------------------
 1 file changed, 12 insertions(+), 34 deletions(-)

--- a/fs/ocfs2/alloc.c~ocfs2-fix-data-corruption-after-conversion-from-inline-format
+++ a/fs/ocfs2/alloc.c
@@ -7045,7 +7045,7 @@ void ocfs2_set_inode_data_inline(struct
 int ocfs2_convert_inline_data_to_extents(struct inode *inode,
 					 struct buffer_head *di_bh)
 {
-	int ret, i, has_data, num_pages = 0;
+	int ret, has_data, num_pages = 0;
 	int need_free = 0;
 	u32 bit_off, num;
 	handle_t *handle;
@@ -7054,26 +7054,17 @@ int ocfs2_convert_inline_data_to_extents
 	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
 	struct ocfs2_dinode *di = (struct ocfs2_dinode *)di_bh->b_data;
 	struct ocfs2_alloc_context *data_ac = NULL;
-	struct page **pages = NULL;
-	loff_t end = osb->s_clustersize;
+	struct page *page = NULL;
 	struct ocfs2_extent_tree et;
 	int did_quota = 0;
 
 	has_data = i_size_read(inode) ? 1 : 0;
 
 	if (has_data) {
-		pages = kcalloc(ocfs2_pages_per_cluster(osb->sb),
-				sizeof(struct page *), GFP_NOFS);
-		if (pages == NULL) {
-			ret = -ENOMEM;
-			mlog_errno(ret);
-			return ret;
-		}
-
 		ret = ocfs2_reserve_clusters(osb, 1, &data_ac);
 		if (ret) {
 			mlog_errno(ret);
-			goto free_pages;
+			goto out;
 		}
 	}
 
@@ -7093,7 +7084,8 @@ int ocfs2_convert_inline_data_to_extents
 	}
 
 	if (has_data) {
-		unsigned int page_end;
+		unsigned int page_end = min_t(unsigned, PAGE_SIZE,
+							osb->s_clustersize);
 		u64 phys;
 
 		ret = dquot_alloc_space_nodirty(inode,
@@ -7117,15 +7109,8 @@ int ocfs2_convert_inline_data_to_extents
 		 */
 		block = phys = ocfs2_clusters_to_blocks(inode->i_sb, bit_off);
 
-		/*
-		 * Non sparse file systems zero on extend, so no need
-		 * to do that now.
-		 */
-		if (!ocfs2_sparse_alloc(osb) &&
-		    PAGE_SIZE < osb->s_clustersize)
-			end = PAGE_SIZE;
-
-		ret = ocfs2_grab_eof_pages(inode, 0, end, pages, &num_pages);
+		ret = ocfs2_grab_eof_pages(inode, 0, page_end, &page,
+					   &num_pages);
 		if (ret) {
 			mlog_errno(ret);
 			need_free = 1;
@@ -7136,20 +7121,15 @@ int ocfs2_convert_inline_data_to_extents
 		 * This should populate the 1st page for us and mark
 		 * it up to date.
 		 */
-		ret = ocfs2_read_inline_data(inode, pages[0], di_bh);
+		ret = ocfs2_read_inline_data(inode, page, di_bh);
 		if (ret) {
 			mlog_errno(ret);
 			need_free = 1;
 			goto out_unlock;
 		}
 
-		page_end = PAGE_SIZE;
-		if (PAGE_SIZE > osb->s_clustersize)
-			page_end = osb->s_clustersize;
-
-		for (i = 0; i < num_pages; i++)
-			ocfs2_map_and_dirty_page(inode, handle, 0, page_end,
-						 pages[i], i > 0, &phys);
+		ocfs2_map_and_dirty_page(inode, handle, 0, page_end, page, 0,
+					 &phys);
 	}
 
 	spin_lock(&oi->ip_lock);
@@ -7180,8 +7160,8 @@ int ocfs2_convert_inline_data_to_extents
 	}
 
 out_unlock:
-	if (pages)
-		ocfs2_unlock_and_free_pages(pages, num_pages);
+	if (page)
+		ocfs2_unlock_and_free_pages(&page, num_pages);
 
 out_commit:
 	if (ret < 0 && did_quota)
@@ -7205,8 +7185,6 @@ out_commit:
 out:
 	if (data_ac)
 		ocfs2_free_alloc_context(data_ac);
-free_pages:
-	kfree(pages);
 	return ret;
 }
 
_


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [patch 07/19] ocfs2: mount fails with buffer overflow in strlen
  2021-10-18 22:14 incoming Andrew Morton
                   ` (5 preceding siblings ...)
  2021-10-18 22:15 ` [patch 06/19] ocfs2: fix data corruption after conversion from inline format Andrew Morton
@ 2021-10-18 22:15 ` Andrew Morton
  2021-10-18 22:15 ` [patch 08/19] memblock: check memory total_size Andrew Morton
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: Andrew Morton @ 2021-10-18 22:15 UTC (permalink / raw)
  To: akpm, gechangwei, ghe, jlbec, joseph.qi, junxiao.bi, linux-mm,
	mark, mm-commits, piaojun, stable, torvalds, vvidic

From: Valentin Vidic <vvidic@valentin-vidic.from.hr>
Subject: ocfs2: mount fails with buffer overflow in strlen

Starting with kernel 5.11 built with CONFIG_FORTIFY_SOURCE mouting an
ocfs2 filesystem with either o2cb or pcmk cluster stack fails with the
trace below.  Problem seems to be that strings for cluster stack and
cluster name are not guaranteed to be null terminated in the disk
representation, while strlcpy assumes that the source string is always
null terminated.  This causes a read outside of the source string
triggering the buffer overflow detection.

detected buffer overflow in strlen
------------[ cut here ]------------
kernel BUG at lib/string.c:1149!
invalid opcode: 0000 [#1] SMP PTI
CPU: 1 PID: 910 Comm: mount.ocfs2 Not tainted 5.14.0-1-amd64 #1
  Debian 5.14.6-2
RIP: 0010:fortify_panic+0xf/0x11
...
Call Trace:
 ocfs2_initialize_super.isra.0.cold+0xc/0x18 [ocfs2]
 ocfs2_fill_super+0x359/0x19b0 [ocfs2]
 mount_bdev+0x185/0x1b0
 ? ocfs2_remount+0x440/0x440 [ocfs2]
 legacy_get_tree+0x27/0x40
 vfs_get_tree+0x25/0xb0
 path_mount+0x454/0xa20
 __x64_sys_mount+0x103/0x140
 do_syscall_64+0x3b/0xc0
 entry_SYSCALL_64_after_hwframe+0x44/0xae

Link: https://lkml.kernel.org/r/20210929180654.32460-1-vvidic@valentin-vidic.from.hr
Signed-off-by: Valentin Vidic <vvidic@valentin-vidic.from.hr>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/ocfs2/super.c |   14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

--- a/fs/ocfs2/super.c~ocfs2-mount-fails-with-buffer-overflow-in-strlen
+++ a/fs/ocfs2/super.c
@@ -2167,11 +2167,17 @@ static int ocfs2_initialize_super(struct
 	}
 
 	if (ocfs2_clusterinfo_valid(osb)) {
+		/*
+		 * ci_stack and ci_cluster in ocfs2_cluster_info may not be null
+		 * terminated, so make sure no overflow happens here by using
+		 * memcpy. Destination strings will always be null terminated
+		 * because osb is allocated using kzalloc.
+		 */
 		osb->osb_stackflags =
 			OCFS2_RAW_SB(di)->s_cluster_info.ci_stackflags;
-		strlcpy(osb->osb_cluster_stack,
+		memcpy(osb->osb_cluster_stack,
 		       OCFS2_RAW_SB(di)->s_cluster_info.ci_stack,
-		       OCFS2_STACK_LABEL_LEN + 1);
+		       OCFS2_STACK_LABEL_LEN);
 		if (strlen(osb->osb_cluster_stack) != OCFS2_STACK_LABEL_LEN) {
 			mlog(ML_ERROR,
 			     "couldn't mount because of an invalid "
@@ -2180,9 +2186,9 @@ static int ocfs2_initialize_super(struct
 			status = -EINVAL;
 			goto bail;
 		}
-		strlcpy(osb->osb_cluster_name,
+		memcpy(osb->osb_cluster_name,
 			OCFS2_RAW_SB(di)->s_cluster_info.ci_cluster,
-			OCFS2_CLUSTER_NAME_LEN + 1);
+			OCFS2_CLUSTER_NAME_LEN);
 	} else {
 		/* The empty string is identical with classic tools that
 		 * don't know about s_cluster_info. */
_


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [patch 08/19] memblock: check memory total_size
  2021-10-18 22:14 incoming Andrew Morton
                   ` (6 preceding siblings ...)
  2021-10-18 22:15 ` [patch 07/19] ocfs2: mount fails with buffer overflow in strlen Andrew Morton
@ 2021-10-18 22:15 ` Andrew Morton
  2021-10-18 22:15 ` [patch 09/19] mm/mempolicy: do not allow illegal MPOL_F_NUMA_BALANCING | MPOL_LOCAL in mbind() Andrew Morton
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: Andrew Morton @ 2021-10-18 22:15 UTC (permalink / raw)
  To: akpm, david, geert+renesas, linux-mm, mm-commits, peng.fan, rppt,
	torvalds

From: Peng Fan <peng.fan@nxp.com>
Subject: memblock: check memory total_size

mem=[X][G|M] is broken on ARM64 platform, there are cases that even
type.cnt is 1, but total_size is not 0 because regions are merged into 1. 
So only check 'cnt' is not enough, total_size should be used, othersize
bootargs 'mem=[X][G|B]' not work anymore.

Link: https://lkml.kernel.org/r/20210930024437.32598-1-peng.fan@oss.nxp.com
Fixes: e888fa7bb882 ("memblock: Check memory add/cap ordering")
Signed-off-by: Peng Fan <peng.fan@nxp.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memblock.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/memblock.c~memblock-check-memory-total_size
+++ a/mm/memblock.c
@@ -1692,7 +1692,7 @@ void __init memblock_cap_memory_range(ph
 	if (!size)
 		return;
 
-	if (memblock.memory.cnt <= 1) {
+	if (!memblock_memory->total_size) {
 		pr_warn("%s: No memory registered yet\n", __func__);
 		return;
 	}
_


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [patch 09/19] mm/mempolicy: do not allow illegal MPOL_F_NUMA_BALANCING | MPOL_LOCAL in mbind()
  2021-10-18 22:14 incoming Andrew Morton
                   ` (7 preceding siblings ...)
  2021-10-18 22:15 ` [patch 08/19] memblock: check memory total_size Andrew Morton
@ 2021-10-18 22:15 ` Andrew Morton
  2021-10-18 22:15 ` [patch 10/19] mm, slub: fix two bugs in slab_debug_trace_open() Andrew Morton
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: Andrew Morton @ 2021-10-18 22:15 UTC (permalink / raw)
  To: akpm, edumazet, linux-mm, mgorman, mm-commits, stable, syzkaller,
	torvalds, willy, ying.huang

From: Eric Dumazet <edumazet@google.com>
Subject: mm/mempolicy: do not allow illegal MPOL_F_NUMA_BALANCING | MPOL_LOCAL in mbind()

syzbot reported access to unitialized memory in mbind() [1]

Issue came with commit bda420b98505 ("numa balancing: migrate on fault
among multiple bound nodes")

This commit added a new bit in MPOL_MODE_FLAGS, but only checked valid
combination (MPOL_F_NUMA_BALANCING can only be used with MPOL_BIND) in
do_set_mempolicy()

This patch moves the check in sanitize_mpol_flags() so that it is also
used by mbind()

[1]
BUG: KMSAN: uninit-value in __mpol_equal+0x567/0x590 mm/mempolicy.c:2260
 __mpol_equal+0x567/0x590 mm/mempolicy.c:2260
 mpol_equal include/linux/mempolicy.h:105 [inline]
 vma_merge+0x4a1/0x1e60 mm/mmap.c:1190
 mbind_range+0xcc8/0x1e80 mm/mempolicy.c:811
 do_mbind+0xf42/0x15f0 mm/mempolicy.c:1333
 kernel_mbind mm/mempolicy.c:1483 [inline]
 __do_sys_mbind mm/mempolicy.c:1490 [inline]
 __se_sys_mbind+0x437/0xb80 mm/mempolicy.c:1486
 __x64_sys_mbind+0x19d/0x200 mm/mempolicy.c:1486
 do_syscall_x64 arch/x86/entry/common.c:51 [inline]
 do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82
 entry_SYSCALL_64_after_hwframe+0x44/0xae

Uninit was created at:
 slab_alloc_node mm/slub.c:3221 [inline]
 slab_alloc mm/slub.c:3230 [inline]
 kmem_cache_alloc+0x751/0xff0 mm/slub.c:3235
 mpol_new mm/mempolicy.c:293 [inline]
 do_mbind+0x912/0x15f0 mm/mempolicy.c:1289
 kernel_mbind mm/mempolicy.c:1483 [inline]
 __do_sys_mbind mm/mempolicy.c:1490 [inline]
 __se_sys_mbind+0x437/0xb80 mm/mempolicy.c:1486
 __x64_sys_mbind+0x19d/0x200 mm/mempolicy.c:1486
 do_syscall_x64 arch/x86/entry/common.c:51 [inline]
 do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82
 entry_SYSCALL_64_after_hwframe+0x44/0xae
=====================================================
Kernel panic - not syncing: panic_on_kmsan set ...
CPU: 0 PID: 15049 Comm: syz-executor.0 Tainted: G    B             5.15.0-rc2-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0x1ff/0x28e lib/dump_stack.c:106
 dump_stack+0x25/0x28 lib/dump_stack.c:113
 panic+0x44f/0xdeb kernel/panic.c:232
 kmsan_report+0x2ee/0x300 mm/kmsan/report.c:186
 __msan_warning+0xd7/0x150 mm/kmsan/instrumentation.c:208
 __mpol_equal+0x567/0x590 mm/mempolicy.c:2260
 mpol_equal include/linux/mempolicy.h:105 [inline]
 vma_merge+0x4a1/0x1e60 mm/mmap.c:1190
 mbind_range+0xcc8/0x1e80 mm/mempolicy.c:811
 do_mbind+0xf42/0x15f0 mm/mempolicy.c:1333
 kernel_mbind mm/mempolicy.c:1483 [inline]
 __do_sys_mbind mm/mempolicy.c:1490 [inline]
 __se_sys_mbind+0x437/0xb80 mm/mempolicy.c:1486
 __x64_sys_mbind+0x19d/0x200 mm/mempolicy.c:1486
 do_syscall_x64 arch/x86/entry/common.c:51 [inline]
 do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f4a41b2c709
Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f4a3f0a3188 EFLAGS: 00000246 ORIG_RAX: 00000000000000ed
RAX: ffffffffffffffda RBX: 00007f4a41c30f60 RCX: 00007f4a41b2c709
RDX: 0000000000002001 RSI: 0000000000c00007 RDI: 0000000020012000
RBP: 00007f4a41b86cb4 R08: 0000000000000000 R09: 0000010000000002
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f4a42164b2f R14: 00007f4a3f0a3300 R15: 0000000000022000

Link: https://lkml.kernel.org/r/20211001215630.810592-1-eric.dumazet@gmail.com
Fixes: bda420b98505 ("numa balancing: migrate on fault among multiple bound nodes")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/mempolicy.c |   16 +++++-----------
 1 file changed, 5 insertions(+), 11 deletions(-)

--- a/mm/mempolicy.c~mm-mempolicy-do-not-allow-illegal-mpol_f_numa_balancing-mpol_local-in-mbind
+++ a/mm/mempolicy.c
@@ -856,16 +856,6 @@ static long do_set_mempolicy(unsigned sh
 		goto out;
 	}
 
-	if (flags & MPOL_F_NUMA_BALANCING) {
-		if (new && new->mode == MPOL_BIND) {
-			new->flags |= (MPOL_F_MOF | MPOL_F_MORON);
-		} else {
-			ret = -EINVAL;
-			mpol_put(new);
-			goto out;
-		}
-	}
-
 	ret = mpol_set_nodemask(new, nodes, scratch);
 	if (ret) {
 		mpol_put(new);
@@ -1458,7 +1448,11 @@ static inline int sanitize_mpol_flags(in
 		return -EINVAL;
 	if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES))
 		return -EINVAL;
-
+	if (*flags & MPOL_F_NUMA_BALANCING) {
+		if (*mode != MPOL_BIND)
+			return -EINVAL;
+		*flags |= (MPOL_F_MOF | MPOL_F_MORON);
+	}
 	return 0;
 }
 
_


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [patch 10/19] mm, slub: fix two bugs in slab_debug_trace_open()
  2021-10-18 22:14 incoming Andrew Morton
                   ` (8 preceding siblings ...)
  2021-10-18 22:15 ` [patch 09/19] mm/mempolicy: do not allow illegal MPOL_F_NUMA_BALANCING | MPOL_LOCAL in mbind() Andrew Morton
@ 2021-10-18 22:15 ` Andrew Morton
  2021-10-18 22:15 ` [patch 11/19] mm, slub: fix mismatch between reconstructed freelist depth and cnt Andrew Morton
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: Andrew Morton @ 2021-10-18 22:15 UTC (permalink / raw)
  To: akpm, andreyknvl, bharata, cl, faiyazm, gregkh, guro,
	iamjoonsoo.kim, keescook, linmiaohe, linux-mm, mm-commits,
	penberg, rientjes, ryabinin.a.a, stable, torvalds, vbabka

From: Miaohe Lin <linmiaohe@huawei.com>
Subject: mm, slub: fix two bugs in slab_debug_trace_open()

Patch series "Fixups for slub".

This series contains various bug fixes for slub.  We fix memoryleak,
use-afer-free, NULL pointer dereferencing and so on in slub.  More details
can be found in the respective changelogs.


This patch (of 5):

It's possible that __seq_open_private() will return NULL.  So we should
check it before using lest dereferencing NULL pointer.  And in error
paths, we forgot to release private buffer via seq_release_private(). 
Memory will leak in these paths.

Link: https://lkml.kernel.org/r/20210916123920.48704-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20210916123920.48704-2-linmiaohe@huawei.com
Fixes: 64dd68497be7 ("mm: slub: move sysfs slab alloc/free interfaces to debugfs")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Faiyaz Mohammed <faiyazm@codeaurora.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Bharata B Rao <bharata@linux.ibm.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slub.c |    8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

--- a/mm/slub.c~mm-slub-fix-two-bugs-in-slab_debug_trace_open
+++ a/mm/slub.c
@@ -6108,9 +6108,14 @@ static int slab_debug_trace_open(struct
 	struct kmem_cache *s = file_inode(filep)->i_private;
 	unsigned long *obj_map;
 
+	if (!t)
+		return -ENOMEM;
+
 	obj_map = bitmap_alloc(oo_objects(s->oo), GFP_KERNEL);
-	if (!obj_map)
+	if (!obj_map) {
+		seq_release_private(inode, filep);
 		return -ENOMEM;
+	}
 
 	if (strcmp(filep->f_path.dentry->d_name.name, "alloc_traces") == 0)
 		alloc = TRACK_ALLOC;
@@ -6119,6 +6124,7 @@ static int slab_debug_trace_open(struct
 
 	if (!alloc_loc_track(t, PAGE_SIZE / sizeof(struct location), GFP_KERNEL)) {
 		bitmap_free(obj_map);
+		seq_release_private(inode, filep);
 		return -ENOMEM;
 	}
 
_


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [patch 11/19] mm, slub: fix mismatch between reconstructed freelist depth and cnt
  2021-10-18 22:14 incoming Andrew Morton
                   ` (9 preceding siblings ...)
  2021-10-18 22:15 ` [patch 10/19] mm, slub: fix two bugs in slab_debug_trace_open() Andrew Morton
@ 2021-10-18 22:15 ` Andrew Morton
  2021-10-18 22:15 ` [patch 12/19] mm, slub: fix potential memoryleak in kmem_cache_open() Andrew Morton
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: Andrew Morton @ 2021-10-18 22:15 UTC (permalink / raw)
  To: akpm, andreyknvl, bharata, cl, faiyazm, gregkh, guro,
	iamjoonsoo.kim, keescook, linmiaohe, linux-mm, mm-commits,
	penberg, rientjes, ryabinin.a.a, stable, torvalds, vbabka

From: Miaohe Lin <linmiaohe@huawei.com>
Subject: mm, slub: fix mismatch between reconstructed freelist depth and cnt

If object's reuse is delayed, it will be excluded from the reconstructed
freelist.  But we forgot to adjust the cnt accordingly.  So there will be
a mismatch between reconstructed freelist depth and cnt.  This will lead
to free_debug_processing() complaining about freelist count or a incorrect
slub inuse count.

Link: https://lkml.kernel.org/r/20210916123920.48704-3-linmiaohe@huawei.com
Fixes: c3895391df38 ("kasan, slub: fix handling of kasan_slab_free hook")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Bharata B Rao <bharata@linux.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Faiyaz Mohammed <faiyazm@codeaurora.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slub.c |   11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

--- a/mm/slub.c~mm-slub-fix-mismatch-between-reconstructed-freelist-depth-and-cnt
+++ a/mm/slub.c
@@ -1701,7 +1701,8 @@ static __always_inline bool slab_free_ho
 }
 
 static inline bool slab_free_freelist_hook(struct kmem_cache *s,
-					   void **head, void **tail)
+					   void **head, void **tail,
+					   int *cnt)
 {
 
 	void *object;
@@ -1728,6 +1729,12 @@ static inline bool slab_free_freelist_ho
 			*head = object;
 			if (!*tail)
 				*tail = object;
+		} else {
+			/*
+			 * Adjust the reconstructed freelist depth
+			 * accordingly if object's reuse is delayed.
+			 */
+			--(*cnt);
 		}
 	} while (object != old_tail);
 
@@ -3480,7 +3487,7 @@ static __always_inline void slab_free(st
 	 * With KASAN enabled slab_free_freelist_hook modifies the freelist
 	 * to remove objects, whose reuse must be delayed.
 	 */
-	if (slab_free_freelist_hook(s, &head, &tail))
+	if (slab_free_freelist_hook(s, &head, &tail, &cnt))
 		do_slab_free(s, page, head, tail, cnt, addr);
 }
 
_


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [patch 12/19] mm, slub: fix potential memoryleak in kmem_cache_open()
  2021-10-18 22:14 incoming Andrew Morton
                   ` (10 preceding siblings ...)
  2021-10-18 22:15 ` [patch 11/19] mm, slub: fix mismatch between reconstructed freelist depth and cnt Andrew Morton
@ 2021-10-18 22:15 ` Andrew Morton
  2021-10-18 22:16 ` [patch 13/19] mm, slub: fix potential use-after-free in slab_debugfs_fops Andrew Morton
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: Andrew Morton @ 2021-10-18 22:15 UTC (permalink / raw)
  To: akpm, andreyknvl, bharata, cl, faiyazm, gregkh, guro,
	iamjoonsoo.kim, keescook, linmiaohe, linux-mm, mm-commits,
	penberg, rientjes, ryabinin.a.a, stable, torvalds, vbabka

From: Miaohe Lin <linmiaohe@huawei.com>
Subject: mm, slub: fix potential memoryleak in kmem_cache_open()

In error path, the random_seq of slub cache might be leaked.  Fix this by
using __kmem_cache_release() to release all the relevant resources.

Link: https://lkml.kernel.org/r/20210916123920.48704-4-linmiaohe@huawei.com
Fixes: 210e7a43fa90 ("mm: SLUB freelist randomization")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Bharata B Rao <bharata@linux.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Faiyaz Mohammed <faiyazm@codeaurora.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slub.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/slub.c~mm-slub-fix-potential-memoryleak-in-kmem_cache_open
+++ a/mm/slub.c
@@ -4210,8 +4210,8 @@ static int kmem_cache_open(struct kmem_c
 	if (alloc_kmem_cache_cpus(s))
 		return 0;
 
-	free_kmem_cache_nodes(s);
 error:
+	__kmem_cache_release(s);
 	return -EINVAL;
 }
 
_


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [patch 13/19] mm, slub: fix potential use-after-free in slab_debugfs_fops
  2021-10-18 22:14 incoming Andrew Morton
                   ` (11 preceding siblings ...)
  2021-10-18 22:15 ` [patch 12/19] mm, slub: fix potential memoryleak in kmem_cache_open() Andrew Morton
@ 2021-10-18 22:16 ` Andrew Morton
  2021-10-18 22:16 ` [patch 14/19] mm, slub: fix incorrect memcg slab count for bulk free Andrew Morton
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: Andrew Morton @ 2021-10-18 22:16 UTC (permalink / raw)
  To: akpm, andreyknvl, bharata, cl, faiyazm, gregkh, guro,
	iamjoonsoo.kim, keescook, linmiaohe, linux-mm, mm-commits,
	penberg, rientjes, ryabinin.a.a, stable, torvalds, vbabka

From: Miaohe Lin <linmiaohe@huawei.com>
Subject: mm, slub: fix potential use-after-free in slab_debugfs_fops

When sysfs_slab_add failed, we shouldn't call debugfs_slab_add() for s
because s will be freed soon.  And slab_debugfs_fops will use s later
leading to a use-after-free.

Link: https://lkml.kernel.org/r/20210916123920.48704-5-linmiaohe@huawei.com
Fixes: 64dd68497be7 ("mm: slub: move sysfs slab alloc/free interfaces to debugfs")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Bharata B Rao <bharata@linux.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Faiyaz Mohammed <faiyazm@codeaurora.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slub.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

--- a/mm/slub.c~mm-slub-fix-potential-use-after-free-in-slab_debugfs_fops
+++ a/mm/slub.c
@@ -4887,13 +4887,15 @@ int __kmem_cache_create(struct kmem_cach
 		return 0;
 
 	err = sysfs_slab_add(s);
-	if (err)
+	if (err) {
 		__kmem_cache_release(s);
+		return err;
+	}
 
 	if (s->flags & SLAB_STORE_USER)
 		debugfs_slab_add(s);
 
-	return err;
+	return 0;
 }
 
 void *__kmalloc_track_caller(size_t size, gfp_t gfpflags, unsigned long caller)
_


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [patch 14/19] mm, slub: fix incorrect memcg slab count for bulk free
  2021-10-18 22:14 incoming Andrew Morton
                   ` (12 preceding siblings ...)
  2021-10-18 22:16 ` [patch 13/19] mm, slub: fix potential use-after-free in slab_debugfs_fops Andrew Morton
@ 2021-10-18 22:16 ` Andrew Morton
  2021-10-18 22:16 ` [patch 15/19] elfcore: correct reference to CONFIG_UML Andrew Morton
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: Andrew Morton @ 2021-10-18 22:16 UTC (permalink / raw)
  To: akpm, andreyknvl, bharata, cl, faiyazm, gregkh, guro,
	iamjoonsoo.kim, keescook, linmiaohe, linux-mm, mm-commits,
	penberg, rientjes, ryabinin.a.a, stable, torvalds, vbabka

From: Miaohe Lin <linmiaohe@huawei.com>
Subject: mm, slub: fix incorrect memcg slab count for bulk free

kmem_cache_free_bulk() will call memcg_slab_free_hook() for all objects
when doing bulk free.  So we shouldn't call memcg_slab_free_hook() again
for bulk free to avoid incorrect memcg slab count.

Link: https://lkml.kernel.org/r/20210916123920.48704-6-linmiaohe@huawei.com
Fixes: d1b2cf6cb84a ("mm: memcg/slab: uncharge during kmem_cache_free_bulk()")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Bharata B Rao <bharata@linux.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Faiyaz Mohammed <faiyazm@codeaurora.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slub.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

--- a/mm/slub.c~mm-slub-fix-incorrect-memcg-slab-count-for-bulk-free
+++ a/mm/slub.c
@@ -3420,7 +3420,9 @@ static __always_inline void do_slab_free
 	struct kmem_cache_cpu *c;
 	unsigned long tid;
 
-	memcg_slab_free_hook(s, &head, 1);
+	/* memcg_slab_free_hook() is already called for bulk free. */
+	if (!tail)
+		memcg_slab_free_hook(s, &head, 1);
 redo:
 	/*
 	 * Determine the currently cpus per cpu slab.
_


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [patch 15/19] elfcore: correct reference to CONFIG_UML
  2021-10-18 22:14 incoming Andrew Morton
                   ` (13 preceding siblings ...)
  2021-10-18 22:16 ` [patch 14/19] mm, slub: fix incorrect memcg slab count for bulk free Andrew Morton
@ 2021-10-18 22:16 ` Andrew Morton
  2021-10-18 22:16 ` [patch 16/19] vfs: check fd has read access in kernel_read_file_from_fd() Andrew Morton
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: Andrew Morton @ 2021-10-18 22:16 UTC (permalink / raw)
  To: akpm, arnd, brho, catalin.marinas, linux-mm, lukas.bulwahn,
	mm-commits, nathan, ndesaulniers, stable, torvalds

From: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Subject: elfcore: correct reference to CONFIG_UML

Commit 6e7b64b9dd6d ("elfcore: fix building with clang") introduces
special handling for two architectures, ia64 and User Mode Linux. 
However, the wrong name, i.e., CONFIG_UM, for the intended Kconfig symbol
for User-Mode Linux was used.

Although the directory for User Mode Linux is ./arch/um; the Kconfig
symbol for this architecture is called CONFIG_UML.

Luckily, ./scripts/checkkconfigsymbols.py warns on non-existing configs:

UM
Referencing files: include/linux/elfcore.h
Similar symbols: UML, NUMA

Correct the name of the config to the intended one.

[akpm@linux-foundation.org: fix um/x86_64, per Catalin]
  Link: https://lkml.kernel.org/r/20211006181119.2851441-1-catalin.marinas@arm.com
  Link: https://lkml.kernel.org/r/YV6pejGzLy5ppEpt@arm.com
Link: https://lkml.kernel.org/r/20211006082209.417-1-lukas.bulwahn@gmail.com
Fixes: 6e7b64b9dd6d ("elfcore: fix building with clang")
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Barret Rhoden <brho@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/elfcore.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/include/linux/elfcore.h~elfcore-correct-reference-to-config_uml
+++ a/include/linux/elfcore.h
@@ -109,7 +109,7 @@ static inline int elf_core_copy_task_fpr
 #endif
 }
 
-#if defined(CONFIG_UM) || defined(CONFIG_IA64)
+#if (defined(CONFIG_UML) && defined(CONFIG_X86_32)) || defined(CONFIG_IA64)
 /*
  * These functions parameterize elf_core_dump in fs/binfmt_elf.c to write out
  * extra segments containing the gate DSO contents.  Dumping its
_


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [patch 16/19] vfs: check fd has read access in kernel_read_file_from_fd()
  2021-10-18 22:14 incoming Andrew Morton
                   ` (14 preceding siblings ...)
  2021-10-18 22:16 ` [patch 15/19] elfcore: correct reference to CONFIG_UML Andrew Morton
@ 2021-10-18 22:16 ` Andrew Morton
  2021-10-18 22:16 ` [patch 17/19] mm/secretmem: fix NULL page->mapping dereference in page_is_secretmem() Andrew Morton
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: Andrew Morton @ 2021-10-18 22:16 UTC (permalink / raw)
  To: akpm, christian.brauner, keescook, linux-mm, mm-commits, stable,
	sunhao.th, torvalds, viro, willy, zohar

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Subject: vfs: check fd has read access in kernel_read_file_from_fd()

If we open a file without read access and then pass the fd to a syscall
whose implementation calls kernel_read_file_from_fd(), we get a warning
from __kernel_read():

        if (WARN_ON_ONCE(!(file->f_mode & FMODE_READ)))

This currently affects both finit_module() and kexec_file_load(), but it
could affect other syscalls in the future.

Link: https://lkml.kernel.org/r/20211007220110.600005-1-willy@infradead.org
Fixes: b844f0ecbc56 ("vfs: define kernel_copy_file_from_fd()")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reported-by: Hao Sun <sunhao.th@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Mimi Zohar <zohar@linux.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/kernel_read_file.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/fs/kernel_read_file.c~vfs-check-fd-has-read-access-in-kernel_read_file_from_fd
+++ a/fs/kernel_read_file.c
@@ -178,7 +178,7 @@ int kernel_read_file_from_fd(int fd, lof
 	struct fd f = fdget(fd);
 	int ret = -EBADF;
 
-	if (!f.file)
+	if (!f.file || !(f.file->f_mode & FMODE_READ))
 		goto out;
 
 	ret = kernel_read_file(f.file, offset, buf, buf_size, file_size, id);
_


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [patch 17/19] mm/secretmem: fix NULL page->mapping dereference in page_is_secretmem()
  2021-10-18 22:14 incoming Andrew Morton
                   ` (15 preceding siblings ...)
  2021-10-18 22:16 ` [patch 16/19] vfs: check fd has read access in kernel_read_file_from_fd() Andrew Morton
@ 2021-10-18 22:16 ` Andrew Morton
  2021-10-18 22:16 ` [patch 18/19] mm/thp: decrease nr_thps in file's mapping on THP split Andrew Morton
  2021-10-18 22:16 ` [patch 19/19] mailmap: add Andrej Shadura Andrew Morton
  18 siblings, 0 replies; 20+ messages in thread
From: Andrew Morton @ 2021-10-18 22:16 UTC (permalink / raw)
  To: akpm, david, djwong, linux-mm, mm-commits, rppt, seanjc, stable,
	stephenackerman16, torvalds

From: Sean Christopherson <seanjc@google.com>
Subject: mm/secretmem: fix NULL page->mapping dereference in page_is_secretmem()

Check for a NULL page->mapping before dereferencing the mapping in
page_is_secretmem(), as the page's mapping can be nullified while gup() is
running, e.g.  by reclaim or truncation.

  BUG: kernel NULL pointer dereference, address: 0000000000000068
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 0 P4D 0
  Oops: 0000 [#1] PREEMPT SMP NOPTI
  CPU: 6 PID: 4173897 Comm: CPU 3/KVM Tainted: G        W
  RIP: 0010:internal_get_user_pages_fast+0x621/0x9d0
  Code: <48> 81 7a 68 80 08 04 bc 0f 85 21 ff ff 8 89 c7 be
  RSP: 0018:ffffaa90087679b0 EFLAGS: 00010046
  RAX: ffffe3f37905b900 RBX: 00007f2dd561e000 RCX: ffffe3f37905b934
  RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffe3f37905b900
  ...
  CR2: 0000000000000068 CR3: 00000004c5898003 CR4: 00000000001726e0
  Call Trace:
   get_user_pages_fast_only+0x13/0x20
   hva_to_pfn+0xa9/0x3e0
   try_async_pf+0xa1/0x270
   direct_page_fault+0x113/0xad0
   kvm_mmu_page_fault+0x69/0x680
   vmx_handle_exit+0xe1/0x5d0
   kvm_arch_vcpu_ioctl_run+0xd81/0x1c70
   kvm_vcpu_ioctl+0x267/0x670
   __x64_sys_ioctl+0x83/0xa0
   do_syscall_64+0x56/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xae

Link: https://lkml.kernel.org/r/20211007231502.3552715-1-seanjc@google.com
Fixes: 1507f51255c9 ("mm: introduce memfd_secret system call to create "secret" memory areas")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reported-by: Darrick J. Wong <djwong@kernel.org>
Reported-by: Stephen <stephenackerman16@gmail.com>
Tested-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/secretmem.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/include/linux/secretmem.h~mm-fix-null-page-mapping-dereference-in-page_is_secretmem
+++ a/include/linux/secretmem.h
@@ -23,7 +23,7 @@ static inline bool page_is_secretmem(str
 	mapping = (struct address_space *)
 		((unsigned long)page->mapping & ~PAGE_MAPPING_FLAGS);
 
-	if (mapping != page->mapping)
+	if (!mapping || mapping != page->mapping)
 		return false;
 
 	return mapping->a_ops == &secretmem_aops;
_


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [patch 18/19] mm/thp: decrease nr_thps in file's mapping on THP split
  2021-10-18 22:14 incoming Andrew Morton
                   ` (16 preceding siblings ...)
  2021-10-18 22:16 ` [patch 17/19] mm/secretmem: fix NULL page->mapping dereference in page_is_secretmem() Andrew Morton
@ 2021-10-18 22:16 ` Andrew Morton
  2021-10-18 22:16 ` [patch 19/19] mailmap: add Andrej Shadura Andrew Morton
  18 siblings, 0 replies; 20+ messages in thread
From: Andrew Morton @ 2021-10-18 22:16 UTC (permalink / raw)
  To: akpm, hannes, hdanton, hughd, kirill.shutemov, linux-mm,
	m.szyprowski, mm-commits, oleg, riel, sfoon.kim, shy828301,
	songliubraving, torvalds, william.kucharski, willy

From: Marek Szyprowski <m.szyprowski@samsung.com>
Subject: mm/thp: decrease nr_thps in file's mapping on THP split

Decrease nr_thps counter in file's mapping to ensure that the page cache
won't be dropped excessively on file write access if page has been
already split.

I've tried a test scenario, where one runs a big binary, kernel remaps it
with THPs, then one forces THP split with
/sys/kernel/debug/split_huge_pages.  During any further open of that
binary with O_RDWR or O_WRITEONLY kernel drops page cache for it, because
of non-zero thps counter.

Link: https://lkml.kernel.org/r/20211012120237.2600-1-m.szyprowski@samsung.com
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Fixes: 09d91cda0e82 ("mm,thp: avoid writes to file with THP in pagecache")
Fixes: 06d3eff62d9d ("mm/thp: fix node page state in split_huge_page_to_list()")
Acked-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: <sfoon.kim@samsung.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/huge_memory.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

--- a/mm/huge_memory.c~mm-thp-decrease-nr_thps-in-files-mapping-on-thp-split
+++ a/mm/huge_memory.c
@@ -2700,12 +2700,14 @@ int split_huge_page_to_list(struct page
 		if (mapping) {
 			int nr = thp_nr_pages(head);
 
-			if (PageSwapBacked(head))
+			if (PageSwapBacked(head)) {
 				__mod_lruvec_page_state(head, NR_SHMEM_THPS,
 							-nr);
-			else
+			} else {
 				__mod_lruvec_page_state(head, NR_FILE_THPS,
 							-nr);
+				filemap_nr_thps_dec(mapping);
+			}
 		}
 
 		__split_huge_page(page, list, end);
_


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [patch 19/19] mailmap: add Andrej Shadura
  2021-10-18 22:14 incoming Andrew Morton
                   ` (17 preceding siblings ...)
  2021-10-18 22:16 ` [patch 18/19] mm/thp: decrease nr_thps in file's mapping on THP split Andrew Morton
@ 2021-10-18 22:16 ` Andrew Morton
  18 siblings, 0 replies; 20+ messages in thread
From: Andrew Morton @ 2021-10-18 22:16 UTC (permalink / raw)
  To: akpm, andrew.shadura, linux-mm, mm-commits, torvalds

From: Andrej Shadura <andrew.shadura@collabora.co.uk>
Subject: mailmap: add Andrej Shadura

Add a mapping for my old work email for BelDisplayTech to the personal
email, and make sure the Collabora email has the correct spelling of the
first name.

Link: https://lkml.kernel.org/r/20210917091016.30232-1-andrew.shadura@collabora.co.uk
Signed-off-by: Andrej Shadura <andrew.shadura@collabora.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 .mailmap |    2 ++
 1 file changed, 2 insertions(+)

--- a/.mailmap~mailmap-add-andrej-shadura
+++ a/.mailmap
@@ -33,6 +33,8 @@ Al Viro <viro@zenIV.linux.org.uk>
 Andi Kleen <ak@linux.intel.com> <ak@suse.de>
 Andi Shyti <andi@etezian.org> <andi.shyti@samsung.com>
 Andreas Herrmann <aherrman@de.ibm.com>
+Andrej Shadura <andrew.shadura@collabora.co.uk>
+Andrej Shadura <andrew@shadura.me> <andrew@beldisplaytech.com>
 Andrew Morton <akpm@linux-foundation.org>
 Andrew Murray <amurray@thegoodpenguin.co.uk> <amurray@embedded-bits.co.uk>
 Andrew Murray <amurray@thegoodpenguin.co.uk> <andrew.murray@arm.com>
_


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2021-10-18 22:16 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-18 22:14 incoming Andrew Morton
2021-10-18 22:15 ` [patch 01/19] mm/userfaultfd: selftests: fix memory corruption with thp enabled Andrew Morton
2021-10-18 22:15 ` [patch 02/19] userfaultfd: fix a race between writeprotect and exit_mmap() Andrew Morton
2021-10-18 22:15 ` [patch 03/19] mm/migrate: optimize hotplug-time demotion order updates Andrew Morton
2021-10-18 22:15 ` [patch 04/19] mm/migrate: add CPU hotplug to demotion #ifdef Andrew Morton
2021-10-18 22:15 ` [patch 05/19] mm/migrate: fix CPUHP state to update node demotion order Andrew Morton
2021-10-18 22:15 ` [patch 06/19] ocfs2: fix data corruption after conversion from inline format Andrew Morton
2021-10-18 22:15 ` [patch 07/19] ocfs2: mount fails with buffer overflow in strlen Andrew Morton
2021-10-18 22:15 ` [patch 08/19] memblock: check memory total_size Andrew Morton
2021-10-18 22:15 ` [patch 09/19] mm/mempolicy: do not allow illegal MPOL_F_NUMA_BALANCING | MPOL_LOCAL in mbind() Andrew Morton
2021-10-18 22:15 ` [patch 10/19] mm, slub: fix two bugs in slab_debug_trace_open() Andrew Morton
2021-10-18 22:15 ` [patch 11/19] mm, slub: fix mismatch between reconstructed freelist depth and cnt Andrew Morton
2021-10-18 22:15 ` [patch 12/19] mm, slub: fix potential memoryleak in kmem_cache_open() Andrew Morton
2021-10-18 22:16 ` [patch 13/19] mm, slub: fix potential use-after-free in slab_debugfs_fops Andrew Morton
2021-10-18 22:16 ` [patch 14/19] mm, slub: fix incorrect memcg slab count for bulk free Andrew Morton
2021-10-18 22:16 ` [patch 15/19] elfcore: correct reference to CONFIG_UML Andrew Morton
2021-10-18 22:16 ` [patch 16/19] vfs: check fd has read access in kernel_read_file_from_fd() Andrew Morton
2021-10-18 22:16 ` [patch 17/19] mm/secretmem: fix NULL page->mapping dereference in page_is_secretmem() Andrew Morton
2021-10-18 22:16 ` [patch 18/19] mm/thp: decrease nr_thps in file's mapping on THP split Andrew Morton
2021-10-18 22:16 ` [patch 19/19] mailmap: add Andrej Shadura Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).