stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 02/15] ipc: WARN if trying to remove ipc object which is absent
       [not found] <20211119164248.50feee07c5d2cc6cc4addf97@linux-foundation.org>
@ 2021-11-20  0:43 ` Andrew Morton
  2021-11-20  0:43 ` [patch 03/15] shm: extend forced shm destroy to support objects from several IPC nses Andrew Morton
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 10+ messages in thread
From: Andrew Morton @ 2021-11-20  0:43 UTC (permalink / raw)
  To: akpm, alexander.mikhalitsyn, avagin, dave, ebiederm, gregkh,
	linux-mm, manfred, mm-commits, ptikhomirov, stable, torvalds,
	vvs

From: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Subject: ipc: WARN if trying to remove ipc object which is absent

Patch series "shm: shm_rmid_forced feature fixes".

Some time ago I met kernel crash after CRIU restore procedure,
fortunately, it was CRIU restore, so, I had dump files and could do
restore many times and crash reproduced easily.  After some investigation
I've constructed the minimal reproducer.  It was found that it's
use-after-free and it happens only if sysctl kernel.shm_rmid_forced = 1.

The key of the problem is that the exit_shm() function not handles shp's
object destroy when task->sysvshm.shm_clist contains items from different
IPC namespaces.  In most cases this list will contain only items from one
IPC namespace.

Why this list may contain object from different namespaces?  Function
exit_shm() designed to clean up this list always when process leaves IPC
namespace.  But we made a mistake a long time ago and not add exit_shm()
call into setns() syscall procedures.  1st second idea was just to add
this call to setns() syscall but it's obviously changes semantics of
setns() syscall and that's userspace-visible change.  So, I gave up this
idea.

First real attempt to address the issue was just to omit forced destroy if
we meet shp object not from current task IPC namespace [1].  But that was
not the best idea because task->sysvshm.shm_clist was protected by rwsem
which belongs to current task IPC namespace.  It means that list
corruption may occur.

Second approach is just extend exit_shm() to properly handle shp's from
different IPC namespaces [2].  This is really non-trivial thing, I've put
a lot of effort into that but not believed that it's possible to make it
fully safe, clean and clear.

Thanks to the efforts of Manfred Spraul working an elegant solution was
designed.  Thanks a lot, Manfred!

Eric also suggested the way to address the issue in ("[RFC][PATCH] shm: In
shm_exit destroy all created and never attached segments") Eric's idea was
to maintain a list of shm_clists one per IPC namespace, use lock-less
lists.  But there is some extra memory consumption-related concerns.

Alternative solution which was suggested by me was implemented in ("shm:
reset shm_clist on setns but omit forced shm destroy") Idea is pretty
simple, we add exit_shm() syscall to setns() but DO NOT destroy shm
segments even if sysctl kernel.shm_rmid_forced = 1, we just clean up the
task->sysvshm.shm_clist list.  This chages semantics of setns() syscall a
little bit but in comparision to "naive" solution when we just add
exit_shm() without any special exclusions this looks like a safer option.

[1] https://lkml.org/lkml/2021/7/6/1108
[2] https://lkml.org/lkml/2021/7/14/736


This patch (of 2):

Let's produce a warning if we trying to remove non-existing IPC object
from IPC namespace kht/idr structures.

This allows to catch possible bugs when ipc_rmid() function was called
with inconsistent struct ipc_ids*, struct kern_ipc_perm* arguments.

Link: https://lkml.kernel.org/r/20211027224348.611025-1-alexander.mikhalitsyn@virtuozzo.com
Link: https://lkml.kernel.org/r/20211027224348.611025-2-alexander.mikhalitsyn@virtuozzo.com
Co-developed-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Cc: Vasily Averin <vvs@virtuozzo.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 ipc/util.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

--- a/ipc/util.c~ipc-warn-if-trying-to-remove-ipc-object-which-is-absent
+++ a/ipc/util.c
@@ -447,8 +447,8 @@ static int ipcget_public(struct ipc_name
 static void ipc_kht_remove(struct ipc_ids *ids, struct kern_ipc_perm *ipcp)
 {
 	if (ipcp->key != IPC_PRIVATE)
-		rhashtable_remove_fast(&ids->key_ht, &ipcp->khtnode,
-				       ipc_kht_params);
+		WARN_ON_ONCE(rhashtable_remove_fast(&ids->key_ht, &ipcp->khtnode,
+				       ipc_kht_params));
 }
 
 /**
@@ -498,7 +498,7 @@ void ipc_rmid(struct ipc_ids *ids, struc
 {
 	int idx = ipcid_to_idx(ipcp->id);
 
-	idr_remove(&ids->ipcs_idr, idx);
+	WARN_ON_ONCE(idr_remove(&ids->ipcs_idr, idx) != ipcp);
 	ipc_kht_remove(ids, ipcp);
 	ids->in_use--;
 	ipcp->deleted = true;
_

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [patch 03/15] shm: extend forced shm destroy to support objects from several IPC nses
       [not found] <20211119164248.50feee07c5d2cc6cc4addf97@linux-foundation.org>
  2021-11-20  0:43 ` [patch 02/15] ipc: WARN if trying to remove ipc object which is absent Andrew Morton
@ 2021-11-20  0:43 ` Andrew Morton
  2021-11-20  0:43 ` [patch 05/15] hexagon: export raw I/O routines for modules Andrew Morton
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 10+ messages in thread
From: Andrew Morton @ 2021-11-20  0:43 UTC (permalink / raw)
  To: akpm, alexander.mikhalitsyn, avagin, dave, ebiederm, gregkh,
	linux-mm, manfred, mm-commits, ptikhomirov, stable, torvalds,
	vvs

From: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Subject: shm: extend forced shm destroy to support objects from several IPC nses

Currently, exit_shm function not designed to work properly when
task->sysvshm.shm_clist holds shm objects from different IPC namespaces.

This is a real pain when sysctl kernel.shm_rmid_forced = 1, because it
leads to use-after-free (reproducer exists).

That particular patch is attempt to fix the problem by extending exit_shm
mechanism to handle shm's destroy from several IPC ns'es.

To achieve that we do several things:

1. add namespace (non-refcounted) pointer to the struct shmid_kernel

2. during new shm object creation (newseg()/shmget syscall) we
   initialize this pointer by current task IPC ns

3. exit_shm() fully reworked such that it traverses over all shp's in
   task->sysvshm.shm_clist and gets IPC namespace not from current task as
   it was before but from shp's object itself, then call shm_destroy(shp,
   ns).

Note.  We need to be really careful here, because as it was said before
(1), our pointer to IPC ns non-refcnt'ed.  To be on the safe side we using
special helper get_ipc_ns_not_zero() which allows to get IPC ns refcounter
only if IPC ns not in the "state of destruction".

Q/A

Q: Why we can access shp->ns memory using non-refcounted pointer?
A: Because shp object lifetime is always shorther
than IPC namespace lifetime, so, if we get shp object from the
task->sysvshm.shm_clist while holding task_lock(task) nobody can
steal our namespace.

Q: Does this patch change semantics of unshare/setns/clone syscalls?
A: Not. It's just fixes non-covered case when process may leave
IPC namespace without getting task->sysvshm.shm_clist list cleaned up.

Link: https://lkml.kernel.org/r/67bb03e5-f79c-1815-e2bf-949c67047418@colorfullife.com
Link: https://lkml.kernel.org/r/20211109151501.4921-1-manfred@colorfullife.com
Fixes: ab602f79915 ("shm: make exit_shm work proportional to task activity")
Co-developed-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Cc: Vasily Averin <vvs@virtuozzo.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/ipc_namespace.h |   15 ++
 include/linux/sched/task.h    |    2 
 ipc/shm.c                     |  189 ++++++++++++++++++++++++--------
 3 files changed, 159 insertions(+), 47 deletions(-)

--- a/include/linux/ipc_namespace.h~shm-extend-forced-shm-destroy-to-support-objects-from-several-ipc-nses-simplified
+++ a/include/linux/ipc_namespace.h
@@ -131,6 +131,16 @@ static inline struct ipc_namespace *get_
 	return ns;
 }
 
+static inline struct ipc_namespace *get_ipc_ns_not_zero(struct ipc_namespace *ns)
+{
+	if (ns) {
+		if (refcount_inc_not_zero(&ns->ns.count))
+			return ns;
+	}
+
+	return NULL;
+}
+
 extern void put_ipc_ns(struct ipc_namespace *ns);
 #else
 static inline struct ipc_namespace *copy_ipcs(unsigned long flags,
@@ -146,6 +156,11 @@ static inline struct ipc_namespace *get_
 {
 	return ns;
 }
+
+static inline struct ipc_namespace *get_ipc_ns_not_zero(struct ipc_namespace *ns)
+{
+	return ns;
+}
 
 static inline void put_ipc_ns(struct ipc_namespace *ns)
 {
--- a/include/linux/sched/task.h~shm-extend-forced-shm-destroy-to-support-objects-from-several-ipc-nses-simplified
+++ a/include/linux/sched/task.h
@@ -158,7 +158,7 @@ static inline struct vm_struct *task_sta
  * Protects ->fs, ->files, ->mm, ->group_info, ->comm, keyring
  * subscriptions and synchronises with wait4().  Also used in procfs.  Also
  * pins the final release of task.io_context.  Also protects ->cpuset and
- * ->cgroup.subsys[]. And ->vfork_done.
+ * ->cgroup.subsys[]. And ->vfork_done. And ->sysvshm.shm_clist.
  *
  * Nests both inside and outside of read_lock(&tasklist_lock).
  * It must not be nested with write_lock_irq(&tasklist_lock),
--- a/ipc/shm.c~shm-extend-forced-shm-destroy-to-support-objects-from-several-ipc-nses-simplified
+++ a/ipc/shm.c
@@ -62,9 +62,18 @@ struct shmid_kernel /* private to the ke
 	struct pid		*shm_lprid;
 	struct ucounts		*mlock_ucounts;
 
-	/* The task created the shm object.  NULL if the task is dead. */
+	/*
+	 * The task created the shm object, for
+	 * task_lock(shp->shm_creator)
+	 */
 	struct task_struct	*shm_creator;
-	struct list_head	shm_clist;	/* list by creator */
+
+	/*
+	 * List by creator. task_lock(->shm_creator) required for read/write.
+	 * If list_empty(), then the creator is dead already.
+	 */
+	struct list_head	shm_clist;
+	struct ipc_namespace	*ns;
 } __randomize_layout;
 
 /* shm_mode upper byte flags */
@@ -115,6 +124,7 @@ static void do_shm_rmid(struct ipc_names
 	struct shmid_kernel *shp;
 
 	shp = container_of(ipcp, struct shmid_kernel, shm_perm);
+	WARN_ON(ns != shp->ns);
 
 	if (shp->shm_nattch) {
 		shp->shm_perm.mode |= SHM_DEST;
@@ -225,10 +235,43 @@ static void shm_rcu_free(struct rcu_head
 	kfree(shp);
 }
 
-static inline void shm_rmid(struct ipc_namespace *ns, struct shmid_kernel *s)
+/*
+ * It has to be called with shp locked.
+ * It must be called before ipc_rmid()
+ */
+static inline void shm_clist_rm(struct shmid_kernel *shp)
+{
+	struct task_struct *creator;
+
+	/* ensure that shm_creator does not disappear */
+	rcu_read_lock();
+
+	/*
+	 * A concurrent exit_shm may do a list_del_init() as well.
+	 * Just do nothing if exit_shm already did the work
+	 */
+	if (!list_empty(&shp->shm_clist)) {
+		/*
+		 * shp->shm_creator is guaranteed to be valid *only*
+		 * if shp->shm_clist is not empty.
+		 */
+		creator = shp->shm_creator;
+
+		task_lock(creator);
+		/*
+		 * list_del_init() is a nop if the entry was already removed
+		 * from the list.
+		 */
+		list_del_init(&shp->shm_clist);
+		task_unlock(creator);
+	}
+	rcu_read_unlock();
+}
+
+static inline void shm_rmid(struct shmid_kernel *s)
 {
-	list_del(&s->shm_clist);
-	ipc_rmid(&shm_ids(ns), &s->shm_perm);
+	shm_clist_rm(s);
+	ipc_rmid(&shm_ids(s->ns), &s->shm_perm);
 }
 
 
@@ -283,7 +326,7 @@ static void shm_destroy(struct ipc_names
 	shm_file = shp->shm_file;
 	shp->shm_file = NULL;
 	ns->shm_tot -= (shp->shm_segsz + PAGE_SIZE - 1) >> PAGE_SHIFT;
-	shm_rmid(ns, shp);
+	shm_rmid(shp);
 	shm_unlock(shp);
 	if (!is_file_hugepages(shm_file))
 		shmem_lock(shm_file, 0, shp->mlock_ucounts);
@@ -303,10 +346,10 @@ static void shm_destroy(struct ipc_names
  *
  * 2) sysctl kernel.shm_rmid_forced is set to 1.
  */
-static bool shm_may_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
+static bool shm_may_destroy(struct shmid_kernel *shp)
 {
 	return (shp->shm_nattch == 0) &&
-	       (ns->shm_rmid_forced ||
+	       (shp->ns->shm_rmid_forced ||
 		(shp->shm_perm.mode & SHM_DEST));
 }
 
@@ -337,7 +380,7 @@ static void shm_close(struct vm_area_str
 	ipc_update_pid(&shp->shm_lprid, task_tgid(current));
 	shp->shm_dtim = ktime_get_real_seconds();
 	shp->shm_nattch--;
-	if (shm_may_destroy(ns, shp))
+	if (shm_may_destroy(shp))
 		shm_destroy(ns, shp);
 	else
 		shm_unlock(shp);
@@ -358,10 +401,10 @@ static int shm_try_destroy_orphaned(int
 	 *
 	 * As shp->* are changed under rwsem, it's safe to skip shp locking.
 	 */
-	if (shp->shm_creator != NULL)
+	if (!list_empty(&shp->shm_clist))
 		return 0;
 
-	if (shm_may_destroy(ns, shp)) {
+	if (shm_may_destroy(shp)) {
 		shm_lock_by_ptr(shp);
 		shm_destroy(ns, shp);
 	}
@@ -379,48 +422,97 @@ void shm_destroy_orphaned(struct ipc_nam
 /* Locking assumes this will only be called with task == current */
 void exit_shm(struct task_struct *task)
 {
-	struct ipc_namespace *ns = task->nsproxy->ipc_ns;
-	struct shmid_kernel *shp, *n;
+	for (;;) {
+		struct shmid_kernel *shp;
+		struct ipc_namespace *ns;
 
-	if (list_empty(&task->sysvshm.shm_clist))
-		return;
+		task_lock(task);
+
+		if (list_empty(&task->sysvshm.shm_clist)) {
+			task_unlock(task);
+			break;
+		}
+
+		shp = list_first_entry(&task->sysvshm.shm_clist, struct shmid_kernel,
+				shm_clist);
 
-	/*
-	 * If kernel.shm_rmid_forced is not set then only keep track of
-	 * which shmids are orphaned, so that a later set of the sysctl
-	 * can clean them up.
-	 */
-	if (!ns->shm_rmid_forced) {
-		down_read(&shm_ids(ns).rwsem);
-		list_for_each_entry(shp, &task->sysvshm.shm_clist, shm_clist)
-			shp->shm_creator = NULL;
 		/*
-		 * Only under read lock but we are only called on current
-		 * so no entry on the list will be shared.
+		 * 1) Get pointer to the ipc namespace. It is worth to say
+		 * that this pointer is guaranteed to be valid because
+		 * shp lifetime is always shorter than namespace lifetime
+		 * in which shp lives.
+		 * We taken task_lock it means that shp won't be freed.
 		 */
-		list_del(&task->sysvshm.shm_clist);
-		up_read(&shm_ids(ns).rwsem);
-		return;
-	}
+		ns = shp->ns;
 
-	/*
-	 * Destroy all already created segments, that were not yet mapped,
-	 * and mark any mapped as orphan to cover the sysctl toggling.
-	 * Destroy is skipped if shm_may_destroy() returns false.
-	 */
-	down_write(&shm_ids(ns).rwsem);
-	list_for_each_entry_safe(shp, n, &task->sysvshm.shm_clist, shm_clist) {
-		shp->shm_creator = NULL;
+		/*
+		 * 2) If kernel.shm_rmid_forced is not set then only keep track of
+		 * which shmids are orphaned, so that a later set of the sysctl
+		 * can clean them up.
+		 */
+		if (!ns->shm_rmid_forced)
+			goto unlink_continue;
 
-		if (shm_may_destroy(ns, shp)) {
-			shm_lock_by_ptr(shp);
-			shm_destroy(ns, shp);
+		/*
+		 * 3) get a reference to the namespace.
+		 *    The refcount could be already 0. If it is 0, then
+		 *    the shm objects will be free by free_ipc_work().
+		 */
+		ns = get_ipc_ns_not_zero(ns);
+		if (!ns) {
+unlink_continue:
+			list_del_init(&shp->shm_clist);
+			task_unlock(task);
+			continue;
 		}
-	}
 
-	/* Remove the list head from any segments still attached. */
-	list_del(&task->sysvshm.shm_clist);
-	up_write(&shm_ids(ns).rwsem);
+		/*
+		 * 4) get a reference to shp.
+		 *   This cannot fail: shm_clist_rm() is called before
+		 *   ipc_rmid(), thus the refcount cannot be 0.
+		 */
+		WARN_ON(!ipc_rcu_getref(&shp->shm_perm));
+
+		/*
+		 * 5) unlink the shm segment from the list of segments
+		 *    created by current.
+		 *    This must be done last. After unlinking,
+		 *    only the refcounts obtained above prevent IPC_RMID
+		 *    from destroying the segment or the namespace.
+		 */
+		list_del_init(&shp->shm_clist);
+
+		task_unlock(task);
+
+		/*
+		 * 6) we have all references
+		 *    Thus lock & if needed destroy shp.
+		 */
+		down_write(&shm_ids(ns).rwsem);
+		shm_lock_by_ptr(shp);
+		/*
+		 * rcu_read_lock was implicitly taken in shm_lock_by_ptr, it's
+		 * safe to call ipc_rcu_putref here
+		 */
+		ipc_rcu_putref(&shp->shm_perm, shm_rcu_free);
+
+		if (ipc_valid_object(&shp->shm_perm)) {
+			if (shm_may_destroy(shp))
+				shm_destroy(ns, shp);
+			else
+				shm_unlock(shp);
+		} else {
+			/*
+			 * Someone else deleted the shp from namespace
+			 * idr/kht while we have waited.
+			 * Just unlock and continue.
+			 */
+			shm_unlock(shp);
+		}
+
+		up_write(&shm_ids(ns).rwsem);
+		put_ipc_ns(ns); /* paired with get_ipc_ns_not_zero */
+	}
 }
 
 static vm_fault_t shm_fault(struct vm_fault *vmf)
@@ -676,7 +768,11 @@ static int newseg(struct ipc_namespace *
 	if (error < 0)
 		goto no_id;
 
+	shp->ns = ns;
+
+	task_lock(current);
 	list_add(&shp->shm_clist, &current->sysvshm.shm_clist);
+	task_unlock(current);
 
 	/*
 	 * shmid gets reported as "inode#" in /proc/pid/maps.
@@ -1567,7 +1663,8 @@ out_nattch:
 	down_write(&shm_ids(ns).rwsem);
 	shp = shm_lock(ns, shmid);
 	shp->shm_nattch--;
-	if (shm_may_destroy(ns, shp))
+
+	if (shm_may_destroy(shp))
 		shm_destroy(ns, shp);
 	else
 		shm_unlock(shp);
_

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [patch 05/15] hexagon: export raw I/O routines for modules
       [not found] <20211119164248.50feee07c5d2cc6cc4addf97@linux-foundation.org>
  2021-11-20  0:43 ` [patch 02/15] ipc: WARN if trying to remove ipc object which is absent Andrew Morton
  2021-11-20  0:43 ` [patch 03/15] shm: extend forced shm destroy to support objects from several IPC nses Andrew Morton
@ 2021-11-20  0:43 ` Andrew Morton
  2021-11-20  0:43 ` [patch 06/15] hexagon: clean up timer-regs.h Andrew Morton
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 10+ messages in thread
From: Andrew Morton @ 2021-11-20  0:43 UTC (permalink / raw)
  To: akpm, bcain, linux-mm, mm-commits, nathan, ndesaulniers, stable,
	torvalds

From: Nathan Chancellor <nathan@kernel.org>
Subject: hexagon: export raw I/O routines for modules

Patch series "Fixes for ARCH=hexagon allmodconfig", v2.

This series fixes some issues noticed with ARCH=hexagon allmodconfig.


This patch (of 3):

When building ARCH=hexagon allmodconfig, the following errors occur:

ERROR: modpost: "__raw_readsl" [drivers/i3c/master/svc-i3c-master.ko] undefined!
ERROR: modpost: "__raw_writesl" [drivers/i3c/master/dw-i3c-master.ko] undefined!
ERROR: modpost: "__raw_readsl" [drivers/i3c/master/dw-i3c-master.ko] undefined!
ERROR: modpost: "__raw_writesl" [drivers/i3c/master/i3c-master-cdns.ko] undefined!
ERROR: modpost: "__raw_readsl" [drivers/i3c/master/i3c-master-cdns.ko] undefined!

Export these symbols so that modules can use them without any errors.

Link: https://lkml.kernel.org/r/20211115174250.1994179-1-nathan@kernel.org
Link: https://lkml.kernel.org/r/20211115174250.1994179-2-nathan@kernel.org
Fixes: 013bf24c3829 ("Hexagon: Provide basic implementation and/or stubs for I/O routines.")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Acked-by: Brian Cain <bcain@codeaurora.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/hexagon/lib/io.c |    4 ++++
 1 file changed, 4 insertions(+)

--- a/arch/hexagon/lib/io.c~hexagon-export-raw-i-o-routines-for-modules
+++ a/arch/hexagon/lib/io.c
@@ -27,6 +27,7 @@ void __raw_readsw(const void __iomem *ad
 		*dst++ = *src;
 
 }
+EXPORT_SYMBOL(__raw_readsw);
 
 /*
  * __raw_writesw - read words a short at a time
@@ -47,6 +48,7 @@ void __raw_writesw(void __iomem *addr, c
 
 
 }
+EXPORT_SYMBOL(__raw_writesw);
 
 /*  Pretty sure len is pre-adjusted for the length of the access already */
 void __raw_readsl(const void __iomem *addr, void *data, int len)
@@ -62,6 +64,7 @@ void __raw_readsl(const void __iomem *ad
 
 
 }
+EXPORT_SYMBOL(__raw_readsl);
 
 void __raw_writesl(void __iomem *addr, const void *data, int len)
 {
@@ -76,3 +79,4 @@ void __raw_writesl(void __iomem *addr, c
 
 
 }
+EXPORT_SYMBOL(__raw_writesl);
_

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [patch 06/15] hexagon: clean up timer-regs.h
       [not found] <20211119164248.50feee07c5d2cc6cc4addf97@linux-foundation.org>
                   ` (2 preceding siblings ...)
  2021-11-20  0:43 ` [patch 05/15] hexagon: export raw I/O routines for modules Andrew Morton
@ 2021-11-20  0:43 ` Andrew Morton
  2021-11-20  0:43 ` [patch 08/15] mm: kmemleak: slob: respect SLAB_NOLEAKTRACE flag Andrew Morton
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 10+ messages in thread
From: Andrew Morton @ 2021-11-20  0:43 UTC (permalink / raw)
  To: akpm, bcain, linux-mm, mm-commits, nathan, ndesaulniers, stable,
	torvalds

From: Nathan Chancellor <nathan@kernel.org>
Subject: hexagon: clean up timer-regs.h

When building allmodconfig, there is a warning about TIMER_ENABLE being
redefined:

 drivers/clocksource/timer-oxnas-rps.c:39:9: error: 'TIMER_ENABLE' macro redefined [-Werror,-Wmacro-redefined]
 #define TIMER_ENABLE            BIT(7)
         ^
 ./arch/hexagon/include/asm/timer-regs.h:13:9: note: previous definition is here
 #define TIMER_ENABLE            0
         ^
 1 error generated.

The values in this header are only used in one file each, if they are
used at all. Remove the header and sink all of the constants into their
respective files.

TCX0_CLK_RATE is only used in arch/hexagon/include/asm/timex.h

TIMER_ENABLE, RTOS_TIMER_INT, RTOS_TIMER_REGS_ADDR are only used in
arch/hexagon/kernel/time.c.

SLEEP_CLK_RATE and TIMER_CLR_ON_MATCH have both been unused since the
file's introduction in commit 71e4a47f32f4 ("Hexagon: Add time and timer
functions").

TIMER_ENABLE is redefined as BIT(0) so the shift is moved into the
definition, rather than its use.

Link: https://lkml.kernel.org/r/20211115174250.1994179-3-nathan@kernel.org
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Acked-by: Brian Cain <bcain@codeaurora.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/hexagon/include/asm/timer-regs.h |   26 ------------------------
 arch/hexagon/include/asm/timex.h      |    3 --
 arch/hexagon/kernel/time.c            |   12 +++++++++--
 3 files changed, 11 insertions(+), 30 deletions(-)

--- a/arch/hexagon/include/asm/timer-regs.h
+++ /dev/null
@@ -1,26 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only */
-/*
- * Timer support for Hexagon
- *
- * Copyright (c) 2010-2011, The Linux Foundation. All rights reserved.
- */
-
-#ifndef _ASM_TIMER_REGS_H
-#define _ASM_TIMER_REGS_H
-
-/*  This stuff should go into a platform specific file  */
-#define TCX0_CLK_RATE		19200
-#define TIMER_ENABLE		0
-#define TIMER_CLR_ON_MATCH	1
-
-/*
- * 8x50 HDD Specs 5-8.  Simulator co-sim not fixed until
- * release 1.1, and then it's "adjustable" and probably not defaulted.
- */
-#define RTOS_TIMER_INT		3
-#ifdef CONFIG_HEXAGON_COMET
-#define RTOS_TIMER_REGS_ADDR	0xAB000000UL
-#endif
-#define SLEEP_CLK_RATE		32000
-
-#endif
--- a/arch/hexagon/include/asm/timex.h~hexagon-clean-up-timer-regsh
+++ a/arch/hexagon/include/asm/timex.h
@@ -7,11 +7,10 @@
 #define _ASM_TIMEX_H
 
 #include <asm-generic/timex.h>
-#include <asm/timer-regs.h>
 #include <asm/hexagon_vm.h>
 
 /* Using TCX0 as our clock.  CLOCK_TICK_RATE scheduled to be removed. */
-#define CLOCK_TICK_RATE              TCX0_CLK_RATE
+#define CLOCK_TICK_RATE              19200
 
 #define ARCH_HAS_READ_CURRENT_TIMER
 
--- a/arch/hexagon/kernel/time.c~hexagon-clean-up-timer-regsh
+++ a/arch/hexagon/kernel/time.c
@@ -17,9 +17,10 @@
 #include <linux/of_irq.h>
 #include <linux/module.h>
 
-#include <asm/timer-regs.h>
 #include <asm/hexagon_vm.h>
 
+#define TIMER_ENABLE		BIT(0)
+
 /*
  * For the clocksource we need:
  *	pcycle frequency (600MHz)
@@ -33,6 +34,13 @@ cycles_t	pcycle_freq_mhz;
 cycles_t	thread_freq_mhz;
 cycles_t	sleep_clk_freq;
 
+/*
+ * 8x50 HDD Specs 5-8.  Simulator co-sim not fixed until
+ * release 1.1, and then it's "adjustable" and probably not defaulted.
+ */
+#define RTOS_TIMER_INT		3
+#define RTOS_TIMER_REGS_ADDR	0xAB000000UL
+
 static struct resource rtos_timer_resources[] = {
 	{
 		.start	= RTOS_TIMER_REGS_ADDR,
@@ -80,7 +88,7 @@ static int set_next_event(unsigned long
 	iowrite32(0, &rtos_timer->clear);
 
 	iowrite32(delta, &rtos_timer->match);
-	iowrite32(1 << TIMER_ENABLE, &rtos_timer->enable);
+	iowrite32(TIMER_ENABLE, &rtos_timer->enable);
 	return 0;
 }
 
_

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [patch 08/15] mm: kmemleak: slob: respect SLAB_NOLEAKTRACE flag
       [not found] <20211119164248.50feee07c5d2cc6cc4addf97@linux-foundation.org>
                   ` (3 preceding siblings ...)
  2021-11-20  0:43 ` [patch 06/15] hexagon: clean up timer-regs.h Andrew Morton
@ 2021-11-20  0:43 ` Andrew Morton
  2021-11-20  0:43 ` [patch 10/15] hugetlb, userfaultfd: fix reservation restore on userfaultfd error Andrew Morton
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 10+ messages in thread
From: Andrew Morton @ 2021-11-20  0:43 UTC (permalink / raw)
  To: akpm, catalin.marinas, cl, glommer, gregkh, iamjoonsoo.kim,
	linux-mm, mm-commits, penberg, rientjes, rkovhaev, songmuchun,
	stable, torvalds, vbabka

From: Rustam Kovhaev <rkovhaev@gmail.com>
Subject: mm: kmemleak: slob: respect SLAB_NOLEAKTRACE flag

When kmemleak is enabled for SLOB, system does not boot and does not print
anything to the console.  At the very early stage in the boot process we
hit infinite recursion from kmemleak_init() and eventually kernel crashes.

kmemleak_init() specifies SLAB_NOLEAKTRACE for KMEM_CACHE(), but
kmem_cache_create_usercopy() removes it because CACHE_CREATE_MASK is not
valid for SLOB.

Let's fix CACHE_CREATE_MASK and make kmemleak work with SLOB

Link: https://lkml.kernel.org/r/20211115020850.3154366-1-rkovhaev@gmail.com
Fixes: d8843922fba4 ("slab: Ignore internal flags in cache creation")
Signed-off-by: Rustam Kovhaev <rkovhaev@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Glauber Costa <glommer@parallels.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slab.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/slab.h~mm-kmemleak-slob-respect-slab_noleaktrace-flag
+++ a/mm/slab.h
@@ -147,7 +147,7 @@ static inline slab_flags_t kmem_cache_fl
 #define SLAB_CACHE_FLAGS (SLAB_NOLEAKTRACE | SLAB_RECLAIM_ACCOUNT | \
 			  SLAB_TEMPORARY | SLAB_ACCOUNT)
 #else
-#define SLAB_CACHE_FLAGS (0)
+#define SLAB_CACHE_FLAGS (SLAB_NOLEAKTRACE)
 #endif
 
 /* Common flags available with current configuration */
_

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [patch 10/15] hugetlb, userfaultfd: fix reservation restore on userfaultfd error
       [not found] <20211119164248.50feee07c5d2cc6cc4addf97@linux-foundation.org>
                   ` (4 preceding siblings ...)
  2021-11-20  0:43 ` [patch 08/15] mm: kmemleak: slob: respect SLAB_NOLEAKTRACE flag Andrew Morton
@ 2021-11-20  0:43 ` Andrew Morton
  2021-11-20  0:43 ` [patch 12/15] mm/damon/dbgfs: use '__GFP_NOWARN' for user-specified size buffer allocation Andrew Morton
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 10+ messages in thread
From: Andrew Morton @ 2021-11-20  0:43 UTC (permalink / raw)
  To: akpm, almasrymina, jthoughton, linux-mm, mike.kravetz,
	mm-commits, stable, torvalds, weixugc

From: Mina Almasry <almasrymina@google.com>
Subject: hugetlb, userfaultfd: fix reservation restore on userfaultfd error

Currently in the is_continue case in hugetlb_mcopy_atomic_pte(), if we
bail out using "goto out_release_unlock;" in the cases where idx >= size,
or !huge_pte_none(), the code will detect that new_pagecache_page ==
false, and so call restore_reserve_on_error().  In this case I see
restore_reserve_on_error() delete the reservation, and the following call
to remove_inode_hugepages() will increment h->resv_hugepages causing a
100% reproducible leak.

We should treat the is_continue case similar to adding a page into the
pagecache and set new_pagecache_page to true, to indicate that there is no
reservation to restore on the error path, and we need not call
restore_reserve_on_error().  Rename new_pagecache_page to
page_in_pagecache to make that clear.

Link: https://lkml.kernel.org/r/20211117193825.378528-1-almasrymina@google.com
Fixes: c7b1850dfb41 ("hugetlb: don't pass page cache pages to restore_reserve_on_error")
Signed-off-by: Mina Almasry <almasrymina@google.com>
Reported-by: James Houghton <jthoughton@google.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Wei Xu <weixugc@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/hugetlb.c |    7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

--- a/mm/hugetlb.c~hugetlb-userfaultfd-fix-reservation-restore-on-userfaultfd-error
+++ a/mm/hugetlb.c
@@ -5736,13 +5736,14 @@ int hugetlb_mcopy_atomic_pte(struct mm_s
 	int ret = -ENOMEM;
 	struct page *page;
 	int writable;
-	bool new_pagecache_page = false;
+	bool page_in_pagecache = false;
 
 	if (is_continue) {
 		ret = -EFAULT;
 		page = find_lock_page(mapping, idx);
 		if (!page)
 			goto out;
+		page_in_pagecache = true;
 	} else if (!*pagep) {
 		/* If a page already exists, then it's UFFDIO_COPY for
 		 * a non-missing case. Return -EEXIST.
@@ -5830,7 +5831,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_s
 		ret = huge_add_to_page_cache(page, mapping, idx);
 		if (ret)
 			goto out_release_nounlock;
-		new_pagecache_page = true;
+		page_in_pagecache = true;
 	}
 
 	ptl = huge_pte_lockptr(h, dst_mm, dst_pte);
@@ -5894,7 +5895,7 @@ out_release_unlock:
 	if (vm_shared || is_continue)
 		unlock_page(page);
 out_release_nounlock:
-	if (!new_pagecache_page)
+	if (!page_in_pagecache)
 		restore_reserve_on_error(h, dst_vma, dst_addr, page);
 	put_page(page);
 	goto out;
_

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [patch 12/15] mm/damon/dbgfs: use '__GFP_NOWARN' for user-specified size buffer allocation
       [not found] <20211119164248.50feee07c5d2cc6cc4addf97@linux-foundation.org>
                   ` (5 preceding siblings ...)
  2021-11-20  0:43 ` [patch 10/15] hugetlb, userfaultfd: fix reservation restore on userfaultfd error Andrew Morton
@ 2021-11-20  0:43 ` Andrew Morton
  2021-11-20  0:43 ` [patch 13/15] mm/damon/dbgfs: fix missed use of damon_dbgfs_lock Andrew Morton
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 10+ messages in thread
From: Andrew Morton @ 2021-11-20  0:43 UTC (permalink / raw)
  To: akpm, linux-mm, mm-commits, sj, stable, torvalds

From: SeongJae Park <sj@kernel.org>
Subject: mm/damon/dbgfs: use '__GFP_NOWARN' for user-specified size buffer allocation

Patch series "DAMON fixes".


This patch (of 2):

DAMON users can trigger below warning in '__alloc_pages()' by invoking
write() to some DAMON debugfs files with arbitrarily high count argument,
because DAMON debugfs interface allocates some buffers based on the
user-specified 'count'.

        if (unlikely(order >= MAX_ORDER)) {
                WARN_ON_ONCE(!(gfp & __GFP_NOWARN));
                return NULL;
        }

Because the DAMON debugfs interface code checks failure of the
'kmalloc()', this commit simply suppresses the warnings by adding
'__GFP_NOWARN' flag.

Link: https://lkml.kernel.org/r/20211110145758.16558-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20211110145758.16558-2-sj@kernel.org
Fixes: 4bc05954d007 ("mm/damon: implement a debugfs-based user space interface")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/damon/dbgfs.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

--- a/mm/damon/dbgfs.c~mm-damon-dbgfs-use-__gfp_nowarn-for-user-specified-size-buffer-allocation
+++ a/mm/damon/dbgfs.c
@@ -32,7 +32,7 @@ static char *user_input_str(const char _
 	if (*ppos)
 		return ERR_PTR(-EINVAL);
 
-	kbuf = kmalloc(count + 1, GFP_KERNEL);
+	kbuf = kmalloc(count + 1, GFP_KERNEL | __GFP_NOWARN);
 	if (!kbuf)
 		return ERR_PTR(-ENOMEM);
 
@@ -133,7 +133,7 @@ static ssize_t dbgfs_schemes_read(struct
 	char *kbuf;
 	ssize_t len;
 
-	kbuf = kmalloc(count, GFP_KERNEL);
+	kbuf = kmalloc(count, GFP_KERNEL | __GFP_NOWARN);
 	if (!kbuf)
 		return -ENOMEM;
 
@@ -452,7 +452,7 @@ static ssize_t dbgfs_init_regions_read(s
 	char *kbuf;
 	ssize_t len;
 
-	kbuf = kmalloc(count, GFP_KERNEL);
+	kbuf = kmalloc(count, GFP_KERNEL | __GFP_NOWARN);
 	if (!kbuf)
 		return -ENOMEM;
 
@@ -578,7 +578,7 @@ static ssize_t dbgfs_kdamond_pid_read(st
 	char *kbuf;
 	ssize_t len;
 
-	kbuf = kmalloc(count, GFP_KERNEL);
+	kbuf = kmalloc(count, GFP_KERNEL | __GFP_NOWARN);
 	if (!kbuf)
 		return -ENOMEM;
 
_

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [patch 13/15] mm/damon/dbgfs: fix missed use of damon_dbgfs_lock
       [not found] <20211119164248.50feee07c5d2cc6cc4addf97@linux-foundation.org>
                   ` (6 preceding siblings ...)
  2021-11-20  0:43 ` [patch 12/15] mm/damon/dbgfs: use '__GFP_NOWARN' for user-specified size buffer allocation Andrew Morton
@ 2021-11-20  0:43 ` Andrew Morton
  2021-11-20  0:43 ` [patch 14/15] kmap_local: don't assume kmap PTEs are linear arrays in memory Andrew Morton
  2021-11-20  0:43 ` [patch 15/15] proc/vmcore: fix clearing user buffer by properly using clear_user() Andrew Morton
  9 siblings, 0 replies; 10+ messages in thread
From: Andrew Morton @ 2021-11-20  0:43 UTC (permalink / raw)
  To: akpm, linux-mm, mm-commits, sj, stable, torvalds

From: SeongJae Park <sj@kernel.org>
Subject: mm/damon/dbgfs: fix missed use of damon_dbgfs_lock

DAMON debugfs is supposed to protect dbgfs_ctxs, dbgfs_nr_ctxs, and
dbgfs_dirs using damon_dbgfs_lock.  However, some of the code is accessing
the variables without the protection.  This commit fixes it by protecting
all such accesses.

Link: https://lkml.kernel.org/r/20211110145758.16558-3-sj@kernel.org
Fixes: 75c1c2b53c78 ("mm/damon/dbgfs: support multiple contexts")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/damon/dbgfs.c |   12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

--- a/mm/damon/dbgfs.c~mm-damon-dbgfs-fix-missed-use-of-damon_dbgfs_lock
+++ a/mm/damon/dbgfs.c
@@ -877,12 +877,14 @@ static ssize_t dbgfs_monitor_on_write(st
 		return -EINVAL;
 	}
 
+	mutex_lock(&damon_dbgfs_lock);
 	if (!strncmp(kbuf, "on", count)) {
 		int i;
 
 		for (i = 0; i < dbgfs_nr_ctxs; i++) {
 			if (damon_targets_empty(dbgfs_ctxs[i])) {
 				kfree(kbuf);
+				mutex_unlock(&damon_dbgfs_lock);
 				return -EINVAL;
 			}
 		}
@@ -892,6 +894,7 @@ static ssize_t dbgfs_monitor_on_write(st
 	} else {
 		ret = -EINVAL;
 	}
+	mutex_unlock(&damon_dbgfs_lock);
 
 	if (!ret)
 		ret = count;
@@ -944,15 +947,16 @@ static int __init __damon_dbgfs_init(voi
 
 static int __init damon_dbgfs_init(void)
 {
-	int rc;
+	int rc = -ENOMEM;
 
+	mutex_lock(&damon_dbgfs_lock);
 	dbgfs_ctxs = kmalloc(sizeof(*dbgfs_ctxs), GFP_KERNEL);
 	if (!dbgfs_ctxs)
-		return -ENOMEM;
+		goto out;
 	dbgfs_ctxs[0] = dbgfs_new_ctx();
 	if (!dbgfs_ctxs[0]) {
 		kfree(dbgfs_ctxs);
-		return -ENOMEM;
+		goto out;
 	}
 	dbgfs_nr_ctxs = 1;
 
@@ -963,6 +967,8 @@ static int __init damon_dbgfs_init(void)
 		pr_err("%s: dbgfs init failed\n", __func__);
 	}
 
+out:
+	mutex_unlock(&damon_dbgfs_lock);
 	return rc;
 }
 
_

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [patch 14/15] kmap_local: don't assume kmap PTEs are linear arrays in memory
       [not found] <20211119164248.50feee07c5d2cc6cc4addf97@linux-foundation.org>
                   ` (7 preceding siblings ...)
  2021-11-20  0:43 ` [patch 13/15] mm/damon/dbgfs: fix missed use of damon_dbgfs_lock Andrew Morton
@ 2021-11-20  0:43 ` Andrew Morton
  2021-11-20  0:43 ` [patch 15/15] proc/vmcore: fix clearing user buffer by properly using clear_user() Andrew Morton
  9 siblings, 0 replies; 10+ messages in thread
From: Andrew Morton @ 2021-11-20  0:43 UTC (permalink / raw)
  To: akpm, ardb, linus.walleij, linux-mm, mm-commits, quanyang.wang,
	rmk+kernel, stable, tglx, torvalds

From: Ard Biesheuvel <ardb@kernel.org>
Subject: kmap_local: don't assume kmap PTEs are linear arrays in memory

The kmap_local conversion broke the ARM architecture, because the new code
assumes that all PTEs used for creating kmaps form a linear array in
memory, and uses array indexing to look up the kmap PTE belonging to a
certain kmap index.

On ARM, this cannot work, not only because the PTE pages may be
non-adjacent in memory, but also because ARM/!LPAE interleaves hardware
entries and extended entries (carrying software-only bits) in a way that
is not compatible with array indexing.

Fortunately, this only seems to affect configurations with more than 8
CPUs, due to the way the per-CPU kmap slots are organized in memory.

Work around this by permitting an architecture to set a Kconfig symbol
that signifies that the kmap PTEs do not form a lineary array in memory,
and so the only way to locate the appropriate one is to walk the page
tables.

Link: https://lore.kernel.org/linux-arm-kernel/20211026131249.3731275-1-ardb@kernel.org/
Link: https://lkml.kernel.org/r/20211116094737.7391-1-ardb@kernel.org
Fixes: 2a15ba82fa6c ("ARM: highmem: Switch to generic kmap atomic")
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Reported-by: Quanyang Wang <quanyang.wang@windriver.com>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/arm/Kconfig |    1 +
 mm/Kconfig       |    3 +++
 mm/highmem.c     |   32 +++++++++++++++++++++-----------
 3 files changed, 25 insertions(+), 11 deletions(-)

--- a/arch/arm/Kconfig~kmap_local-dont-assume-kmap-ptes-are-linear-arrays-in-memory
+++ a/arch/arm/Kconfig
@@ -1463,6 +1463,7 @@ config HIGHMEM
 	bool "High Memory Support"
 	depends on MMU
 	select KMAP_LOCAL
+	select KMAP_LOCAL_NON_LINEAR_PTE_ARRAY
 	help
 	  The address space of ARM processors is only 4 Gigabytes large
 	  and it has to accommodate user address space, kernel address
--- a/mm/highmem.c~kmap_local-dont-assume-kmap-ptes-are-linear-arrays-in-memory
+++ a/mm/highmem.c
@@ -503,16 +503,22 @@ static inline int kmap_local_calc_idx(in
 
 static pte_t *__kmap_pte;
 
-static pte_t *kmap_get_pte(void)
+static pte_t *kmap_get_pte(unsigned long vaddr, int idx)
 {
+	if (IS_ENABLED(CONFIG_KMAP_LOCAL_NON_LINEAR_PTE_ARRAY))
+		/*
+		 * Set by the arch if __kmap_pte[-idx] does not produce
+		 * the correct entry.
+		 */
+		return virt_to_kpte(vaddr);
 	if (!__kmap_pte)
 		__kmap_pte = virt_to_kpte(__fix_to_virt(FIX_KMAP_BEGIN));
-	return __kmap_pte;
+	return &__kmap_pte[-idx];
 }
 
 void *__kmap_local_pfn_prot(unsigned long pfn, pgprot_t prot)
 {
-	pte_t pteval, *kmap_pte = kmap_get_pte();
+	pte_t pteval, *kmap_pte;
 	unsigned long vaddr;
 	int idx;
 
@@ -524,9 +530,10 @@ void *__kmap_local_pfn_prot(unsigned lon
 	preempt_disable();
 	idx = arch_kmap_local_map_idx(kmap_local_idx_push(), pfn);
 	vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
-	BUG_ON(!pte_none(*(kmap_pte - idx)));
+	kmap_pte = kmap_get_pte(vaddr, idx);
+	BUG_ON(!pte_none(*kmap_pte));
 	pteval = pfn_pte(pfn, prot);
-	arch_kmap_local_set_pte(&init_mm, vaddr, kmap_pte - idx, pteval);
+	arch_kmap_local_set_pte(&init_mm, vaddr, kmap_pte, pteval);
 	arch_kmap_local_post_map(vaddr, pteval);
 	current->kmap_ctrl.pteval[kmap_local_idx()] = pteval;
 	preempt_enable();
@@ -559,7 +566,7 @@ EXPORT_SYMBOL(__kmap_local_page_prot);
 void kunmap_local_indexed(void *vaddr)
 {
 	unsigned long addr = (unsigned long) vaddr & PAGE_MASK;
-	pte_t *kmap_pte = kmap_get_pte();
+	pte_t *kmap_pte;
 	int idx;
 
 	if (addr < __fix_to_virt(FIX_KMAP_END) ||
@@ -584,8 +591,9 @@ void kunmap_local_indexed(void *vaddr)
 	idx = arch_kmap_local_unmap_idx(kmap_local_idx(), addr);
 	WARN_ON_ONCE(addr != __fix_to_virt(FIX_KMAP_BEGIN + idx));
 
+	kmap_pte = kmap_get_pte(addr, idx);
 	arch_kmap_local_pre_unmap(addr);
-	pte_clear(&init_mm, addr, kmap_pte - idx);
+	pte_clear(&init_mm, addr, kmap_pte);
 	arch_kmap_local_post_unmap(addr);
 	current->kmap_ctrl.pteval[kmap_local_idx()] = __pte(0);
 	kmap_local_idx_pop();
@@ -607,7 +615,7 @@ EXPORT_SYMBOL(kunmap_local_indexed);
 void __kmap_local_sched_out(void)
 {
 	struct task_struct *tsk = current;
-	pte_t *kmap_pte = kmap_get_pte();
+	pte_t *kmap_pte;
 	int i;
 
 	/* Clear kmaps */
@@ -634,8 +642,9 @@ void __kmap_local_sched_out(void)
 		idx = arch_kmap_local_map_idx(i, pte_pfn(pteval));
 
 		addr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
+		kmap_pte = kmap_get_pte(addr, idx);
 		arch_kmap_local_pre_unmap(addr);
-		pte_clear(&init_mm, addr, kmap_pte - idx);
+		pte_clear(&init_mm, addr, kmap_pte);
 		arch_kmap_local_post_unmap(addr);
 	}
 }
@@ -643,7 +652,7 @@ void __kmap_local_sched_out(void)
 void __kmap_local_sched_in(void)
 {
 	struct task_struct *tsk = current;
-	pte_t *kmap_pte = kmap_get_pte();
+	pte_t *kmap_pte;
 	int i;
 
 	/* Restore kmaps */
@@ -663,7 +672,8 @@ void __kmap_local_sched_in(void)
 		/* See comment in __kmap_local_sched_out() */
 		idx = arch_kmap_local_map_idx(i, pte_pfn(pteval));
 		addr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
-		set_pte_at(&init_mm, addr, kmap_pte - idx, pteval);
+		kmap_pte = kmap_get_pte(addr, idx);
+		set_pte_at(&init_mm, addr, kmap_pte, pteval);
 		arch_kmap_local_post_map(addr, pteval);
 	}
 }
--- a/mm/Kconfig~kmap_local-dont-assume-kmap-ptes-are-linear-arrays-in-memory
+++ a/mm/Kconfig
@@ -890,6 +890,9 @@ config MAPPING_DIRTY_HELPERS
 config KMAP_LOCAL
 	bool
 
+config KMAP_LOCAL_NON_LINEAR_PTE_ARRAY
+	bool
+
 # struct io_mapping based helper.  Selected by drivers that need them
 config IO_MAPPING
 	bool
_

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [patch 15/15] proc/vmcore: fix clearing user buffer by properly using clear_user()
       [not found] <20211119164248.50feee07c5d2cc6cc4addf97@linux-foundation.org>
                   ` (8 preceding siblings ...)
  2021-11-20  0:43 ` [patch 14/15] kmap_local: don't assume kmap PTEs are linear arrays in memory Andrew Morton
@ 2021-11-20  0:43 ` Andrew Morton
  9 siblings, 0 replies; 10+ messages in thread
From: Andrew Morton @ 2021-11-20  0:43 UTC (permalink / raw)
  To: akpm, bhe, david, dyoung, linux-mm, mm-commits, prudo, stable,
	torvalds, vgoyal

From: David Hildenbrand <david@redhat.com>
Subject: proc/vmcore: fix clearing user buffer by properly using clear_user()

To clear a user buffer we cannot simply use memset, we have to use
clear_user().  With a virtio-mem device that registers a vmcore_cb and has
some logically unplugged memory inside an added Linux memory block, I can
easily trigger a BUG by copying the vmcore via "cp":

[   11.327580] systemd[1]: Starting Kdump Vmcore Save Service...
[   11.339697] kdump[420]: Kdump is using the default log level(3).
[   11.370964] kdump[453]: saving to /sysroot/var/crash/127.0.0.1-2021-11-11-14:59:22/
[   11.373997] kdump[458]: saving vmcore-dmesg.txt to /sysroot/var/crash/127.0.0.1-2021-11-11-14:59:22/
[   11.385357] kdump[465]: saving vmcore-dmesg.txt complete
[   11.386722] kdump[467]: saving vmcore
[   16.531275] BUG: unable to handle page fault for address: 00007f2374e01000
[   16.531705] #PF: supervisor write access in kernel mode
[   16.532037] #PF: error_code(0x0003) - permissions violation
[   16.532396] PGD 7a523067 P4D 7a523067 PUD 7a528067 PMD 7a525067 PTE 800000007048f867
[   16.532872] Oops: 0003 [#1] PREEMPT SMP NOPTI
[   16.533154] CPU: 0 PID: 468 Comm: cp Not tainted 5.15.0+ #6
[   16.533513] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-27-g64f37cc530f1-prebuilt.qemu.org 04/01/2014
[   16.534198] RIP: 0010:read_from_oldmem.part.0.cold+0x1d/0x86
[   16.534552] Code: ff ff ff e8 05 ff fe ff e9 b9 e9 7f ff 48 89 de 48 c7 c7 38 3b 60 82 e8 f1 fe fe ff 83 fd 08 72 3c 49 8d 7d 08 4c 89 e9 89 e8 <49> c7 45 00 00 00 00 00 49 c7 44 05 f8 00 00 00 00 48 83 e7 f81
[   16.535670] RSP: 0018:ffffc9000073be08 EFLAGS: 00010212
[   16.535998] RAX: 0000000000001000 RBX: 00000000002fd000 RCX: 00007f2374e01000
[   16.536441] RDX: 0000000000000001 RSI: 00000000ffffdfff RDI: 00007f2374e01008
[   16.536878] RBP: 0000000000001000 R08: 0000000000000000 R09: ffffc9000073bc50
[   16.537315] R10: ffffc9000073bc48 R11: ffffffff829461a8 R12: 000000000000f000
[   16.537755] R13: 00007f2374e01000 R14: 0000000000000000 R15: ffff88807bd421e8
[   16.538200] FS:  00007f2374e12140(0000) GS:ffff88807f000000(0000) knlGS:0000000000000000
[   16.538696] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   16.539055] CR2: 00007f2374e01000 CR3: 000000007a4aa000 CR4: 0000000000350eb0
[   16.539510] Call Trace:
[   16.539679]  <TASK>
[   16.539828]  read_vmcore+0x236/0x2c0
[   16.540063]  ? enqueue_hrtimer+0x2f/0x80
[   16.540323]  ? inode_security+0x22/0x60
[   16.540572]  proc_reg_read+0x55/0xa0
[   16.540807]  vfs_read+0x95/0x190
[   16.541022]  ksys_read+0x4f/0xc0
[   16.541238]  do_syscall_64+0x3b/0x90
[   16.541475]  entry_SYSCALL_64_after_hwframe+0x44/0xae

Some x86-64 CPUs have a CPU feature called "Supervisor Mode Access
Prevention (SMAP)", which is used to detect wrong access from the kernel
to user buffers like this: SMAP triggers a permissions violation on wrong
access.  In the x86-64 variant of clear_user(), SMAP is properly handled
via clac()+stac().

To fix, properly use clear_user() when we're dealing with a user buffer.

Link: https://lkml.kernel.org/r/20211112092750.6921-1-david@redhat.com
Fixes: 997c136f518c ("fs/proc/vmcore.c: add hook to read_from_oldmem() to check for non-ram pages")
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Philipp Rudo <prudo@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/proc/vmcore.c |   20 ++++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

--- a/fs/proc/vmcore.c~proc-vmcore-fix-clearing-user-buffer-by-properly-using-clear_user
+++ a/fs/proc/vmcore.c
@@ -154,9 +154,13 @@ ssize_t read_from_oldmem(char *buf, size
 			nr_bytes = count;
 
 		/* If pfn is not ram, return zeros for sparse dump files */
-		if (!pfn_is_ram(pfn))
-			memset(buf, 0, nr_bytes);
-		else {
+		if (!pfn_is_ram(pfn)) {
+			tmp = 0;
+			if (!userbuf)
+				memset(buf, 0, nr_bytes);
+			else if (clear_user(buf, nr_bytes))
+				tmp = -EFAULT;
+		} else {
 			if (encrypted)
 				tmp = copy_oldmem_page_encrypted(pfn, buf,
 								 nr_bytes,
@@ -165,12 +169,12 @@ ssize_t read_from_oldmem(char *buf, size
 			else
 				tmp = copy_oldmem_page(pfn, buf, nr_bytes,
 						       offset, userbuf);
-
-			if (tmp < 0) {
-				up_read(&vmcore_cb_rwsem);
-				return tmp;
-			}
 		}
+		if (tmp < 0) {
+			up_read(&vmcore_cb_rwsem);
+			return tmp;
+		}
+
 		*ppos += nr_bytes;
 		count -= nr_bytes;
 		buf += nr_bytes;
_

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-11-20  0:44 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20211119164248.50feee07c5d2cc6cc4addf97@linux-foundation.org>
2021-11-20  0:43 ` [patch 02/15] ipc: WARN if trying to remove ipc object which is absent Andrew Morton
2021-11-20  0:43 ` [patch 03/15] shm: extend forced shm destroy to support objects from several IPC nses Andrew Morton
2021-11-20  0:43 ` [patch 05/15] hexagon: export raw I/O routines for modules Andrew Morton
2021-11-20  0:43 ` [patch 06/15] hexagon: clean up timer-regs.h Andrew Morton
2021-11-20  0:43 ` [patch 08/15] mm: kmemleak: slob: respect SLAB_NOLEAKTRACE flag Andrew Morton
2021-11-20  0:43 ` [patch 10/15] hugetlb, userfaultfd: fix reservation restore on userfaultfd error Andrew Morton
2021-11-20  0:43 ` [patch 12/15] mm/damon/dbgfs: use '__GFP_NOWARN' for user-specified size buffer allocation Andrew Morton
2021-11-20  0:43 ` [patch 13/15] mm/damon/dbgfs: fix missed use of damon_dbgfs_lock Andrew Morton
2021-11-20  0:43 ` [patch 14/15] kmap_local: don't assume kmap PTEs are linear arrays in memory Andrew Morton
2021-11-20  0:43 ` [patch 15/15] proc/vmcore: fix clearing user buffer by properly using clear_user() Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).