linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] Signal: Fix hard lockup problem in flush_sigqueue()
@ 2019-03-21 21:45 Waiman Long
  2019-03-21 21:45 ` [PATCH 1/4] mm: Implement kmem objects freeing queue Waiman Long
                   ` (4 more replies)
  0 siblings, 5 replies; 22+ messages in thread
From: Waiman Long @ 2019-03-21 21:45 UTC (permalink / raw)
  To: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim
  Cc: linux-kernel, linux-mm, selinux, Paul Moore, Stephen Smalley,
	Eric Paris, Peter Zijlstra (Intel),
	Oleg Nesterov, Waiman Long

It was found that if a process has accumulated sufficient number of
pending signals, the exiting of that process may cause its parent to
have hard lockup when running on a debug kernel with a slow memory
freeing path (like with KASAN enabled).

  release_task() => flush_sigqueue()

The lockup condition can be reproduced on a large system with a lot of
memory and relatively slow CPUs running LTP's sigqueue_9-1 test on a
debug kernel.

This patchset tries to mitigate this problem by introducing a new kernel
memory freeing queue mechanism modelled after the wake_q mechanism for
waking up tasks. Then flush_sigqueue() and release_task() are modified
to use the freeing queue mechanism to defer the actual memory object
freeing until after releasing the tasklist_lock and with irq re-enabled.

With the patchset applied, the hard lockup problem was no longer
reproducible on the debug kernel.

Waiman Long (4):
  mm: Implement kmem objects freeing queue
  signal: Make flush_sigqueue() use free_q to release memory
  signal: Add free_uid_to_q()
  mm: Do periodic rescheduling when freeing objects in kmem_free_up_q()

 include/linux/sched/user.h |  3 +++
 include/linux/signal.h     |  4 ++-
 include/linux/slab.h       | 28 +++++++++++++++++++++
 kernel/exit.c              | 12 ++++++---
 kernel/signal.c            | 29 +++++++++++++---------
 kernel/user.c              | 17 ++++++++++---
 mm/slab_common.c           | 50 ++++++++++++++++++++++++++++++++++++++
 security/selinux/hooks.c   |  8 ++++--
 8 files changed, 128 insertions(+), 23 deletions(-)

-- 
2.18.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/4] mm: Implement kmem objects freeing queue
  2019-03-21 21:45 [PATCH 0/4] Signal: Fix hard lockup problem in flush_sigqueue() Waiman Long
@ 2019-03-21 21:45 ` Waiman Long
  2019-03-22 17:47   ` Christopher Lameter
  2019-03-21 21:45 ` [PATCH 2/4] signal: Make flush_sigqueue() use free_q to release memory Waiman Long
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 22+ messages in thread
From: Waiman Long @ 2019-03-21 21:45 UTC (permalink / raw)
  To: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim
  Cc: linux-kernel, linux-mm, selinux, Paul Moore, Stephen Smalley,
	Eric Paris, Peter Zijlstra (Intel),
	Oleg Nesterov, Waiman Long

When releasing kernel data structures, freeing up the memory
occupied by those objects is usually the last step. To avoid races,
the release operation is commonly done with a lock held. However, the
freeing operations do not need to be under lock, but are in many cases.

In some complex cases where the locks protect many different memory
objects, that can be a problem especially if some memory debugging
features like KASAN are enabled. In those cases, freeing memory objects
under lock can greatly lengthen the lock hold time. This can even lead
to soft/hard lockups in some extreme cases.

To make it easer to defer freeing memory objects until after unlock,
a kernel memory freeing queue mechanism is now added. It is modelled
after the wake_q mechanism for waking up tasks without holding a lock.

Now kmem_free_q_add() can be called to add memory objects into a freeing
queue. Later on, kmem_free_up_q() can be called to free all the memory
objects in the freeing queue after releasing the lock.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/slab.h | 28 ++++++++++++++++++++++++++++
 mm/slab_common.c     | 41 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 69 insertions(+)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 11b45f7ae405..6116fcecbd8f 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -762,4 +762,32 @@ int slab_dead_cpu(unsigned int cpu);
 #define slab_dead_cpu		NULL
 #endif
 
+/*
+ * Freeing queue node for freeing kmem_cache slab objects later.
+ * The node is put at the beginning of the memory object and so the object
+ * size cannot be smaller than sizeof(kmem_free_q_node).
+ */
+struct kmem_free_q_node {
+	struct kmem_free_q_node *next;
+	struct kmem_cache *cachep;	/* NULL if alloc'ed by kmalloc */
+};
+
+struct kmem_free_q_head {
+	struct kmem_free_q_node *first;
+	struct kmem_free_q_node **lastp;
+};
+
+#define DEFINE_KMEM_FREE_Q(name)	\
+	struct kmem_free_q_head name = { NULL, &name.first }
+
+static inline void kmem_free_q_init(struct kmem_free_q_head *head)
+{
+	head->first = NULL;
+	head->lastp = &head->first;
+}
+
+extern void kmem_free_q_add(struct kmem_free_q_head *head,
+			    struct kmem_cache *cachep, void *object);
+extern void kmem_free_up_q(struct kmem_free_q_head *head);
+
 #endif	/* _LINUX_SLAB_H */
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 03eeb8b7b4b1..dba20b4208f1 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1597,6 +1597,47 @@ void kzfree(const void *p)
 }
 EXPORT_SYMBOL(kzfree);
 
+/**
+ * kmem_free_q_add - add a kmem object to a freeing queue
+ * @head: freeing queue head
+ * @cachep: kmem_cache pointer (NULL for kmalloc'ed objects)
+ * @object: kmem object to be freed put into the queue
+ *
+ * Put a kmem object into the freeing queue to be freed later.
+ */
+void kmem_free_q_add(struct kmem_free_q_head *head, struct kmem_cache *cachep,
+		     void *object)
+{
+	struct kmem_free_q_node *node = object;
+
+	WARN_ON_ONCE(cachep && cachep->object_size < sizeof(*node));
+	node->next = NULL;
+	node->cachep = cachep;
+	*(head->lastp) = node;
+	head->lastp = &node->next;
+}
+EXPORT_SYMBOL_GPL(kmem_free_q_add);
+
+/**
+ * kmem_free_up_q - free all the objects in the freeing queue
+ * @head: freeing queue head
+ *
+ * Free all the objects in the freeing queue.
+ */
+void kmem_free_up_q(struct kmem_free_q_head *head)
+{
+	struct kmem_free_q_node *node, *next;
+
+	for (node = head->first; node; node = next) {
+		next = node->next;
+		if (node->cachep)
+			kmem_cache_free(node->cachep, node);
+		else
+			kfree(node);
+	}
+}
+EXPORT_SYMBOL_GPL(kmem_free_up_q);
+
 /* Tracepoints definitions. */
 EXPORT_TRACEPOINT_SYMBOL(kmalloc);
 EXPORT_TRACEPOINT_SYMBOL(kmem_cache_alloc);
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/4] signal: Make flush_sigqueue() use free_q to release memory
  2019-03-21 21:45 [PATCH 0/4] Signal: Fix hard lockup problem in flush_sigqueue() Waiman Long
  2019-03-21 21:45 ` [PATCH 1/4] mm: Implement kmem objects freeing queue Waiman Long
@ 2019-03-21 21:45 ` Waiman Long
  2019-03-22  1:52   ` Matthew Wilcox
  2019-03-21 21:45 ` [PATCH 3/4] signal: Add free_uid_to_q() Waiman Long
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 22+ messages in thread
From: Waiman Long @ 2019-03-21 21:45 UTC (permalink / raw)
  To: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim
  Cc: linux-kernel, linux-mm, selinux, Paul Moore, Stephen Smalley,
	Eric Paris, Peter Zijlstra (Intel),
	Oleg Nesterov, Waiman Long

It was found that if a process had many pending signals (e.g. millions),
the act of exiting that process might cause its parent to have a hard
lockup especially on a debug kernel with features like KASAN enabled.
It was because the flush_sigqueue() was called in release_task() with
tasklist_lock held and irq disabled.

  [ 3133.105601] NMI watchdog: Watchdog detected hard LOCKUP on cpu 37
    :
  [ 3133.105709] CPU: 37 PID: 11200 Comm: bash Kdump: loaded Not tainted 4.18.0-80.el8.x86_64+debug #1
    :
  [ 3133.105750]  slab_free_freelist_hook+0xa0/0x120
  [ 3133.105758]  kmem_cache_free+0x9d/0x310
  [ 3133.105762]  flush_sigqueue+0x12b/0x1d0
  [ 3133.105766]  release_task.part.14+0xaf7/0x1520
  [ 3133.105784]  wait_consider_task+0x28da/0x3350
  [ 3133.105804]  do_wait+0x3eb/0x8c0
  [ 3133.105819]  kernel_wait4+0xe4/0x1b0
  [ 3133.105834]  __do_sys_wait4+0xe0/0xf0
  [ 3133.105864]  do_syscall_64+0xa5/0x4a0
  [ 3133.105868]  entry_SYSCALL_64_after_hwframe+0x6a/0xdf

[ All the "?" stack trace entries were removed from above. ]

To avoid this dire condition and reduce lock hold time of tasklist_lock,
flush_sigqueue() is modified to pass in a freeing queue pointer so that
the actual freeing of memory objects can be deferred until after the
tasklist_lock is released and irq re-enabled.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/signal.h   |  4 +++-
 kernel/exit.c            | 12 ++++++++----
 kernel/signal.c          | 27 ++++++++++++++++-----------
 security/selinux/hooks.c |  8 ++++++--
 4 files changed, 33 insertions(+), 18 deletions(-)

diff --git a/include/linux/signal.h b/include/linux/signal.h
index 9702016734b1..a9562e502122 100644
--- a/include/linux/signal.h
+++ b/include/linux/signal.h
@@ -5,6 +5,7 @@
 #include <linux/bug.h>
 #include <linux/signal_types.h>
 #include <linux/string.h>
+#include <linux/slab.h>
 
 struct task_struct;
 
@@ -254,7 +255,8 @@ static inline void init_sigpending(struct sigpending *sig)
 	INIT_LIST_HEAD(&sig->list);
 }
 
-extern void flush_sigqueue(struct sigpending *queue);
+extern void flush_sigqueue(struct sigpending *queue,
+			   struct kmem_free_q_head *head);
 
 /* Test if 'sig' is valid signal. Use this instead of testing _NSIG directly */
 static inline int valid_signal(unsigned long sig)
diff --git a/kernel/exit.c b/kernel/exit.c
index 2166c2d92ddc..ee707a63edfd 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -88,7 +88,8 @@ static void __unhash_process(struct task_struct *p, bool group_dead)
 /*
  * This function expects the tasklist_lock write-locked.
  */
-static void __exit_signal(struct task_struct *tsk)
+static void __exit_signal(struct task_struct *tsk,
+			  struct kmem_free_q_head *free_q)
 {
 	struct signal_struct *sig = tsk->signal;
 	bool group_dead = thread_group_leader(tsk);
@@ -160,14 +161,14 @@ static void __exit_signal(struct task_struct *tsk)
 	 * Do this under ->siglock, we can race with another thread
 	 * doing sigqueue_free() if we have SIGQUEUE_PREALLOC signals.
 	 */
-	flush_sigqueue(&tsk->pending);
+	flush_sigqueue(&tsk->pending, free_q);
 	tsk->sighand = NULL;
 	spin_unlock(&sighand->siglock);
 
 	__cleanup_sighand(sighand);
 	clear_tsk_thread_flag(tsk, TIF_SIGPENDING);
 	if (group_dead) {
-		flush_sigqueue(&sig->shared_pending);
+		flush_sigqueue(&sig->shared_pending, free_q);
 		tty_kref_put(tty);
 	}
 }
@@ -186,6 +187,8 @@ void release_task(struct task_struct *p)
 {
 	struct task_struct *leader;
 	int zap_leader;
+	DEFINE_KMEM_FREE_Q(free_q);
+
 repeat:
 	/* don't need to get the RCU readlock here - the process is dead and
 	 * can't be modifying its own credentials. But shut RCU-lockdep up */
@@ -197,7 +200,7 @@ void release_task(struct task_struct *p)
 
 	write_lock_irq(&tasklist_lock);
 	ptrace_release_task(p);
-	__exit_signal(p);
+	__exit_signal(p, &free_q);
 
 	/*
 	 * If we are the last non-leader member of the thread
@@ -219,6 +222,7 @@ void release_task(struct task_struct *p)
 	}
 
 	write_unlock_irq(&tasklist_lock);
+	kmem_free_up_q(&free_q);
 	cgroup_release(p);
 	release_thread(p);
 	call_rcu(&p->rcu, delayed_put_task_struct);
diff --git a/kernel/signal.c b/kernel/signal.c
index b7953934aa99..04fb202c16bd 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -435,16 +435,19 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t flags, int override_rlimi
 	return q;
 }
 
-static void __sigqueue_free(struct sigqueue *q)
+static void __sigqueue_free(struct sigqueue *q, struct kmem_free_q_head *free_q)
 {
 	if (q->flags & SIGQUEUE_PREALLOC)
 		return;
 	atomic_dec(&q->user->sigpending);
 	free_uid(q->user);
-	kmem_cache_free(sigqueue_cachep, q);
+	if (free_q)
+		kmem_free_q_add(free_q, sigqueue_cachep, q);
+	else
+		kmem_cache_free(sigqueue_cachep, q);
 }
 
-void flush_sigqueue(struct sigpending *queue)
+void flush_sigqueue(struct sigpending *queue, struct kmem_free_q_head *free_q)
 {
 	struct sigqueue *q;
 
@@ -452,7 +455,7 @@ void flush_sigqueue(struct sigpending *queue)
 	while (!list_empty(&queue->list)) {
 		q = list_entry(queue->list.next, struct sigqueue , list);
 		list_del_init(&q->list);
-		__sigqueue_free(q);
+		__sigqueue_free(q, free_q);
 	}
 }
 
@@ -462,12 +465,14 @@ void flush_sigqueue(struct sigpending *queue)
 void flush_signals(struct task_struct *t)
 {
 	unsigned long flags;
+	DEFINE_KMEM_FREE_Q(free_q);
 
 	spin_lock_irqsave(&t->sighand->siglock, flags);
 	clear_tsk_thread_flag(t, TIF_SIGPENDING);
-	flush_sigqueue(&t->pending);
-	flush_sigqueue(&t->signal->shared_pending);
+	flush_sigqueue(&t->pending, &free_q);
+	flush_sigqueue(&t->signal->shared_pending, &free_q);
 	spin_unlock_irqrestore(&t->sighand->siglock, flags);
+	kmem_free_up_q(&free_q);
 }
 EXPORT_SYMBOL(flush_signals);
 
@@ -488,7 +493,7 @@ static void __flush_itimer_signals(struct sigpending *pending)
 		} else {
 			sigdelset(&signal, sig);
 			list_del_init(&q->list);
-			__sigqueue_free(q);
+			__sigqueue_free(q, NULL);
 		}
 	}
 
@@ -580,7 +585,7 @@ static void collect_signal(int sig, struct sigpending *list, kernel_siginfo_t *i
 			(info->si_code == SI_TIMER) &&
 			(info->si_sys_private);
 
-		__sigqueue_free(first);
+		__sigqueue_free(first, NULL);
 	} else {
 		/*
 		 * Ok, it wasn't in the queue.  This must be
@@ -728,7 +733,7 @@ static int dequeue_synchronous_signal(kernel_siginfo_t *info)
 still_pending:
 	list_del_init(&sync->list);
 	copy_siginfo(info, &sync->info);
-	__sigqueue_free(sync);
+	__sigqueue_free(sync, NULL);
 	return info->si_signo;
 }
 
@@ -776,7 +781,7 @@ static void flush_sigqueue_mask(sigset_t *mask, struct sigpending *s)
 	list_for_each_entry_safe(q, n, &s->list, list) {
 		if (sigismember(mask, q->info.si_signo)) {
 			list_del_init(&q->list);
-			__sigqueue_free(q);
+			__sigqueue_free(q, NULL);
 		}
 	}
 }
@@ -1749,7 +1754,7 @@ void sigqueue_free(struct sigqueue *q)
 	spin_unlock_irqrestore(lock, flags);
 
 	if (q)
-		__sigqueue_free(q);
+		__sigqueue_free(q, NULL);
 }
 
 int send_sigqueue(struct sigqueue *q, struct pid *pid, enum pid_type type)
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 1d0b37af2444..8ca571a0b2ac 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2548,6 +2548,8 @@ static void selinux_bprm_committed_creds(struct linux_binprm *bprm)
 	rc = avc_has_perm(&selinux_state,
 			  osid, sid, SECCLASS_PROCESS, PROCESS__SIGINH, NULL);
 	if (rc) {
+		DEFINE_KMEM_FREE_Q(free_q);
+
 		if (IS_ENABLED(CONFIG_POSIX_TIMERS)) {
 			memset(&itimer, 0, sizeof itimer);
 			for (i = 0; i < 3; i++)
@@ -2555,13 +2557,15 @@ static void selinux_bprm_committed_creds(struct linux_binprm *bprm)
 		}
 		spin_lock_irq(&current->sighand->siglock);
 		if (!fatal_signal_pending(current)) {
-			flush_sigqueue(&current->pending);
-			flush_sigqueue(&current->signal->shared_pending);
+			flush_sigqueue(&current->pending, &free_q);
+			flush_sigqueue(&current->signal->shared_pending,
+				       &free_q);
 			flush_signal_handlers(current, 1);
 			sigemptyset(&current->blocked);
 			recalc_sigpending();
 		}
 		spin_unlock_irq(&current->sighand->siglock);
+		kmem_free_up_q(&free_q);
 	}
 
 	/* Wake up the parent if it is waiting so that it can recheck
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/4] signal: Add free_uid_to_q()
  2019-03-21 21:45 [PATCH 0/4] Signal: Fix hard lockup problem in flush_sigqueue() Waiman Long
  2019-03-21 21:45 ` [PATCH 1/4] mm: Implement kmem objects freeing queue Waiman Long
  2019-03-21 21:45 ` [PATCH 2/4] signal: Make flush_sigqueue() use free_q to release memory Waiman Long
@ 2019-03-21 21:45 ` Waiman Long
  2019-03-21 21:45 ` [PATCH 4/4] mm: Do periodic rescheduling when freeing objects in kmem_free_up_q() Waiman Long
  2019-03-22 10:15 ` [PATCH 0/4] Signal: Fix hard lockup problem in flush_sigqueue() Matthew Wilcox
  4 siblings, 0 replies; 22+ messages in thread
From: Waiman Long @ 2019-03-21 21:45 UTC (permalink / raw)
  To: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim
  Cc: linux-kernel, linux-mm, selinux, Paul Moore, Stephen Smalley,
	Eric Paris, Peter Zijlstra (Intel),
	Oleg Nesterov, Waiman Long

Add a new free_uid_to_q() function to put the user structure on
freeing queue instead of freeing it directly. That new function is then
called from __sigqueue_free() with a free_q parameter.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/sched/user.h |  3 +++
 kernel/signal.c            |  2 +-
 kernel/user.c              | 17 +++++++++++++----
 3 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index c7b5f86b91a1..77f28d5cb940 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -63,6 +63,9 @@ static inline struct user_struct *get_uid(struct user_struct *u)
 	refcount_inc(&u->__count);
 	return u;
 }
+
+struct kmem_free_q_head;
 extern void free_uid(struct user_struct *);
+extern void free_uid_to_q(struct user_struct *u, struct kmem_free_q_head *q);
 
 #endif /* _LINUX_SCHED_USER_H */
diff --git a/kernel/signal.c b/kernel/signal.c
index 04fb202c16bd..2ecb23b540eb 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -440,7 +440,7 @@ static void __sigqueue_free(struct sigqueue *q, struct kmem_free_q_head *free_q)
 	if (q->flags & SIGQUEUE_PREALLOC)
 		return;
 	atomic_dec(&q->user->sigpending);
-	free_uid(q->user);
+	free_uid_to_q(q->user, free_q);
 	if (free_q)
 		kmem_free_q_add(free_q, sigqueue_cachep, q);
 	else
diff --git a/kernel/user.c b/kernel/user.c
index 0df9b1640b2a..d92629bae546 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -135,14 +135,18 @@ static struct user_struct *uid_hash_find(kuid_t uid, struct hlist_head *hashent)
  * IRQ state (as stored in flags) is restored and uidhash_lock released
  * upon function exit.
  */
-static void free_user(struct user_struct *up, unsigned long flags)
+static void free_user(struct user_struct *up, unsigned long flags,
+		      struct kmem_free_q_head *free_q)
 	__releases(&uidhash_lock)
 {
 	uid_hash_remove(up);
 	spin_unlock_irqrestore(&uidhash_lock, flags);
 	key_put(up->uid_keyring);
 	key_put(up->session_keyring);
-	kmem_cache_free(uid_cachep, up);
+	if (free_q)
+		kmem_free_q_add(free_q, uid_cachep, up);
+	else
+		kmem_cache_free(uid_cachep, up);
 }
 
 /*
@@ -162,7 +166,7 @@ struct user_struct *find_user(kuid_t uid)
 	return ret;
 }
 
-void free_uid(struct user_struct *up)
+void free_uid_to_q(struct user_struct *up, struct kmem_free_q_head *free_q)
 {
 	unsigned long flags;
 
@@ -170,7 +174,12 @@ void free_uid(struct user_struct *up)
 		return;
 
 	if (refcount_dec_and_lock_irqsave(&up->__count, &uidhash_lock, &flags))
-		free_user(up, flags);
+		free_user(up, flags, free_q);
+}
+
+void free_uid(struct user_struct *up)
+{
+	free_uid_to_q(up, NULL);
 }
 
 struct user_struct *alloc_uid(kuid_t uid)
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/4] mm: Do periodic rescheduling when freeing objects in kmem_free_up_q()
  2019-03-21 21:45 [PATCH 0/4] Signal: Fix hard lockup problem in flush_sigqueue() Waiman Long
                   ` (2 preceding siblings ...)
  2019-03-21 21:45 ` [PATCH 3/4] signal: Add free_uid_to_q() Waiman Long
@ 2019-03-21 21:45 ` Waiman Long
  2019-03-21 22:00   ` Peter Zijlstra
  2019-03-22 10:15 ` [PATCH 0/4] Signal: Fix hard lockup problem in flush_sigqueue() Matthew Wilcox
  4 siblings, 1 reply; 22+ messages in thread
From: Waiman Long @ 2019-03-21 21:45 UTC (permalink / raw)
  To: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim
  Cc: linux-kernel, linux-mm, selinux, Paul Moore, Stephen Smalley,
	Eric Paris, Peter Zijlstra (Intel),
	Oleg Nesterov, Waiman Long

If the freeing queue has many objects, freeing all of them consecutively
may cause soft lockup especially on a debug kernel. So kmem_free_up_q()
is modified to call cond_resched() if running in the process context.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 mm/slab_common.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/mm/slab_common.c b/mm/slab_common.c
index dba20b4208f1..633a1d0f6d20 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1622,11 +1622,14 @@ EXPORT_SYMBOL_GPL(kmem_free_q_add);
  * kmem_free_up_q - free all the objects in the freeing queue
  * @head: freeing queue head
  *
- * Free all the objects in the freeing queue.
+ * Free all the objects in the freeing queue. The caller cannot hold any
+ * non-sleeping locks.
  */
 void kmem_free_up_q(struct kmem_free_q_head *head)
 {
 	struct kmem_free_q_node *node, *next;
+	bool do_resched = !in_irq();
+	int cnt = 0;
 
 	for (node = head->first; node; node = next) {
 		next = node->next;
@@ -1634,6 +1637,12 @@ void kmem_free_up_q(struct kmem_free_q_head *head)
 			kmem_cache_free(node->cachep, node);
 		else
 			kfree(node);
+		/*
+		 * Call cond_resched() every 256 objects freed when in
+		 * process context.
+		 */
+		if (do_resched && !(++cnt & 0xff))
+			cond_resched();
 	}
 }
 EXPORT_SYMBOL_GPL(kmem_free_up_q);
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] mm: Do periodic rescheduling when freeing objects in kmem_free_up_q()
  2019-03-21 21:45 ` [PATCH 4/4] mm: Do periodic rescheduling when freeing objects in kmem_free_up_q() Waiman Long
@ 2019-03-21 22:00   ` Peter Zijlstra
  2019-03-22 14:35     ` Waiman Long
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2019-03-21 22:00 UTC (permalink / raw)
  To: Waiman Long
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, linux-kernel, linux-mm, selinux, Paul Moore,
	Stephen Smalley, Eric Paris, Oleg Nesterov

On Thu, Mar 21, 2019 at 05:45:12PM -0400, Waiman Long wrote:
> If the freeing queue has many objects, freeing all of them consecutively
> may cause soft lockup especially on a debug kernel. So kmem_free_up_q()
> is modified to call cond_resched() if running in the process context.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  mm/slab_common.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index dba20b4208f1..633a1d0f6d20 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -1622,11 +1622,14 @@ EXPORT_SYMBOL_GPL(kmem_free_q_add);
>   * kmem_free_up_q - free all the objects in the freeing queue
>   * @head: freeing queue head
>   *
> - * Free all the objects in the freeing queue.
> + * Free all the objects in the freeing queue. The caller cannot hold any
> + * non-sleeping locks.
>   */
>  void kmem_free_up_q(struct kmem_free_q_head *head)
>  {
>  	struct kmem_free_q_node *node, *next;
> +	bool do_resched = !in_irq();
> +	int cnt = 0;
>  
>  	for (node = head->first; node; node = next) {
>  		next = node->next;
> @@ -1634,6 +1637,12 @@ void kmem_free_up_q(struct kmem_free_q_head *head)
>  			kmem_cache_free(node->cachep, node);
>  		else
>  			kfree(node);
> +		/*
> +		 * Call cond_resched() every 256 objects freed when in
> +		 * process context.
> +		 */
> +		if (do_resched && !(++cnt & 0xff))
> +			cond_resched();

Why not just: cond_resched() ?

>  	}
>  }
>  EXPORT_SYMBOL_GPL(kmem_free_up_q);
> -- 
> 2.18.1
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] signal: Make flush_sigqueue() use free_q to release memory
  2019-03-21 21:45 ` [PATCH 2/4] signal: Make flush_sigqueue() use free_q to release memory Waiman Long
@ 2019-03-22  1:52   ` Matthew Wilcox
  2019-03-22 11:16     ` Oleg Nesterov
  0 siblings, 1 reply; 22+ messages in thread
From: Matthew Wilcox @ 2019-03-22  1:52 UTC (permalink / raw)
  To: Waiman Long
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, linux-kernel, linux-mm, selinux, Paul Moore,
	Stephen Smalley, Eric Paris, Peter Zijlstra (Intel),
	Oleg Nesterov

On Thu, Mar 21, 2019 at 05:45:10PM -0400, Waiman Long wrote:
> It was found that if a process had many pending signals (e.g. millions),
> the act of exiting that process might cause its parent to have a hard
> lockup especially on a debug kernel with features like KASAN enabled.
> It was because the flush_sigqueue() was called in release_task() with
> tasklist_lock held and irq disabled.

This rather apocalyptic language is a bit uncalled for.  I appreciate the
warning is scary, but all that's really happening is that the debug code
is taking too long to execute.

> To avoid this dire condition and reduce lock hold time of tasklist_lock,
> flush_sigqueue() is modified to pass in a freeing queue pointer so that
> the actual freeing of memory objects can be deferred until after the
> tasklist_lock is released and irq re-enabled.

I think this is a really bad solution.  It looks kind of generic,
but isn't.  It's terribly inefficient, and all it's really doing is
deferring the debugging code until we've re-enabled interrupts.

We'd be much better off just having a list_head in the caller
and list_splice() the queue->list onto that caller.  Then call
__sigqueue_free() for each signal on the queue.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/4] Signal: Fix hard lockup problem in flush_sigqueue()
  2019-03-21 21:45 [PATCH 0/4] Signal: Fix hard lockup problem in flush_sigqueue() Waiman Long
                   ` (3 preceding siblings ...)
  2019-03-21 21:45 ` [PATCH 4/4] mm: Do periodic rescheduling when freeing objects in kmem_free_up_q() Waiman Long
@ 2019-03-22 10:15 ` Matthew Wilcox
  2019-03-22 11:49   ` Oleg Nesterov
  4 siblings, 1 reply; 22+ messages in thread
From: Matthew Wilcox @ 2019-03-22 10:15 UTC (permalink / raw)
  To: Waiman Long
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, linux-kernel, linux-mm, selinux, Paul Moore,
	Stephen Smalley, Eric Paris, Peter Zijlstra (Intel),
	Oleg Nesterov

On Thu, Mar 21, 2019 at 05:45:08PM -0400, Waiman Long wrote:
> It was found that if a process has accumulated sufficient number of
> pending signals, the exiting of that process may cause its parent to
> have hard lockup when running on a debug kernel with a slow memory
> freeing path (like with KASAN enabled).

I appreciate these are "reliable" signals, but why do we accumulate so
many signals to a task which will never receive them?  Can we detect at
signal delivery time that the task is going to die and avoid queueing
them in the first place?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] signal: Make flush_sigqueue() use free_q to release memory
  2019-03-22  1:52   ` Matthew Wilcox
@ 2019-03-22 11:16     ` Oleg Nesterov
  2019-03-22 16:10       ` Waiman Long
  0 siblings, 1 reply; 22+ messages in thread
From: Oleg Nesterov @ 2019-03-22 11:16 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Waiman Long, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, linux-kernel, linux-mm, selinux,
	Paul Moore, Stephen Smalley, Eric Paris, Peter Zijlstra (Intel)

On 03/21, Matthew Wilcox wrote:
>
> On Thu, Mar 21, 2019 at 05:45:10PM -0400, Waiman Long wrote:
>
> > To avoid this dire condition and reduce lock hold time of tasklist_lock,
> > flush_sigqueue() is modified to pass in a freeing queue pointer so that
> > the actual freeing of memory objects can be deferred until after the
> > tasklist_lock is released and irq re-enabled.
>
> I think this is a really bad solution.  It looks kind of generic,
> but isn't.  It's terribly inefficient, and all it's really doing is
> deferring the debugging code until we've re-enabled interrupts.

Agreed.

> We'd be much better off just having a list_head in the caller
> and list_splice() the queue->list onto that caller.  Then call
> __sigqueue_free() for each signal on the queue.

This won't work, note the comment which explains the race with sigqueue_free().

Let me think about it... at least we can do something like

	close_the_race_with_sigqueue_free(struct sigpending *queue)
	{
		struct sigqueue *q, *t;

		list_for_each_entry_safe(q, t, ...) {
			if (q->flags & SIGQUEUE_PREALLOC)
				list_del_init(&q->list);
	}

called with ->siglock held, tasklist_lock is not needed.

After that flush_sigqueue() can be called lockless in release_task() release_task.

I'll try to make the patch tomorrow.

Oleg.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/4] Signal: Fix hard lockup problem in flush_sigqueue()
  2019-03-22 10:15 ` [PATCH 0/4] Signal: Fix hard lockup problem in flush_sigqueue() Matthew Wilcox
@ 2019-03-22 11:49   ` Oleg Nesterov
  0 siblings, 0 replies; 22+ messages in thread
From: Oleg Nesterov @ 2019-03-22 11:49 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Waiman Long, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, linux-kernel, linux-mm, selinux,
	Paul Moore, Stephen Smalley, Eric Paris, Peter Zijlstra (Intel)

On 03/22, Matthew Wilcox wrote:
>
> On Thu, Mar 21, 2019 at 05:45:08PM -0400, Waiman Long wrote:
> > It was found that if a process has accumulated sufficient number of
> > pending signals, the exiting of that process may cause its parent to
> > have hard lockup when running on a debug kernel with a slow memory
> > freeing path (like with KASAN enabled).
>
> I appreciate these are "reliable" signals, but why do we accumulate so
> many signals to a task which will never receive them?  Can we detect at
> signal delivery time that the task is going to die and avoid queueing
> them in the first place?

A task can block the signal and accumulate up to RLIMIT_SIGPENDING signals,
then it can exit.

Oleg.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] mm: Do periodic rescheduling when freeing objects in kmem_free_up_q()
  2019-03-21 22:00   ` Peter Zijlstra
@ 2019-03-22 14:35     ` Waiman Long
  0 siblings, 0 replies; 22+ messages in thread
From: Waiman Long @ 2019-03-22 14:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, linux-kernel, linux-mm, selinux, Paul Moore,
	Stephen Smalley, Eric Paris, Oleg Nesterov

On 03/21/2019 06:00 PM, Peter Zijlstra wrote:
> On Thu, Mar 21, 2019 at 05:45:12PM -0400, Waiman Long wrote:
>> If the freeing queue has many objects, freeing all of them consecutively
>> may cause soft lockup especially on a debug kernel. So kmem_free_up_q()
>> is modified to call cond_resched() if running in the process context.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>  mm/slab_common.c | 11 ++++++++++-
>>  1 file changed, 10 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/slab_common.c b/mm/slab_common.c
>> index dba20b4208f1..633a1d0f6d20 100644
>> --- a/mm/slab_common.c
>> +++ b/mm/slab_common.c
>> @@ -1622,11 +1622,14 @@ EXPORT_SYMBOL_GPL(kmem_free_q_add);
>>   * kmem_free_up_q - free all the objects in the freeing queue
>>   * @head: freeing queue head
>>   *
>> - * Free all the objects in the freeing queue.
>> + * Free all the objects in the freeing queue. The caller cannot hold any
>> + * non-sleeping locks.
>>   */
>>  void kmem_free_up_q(struct kmem_free_q_head *head)
>>  {
>>  	struct kmem_free_q_node *node, *next;
>> +	bool do_resched = !in_irq();
>> +	int cnt = 0;
>>  
>>  	for (node = head->first; node; node = next) {
>>  		next = node->next;
>> @@ -1634,6 +1637,12 @@ void kmem_free_up_q(struct kmem_free_q_head *head)
>>  			kmem_cache_free(node->cachep, node);
>>  		else
>>  			kfree(node);
>> +		/*
>> +		 * Call cond_resched() every 256 objects freed when in
>> +		 * process context.
>> +		 */
>> +		if (do_resched && !(++cnt & 0xff))
>> +			cond_resched();
> Why not just: cond_resched() ?

cond_resched() calls ___might_sleep(). So it is prudent to check for
process context first to avoid erroneous message. Yes, I can call
cond_resched() after every free. I added the count just to not call it
too frequently.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] signal: Make flush_sigqueue() use free_q to release memory
  2019-03-22 11:16     ` Oleg Nesterov
@ 2019-03-22 16:10       ` Waiman Long
  2019-03-22 17:50         ` Christopher Lameter
  0 siblings, 1 reply; 22+ messages in thread
From: Waiman Long @ 2019-03-22 16:10 UTC (permalink / raw)
  To: Oleg Nesterov, Matthew Wilcox
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, linux-kernel, linux-mm, selinux, Paul Moore,
	Stephen Smalley, Eric Paris, Peter Zijlstra (Intel)

On 03/22/2019 07:16 AM, Oleg Nesterov wrote:
> On 03/21, Matthew Wilcox wrote:
>> On Thu, Mar 21, 2019 at 05:45:10PM -0400, Waiman Long wrote:
>>
>>> To avoid this dire condition and reduce lock hold time of tasklist_lock,
>>> flush_sigqueue() is modified to pass in a freeing queue pointer so that
>>> the actual freeing of memory objects can be deferred until after the
>>> tasklist_lock is released and irq re-enabled.
>> I think this is a really bad solution.  It looks kind of generic,
>> but isn't.  It's terribly inefficient, and all it's really doing is
>> deferring the debugging code until we've re-enabled interrupts.
> Agreed.

Thanks for looking into that. As I am not knowledgeable enough about the
signal handling code path, I choose the lowest risk approach of not
trying to change the code flow while deferring memory deallocation after
releasing the tasklist_lock.

>> We'd be much better off just having a list_head in the caller
>> and list_splice() the queue->list onto that caller.  Then call
>> __sigqueue_free() for each signal on the queue.
> This won't work, note the comment which explains the race with sigqueue_free().
>
> Let me think about it... at least we can do something like
>
> 	close_the_race_with_sigqueue_free(struct sigpending *queue)
> 	{
> 		struct sigqueue *q, *t;
>
> 		list_for_each_entry_safe(q, t, ...) {
> 			if (q->flags & SIGQUEUE_PREALLOC)
> 				list_del_init(&q->list);
> 	}
>
> called with ->siglock held, tasklist_lock is not needed.
>
> After that flush_sigqueue() can be called lockless in release_task() release_task.
>
> I'll try to make the patch tomorrow.
>
> Oleg.
>
I am looking forward to it.

Thanks,
Longman



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] mm: Implement kmem objects freeing queue
  2019-03-21 21:45 ` [PATCH 1/4] mm: Implement kmem objects freeing queue Waiman Long
@ 2019-03-22 17:47   ` Christopher Lameter
  0 siblings, 0 replies; 22+ messages in thread
From: Christopher Lameter @ 2019-03-22 17:47 UTC (permalink / raw)
  To: Waiman Long
  Cc: Andrew Morton, Pekka Enberg, David Rientjes, Joonsoo Kim,
	linux-kernel, linux-mm, selinux, Paul Moore, Stephen Smalley,
	Eric Paris, Peter Zijlstra (Intel),
	Oleg Nesterov

On Thu, 21 Mar 2019, Waiman Long wrote:

> When releasing kernel data structures, freeing up the memory
> occupied by those objects is usually the last step. To avoid races,
> the release operation is commonly done with a lock held. However, the
> freeing operations do not need to be under lock, but are in many cases.
>
> In some complex cases where the locks protect many different memory
> objects, that can be a problem especially if some memory debugging
> features like KASAN are enabled. In those cases, freeing memory objects
> under lock can greatly lengthen the lock hold time. This can even lead
> to soft/hard lockups in some extreme cases.
>
> To make it easer to defer freeing memory objects until after unlock,
> a kernel memory freeing queue mechanism is now added. It is modelled
> after the wake_q mechanism for waking up tasks without holding a lock.

It is already pretty easy. You just store the pointer to the slab object
in a local variable, finish all the unlocks and then free the objects.
This is done in numerous places of the kernel.

I fear that the automated mechanism will make the code more difficult to
read and result in a loss of clarity of the sequencing of events in
releasing locks and objects.

Also there is already kfree_rcu which does a similar thing to what you are
proposing here and is used in numerous places.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] signal: Make flush_sigqueue() use free_q to release memory
  2019-03-22 16:10       ` Waiman Long
@ 2019-03-22 17:50         ` Christopher Lameter
  2019-03-22 18:12           ` Waiman Long
  2019-03-26 13:29           ` Oleg Nesterov
  0 siblings, 2 replies; 22+ messages in thread
From: Christopher Lameter @ 2019-03-22 17:50 UTC (permalink / raw)
  To: Waiman Long
  Cc: Oleg Nesterov, Matthew Wilcox, Andrew Morton, Pekka Enberg,
	David Rientjes, Joonsoo Kim, linux-kernel, linux-mm, selinux,
	Paul Moore, Stephen Smalley, Eric Paris, Peter Zijlstra (Intel)

On Fri, 22 Mar 2019, Waiman Long wrote:

> I am looking forward to it.

There is also alrady rcu being used in these paths. kfree_rcu() would not
be enough? It is an estalished mechanism that is mature and well
understood.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] signal: Make flush_sigqueue() use free_q to release memory
  2019-03-22 17:50         ` Christopher Lameter
@ 2019-03-22 18:12           ` Waiman Long
  2019-03-22 19:39             ` Christopher Lameter
  2019-03-26 13:29           ` Oleg Nesterov
  1 sibling, 1 reply; 22+ messages in thread
From: Waiman Long @ 2019-03-22 18:12 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Oleg Nesterov, Matthew Wilcox, Andrew Morton, Pekka Enberg,
	David Rientjes, Joonsoo Kim, linux-kernel, linux-mm, selinux,
	Paul Moore, Stephen Smalley, Eric Paris, Peter Zijlstra (Intel)

On 03/22/2019 01:50 PM, Christopher Lameter wrote:
> On Fri, 22 Mar 2019, Waiman Long wrote:
>
>> I am looking forward to it.
> There is also alrady rcu being used in these paths. kfree_rcu() would not
> be enough? It is an estalished mechanism that is mature and well
> understood.
>
In this case, the memory objects are from kmem caches, so they can't
freed using kfree_rcu().

There are certainly overhead using the kfree_rcu(), or a
kfree_rcu()-like mechanism. Also I think the actual freeing is done at
SoftIRQ context which can be a problem if there are too many memory
objects to free.

I think what Oleg is trying to do is probably the most efficient way.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] signal: Make flush_sigqueue() use free_q to release memory
  2019-03-22 18:12           ` Waiman Long
@ 2019-03-22 19:39             ` Christopher Lameter
  2019-03-22 19:59               ` Matthew Wilcox
  0 siblings, 1 reply; 22+ messages in thread
From: Christopher Lameter @ 2019-03-22 19:39 UTC (permalink / raw)
  To: Waiman Long
  Cc: Oleg Nesterov, Matthew Wilcox, Andrew Morton, Pekka Enberg,
	David Rientjes, Joonsoo Kim, linux-kernel, linux-mm, selinux,
	Paul Moore, Stephen Smalley, Eric Paris, Peter Zijlstra (Intel)

On Fri, 22 Mar 2019, Waiman Long wrote:

> >
> >> I am looking forward to it.
> > There is also alrady rcu being used in these paths. kfree_rcu() would not
> > be enough? It is an estalished mechanism that is mature and well
> > understood.
> >
> In this case, the memory objects are from kmem caches, so they can't
> freed using kfree_rcu().

Oh they can. kfree() can free memory from any slab cache.

> There are certainly overhead using the kfree_rcu(), or a
> kfree_rcu()-like mechanism. Also I think the actual freeing is done at
> SoftIRQ context which can be a problem if there are too many memory
> objects to free.

No there is none that I am aware of. And this has survived testing of HPC
loads with gazillion of objects that have to be freed from multiple
processors. We really should not rearchitect this stuff... It took us
quite a long time to have this scale well under all loads.

Please use rcu_free().



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] signal: Make flush_sigqueue() use free_q to release memory
  2019-03-22 19:39             ` Christopher Lameter
@ 2019-03-22 19:59               ` Matthew Wilcox
  2019-03-25 14:15                 ` Christopher Lameter
  0 siblings, 1 reply; 22+ messages in thread
From: Matthew Wilcox @ 2019-03-22 19:59 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Waiman Long, Oleg Nesterov, Andrew Morton, Pekka Enberg,
	David Rientjes, Joonsoo Kim, linux-kernel, linux-mm, selinux,
	Paul Moore, Stephen Smalley, Eric Paris, Peter Zijlstra (Intel)

On Fri, Mar 22, 2019 at 07:39:31PM +0000, Christopher Lameter wrote:
> On Fri, 22 Mar 2019, Waiman Long wrote:
> 
> > >
> > >> I am looking forward to it.
> > > There is also alrady rcu being used in these paths. kfree_rcu() would not
> > > be enough? It is an estalished mechanism that is mature and well
> > > understood.
> > >
> > In this case, the memory objects are from kmem caches, so they can't
> > freed using kfree_rcu().
> 
> Oh they can. kfree() can free memory from any slab cache.

Only for SLAB and SLUB.  SLOB requires that you pass a pointer to the
slab cache; it has no way to look up the slab cache from the object.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] signal: Make flush_sigqueue() use free_q to release memory
  2019-03-22 19:59               ` Matthew Wilcox
@ 2019-03-25 14:15                 ` Christopher Lameter
  2019-03-25 15:26                   ` Matthew Wilcox
  2019-03-26 13:36                   ` Oleg Nesterov
  0 siblings, 2 replies; 22+ messages in thread
From: Christopher Lameter @ 2019-03-25 14:15 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Waiman Long, Oleg Nesterov, Andrew Morton, Pekka Enberg,
	David Rientjes, Joonsoo Kim, linux-kernel, linux-mm, selinux,
	Paul Moore, Stephen Smalley, Eric Paris, Peter Zijlstra (Intel)

On Fri, 22 Mar 2019, Matthew Wilcox wrote:

> On Fri, Mar 22, 2019 at 07:39:31PM +0000, Christopher Lameter wrote:
> > On Fri, 22 Mar 2019, Waiman Long wrote:
> >
> > > >
> > > >> I am looking forward to it.
> > > > There is also alrady rcu being used in these paths. kfree_rcu() would not
> > > > be enough? It is an estalished mechanism that is mature and well
> > > > understood.
> > > >
> > > In this case, the memory objects are from kmem caches, so they can't
> > > freed using kfree_rcu().
> >
> > Oh they can. kfree() can free memory from any slab cache.
>
> Only for SLAB and SLUB.  SLOB requires that you pass a pointer to the
> slab cache; it has no way to look up the slab cache from the object.

Well then we could either fix SLOB to conform to the others or add a
kmem_cache_free_rcu() variant.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] signal: Make flush_sigqueue() use free_q to release memory
  2019-03-25 14:15                 ` Christopher Lameter
@ 2019-03-25 15:26                   ` Matthew Wilcox
  2019-03-25 16:16                     ` Christopher Lameter
  2019-03-26 13:36                   ` Oleg Nesterov
  1 sibling, 1 reply; 22+ messages in thread
From: Matthew Wilcox @ 2019-03-25 15:26 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Waiman Long, Oleg Nesterov, Andrew Morton, Pekka Enberg,
	David Rientjes, Joonsoo Kim, linux-kernel, linux-mm, selinux,
	Paul Moore, Stephen Smalley, Eric Paris, Peter Zijlstra (Intel)

On Mon, Mar 25, 2019 at 02:15:25PM +0000, Christopher Lameter wrote:
> On Fri, 22 Mar 2019, Matthew Wilcox wrote:
> 
> > On Fri, Mar 22, 2019 at 07:39:31PM +0000, Christopher Lameter wrote:
> > > On Fri, 22 Mar 2019, Waiman Long wrote:
> > >
> > > > >
> > > > >> I am looking forward to it.
> > > > > There is also alrady rcu being used in these paths. kfree_rcu() would not
> > > > > be enough? It is an estalished mechanism that is mature and well
> > > > > understood.
> > > > >
> > > > In this case, the memory objects are from kmem caches, so they can't
> > > > freed using kfree_rcu().
> > >
> > > Oh they can. kfree() can free memory from any slab cache.
> >
> > Only for SLAB and SLUB.  SLOB requires that you pass a pointer to the
> > slab cache; it has no way to look up the slab cache from the object.
> 
> Well then we could either fix SLOB to conform to the others or add a
> kmem_cache_free_rcu() variant.

The problem with a kmem_cache_free_rcu() variant is that we now have
three pointers to store -- the object pointer, the slab pointer and the
rcu next pointer.

I spent some time looking at how SLOB might be fixed, and I didn't come
up with a good idea.  Currently SLOB stores the size of the object in the
four bytes before the object, unless the object is "allocated from a slab
cache", in which case the size is taken from the cache pointer instead.
So calling kfree() on a pointer allocated using kmem_cache_alloc() will
cause corruption as it attempts to determine the length of the object.

Options:

1. Dispense with this optimisation and always store the size of the
object before the object.

2. Add a kmem_cache flag that says whether objects in this cache may be
freed with kfree().  Only dispense with this optimisation for slabs
with this flag set.

3. Change SLOB to segregate objects by size.  If someone has gone to
the trouble of creating a kmem_cache, this is a pretty good hint that
there will be a lot of objects of this size allocated, so this could
help SLOB fight fragmentation.

Any other bright ideas?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] signal: Make flush_sigqueue() use free_q to release memory
  2019-03-25 15:26                   ` Matthew Wilcox
@ 2019-03-25 16:16                     ` Christopher Lameter
  0 siblings, 0 replies; 22+ messages in thread
From: Christopher Lameter @ 2019-03-25 16:16 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Waiman Long, Oleg Nesterov, Andrew Morton, Pekka Enberg,
	David Rientjes, Joonsoo Kim, linux-kernel, linux-mm, selinux,
	Paul Moore, Stephen Smalley, Eric Paris, Peter Zijlstra (Intel)

On Mon, 25 Mar 2019, Matthew Wilcox wrote:

> Options:
>
> 1. Dispense with this optimisation and always store the size of the
> object before the object.

I think thats how SLOB handled it at some point in the past. Lets go back
to that setup so its compatible with the other allocators?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] signal: Make flush_sigqueue() use free_q to release memory
  2019-03-22 17:50         ` Christopher Lameter
  2019-03-22 18:12           ` Waiman Long
@ 2019-03-26 13:29           ` Oleg Nesterov
  1 sibling, 0 replies; 22+ messages in thread
From: Oleg Nesterov @ 2019-03-26 13:29 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Waiman Long, Matthew Wilcox, Andrew Morton, Pekka Enberg,
	David Rientjes, Joonsoo Kim, linux-kernel, linux-mm, selinux,
	Paul Moore, Stephen Smalley, Eric Paris, Peter Zijlstra (Intel)

Sorry, I am sick and can't work, hopefully I'll return tomorrow.

On 03/22, Christopher Lameter wrote:
>
> On Fri, 22 Mar 2019, Waiman Long wrote:
>
> > I am looking forward to it.
>
> There is also alrady rcu being used in these paths. kfree_rcu() would not
> be enough? It is an estalished mechanism that is mature and well
> understood.

But why do we want to increase the number of rcu callbacks in flight?

For the moment, lets discuss the exiting tasks only. The only reason why
flush_sigqueue(&tsk->pending) needs spin_lock_irq() is the race with
release_posix_timer()->sigqueue_free() from another thread which can remove
a SIGQUEUE_PREALLOC'ed sigqueue from list. With the simple patch below
flush_sigqueue() can be called lockless with irqs enabled.

However, this change is not enough, we need to do something similar with
do_sigaction()->flush_sigqueue_mask(), and this is less simple.

So I won't really argue with kfree_rcu() but I am not sure this is the best
option.

Oleg.


--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -85,6 +85,17 @@ static void __unhash_process(struct task_struct *p, bool group_dead)
 	list_del_rcu(&p->thread_node);
 }
 
+// Rename me and move into signal.c
+void remove_prealloced(struct sigpending *queue)
+{
+	struct sigqueue *q, *t;
+
+	list_for_each_entry_safe(q, t, &queue->list, list) {
+		if (q->flags & SIGQUEUE_PREALLOC)
+			list_del_init(&q->list);
+	}
+}
+
 /*
  * This function expects the tasklist_lock write-locked.
  */
@@ -160,16 +171,15 @@ static void __exit_signal(struct task_struct *tsk)
 	 * Do this under ->siglock, we can race with another thread
 	 * doing sigqueue_free() if we have SIGQUEUE_PREALLOC signals.
 	 */
-	flush_sigqueue(&tsk->pending);
+	if (!group_dead)
+		remove_prealloced(&tsk->pending);
 	tsk->sighand = NULL;
 	spin_unlock(&sighand->siglock);
 
 	__cleanup_sighand(sighand);
 	clear_tsk_thread_flag(tsk, TIF_SIGPENDING);
-	if (group_dead) {
-		flush_sigqueue(&sig->shared_pending);
+	if (group_dead)
 		tty_kref_put(tty);
-	}
 }
 
 static void delayed_put_task_struct(struct rcu_head *rhp)
@@ -221,6 +231,11 @@ void release_task(struct task_struct *p)
 	write_unlock_irq(&tasklist_lock);
 	cgroup_release(p);
 	release_thread(p);
+
+	flush_sigqueue(&p->pending);
+	if (thread_group_leader(p))
+		flush_sigqueue(&p->signal->shared_pending);
+
 	call_rcu(&p->rcu, delayed_put_task_struct);
 
 	p = leader;


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] signal: Make flush_sigqueue() use free_q to release memory
  2019-03-25 14:15                 ` Christopher Lameter
  2019-03-25 15:26                   ` Matthew Wilcox
@ 2019-03-26 13:36                   ` Oleg Nesterov
  1 sibling, 0 replies; 22+ messages in thread
From: Oleg Nesterov @ 2019-03-26 13:36 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Matthew Wilcox, Waiman Long, Andrew Morton, Pekka Enberg,
	David Rientjes, Joonsoo Kim, linux-kernel, linux-mm, selinux,
	Paul Moore, Stephen Smalley, Eric Paris, Peter Zijlstra (Intel)

On 03/25, Christopher Lameter wrote:
>
> On Fri, 22 Mar 2019, Matthew Wilcox wrote:
>
> > Only for SLAB and SLUB.  SLOB requires that you pass a pointer to the
> > slab cache; it has no way to look up the slab cache from the object.
>
> Well then we could either fix SLOB to conform to the others or add a
> kmem_cache_free_rcu() variant.

Speaking of struct sigqueue we can simply use call_rcu() but see my previous
email, I am not sure this is the best option.

Oleg.


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2019-03-26 13:36 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-21 21:45 [PATCH 0/4] Signal: Fix hard lockup problem in flush_sigqueue() Waiman Long
2019-03-21 21:45 ` [PATCH 1/4] mm: Implement kmem objects freeing queue Waiman Long
2019-03-22 17:47   ` Christopher Lameter
2019-03-21 21:45 ` [PATCH 2/4] signal: Make flush_sigqueue() use free_q to release memory Waiman Long
2019-03-22  1:52   ` Matthew Wilcox
2019-03-22 11:16     ` Oleg Nesterov
2019-03-22 16:10       ` Waiman Long
2019-03-22 17:50         ` Christopher Lameter
2019-03-22 18:12           ` Waiman Long
2019-03-22 19:39             ` Christopher Lameter
2019-03-22 19:59               ` Matthew Wilcox
2019-03-25 14:15                 ` Christopher Lameter
2019-03-25 15:26                   ` Matthew Wilcox
2019-03-25 16:16                     ` Christopher Lameter
2019-03-26 13:36                   ` Oleg Nesterov
2019-03-26 13:29           ` Oleg Nesterov
2019-03-21 21:45 ` [PATCH 3/4] signal: Add free_uid_to_q() Waiman Long
2019-03-21 21:45 ` [PATCH 4/4] mm: Do periodic rescheduling when freeing objects in kmem_free_up_q() Waiman Long
2019-03-21 22:00   ` Peter Zijlstra
2019-03-22 14:35     ` Waiman Long
2019-03-22 10:15 ` [PATCH 0/4] Signal: Fix hard lockup problem in flush_sigqueue() Matthew Wilcox
2019-03-22 11:49   ` Oleg Nesterov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).