linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch V4 0/2] signals: Allow caching one sigqueue object per task
@ 2021-03-22  9:19 Thomas Gleixner
  2021-03-22  9:19 ` [patch V4 1/2] signal: Hand SIGQUEUE_PREALLOC flag to __sigqueue_alloc() Thomas Gleixner
  2021-03-22  9:19 ` [patch V4 2/2] signal: Allow tasks to cache one sigqueue struct Thomas Gleixner
  0 siblings, 2 replies; 9+ messages in thread
From: Thomas Gleixner @ 2021-03-22  9:19 UTC (permalink / raw)
  To: LKML
  Cc: Oleg Nesterov, Sebastian Andrzej Siewior, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Matt Fleming, Eric W. Biederman

This is a follow up to the V2/V3 submission which can be found here:

  https://lore.kernel.org/r/20210311132036.228542540@linutronix.de

Signal sending requires a kmem cache allocation at the sender side and the
receiver hands it back to the kmem cache when consuming the signal.

This works pretty well even for realtime workloads except for the case when
the kmem cache allocation has to go into the slow path which is rare but
happens.

Preempt-RT carries a patch which allows caching of one sigqueue object per
task. The object is not preallocated. It's cached when the task receives a
signal. The cache is freed when the task exits.

The memory overhead for a standard distro setup is pretty small. After boot
there are less than 10 objects cached in about 1500 tasks. The speedup for
sending a signal from a cached sigqueue object is small (~3us) per signal
and almost invisible, but for signal heavy workloads it's definitely
measurable and for the targeted realtime workloads it's solving a real
world latency issue.

Changes vs V2/3:

   - Drop the previous wrapper function and explicitly drop
     the sigqueue cache at the end of __exit_signals() to
     handle the self reaping case correctly

Thanks,

	tglx
---
 include/linux/sched.h  |    1 
 include/linux/signal.h |    1 
 kernel/exit.c          |    1 
 kernel/fork.c          |    1 
 kernel/signal.c        |   55 +++++++++++++++++++++++++++++++++++++++----------
 5 files changed, 48 insertions(+), 11 deletions(-)


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [patch V4 1/2] signal: Hand SIGQUEUE_PREALLOC flag to __sigqueue_alloc()
  2021-03-22  9:19 [patch V4 0/2] signals: Allow caching one sigqueue object per task Thomas Gleixner
@ 2021-03-22  9:19 ` Thomas Gleixner
  2021-04-15  8:37   ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
  2021-03-22  9:19 ` [patch V4 2/2] signal: Allow tasks to cache one sigqueue struct Thomas Gleixner
  1 sibling, 1 reply; 9+ messages in thread
From: Thomas Gleixner @ 2021-03-22  9:19 UTC (permalink / raw)
  To: LKML
  Cc: Oleg Nesterov, Sebastian Andrzej Siewior, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Matt Fleming, Eric W. Biederman

There is no point in having the conditional at the callsite.

Just hand in the allocation mode flag to __sigqueue_alloc() and use it to
initialize sigqueue::flags.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/signal.c |   17 +++++++----------
 1 file changed, 7 insertions(+), 10 deletions(-)

--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -410,7 +410,8 @@ void task_join_group_stop(struct task_st
  *   appropriate lock must be held to stop the target task from exiting
  */
 static struct sigqueue *
-__sigqueue_alloc(int sig, struct task_struct *t, gfp_t flags, int override_rlimit)
+__sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags,
+		 int override_rlimit, const unsigned int sigqueue_flags)
 {
 	struct sigqueue *q = NULL;
 	struct user_struct *user;
@@ -432,7 +433,7 @@ static struct sigqueue *
 	rcu_read_unlock();
 
 	if (override_rlimit || likely(sigpending <= task_rlimit(t, RLIMIT_SIGPENDING))) {
-		q = kmem_cache_alloc(sigqueue_cachep, flags);
+		q = kmem_cache_alloc(sigqueue_cachep, gfp_flags);
 	} else {
 		print_dropped_signal(sig);
 	}
@@ -442,7 +443,7 @@ static struct sigqueue *
 			free_uid(user);
 	} else {
 		INIT_LIST_HEAD(&q->list);
-		q->flags = 0;
+		q->flags = sigqueue_flags;
 		q->user = user;
 	}
 
@@ -1113,7 +1114,8 @@ static int __send_signal(int sig, struct
 	else
 		override_rlimit = 0;
 
-	q = __sigqueue_alloc(sig, t, GFP_ATOMIC, override_rlimit);
+	q = __sigqueue_alloc(sig, t, GFP_ATOMIC, override_rlimit, 0);
+
 	if (q) {
 		list_add_tail(&q->list, &pending->list);
 		switch ((unsigned long) info) {
@@ -1807,12 +1809,7 @@ EXPORT_SYMBOL(kill_pid);
  */
 struct sigqueue *sigqueue_alloc(void)
 {
-	struct sigqueue *q = __sigqueue_alloc(-1, current, GFP_KERNEL, 0);
-
-	if (q)
-		q->flags |= SIGQUEUE_PREALLOC;
-
-	return q;
+	return __sigqueue_alloc(-1, current, GFP_KERNEL, 0, SIGQUEUE_PREALLOC);
 }
 
 void sigqueue_free(struct sigqueue *q)


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [patch V4 2/2] signal: Allow tasks to cache one sigqueue struct
  2021-03-22  9:19 [patch V4 0/2] signals: Allow caching one sigqueue object per task Thomas Gleixner
  2021-03-22  9:19 ` [patch V4 1/2] signal: Hand SIGQUEUE_PREALLOC flag to __sigqueue_alloc() Thomas Gleixner
@ 2021-03-22  9:19 ` Thomas Gleixner
  2021-03-23 18:04   ` Oleg Nesterov
  1 sibling, 1 reply; 9+ messages in thread
From: Thomas Gleixner @ 2021-03-22  9:19 UTC (permalink / raw)
  To: LKML
  Cc: Oleg Nesterov, Sebastian Andrzej Siewior, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Matt Fleming, Eric W. Biederman

From: Thomas Gleixner <tglx@linutronix.de>

The idea for this originates from the real time tree to make signal
delivery for realtime applications more efficient. In quite some of these
application scenarios a control tasks signals workers to start their
computations. There is usually only one signal per worker on flight.  This
works nicely as long as the kmem cache allocations do not hit the slow path
and cause latencies.

To cure this an optimistic caching was introduced (limited to RT tasks)
which allows a task to cache a single sigqueue in a pointer in task_struct
instead of handing it back to the kmem cache after consuming a signal. When
the next signal is sent to the task then the cached sigqueue is used
instead of allocating a new one. This solved the problem for this set of
application scenarios nicely.

The task cache is not preallocated so the first signal sent to a task goes
always to the cache allocator. The cached sigqueue stays around until the
task exits and is freed when task::sighand is dropped.

After posting this solution for mainline the discussion came up whether
this would be useful in general and should not be limited to realtime
tasks: https://lore.kernel.org/r/m11rcu7nbr.fsf@fess.ebiederm.org

One concern leading to the original limitation was to avoid a large amount
of pointlessly cached sigqueues in alive tasks. The other concern was
vs. RLIMIT_SIGPENDING as these cached sigqueues are not accounted for.

The accounting problem is real, but on the other hand slightly academic.
After gathering some statistics it turned out that after boot of a regular
distro install there are less than 10 sigqueues cached in ~1500 tasks.

In case of a 'mass fork and fire signal to child' scenario the extra 80
bytes of memory per task are well in the noise of the overall memory
consumption of the fork bomb.

If this should be limited then this would need an extra counter in struct
user, more atomic instructions and a seperate rlimit. Yet another tunable
which is mostly unused.

The caching is actually used. After boot and a full kernel compile on a
64CPU machine with make -j128 the number of 'allocations' looks like this:

  From slab: 	   23996
  From task cache: 52223

I.e. it reduces the number of slab cache operations by ~68%.

A typical pattern there is:

<...>-58490 __sigqueue_alloc:  for 58488 from slab ffff8881132df460
<...>-58488 __sigqueue_free:   cache ffff8881132df460
<...>-58488 __sigqueue_alloc:  for 1149 from cache ffff8881103dc550
  bash-1149 exit_task_sighand: free ffff8881132df460
  bash-1149 __sigqueue_free:   cache ffff8881103dc550

The interesting sequence is that the exiting task 58488 grabs the sigqueue
from bash's task cache to signal exit and bash sticks it back into it's own
cache. Lather, rinse and repeat.

The caching is probably not noticable for the general use case, but the
benefit for latency sensitive applications is clear. While kmem caches are
usually just serving from the fast path the slab merging (default) can
depending on the usage pattern of the merged slabs cause occasional slow
path allocations.

The time spared per cached entry is a few micro seconds per signal which is
not relevant for e.g. a kernel build, but for signal heavy workloads it's
measurable.

As there is no real downside of this caching mechanism making it
unconditionally available is preferred over more conditional code or new
magic tunables.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
V4: Handle the self reaping case correctly (Oleg)

V3: Use READ/WRITE_ONCE() for the cache operations and add commentry
    for it.

V2: Remove the realtime task restriction and get rid of the cmpxchg()
    (Eric, Oleg)
    Add more information to the changelog.
---
 include/linux/sched.h  |    1 +
 include/linux/signal.h |    1 +
 kernel/exit.c          |    1 +
 kernel/fork.c          |    1 +
 kernel/signal.c        |   41 +++++++++++++++++++++++++++++++++++++++--
 5 files changed, 43 insertions(+), 2 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -984,6 +984,7 @@ struct task_struct {
 	/* Signal handlers: */
 	struct signal_struct		*signal;
 	struct sighand_struct __rcu		*sighand;
+	struct sigqueue			*sigqueue_cache;
 	sigset_t			blocked;
 	sigset_t			real_blocked;
 	/* Restored if set_restore_sigmask() was used: */
--- a/include/linux/signal.h
+++ b/include/linux/signal.h
@@ -265,6 +265,7 @@ static inline void init_sigpending(struc
 }
 
 extern void flush_sigqueue(struct sigpending *queue);
+extern void exit_task_sigqueue_cache(struct task_struct *tsk);
 
 /* Test if 'sig' is valid signal. Use this instead of testing _NSIG directly */
 static inline int valid_signal(unsigned long sig)
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -162,6 +162,7 @@ static void __exit_signal(struct task_st
 		flush_sigqueue(&sig->shared_pending);
 		tty_kref_put(tty);
 	}
+	exit_task_sigqueue_cache(tsk);
 }
 
 static void delayed_put_task_struct(struct rcu_head *rhp)
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2003,6 +2003,7 @@ static __latent_entropy struct task_stru
 	spin_lock_init(&p->alloc_lock);
 
 	init_sigpending(&p->pending);
+	p->sigqueue_cache = NULL;
 
 	p->utime = p->stime = p->gtime = 0;
 #ifdef CONFIG_ARCH_HAS_SCALED_CPUTIME
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -433,7 +433,16 @@ static struct sigqueue *
 	rcu_read_unlock();
 
 	if (override_rlimit || likely(sigpending <= task_rlimit(t, RLIMIT_SIGPENDING))) {
-		q = kmem_cache_alloc(sigqueue_cachep, gfp_flags);
+		/*
+		 * Preallocation does not hold sighand::siglock so it can't
+		 * use the cache. The lockless caching requires that only
+		 * one consumer and only one producer run at a time.
+		 */
+		q = READ_ONCE(t->sigqueue_cache);
+		if (!q || sigqueue_flags)
+			q = kmem_cache_alloc(sigqueue_cachep, gfp_flags);
+		else
+			WRITE_ONCE(t->sigqueue_cache, NULL);
 	} else {
 		print_dropped_signal(sig);
 	}
@@ -450,13 +459,41 @@ static struct sigqueue *
 	return q;
 }
 
+static void sigqueue_cache_or_free(struct sigqueue *q, bool cache)
+{
+	/*
+	 * Cache one sigqueue per task. This pairs with the consumer side
+	 * in __sigqueue_alloc() and needs READ/WRITE_ONCE() to prevent the
+	 * compiler from store tearing and to tell KCSAN that the data race
+	 * is intentional when run without holding current->sighand->siglock,
+	 * which is fine as current obviously cannot run __sigqueue_free()
+	 * concurrently.
+	 */
+	if (cache && !READ_ONCE(current->sigqueue_cache))
+		WRITE_ONCE(current->sigqueue_cache, q);
+	else
+		kmem_cache_free(sigqueue_cachep, q);
+}
+
+void exit_task_sigqueue_cache(struct task_struct *tsk)
+{
+	/* Race free because @tsk is mopped up */
+	struct sigqueue *q = tsk->sigqueue_cache;
+
+	if (q) {
+		tsk->sigqueue_cache = NULL;
+		/* If task is self reaping, don't cache it back */
+		sigqueue_cache_or_free(q, tsk != current);
+	}
+}
+
 static void __sigqueue_free(struct sigqueue *q)
 {
 	if (q->flags & SIGQUEUE_PREALLOC)
 		return;
 	if (atomic_dec_and_test(&q->user->sigpending))
 		free_uid(q->user);
-	kmem_cache_free(sigqueue_cachep, q);
+	sigqueue_cache_or_free(q, true);
 }
 
 void flush_sigqueue(struct sigpending *queue)


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch V4 2/2] signal: Allow tasks to cache one sigqueue struct
  2021-03-22  9:19 ` [patch V4 2/2] signal: Allow tasks to cache one sigqueue struct Thomas Gleixner
@ 2021-03-23 18:04   ` Oleg Nesterov
  2021-03-23 19:24     ` Thomas Gleixner
  0 siblings, 1 reply; 9+ messages in thread
From: Oleg Nesterov @ 2021-03-23 18:04 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Sebastian Andrzej Siewior, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Matt Fleming,
	Eric W. Biederman

On 03/22, Thomas Gleixner wrote:
>
> +static void sigqueue_cache_or_free(struct sigqueue *q, bool cache)
> +{
> +	/*
> +	 * Cache one sigqueue per task. This pairs with the consumer side
> +	 * in __sigqueue_alloc() and needs READ/WRITE_ONCE() to prevent the
> +	 * compiler from store tearing and to tell KCSAN that the data race
> +	 * is intentional when run without holding current->sighand->siglock,
> +	 * which is fine as current obviously cannot run __sigqueue_free()
> +	 * concurrently.
> +	 */
> +	if (cache && !READ_ONCE(current->sigqueue_cache))
> +		WRITE_ONCE(current->sigqueue_cache, q);
> +	else
> +		kmem_cache_free(sigqueue_cachep, q);
> +}
> +
> +void exit_task_sigqueue_cache(struct task_struct *tsk)
> +{
> +	/* Race free because @tsk is mopped up */
> +	struct sigqueue *q = tsk->sigqueue_cache;
> +
> +	if (q) {
> +		tsk->sigqueue_cache = NULL;
> +		/* If task is self reaping, don't cache it back */
> +		sigqueue_cache_or_free(q, tsk != current);
                                          ^^^^^^^^^^^^^^
Still not right or I am totally confused.

tsk != current can be true if an exiting (and autoreaping) sub-thread
releases its group leader.

IOW. Suppose a process has 2 threads, its parent ignores SIGCHLD.

The group leader L exits. Then its sub-thread T exits too and calls
release_task(T). In this case the tsk != current is false.

But after that T calls release_task(L) and L != T is true.

I'd suggest to free tsk->sigqueue_cache in __exit_signal() unconditionally and
remove the "bool cache" argument from sigqueue_cache_or_free().

Oleg.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch V4 2/2] signal: Allow tasks to cache one sigqueue struct
  2021-03-23 18:04   ` Oleg Nesterov
@ 2021-03-23 19:24     ` Thomas Gleixner
  2021-03-23 21:05       ` [patch V5 " Thomas Gleixner
  0 siblings, 1 reply; 9+ messages in thread
From: Thomas Gleixner @ 2021-03-23 19:24 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: LKML, Sebastian Andrzej Siewior, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Matt Fleming,
	Eric W. Biederman

On Tue, Mar 23 2021 at 19:04, Oleg Nesterov wrote:
> On 03/22, Thomas Gleixner wrote:
>> +static void sigqueue_cache_or_free(struct sigqueue *q, bool cache)
>> +	if (q) {
>> +		tsk->sigqueue_cache = NULL;
>> +		/* If task is self reaping, don't cache it back */
>> +		sigqueue_cache_or_free(q, tsk != current);
>                                           ^^^^^^^^^^^^^^
> Still not right or I am totally confused.
>
> tsk != current can be true if an exiting (and autoreaping) sub-thread
> releases its group leader.
>
> IOW. Suppose a process has 2 threads, its parent ignores SIGCHLD.
>
> The group leader L exits. Then its sub-thread T exits too and calls
> release_task(T). In this case the tsk != current is false.
>
> But after that T calls release_task(L) and L != T is true.

Bah. yes.

> I'd suggest to free tsk->sigqueue_cache in __exit_signal() unconditionally and
> remove the "bool cache" argument from sigqueue_cache_or_free().

That's what you get from trying to be clever, dammit.

Thanks for walking me through the oddities of exit !

       tglx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [patch V5 2/2] signal: Allow tasks to cache one sigqueue struct
  2021-03-23 19:24     ` Thomas Gleixner
@ 2021-03-23 21:05       ` Thomas Gleixner
  2021-03-24 18:03         ` Oleg Nesterov
  2021-04-15  8:37         ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
  0 siblings, 2 replies; 9+ messages in thread
From: Thomas Gleixner @ 2021-03-23 21:05 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: LKML, Sebastian Andrzej Siewior, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Matt Fleming,
	Eric W. Biederman

The idea for this originates from the real time tree to make signal
delivery for realtime applications more efficient. In quite some of these
application scenarios a control tasks signals workers to start their
computations. There is usually only one signal per worker on flight.  This
works nicely as long as the kmem cache allocations do not hit the slow path
and cause latencies.

To cure this an optimistic caching was introduced (limited to RT tasks)
which allows a task to cache a single sigqueue in a pointer in task_struct
instead of handing it back to the kmem cache after consuming a signal. When
the next signal is sent to the task then the cached sigqueue is used
instead of allocating a new one. This solved the problem for this set of
application scenarios nicely.

The task cache is not preallocated so the first signal sent to a task goes
always to the cache allocator. The cached sigqueue stays around until the
task exits and is freed when task::sighand is dropped.

After posting this solution for mainline the discussion came up whether
this would be useful in general and should not be limited to realtime
tasks: https://lore.kernel.org/r/m11rcu7nbr.fsf@fess.ebiederm.org

One concern leading to the original limitation was to avoid a large amount
of pointlessly cached sigqueues in alive tasks. The other concern was
vs. RLIMIT_SIGPENDING as these cached sigqueues are not accounted for.

The accounting problem is real, but on the other hand slightly academic.
After gathering some statistics it turned out that after boot of a regular
distro install there are less than 10 sigqueues cached in ~1500 tasks.

In case of a 'mass fork and fire signal to child' scenario the extra 80
bytes of memory per task are well in the noise of the overall memory
consumption of the fork bomb.

If this should be limited then this would need an extra counter in struct
user, more atomic instructions and a seperate rlimit. Yet another tunable
which is mostly unused.

The caching is actually used. After boot and a full kernel compile on a
64CPU machine with make -j128 the number of 'allocations' looks like this:

  From slab: 	   23996
  From task cache: 52223

I.e. it reduces the number of slab cache operations by ~68%.

A typical pattern there is:

<...>-58490 __sigqueue_alloc:  for 58488 from slab ffff8881132df460
<...>-58488 __sigqueue_free:   cache ffff8881132df460
<...>-58488 __sigqueue_alloc:  for 1149 from cache ffff8881103dc550
  bash-1149 exit_task_sighand: free ffff8881132df460
  bash-1149 __sigqueue_free:   cache ffff8881103dc550

The interesting sequence is that the exiting task 58488 grabs the sigqueue
from bash's task cache to signal exit and bash sticks it back into it's own
cache. Lather, rinse and repeat.

The caching is probably not noticable for the general use case, but the
benefit for latency sensitive applications is clear. While kmem caches are
usually just serving from the fast path the slab merging (default) can
depending on the usage pattern of the merged slabs cause occasional slow
path allocations.

The time spared per cached entry is a few micro seconds per signal which is
not relevant for e.g. a kernel build, but for signal heavy workloads it's
measurable.

As there is no real downside of this caching mechanism making it
unconditionally available is preferred over more conditional code or new
magic tunables.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
V5: Self reaping was only mostly correct. Don't try to be smart and
    make it simple _and_ correct (Oleg)

V4: Handle the self reaping case correctly (Oleg)

V3: Use READ/WRITE_ONCE() for the cache operations and add commentry
    for it.

V2: Remove the realtime task restriction and get rid of the cmpxchg()
    (Eric, Oleg)
    Add more information to the changelog.
---
 include/linux/sched.h  |    1 +
 include/linux/signal.h |    1 +
 kernel/exit.c          |    1 +
 kernel/fork.c          |    1 +
 kernel/signal.c        |   44 ++++++++++++++++++++++++++++++++++++++++++--
 5 files changed, 46 insertions(+), 2 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -984,6 +984,7 @@ struct task_struct {
 	/* Signal handlers: */
 	struct signal_struct		*signal;
 	struct sighand_struct __rcu		*sighand;
+	struct sigqueue			*sigqueue_cache;
 	sigset_t			blocked;
 	sigset_t			real_blocked;
 	/* Restored if set_restore_sigmask() was used: */
--- a/include/linux/signal.h
+++ b/include/linux/signal.h
@@ -265,6 +265,7 @@ static inline void init_sigpending(struc
 }
 
 extern void flush_sigqueue(struct sigpending *queue);
+extern void exit_task_sigqueue_cache(struct task_struct *tsk);
 
 /* Test if 'sig' is valid signal. Use this instead of testing _NSIG directly */
 static inline int valid_signal(unsigned long sig)
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -162,6 +162,7 @@ static void __exit_signal(struct task_st
 		flush_sigqueue(&sig->shared_pending);
 		tty_kref_put(tty);
 	}
+	exit_task_sigqueue_cache(tsk);
 }
 
 static void delayed_put_task_struct(struct rcu_head *rhp)
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2003,6 +2003,7 @@ static __latent_entropy struct task_stru
 	spin_lock_init(&p->alloc_lock);
 
 	init_sigpending(&p->pending);
+	p->sigqueue_cache = NULL;
 
 	p->utime = p->stime = p->gtime = 0;
 #ifdef CONFIG_ARCH_HAS_SCALED_CPUTIME
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -434,7 +434,16 @@ static struct sigqueue *
 	rcu_read_unlock();
 
 	if (override_rlimit || likely(sigpending <= task_rlimit(t, RLIMIT_SIGPENDING))) {
-		q = kmem_cache_alloc(sigqueue_cachep, gfp_flags);
+		/*
+		 * Preallocation does not hold sighand::siglock so it can't
+		 * use the cache. The lockless caching requires that only
+		 * one consumer and only one producer run at a time.
+		 */
+		q = READ_ONCE(t->sigqueue_cache);
+		if (!q || sigqueue_flags)
+			q = kmem_cache_alloc(sigqueue_cachep, gfp_flags);
+		else
+			WRITE_ONCE(t->sigqueue_cache, NULL);
 	} else {
 		print_dropped_signal(sig);
 	}
@@ -451,13 +460,44 @@ static struct sigqueue *
 	return q;
 }
 
+void exit_task_sigqueue_cache(struct task_struct *tsk)
+{
+	/* Race free because @tsk is mopped up */
+	struct sigqueue *q = tsk->sigqueue_cache;
+
+	if (q) {
+		tsk->sigqueue_cache = NULL;
+		/*
+		 * Hand it back to the cache as the task might
+		 * be self reaping which would leak the object.
+		 */
+		 kmem_cache_free(sigqueue_cachep, q);
+	}
+}
+
+static void sigqueue_cache_or_free(struct sigqueue *q)
+{
+	/*
+	 * Cache one sigqueue per task. This pairs with the consumer side
+	 * in __sigqueue_alloc() and needs READ/WRITE_ONCE() to prevent the
+	 * compiler from store tearing and to tell KCSAN that the data race
+	 * is intentional when run without holding current->sighand->siglock,
+	 * which is fine as current obviously cannot run __sigqueue_free()
+	 * concurrently.
+	 */
+	if (!READ_ONCE(current->sigqueue_cache))
+		WRITE_ONCE(current->sigqueue_cache, q);
+	else
+		kmem_cache_free(sigqueue_cachep, q);
+}
+
 static void __sigqueue_free(struct sigqueue *q)
 {
 	if (q->flags & SIGQUEUE_PREALLOC)
 		return;
 	if (atomic_dec_and_test(&q->user->sigpending))
 		free_uid(q->user);
-	kmem_cache_free(sigqueue_cachep, q);
+	sigqueue_cache_or_free(q);
 }
 
 void flush_sigqueue(struct sigpending *queue)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch V5 2/2] signal: Allow tasks to cache one sigqueue struct
  2021-03-23 21:05       ` [patch V5 " Thomas Gleixner
@ 2021-03-24 18:03         ` Oleg Nesterov
  2021-04-15  8:37         ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 9+ messages in thread
From: Oleg Nesterov @ 2021-03-24 18:03 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Sebastian Andrzej Siewior, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Matt Fleming,
	Eric W. Biederman

On 03/23, Thomas Gleixner wrote:
>
>  include/linux/sched.h  |    1 +
>  include/linux/signal.h |    1 +
>  kernel/exit.c          |    1 +
>  kernel/fork.c          |    1 +
>  kernel/signal.c        |   44 ++++++++++++++++++++++++++++++++++++++++++--
>  5 files changed, 46 insertions(+), 2 deletions(-)

both patches look good to me, feel free to add

Reviewed-by: Oleg Nesterov <oleg@redhat.com>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [tip: sched/core] signal: Allow tasks to cache one sigqueue struct
  2021-03-23 21:05       ` [patch V5 " Thomas Gleixner
  2021-03-24 18:03         ` Oleg Nesterov
@ 2021-04-15  8:37         ` tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 9+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2021-04-15  8:37 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Peter Zijlstra (Intel),
	Oleg Nesterov, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     4bad58ebc8bc4f20d89cff95417c9b4674769709
Gitweb:        https://git.kernel.org/tip/4bad58ebc8bc4f20d89cff95417c9b4674769709
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Tue, 23 Mar 2021 22:05:39 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 14 Apr 2021 18:04:08 +02:00

signal: Allow tasks to cache one sigqueue struct

The idea for this originates from the real time tree to make signal
delivery for realtime applications more efficient. In quite some of these
application scenarios a control tasks signals workers to start their
computations. There is usually only one signal per worker on flight.  This
works nicely as long as the kmem cache allocations do not hit the slow path
and cause latencies.

To cure this an optimistic caching was introduced (limited to RT tasks)
which allows a task to cache a single sigqueue in a pointer in task_struct
instead of handing it back to the kmem cache after consuming a signal. When
the next signal is sent to the task then the cached sigqueue is used
instead of allocating a new one. This solved the problem for this set of
application scenarios nicely.

The task cache is not preallocated so the first signal sent to a task goes
always to the cache allocator. The cached sigqueue stays around until the
task exits and is freed when task::sighand is dropped.

After posting this solution for mainline the discussion came up whether
this would be useful in general and should not be limited to realtime
tasks: https://lore.kernel.org/r/m11rcu7nbr.fsf@fess.ebiederm.org

One concern leading to the original limitation was to avoid a large amount
of pointlessly cached sigqueues in alive tasks. The other concern was
vs. RLIMIT_SIGPENDING as these cached sigqueues are not accounted for.

The accounting problem is real, but on the other hand slightly academic.
After gathering some statistics it turned out that after boot of a regular
distro install there are less than 10 sigqueues cached in ~1500 tasks.

In case of a 'mass fork and fire signal to child' scenario the extra 80
bytes of memory per task are well in the noise of the overall memory
consumption of the fork bomb.

If this should be limited then this would need an extra counter in struct
user, more atomic instructions and a seperate rlimit. Yet another tunable
which is mostly unused.

The caching is actually used. After boot and a full kernel compile on a
64CPU machine with make -j128 the number of 'allocations' looks like this:

  From slab:	   23996
  From task cache: 52223

I.e. it reduces the number of slab cache operations by ~68%.

A typical pattern there is:

<...>-58490 __sigqueue_alloc:  for 58488 from slab ffff8881132df460
<...>-58488 __sigqueue_free:   cache ffff8881132df460
<...>-58488 __sigqueue_alloc:  for 1149 from cache ffff8881103dc550
  bash-1149 exit_task_sighand: free ffff8881132df460
  bash-1149 __sigqueue_free:   cache ffff8881103dc550

The interesting sequence is that the exiting task 58488 grabs the sigqueue
from bash's task cache to signal exit and bash sticks it back into it's own
cache. Lather, rinse and repeat.

The caching is probably not noticable for the general use case, but the
benefit for latency sensitive applications is clear. While kmem caches are
usually just serving from the fast path the slab merging (default) can
depending on the usage pattern of the merged slabs cause occasional slow
path allocations.

The time spared per cached entry is a few micro seconds per signal which is
not relevant for e.g. a kernel build, but for signal heavy workloads it's
measurable.

As there is no real downside of this caching mechanism making it
unconditionally available is preferred over more conditional code or new
magic tunables.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lkml.kernel.org/r/87sg4lbmxo.fsf@nanos.tec.linutronix.de
---
 include/linux/sched.h  |  1 +-
 include/linux/signal.h |  1 +-
 kernel/exit.c          |  1 +-
 kernel/fork.c          |  1 +-
 kernel/signal.c        | 44 +++++++++++++++++++++++++++++++++++++++--
 5 files changed, 46 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 05572e2..f5ca798 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -984,6 +984,7 @@ struct task_struct {
 	/* Signal handlers: */
 	struct signal_struct		*signal;
 	struct sighand_struct __rcu		*sighand;
+	struct sigqueue			*sigqueue_cache;
 	sigset_t			blocked;
 	sigset_t			real_blocked;
 	/* Restored if set_restore_sigmask() was used: */
diff --git a/include/linux/signal.h b/include/linux/signal.h
index 205526c..c3cbea2 100644
--- a/include/linux/signal.h
+++ b/include/linux/signal.h
@@ -265,6 +265,7 @@ static inline void init_sigpending(struct sigpending *sig)
 }
 
 extern void flush_sigqueue(struct sigpending *queue);
+extern void exit_task_sigqueue_cache(struct task_struct *tsk);
 
 /* Test if 'sig' is valid signal. Use this instead of testing _NSIG directly */
 static inline int valid_signal(unsigned long sig)
diff --git a/kernel/exit.c b/kernel/exit.c
index 04029e3..0596526 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -162,6 +162,7 @@ static void __exit_signal(struct task_struct *tsk)
 		flush_sigqueue(&sig->shared_pending);
 		tty_kref_put(tty);
 	}
+	exit_task_sigqueue_cache(tsk);
 }
 
 static void delayed_put_task_struct(struct rcu_head *rhp)
diff --git a/kernel/fork.c b/kernel/fork.c
index d3171e8..3c43a9f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1995,6 +1995,7 @@ static __latent_entropy struct task_struct *copy_process(
 	spin_lock_init(&p->alloc_lock);
 
 	init_sigpending(&p->pending);
+	p->sigqueue_cache = NULL;
 
 	p->utime = p->stime = p->gtime = 0;
 #ifdef CONFIG_ARCH_HAS_SCALED_CPUTIME
diff --git a/kernel/signal.c b/kernel/signal.c
index 568a2e2..2d9463e 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -433,7 +433,16 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags,
 	rcu_read_unlock();
 
 	if (override_rlimit || likely(sigpending <= task_rlimit(t, RLIMIT_SIGPENDING))) {
-		q = kmem_cache_alloc(sigqueue_cachep, gfp_flags);
+		/*
+		 * Preallocation does not hold sighand::siglock so it can't
+		 * use the cache. The lockless caching requires that only
+		 * one consumer and only one producer run at a time.
+		 */
+		q = READ_ONCE(t->sigqueue_cache);
+		if (!q || sigqueue_flags)
+			q = kmem_cache_alloc(sigqueue_cachep, gfp_flags);
+		else
+			WRITE_ONCE(t->sigqueue_cache, NULL);
 	} else {
 		print_dropped_signal(sig);
 	}
@@ -450,13 +459,44 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags,
 	return q;
 }
 
+void exit_task_sigqueue_cache(struct task_struct *tsk)
+{
+	/* Race free because @tsk is mopped up */
+	struct sigqueue *q = tsk->sigqueue_cache;
+
+	if (q) {
+		tsk->sigqueue_cache = NULL;
+		/*
+		 * Hand it back to the cache as the task might
+		 * be self reaping which would leak the object.
+		 */
+		 kmem_cache_free(sigqueue_cachep, q);
+	}
+}
+
+static void sigqueue_cache_or_free(struct sigqueue *q)
+{
+	/*
+	 * Cache one sigqueue per task. This pairs with the consumer side
+	 * in __sigqueue_alloc() and needs READ/WRITE_ONCE() to prevent the
+	 * compiler from store tearing and to tell KCSAN that the data race
+	 * is intentional when run without holding current->sighand->siglock,
+	 * which is fine as current obviously cannot run __sigqueue_free()
+	 * concurrently.
+	 */
+	if (!READ_ONCE(current->sigqueue_cache))
+		WRITE_ONCE(current->sigqueue_cache, q);
+	else
+		kmem_cache_free(sigqueue_cachep, q);
+}
+
 static void __sigqueue_free(struct sigqueue *q)
 {
 	if (q->flags & SIGQUEUE_PREALLOC)
 		return;
 	if (atomic_dec_and_test(&q->user->sigpending))
 		free_uid(q->user);
-	kmem_cache_free(sigqueue_cachep, q);
+	sigqueue_cache_or_free(q);
 }
 
 void flush_sigqueue(struct sigpending *queue)

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [tip: sched/core] signal: Hand SIGQUEUE_PREALLOC flag to __sigqueue_alloc()
  2021-03-22  9:19 ` [patch V4 1/2] signal: Hand SIGQUEUE_PREALLOC flag to __sigqueue_alloc() Thomas Gleixner
@ 2021-04-15  8:37   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 9+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2021-04-15  8:37 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     69995ebbb9d3717306a165db88a1292b63f77a37
Gitweb:        https://git.kernel.org/tip/69995ebbb9d3717306a165db88a1292b63f77a37
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Mon, 22 Mar 2021 10:19:42 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 14 Apr 2021 18:04:08 +02:00

signal: Hand SIGQUEUE_PREALLOC flag to __sigqueue_alloc()

There is no point in having the conditional at the callsite.

Just hand in the allocation mode flag to __sigqueue_alloc() and use it to
initialize sigqueue::flags.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210322092258.898677147@linutronix.de
---
 kernel/signal.c | 17 +++++++----------
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index ba4d1ef..568a2e2 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -410,7 +410,8 @@ void task_join_group_stop(struct task_struct *task)
  *   appropriate lock must be held to stop the target task from exiting
  */
 static struct sigqueue *
-__sigqueue_alloc(int sig, struct task_struct *t, gfp_t flags, int override_rlimit)
+__sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags,
+		 int override_rlimit, const unsigned int sigqueue_flags)
 {
 	struct sigqueue *q = NULL;
 	struct user_struct *user;
@@ -432,7 +433,7 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t flags, int override_rlimi
 	rcu_read_unlock();
 
 	if (override_rlimit || likely(sigpending <= task_rlimit(t, RLIMIT_SIGPENDING))) {
-		q = kmem_cache_alloc(sigqueue_cachep, flags);
+		q = kmem_cache_alloc(sigqueue_cachep, gfp_flags);
 	} else {
 		print_dropped_signal(sig);
 	}
@@ -442,7 +443,7 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t flags, int override_rlimi
 			free_uid(user);
 	} else {
 		INIT_LIST_HEAD(&q->list);
-		q->flags = 0;
+		q->flags = sigqueue_flags;
 		q->user = user;
 	}
 
@@ -1113,7 +1114,8 @@ static int __send_signal(int sig, struct kernel_siginfo *info, struct task_struc
 	else
 		override_rlimit = 0;
 
-	q = __sigqueue_alloc(sig, t, GFP_ATOMIC, override_rlimit);
+	q = __sigqueue_alloc(sig, t, GFP_ATOMIC, override_rlimit, 0);
+
 	if (q) {
 		list_add_tail(&q->list, &pending->list);
 		switch ((unsigned long) info) {
@@ -1807,12 +1809,7 @@ EXPORT_SYMBOL(kill_pid);
  */
 struct sigqueue *sigqueue_alloc(void)
 {
-	struct sigqueue *q = __sigqueue_alloc(-1, current, GFP_KERNEL, 0);
-
-	if (q)
-		q->flags |= SIGQUEUE_PREALLOC;
-
-	return q;
+	return __sigqueue_alloc(-1, current, GFP_KERNEL, 0, SIGQUEUE_PREALLOC);
 }
 
 void sigqueue_free(struct sigqueue *q)

^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-04-15  8:37 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-22  9:19 [patch V4 0/2] signals: Allow caching one sigqueue object per task Thomas Gleixner
2021-03-22  9:19 ` [patch V4 1/2] signal: Hand SIGQUEUE_PREALLOC flag to __sigqueue_alloc() Thomas Gleixner
2021-04-15  8:37   ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
2021-03-22  9:19 ` [patch V4 2/2] signal: Allow tasks to cache one sigqueue struct Thomas Gleixner
2021-03-23 18:04   ` Oleg Nesterov
2021-03-23 19:24     ` Thomas Gleixner
2021-03-23 21:05       ` [patch V5 " Thomas Gleixner
2021-03-24 18:03         ` Oleg Nesterov
2021-04-15  8:37         ` [tip: sched/core] " tip-bot2 for Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).