All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3] signal: Let tasks cache one sigqueue struct.
@ 2023-04-06 19:40 Sebastian Andrzej Siewior
  2023-04-06 20:47 ` [PATCH v4] " Sebastian Andrzej Siewior
  0 siblings, 1 reply; 5+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-04-06 19:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: Eric W. Biederman, Ben Segall, Christian Brauner,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Ingo Molnar,
	Juri Lelli, Mel Gorman, Oleg Nesterov, Peter Zijlstra,
	Steven Rostedt, Thomas Gleixner, Valentin Schneider,
	Vincent Guittot

The sigqueue caching originated in the PREEMPT_RT tree. A few of the
applications, that were ported to Linux, were ported from OS-9. Sending
notifications from one task to another via a signal was a common
communication model there and so the applications are heavy signal
users. Removing the allocation reduces the critical path, avoids locks
and so lowers the maximal latency of the task while sending a signal.

After posting the first version a discussion came up whether it wouldn't
make sense to have this caching unconditionally and not restricted to
PREEMPT_RT only.

The sigqueue flagged SIGQUEUE_PREALLOC is used by the POSIX timer code.
It is allocated on initialisation and reused during the lifetime of the
timer. The sigqueue is freed once the timer is deleted. The POSIX timer
sigqueue has its own caching and unique lifetime pattern and therefore
does not fit for the generic caching.

In the common case the signal is dequeued and freed in collect_signal().
At this point, the 'current' task receives the signal and its
sighand_struct::siglock is held.

__sigqueue_alloc() is used to allocate a new sigqueue. The task_struct
passed as argument is the task that will receive the signal. Its
sighand_struct::siglock is acquired (except for SIGQUEUE_PREALLOC
allocation which is ignored). Use a cached sigqueue before allocating a
new one.

As a result of this pattern, the task sending a signal will use the
cache from the task that will receive the signal which in turn caches
the signal.

The numbers of system boot followed by an allmod kernel build:
Out of 333216 allocations, 194876 (~58%) were served from the cache.
From all free invocations, 4212 were in a path were caching is not done
and 329002 sigqueue were cached.

Cache the struct sigqueue in collect_signal() and reuse it for the
allocation. Rely on sighand_struct::siglock locking for cache handling
which is held during allocation and free. The cache is cleaned once
sighand_struct is freed.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 v2…v3: Wrote the patch and its description from scratch without looking
 at the previous version.

 include/linux/sched/signal.h |    2 +
 kernel/fork.c                |   11 +++++++++
 kernel/signal.c              |   48 +++++++++++++++++++++++++++++++++++++++----
 3 files changed, 57 insertions(+), 4 deletions(-)

--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -22,6 +22,7 @@ struct sighand_struct {
 	refcount_t		count;
 	wait_queue_head_t	signalfd_wqh;
 	struct k_sigaction	action[_NSIG];
+	struct sigqueue		*sigqueue_cache;
 };
 
 /*
@@ -349,6 +350,7 @@ extern int send_sig(int, struct task_str
 extern int zap_other_threads(struct task_struct *p);
 extern struct sigqueue *sigqueue_alloc(void);
 extern void sigqueue_free(struct sigqueue *);
+extern void sigqueue_free_cached_entry(struct sigqueue *q);
 extern int send_sigqueue(struct sigqueue *, struct pid *, enum pid_type);
 extern int do_sigaction(int, struct k_sigaction *, struct k_sigaction *);
 
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1661,6 +1661,7 @@ static int copy_sighand(unsigned long cl
 	RCU_INIT_POINTER(tsk->sighand, sig);
 	if (!sig)
 		return -ENOMEM;
+	sig->sigqueue_cache = NULL;
 
 	refcount_set(&sig->count, 1);
 	spin_lock_irq(&current->sighand->siglock);
@@ -1677,7 +1678,17 @@ static int copy_sighand(unsigned long cl
 void __cleanup_sighand(struct sighand_struct *sighand)
 {
 	if (refcount_dec_and_test(&sighand->count)) {
+		struct sigqueue *sigqueue = NULL;
+
 		signalfd_cleanup(sighand);
+		spin_lock_irq(&sighand->siglock);
+		if (sighand->sigqueue_cache) {
+			sigqueue = sighand->sigqueue_cache;
+			sighand->sigqueue_cache = NULL;
+		}
+		spin_unlock_irq(&sighand->siglock);
+
+		sigqueue_free_cached_entry(sigqueue);
 		/*
 		 * sighand_cachep is SLAB_TYPESAFE_BY_RCU so we can free it
 		 * without an RCU grace period, see __lock_task_sighand().
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -432,7 +432,18 @@ static struct sigqueue *
 		return NULL;
 
 	if (override_rlimit || likely(sigpending <= task_rlimit(t, RLIMIT_SIGPENDING))) {
-		q = kmem_cache_alloc(sigqueue_cachep, gfp_flags);
+
+		if (!sigqueue_flags) {
+			struct sighand_struct *sighand = t->sighand;
+
+			lockdep_assert_held(&sighand->siglock);
+			if (sighand->sigqueue_cache) {
+				q = sighand->sigqueue_cache;
+				sighand->sigqueue_cache = NULL;
+			}
+		}
+		if (!q)
+			q = kmem_cache_alloc(sigqueue_cachep, gfp_flags);
 	} else {
 		print_dropped_signal(sig);
 	}
@@ -447,14 +458,43 @@ static struct sigqueue *
 	return q;
 }
 
-static void __sigqueue_free(struct sigqueue *q)
+static bool sigqueue_cleanup_accounting(struct sigqueue *q)
 {
 	if (q->flags & SIGQUEUE_PREALLOC)
-		return;
+		return false;
 	if (q->ucounts) {
 		dec_rlimit_put_ucounts(q->ucounts, UCOUNT_RLIMIT_SIGPENDING);
 		q->ucounts = NULL;
 	}
+	return true;
+}
+
+static void __sigqueue_free(struct sigqueue *q)
+{
+	if (!sigqueue_cleanup_accounting(q))
+		return;
+	kmem_cache_free(sigqueue_cachep, q);
+}
+
+void sigqueue_free_cached_entry(struct sigqueue *q)
+{
+	if (!q)
+		return;
+	kmem_cache_free(sigqueue_cachep, q);
+}
+
+static void sigqueue_cache_or_free(struct sigqueue *q)
+{
+	struct sighand_struct *sighand = current->sighand;
+
+	if (!sigqueue_cleanup_accounting(q))
+		return;
+
+	lockdep_assert_held(&sighand->siglock);
+	if (!sighand->sigqueue_cache) {
+		sighand->sigqueue_cache = q;
+		return;
+	}
 	kmem_cache_free(sigqueue_cachep, q);
 }
 
@@ -594,7 +634,7 @@ static void collect_signal(int sig, stru
 			(info->si_code == SI_TIMER) &&
 			(info->si_sys_private);
 
-		__sigqueue_free(first);
+		sigqueue_cache_or_free(first);
 	} else {
 		/*
 		 * Ok, it wasn't in the queue.  This must be

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH v4] signal: Let tasks cache one sigqueue struct.
  2023-04-06 19:40 [PATCH v3] signal: Let tasks cache one sigqueue struct Sebastian Andrzej Siewior
@ 2023-04-06 20:47 ` Sebastian Andrzej Siewior
  2023-04-07  0:03   ` Hillf Danton
  0 siblings, 1 reply; 5+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-04-06 20:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: Eric W. Biederman, Ben Segall, Christian Brauner,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Ingo Molnar,
	Juri Lelli, Mel Gorman, Oleg Nesterov, Peter Zijlstra,
	Steven Rostedt, Thomas Gleixner, Valentin Schneider,
	Vincent Guittot, Linus Torvalds

The sigqueue caching originated in the PREEMPT_RT tree. A few of the
applications, that were ported to Linux, were ported from OS-9. Sending
notifications from one task to another via a signal was a common
communication model there and so the applications are heavy signal
users. Removing the allocation reduces the critical path, avoids locks
and so lowers the maximal latency of the task while sending a signal.

After posting the first version a discussion came up whether it wouldn't
make sense to have this caching unconditionally and not restricted to
PREEMPT_RT only.

The sigqueue flagged SIGQUEUE_PREALLOC is used by the POSIX timer code.
It is allocated on initialisation and reused during the lifetime of the
timer. The sigqueue is freed once the timer is deleted. The POSIX timer
sigqueue has its own caching and unique lifetime pattern and therefore
does not fit for the generic caching.

In the common case the signal is dequeued and freed in collect_signal().
At this point, the 'current' task receives the signal and its
sighand_struct::siglock is held.

__sigqueue_alloc() is used to allocate a new sigqueue. The task_struct
passed as argument is the task that will receive the signal. Its
sighand_struct::siglock is acquired (except for SIGQUEUE_PREALLOC
allocation which is ignored). Use a cached sigqueue before allocating a
new one.

As a result of this pattern, the task sending a signal will use the
cache from the task that will receive the signal which in turn caches
the signal.

The numbers of system boot followed by an allmod kernel build:
Out of 333216 allocations, 194876 (~58%) were served from the cache.
From all free invocations, 4212 were in a path were caching is not done
and 329002 sigqueue were cached.

Cache the struct sigqueue in collect_signal() and reuse it for the
allocation. Rely on sighand_struct::siglock locking for cache handling
which is held during allocation and free. The cache is cleaned once
sighand_struct is freed.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
v3…v4: Repost with Brauner and Torvalds on Cc.
v2…v3: Wrote the patch and its description from scratch without looking
       at the previous version.

 include/linux/sched/signal.h |    2 +
 kernel/fork.c                |   11 +++++++++
 kernel/signal.c              |   48 +++++++++++++++++++++++++++++++++++++++----
 3 files changed, 57 insertions(+), 4 deletions(-)

--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -22,6 +22,7 @@ struct sighand_struct {
 	refcount_t		count;
 	wait_queue_head_t	signalfd_wqh;
 	struct k_sigaction	action[_NSIG];
+	struct sigqueue		*sigqueue_cache;
 };
 
 /*
@@ -349,6 +350,7 @@ extern int send_sig(int, struct task_str
 extern int zap_other_threads(struct task_struct *p);
 extern struct sigqueue *sigqueue_alloc(void);
 extern void sigqueue_free(struct sigqueue *);
+extern void sigqueue_free_cached_entry(struct sigqueue *q);
 extern int send_sigqueue(struct sigqueue *, struct pid *, enum pid_type);
 extern int do_sigaction(int, struct k_sigaction *, struct k_sigaction *);
 
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1661,6 +1661,7 @@ static int copy_sighand(unsigned long cl
 	RCU_INIT_POINTER(tsk->sighand, sig);
 	if (!sig)
 		return -ENOMEM;
+	sig->sigqueue_cache = NULL;
 
 	refcount_set(&sig->count, 1);
 	spin_lock_irq(&current->sighand->siglock);
@@ -1677,7 +1678,17 @@ static int copy_sighand(unsigned long cl
 void __cleanup_sighand(struct sighand_struct *sighand)
 {
 	if (refcount_dec_and_test(&sighand->count)) {
+		struct sigqueue *sigqueue = NULL;
+
 		signalfd_cleanup(sighand);
+		spin_lock_irq(&sighand->siglock);
+		if (sighand->sigqueue_cache) {
+			sigqueue = sighand->sigqueue_cache;
+			sighand->sigqueue_cache = NULL;
+		}
+		spin_unlock_irq(&sighand->siglock);
+
+		sigqueue_free_cached_entry(sigqueue);
 		/*
 		 * sighand_cachep is SLAB_TYPESAFE_BY_RCU so we can free it
 		 * without an RCU grace period, see __lock_task_sighand().
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -432,7 +432,18 @@ static struct sigqueue *
 		return NULL;
 
 	if (override_rlimit || likely(sigpending <= task_rlimit(t, RLIMIT_SIGPENDING))) {
-		q = kmem_cache_alloc(sigqueue_cachep, gfp_flags);
+
+		if (!sigqueue_flags) {
+			struct sighand_struct *sighand = t->sighand;
+
+			lockdep_assert_held(&sighand->siglock);
+			if (sighand->sigqueue_cache) {
+				q = sighand->sigqueue_cache;
+				sighand->sigqueue_cache = NULL;
+			}
+		}
+		if (!q)
+			q = kmem_cache_alloc(sigqueue_cachep, gfp_flags);
 	} else {
 		print_dropped_signal(sig);
 	}
@@ -447,14 +458,43 @@ static struct sigqueue *
 	return q;
 }
 
-static void __sigqueue_free(struct sigqueue *q)
+static bool sigqueue_cleanup_accounting(struct sigqueue *q)
 {
 	if (q->flags & SIGQUEUE_PREALLOC)
-		return;
+		return false;
 	if (q->ucounts) {
 		dec_rlimit_put_ucounts(q->ucounts, UCOUNT_RLIMIT_SIGPENDING);
 		q->ucounts = NULL;
 	}
+	return true;
+}
+
+static void __sigqueue_free(struct sigqueue *q)
+{
+	if (!sigqueue_cleanup_accounting(q))
+		return;
+	kmem_cache_free(sigqueue_cachep, q);
+}
+
+void sigqueue_free_cached_entry(struct sigqueue *q)
+{
+	if (!q)
+		return;
+	kmem_cache_free(sigqueue_cachep, q);
+}
+
+static void sigqueue_cache_or_free(struct sigqueue *q)
+{
+	struct sighand_struct *sighand = current->sighand;
+
+	if (!sigqueue_cleanup_accounting(q))
+		return;
+
+	lockdep_assert_held(&sighand->siglock);
+	if (!sighand->sigqueue_cache) {
+		sighand->sigqueue_cache = q;
+		return;
+	}
 	kmem_cache_free(sigqueue_cachep, q);
 }
 
@@ -594,7 +634,7 @@ static void collect_signal(int sig, stru
 			(info->si_code == SI_TIMER) &&
 			(info->si_sys_private);
 
-		__sigqueue_free(first);
+		sigqueue_cache_or_free(first);
 	} else {
 		/*
 		 * Ok, it wasn't in the queue.  This must be


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v4] signal: Let tasks cache one sigqueue struct.
  2023-04-06 20:47 ` [PATCH v4] " Sebastian Andrzej Siewior
@ 2023-04-07  0:03   ` Hillf Danton
  2023-04-07  3:53     ` Matthew Wilcox
  0 siblings, 1 reply; 5+ messages in thread
From: Hillf Danton @ 2023-04-07  0:03 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-kernel, Eric W. Biederman, Ingo Molnar, Mel Gorman,
	Oleg Nesterov, Peter Zijlstra, Thomas Gleixner, Vincent Guittot,
	linux-mm, Linus Torvalds

On Thu, 6 Apr 2023 22:47:21 +0200 Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> The sigqueue caching originated in the PREEMPT_RT tree. A few of the
> applications, that were ported to Linux, were ported from OS-9. Sending
> notifications from one task to another via a signal was a common
> communication model there and so the applications are heavy signal
> users. Removing the allocation reduces the critical path, avoids locks
> and so lowers the maximal latency of the task while sending a signal.

This is needed only if slab does not work for you.
Why not make it work better instead of peeling slab one more off just to
make signal/rt richer?
> 
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -432,7 +432,18 @@ static struct sigqueue *
>  		return NULL;
> =20
>  	if (override_rlimit || likely(sigpending <=3D task_rlimit(t, RLIMIT_SIGPE=
> NDING))) {
> -		q =3D kmem_cache_alloc(sigqueue_cachep, gfp_flags);
> +
> +		if (!sigqueue_flags) {
> +			struct sighand_struct *sighand =3D t->sighand;
> +
> +			lockdep_assert_held(&sighand->siglock);
> +			if (sighand->sigqueue_cache) {
> +				q =3D sighand->sigqueue_cache;
> +				sighand->sigqueue_cache =3D NULL;
> +			}
> +		}
> +		if (!q)
> +			q =3D kmem_cache_alloc(sigqueue_cachep, gfp_flags);
>  	} else {
>  		print_dropped_signal(sig);
>  	}


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v4] signal: Let tasks cache one sigqueue struct.
  2023-04-07  0:03   ` Hillf Danton
@ 2023-04-07  3:53     ` Matthew Wilcox
  2023-05-24 15:33       ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 5+ messages in thread
From: Matthew Wilcox @ 2023-04-07  3:53 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Sebastian Andrzej Siewior, linux-kernel, Eric W. Biederman,
	Ingo Molnar, Mel Gorman, Oleg Nesterov, Peter Zijlstra,
	Thomas Gleixner, Vincent Guittot, linux-mm, Linus Torvalds

On Fri, Apr 07, 2023 at 08:03:06AM +0800, Hillf Danton wrote:
> On Thu, 6 Apr 2023 22:47:21 +0200 Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> > The sigqueue caching originated in the PREEMPT_RT tree. A few of the
> > applications, that were ported to Linux, were ported from OS-9. Sending
> > notifications from one task to another via a signal was a common
> > communication model there and so the applications are heavy signal
> > users. Removing the allocation reduces the critical path, avoids locks
> > and so lowers the maximal latency of the task while sending a signal.

It might lower the _average_ latency, but it certainly doesn't lower
the _maximum_ latency, because you have to assume worst case scenario
for maximum latency.  Which is that there's no sigqueue cached, so you
have to go into the slab allocator.  And again, worst case scenario is
that you have to go into the page allocator from there, and further that
you have to run reclaim, and ...

What I find odd about the numbers that you quote:

> The numbers of system boot followed by an allmod kernel build:
> Out of 333216 allocations, 194876 (~58%) were served from the cache.
> From all free invocations, 4212 were in a path were caching is not done
> and 329002 sigqueue were cached.

... is that there's no absolute numbers.  Does it save 1% of the cost
of sending a signal?  10%?  What does perf say about the cost saved
by no longer going into slab?  Because the fast path in slab is very
fast.  It might even be quicker than your fast path for multithreaded
applications which have threads running on different NUMA nodes.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v4] signal: Let tasks cache one sigqueue struct.
  2023-04-07  3:53     ` Matthew Wilcox
@ 2023-05-24 15:33       ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 5+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-05-24 15:33 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hillf Danton, linux-kernel, Eric W. Biederman, Ingo Molnar,
	Mel Gorman, Oleg Nesterov, Peter Zijlstra, Thomas Gleixner,
	Vincent Guittot, linux-mm, Linus Torvalds

On 2023-04-07 04:53:27 [+0100], Matthew Wilcox wrote:
> On Fri, Apr 07, 2023 at 08:03:06AM +0800, Hillf Danton wrote:
> > On Thu, 6 Apr 2023 22:47:21 +0200 Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> > > The sigqueue caching originated in the PREEMPT_RT tree. A few of the
> > > applications, that were ported to Linux, were ported from OS-9. Sending
> > > notifications from one task to another via a signal was a common
> > > communication model there and so the applications are heavy signal
> > > users. Removing the allocation reduces the critical path, avoids locks
> > > and so lowers the maximal latency of the task while sending a signal.
> 
> It might lower the _average_ latency, but it certainly doesn't lower
> the _maximum_ latency, because you have to assume worst case scenario
> for maximum latency.  Which is that there's no sigqueue cached, so you
> have to go into the slab allocator.  And again, worst case scenario is
> that you have to go into the page allocator from there, and further that
> you have to run reclaim, and ...

Yes. The idea is in general not to send more signals in parallel than
the available number cached slots.

> What I find odd about the numbers that you quote:
> 
> > The numbers of system boot followed by an allmod kernel build:
> > Out of 333216 allocations, 194876 (~58%) were served from the cache.
> > From all free invocations, 4212 were in a path were caching is not done
> > and 329002 sigqueue were cached.
> 
> ... is that there's no absolute numbers.  Does it save 1% of the cost
> of sending a signal?  10%?  What does perf say about the cost saved
> by no longer going into slab?  Because the fast path in slab is very
> fast.  It might even be quicker than your fast path for multithreaded
> applications which have threads running on different NUMA nodes.

I asked for updated numbers and the improvement is not as big as
initially reported. There might have been a change in the configuration
for the testing an so the improvement is not as big as initially assumed.
I'm sorry, but I didn't get any numbers to back anything up. I'm
dropping the effort here, thanks for the review :)

Sebastian

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-05-24 15:33 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-06 19:40 [PATCH v3] signal: Let tasks cache one sigqueue struct Sebastian Andrzej Siewior
2023-04-06 20:47 ` [PATCH v4] " Sebastian Andrzej Siewior
2023-04-07  0:03   ` Hillf Danton
2023-04-07  3:53     ` Matthew Wilcox
2023-05-24 15:33       ` Sebastian Andrzej Siewior

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.