linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5 v3 RESEND] seccomp: add the synchronous mode for seccomp_unotify
@ 2023-01-10 21:30 Andrei Vagin
  2023-01-10 21:30 ` [PATCH 1/5] seccomp: don't use semaphore and wait_queue together Andrei Vagin
                   ` (4 more replies)
  0 siblings, 5 replies; 22+ messages in thread
From: Andrei Vagin @ 2023-01-10 21:30 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Dietmar Eggemann
  Cc: linux-kernel, Kees Cook, Christian Brauner, Andrei Vagin,
	Andy Lutomirski, Juri Lelli, Peter Oskolkov, Tycho Andersen,
	Will Drewry

From: Andrei Vagin <avagin@gmail.com>

seccomp_unotify allows more privileged processes do actions on behalf
of less privileged processes.

In many cases, the workflow is fully synchronous. It means a target
process triggers a system call and passes controls to a supervisor
process that handles the system call and returns controls back to the
target process. In this context, "synchronous" means that only one
process is running and another one is waiting.

The new WF_CURRENT_CPU flag advises the scheduler to move the wakee to
the current CPU. For such synchronous workflows, it makes context
switches a few times faster.

Right now, each interaction takes 12µs. With this patch, it takes about
3µs.

v2: clean up the first patch and add the test.
v3: update commit messages and a few fixes suggested by Kees Cook.

Kees is ready to take this patch set, but wants to get Acks from the
sched folks.

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Peter Oskolkov <posk@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tycho Andersen <tycho@tycho.pizza>
Cc: Will Drewry <wad@chromium.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>

Andrei Vagin (4):
  seccomp: don't use semaphore and wait_queue together
  sched: add a few helpers to wake up tasks on the current cpu
  seccomp: add the synchronous mode for seccomp_unotify
  selftest/seccomp: add a new test for the sync mode of
    seccomp_user_notify

Peter Oskolkov (1):
  sched: add WF_CURRENT_CPU and externise ttwu

 include/linux/completion.h                    |  1 +
 include/linux/swait.h                         |  1 +
 include/linux/wait.h                          |  3 +
 include/uapi/linux/seccomp.h                  |  4 +
 kernel/sched/completion.c                     | 12 +++
 kernel/sched/core.c                           |  5 +-
 kernel/sched/fair.c                           |  4 +
 kernel/sched/sched.h                          | 13 +--
 kernel/sched/swait.c                          | 11 +++
 kernel/sched/wait.c                           |  5 ++
 kernel/seccomp.c                              | 72 +++++++++++++--
 tools/testing/selftests/seccomp/seccomp_bpf.c | 88 +++++++++++++++++++
 12 files changed, 204 insertions(+), 15 deletions(-)

-- 
2.37.2


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/5] seccomp: don't use semaphore and wait_queue together
  2023-01-10 21:30 [PATCH 0/5 v3 RESEND] seccomp: add the synchronous mode for seccomp_unotify Andrei Vagin
@ 2023-01-10 21:30 ` Andrei Vagin
  2023-01-12 14:58   ` Tycho Andersen
  2023-01-16  9:48   ` Peter Zijlstra
  2023-01-10 21:30 ` [PATCH 2/5] sched: add WF_CURRENT_CPU and externise ttwu Andrei Vagin
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 22+ messages in thread
From: Andrei Vagin @ 2023-01-10 21:30 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Dietmar Eggemann
  Cc: linux-kernel, Kees Cook, Christian Brauner, Andrei Vagin,
	Andy Lutomirski, Juri Lelli, Peter Oskolkov, Tycho Andersen,
	Will Drewry

From: Andrei Vagin <avagin@gmail.com>

The main reason is to use new wake_up helpers that will be added in the
following patches. But here are a few other reasons:

* if we use two different ways, we always need to call them both. This
  patch fixes seccomp_notify_recv where we forgot to call wake_up_poll
  in the error path.

* If we use one primitive, we can control how many waiters are woken up
  for each request. Our goal is to wake up just one that will handle a
  request. Right now, wake_up_poll can wake up one waiter and
  up(&match->notif->request) can wake up one more.

Signed-off-by: Andrei Vagin <avagin@gmail.com>
---
 kernel/seccomp.c | 41 ++++++++++++++++++++++++++++++++++++-----
 1 file changed, 36 insertions(+), 5 deletions(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index e9852d1b4a5e..876022e9c88c 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -145,7 +145,7 @@ struct seccomp_kaddfd {
  * @notifications: A list of struct seccomp_knotif elements.
  */
 struct notification {
-	struct semaphore request;
+	atomic_t requests;
 	u64 next_id;
 	struct list_head notifications;
 };
@@ -1116,7 +1116,7 @@ static int seccomp_do_user_notification(int this_syscall,
 	list_add_tail(&n.list, &match->notif->notifications);
 	INIT_LIST_HEAD(&n.addfd);
 
-	up(&match->notif->request);
+	atomic_add(1, &match->notif->requests);
 	wake_up_poll(&match->wqh, EPOLLIN | EPOLLRDNORM);
 
 	/*
@@ -1450,6 +1450,37 @@ find_notification(struct seccomp_filter *filter, u64 id)
 	return NULL;
 }
 
+static int recv_wake_function(wait_queue_entry_t *wait, unsigned int mode, int sync,
+				  void *key)
+{
+	/* Avoid a wakeup if event not interesting for us. */
+	if (key && !(key_to_poll(key) & (EPOLLIN | EPOLLERR)))
+		return 0;
+	return autoremove_wake_function(wait, mode, sync, key);
+}
+
+static int recv_wait_event(struct seccomp_filter *filter)
+{
+	DEFINE_WAIT_FUNC(wait, recv_wake_function);
+	int ret;
+
+	if (atomic_add_unless(&filter->notif->requests, -1, 0) != 0)
+		return 0;
+
+	for (;;) {
+		ret = prepare_to_wait_event(&filter->wqh, &wait, TASK_INTERRUPTIBLE);
+
+		if (atomic_add_unless(&filter->notif->requests, -1, 0) != 0)
+			break;
+
+		if (ret)
+			return ret;
+
+		schedule();
+	}
+	finish_wait(&filter->wqh, &wait);
+	return 0;
+}
 
 static long seccomp_notify_recv(struct seccomp_filter *filter,
 				void __user *buf)
@@ -1467,7 +1498,7 @@ static long seccomp_notify_recv(struct seccomp_filter *filter,
 
 	memset(&unotif, 0, sizeof(unotif));
 
-	ret = down_interruptible(&filter->notif->request);
+	ret = recv_wait_event(filter);
 	if (ret < 0)
 		return ret;
 
@@ -1515,7 +1546,8 @@ static long seccomp_notify_recv(struct seccomp_filter *filter,
 			if (should_sleep_killable(filter, knotif))
 				complete(&knotif->ready);
 			knotif->state = SECCOMP_NOTIFY_INIT;
-			up(&filter->notif->request);
+			atomic_add(1, &filter->notif->requests);
+			wake_up_poll(&filter->wqh, EPOLLIN | EPOLLRDNORM);
 		}
 		mutex_unlock(&filter->notify_lock);
 	}
@@ -1777,7 +1809,6 @@ static struct file *init_listener(struct seccomp_filter *filter)
 	if (!filter->notif)
 		goto out;
 
-	sema_init(&filter->notif->request, 0);
 	filter->notif->next_id = get_random_u64();
 	INIT_LIST_HEAD(&filter->notif->notifications);
 
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/5] sched: add WF_CURRENT_CPU and externise ttwu
  2023-01-10 21:30 [PATCH 0/5 v3 RESEND] seccomp: add the synchronous mode for seccomp_unotify Andrei Vagin
  2023-01-10 21:30 ` [PATCH 1/5] seccomp: don't use semaphore and wait_queue together Andrei Vagin
@ 2023-01-10 21:30 ` Andrei Vagin
  2023-01-12  7:35   ` Chen Yu
  2023-01-10 21:30 ` [PATCH 3/5] sched: add a few helpers to wake up tasks on the current cpu Andrei Vagin
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 22+ messages in thread
From: Andrei Vagin @ 2023-01-10 21:30 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Dietmar Eggemann
  Cc: linux-kernel, Kees Cook, Christian Brauner, Andrei Vagin,
	Andy Lutomirski, Juri Lelli, Peter Oskolkov, Tycho Andersen,
	Will Drewry

From: Peter Oskolkov <posk@google.com>

Add WF_CURRENT_CPU wake flag that advices the scheduler to
move the wakee to the current CPU. This is useful for fast on-CPU
context switching use cases.

In addition, make ttwu external rather than static so that
the flag could be passed to it from outside of sched/core.c.

Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
---
 kernel/sched/core.c  |  3 +--
 kernel/sched/fair.c  |  4 ++++
 kernel/sched/sched.h | 13 ++++++++-----
 3 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 25b582b6ee5f..6478e819eb99 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4063,8 +4063,7 @@ bool ttwu_state_match(struct task_struct *p, unsigned int state, int *success)
  * Return: %true if @p->state changes (an actual wakeup was done),
  *	   %false otherwise.
  */
-static int
-try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
+int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 {
 	unsigned long flags;
 	int cpu, success = 0;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c36aa54ae071..d6f76bead3c5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7380,6 +7380,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 	if (wake_flags & WF_TTWU) {
 		record_wakee(p);
 
+		if ((wake_flags & WF_CURRENT_CPU) &&
+		    cpumask_test_cpu(cpu, p->cpus_ptr))
+			return cpu;
+
 		if (sched_energy_enabled()) {
 			new_cpu = find_energy_efficient_cpu(p, prev_cpu);
 			if (new_cpu >= 0)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 771f8ddb7053..34b4c54b2a2a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2088,12 +2088,13 @@ static inline int task_on_rq_migrating(struct task_struct *p)
 }
 
 /* Wake flags. The first three directly map to some SD flag value */
-#define WF_EXEC     0x02 /* Wakeup after exec; maps to SD_BALANCE_EXEC */
-#define WF_FORK     0x04 /* Wakeup after fork; maps to SD_BALANCE_FORK */
-#define WF_TTWU     0x08 /* Wakeup;            maps to SD_BALANCE_WAKE */
+#define WF_EXEC         0x02 /* Wakeup after exec; maps to SD_BALANCE_EXEC */
+#define WF_FORK         0x04 /* Wakeup after fork; maps to SD_BALANCE_FORK */
+#define WF_TTWU         0x08 /* Wakeup;            maps to SD_BALANCE_WAKE */
 
-#define WF_SYNC     0x10 /* Waker goes to sleep after wakeup */
-#define WF_MIGRATED 0x20 /* Internal use, task got migrated */
+#define WF_SYNC         0x10 /* Waker goes to sleep after wakeup */
+#define WF_MIGRATED     0x20 /* Internal use, task got migrated */
+#define WF_CURRENT_CPU  0x40 /* Prefer to move the wakee to the current CPU. */
 
 #ifdef CONFIG_SMP
 static_assert(WF_EXEC == SD_BALANCE_EXEC);
@@ -3245,6 +3246,8 @@ static inline bool is_per_cpu_kthread(struct task_struct *p)
 extern void swake_up_all_locked(struct swait_queue_head *q);
 extern void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait);
 
+extern int try_to_wake_up(struct task_struct *tsk, unsigned int state, int wake_flags);
+
 #ifdef CONFIG_PREEMPT_DYNAMIC
 extern int preempt_dynamic_mode;
 extern int sched_dynamic_mode(const char *str);
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/5] sched: add a few helpers to wake up tasks on the current cpu
  2023-01-10 21:30 [PATCH 0/5 v3 RESEND] seccomp: add the synchronous mode for seccomp_unotify Andrei Vagin
  2023-01-10 21:30 ` [PATCH 1/5] seccomp: don't use semaphore and wait_queue together Andrei Vagin
  2023-01-10 21:30 ` [PATCH 2/5] sched: add WF_CURRENT_CPU and externise ttwu Andrei Vagin
@ 2023-01-10 21:30 ` Andrei Vagin
  2023-01-16  9:59   ` Peter Zijlstra
  2023-01-10 21:30 ` [PATCH 4/5] seccomp: add the synchronous mode for seccomp_unotify Andrei Vagin
  2023-01-10 21:30 ` [PATCH 5/5] selftest/seccomp: add a new test for the sync mode of seccomp_user_notify Andrei Vagin
  4 siblings, 1 reply; 22+ messages in thread
From: Andrei Vagin @ 2023-01-10 21:30 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Dietmar Eggemann
  Cc: linux-kernel, Kees Cook, Christian Brauner, Andrei Vagin,
	Andy Lutomirski, Juri Lelli, Peter Oskolkov, Tycho Andersen,
	Will Drewry

From: Andrei Vagin <avagin@gmail.com>

Add complete_on_current_cpu, wake_up_poll_on_current_cpu helpers to wake
up tasks on the current CPU.

These two helpers are useful when the task needs to make a synchronous context
switch to another task. In this context, synchronous means it wakes up the
target task and falls asleep right after that.

One example of such workloads is seccomp user notifies. This mechanism allows
the  supervisor process handles system calls on behalf of a target process.
While the supervisor is handling an intercepted system call, the target process
will be blocked in the kernel, waiting for a response to come back.

On-CPU context switches are much faster than regular ones.

Signed-off-by: Andrei Vagin <avagin@gmail.com>
---
 include/linux/completion.h |  1 +
 include/linux/swait.h      |  1 +
 include/linux/wait.h       |  3 +++
 kernel/sched/completion.c  | 12 ++++++++++++
 kernel/sched/core.c        |  2 +-
 kernel/sched/swait.c       | 11 +++++++++++
 kernel/sched/wait.c        |  5 +++++
 7 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/include/linux/completion.h b/include/linux/completion.h
index 62b32b19e0a8..fb2915676574 100644
--- a/include/linux/completion.h
+++ b/include/linux/completion.h
@@ -116,6 +116,7 @@ extern bool try_wait_for_completion(struct completion *x);
 extern bool completion_done(struct completion *x);
 
 extern void complete(struct completion *);
+extern void complete_on_current_cpu(struct completion *x);
 extern void complete_all(struct completion *);
 
 #endif
diff --git a/include/linux/swait.h b/include/linux/swait.h
index 6a8c22b8c2a5..1f27b254adf5 100644
--- a/include/linux/swait.h
+++ b/include/linux/swait.h
@@ -147,6 +147,7 @@ static inline bool swq_has_sleeper(struct swait_queue_head *wq)
 extern void swake_up_one(struct swait_queue_head *q);
 extern void swake_up_all(struct swait_queue_head *q);
 extern void swake_up_locked(struct swait_queue_head *q);
+extern void swake_up_locked_on_current_cpu(struct swait_queue_head *q);
 
 extern void prepare_to_swait_exclusive(struct swait_queue_head *q, struct swait_queue *wait, int state);
 extern long prepare_to_swait_event(struct swait_queue_head *q, struct swait_queue *wait, int state);
diff --git a/include/linux/wait.h b/include/linux/wait.h
index a0307b516b09..5ec7739400f4 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -210,6 +210,7 @@ __remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq
 }
 
 int __wake_up(struct wait_queue_head *wq_head, unsigned int mode, int nr, void *key);
+void __wake_up_on_current_cpu(struct wait_queue_head *wq_head, unsigned int mode, void *key);
 void __wake_up_locked_key(struct wait_queue_head *wq_head, unsigned int mode, void *key);
 void __wake_up_locked_key_bookmark(struct wait_queue_head *wq_head,
 		unsigned int mode, void *key, wait_queue_entry_t *bookmark);
@@ -237,6 +238,8 @@ void __wake_up_pollfree(struct wait_queue_head *wq_head);
 #define key_to_poll(m) ((__force __poll_t)(uintptr_t)(void *)(m))
 #define wake_up_poll(x, m)							\
 	__wake_up(x, TASK_NORMAL, 1, poll_to_key(m))
+#define wake_up_poll_on_current_cpu(x, m)					\
+	__wake_up_on_current_cpu(x, TASK_NORMAL, poll_to_key(m))
 #define wake_up_locked_poll(x, m)						\
 	__wake_up_locked_key((x), TASK_NORMAL, poll_to_key(m))
 #define wake_up_interruptible_poll(x, m)					\
diff --git a/kernel/sched/completion.c b/kernel/sched/completion.c
index d57a5c1c1cd9..a1931a79c05a 100644
--- a/kernel/sched/completion.c
+++ b/kernel/sched/completion.c
@@ -38,6 +38,18 @@ void complete(struct completion *x)
 }
 EXPORT_SYMBOL(complete);
 
+void complete_on_current_cpu(struct completion *x)
+{
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&x->wait.lock, flags);
+
+	if (x->done != UINT_MAX)
+		x->done++;
+	swake_up_locked_on_current_cpu(&x->wait);
+	raw_spin_unlock_irqrestore(&x->wait.lock, flags);
+}
+
 /**
  * complete_all: - signals all threads waiting on this completion
  * @x:  holds the state of this particular completion
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6478e819eb99..c81866821139 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6874,7 +6874,7 @@ asmlinkage __visible void __sched preempt_schedule_irq(void)
 int default_wake_function(wait_queue_entry_t *curr, unsigned mode, int wake_flags,
 			  void *key)
 {
-	WARN_ON_ONCE(IS_ENABLED(CONFIG_SCHED_DEBUG) && wake_flags & ~WF_SYNC);
+	WARN_ON_ONCE(IS_ENABLED(CONFIG_SCHED_DEBUG) && wake_flags & ~(WF_SYNC|WF_CURRENT_CPU));
 	return try_to_wake_up(curr->private, mode, wake_flags);
 }
 EXPORT_SYMBOL(default_wake_function);
diff --git a/kernel/sched/swait.c b/kernel/sched/swait.c
index 76b9b796e695..9ebe23868942 100644
--- a/kernel/sched/swait.c
+++ b/kernel/sched/swait.c
@@ -31,6 +31,17 @@ void swake_up_locked(struct swait_queue_head *q)
 }
 EXPORT_SYMBOL(swake_up_locked);
 
+void swake_up_locked_on_current_cpu(struct swait_queue_head *q)
+{
+	struct swait_queue *curr;
+
+	if (list_empty(&q->task_list))
+		return;
+
+	curr = list_first_entry(&q->task_list, typeof(*curr), task_list);
+	try_to_wake_up(curr->task, TASK_NORMAL, WF_CURRENT_CPU);
+	list_del_init(&curr->task_list);
+}
 /*
  * Wake up all waiters. This is an interface which is solely exposed for
  * completions and not for general usage.
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 133b74730738..47803a0b8d5d 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -161,6 +161,11 @@ int __wake_up(struct wait_queue_head *wq_head, unsigned int mode,
 }
 EXPORT_SYMBOL(__wake_up);
 
+void __wake_up_on_current_cpu(struct wait_queue_head *wq_head, unsigned int mode, void *key)
+{
+	__wake_up_common_lock(wq_head, mode, 1, WF_CURRENT_CPU, key);
+}
+
 /*
  * Same as __wake_up but called with the spinlock in wait_queue_head_t held.
  */
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/5] seccomp: add the synchronous mode for seccomp_unotify
  2023-01-10 21:30 [PATCH 0/5 v3 RESEND] seccomp: add the synchronous mode for seccomp_unotify Andrei Vagin
                   ` (2 preceding siblings ...)
  2023-01-10 21:30 ` [PATCH 3/5] sched: add a few helpers to wake up tasks on the current cpu Andrei Vagin
@ 2023-01-10 21:30 ` Andrei Vagin
  2023-01-12 15:00   ` Tycho Andersen
  2023-01-10 21:30 ` [PATCH 5/5] selftest/seccomp: add a new test for the sync mode of seccomp_user_notify Andrei Vagin
  4 siblings, 1 reply; 22+ messages in thread
From: Andrei Vagin @ 2023-01-10 21:30 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Dietmar Eggemann
  Cc: linux-kernel, Kees Cook, Christian Brauner, Andrei Vagin,
	Andy Lutomirski, Juri Lelli, Peter Oskolkov, Tycho Andersen,
	Will Drewry

From: Andrei Vagin <avagin@gmail.com>

seccomp_unotify allows more privileged processes do actions on behalf
of less privileged processes.

In many cases, the workflow is fully synchronous. It means a target
process triggers a system call and passes controls to a supervisor
process that handles the system call and returns controls to the target
process. In this context, "synchronous" means that only one process is
running and another one is waiting.

There is the WF_CURRENT_CPU flag that is used to advise the scheduler to
move the wakee to the current CPU. For such synchronous workflows, it
makes context switches a few times faster.

Right now, each interaction takes 12µs. With this patch, it takes about
3µs.

This change introduce the SECCOMP_USER_NOTIF_FD_SYNC_WAKE_UP flag that
it used to enable the sync mode.

Signed-off-by: Andrei Vagin <avagin@gmail.com>
---
 include/uapi/linux/seccomp.h |  4 ++++
 kernel/seccomp.c             | 31 +++++++++++++++++++++++++++++--
 2 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index 0fdc6ef02b94..dbfc9b37fcae 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -115,6 +115,8 @@ struct seccomp_notif_resp {
 	__u32 flags;
 };
 
+#define SECCOMP_USER_NOTIF_FD_SYNC_WAKE_UP (1UL << 0)
+
 /* valid flags for seccomp_notif_addfd */
 #define SECCOMP_ADDFD_FLAG_SETFD	(1UL << 0) /* Specify remote fd */
 #define SECCOMP_ADDFD_FLAG_SEND		(1UL << 1) /* Addfd and return it, atomically */
@@ -150,4 +152,6 @@ struct seccomp_notif_addfd {
 #define SECCOMP_IOCTL_NOTIF_ADDFD	SECCOMP_IOW(3, \
 						struct seccomp_notif_addfd)
 
+#define SECCOMP_IOCTL_NOTIF_SET_FLAGS	SECCOMP_IOW(4, __u64)
+
 #endif /* _UAPI_LINUX_SECCOMP_H */
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 876022e9c88c..0a62d44f4898 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -143,9 +143,12 @@ struct seccomp_kaddfd {
  *           filter->notify_lock.
  * @next_id: The id of the next request.
  * @notifications: A list of struct seccomp_knotif elements.
+ * @flags: A set of SECCOMP_USER_NOTIF_FD_* flags.
  */
+
 struct notification {
 	atomic_t requests;
+	u32 flags;
 	u64 next_id;
 	struct list_head notifications;
 };
@@ -1117,7 +1120,10 @@ static int seccomp_do_user_notification(int this_syscall,
 	INIT_LIST_HEAD(&n.addfd);
 
 	atomic_add(1, &match->notif->requests);
-	wake_up_poll(&match->wqh, EPOLLIN | EPOLLRDNORM);
+	if (match->notif->flags & SECCOMP_USER_NOTIF_FD_SYNC_WAKE_UP)
+		wake_up_poll_on_current_cpu(&match->wqh, EPOLLIN | EPOLLRDNORM);
+	else
+		wake_up_poll(&match->wqh, EPOLLIN | EPOLLRDNORM);
 
 	/*
 	 * This is where we wait for a reply from userspace.
@@ -1593,7 +1599,10 @@ static long seccomp_notify_send(struct seccomp_filter *filter,
 	knotif->error = resp.error;
 	knotif->val = resp.val;
 	knotif->flags = resp.flags;
-	complete(&knotif->ready);
+	if (filter->notif->flags & SECCOMP_USER_NOTIF_FD_SYNC_WAKE_UP)
+		complete_on_current_cpu(&knotif->ready);
+	else
+		complete(&knotif->ready);
 out:
 	mutex_unlock(&filter->notify_lock);
 	return ret;
@@ -1623,6 +1632,22 @@ static long seccomp_notify_id_valid(struct seccomp_filter *filter,
 	return ret;
 }
 
+static long seccomp_notify_set_flags(struct seccomp_filter *filter,
+				    unsigned long flags)
+{
+	long ret;
+
+	if (flags & ~SECCOMP_USER_NOTIF_FD_SYNC_WAKE_UP)
+		return -EINVAL;
+
+	ret = mutex_lock_interruptible(&filter->notify_lock);
+	if (ret < 0)
+		return ret;
+	filter->notif->flags = flags;
+	mutex_unlock(&filter->notify_lock);
+	return 0;
+}
+
 static long seccomp_notify_addfd(struct seccomp_filter *filter,
 				 struct seccomp_notif_addfd __user *uaddfd,
 				 unsigned int size)
@@ -1752,6 +1777,8 @@ static long seccomp_notify_ioctl(struct file *file, unsigned int cmd,
 	case SECCOMP_IOCTL_NOTIF_ID_VALID_WRONG_DIR:
 	case SECCOMP_IOCTL_NOTIF_ID_VALID:
 		return seccomp_notify_id_valid(filter, buf);
+	case SECCOMP_IOCTL_NOTIF_SET_FLAGS:
+		return seccomp_notify_set_flags(filter, arg);
 	}
 
 	/* Extensible Argument ioctls */
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 5/5] selftest/seccomp: add a new test for the sync mode of seccomp_user_notify
  2023-01-10 21:30 [PATCH 0/5 v3 RESEND] seccomp: add the synchronous mode for seccomp_unotify Andrei Vagin
                   ` (3 preceding siblings ...)
  2023-01-10 21:30 ` [PATCH 4/5] seccomp: add the synchronous mode for seccomp_unotify Andrei Vagin
@ 2023-01-10 21:30 ` Andrei Vagin
  4 siblings, 0 replies; 22+ messages in thread
From: Andrei Vagin @ 2023-01-10 21:30 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Dietmar Eggemann
  Cc: linux-kernel, Kees Cook, Christian Brauner, Andrei Vagin,
	Andy Lutomirski, Juri Lelli, Peter Oskolkov, Tycho Andersen,
	Will Drewry

From: Andrei Vagin <avagin@gmail.com>

Test output:
RUN           global.user_notification_sync ...
seccomp_bpf.c:4279:user_notification_sync:basic: 8655 nsec/syscall
seccomp_bpf.c:4279:user_notification_sync:sync:	 2919 nsec/syscall
OK  global.user_notification_sync

Signed-off-by: Andrei Vagin <avagin@gmail.com>
---
 tools/testing/selftests/seccomp/seccomp_bpf.c | 88 +++++++++++++++++++
 1 file changed, 88 insertions(+)

diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 9c2f448bb3a9..e4207cddd668 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -4243,6 +4243,94 @@ TEST(user_notification_addfd_rlimit)
 	close(memfd);
 }
 
+/* USER_NOTIF_BENCH_TIMEOUT is 100 miliseconds. */
+#define USER_NOTIF_BENCH_TIMEOUT  100000000ULL
+#define NSECS_PER_SEC            1000000000ULL
+
+#ifndef SECCOMP_USER_NOTIF_FD_SYNC_WAKE_UP
+#define SECCOMP_USER_NOTIF_FD_SYNC_WAKE_UP (1UL << 0)
+#define SECCOMP_IOCTL_NOTIF_SET_FLAGS  SECCOMP_IOW(4, __u64)
+#endif
+
+static uint64_t user_notification_sync_loop(struct __test_metadata *_metadata,
+					    char *test_name, int listener)
+{
+	struct timespec ts;
+	uint64_t start, end, nr;
+	struct seccomp_notif req = {};
+	struct seccomp_notif_resp resp = {};
+
+	clock_gettime(CLOCK_MONOTONIC, &ts);
+	start = ts.tv_nsec + ts.tv_sec * NSECS_PER_SEC;
+	for (end = start, nr = 0; end - start < USER_NOTIF_BENCH_TIMEOUT; nr++) {
+		memset(&req, 0, sizeof(req));
+		req.pid = 0;
+		ASSERT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0);
+
+		ASSERT_EQ(req.data.nr,  __NR_getppid);
+
+		resp.id = req.id;
+		resp.error = 0;
+		resp.val = USER_NOTIF_MAGIC;
+		resp.flags = 0;
+		ASSERT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), 0);
+
+		clock_gettime(CLOCK_MONOTONIC, &ts);
+		end = ts.tv_nsec + ts.tv_sec * NSECS_PER_SEC;
+	}
+	TH_LOG("%s:\t%lld nsec/syscall", test_name, USER_NOTIF_BENCH_TIMEOUT / nr);
+	return nr;
+}
+
+TEST(user_notification_sync)
+{
+	pid_t pid;
+	long ret;
+	int status, listener;
+	unsigned long calls, sync_calls;
+
+	ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+	ASSERT_EQ(0, ret) {
+		TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
+	}
+
+	listener = user_notif_syscall(__NR_getppid,
+				      SECCOMP_FILTER_FLAG_NEW_LISTENER);
+	ASSERT_GE(listener, 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		while (1) {
+			ret = syscall(__NR_getppid);
+			if (ret == USER_NOTIF_MAGIC)
+				continue;
+			break;
+		}
+		_exit(1);
+	}
+
+	calls = user_notification_sync_loop(_metadata, "basic", listener);
+
+	/* Try to set invalid flags. */
+	EXPECT_SYSCALL_RETURN(-EINVAL,
+		ioctl(listener, SECCOMP_IOCTL_NOTIF_SET_FLAGS, 0xffffffff, 0));
+
+	ASSERT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SET_FLAGS,
+			SECCOMP_USER_NOTIF_FD_SYNC_WAKE_UP, 0), 0);
+
+	sync_calls = user_notification_sync_loop(_metadata, "sync", listener);
+
+	EXPECT_GT(sync_calls, calls);
+
+	kill(pid, SIGKILL);
+	ASSERT_EQ(waitpid(pid, &status, 0), pid);
+	ASSERT_EQ(true, WIFSIGNALED(status));
+	ASSERT_EQ(SIGKILL, WTERMSIG(status));
+}
+
+
 /* Make sure PTRACE_O_SUSPEND_SECCOMP requires CAP_SYS_ADMIN. */
 FIXTURE(O_SUSPEND_SECCOMP) {
 	pid_t pid;
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/5] sched: add WF_CURRENT_CPU and externise ttwu
  2023-01-10 21:30 ` [PATCH 2/5] sched: add WF_CURRENT_CPU and externise ttwu Andrei Vagin
@ 2023-01-12  7:35   ` Chen Yu
  2023-01-13 21:39     ` Andrei Vagin
  0 siblings, 1 reply; 22+ messages in thread
From: Chen Yu @ 2023-01-12  7:35 UTC (permalink / raw)
  To: Andrei Vagin
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	linux-kernel, Kees Cook, Christian Brauner, Andrei Vagin,
	Andy Lutomirski, Juri Lelli, Peter Oskolkov, Tycho Andersen,
	Will Drewry

On 2023-01-10 at 13:30:07 -0800, Andrei Vagin wrote:
> From: Peter Oskolkov <posk@google.com>
> 
> Add WF_CURRENT_CPU wake flag that advices the scheduler to
> move the wakee to the current CPU. This is useful for fast on-CPU
> context switching use cases.
> 
> In addition, make ttwu external rather than static so that
> the flag could be passed to it from outside of sched/core.c.
> 
> Signed-off-by: Peter Oskolkov <posk@google.com>
> Signed-off-by: Andrei Vagin <avagin@gmail.com>
> @@ -7380,6 +7380,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>  	if (wake_flags & WF_TTWU) {
>  		record_wakee(p);
>  
> +		if ((wake_flags & WF_CURRENT_CPU) &&
> +		    cpumask_test_cpu(cpu, p->cpus_ptr))
> +			return cpu;
I agree that cross-CPU wake up brings pain to fast context switching
use cases,  especially on high core count system. We suffered from this
issue as well, so previously we presented this issue as well. The difference
is that we used some dynamic "WF_CURRENT_CPU" mechanism[1] to deal with it.
That is, if the waker/wakee are both short duration tasks, let the waker wakes up
the wakee on current CPU. So not only seccomp but also other components/workloads
could benefit from this without having to set the WF_CURRENT_CPU flag.

Link [1]:
https://lore.kernel.org/lkml/cover.1671158588.git.yu.c.chen@intel.com/

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/5] seccomp: don't use semaphore and wait_queue together
  2023-01-10 21:30 ` [PATCH 1/5] seccomp: don't use semaphore and wait_queue together Andrei Vagin
@ 2023-01-12 14:58   ` Tycho Andersen
  2023-01-13 21:51     ` Andrei Vagin
  2023-01-16  9:48   ` Peter Zijlstra
  1 sibling, 1 reply; 22+ messages in thread
From: Tycho Andersen @ 2023-01-12 14:58 UTC (permalink / raw)
  To: Andrei Vagin
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	linux-kernel, Kees Cook, Christian Brauner, Andrei Vagin,
	Andy Lutomirski, Juri Lelli, Peter Oskolkov, Will Drewry

On Tue, Jan 10, 2023 at 01:30:06PM -0800, Andrei Vagin wrote:
> From: Andrei Vagin <avagin@gmail.com>
> 
> The main reason is to use new wake_up helpers that will be added in the
> following patches. But here are a few other reasons:
> 
> * if we use two different ways, we always need to call them both. This
>   patch fixes seccomp_notify_recv where we forgot to call wake_up_poll
>   in the error path.

[snip]

> @@ -1515,7 +1546,8 @@ static long seccomp_notify_recv(struct seccomp_filter *filter,
>  			if (should_sleep_killable(filter, knotif))
>  				complete(&knotif->ready);
>  			knotif->state = SECCOMP_NOTIFY_INIT;
> -			up(&filter->notif->request);
> +			atomic_add(1, &filter->notif->requests);
> +			wake_up_poll(&filter->wqh, EPOLLIN | EPOLLRDNORM);
>  		}
>  		mutex_unlock(&filter->notify_lock);
>  	}

I wonder if this shouldn't be a separate patch that you can send now
independent of this series?

Tycho

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/5] seccomp: add the synchronous mode for seccomp_unotify
  2023-01-10 21:30 ` [PATCH 4/5] seccomp: add the synchronous mode for seccomp_unotify Andrei Vagin
@ 2023-01-12 15:00   ` Tycho Andersen
  2023-01-14  1:16     ` Andrei Vagin
  0 siblings, 1 reply; 22+ messages in thread
From: Tycho Andersen @ 2023-01-12 15:00 UTC (permalink / raw)
  To: Andrei Vagin
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	linux-kernel, Kees Cook, Christian Brauner, Andrei Vagin,
	Andy Lutomirski, Juri Lelli, Peter Oskolkov, Will Drewry

On Tue, Jan 10, 2023 at 01:30:09PM -0800, Andrei Vagin wrote:
> From: Andrei Vagin <avagin@gmail.com>
> 
> seccomp_unotify allows more privileged processes do actions on behalf
> of less privileged processes.
> 
> In many cases, the workflow is fully synchronous. It means a target
> process triggers a system call and passes controls to a supervisor
> process that handles the system call and returns controls to the target
> process. In this context, "synchronous" means that only one process is
> running and another one is waiting.
> 
> There is the WF_CURRENT_CPU flag that is used to advise the scheduler to
> move the wakee to the current CPU. For such synchronous workflows, it
> makes context switches a few times faster.
> 
> Right now, each interaction takes 12µs. With this patch, it takes about
> 3µs.
> 
> This change introduce the SECCOMP_USER_NOTIF_FD_SYNC_WAKE_UP flag that
> it used to enable the sync mode.

What about just not having a flag and using the new primitives all the
time? Is there any reason not to?

Tycho

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/5] sched: add WF_CURRENT_CPU and externise ttwu
  2023-01-12  7:35   ` Chen Yu
@ 2023-01-13 21:39     ` Andrei Vagin
  2023-01-14 15:59       ` Chen Yu
  0 siblings, 1 reply; 22+ messages in thread
From: Andrei Vagin @ 2023-01-13 21:39 UTC (permalink / raw)
  To: Chen Yu
  Cc: Andrei Vagin, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, linux-kernel, Kees Cook, Christian Brauner,
	Andy Lutomirski, Juri Lelli, Peter Oskolkov, Tycho Andersen,
	Will Drewry

On Wed, Jan 11, 2023 at 11:36 PM Chen Yu <yu.c.chen@intel.com> wrote:
>
> On 2023-01-10 at 13:30:07 -0800, Andrei Vagin wrote:
> > From: Peter Oskolkov <posk@google.com>
> >
> > Add WF_CURRENT_CPU wake flag that advices the scheduler to
> > move the wakee to the current CPU. This is useful for fast on-CPU
> > context switching use cases.
> >
> > In addition, make ttwu external rather than static so that
> > the flag could be passed to it from outside of sched/core.c.
> >
> > Signed-off-by: Peter Oskolkov <posk@google.com>
> > Signed-off-by: Andrei Vagin <avagin@gmail.com>
> > @@ -7380,6 +7380,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> >       if (wake_flags & WF_TTWU) {
> >               record_wakee(p);
> >
> > +             if ((wake_flags & WF_CURRENT_CPU) &&
> > +                 cpumask_test_cpu(cpu, p->cpus_ptr))
> > +                     return cpu;
> I agree that cross-CPU wake up brings pain to fast context switching
> use cases,  especially on high core count system. We suffered from this
> issue as well, so previously we presented this issue as well. The difference
> is that we used some dynamic "WF_CURRENT_CPU" mechanism[1] to deal with it.
> That is, if the waker/wakee are both short duration tasks, let the waker wakes up
> the wakee on current CPU. So not only seccomp but also other components/workloads
> could benefit from this without having to set the WF_CURRENT_CPU flag.
>
> Link [1]:
> https://lore.kernel.org/lkml/cover.1671158588.git.yu.c.chen@intel.com/

Thanks for the link. I like the idea, but this change has no impact on the
seccom notify case.  I used the benchmark from the fifth patch. It is
a ping-pong
benchmark in which one process triggers system calls, and another process
handles them. It measures the number of system calls that can be processed
within a specified time slice.

$ cd tools/testing/selftests/seccomp/
$ make
$ ./seccomp_bpf  2>&1| grep user_notification_sync
#  RUN           global.user_notification_sync ...
# seccomp_bpf.c:4281:user_notification_sync:basic: 8489 nsec/syscall
# seccomp_bpf.c:4281:user_notification_sync:sync: 3078 nsec/syscall
#            OK  global.user_notification_sync
ok 51 global.user_notification_sync

The results are the same with and without your change. I expected that
your change improves
the basic case so that it reaches the sync one.

I did some experiments and found that we can achieve the desirable
outcome if we move the "short-task" checks prior to considering waking
up on prev_cpu.

For example, with this patch:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2f89e44e237d..af20b58e3972 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6384,6 +6384,11 @@ static int wake_wide(struct task_struct *p)
 static int
 wake_affine_idle(int this_cpu, int prev_cpu, int sync)
 {
+       /* The only running task is a short duration one. */
+       if (cpu_rq(this_cpu)->nr_running == 1 &&
+           is_short_task(cpu_curr(this_cpu)))
+               return this_cpu;
+
        /*
         * If this_cpu is idle, it implies the wakeup is from interrupt
         * context. Only allow the move if cache is shared. Otherwise an
@@ -6405,11 +6410,6 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync)
        if (available_idle_cpu(prev_cpu))
                return prev_cpu;

-       /* The only running task is a short duration one. */
-       if (cpu_rq(this_cpu)->nr_running == 1 &&
-           is_short_task(cpu_curr(this_cpu)))
-               return this_cpu;
-
        return nr_cpumask_bits;
 }

@@ -6897,6 +6897,10 @@ static int select_idle_sibling(struct
task_struct *p, int prev, int target)
            asym_fits_cpu(task_util, util_min, util_max, target))
                return target;

+       if (!has_idle_core && cpu_rq(target)->nr_running == 1 &&
+           is_short_task(cpu_curr(target)) && is_short_task(p))
+               return target;
+
        /*
         * If the previous CPU is cache affine and idle, don't be stupid:
         */


the basic test case shows almost the same results as the sync one:

$ ./seccomp_bpf  2>&1| grep user_notification_sync
#  RUN           global.user_notification_sync ...
# seccomp_bpf.c:4281:user_notification_sync:basic: 3082 nsec/syscall
# seccomp_bpf.c:4281:user_notification_sync:sync: 2690 nsec/syscall
#            OK  global.user_notification_sync
ok 51 global.user_notification_sync

If you want to do any experiments, you can find my tree here:
[1] https://github.com/avagin/linux-task-diag/tree/wip/seccom-notify-sync-and-shed-short-task

Thanks,
Andrei

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/5] seccomp: don't use semaphore and wait_queue together
  2023-01-12 14:58   ` Tycho Andersen
@ 2023-01-13 21:51     ` Andrei Vagin
  0 siblings, 0 replies; 22+ messages in thread
From: Andrei Vagin @ 2023-01-13 21:51 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Andrei Vagin, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, linux-kernel, Kees Cook, Christian Brauner,
	Andy Lutomirski, Juri Lelli, Peter Oskolkov, Will Drewry

On Thu, Jan 12, 2023 at 6:58 AM Tycho Andersen <tycho@tycho.pizza> wrote:
>
> On Tue, Jan 10, 2023 at 01:30:06PM -0800, Andrei Vagin wrote:
> > From: Andrei Vagin <avagin@gmail.com>
> >
> > The main reason is to use new wake_up helpers that will be added in the
> > following patches. But here are a few other reasons:
> >
> > * if we use two different ways, we always need to call them both. This
> >   patch fixes seccomp_notify_recv where we forgot to call wake_up_poll
> >   in the error path.
>
> [snip]
>
> > @@ -1515,7 +1546,8 @@ static long seccomp_notify_recv(struct seccomp_filter *filter,
> >                       if (should_sleep_killable(filter, knotif))
> >                               complete(&knotif->ready);
> >                       knotif->state = SECCOMP_NOTIFY_INIT;
> > -                     up(&filter->notif->request);
> > +                     atomic_add(1, &filter->notif->requests);
> > +                     wake_up_poll(&filter->wqh, EPOLLIN | EPOLLRDNORM);
> >               }
> >               mutex_unlock(&filter->notify_lock);
> >       }
>
> I wonder if this shouldn't be a separate patch that you can send now
> independent of this series?

You are right. It is a bug fix and I can send it in a separate patch.
I didn't expect it would take so long to merge the whole set.

Thanks,
Andrei

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/5] seccomp: add the synchronous mode for seccomp_unotify
  2023-01-12 15:00   ` Tycho Andersen
@ 2023-01-14  1:16     ` Andrei Vagin
  0 siblings, 0 replies; 22+ messages in thread
From: Andrei Vagin @ 2023-01-14  1:16 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Andrei Vagin, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, linux-kernel, Kees Cook, Christian Brauner,
	Andy Lutomirski, Juri Lelli, Peter Oskolkov, Will Drewry

On Thu, Jan 12, 2023 at 7:00 AM Tycho Andersen <tycho@tycho.pizza> wrote:
>
> On Tue, Jan 10, 2023 at 01:30:09PM -0800, Andrei Vagin wrote:
> > From: Andrei Vagin <avagin@gmail.com>
> >
> > seccomp_unotify allows more privileged processes do actions on behalf
> > of less privileged processes.
> >
> > In many cases, the workflow is fully synchronous. It means a target
> > process triggers a system call and passes controls to a supervisor
> > process that handles the system call and returns controls to the target
> > process. In this context, "synchronous" means that only one process is
> > running and another one is waiting.
> >
> > There is the WF_CURRENT_CPU flag that is used to advise the scheduler to
> > move the wakee to the current CPU. For such synchronous workflows, it
> > makes context switches a few times faster.
> >
> > Right now, each interaction takes 12盜. With this patch, it takes about
> > 3盜.
> >
> > This change introduces the SECCOMP_USER_NOTIF_FD_SYNC_WAKE_UP flag that
> > it used to enable the sync mode.
>
> What about just not having a flag and using the new primitives all the
> time? Is there any reason not to?

I was thinking about that but then I decided that it can have a
negative impact in cases
when workflows are not synchronous. This can happen when one process wakes up
another one and continues running on cpu. With the flag, both
processes are scheduled
on the same cpu. Without the flag, they can be scheduled on different
cpu-s and run
concurrently.

In the seccomp unotify, switches from tracee to supervisor are always
synchronous.
Switches into the opposite direction can be either type.

I think it is better to let users decide what type is more suitable
for their workloads.

Thanks,
Andrei

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/5] sched: add WF_CURRENT_CPU and externise ttwu
  2023-01-13 21:39     ` Andrei Vagin
@ 2023-01-14 15:59       ` Chen Yu
  2023-01-18  6:10         ` Andrei Vagin
  0 siblings, 1 reply; 22+ messages in thread
From: Chen Yu @ 2023-01-14 15:59 UTC (permalink / raw)
  To: Andrei Vagin
  Cc: Andrei Vagin, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, linux-kernel, Kees Cook, Christian Brauner,
	Andy Lutomirski, Juri Lelli, Peter Oskolkov, Tycho Andersen,
	Will Drewry

On 2023-01-13 at 13:39:46 -0800, Andrei Vagin wrote:
> On Wed, Jan 11, 2023 at 11:36 PM Chen Yu <yu.c.chen@intel.com> wrote:
> >
> > On 2023-01-10 at 13:30:07 -0800, Andrei Vagin wrote:
> > > From: Peter Oskolkov <posk@google.com>
> > >
> > > Add WF_CURRENT_CPU wake flag that advices the scheduler to
> > > move the wakee to the current CPU. This is useful for fast on-CPU
> > > context switching use cases.
> > >
> > > In addition, make ttwu external rather than static so that
> > > the flag could be passed to it from outside of sched/core.c.
> > >
> > > Signed-off-by: Peter Oskolkov <posk@google.com>
> > > Signed-off-by: Andrei Vagin <avagin@gmail.com>
> > > @@ -7380,6 +7380,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> > >       if (wake_flags & WF_TTWU) {
> > >               record_wakee(p);
> > >
> > > +             if ((wake_flags & WF_CURRENT_CPU) &&
> > > +                 cpumask_test_cpu(cpu, p->cpus_ptr))
> > > +                     return cpu;
> > I agree that cross-CPU wake up brings pain to fast context switching
> > use cases,  especially on high core count system. We suffered from this
> > issue as well, so previously we presented this issue as well. The difference
> > is that we used some dynamic "WF_CURRENT_CPU" mechanism[1] to deal with it.
> > That is, if the waker/wakee are both short duration tasks, let the waker wakes up
> > the wakee on current CPU. So not only seccomp but also other components/workloads
> > could benefit from this without having to set the WF_CURRENT_CPU flag.
> >
> > Link [1]:
> > https://lore.kernel.org/lkml/cover.1671158588.git.yu.c.chen@intel.com/
> 
> Thanks for the link. I like the idea, but this change has no impact on the
> seccom notify case.  I used the benchmark from the fifth patch. It is
> a ping-pong
> benchmark in which one process triggers system calls, and another process
> handles them. It measures the number of system calls that can be processed
> within a specified time slice.
>
Thanks for this information.
> $ cd tools/testing/selftests/seccomp/
> $ make
> $ ./seccomp_bpf  2>&1| grep user_notification_sync
> #  RUN           global.user_notification_sync ...
> # seccomp_bpf.c:4281:user_notification_sync:basic: 8489 nsec/syscall
> # seccomp_bpf.c:4281:user_notification_sync:sync: 3078 nsec/syscall
> #            OK  global.user_notification_sync
> ok 51 global.user_notification_sync
> 
> The results are the same with and without your change. I expected that
> your change improves
> the basic case so that it reaches the sync one.
> 
The reason why the patch did not bring benefit might be that, that patch
aims to wake task on current CPU only there is no idle cores. The seccomp
notify would launch 2 processes which makes it hard to saturate the system
if I understand correctly? I built a kernel based on your repo, and launched
above test, it seems that the average load is quit low on my system. Is this
expected? Besides, is 100 ms long enough to test(USER_NOTIF_BENCH_TIMEOUT)?
> I did some experiments and found that we can achieve the desirable
> outcome if we move the "short-task" checks prior to considering waking
> up on prev_cpu.
> 
> For example, with this patch:
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2f89e44e237d..af20b58e3972 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6384,6 +6384,11 @@ static int wake_wide(struct task_struct *p)
>  static int
>  wake_affine_idle(int this_cpu, int prev_cpu, int sync)
>  {
> +       /* The only running task is a short duration one. */
> +       if (cpu_rq(this_cpu)->nr_running == 1 &&
> +           is_short_task(cpu_curr(this_cpu)))
> +               return this_cpu;
> +
In v1 we put above code before checking the prev_cpu, but is was reported
to bring regression on some systems[2]. The result suggested that, we should
try to pick idle CPU first, no matter it is current CPU, or previous CPU,
if it failed, we can lower the bar and pick the current CPU.

[2] https://lore.kernel.org/lkml/6c626e8d-4133-00ba-a765-bafe08034517@amd.com/
>         /*
>          * If this_cpu is idle, it implies the wakeup is from interrupt
>          * context. Only allow the move if cache is shared. Otherwise an
> @@ -6405,11 +6410,6 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync)
>         if (available_idle_cpu(prev_cpu))
>                 return prev_cpu;
> 
> -       /* The only running task is a short duration one. */
> -       if (cpu_rq(this_cpu)->nr_running == 1 &&
> -           is_short_task(cpu_curr(this_cpu)))
> -               return this_cpu;
> -
>         return nr_cpumask_bits;
>  }
> 
> @@ -6897,6 +6897,10 @@ static int select_idle_sibling(struct
> task_struct *p, int prev, int target)
>             asym_fits_cpu(task_util, util_min, util_max, target))
>                 return target;
> 
> +       if (!has_idle_core && cpu_rq(target)->nr_running == 1 &&
> +           is_short_task(cpu_curr(target)) && is_short_task(p))
> +               return target;
> +
>         /*
>          * If the previous CPU is cache affine and idle, don't be stupid:
>          */
> 
> 
> the basic test case shows almost the same results as the sync one:
> 
> $ ./seccomp_bpf  2>&1| grep user_notification_sync
> #  RUN           global.user_notification_sync ...
> # seccomp_bpf.c:4281:user_notification_sync:basic: 3082 nsec/syscall
> # seccomp_bpf.c:4281:user_notification_sync:sync: 2690 nsec/syscall
> #            OK  global.user_notification_sync
> ok 51 global.user_notification_sync
> 
> If you want to do any experiments, you can find my tree here:
> [1] https://github.com/avagin/linux-task-diag/tree/wip/seccom-notify-sync-and-shed-short-task
> 
I'm happy to further test on this. One thing I'm curious about is, where does the
benefit of waking up seccom task on current CPU come from? Is it due to hot L1 cache?

thanks,
Chenyu
> Thanks,
> Andrei

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/5] seccomp: don't use semaphore and wait_queue together
  2023-01-10 21:30 ` [PATCH 1/5] seccomp: don't use semaphore and wait_queue together Andrei Vagin
  2023-01-12 14:58   ` Tycho Andersen
@ 2023-01-16  9:48   ` Peter Zijlstra
  1 sibling, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2023-01-16  9:48 UTC (permalink / raw)
  To: Andrei Vagin
  Cc: Ingo Molnar, Vincent Guittot, Dietmar Eggemann, linux-kernel,
	Kees Cook, Christian Brauner, Andrei Vagin, Andy Lutomirski,
	Juri Lelli, Peter Oskolkov, Tycho Andersen, Will Drewry

On Tue, Jan 10, 2023 at 01:30:06PM -0800, Andrei Vagin wrote:

> +	atomic_add(1, &match->notif->requests);

atomic_inc() ?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/5] sched: add a few helpers to wake up tasks on the current cpu
  2023-01-10 21:30 ` [PATCH 3/5] sched: add a few helpers to wake up tasks on the current cpu Andrei Vagin
@ 2023-01-16  9:59   ` Peter Zijlstra
  2023-01-19  6:45     ` Andrei Vagin
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2023-01-16  9:59 UTC (permalink / raw)
  To: Andrei Vagin
  Cc: Ingo Molnar, Vincent Guittot, Dietmar Eggemann, linux-kernel,
	Kees Cook, Christian Brauner, Andrei Vagin, Andy Lutomirski,
	Juri Lelli, Peter Oskolkov, Tycho Andersen, Will Drewry

On Tue, Jan 10, 2023 at 01:30:08PM -0800, Andrei Vagin wrote:

> diff --git a/kernel/sched/completion.c b/kernel/sched/completion.c
> index d57a5c1c1cd9..a1931a79c05a 100644
> --- a/kernel/sched/completion.c
> +++ b/kernel/sched/completion.c
> @@ -38,6 +38,18 @@ void complete(struct completion *x)
>  }
>  EXPORT_SYMBOL(complete);
>  
> +void complete_on_current_cpu(struct completion *x)
> +{
> +	unsigned long flags;
> +
> +	raw_spin_lock_irqsave(&x->wait.lock, flags);
> +
> +	if (x->done != UINT_MAX)
> +		x->done++;
> +	swake_up_locked_on_current_cpu(&x->wait);
> +	raw_spin_unlock_irqrestore(&x->wait.lock, flags);
> +}
> +
>  /**
>   * complete_all: - signals all threads waiting on this completion
>   * @x:  holds the state of this particular completion
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 6478e819eb99..c81866821139 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6874,7 +6874,7 @@ asmlinkage __visible void __sched preempt_schedule_irq(void)
>  int default_wake_function(wait_queue_entry_t *curr, unsigned mode, int wake_flags,
>  			  void *key)
>  {
> -	WARN_ON_ONCE(IS_ENABLED(CONFIG_SCHED_DEBUG) && wake_flags & ~WF_SYNC);
> +	WARN_ON_ONCE(IS_ENABLED(CONFIG_SCHED_DEBUG) && wake_flags & ~(WF_SYNC|WF_CURRENT_CPU));
>  	return try_to_wake_up(curr->private, mode, wake_flags);
>  }
>  EXPORT_SYMBOL(default_wake_function);
> diff --git a/kernel/sched/swait.c b/kernel/sched/swait.c
> index 76b9b796e695..9ebe23868942 100644
> --- a/kernel/sched/swait.c
> +++ b/kernel/sched/swait.c
> @@ -31,6 +31,17 @@ void swake_up_locked(struct swait_queue_head *q)
>  }
>  EXPORT_SYMBOL(swake_up_locked);
>  
> +void swake_up_locked_on_current_cpu(struct swait_queue_head *q)
> +{
> +	struct swait_queue *curr;
> +
> +	if (list_empty(&q->task_list))
> +		return;
> +
> +	curr = list_first_entry(&q->task_list, typeof(*curr), task_list);
> +	try_to_wake_up(curr->task, TASK_NORMAL, WF_CURRENT_CPU);
> +	list_del_init(&curr->task_list);
> +}
>  /*
>   * Wake up all waiters. This is an interface which is solely exposed for
>   * completions and not for general usage.
> diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
> index 133b74730738..47803a0b8d5d 100644
> --- a/kernel/sched/wait.c
> +++ b/kernel/sched/wait.c
> @@ -161,6 +161,11 @@ int __wake_up(struct wait_queue_head *wq_head, unsigned int mode,
>  }
>  EXPORT_SYMBOL(__wake_up);
>  
> +void __wake_up_on_current_cpu(struct wait_queue_head *wq_head, unsigned int mode, void *key)
> +{
> +	__wake_up_common_lock(wq_head, mode, 1, WF_CURRENT_CPU, key);
> +}
> +
>  /*
>   * Same as __wake_up but called with the spinlock in wait_queue_head_t held.
>   */

So I hate this patch, it adds a whole ton of duplication and litter for
no real reason afaict. For instance you could've just added an
wake_flags argument to swake_up_locked(), there's not *that* many users
-- unlike complete().

And for complete() instead of fully duplicating the function (always a
bad idea) you could've made complete_flags() and implemented complete()
using that, or something.

Anyway, let me go (finally) have a look at Chen's patch, since that
might render all this moot.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/5] sched: add WF_CURRENT_CPU and externise ttwu
  2023-01-14 15:59       ` Chen Yu
@ 2023-01-18  6:10         ` Andrei Vagin
  0 siblings, 0 replies; 22+ messages in thread
From: Andrei Vagin @ 2023-01-18  6:10 UTC (permalink / raw)
  To: Chen Yu
  Cc: Andrei Vagin, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, linux-kernel, Kees Cook, Christian Brauner,
	Andy Lutomirski, Juri Lelli, Peter Oskolkov, Tycho Andersen,
	Will Drewry

On Sat, Jan 14, 2023 at 8:00 AM Chen Yu <yu.c.chen@intel.com> wrote:
>
> On 2023-01-13 at 13:39:46 -0800, Andrei Vagin wrote:
> > On Wed, Jan 11, 2023 at 11:36 PM Chen Yu <yu.c.chen@intel.com> wrote:
> > >
> > > On 2023-01-10 at 13:30:07 -0800, Andrei Vagin wrote:
> > > > From: Peter Oskolkov <posk@google.com>
> > > >
> > > > Add WF_CURRENT_CPU wake flag that advices the scheduler to
> > > > move the wakee to the current CPU. This is useful for fast on-CPU
> > > > context switching use cases.
> > > >
> > > > In addition, make ttwu external rather than static so that
> > > > the flag could be passed to it from outside of sched/core.c.
> > > >
> > > > Signed-off-by: Peter Oskolkov <posk@google.com>
> > > > Signed-off-by: Andrei Vagin <avagin@gmail.com>
> > > > @@ -7380,6 +7380,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> > > >       if (wake_flags & WF_TTWU) {
> > > >               record_wakee(p);
> > > >
> > > > +             if ((wake_flags & WF_CURRENT_CPU) &&
> > > > +                 cpumask_test_cpu(cpu, p->cpus_ptr))
> > > > +                     return cpu;
> > > I agree that cross-CPU wake up brings pain to fast context switching
> > > use cases,  especially on high core count system. We suffered from this
> > > issue as well, so previously we presented this issue as well. The difference
> > > is that we used some dynamic "WF_CURRENT_CPU" mechanism[1] to deal with it.
> > > That is, if the waker/wakee are both short duration tasks, let the waker wakes up
> > > the wakee on current CPU. So not only seccomp but also other components/workloads
> > > could benefit from this without having to set the WF_CURRENT_CPU flag.
> > >
> > > Link [1]:
> > > https://lore.kernel.org/lkml/cover.1671158588.git.yu.c.chen@intel.com/
> >
> > Thanks for the link. I like the idea, but this change has no impact on the
> > seccom notify case.  I used the benchmark from the fifth patch. It is
> > a ping-pong
> > benchmark in which one process triggers system calls, and another process
> > handles them. It measures the number of system calls that can be processed
> > within a specified time slice.
> >
> Thanks for this information.
> > $ cd tools/testing/selftests/seccomp/
> > $ make
> > $ ./seccomp_bpf  2>&1| grep user_notification_sync
> > #  RUN           global.user_notification_sync ...
> > # seccomp_bpf.c:4281:user_notification_sync:basic: 8489 nsec/syscall
> > # seccomp_bpf.c:4281:user_notification_sync:sync: 3078 nsec/syscall
> > #            OK  global.user_notification_sync
> > ok 51 global.user_notification_sync
> >
> > The results are the same with and without your change. I expected that
> > your change improves
> > the basic case so that it reaches the sync one.
> >
> The reason why the patch did not bring benefit might be that, that patch
> aims to wake task on current CPU only there is no idle cores. The seccomp
> notify would launch 2 processes which makes it hard to saturate the system
> if I understand correctly?

Yes, you understand it right. Our workloads do not always scale on all CPU-s,
and it is critical to work with maximum performance even when there are some
idle cores. I feel I need to say a few words about our use-case. I am working on
gVisor, it is a sandbox solution. In a few words, we intercept guest system
calls and handle them in our user-mode kernel. All these mean that we are
limited by a user application that is running inside a sandbox and how well it
scales on multiple cpu-s. The current benchmark emulates a uni-thread
process running in a sandbox.

> I built a kernel based on your repo, and launched
> above test, it seems that the average load is quit low on my system. Is this
> expected? Besides, is 100 ms long enough to test(USER_NOTIF_BENCH_TIMEOUT)?

The accuracy within the 20% range is good enough for this test. But if
we want to
have a real benchmark, we need to implement it in a separate binary or we can
add it to the perf benchmark suite.

> > I did some experiments and found that we can achieve the desirable
> > outcome if we move the "short-task" checks prior to considering waking
> > up on prev_cpu.
> >
> > For example, with this patch:
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 2f89e44e237d..af20b58e3972 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6384,6 +6384,11 @@ static int wake_wide(struct task_struct *p)
> >  static int
> >  wake_affine_idle(int this_cpu, int prev_cpu, int sync)
> >  {
> > +       /* The only running task is a short duration one. */
> > +       if (cpu_rq(this_cpu)->nr_running == 1 &&
> > +           is_short_task(cpu_curr(this_cpu)))
> > +               return this_cpu;
> > +
> In v1 we put above code before checking the prev_cpu, but is was reported
> to bring regression on some systems[2]. The result suggested that, we should
> try to pick idle CPU first, no matter it is current CPU, or previous CPU,
> if it failed, we can lower the bar and pick the current CPU.

Maybe the criteria of a short task should be lower for idle cpu-s. It should be
close to the cost of waking up an idle cpu.

>
> [2] https://lore.kernel.org/lkml/6c626e8d-4133-00ba-a765-bafe08034517@amd.com/
> >         /*
> >          * If this_cpu is idle, it implies the wakeup is from interrupt
> >          * context. Only allow the move if cache is shared. Otherwise an
> > @@ -6405,11 +6410,6 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync)
> >         if (available_idle_cpu(prev_cpu))
> >                 return prev_cpu;
> >
> > -       /* The only running task is a short duration one. */
> > -       if (cpu_rq(this_cpu)->nr_running == 1 &&
> > -           is_short_task(cpu_curr(this_cpu)))
> > -               return this_cpu;
> > -
> >         return nr_cpumask_bits;
> >  }
> >
> > @@ -6897,6 +6897,10 @@ static int select_idle_sibling(struct
> > task_struct *p, int prev, int target)
> >             asym_fits_cpu(task_util, util_min, util_max, target))
> >                 return target;
> >
> > +       if (!has_idle_core && cpu_rq(target)->nr_running == 1 &&
> > +           is_short_task(cpu_curr(target)) && is_short_task(p))
> > +               return target;
> > +
> >         /*
> >          * If the previous CPU is cache affine and idle, don't be stupid:
> >          */
> >
> >
> > the basic test case shows almost the same results as the sync one:
> >
> > $ ./seccomp_bpf  2>&1| grep user_notification_sync
> > #  RUN           global.user_notification_sync ...
> > # seccomp_bpf.c:4281:user_notification_sync:basic: 3082 nsec/syscall
> > # seccomp_bpf.c:4281:user_notification_sync:sync: 2690 nsec/syscall
> > #            OK  global.user_notification_sync
> > ok 51 global.user_notification_sync
> >
> > If you want to do any experiments, you can find my tree here:
> > [1] https://github.com/avagin/linux-task-diag/tree/wip/seccom-notify-sync-and-shed-short-task
> >
> I'm happy to further test on this. One thing I'm curious about is, where does the
> benefit of waking up seccom task on current CPU come from? Is it due to hot L1 cache?

I don't think that it is due to cpu caches. It should be due to the
overhead of queuing a task to
a non-current queue, sending ipt, waking from the idle loop.

Thanks,
Andrei

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/5] sched: add a few helpers to wake up tasks on the current cpu
  2023-01-16  9:59   ` Peter Zijlstra
@ 2023-01-19  6:45     ` Andrei Vagin
  2023-01-19 10:09       ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Andrei Vagin @ 2023-01-19  6:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrei Vagin, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	linux-kernel, Kees Cook, Christian Brauner, Andy Lutomirski,
	Juri Lelli, Peter Oskolkov, Tycho Andersen, Will Drewry

On Mon, Jan 16, 2023 at 1:59 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Jan 10, 2023 at 01:30:08PM -0800, Andrei Vagin wrote:
>
> > diff --git a/kernel/sched/completion.c b/kernel/sched/completion.c
> > index d57a5c1c1cd9..a1931a79c05a 100644
> > --- a/kernel/sched/completion.c
> > +++ b/kernel/sched/completion.c
> > @@ -38,6 +38,18 @@ void complete(struct completion *x)
> >  }
> >  EXPORT_SYMBOL(complete);
> >
> > +void complete_on_current_cpu(struct completion *x)
> > +{
> > +     unsigned long flags;
> > +
> > +     raw_spin_lock_irqsave(&x->wait.lock, flags);
> > +
> > +     if (x->done != UINT_MAX)
> > +             x->done++;
> > +     swake_up_locked_on_current_cpu(&x->wait);
> > +     raw_spin_unlock_irqrestore(&x->wait.lock, flags);
> > +}
> > +
> >  /**
> >   * complete_all: - signals all threads waiting on this completion
> >   * @x:  holds the state of this particular completion
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 6478e819eb99..c81866821139 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -6874,7 +6874,7 @@ asmlinkage __visible void __sched preempt_schedule_irq(void)
> >  int default_wake_function(wait_queue_entry_t *curr, unsigned mode, int wake_flags,
> >                         void *key)
> >  {
> > -     WARN_ON_ONCE(IS_ENABLED(CONFIG_SCHED_DEBUG) && wake_flags & ~WF_SYNC);
> > +     WARN_ON_ONCE(IS_ENABLED(CONFIG_SCHED_DEBUG) && wake_flags & ~(WF_SYNC|WF_CURRENT_CPU));
> >       return try_to_wake_up(curr->private, mode, wake_flags);
> >  }
> >  EXPORT_SYMBOL(default_wake_function);
> > diff --git a/kernel/sched/swait.c b/kernel/sched/swait.c
> > index 76b9b796e695..9ebe23868942 100644
> > --- a/kernel/sched/swait.c
> > +++ b/kernel/sched/swait.c
> > @@ -31,6 +31,17 @@ void swake_up_locked(struct swait_queue_head *q)
> >  }
> >  EXPORT_SYMBOL(swake_up_locked);
> >
> > +void swake_up_locked_on_current_cpu(struct swait_queue_head *q)
> > +{
> > +     struct swait_queue *curr;
> > +
> > +     if (list_empty(&q->task_list))
> > +             return;
> > +
> > +     curr = list_first_entry(&q->task_list, typeof(*curr), task_list);
> > +     try_to_wake_up(curr->task, TASK_NORMAL, WF_CURRENT_CPU);
> > +     list_del_init(&curr->task_list);
> > +}
> >  /*
> >   * Wake up all waiters. This is an interface which is solely exposed for
> >   * completions and not for general usage.
> > diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
> > index 133b74730738..47803a0b8d5d 100644
> > --- a/kernel/sched/wait.c
> > +++ b/kernel/sched/wait.c
> > @@ -161,6 +161,11 @@ int __wake_up(struct wait_queue_head *wq_head, unsigned int mode,
> >  }
> >  EXPORT_SYMBOL(__wake_up);
> >
> > +void __wake_up_on_current_cpu(struct wait_queue_head *wq_head, unsigned int mode, void *key)
> > +{
> > +     __wake_up_common_lock(wq_head, mode, 1, WF_CURRENT_CPU, key);
> > +}
> > +
> >  /*
> >   * Same as __wake_up but called with the spinlock in wait_queue_head_t held.
> >   */
>
> So I hate this patch, it adds a whole ton of duplication and litter for
> no real reason afaict. For instance you could've just added an
> wake_flags argument to swake_up_locked(), there's not *that* many users
> -- unlike complete().
>
> And for complete() instead of fully duplicating the function (always a
> bad idea) you could've made complete_flags() and implemented complete()
> using that, or something.
>
> Anyway, let me go (finally) have a look at Chen's patch, since that
> might render all this moot.

If it is the only concern, it is easy to fix. I think I decided to do it
this way because swake_up_locked is in include/linux/swait.h, but wakeup
flags are in kernel/sched/sched.h. I agree that it is better to avoid
this code duplication.

Regarding Chen's proposed patch, unfortunately, it does not solve our
problem. It works only in narrow conditions. One of the major problems
is that it does nothing if there are idle cores. The advantage of my
changes is that they work predictably, and they provide benefits
regardless of other workloads running on the same host.

Thanks,
Andrei

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/5] sched: add a few helpers to wake up tasks on the current cpu
  2023-01-19  6:45     ` Andrei Vagin
@ 2023-01-19 10:09       ` Peter Zijlstra
  0 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2023-01-19 10:09 UTC (permalink / raw)
  To: Andrei Vagin
  Cc: Andrei Vagin, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	linux-kernel, Kees Cook, Christian Brauner, Andy Lutomirski,
	Juri Lelli, Peter Oskolkov, Tycho Andersen, Will Drewry

On Wed, Jan 18, 2023 at 10:45:33PM -0800, Andrei Vagin wrote:
> On Mon, Jan 16, 2023 at 1:59 AM Peter Zijlstra <peterz@infradead.org> wrote:

> > So I hate this patch, it adds a whole ton of duplication and litter for
> > no real reason afaict. For instance you could've just added an
> > wake_flags argument to swake_up_locked(), there's not *that* many users
> > -- unlike complete().
> >
> > And for complete() instead of fully duplicating the function (always a
> > bad idea) you could've made complete_flags() and implemented complete()
> > using that, or something.
> >
> > Anyway, let me go (finally) have a look at Chen's patch, since that
> > might render all this moot.
> 
> If it is the only concern, it is easy to fix. I think I decided to do it
> this way because swake_up_locked is in include/linux/swait.h, but wakeup
> flags are in kernel/sched/sched.h. I agree that it is better to avoid
> this code duplication.

Thanks.

> Regarding Chen's proposed patch, unfortunately, it does not solve our
> problem. It works only in narrow conditions. One of the major problems
> is that it does nothing if there are idle cores. The advantage of my
> changes is that they work predictably, and they provide benefits
> regardless of other workloads running on the same host.

Oh well.. carry on then. Can't always have two birds thing I suppose :-)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 2/5] sched: add WF_CURRENT_CPU and externise ttwu
  2022-11-11  7:31 [PATCH 0/5 v3] seccomp: add the synchronous mode for seccomp_unotify Andrei Vagin
@ 2022-11-11  7:31 ` Andrei Vagin
  0 siblings, 0 replies; 22+ messages in thread
From: Andrei Vagin @ 2022-11-11  7:31 UTC (permalink / raw)
  To: Kees Cook, Peter Zijlstra, Christian Brauner
  Cc: linux-kernel, Andrei Vagin, Andy Lutomirski, Dietmar Eggemann,
	Ingo Molnar, Juri Lelli, Peter Oskolkov, Tycho Andersen,
	Will Drewry, Vincent Guittot

From: Peter Oskolkov <posk@google.com>

Add WF_CURRENT_CPU wake flag that advices the scheduler to
move the wakee to the current CPU. This is useful for fast on-CPU
context switching use cases such as UMCG.

In addition, make ttwu external rather than static so that
the flag could be passed to it from outside of sched/core.c.

Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
---
 kernel/sched/core.c  |  3 +--
 kernel/sched/fair.c  |  4 ++++
 kernel/sched/sched.h | 13 ++++++++-----
 3 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cb2aa2b54c7a..4b591e7773fd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4039,8 +4039,7 @@ bool ttwu_state_match(struct task_struct *p, unsigned int state, int *success)
  * Return: %true if @p->state changes (an actual wakeup was done),
  *	   %false otherwise.
  */
-static int
-try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
+int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 {
 	unsigned long flags;
 	int cpu, success = 0;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e4a0b8bd941c..4ebe7222664c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7204,6 +7204,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 	if (wake_flags & WF_TTWU) {
 		record_wakee(p);
 
+		if ((wake_flags & WF_CURRENT_CPU) &&
+		    cpumask_test_cpu(cpu, p->cpus_ptr))
+			return cpu;
+
 		if (sched_energy_enabled()) {
 			new_cpu = find_energy_efficient_cpu(p, prev_cpu);
 			if (new_cpu >= 0)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a4a20046e586..4c275f41773c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2077,12 +2077,13 @@ static inline int task_on_rq_migrating(struct task_struct *p)
 }
 
 /* Wake flags. The first three directly map to some SD flag value */
-#define WF_EXEC     0x02 /* Wakeup after exec; maps to SD_BALANCE_EXEC */
-#define WF_FORK     0x04 /* Wakeup after fork; maps to SD_BALANCE_FORK */
-#define WF_TTWU     0x08 /* Wakeup;            maps to SD_BALANCE_WAKE */
+#define WF_EXEC         0x02 /* Wakeup after exec; maps to SD_BALANCE_EXEC */
+#define WF_FORK         0x04 /* Wakeup after fork; maps to SD_BALANCE_FORK */
+#define WF_TTWU         0x08 /* Wakeup;            maps to SD_BALANCE_WAKE */
 
-#define WF_SYNC     0x10 /* Waker goes to sleep after wakeup */
-#define WF_MIGRATED 0x20 /* Internal use, task got migrated */
+#define WF_SYNC         0x10 /* Waker goes to sleep after wakeup */
+#define WF_MIGRATED     0x20 /* Internal use, task got migrated */
+#define WF_CURRENT_CPU  0x40 /* Prefer to move the wakee to the current CPU. */
 
 #ifdef CONFIG_SMP
 static_assert(WF_EXEC == SD_BALANCE_EXEC);
@@ -3167,6 +3168,8 @@ static inline bool is_per_cpu_kthread(struct task_struct *p)
 extern void swake_up_all_locked(struct swait_queue_head *q);
 extern void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait);
 
+extern int try_to_wake_up(struct task_struct *tsk, unsigned int state, int wake_flags);
+
 #ifdef CONFIG_PREEMPT_DYNAMIC
 extern int preempt_dynamic_mode;
 extern int sched_dynamic_mode(const char *str);
-- 
2.38.1.493.g58b659f92b-goog


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/5] sched: add WF_CURRENT_CPU and externise ttwu
  2022-10-20  4:54   ` Kees Cook
@ 2022-10-21  0:48     ` Andrei Vagin
  0 siblings, 0 replies; 22+ messages in thread
From: Andrei Vagin @ 2022-10-21  0:48 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, Andy Lutomirski, Christian Brauner,
	Dietmar Eggemann, Kees Cook, Ingo Molnar, Juri Lelli,
	Peter Oskolkov, Peter Zijlstra, Tycho Andersen, Will Drewry,
	Vincent Guittot

On Wed, Oct 19, 2022 at 09:54:15PM -0700, Kees Cook wrote:
> On October 19, 2022 6:10:45 PM PDT, Andrei Vagin <avagin@gmail.com> wrote:
> >From: Peter Oskolkov <posk@google.com>
> >
> >Add WF_CURRENT_CPU wake flag that advices the scheduler to
> >move the wakee to the current CPU. This is useful for fast on-CPU
> >context switching use cases such as UMCG.
> 
> UMCG is https://lwn.net/Articles/879398/ ?
> 

Yes, this is it. https://lkml.org/lkml/2021/11/4/830 is the most recent
version that I've seen.

> >In addition, make ttwu external rather than static so that
> >the flag could be passed to it from outside of sched/core.c.
> >
> >Signed-off-by: Peter Oskolkov <posk@google.com>
> >Signed-off-by: Andrei Vagin <avagin@gmail.com>
> >---
> > kernel/sched/core.c  |  3 +--
> > kernel/sched/fair.c  |  4 ++++
> > kernel/sched/sched.h | 13 ++++++++-----
> > 3 files changed, 13 insertions(+), 7 deletions(-)
> 
> This would need an Ack from the sched maintainers...

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/5] sched: add WF_CURRENT_CPU and externise ttwu
  2022-10-20  1:10 ` [PATCH 2/5] sched: add WF_CURRENT_CPU and externise ttwu Andrei Vagin
@ 2022-10-20  4:54   ` Kees Cook
  2022-10-21  0:48     ` Andrei Vagin
  0 siblings, 1 reply; 22+ messages in thread
From: Kees Cook @ 2022-10-20  4:54 UTC (permalink / raw)
  To: Andrei Vagin, linux-kernel
  Cc: Andy Lutomirski, Christian Brauner, Dietmar Eggemann, Kees Cook,
	Ingo Molnar, Juri Lelli, Peter Oskolkov, Peter Zijlstra,
	Tycho Andersen, Will Drewry, Vincent Guittot

On October 19, 2022 6:10:45 PM PDT, Andrei Vagin <avagin@gmail.com> wrote:
>From: Peter Oskolkov <posk@google.com>
>
>Add WF_CURRENT_CPU wake flag that advices the scheduler to
>move the wakee to the current CPU. This is useful for fast on-CPU
>context switching use cases such as UMCG.

UMCG is https://lwn.net/Articles/879398/ ?

>In addition, make ttwu external rather than static so that
>the flag could be passed to it from outside of sched/core.c.
>
>Signed-off-by: Peter Oskolkov <posk@google.com>
>Signed-off-by: Andrei Vagin <avagin@gmail.com>
>---
> kernel/sched/core.c  |  3 +--
> kernel/sched/fair.c  |  4 ++++
> kernel/sched/sched.h | 13 ++++++++-----
> 3 files changed, 13 insertions(+), 7 deletions(-)

This would need an Ack from the sched maintainers...


-- 
Kees Cook

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 2/5] sched: add WF_CURRENT_CPU and externise ttwu
  2022-10-20  1:10 [PATCH 0/5 v2] seccomp: add the synchronous mode for seccomp_unotify Andrei Vagin
@ 2022-10-20  1:10 ` Andrei Vagin
  2022-10-20  4:54   ` Kees Cook
  0 siblings, 1 reply; 22+ messages in thread
From: Andrei Vagin @ 2022-10-20  1:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrei Vagin, Andy Lutomirski, Christian Brauner,
	Dietmar Eggemann, Kees Cook, Ingo Molnar, Juri Lelli,
	Peter Oskolkov, Peter Zijlstra, Tycho Andersen, Will Drewry,
	Vincent Guittot

From: Peter Oskolkov <posk@google.com>

Add WF_CURRENT_CPU wake flag that advices the scheduler to
move the wakee to the current CPU. This is useful for fast on-CPU
context switching use cases such as UMCG.

In addition, make ttwu external rather than static so that
the flag could be passed to it from outside of sched/core.c.

Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
---
 kernel/sched/core.c  |  3 +--
 kernel/sched/fair.c  |  4 ++++
 kernel/sched/sched.h | 13 ++++++++-----
 3 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5800b0623ff3..cffa8f314c9a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4039,8 +4039,7 @@ bool ttwu_state_match(struct task_struct *p, unsigned int state, int *success)
  * Return: %true if @p->state changes (an actual wakeup was done),
  *	   %false otherwise.
  */
-static int
-try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
+int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 {
 	unsigned long flags;
 	int cpu, success = 0;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e4a0b8bd941c..4ebe7222664c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7204,6 +7204,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 	if (wake_flags & WF_TTWU) {
 		record_wakee(p);
 
+		if ((wake_flags & WF_CURRENT_CPU) &&
+		    cpumask_test_cpu(cpu, p->cpus_ptr))
+			return cpu;
+
 		if (sched_energy_enabled()) {
 			new_cpu = find_energy_efficient_cpu(p, prev_cpu);
 			if (new_cpu >= 0)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1644242ecd11..ee24141c4942 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2071,12 +2071,13 @@ static inline int task_on_rq_migrating(struct task_struct *p)
 }
 
 /* Wake flags. The first three directly map to some SD flag value */
-#define WF_EXEC     0x02 /* Wakeup after exec; maps to SD_BALANCE_EXEC */
-#define WF_FORK     0x04 /* Wakeup after fork; maps to SD_BALANCE_FORK */
-#define WF_TTWU     0x08 /* Wakeup;            maps to SD_BALANCE_WAKE */
+#define WF_EXEC         0x02 /* Wakeup after exec; maps to SD_BALANCE_EXEC */
+#define WF_FORK         0x04 /* Wakeup after fork; maps to SD_BALANCE_FORK */
+#define WF_TTWU         0x08 /* Wakeup;            maps to SD_BALANCE_WAKE */
 
-#define WF_SYNC     0x10 /* Waker goes to sleep after wakeup */
-#define WF_MIGRATED 0x20 /* Internal use, task got migrated */
+#define WF_SYNC         0x10 /* Waker goes to sleep after wakeup */
+#define WF_MIGRATED     0x20 /* Internal use, task got migrated */
+#define WF_CURRENT_CPU  0x40 /* Prefer to move the wakee to the current CPU. */
 
 #ifdef CONFIG_SMP
 static_assert(WF_EXEC == SD_BALANCE_EXEC);
@@ -3161,6 +3162,8 @@ static inline bool is_per_cpu_kthread(struct task_struct *p)
 extern void swake_up_all_locked(struct swait_queue_head *q);
 extern void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait);
 
+extern int try_to_wake_up(struct task_struct *tsk, unsigned int state, int wake_flags);
+
 #ifdef CONFIG_PREEMPT_DYNAMIC
 extern int preempt_dynamic_mode;
 extern int sched_dynamic_mode(const char *str);
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2023-01-19 10:10 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-10 21:30 [PATCH 0/5 v3 RESEND] seccomp: add the synchronous mode for seccomp_unotify Andrei Vagin
2023-01-10 21:30 ` [PATCH 1/5] seccomp: don't use semaphore and wait_queue together Andrei Vagin
2023-01-12 14:58   ` Tycho Andersen
2023-01-13 21:51     ` Andrei Vagin
2023-01-16  9:48   ` Peter Zijlstra
2023-01-10 21:30 ` [PATCH 2/5] sched: add WF_CURRENT_CPU and externise ttwu Andrei Vagin
2023-01-12  7:35   ` Chen Yu
2023-01-13 21:39     ` Andrei Vagin
2023-01-14 15:59       ` Chen Yu
2023-01-18  6:10         ` Andrei Vagin
2023-01-10 21:30 ` [PATCH 3/5] sched: add a few helpers to wake up tasks on the current cpu Andrei Vagin
2023-01-16  9:59   ` Peter Zijlstra
2023-01-19  6:45     ` Andrei Vagin
2023-01-19 10:09       ` Peter Zijlstra
2023-01-10 21:30 ` [PATCH 4/5] seccomp: add the synchronous mode for seccomp_unotify Andrei Vagin
2023-01-12 15:00   ` Tycho Andersen
2023-01-14  1:16     ` Andrei Vagin
2023-01-10 21:30 ` [PATCH 5/5] selftest/seccomp: add a new test for the sync mode of seccomp_user_notify Andrei Vagin
  -- strict thread matches above, loose matches on Subject: below --
2022-11-11  7:31 [PATCH 0/5 v3] seccomp: add the synchronous mode for seccomp_unotify Andrei Vagin
2022-11-11  7:31 ` [PATCH 2/5] sched: add WF_CURRENT_CPU and externise ttwu Andrei Vagin
2022-10-20  1:10 [PATCH 0/5 v2] seccomp: add the synchronous mode for seccomp_unotify Andrei Vagin
2022-10-20  1:10 ` [PATCH 2/5] sched: add WF_CURRENT_CPU and externise ttwu Andrei Vagin
2022-10-20  4:54   ` Kees Cook
2022-10-21  0:48     ` Andrei Vagin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).