linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jann Horn <jannh@google.com>
To: Peter Oskolkov <posk@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	kernel list <linux-kernel@vger.kernel.org>,
	Linux API <linux-api@vger.kernel.org>,
	Paul Turner <pjt@google.com>, Ben Segall <bsegall@google.com>,
	Peter Oskolkov <posk@posk.io>,
	Joel Fernandes <joel@joelfernandes.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Andrei Vagin <avagin@google.com>,
	Jim Newsome <jnewsome@torproject.org>
Subject: Re: [RFC PATCH v0.1 4/9] sched/umcg: implement core UMCG API
Date: Fri, 21 May 2021 23:33:14 +0200	[thread overview]
Message-ID: <CAG48ez3Ur61rpOZduQRFabB9R=RbSin9Th+=0=z9FUpcZ21C=w@mail.gmail.com> (raw)
In-Reply-To: <20210520183614.1227046-5-posk@google.com>

On Thu, May 20, 2021 at 8:36 PM Peter Oskolkov <posk@google.com> wrote:
> Implement version 1 of core UMCG API (wait/wake/swap).
>
> As has been outlined in
> https://lore.kernel.org/lkml/20200722234538.166697-1-posk@posk.io/,
> efficient and synchronous on-CPU context switching is key
> to enabling two broad use cases: in-process M:N userspace scheduling
> and fast X-process RPCs for security wrappers.
>
> High-level design considerations/approaches used:
> - wait & wake can race with each other;
> - offload as much work as possible to libumcg in tools/lib/umcg,
>   specifically:
>   - most state changes, e.g. RUNNABLE <=> RUNNING, are done in
>     the userspace (libumcg);
>   - retries are offloaded to the userspace.
[...]
> diff --git a/kernel/sched/umcg.c b/kernel/sched/umcg.c
[...]
> +static int get_state(struct umcg_task __user *ut, u32 *state)
> +{
> +       return get_user(*state, (u32 __user *)ut);

Why the cast instead of get_user(*state, &ut->state)?
And maybe do this inline instead of adding a separate helper for it?

> +}
> +
> +static int put_state(struct umcg_task __user *ut, u32 state)
> +{
> +       return put_user(state, (u32 __user *)ut);
> +}
[...]
> +static int do_context_switch(struct task_struct *next)
> +{
> +       struct umcg_task_data *utd = rcu_access_pointer(current->umcg_task_data);
> +
> +       /*
> +        * It is important to set_current_state(TASK_INTERRUPTIBLE) before
> +        * waking @next, as @next may immediately try to wake current back
> +        * (e.g. current is a server, @next is a worker that immediately
> +        * blocks or waits), and this next wakeup must not be lost.
> +        */
> +       set_current_state(TASK_INTERRUPTIBLE);
> +
> +       WRITE_ONCE(utd->in_wait, true);
> +
> +       if (!try_to_wake_up(next, TASK_NORMAL, WF_CURRENT_CPU))
> +               return -EAGAIN;
> +
> +       freezable_schedule();
> +
> +       WRITE_ONCE(utd->in_wait, false);
> +
> +       if (signal_pending(current))
> +               return -EINTR;

What is this -EINTR supposed to tell userspace? We can't tell whether
we were woken up by a signal or by do_context_switch() or the
umcg_wake syscall, right? If we're woken by another thread calling
do_context_switch() and then get a signal immediately afterwards,
can't that lead to a lost wakeup?

I don't know whether trying to track the origin of the wakeup is a
workable approach here; you might have to instead do cmpxchg() on the
->in_wait field and give it three states (default, waiting-for-wake
and successfully-woken)?
Or you give up on trying to figure out who woke you, just always
return zero, and let userspace deal with figuring out whether the
wakeup was real or not. I don't know whether that'd be acceptable.

> +       return 0;
> +}
> +
> +static int do_wait(void)
> +{
> +       struct umcg_task_data *utd = rcu_access_pointer(current->umcg_task_data);
> +
> +       if (!utd)
> +               return -EINVAL;
> +
> +       WRITE_ONCE(utd->in_wait, true);
> +
> +       set_current_state(TASK_INTERRUPTIBLE);
> +       freezable_schedule();
> +
> +       WRITE_ONCE(utd->in_wait, false);
> +
> +       if (signal_pending(current))
> +               return -EINTR;
> +
> +       return 0;
>  }
>
>  /**
> @@ -90,7 +228,23 @@ SYSCALL_DEFINE1(umcg_unregister_task, u32, flags)
>  SYSCALL_DEFINE2(umcg_wait, u32, flags,
>                 const struct __kernel_timespec __user *, timeout)
>  {
> -       return -ENOSYS;
> +       struct umcg_task_data *utd;
> +
> +       if (flags)
> +               return -EINVAL;
> +       if (timeout)
> +               return -EOPNOTSUPP;
> +
> +       rcu_read_lock();
> +       utd = rcu_dereference(current->umcg_task_data);
> +       if (!utd) {
> +               rcu_read_unlock();
> +               return -EINVAL;
> +       }
> +
> +       rcu_read_unlock();

rcu_access_pointer() instead of the locking and unlocking?

> +       return do_wait();
>  }
>
>  /**
> @@ -110,7 +264,39 @@ SYSCALL_DEFINE2(umcg_wait, u32, flags,
>   */
>  SYSCALL_DEFINE2(umcg_wake, u32, flags, u32, next_tid)
>  {
> -       return -ENOSYS;
> +       struct umcg_task_data *next_utd;
> +       struct task_struct *next;
> +       int ret = -EINVAL;
> +
> +       if (!next_tid)
> +               return -EINVAL;
> +       if (flags)
> +               return -EINVAL;
> +
> +       next = find_get_task_by_vpid(next_tid);
> +       if (!next)
> +               return -ESRCH;
> +       rcu_read_lock();

Wouldn't it be more efficient to replace the last 4 lines with the following?

rcu_read_lock();
next = find_task_by_vpid(next_tid);
if (!next) {
  err = -ESRCH;
  goto out;
}

Then you don't need to use refcounting here...

> +       next_utd = rcu_dereference(next->umcg_task_data);
> +       if (!next_utd)
> +               goto out;
> +
> +       if (!READ_ONCE(next_utd->in_wait)) {
> +               ret = -EAGAIN;
> +               goto out;
> +       }
> +
> +       ret = wake_up_process(next);
> +       put_task_struct(next);

... and you'd be able to drop this put_task_struct(), too.

> +       if (ret)
> +               ret = 0;
> +       else
> +               ret = -EAGAIN;
> +
> +out:
> +       rcu_read_unlock();
> +       return ret;
>  }
>
>  /**
> @@ -139,5 +325,44 @@ SYSCALL_DEFINE2(umcg_wake, u32, flags, u32, next_tid)
>  SYSCALL_DEFINE4(umcg_swap, u32, wake_flags, u32, next_tid, u32, wait_flags,
>                 const struct __kernel_timespec __user *, timeout)
>  {
> -       return -ENOSYS;
> +       struct umcg_task_data *curr_utd;
> +       struct umcg_task_data *next_utd;
> +       struct task_struct *next;
> +       int ret = -EINVAL;
> +
> +       rcu_read_lock();
> +       curr_utd = rcu_dereference(current->umcg_task_data);
> +
> +       if (!next_tid || wake_flags || wait_flags || !curr_utd)
> +               goto out;
> +
> +       if (timeout) {
> +               ret = -EOPNOTSUPP;
> +               goto out;
> +       }
> +
> +       next = find_get_task_by_vpid(next_tid);
> +       if (!next) {
> +               ret = -ESRCH;
> +               goto out;
> +       }

There isn't any type of access check here, right? Any task can wake up
any other task? That feels a bit weird to me - and if you want to keep
it as-is, it should probably at least be documented that any task on
the system can send you spurious wakeups if you opt in to umcg.

In contrast, shared futexes can avoid this because they get their
access control implicitly from the VMA.

> +       next_utd = rcu_dereference(next->umcg_task_data);
> +       if (!next_utd) {
> +               ret = -EINVAL;
> +               goto out;
> +       }
> +
> +       if (!READ_ONCE(next_utd->in_wait)) {
> +               ret = -EAGAIN;
> +               goto out;
> +       }
> +
> +       rcu_read_unlock();
> +
> +       return do_context_switch(next);

It looks like the refcount of the target task is incremented but never
decremented, so this probably currently leaks references?

I'd maybe try to split do_context_switch() into two parts, one that
does the non-blocking waking of another task and one that does the
sleeping. Then you can avoid taking a reference on the task as above -
this is supposed to be a really hot fastpath, so it's a good idea to
avoid atomic instructions if possible, right?



> +out:
> +       rcu_read_unlock();
> +       return ret;
>  }

  parent reply	other threads:[~2021-05-21 21:33 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-20 18:36 [RFC PATCH v0.1 0/9] UMCG early preview/RFC patchset Peter Oskolkov
2021-05-20 18:36 ` [RFC PATCH v0.1 1/9] sched/umcg: add UMCG syscall stubs and CONFIG_UMCG Peter Oskolkov
2021-05-20 18:36 ` [RFC PATCH v0.1 2/9] sched/umcg: add uapi/linux/umcg.h and sched/umcg.c Peter Oskolkov
2021-05-20 18:36 ` [RFC PATCH v0.1 3/9] sched: add WF_CURRENT_CPU and externise ttwu Peter Oskolkov
2021-05-20 18:36 ` [RFC PATCH v0.1 4/9] sched/umcg: implement core UMCG API Peter Oskolkov
2021-05-21 19:06   ` Andrei Vagin
2021-05-21 21:31     ` Jann Horn
2021-05-21 22:03       ` Peter Oskolkov
2021-05-21 19:32   ` Andy Lutomirski
2021-05-21 22:01     ` Peter Oskolkov
2021-05-21 21:33   ` Jann Horn [this message]
2021-06-09 13:01     ` Peter Zijlstra
2021-05-20 18:36 ` [RFC PATCH v0.1 5/9] lib/umcg: implement UMCG core API for userspace Peter Oskolkov
2021-05-20 18:36 ` [RFC PATCH v0.1 6/9] selftests/umcg: add UMCG core API selftest Peter Oskolkov
2021-05-20 18:36 ` [RFC PATCH v0.1 7/9] sched/umcg: add UMCG server/worker API (early RFC) Peter Oskolkov
2021-05-21 20:17   ` Andrei Vagin
2021-05-20 18:36 ` [RFC PATCH v0.1 8/9] lib/umcg: " Peter Oskolkov
2021-05-20 18:36 ` [RFC PATCH v0.1 9/9] selftests/umcg: add UMCG server/worker API selftest Peter Oskolkov
2021-05-20 21:17 ` [RFC PATCH v0.1 0/9] UMCG early preview/RFC patchset Jonathan Corbet
2021-05-20 21:38   ` Peter Oskolkov
2021-05-21  0:15     ` Randy Dunlap
2021-05-21  8:04       ` Peter Zijlstra
2021-05-21 15:08     ` Jonathan Corbet
2021-05-21 16:03       ` Peter Oskolkov
2021-05-21 19:17         ` Jonathan Corbet
2021-05-27  0:06           ` Peter Oskolkov
2021-05-27 15:41             ` Jonathan Corbet
     [not found] ` <CAEWA0a72SvpcuN4ov=98T3uWtExPCr7BQePOgjkqD1ofWKEASw@mail.gmail.com>
2021-05-21 19:13   ` Peter Oskolkov
2021-05-21 23:08     ` Jann Horn
2021-06-09 12:54 ` Peter Zijlstra
2021-06-09 20:18   ` Peter Oskolkov
2021-06-10 18:02     ` Peter Zijlstra
2021-06-10 20:06       ` Peter Oskolkov
2021-07-07 17:45       ` Thierry Delisle
2021-07-08 21:44         ` Peter Oskolkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAG48ez3Ur61rpOZduQRFabB9R=RbSin9Th+=0=z9FUpcZ21C=w@mail.gmail.com' \
    --to=jannh@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=avagin@google.com \
    --cc=bsegall@google.com \
    --cc=jnewsome@torproject.org \
    --cc=joel@joelfernandes.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=posk@google.com \
    --cc=posk@posk.io \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).