linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: ebiederm@xmission.com (Eric W. Biederman)
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: LKML <linux-kernel@vger.kernel.org>,
	Linux FS Devel <linux-fsdevel@vger.kernel.org>,
	Alexey Dobriyan <adobriyan@gmail.com>,
	Alexey Gladkov <legion@kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Oleg Nesterov <oleg@redhat.com>,
	Alexey Gladkov <gladkov.alexey@gmail.com>
Subject: Re: [PATCH v2 2/2] proc: Ensure we see the exit of each process tid exactly
Date: Fri, 24 Apr 2020 14:51:25 -0500	[thread overview]
Message-ID: <87mu70psqq.fsf@x220.int.ebiederm.org> (raw)
In-Reply-To: <CAHk-=wj-K3fqdMr-r8WgS8RKPuZOuFbPXCEUe9APrdShn99xsA@mail.gmail.com> (Linus Torvalds's message of "Fri, 24 Apr 2020 11:02:35 -0700")

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Thu, Apr 23, 2020 at 8:36 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> At one point my brain I had forgetten that xchg can not take two memory
>> arguments and had hoped to be able to provide stronger guarnatees than I
>> can.  Which is where I think the structure of exchange_pids came from.
>
> Note that even if we were to have a "exchange two memory locations
> atomically" instruction (and we don't - even a "double cmpxchg" is
> actually just a double-_sized_ one, not a two different locations
> one), I'm not convinced it makes sense.
>
> There's no way to _walk_ two lists atomically. Any user will only ever
> walk one or the other, so it's not sensible to try to make the two
> list updates be atomic.
>
> And if a user for some reason walks both, the walking itself will
> obviously then be racy - it does one or the other first, and can see
> either the old state, or the new state - or see _neither_ (ie if you
> walk it twice, you might see neither task, or you might see both, just
> depending on order or walk).
>
>> I do agree the clearer we can write things, the easier it is for
>> someone else to come along and follow.
>
> Your alternate write of the function seems a bit more readable to me,
> even if the main effect might be just that it was split up a bit and
> added a few comments and whitespace.
>
> So I'm more happier with that one. That said:
>
>> We can not use a remove and reinser model because that does break rcu
>> accesses, and complicates everything else.  With a swap model we have
>> the struct pids pointer at either of the tasks that are swapped but
>> never at nothing.
>
> I'm not suggesting removing the pid entirely - like making task->pid
> be NULL. I'm literally suggesting just doing the RCU list operations
> as "remove and re-insert".
>
> And that shouldn't break anything, for the same reason that an atomic
> exchange doesn't make sense: you can only ever walk one of the lists
> at a time. And regardless of how you walk it, you might not see the
> new state (or the old state) reliably.
>
> Put another way:
>
>>         void hlist_swap_before_rcu(struct hlist_node *left, struct hlist_node *right)
>>         {
>>                 struct hlist_node **lpprev = left->pprev;
>>                 struct hlist_node **rpprev = right->pprev;
>>
>>                 rcu_assign_pointer(*lpprev, right);
>>                 rcu_assign_pointer(*rpprev, left);
>
> These are the only two assignments that matter for anything that walks
> the list (the pprev ones are for things that change the list, and they
> have to have exclusions in place).
>
> And those two writes cannot be atomic anyway, so you fundamentally
> will always be in the situation that a walker can miss one of the
> tasks.
>
> Which is why I think it would be ok to just do the RCU list swap as a
> "remove left, remove right, add left, add right" operation. It doesn't
> seem fundamentally different to a walker than the "switch left/right"
> operation, and it seems much simpler.
>
> Is there something I'm missing?


The problem with

remove
remove
add
add
is:

A lookup that hit between the remove and the add could return nothing.

The function kill_pid_info does everything it can to handle this case
today does:

int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid)
{
	int error = -ESRCH;
	struct task_struct *p;

	for (;;) {
		rcu_read_lock();
		p = pid_task(pid, PIDTYPE_PID);
		if (p)
			error = group_send_sig_info(sig, info, p, PIDTYPE_TGID);
		rcu_read_unlock();
		if (likely(!p || error != -ESRCH))
			return error;

		/*
		 * The task was unhashed in between, try again.  If it
		 * is dead, pid_task() will return NULL, if we race with
		 * de_thread() it will find the new leader.
		 */
	}
}

Now kill_pid_info is signalling the entire task and is just using
PIDTYPE_PID to find a thread in the task.

With the remove then add model there will be a point where pid_task
will return nothing, because ever so briefly the lists will be
empty.

However with an actually swap we will find a task and kill_pid_info
will work.  It pathloglical cases lock_task_sighand might have to loop
and we would need to find the new task that has the given pid.  But
kill_pid_info is guaranteed to work with swaps and will fail with
remove add.


> But I'm *not* suggesting that we change these simple parts to be
> "remove thread_pid or pid pointer, and then insert a new one":
>
>>                 /* Swap thread_pid */
>>                 rpid = left->thread_pid;
>>                 lpid = right->thread_pid;
>>                 rcu_assign_pointer(left->thread_pid, lpid);
>>                 rcu_assign_pointer(right->thread_pid, rpid);
>>
>>                 /* Swap the cached pid value */
>>                 WRITE_ONCE(left->pid, pid_nr(lpid));
>>                 WRITE_ONCE(right->pid, pid_nr(rpid));
>>         }
>
> because I agree that for things that don't _walk_ the list, but just
> look up "thread_pid" vs "pid" atomically but asynchronously, we
> obviously need to get one or the other, not some kind of "empty"
> state.

For PIDTYPE_PID and PIDTYPE_TGID these practically aren't lists but
pointers to the appropriate task.  Only for PIDTYPE_PGID and PIDTYPE_SID
do these become lists in practice.

That not-really-a-list status allows for signel delivery to indivdual
processes to happen in rcu context.  Which is where we would get into
trouble with add/remove.

Since signals are guaranteed to be delivered to the entire session
or the entire process group all of the list walking happens under
the tasklist_lock currently.  Which really keeps list walking from
being a concern.

>> Does that look a little more readable?
>
> Regardless, I find your new version at least a lot more readable, so
> I'm ok with it.

Good. Then I will finish cleaning it up and go with that version.

> It looks like Oleg found an independent issue, though.

Yes, and I will definitely work through those.

Eric

  parent reply	other threads:[~2020-04-24 19:54 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-19 14:10 [PATCH v12 0/7] proc: modernize proc to support multiple private instances Alexey Gladkov
2020-04-19 14:10 ` [PATCH v12 1/7] proc: rename struct proc_fs_info to proc_fs_opts Alexey Gladkov
2020-04-19 14:10 ` [PATCH v12 2/7] proc: allow to mount many instances of proc in one pid namespace Alexey Gladkov
2020-04-23 11:28   ` [PATCH v13 " Alexey Gladkov
2020-04-23 12:16     ` Eric W. Biederman
2020-04-23 20:01       ` Alexey Gladkov
2020-04-19 14:10 ` [PATCH v12 3/7] proc: instantiate only pids that we can ptrace on 'hidepid=4' mount option Alexey Gladkov
2020-04-19 14:10 ` [PATCH v12 4/7] proc: add option to mount only a pids subset Alexey Gladkov
2020-04-19 14:10 ` [PATCH v12 5/7] docs: proc: add documentation for "hidepid=4" and "subset=pid" options and new mount behavior Alexey Gladkov
2020-04-19 14:10 ` [PATCH v12 6/7] proc: use human-readable values for hidepid Alexey Gladkov
2020-04-19 14:10 ` [PATCH v12 7/7] proc: use named enums for better readability Alexey Gladkov
     [not found] ` <87ftcv1nqe.fsf@x220.int.ebiederm.org>
2020-04-23 17:54   ` [PATCH v2 0/2] proc: Calling proc_flush_task exactly once per task Oleg Nesterov
2020-04-23 19:38     ` Eric W. Biederman
2020-04-23 19:39   ` [PATCH v2 1/2] proc: Use PIDTYPE_TGID in next_tgid Eric W. Biederman
2020-04-24 17:29     ` Oleg Nesterov
2020-04-23 19:39   ` [PATCH v2 2/2] proc: Ensure we see the exit of each process tid exactly Eric W. Biederman
2020-04-23 20:28     ` Linus Torvalds
2020-04-24  3:33       ` Eric W. Biederman
2020-04-24 18:02         ` Linus Torvalds
2020-04-24 18:46           ` Linus Torvalds
2020-04-24 19:51           ` Eric W. Biederman [this message]
2020-04-24 20:10             ` Linus Torvalds
2020-04-24 17:39     ` Oleg Nesterov
2020-04-24 18:10       ` Eric W. Biederman
2020-04-24 20:50       ` [PATCH] proc: Put thread_pid in release_task not proc_flush_pid Eric W. Biederman
     [not found]       ` <87mu6ymkea.fsf_-_@x220.int.ebiederm.org>
     [not found]         ` <87blnemj5t.fsf_-_@x220.int.ebiederm.org>
2020-04-26 17:22           ` [PATCH v3 2/6] posix-cpu-timers: Use PIDTYPE_TGID to simplify the logic in lookup_task Oleg Nesterov
2020-04-27 11:51             ` Eric W. Biederman
2020-04-28 18:03               ` Oleg Nesterov
2020-04-27 10:32           ` Thomas Gleixner
2020-04-27 19:46             ` Eric W. Biederman
     [not found]         ` <875zdmmj4y.fsf_-_@x220.int.ebiederm.org>
2020-04-26 17:40           ` [PATCH v3 3/6] rculist: Add hlist_swap_before_rcu Linus Torvalds
2020-04-27 14:28             ` Eric W. Biederman
2020-04-27 20:27               ` Linus Torvalds
2020-04-28 12:16                 ` [PATCH v4 0/2] proc: Ensure we see the exit of each process tid exactly Eric W. Biederman
2020-04-28 12:18                   ` [PATCH v4 1/2] rculist: Add hlists_swap_heads_rcu Eric W. Biederman
2020-04-28 12:19                   ` [PATCH v4 2/2] proc: Ensure we see the exit of each process tid exactly once Eric W. Biederman
2020-04-28 16:53                   ` [PATCH v4 0/2] proc: Ensure we see the exit of each process tid exactly Linus Torvalds
2020-04-28 17:55                     ` Eric W. Biederman
2020-04-28 18:55                     ` Eric W. Biederman
2020-04-28 19:36                       ` Linus Torvalds
2020-04-28 18:05                   ` Oleg Nesterov
2020-04-28 18:54                     ` Eric W. Biederman
2020-04-28 21:39                     ` [PATCH v1 0/4] signal: Removing has_group_leader_pid Eric W. Biederman
2020-04-28 21:45                       ` [PATCH v1 1/4] posix-cpu-timer: Tidy up group_leader logic in lookup_task Eric W. Biederman
2020-04-28 21:48                       ` [PATCH 2/4] posix-cpu-timer: Unify the now redundant code " Eric W. Biederman
2020-04-28 21:53                       ` [PATCH v1 3/4] exec: Remove BUG_ON(has_group_leader_pid) Eric W. Biederman
2020-04-28 21:56                       ` [PATCH v4 4/4] signal: Remove has_group_leader_pid Eric W. Biederman
2020-04-30 11:54                       ` [PATCH v1 0/3] posix-cpu-timers: Use pids not tasks in lookup Eric W. Biederman
2020-04-30 11:55                         ` [PATCH v1 1/3] posix-cpu-timers: Extend rcu_read_lock removing task_struct references Eric W. Biederman
2020-04-30 11:56                         ` [PATCH v1 2/3] posix-cpu-timers: Replace cpu_timer_pid_type with clock_pid_type Eric W. Biederman
2020-04-30 11:56                         ` [PATCH v1 3/3] posix-cpu-timers: Replace __get_task_for_clock with pid_for_clock Eric W. Biederman
     [not found]         ` <87h7x6mj6h.fsf_-_@x220.int.ebiederm.org>
2020-04-27  9:43           ` [PATCH v3 1/6] posix-cpu-timers: Always call __get_task_for_clock holding rcu_read_lock Thomas Gleixner
2020-04-27 11:53             ` Eric W. Biederman
     [not found]         ` <87r1w8ete7.fsf@x220.int.ebiederm.org>
2020-04-27 20:23           ` [PATCH v3] proc: Ensure we see the exit of each process tid exactly Eric W. Biederman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87mu70psqq.fsf@x220.int.ebiederm.org \
    --to=ebiederm@xmission.com \
    --cc=adobriyan@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=gladkov.alexey@gmail.com \
    --cc=legion@kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=oleg@redhat.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).