linux-trace-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Beau Belgrave <beaub@linux.microsoft.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steven Rostedt <rostedt@goodmis.org>,
	Alexei Starovoitov <alexei.starovoitov@gmail.com>,
	Masami Hiramatsu <mhiramat@kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	linux-trace-kernel@vger.kernel.org,
	Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Andrii Nakryiko <andrii@kernel.org>, bpf <bpf@vger.kernel.org>,
	David Vernet <void@manifault.com>,
	dthaler@microsoft.com, brauner@kernel.org, hch@infradead.org
Subject: Re: [PATCH] tracing/user_events: Run BPF program if attached
Date: Wed, 17 May 2023 12:07:50 -0700	[thread overview]
Message-ID: <20230517190750.GA366@W11-BEAU-MD.localdomain> (raw)
In-Reply-To: <CAHk-=whzzuNEW8UcV2_8OyuKcXPrk7-j_8GzOoroxz9JiZiD3w@mail.gmail.com>

On Wed, May 17, 2023 at 11:15:14AM -0700, Linus Torvalds wrote:
> On Wed, May 17, 2023 at 10:22 AM Beau Belgrave
> <beaub@linux.microsoft.com> wrote:
> >
> > On Tue, May 16, 2023 at 08:03:09PM -0700, Linus Torvalds wrote:
> > > So what is it that could even race and change the list that is the
> > > cause of that rcu-ness?
> >
> > Processes that fork() with previous user_events need to be duplicated.
> 
> BS.
> 
> Really. Stop making stuff up.
> 
> The above statement is clearly not true - just LOOK AT THE CODE.
> 

user_event_mm_dup() puts a new mm into the global list before the
enablers list is fully populated. As it stands now, since it's in the
global list, it can get enumerated in a small timing window via the
tracing subsystem register callbacks when someone enables the event via
ftrace/perf.

> Here's the loop in question:
> 
>                 list_for_each_entry_rcu(enabler, &mm->enablers, link) {
>                         if (enabler->event == user) {
>                                 attempt = 0;
>                                 user_event_enabler_write(mm, enabler,
> true, &attempt);
>                         }
>                 }
> 
> and AT THE VERY TOP OF user_event_enabler_write() we have this:
> 
>         lockdep_assert_held(&event_mutex);
> 
> so either nobody has ever tested this code with lockdep enabled, or we
> hold that lock.
> 
> And if nobody has ever tested the code, then it's broken anyway. That
> code N#EEDS the mutex lock. It needs to stop thinking it's RCU-safe,
> when it clearly isn't.
> 
> So I ask again: why is that code using RCU list traversal, when it
> already holds the lock that makes the RCU'ness COMPLETELY POINTLESS.
> 
> And again, that pointless RCU locking around this all seems to be the
> *only* reason for all these issues with pin_user_pages_remote().
> 
> So I claim that this code is garbage.  Somebody didn't think about locking.
> 
> Now, it's true that during fork, we have *another* RCU loop, but that
> one is harmless: that's not the one that does all this page pinning.
> 
> Now, that one *does* do
> 
>         list_add_rcu(&enabler->link, &mm->enablers);
> 
> without actually holding any locks, but in this case 'mm' is a newly
> allocated private thing of a task that hasn't even been exposed to the
> world yet, so nobody should be able to even see it. So that code lacks
> the proper locking for the new list, but it does so because there is
> nothing that can race with the new list (and the old list is
> read-only, so RCU traversal of the old list works).
> 

Well, that's the problem I was trying to point out. The fork path calls
user_event_mm_dup() -> user_event_mm_create(), which DO expose this to
the trace world. I definitely need to fix that, then I can drop these RCU
paths in the enablers. It has been exposed out to the tracing tracepoint
paths, but it has yet to be exposed out to the newly forked process that
could cause data writes, add new events, disable them, etc.

> So that "list_add_rcu()" there could probably be just a "list_add()",
> with a comment saying "this is new, nobody can see it".
> 

Yes, after I fix the ordering of the mm add to the tracing global list.
That is clearly something I should have done originally and caused
confusion and extra RCU usage that is unneeded.

> And if something *can* race it it and can see the new list, then it
> had damn well needs that mutex lock anyway, because that "something"
> could be actually modifying it. But that's separate from the page
> pinning situation.
> 
> So again, I claim that the RCU'ness of the pin_user_pages part is
> broken and should simply not exist.
> 
> > > Other code in that file happily just does
> > >
> > >         mutex_lock(&event_mutex);
> > >
> > >         list_for_each_entry_safe(enabler, next, &mm->enablers, link)
> > >
> > > with no RCU anywhere. Why does user_event_enabler_update() not do that?
> >
> > This is due to the fork() case above without taking the event_mutex.
> 
> See above. Your thinking is confused, and the code is broken.
> 
> If somebody can see the new list while it is created during fork(),
> then you need the event_mutex to protect the creation of it.
> 
> And if nobody can see it, then you don't need any RCU protection against it.
> 
> Those are the two choices. You can't have it both ways.
> 
> > > Oh, and even those other loops are a bit strange. Why do they use the
> > > "_safe" variant, even when they just traverse the list without
> > > changing it? Look at user_event_enabler_exists(), for example.
> >
> > The other places in the code that do this either will remove the event
> > depending on the situation during the for_each, or they only hold the
> > register lock and don't hold the event_mutex.
> 
> So?
> 
> That "safe" variant doesn't imply any locking. It does *not* protect
> against events being removed. It *purely* protects against the loop
> itself removing entries.
> 
> So this code:
> 
>         list_for_each_entry_safe(enabler, next, &mm->enablers, link) {
>                 if (enabler->addr == uaddr &&
>                     (enabler->values & ENABLE_VAL_BIT_MASK) == bit)
>                         return true;
>         }
> 
> is simply nonsensical. There is no reason for the "safe". It does not
> make anything safer.
> 
> The above loop is only safe under the mutex (it would need to be the
> "rcu" variant to be safe to traverse without locking), and since it
> isn't modifying the list, there's no reason for the safe.
> 
> End result: the "safe" part is simply wrong.
> 

Got it, I was confused.

> If the intention is "rcu" because of lack of locking, then the code needs to
>  (a) get the rcu read lock
>  (b) use the _rcu variant of the list traversal
> 
> And if the intention is that it's done under the proper 'event_mutex'
> lock, then the "safe" part should simply be dropped.
> 
>                Linus

Thanks,
-Beau

  reply	other threads:[~2023-05-17 19:08 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-08 16:37 [PATCH] tracing/user_events: Run BPF program if attached Beau Belgrave
2023-05-09 15:24 ` Alexei Starovoitov
2023-05-09 17:01   ` Steven Rostedt
2023-05-09 20:30     ` Steven Rostedt
2023-05-09 20:42       ` Steven Rostedt
2023-05-15 16:57       ` Alexei Starovoitov
2023-05-15 18:33         ` Steven Rostedt
2023-05-15 19:35           ` Beau Belgrave
2023-05-15 21:38             ` Steven Rostedt
2023-05-15 19:24         ` Beau Belgrave
2023-05-15 21:57           ` Steven Rostedt
2023-05-17  0:36           ` Alexei Starovoitov
2023-05-17  0:56             ` Linus Torvalds
2023-05-17  1:46               ` Linus Torvalds
2023-05-17  2:29                 ` Steven Rostedt
2023-05-17  3:03                   ` Linus Torvalds
2023-05-17 17:22                     ` Beau Belgrave
2023-05-17 18:15                       ` Linus Torvalds
2023-05-17 19:07                         ` Beau Belgrave [this message]
2023-05-17 19:26                           ` Linus Torvalds
2023-05-17 19:36                             ` Beau Belgrave
2023-05-17 19:36                             ` Linus Torvalds
2023-05-17 19:37                               ` Linus Torvalds
2023-05-17 23:00                                 ` Beau Belgrave
2023-05-17 23:14                                   ` Linus Torvalds
2023-05-17 23:25                                     ` Steven Rostedt
2023-05-18  0:14                                       ` Beau Belgrave
2023-05-18  0:23                                         ` Linus Torvalds
2023-05-17 20:08                               ` Linus Torvalds
2023-05-17  1:26             ` Steven Rostedt
2023-05-17 16:50               ` Beau Belgrave
2023-05-18  0:10                 ` Alexei Starovoitov
2023-05-18  0:19                   ` Beau Belgrave
2023-05-18  0:56                     ` Alexei Starovoitov
2023-05-18  1:18                       ` Beau Belgrave
2023-05-18  2:08                         ` Steven Rostedt
2023-05-18  3:14                           ` Alexei Starovoitov
2023-05-18 13:36                             ` Steven Rostedt
2023-05-18 17:28                               ` Beau Belgrave
2023-06-01  9:46                   ` Christian Brauner
2023-06-01 15:24                     ` Beau Belgrave
2023-06-01 15:57                       ` Christian Brauner
2023-06-01 16:29                         ` Beau Belgrave
2023-06-06 13:37                           ` Masami Hiramatsu
2023-06-06 17:05                             ` Beau Belgrave
2023-06-07 14:07                               ` Masami Hiramatsu
2023-06-07 19:26                                 ` Beau Belgrave
2023-06-08  0:25                                   ` Masami Hiramatsu
2023-05-17 17:51             ` Beau Belgrave
2023-06-06 13:57             ` Masami Hiramatsu
2023-06-06 16:57               ` Andrii Nakryiko
2023-06-06 20:57                 ` Beau Belgrave

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230517190750.GA366@W11-BEAU-MD.localdomain \
    --to=beaub@linux.microsoft.com \
    --cc=alexei.starovoitov@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=brauner@kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=dthaler@microsoft.com \
    --cc=hch@infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=mhiramat@kernel.org \
    --cc=rostedt@goodmis.org \
    --cc=torvalds@linux-foundation.org \
    --cc=void@manifault.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).