linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Andy Lutomirski <luto@amacapital.net>
To: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Thomas Gleixner <tglx@linutronix.de>,
	"linux-doc@vger.kernel.org" <linux-doc@vger.kernel.org>,
	Christoph Lameter <cl@linux.com>, Michal Hocko <mhocko@suse.com>,
	Gilad Ben Yossef <giladb@mellanox.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Viresh Kumar <viresh.kumar@linaro.org>,
	Linux API <linux-api@vger.kernel.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ingo Molnar <mingo@kernel.org>, Tejun Heo <tj@kernel.org>,
	Rik van Riel <riel@redhat.com>, Will Deacon <will.deacon@arm.com>,
	Frederic Weisbecker <fweisbec@gmail.com>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Peter Zijlstra <peterz@infradead.org>
Subject: Re: [PATCH v15 04/13] task_isolation: add initial support
Date: Fri, 2 Sep 2016 10:28:00 -0700	[thread overview]
Message-ID: <CALCETrXrsZjMjdd1jACbrz8GMXQC5FmF8BbkHobmMCbG5GPN7w@mail.gmail.com> (raw)
In-Reply-To: <3f84f736-ed7f-adff-d5f0-4f7db664208f@mellanox.com>

On Sep 2, 2016 7:04 AM, "Chris Metcalf" <cmetcalf@mellanox.com> wrote:
>
> On 8/30/2016 3:50 PM, Andy Lutomirski wrote:
>>
>> On Tue, Aug 30, 2016 at 12:37 PM, Chris Metcalf <cmetcalf@mellanox.com> wrote:
>>>
>>> On 8/30/2016 2:43 PM, Andy Lutomirski wrote:
>>>>
>>>> What if we did it the other way around: set a percpu flag saying
>>>> "going quiescent; disallow new deferred work", then finish all
>>>> existing work and return to userspace.  Then, on the next entry, clear
>>>> that flag.  With the flag set, vmstat would just flush anything that
>>>> it accumulates immediately, nothing would be added to the LRU list,
>>>> etc.
>>>
>>>
>>> This is an interesting idea!
>>>
>>> However, there are a number of implementation ideas that make me
>>> worry that it might be a trickier approach overall.
>>>
>>> First, "on the next entry" hides a world of hurt in four simple words.
>>> Some platforms (arm64 and tile, that I'm familiar with) have a common
>>> chunk of code that always runs on every entry to the kernel.  It would
>>> not be too hard to poke at the assembly and make those platforms
>>> always run some task-isolation specific code on entry.  But x86 scares
>>> me - there seem to be a whole lot of ways to get into the kernel, and
>>> I'm not convinced there is a lot of shared macrology or whatever that
>>> would make it straightforward to intercept all of them.
>>
>> Just use the context tracking entry hook.  It's 100% reliable.  The
>> relevant x86 function is enter_from_user_mode(), but I would just hook
>> into user_exit() in the common code.  (This code is had better be
>> reliable, because context tracking depends on it, and, if context
>> tracking doesn't work on a given arch, then isolation isn't going to
>> work regardless.
>
>
> This looks a lot cleaner than last time I looked at the x86 code. So yes, I think
> we could do an entry-point approach plausibly now.
>
> This is also good for when we want to look at deferring the kernel TLB flush,
> since it's the same mechanism that would be required for that.
>
>

There's at least one gotcha for the latter: NMIs aren't currently
guaranteed to go through context tracking.  Instead they use their own
RCU hooks.  Deferred TLB flushes can still be made to work, but a bit
more care will be needed.  I would probably approach it with an
additional NMI hook in the same places as rcu_nmi_enter() that does,
more or less:

if (need_tlb_flush) flush();

and then make sure that the normal exit hook looks like:

if (need_tlb_flush) {
  flush();
  barrier(); /* An NMI must not see !need_tlb_flush if the TLB hasn't
been flushed */
  flush the TLB;
}

>>> But it does seem like we are adding noticeable maintenance cost on
>>> the mainline kernel to support task isolation by doing this.  My guess
>>> is that it is easier to support the kind of "are you clean?" / "get clean"
>>> APIs for subsystems, rather than weaving a whole set of "stay clean"
>>> mechanism into each subsystem.
>>
>> My intuition is that it's the other way around.  For the mainline
>> kernel, having a nice clean well-integrated implementation is nicer
>> than having a bolted-on implementation that interacts in potentially
>> complicated ways.  Once quiescence support is in mainline, the size of
>> the diff or the degree to which it's scattered around is irrelevant
>> because it's not a diff any more.
>
>
> I'm not concerned with the size of the diff, just with the intrusiveness into
> the various subsystems.
>
> That said, code talks, so let me take a swing at doing it the way you suggest
> for vmstat/lru and we'll see what it looks like.

Thanks :)

>
>
>>> So to pop up a level, what is your actual concern about the existing
>>> "do it in a loop" model?  The macrology currently in use means there
>>> is zero cost if you don't configure TASK_ISOLATION, and the software
>>> maintenance cost seems low since the idioms used for task isolation
>>> in the loop are generally familiar to people reading that code.
>>
>> My concern is that it's not obvious to readers of the code that the
>> loop ever terminates.  It really ought to, but it's doing something
>> very odd.  Normally we can loop because we get scheduled out, but
>> actually blocking in the return-to-userspace path, especially blocking
>> on a condition that doesn't have a wakeup associated with it, is odd.
>
>
> True, although, comments :-)
>
> Regardless, though, this doesn't seem at all weird to me in the
> context of the vmstat and lru stuff, though.  It's exactly parallel to
> the fact that we loop around on checking need_resched and signal, and
> in some cases you could imagine multiple loops around when we schedule
> out and get a signal, so loop around again, and then another
> reschedule event happens during signal processing so we go around
> again, etc.  Eventually it settles down.  It's the same with the
> vmstat/lru stuff.

Only kind of.

When we say, effectively, while (need_resched()) schedule();, we're
not waiting for an event or condition per se.  We're runnable (in the
sense that userspace wants to run and we're not blocked on anything)
the entire time -- we're simply yielding to some other thread that is
also runnable.  So if that loop runs forever, it either means that
we're at low priority and we genuinely shouldn't be running or that
there's a scheduler bug.

If, on the other hand, we say while (not quiesced) schedule(); (or
equivalent), we're saying that we're *not* really ready to run and
that we're waiting for some condition to change.  The condition in
question is fairly complicated and won't wake us when we are ready.  I
can also imagine the scheduler getting rather confused, since, as far
as the scheduler knows, we are runnable and we are supposed to be
running.

>
>
>>>> Also, this cond_resched stuff doesn't worry me too much at a
>>>> fundamental level -- if we're really going quiescent, shouldn't we be
>>>> able to arrange that there are no other schedulable tasks on the CPU
>>>> in question?
>>>
>>> We aren't currently planning to enforce things in the scheduler, so if
>>> the application affinitizes another task on top of an existing task
>>> isolation task, by default the task isolation task just dies. (Unless
>>> it's using NOSIG mode, in which case it just ends up stuck in the
>>> kernel trying to wait out the dyntick until you either kill it, or
>>> re-affinitize the offending task.)  But I'm reluctant to guarantee
>>> every possible way that you might (perhaps briefly) have some
>>> schedulable task, and the current approach seems pretty robust if that
>>> sort of thing happens.
>>
>> This kind of waiting out the dyntick scares me.  Why is there ever a
>> dyntick that you're waiting out?  If quiescence is to be a supported
>> mainline feature, shouldn't the scheduler be integrated well enough
>> with it that you don't need to wait like this?
>
>
> Well, this is certainly the funkiest piece of the task isolation
> stuff.  The problem is that the dyntick stuff may, for example, need
> one more tick 4us from now (or whatever) just to close out the current
> RCU period.  We can't return to userspace until that happens.  So what
> else can we do when the task is ready to return to userspace?  We
> could punt into the idle task instead of waiting in this task, which
> was my earlier schedule_time() suggestion.  Do you think that's cleaner?
>

Unless I'm missing something (which is reasonably likely), couldn't
the isolation code just force or require rcu_nocbs on the isolated
CPUs to avoid this problem entirely.

I admit I still don't understand why the RCU context tracking code
can't just run the callback right away instead of waiting however many
microseconds in general.  I feel like paulmck has explained it to me
at least once, but that doesn't mean I remember the answer.

--Andy

  reply	other threads:[~2016-09-02 17:28 UTC|newest]

Thread overview: 80+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-08-16 21:19 [PATCH v15 00/13] support "task_isolation" mode Chris Metcalf
2016-08-16 21:19 ` [PATCH v15 01/13] vmstat: add quiet_vmstat_sync function Chris Metcalf
2016-08-16 21:19 ` [PATCH v15 02/13] vmstat: add vmstat_idle function Chris Metcalf
2016-08-16 21:19 ` [PATCH v15 03/13] lru_add_drain_all: factor out lru_add_drain_needed Chris Metcalf
2016-08-16 21:19 ` [PATCH v15 04/13] task_isolation: add initial support Chris Metcalf
2016-08-29 16:33   ` Peter Zijlstra
2016-08-29 16:40     ` Chris Metcalf
2016-08-29 16:48       ` Peter Zijlstra
2016-08-29 16:53         ` Chris Metcalf
2016-08-30  7:59           ` Peter Zijlstra
2016-08-30  7:58       ` Peter Zijlstra
2016-08-30 15:32         ` Chris Metcalf
2016-08-30 16:30           ` Andy Lutomirski
2016-08-30 17:02             ` Chris Metcalf
2016-08-30 18:43               ` Andy Lutomirski
2016-08-30 19:37                 ` Chris Metcalf
2016-08-30 19:50                   ` Andy Lutomirski
2016-09-02 14:04                     ` Chris Metcalf
2016-09-02 17:28                       ` Andy Lutomirski [this message]
2016-09-09 17:40                         ` Chris Metcalf
2016-09-12 17:41                           ` Andy Lutomirski
2016-09-12 19:25                             ` Chris Metcalf
2016-09-27 14:22                         ` Frederic Weisbecker
2016-09-27 14:39                           ` Peter Zijlstra
2016-09-27 14:51                             ` Frederic Weisbecker
2016-09-27 14:48                           ` Paul E. McKenney
2016-09-30 16:59                 ` Chris Metcalf
2016-09-01 10:06           ` Peter Zijlstra
2016-09-02 14:03             ` Chris Metcalf
2016-09-02 16:40               ` Peter Zijlstra
2017-02-02 16:13   ` Eugene Syromiatnikov
2017-02-02 18:12     ` Chris Metcalf
2016-08-16 21:19 ` [PATCH v15 05/13] task_isolation: track asynchronous interrupts Chris Metcalf
2016-08-16 21:19 ` [PATCH v15 06/13] arch/x86: enable task isolation functionality Chris Metcalf
2016-08-30 21:46   ` Andy Lutomirski
2016-08-16 21:19 ` [PATCH v15 07/13] arm64: factor work_pending state machine to C Chris Metcalf
2016-08-17  8:05   ` Will Deacon
2016-08-16 21:19 ` [PATCH v15 08/13] arch/arm64: enable task isolation functionality Chris Metcalf
2016-08-26 16:25   ` Catalin Marinas
2016-08-16 21:19 ` [PATCH v15 09/13] arch/tile: " Chris Metcalf
2016-08-16 21:19 ` [PATCH v15 10/13] arm, tile: turn off timer tick for oneshot_stopped state Chris Metcalf
2016-08-16 21:19 ` [PATCH v15 11/13] task_isolation: support CONFIG_TASK_ISOLATION_ALL Chris Metcalf
2016-08-16 21:19 ` [PATCH v15 12/13] task_isolation: add user-settable notification signal Chris Metcalf
2016-08-16 21:19 ` [PATCH v15 13/13] task_isolation self test Chris Metcalf
2016-08-17 19:37 ` [PATCH] Fix /proc/stat freezes (was [PATCH v15] "task_isolation" mode) Christoph Lameter
2016-08-20  1:42   ` Chris Metcalf
2016-09-28 13:16   ` Frederic Weisbecker
2016-08-29 16:27 ` Ping: [PATCH v15 00/13] support "task_isolation" mode Chris Metcalf
2016-09-07 21:11   ` Francis Giraldeau
2016-09-07 21:39     ` Francis Giraldeau
2016-09-08 16:21     ` Francis Giraldeau
2016-09-12 16:01     ` Chris Metcalf
2016-09-12 16:14       ` Peter Zijlstra
2016-09-12 21:15         ` Rafael J. Wysocki
2016-09-13  0:05           ` Rafael J. Wysocki
2016-09-13 16:00             ` Francis Giraldeau
2016-09-13  0:20       ` Francis Giraldeau
2016-09-13 16:12         ` Chris Metcalf
2016-09-27 14:49         ` Frederic Weisbecker
2016-09-27 14:35   ` Frederic Weisbecker
2016-09-30 17:07     ` Chris Metcalf
2016-11-05  4:04 ` task isolation discussion at Linux Plumbers Chris Metcalf
2016-11-05 16:05   ` Christoph Lameter
2016-11-07 16:55   ` Thomas Gleixner
2016-11-07 18:36     ` Thomas Gleixner
2016-11-07 19:12       ` Rik van Riel
2016-11-07 19:16         ` Will Deacon
2016-11-07 19:18           ` Rik van Riel
2016-11-11 20:54     ` Luiz Capitulino
2016-11-09  1:40   ` Paul E. McKenney
2016-11-09 11:14     ` Andy Lutomirski
2016-11-09 17:38       ` Paul E. McKenney
2016-11-09 18:57         ` Will Deacon
2016-11-09 19:11           ` Paul E. McKenney
2016-11-10  1:44         ` Andy Lutomirski
2016-11-10  4:52           ` Paul E. McKenney
2016-11-10  5:10             ` Paul E. McKenney
2016-11-11 17:00             ` Andy Lutomirski
2016-11-09 11:07   ` Frederic Weisbecker
2016-12-19 14:37   ` Paul E. McKenney

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CALCETrXrsZjMjdd1jACbrz8GMXQC5FmF8BbkHobmMCbG5GPN7w@mail.gmail.com \
    --to=luto@amacapital.net \
    --cc=akpm@linux-foundation.org \
    --cc=catalin.marinas@arm.com \
    --cc=cl@linux.com \
    --cc=cmetcalf@mellanox.com \
    --cc=fweisbec@gmail.com \
    --cc=giladb@mellanox.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=mingo@kernel.org \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    --cc=tj@kernel.org \
    --cc=viresh.kumar@linaro.org \
    --cc=will.deacon@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).