linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Chris Metcalf <cmetcalf@mellanox.com>
To: Andy Lutomirski <luto@amacapital.net>
Cc: "linux-doc@vger.kernel.org" <linux-doc@vger.kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Christoph Lameter <cl@linux.com>, Michal Hocko <mhocko@suse.com>,
	Gilad Ben Yossef <giladb@mellanox.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linux API <linux-api@vger.kernel.org>,
	"Viresh Kumar" <viresh.kumar@linaro.org>,
	Ingo Molnar <mingo@kernel.org>,
	"Steven Rostedt" <rostedt@goodmis.org>, Tejun Heo <tj@kernel.org>,
	Will Deacon <will.deacon@arm.com>, Rik van Riel <riel@redhat.com>,
	Frederic Weisbecker <fweisbec@gmail.com>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Peter Zijlstra <peterz@infradead.org>
Subject: Re: [PATCH v15 04/13] task_isolation: add initial support
Date: Mon, 12 Sep 2016 15:25:04 -0400	[thread overview]
Message-ID: <84b47c50-6f1b-dcdb-c90e-471d1b4adbac@mellanox.com> (raw)
In-Reply-To: <CALCETrV7n5Qs0Kkb-cwC3JWsngfR8hd+JrW87OvPCgRHmCdK8A@mail.gmail.com>

On 9/12/2016 1:41 PM, Andy Lutomirski wrote:
> On Sep 9, 2016 1:40 PM, "Chris Metcalf" <cmetcalf@mellanox.com> wrote:
>> On 9/2/2016 1:28 PM, Andy Lutomirski wrote:
>>> On Sep 2, 2016 7:04 AM, "Chris Metcalf" <cmetcalf@mellanox.com> wrote:
>>>> On 8/30/2016 3:50 PM, Andy Lutomirski wrote:
>>>>> On Tue, Aug 30, 2016 at 12:37 PM, Chris Metcalf <cmetcalf@mellanox.com> wrote:
>>>>>> So to pop up a level, what is your actual concern about the existing
>>>>>> "do it in a loop" model?  The macrology currently in use means there
>>>>>> is zero cost if you don't configure TASK_ISOLATION, and the software
>>>>>> maintenance cost seems low since the idioms used for task isolation
>>>>>> in the loop are generally familiar to people reading that code.
>>>>> My concern is that it's not obvious to readers of the code that the
>>>>> loop ever terminates.  It really ought to, but it's doing something
>>>>> very odd.  Normally we can loop because we get scheduled out, but
>>>>> actually blocking in the return-to-userspace path, especially blocking
>>>>> on a condition that doesn't have a wakeup associated with it, is odd.
>>>>
>>>> True, although, comments :-)
>>>>
>>>> Regardless, though, this doesn't seem at all weird to me in the
>>>> context of the vmstat and lru stuff, though.  It's exactly parallel to
>>>> the fact that we loop around on checking need_resched and signal, and
>>>> in some cases you could imagine multiple loops around when we schedule
>>>> out and get a signal, so loop around again, and then another
>>>> reschedule event happens during signal processing so we go around
>>>> again, etc.  Eventually it settles down.  It's the same with the
>>>> vmstat/lru stuff.
>>> Only kind of.
>>>
>>> When we say, effectively, while (need_resched()) schedule();, we're
>>> not waiting for an event or condition per se.  We're runnable (in the
>>> sense that userspace wants to run and we're not blocked on anything)
>>> the entire time -- we're simply yielding to some other thread that is
>>> also runnable.  So if that loop runs forever, it either means that
>>> we're at low priority and we genuinely shouldn't be running or that
>>> there's a scheduler bug.
>>>
>>> If, on the other hand, we say while (not quiesced) schedule(); (or
>>> equivalent), we're saying that we're *not* really ready to run and
>>> that we're waiting for some condition to change.  The condition in
>>> question is fairly complicated and won't wake us when we are ready.  I
>>> can also imagine the scheduler getting rather confused, since, as far
>>> as the scheduler knows, we are runnable and we are supposed to be
>>> running.
>>
>> So, how about a code structure like this?
>>
>> In the main return-to-userspace loop where we check TIF flags,
>> we keep the notion of our TIF_TASK_ISOLATION flag that causes
>> us to invoke a task_isolation_prepare() routine.  This routine
>> does the following things:
>>
>> 1. As you suggested, set a new TIF bit (or equivalent) that says the
>> system should no longer create deferred work on this core, and then
>> flush any necessary already-deferred work (currently, the LRU cache
>> and the vmstat stuff).  We never have to go flush the deferred work
>> again during this task's return to userspace.  Note that this bit can
>> only be set on a core marked for task isolation, so it can't be used
>> for denial of service type attacks on normal cores that are trying to
>> multitask normal Linux processes.
> I think it can't be a TIF flag unless you can do the whole mess with
> preemption off because, if you get preempted, other tasks on the CPU
> won't see the flag.  You could do it with a percpu flag, I think.

Yes, a percpu flag - you're right.  I think it will make sense for this to
be a flag declared in linux/isolation.h which can be read by vmstat, LRU, etc.

>> 2. Check if the dyntick is stopped, and if not, wait on a completion
>> that will be set when it does stop.  This means we may schedule out at
>> this point, but when we return, the deferred work stuff is still safe
>> since your bit is still set, and in principle the dyn tick is
>> stopped.
>>
>> Then, after we disable interrupts and re-read the thread-info flags,
>> we check to see if the TIF_TASK_ISOLATION flag is the ONLY flag still
>> set that would keep us in the loop.  This will always end up happening
>> on each return to userspace, since the only thing that actually clears
>> the bit is a prctl() call.  When that happens we know we are about to
>> return to userspace, so we call task_isolation_ready(), which now has
>> two things to do:
> Would it perhaps be more straightforward to do the stuff before the
> loop and not check TIF_TASK_ISOLATION in the loop?

We can certainly play around with just not looping in this case, but
in particular I can imagine an isolated task entering the kernel and
then doing something that requires scheduling a kernel task.  We'd
clearly like that other task to run before the isolated task returns to
userspace.  But then, that other task might do something to re-enable
the dyntick.  That's why we'd like to recheck that dyntick is off in
the loop after each potential call to schedule().

>> 1. We check that the dyntick is in fact stopped, since it's possible
>> that a race condition led to it being somehow restarted by an interrupt.
>> If it is not stopped, we go around the loop again so we can go back in
>> to the completion discussed above and wait some more.  This may merit
>> a WARN_ON or other notice since it seems like people aren't convinced
>> there are things that could restart it, but honestly the dyntick stuff
>> is complex enough that I think a belt-and-suspenders kind of test here
>> at the last minute is just defensive programming.
> Seems reasonable.  But maybe this could go after the loop and, if the
> dyntick is back, it could be treated like any other kernel bug that
> interrupts an isolated task?  That would preserve more of the existing
> code structure.

Well, we can certainly try it that way.  If I move it out and my testing
doesn't trigger the bug, that's at least an initial sign that it might be
OK.  But I worry/suspect that it will trip up at some point in some use
case and we'll have to fix it at that point.

> If that works, it could go in user_enter().

Presumably with trace_user_enter() and vtime_user_enter()
in __context_tracking_enter()?

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

  reply	other threads:[~2016-09-12 20:59 UTC|newest]

Thread overview: 80+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-08-16 21:19 [PATCH v15 00/13] support "task_isolation" mode Chris Metcalf
2016-08-16 21:19 ` [PATCH v15 01/13] vmstat: add quiet_vmstat_sync function Chris Metcalf
2016-08-16 21:19 ` [PATCH v15 02/13] vmstat: add vmstat_idle function Chris Metcalf
2016-08-16 21:19 ` [PATCH v15 03/13] lru_add_drain_all: factor out lru_add_drain_needed Chris Metcalf
2016-08-16 21:19 ` [PATCH v15 04/13] task_isolation: add initial support Chris Metcalf
2016-08-29 16:33   ` Peter Zijlstra
2016-08-29 16:40     ` Chris Metcalf
2016-08-29 16:48       ` Peter Zijlstra
2016-08-29 16:53         ` Chris Metcalf
2016-08-30  7:59           ` Peter Zijlstra
2016-08-30  7:58       ` Peter Zijlstra
2016-08-30 15:32         ` Chris Metcalf
2016-08-30 16:30           ` Andy Lutomirski
2016-08-30 17:02             ` Chris Metcalf
2016-08-30 18:43               ` Andy Lutomirski
2016-08-30 19:37                 ` Chris Metcalf
2016-08-30 19:50                   ` Andy Lutomirski
2016-09-02 14:04                     ` Chris Metcalf
2016-09-02 17:28                       ` Andy Lutomirski
2016-09-09 17:40                         ` Chris Metcalf
2016-09-12 17:41                           ` Andy Lutomirski
2016-09-12 19:25                             ` Chris Metcalf [this message]
2016-09-27 14:22                         ` Frederic Weisbecker
2016-09-27 14:39                           ` Peter Zijlstra
2016-09-27 14:51                             ` Frederic Weisbecker
2016-09-27 14:48                           ` Paul E. McKenney
2016-09-30 16:59                 ` Chris Metcalf
2016-09-01 10:06           ` Peter Zijlstra
2016-09-02 14:03             ` Chris Metcalf
2016-09-02 16:40               ` Peter Zijlstra
2017-02-02 16:13   ` Eugene Syromiatnikov
2017-02-02 18:12     ` Chris Metcalf
2016-08-16 21:19 ` [PATCH v15 05/13] task_isolation: track asynchronous interrupts Chris Metcalf
2016-08-16 21:19 ` [PATCH v15 06/13] arch/x86: enable task isolation functionality Chris Metcalf
2016-08-30 21:46   ` Andy Lutomirski
2016-08-16 21:19 ` [PATCH v15 07/13] arm64: factor work_pending state machine to C Chris Metcalf
2016-08-17  8:05   ` Will Deacon
2016-08-16 21:19 ` [PATCH v15 08/13] arch/arm64: enable task isolation functionality Chris Metcalf
2016-08-26 16:25   ` Catalin Marinas
2016-08-16 21:19 ` [PATCH v15 09/13] arch/tile: " Chris Metcalf
2016-08-16 21:19 ` [PATCH v15 10/13] arm, tile: turn off timer tick for oneshot_stopped state Chris Metcalf
2016-08-16 21:19 ` [PATCH v15 11/13] task_isolation: support CONFIG_TASK_ISOLATION_ALL Chris Metcalf
2016-08-16 21:19 ` [PATCH v15 12/13] task_isolation: add user-settable notification signal Chris Metcalf
2016-08-16 21:19 ` [PATCH v15 13/13] task_isolation self test Chris Metcalf
2016-08-17 19:37 ` [PATCH] Fix /proc/stat freezes (was [PATCH v15] "task_isolation" mode) Christoph Lameter
2016-08-20  1:42   ` Chris Metcalf
2016-09-28 13:16   ` Frederic Weisbecker
2016-08-29 16:27 ` Ping: [PATCH v15 00/13] support "task_isolation" mode Chris Metcalf
2016-09-07 21:11   ` Francis Giraldeau
2016-09-07 21:39     ` Francis Giraldeau
2016-09-08 16:21     ` Francis Giraldeau
2016-09-12 16:01     ` Chris Metcalf
2016-09-12 16:14       ` Peter Zijlstra
2016-09-12 21:15         ` Rafael J. Wysocki
2016-09-13  0:05           ` Rafael J. Wysocki
2016-09-13 16:00             ` Francis Giraldeau
2016-09-13  0:20       ` Francis Giraldeau
2016-09-13 16:12         ` Chris Metcalf
2016-09-27 14:49         ` Frederic Weisbecker
2016-09-27 14:35   ` Frederic Weisbecker
2016-09-30 17:07     ` Chris Metcalf
2016-11-05  4:04 ` task isolation discussion at Linux Plumbers Chris Metcalf
2016-11-05 16:05   ` Christoph Lameter
2016-11-07 16:55   ` Thomas Gleixner
2016-11-07 18:36     ` Thomas Gleixner
2016-11-07 19:12       ` Rik van Riel
2016-11-07 19:16         ` Will Deacon
2016-11-07 19:18           ` Rik van Riel
2016-11-11 20:54     ` Luiz Capitulino
2016-11-09  1:40   ` Paul E. McKenney
2016-11-09 11:14     ` Andy Lutomirski
2016-11-09 17:38       ` Paul E. McKenney
2016-11-09 18:57         ` Will Deacon
2016-11-09 19:11           ` Paul E. McKenney
2016-11-10  1:44         ` Andy Lutomirski
2016-11-10  4:52           ` Paul E. McKenney
2016-11-10  5:10             ` Paul E. McKenney
2016-11-11 17:00             ` Andy Lutomirski
2016-11-09 11:07   ` Frederic Weisbecker
2016-12-19 14:37   ` Paul E. McKenney

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=84b47c50-6f1b-dcdb-c90e-471d1b4adbac@mellanox.com \
    --to=cmetcalf@mellanox.com \
    --cc=akpm@linux-foundation.org \
    --cc=catalin.marinas@arm.com \
    --cc=cl@linux.com \
    --cc=fweisbec@gmail.com \
    --cc=giladb@mellanox.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@amacapital.net \
    --cc=mhocko@suse.com \
    --cc=mingo@kernel.org \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    --cc=tj@kernel.org \
    --cc=viresh.kumar@linaro.org \
    --cc=will.deacon@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).