From: Chris Metcalf <cmetcalf@mellanox.com>
To: Andy Lutomirski <luto@amacapital.net>
Cc: "linux-doc@vger.kernel.org" <linux-doc@vger.kernel.org>,
Thomas Gleixner <tglx@linutronix.de>,
Christoph Lameter <cl@linux.com>, Michal Hocko <mhocko@suse.com>,
Gilad Ben Yossef <giladb@mellanox.com>,
Andrew Morton <akpm@linux-foundation.org>,
Linux API <linux-api@vger.kernel.org>,
"Viresh Kumar" <viresh.kumar@linaro.org>,
Ingo Molnar <mingo@kernel.org>,
"Steven Rostedt" <rostedt@goodmis.org>, Tejun Heo <tj@kernel.org>,
Will Deacon <will.deacon@arm.com>, Rik van Riel <riel@redhat.com>,
Frederic Weisbecker <fweisbec@gmail.com>,
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
Catalin Marinas <catalin.marinas@arm.com>,
Peter Zijlstra <peterz@infradead.org>
Subject: Re: [PATCH v15 04/13] task_isolation: add initial support
Date: Mon, 12 Sep 2016 15:25:04 -0400 [thread overview]
Message-ID: <84b47c50-6f1b-dcdb-c90e-471d1b4adbac@mellanox.com> (raw)
In-Reply-To: <CALCETrV7n5Qs0Kkb-cwC3JWsngfR8hd+JrW87OvPCgRHmCdK8A@mail.gmail.com>
On 9/12/2016 1:41 PM, Andy Lutomirski wrote:
> On Sep 9, 2016 1:40 PM, "Chris Metcalf" <cmetcalf@mellanox.com> wrote:
>> On 9/2/2016 1:28 PM, Andy Lutomirski wrote:
>>> On Sep 2, 2016 7:04 AM, "Chris Metcalf" <cmetcalf@mellanox.com> wrote:
>>>> On 8/30/2016 3:50 PM, Andy Lutomirski wrote:
>>>>> On Tue, Aug 30, 2016 at 12:37 PM, Chris Metcalf <cmetcalf@mellanox.com> wrote:
>>>>>> So to pop up a level, what is your actual concern about the existing
>>>>>> "do it in a loop" model? The macrology currently in use means there
>>>>>> is zero cost if you don't configure TASK_ISOLATION, and the software
>>>>>> maintenance cost seems low since the idioms used for task isolation
>>>>>> in the loop are generally familiar to people reading that code.
>>>>> My concern is that it's not obvious to readers of the code that the
>>>>> loop ever terminates. It really ought to, but it's doing something
>>>>> very odd. Normally we can loop because we get scheduled out, but
>>>>> actually blocking in the return-to-userspace path, especially blocking
>>>>> on a condition that doesn't have a wakeup associated with it, is odd.
>>>>
>>>> True, although, comments :-)
>>>>
>>>> Regardless, though, this doesn't seem at all weird to me in the
>>>> context of the vmstat and lru stuff, though. It's exactly parallel to
>>>> the fact that we loop around on checking need_resched and signal, and
>>>> in some cases you could imagine multiple loops around when we schedule
>>>> out and get a signal, so loop around again, and then another
>>>> reschedule event happens during signal processing so we go around
>>>> again, etc. Eventually it settles down. It's the same with the
>>>> vmstat/lru stuff.
>>> Only kind of.
>>>
>>> When we say, effectively, while (need_resched()) schedule();, we're
>>> not waiting for an event or condition per se. We're runnable (in the
>>> sense that userspace wants to run and we're not blocked on anything)
>>> the entire time -- we're simply yielding to some other thread that is
>>> also runnable. So if that loop runs forever, it either means that
>>> we're at low priority and we genuinely shouldn't be running or that
>>> there's a scheduler bug.
>>>
>>> If, on the other hand, we say while (not quiesced) schedule(); (or
>>> equivalent), we're saying that we're *not* really ready to run and
>>> that we're waiting for some condition to change. The condition in
>>> question is fairly complicated and won't wake us when we are ready. I
>>> can also imagine the scheduler getting rather confused, since, as far
>>> as the scheduler knows, we are runnable and we are supposed to be
>>> running.
>>
>> So, how about a code structure like this?
>>
>> In the main return-to-userspace loop where we check TIF flags,
>> we keep the notion of our TIF_TASK_ISOLATION flag that causes
>> us to invoke a task_isolation_prepare() routine. This routine
>> does the following things:
>>
>> 1. As you suggested, set a new TIF bit (or equivalent) that says the
>> system should no longer create deferred work on this core, and then
>> flush any necessary already-deferred work (currently, the LRU cache
>> and the vmstat stuff). We never have to go flush the deferred work
>> again during this task's return to userspace. Note that this bit can
>> only be set on a core marked for task isolation, so it can't be used
>> for denial of service type attacks on normal cores that are trying to
>> multitask normal Linux processes.
> I think it can't be a TIF flag unless you can do the whole mess with
> preemption off because, if you get preempted, other tasks on the CPU
> won't see the flag. You could do it with a percpu flag, I think.
Yes, a percpu flag - you're right. I think it will make sense for this to
be a flag declared in linux/isolation.h which can be read by vmstat, LRU, etc.
>> 2. Check if the dyntick is stopped, and if not, wait on a completion
>> that will be set when it does stop. This means we may schedule out at
>> this point, but when we return, the deferred work stuff is still safe
>> since your bit is still set, and in principle the dyn tick is
>> stopped.
>>
>> Then, after we disable interrupts and re-read the thread-info flags,
>> we check to see if the TIF_TASK_ISOLATION flag is the ONLY flag still
>> set that would keep us in the loop. This will always end up happening
>> on each return to userspace, since the only thing that actually clears
>> the bit is a prctl() call. When that happens we know we are about to
>> return to userspace, so we call task_isolation_ready(), which now has
>> two things to do:
> Would it perhaps be more straightforward to do the stuff before the
> loop and not check TIF_TASK_ISOLATION in the loop?
We can certainly play around with just not looping in this case, but
in particular I can imagine an isolated task entering the kernel and
then doing something that requires scheduling a kernel task. We'd
clearly like that other task to run before the isolated task returns to
userspace. But then, that other task might do something to re-enable
the dyntick. That's why we'd like to recheck that dyntick is off in
the loop after each potential call to schedule().
>> 1. We check that the dyntick is in fact stopped, since it's possible
>> that a race condition led to it being somehow restarted by an interrupt.
>> If it is not stopped, we go around the loop again so we can go back in
>> to the completion discussed above and wait some more. This may merit
>> a WARN_ON or other notice since it seems like people aren't convinced
>> there are things that could restart it, but honestly the dyntick stuff
>> is complex enough that I think a belt-and-suspenders kind of test here
>> at the last minute is just defensive programming.
> Seems reasonable. But maybe this could go after the loop and, if the
> dyntick is back, it could be treated like any other kernel bug that
> interrupts an isolated task? That would preserve more of the existing
> code structure.
Well, we can certainly try it that way. If I move it out and my testing
doesn't trigger the bug, that's at least an initial sign that it might be
OK. But I worry/suspect that it will trip up at some point in some use
case and we'll have to fix it at that point.
> If that works, it could go in user_enter().
Presumably with trace_user_enter() and vtime_user_enter()
in __context_tracking_enter()?
--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com
next prev parent reply other threads:[~2016-09-12 20:59 UTC|newest]
Thread overview: 80+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-08-16 21:19 [PATCH v15 00/13] support "task_isolation" mode Chris Metcalf
2016-08-16 21:19 ` [PATCH v15 01/13] vmstat: add quiet_vmstat_sync function Chris Metcalf
2016-08-16 21:19 ` [PATCH v15 02/13] vmstat: add vmstat_idle function Chris Metcalf
2016-08-16 21:19 ` [PATCH v15 03/13] lru_add_drain_all: factor out lru_add_drain_needed Chris Metcalf
2016-08-16 21:19 ` [PATCH v15 04/13] task_isolation: add initial support Chris Metcalf
2016-08-29 16:33 ` Peter Zijlstra
2016-08-29 16:40 ` Chris Metcalf
2016-08-29 16:48 ` Peter Zijlstra
2016-08-29 16:53 ` Chris Metcalf
2016-08-30 7:59 ` Peter Zijlstra
2016-08-30 7:58 ` Peter Zijlstra
2016-08-30 15:32 ` Chris Metcalf
2016-08-30 16:30 ` Andy Lutomirski
2016-08-30 17:02 ` Chris Metcalf
2016-08-30 18:43 ` Andy Lutomirski
2016-08-30 19:37 ` Chris Metcalf
2016-08-30 19:50 ` Andy Lutomirski
2016-09-02 14:04 ` Chris Metcalf
2016-09-02 17:28 ` Andy Lutomirski
2016-09-09 17:40 ` Chris Metcalf
2016-09-12 17:41 ` Andy Lutomirski
2016-09-12 19:25 ` Chris Metcalf [this message]
2016-09-27 14:22 ` Frederic Weisbecker
2016-09-27 14:39 ` Peter Zijlstra
2016-09-27 14:51 ` Frederic Weisbecker
2016-09-27 14:48 ` Paul E. McKenney
2016-09-30 16:59 ` Chris Metcalf
2016-09-01 10:06 ` Peter Zijlstra
2016-09-02 14:03 ` Chris Metcalf
2016-09-02 16:40 ` Peter Zijlstra
2017-02-02 16:13 ` Eugene Syromiatnikov
2017-02-02 18:12 ` Chris Metcalf
2016-08-16 21:19 ` [PATCH v15 05/13] task_isolation: track asynchronous interrupts Chris Metcalf
2016-08-16 21:19 ` [PATCH v15 06/13] arch/x86: enable task isolation functionality Chris Metcalf
2016-08-30 21:46 ` Andy Lutomirski
2016-08-16 21:19 ` [PATCH v15 07/13] arm64: factor work_pending state machine to C Chris Metcalf
2016-08-17 8:05 ` Will Deacon
2016-08-16 21:19 ` [PATCH v15 08/13] arch/arm64: enable task isolation functionality Chris Metcalf
2016-08-26 16:25 ` Catalin Marinas
2016-08-16 21:19 ` [PATCH v15 09/13] arch/tile: " Chris Metcalf
2016-08-16 21:19 ` [PATCH v15 10/13] arm, tile: turn off timer tick for oneshot_stopped state Chris Metcalf
2016-08-16 21:19 ` [PATCH v15 11/13] task_isolation: support CONFIG_TASK_ISOLATION_ALL Chris Metcalf
2016-08-16 21:19 ` [PATCH v15 12/13] task_isolation: add user-settable notification signal Chris Metcalf
2016-08-16 21:19 ` [PATCH v15 13/13] task_isolation self test Chris Metcalf
2016-08-17 19:37 ` [PATCH] Fix /proc/stat freezes (was [PATCH v15] "task_isolation" mode) Christoph Lameter
2016-08-20 1:42 ` Chris Metcalf
2016-09-28 13:16 ` Frederic Weisbecker
2016-08-29 16:27 ` Ping: [PATCH v15 00/13] support "task_isolation" mode Chris Metcalf
2016-09-07 21:11 ` Francis Giraldeau
2016-09-07 21:39 ` Francis Giraldeau
2016-09-08 16:21 ` Francis Giraldeau
2016-09-12 16:01 ` Chris Metcalf
2016-09-12 16:14 ` Peter Zijlstra
2016-09-12 21:15 ` Rafael J. Wysocki
2016-09-13 0:05 ` Rafael J. Wysocki
2016-09-13 16:00 ` Francis Giraldeau
2016-09-13 0:20 ` Francis Giraldeau
2016-09-13 16:12 ` Chris Metcalf
2016-09-27 14:49 ` Frederic Weisbecker
2016-09-27 14:35 ` Frederic Weisbecker
2016-09-30 17:07 ` Chris Metcalf
2016-11-05 4:04 ` task isolation discussion at Linux Plumbers Chris Metcalf
2016-11-05 16:05 ` Christoph Lameter
2016-11-07 16:55 ` Thomas Gleixner
2016-11-07 18:36 ` Thomas Gleixner
2016-11-07 19:12 ` Rik van Riel
2016-11-07 19:16 ` Will Deacon
2016-11-07 19:18 ` Rik van Riel
2016-11-11 20:54 ` Luiz Capitulino
2016-11-09 1:40 ` Paul E. McKenney
2016-11-09 11:14 ` Andy Lutomirski
2016-11-09 17:38 ` Paul E. McKenney
2016-11-09 18:57 ` Will Deacon
2016-11-09 19:11 ` Paul E. McKenney
2016-11-10 1:44 ` Andy Lutomirski
2016-11-10 4:52 ` Paul E. McKenney
2016-11-10 5:10 ` Paul E. McKenney
2016-11-11 17:00 ` Andy Lutomirski
2016-11-09 11:07 ` Frederic Weisbecker
2016-12-19 14:37 ` Paul E. McKenney
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=84b47c50-6f1b-dcdb-c90e-471d1b4adbac@mellanox.com \
--to=cmetcalf@mellanox.com \
--cc=akpm@linux-foundation.org \
--cc=catalin.marinas@arm.com \
--cc=cl@linux.com \
--cc=fweisbec@gmail.com \
--cc=giladb@mellanox.com \
--cc=linux-api@vger.kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=luto@amacapital.net \
--cc=mhocko@suse.com \
--cc=mingo@kernel.org \
--cc=paulmck@linux.vnet.ibm.com \
--cc=peterz@infradead.org \
--cc=riel@redhat.com \
--cc=rostedt@goodmis.org \
--cc=tglx@linutronix.de \
--cc=tj@kernel.org \
--cc=viresh.kumar@linaro.org \
--cc=will.deacon@arm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).