From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934010AbcCITjz (ORCPT ); Wed, 9 Mar 2016 14:39:55 -0500 Received: from mail-db3on0058.outbound.protection.outlook.com ([157.55.234.58]:24656 "EHLO emea01-db3-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1753732AbcCITjr (ORCPT ); Wed, 9 Mar 2016 14:39:47 -0500 Authentication-Results: vger.kernel.org; dkim=none (message not signed) header.d=none;vger.kernel.org; dmarc=none action=none header.from=mellanox.com; Subject: Re: [PATCH v9 04/13] task_isolation: add initial support To: Frederic Weisbecker References: <1451936091-29247-1-git-send-email-cmetcalf@ezchip.com> <1451936091-29247-5-git-send-email-cmetcalf@ezchip.com> <20160119154214.GA2722@lerouge> <569EA050.5030106@ezchip.com> <20160128002801.GA14313@lerouge> <56ABACDD.5090500@ezchip.com> <20160130211125.GB7856@lerouge> <56BCDFE9.10200@ezchip.com> <20160304125603.GA23906@lerouge> CC: Gilad Ben Yossef , Steven Rostedt , Ingo Molnar , Peter Zijlstra , Andrew Morton , "Rik van Riel" , Tejun Heo , Thomas Gleixner , "Paul E. McKenney" , Christoph Lameter , Viresh Kumar , Catalin Marinas , Will Deacon , Andy Lutomirski , , , From: Chris Metcalf Message-ID: <56E07BF0.3060509@mellanox.com> Date: Wed, 9 Mar 2016 14:39:28 -0500 User-Agent: Mozilla/5.0 (X11; Linux i686 on x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0 MIME-Version: 1.0 In-Reply-To: <20160304125603.GA23906@lerouge> Content-Type: text/plain; charset="windows-1252"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [12.216.194.146] X-ClientProxiedBy: SN1PR0701CA0044.namprd07.prod.outlook.com (25.163.126.12) To AM4PR05MB1682.eurprd05.prod.outlook.com (25.165.245.153) X-MS-Office365-Filtering-Correlation-Id: 34df5840-824b-4320-d7c1-08d348528f7f X-Microsoft-Exchange-Diagnostics: 1;AM4PR05MB1682;2:cBoXdungHoKd2MyOyWu8++UGTe57Z7rRmvufflpQhmUL8BaHajGgrwrtWzTckUcJH/EoPpfGPRI/eA50InG+ckXYBFWUtCnNeMy1QKucEV9Z+nC4rJheYGQ3iyh4Zel83l8wHlRFNprYNd+E/bYT9DkYwNXFVjVgzIHcskeEMCITOVmmoDCesShlw+3hje1A;3:H9BKXYyGKC/N4jflpQC+nmjYvRwDbMliDOHo6Ses/G/LaMxvGgaXxj3nrSKqr5Me4HNB/nnIaiDKhQGNMJ2g27XAEZCR6eFQ/v5C+7NqRq8MoC8y0cdjQ7yTGd3SHD+o;25:AP/JnJlo+9bvzanxxDDWY7jkgl7iryofM/rhZNjZ9syMYteMZUCJLJDBTjFEUfU6BoTHSUKkMObuRZtNTdQlMnqY+HRUVnLY5mwf4KDl4zoYztMwQPnjA2e3aJeh/E5vYwmZh4vEqAK0DP0Dd512Y/0B90y5WStVfiRsO7NAdAysVkAXGQoTg9OA9W7VOZqV6ar85ezO9VbiGhLEaNK5sS0XaWYZ+bIQgmMME91tZA6r/hsvNUPdmr+fIF2W7p7/GO/5R35vyNIJvPXB3uj5+r+z5wArJO21WFmw2HzNxPN0imDvXja4/nW75z4gk5RWdih+6YrFU66rWJu42kY8Xw== X-Microsoft-Antispam: UriScan:;BCL:0;PCL:0;RULEID:;SRVR:AM4PR05MB1682; X-MLNXRule-EZCH-Linux: Rule triggered X-Microsoft-Exchange-Diagnostics: 1;AM4PR05MB1682;20:BD+rn3EvRmtB+AbPT7wLFE2+TG49CrcG8IIz798qI0mvxku1eRZJoKKjM9x25zd1i+oRuDKZF4WcpHVzGLfVHrz0pOZqvTgr2rJKcs7W8XIgYoZUB5MGCCYBYbg9iRXyNmImevSkjXoiZQTPdsquQihYjrwWC2VZMslToTntcSMYcvahbn/TjVniDcYO4veNNv5Tdfr2I3HGOXr3Rd8WTsemmqJ0vXwUTyVf0b60b5tmVLsYkIu/dJ+ZVUxhJtS71UVW7N9zE1Azh9Ee9YavYcwbBiRa7A+fzargBDJevKkuhVfGVzUOAIL9CAzbToRZQgr46hJpwKNsrfggvsyKWHczZJtyiCTrSupq+wrH4psvGxImcCjDVedXglxqpanVJUChtnaArBOqxmet5FwuCiNrMN6MOLl8mT/EVNbvB3RxTWoGXapvBbcyCtKE6fg7gR9Sc7oRFpI1timjmioScvLBcdyxxmQLIbGMwHTH/H6E2ZPJmcgdXPdiZQxWvdHq;4:6GGlwz+YF+un386Qv3e/FxUxTtQFb9vKGAI4g/VD7vREvEmcNOj9P8sA58SZYmWQ2unxlxCst6wCP5204bRr7Pkt7bLi5m0Kt9kb2LTMm/r3HEW206EQiVYwOObXIN8BFp/HYc+27YP2QkvU+RbKXBccxgmGEzbLln0lKNQvYL+Pi3QlTtFgmK+v96faJ96upt1b4ug2xqqaVyblPtHfYWknWobLprwBAsS8YW3p2pVDfQLpBVLGTW4/bjaRPTFXDCg41hN47FN09SjaQrmOQGSEUODfPwEDI8UL+O1fjwM5PjRXFbjTh8E3QvIAFCPKS5fNk8FBo4AxTiQYg309gKMDfh6pHzF/pwyIO3BrC89jFfBWIHN4JA56dYHTSSzF X-Microsoft-Antispam-PRVS: X-Exchange-Antispam-Report-Test: UriScan:; X-Exchange-Antispam-Report-CFA-Test: BCL:0;PCL:0;RULEID:(601004)(2401047)(5005006)(8121501046)(3002001)(10201501046);SRVR:AM4PR05MB1682;BCL:0;PCL:0;RULEID:;SRVR:AM4PR05MB1682; X-Forefront-PRVS: 0876988AF0 X-Forefront-Antispam-Report: SFV:NSPM;SFS:(10009020)(4630300001)(6049001)(6009001)(51914003)(377454003)(479174004)(52314003)(43784003)(24454002)(80316001)(19580395003)(1096002)(86362001)(5004730100002)(230700001)(50466002)(64126003)(586003)(92566002)(81166005)(83506001)(6116002)(3846002)(93886004)(23746002)(4326007)(1411001)(42186005)(33656002)(5008740100001)(2906002)(36756003)(87266999)(66066001)(50986999)(4001350100001)(15975445007)(76176999)(47776003)(77096005)(54356999)(2950100001)(110136002)(65806001)(189998001)(18886065003);DIR:OUT;SFP:1101;SCL:1;SRVR:AM4PR05MB1682;H:[10.15.7.41];FPR:;SPF:None;MLV:sfv;LANG:en; X-Microsoft-Exchange-Diagnostics: =?Windows-1252?Q?1;AM4PR05MB1682;23:U+u1gJgS/Bqy20b79C3umgccAY2I1VgiAl5xA?= =?Windows-1252?Q?99wvw2Iy1Uxm1W6AqtlK3NY5y8pmm5lj+rSAz/ZRwh+3bkeIxz1GJywN?= =?Windows-1252?Q?MCcg8TT58gpR1KxdhjmasuMXHXzN/BUJaTCaTLD5A5I8+lj+CzPO3ryF?= =?Windows-1252?Q?tzNlkO4pk5kQoHzjrSryJFy5b5Edenmn/reWJ3mKN48FKI8DctFWUg3a?= =?Windows-1252?Q?sRn4kOAmZNTaF8LvsxbdtSysoYxLXggPl6X468kovTOTHhg1EMrui+U2?= =?Windows-1252?Q?Bb8sBfrQFfyWQgH9buViJeGTEOlSGjb6MUJSCsmw052dMsVpaWT8hifi?= =?Windows-1252?Q?pMPdUiJiaeC+Exj7KxZkQPeyQ8BTTdo1f8OW/DzhdIsc26eQExZGpeuH?= =?Windows-1252?Q?12N4Io/UIEO5uWZI48HCLcJZuFs3V84EM6KuFRi/ApP5l9yAoOcJnEdK?= =?Windows-1252?Q?bJcUdUhopj1QCoGiFO1OHfSu+8+/ggRaR7MNs2p1rj0MwJSstEF2tpaK?= =?Windows-1252?Q?ZK/9766nbwwJAX/wLCJB7ACx7NyDQno1Dvd9pEmsqH8MkEJsRExkZEu8?= =?Windows-1252?Q?bAWRcYNmg78R6lLeZYAGpIqKlcd+ASmBmYz9cKq2hql1jFVmSqHHoAwO?= =?Windows-1252?Q?D/+mtus2R3BRdrr27FybYBaNW1jyLnWJAV6dEea8eilbxDWLOX1Wq4XP?= =?Windows-1252?Q?HrK7RgBew4OmnCjwnBg6Q+SFXqX35c0xL4+r0/gtH017lSXX/vGH5uOI?= =?Windows-1252?Q?Td04+T6zpo+FeomLrbJibPxx9mxWwar37rXbU54rIVEWyeT+LWbZdLZn?= =?Windows-1252?Q?szC5RS/gCwaxAHaeHuCpICFyf693E17ghYJIrrGlWDs2hQBBP1hIw6nG?= =?Windows-1252?Q?/NidUCMHhy5kwefJEaS6QdtHOfl3LaYTXwQm5nQapmC7RoxXjM+lNZNU?= =?Windows-1252?Q?292QMOUI2e1npIe/GNPDTOIlXjcZJNVDJIJZIA5qAhnx6rcsfaXw/KV3?= =?Windows-1252?Q?i3lvf4yhzKt+x/eEPtBinldad19nHsdpMK5DxM9zrlTaySVaQSreuGt1?= =?Windows-1252?Q?ucRlX3KA9o4oPjhR2JgCkVHrgyxFd/Zn5ikWDKv8lDrMxyU1AOAP7fW+?= =?Windows-1252?Q?J4Bvf98FP9wSNB60Y/cZvLHyhUD5yTP95/2fvGB6mlNnKuL40VKoQ1EE?= =?Windows-1252?Q?auxikuDM6xURmmFLGYfZOti2Uub6tu/cLZ/czObSZlEpMIq33MATmabq?= =?Windows-1252?Q?83av/iHhZwraRGsThSG1/93XTSsPzqtah7+GtxA9E39NI+kzvljnoQmN?= =?Windows-1252?Q?z9C?= X-Microsoft-Exchange-Diagnostics: 1;AM4PR05MB1682;5:SAwSlgfkHK0B14he07MYaSQsdwBYeA98wHy68LJsPwWJ2IaFY64uawul4QyCtRhsdW/kJ5YlIU19M2kyiVfFzkS52XEqknBe9yWeuXWUqzj2/gzi7EipPLgwsqxnxX7Ao5YnI9SNA2uiSiwAdfyfUg==;24:L4hrzGieGt6iV7xAEViPtb8UN6nBGjg9nwlVFRmVm9CtYnd1BW7LSnfxmdOdVhJggpEgLzaX8oJsWQvYbv7Ovzek8+eH7ErOklr4XI9Srk0= X-OriginatorOrg: Mellanox.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 09 Mar 2016 19:39:38.8596 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-Transport-CrossTenantHeadersStamped: AM4PR05MB1682 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Frederic, Thanks for the detailed feedback on the task isolation stuff. This reply kind of turned into an essay, so I've added a little "TL;DR" sentence before each section. TL;DR: Let's make an explicit decision about whether task isolation should be "persistent" or "one-shot". Both have some advantages. ===== An important high-level issue is how "sticky" task isolation mode is. We need to choose one of these two options: "Persistent mode": A task switches state to "task isolation" mode (kind of a level-triggered analogy) and stays there indefinitely. It can make a syscall, take a page fault, etc., if it wants to, but the kernel protects it from incurring any further asynchronous interrupts. This is the model I've been advocating for. "One-shot mode": A task requests isolation via prctl(), the kernel ensures it is isolated on return from the prctl(), but then as soon as it enters the kernel again, task isolation is switched off until another prctl is issued. This is what you recommended in your last email. There are a number of pros and cons to the two models. I think on balance I still like the "persistent mode" approach, but here's all the pros/cons I can think of: PRO for persistent mode: A somewhat easier programming model. Users can just imagine "task isolation" as a way for them to still be able to use the kernel exactly as they always have; it's just slower to get back out of the kernel so you use it judiciously. For example, a process is free to call write() on a socket to perform a diagnostic, but when returning from the write() syscall, the kernel will hold the task in kernel mode until any timer ticks (perhaps from networking stuff) are complete, and then let it return to userspace to continue in task isolation mode. This is convenient to the user since they don't have to fret about re-enabling task isolation after that syscall, page fault, or whatever; they can just continue running. With your suggestion, the user pretty much has to leave STRICT mode enabled so he gets notified of any unexpected return to kernel space (in fact we might make it required so you always get a signal when leaving task isolation unless it's via a prctl or exit syscall). PRO for one-shot mode: A somewhat crisper interaction with sched_setaffinity() etc. With a persistent mode approach, a task can start up task isolation, then later another task can be placed on its cpu and break it (it won't return to userspace until killed or the new process affinitizes itself away or stops running). By contrast, in one-shot mode, any return to kernel spaces turns off task isolation anyway, so it's very clear what the interaction looks like. I suspect this is more a theoretical advantage to one-shot mode than a practical one, though. CON for one-shot mode: It's actually hard to catch every kernel entry so we can turn the task-isolation flag off again - and we really do need to have a flag, just so that we can suitably debug any bad actions that bring us into the kernel when we're not expecting it. Right now there are things that bring us into the kernel that we don't bother annotating for task isolation STRICT mode, just because they're visible to the user anyway: e.g., a bus fault or segmentation violation. I think we can actually make both modes available to users with just another flag bit, so maybe we can look at what that looks like in v11: adding a PR_TASK_ISOLATION_ONESHOT flag would turn off task isolation at the next syscall entry, page fault, etc. Then we can think more specifically about whether we want to remove the flag or not, and if we remove it, whether we want to make the code that was controlled by it unconditionally true or unconditionally false (i.e. remove it again). TL;DR: We should be more willing to return -EINVAL from prctl(). ===== One thing you've argued is that we should be more aggressive about failing the prctl() call. I think, in any case, that this is probably reasonable. We already check that the task's affinity is limited to the current core and that that core is a task_isolation cpu; I think we can also require that can_stop_full_tick() return true (or the moral equivalent given your recent patch series). This will mean you can't even try to go into task isolation mode if another task is schedulable, among other things, which seems like a good thing. However, it is important to note that the current task_isolation_ready and task_isolation_enter calls that are in the prepare_exit_to_userspace routine are still required even with your proposed one-shot mode. We have to be sure that no interrupts occur on the way back to userspace that might then in principle lead to timer interrupts being scheduled, and the way to do that is make sure task_isolation_ready returns true with interrupts disabled, and interrupts are not then re-enabled before return to userspace. Anything else is just keeping your fingers crossed and guessing. TL;DR: Returning -EBUSY from prctl() isn't really that helpful. ===== Frederic wonders if we can test for various things not being ready (dynticks not off yet, etc) and just return -EBUSY and let userspace do the spinning. First, note that this is only possible for one-shot mode. For persistent mode, we have the potential to run up against this on return from any syscall, and we obviously can't add new error returns to other syscalls. So it doesn't really make sense to add EBUSY semantics to prctl if nothing else can use it. But even in one-shot mode, I'm not really sure what the advantage is here. We still need to do something like task_isolation_ready() in the prepare_exit_to_usermode() loop, since that's where we have interrupts disabled and can do a final assessment of the state of the kernel for this core. So, while you could imagine having that code just hook in and call syscall_set_return_value() there instead of causing things to loop back, that doesn't really save us much complexity in the kernel, and instead pushes complexity back to userspace, which may well handle it just by busywaiting on the prctl() anyway. You might argue that if we just return to userspace, userspace can sleep briefly and retry, thus avoiding spinning in the scheduler. But it's relatively easy to do that (or better) in the kernel, so I'm not sure that's more than a straw man. See the next point. TL;DR: Should we arrange to actually use a completion in task_isolation_enter when dynticks are ticking, and call complete() in tick-sched.c when we shut down dynticks, or, just spin in schedule() and not worry about burning a little cpu? ===== One question that keeps getting asked is how useful it is to just call schedule() while we're waiting for dynticks to shut off, since it could just be a busy spin going into schedule() over and over. Even if another task is ready to run we might not switch to it right away. So one thing we could think about is arranging so that whenever we turn off dynticks, we also notify any tasks that were waiting for it to be turned off; that way we can just sleep in task_isolation_enter() and wait to be notified, thus guaranteeing any other task that wants to run can run, or even just waiting in cpu idle for a little while. Does this seem like it's worth coding up? My impression has always been that we wait pretty briefly for dynticks to shut down, so it doesn't really matter if we spin - and even if we do spin, in principle we already arranged for this cpu to be dedicated to this task anyway, so it doesn't really do anything bad except maybe burn a little bit of extra cpu power. But I'm willing to be convinced... TL;DR: We should turn off task isolation mode for signals. ===== One thing that occurs to me is that we should arrange so that any signal delivery turns off task isolation mode. This is easily documented semantics even in persistent mode, and it allows the userspace program to run and discover that something bad has happened, rather than potentially hanging in the kernel trying to wait for isolation to be possible before calling the signal handler. I'll make this change for v11 in any case. Also, doing this is something of a requirement for the proposed one-shot mode, since if we have STRICT mode enabled, then any entry into the kernel is either a syscall, or else ends up causing a signal, and by hooking the signal mechanism we have a place to catch all the non-syscall entrypoints, more or less. TL;DR: Maybe we should use seccomp for STRICT mode syscall detection. ===== This is being investigated in a separate email thread with Andy Lutomirski. Whether it gets included in v11 is still TBD. TL;DR: Various minor issues in answer to Frederic's comments :-) ===== On 03/04/2016 07:56 AM, Frederic Weisbecker wrote: > On Thu, Feb 11, 2016 at 02:24:25PM -0500, Chris Metcalf wrote: >> We don't want to wait for preemption points or interrupts, and there are >> no other voluntary reschedules in the prepare_exit_to_usermode() loop. >> >> If the other task had been woken up for some completion, then yes we would >> already have had TIF_RESCHED set, but if the other runnable task was (for >> example) pre-empted on a timer tick, we wouldn't have TIF_RESCHED set at >> this point, and thus we might need to call schedule() explicitly. > > There can't be another task in the runqueue waiting to be preempted since > we (the current task) are running on the CPU. My earlier sentence may not have been clear. By saying "if the other runnable task was pre-empted on a timer tick", I meant that TIF_RESCHED wasn't set on our task, and we'd only eventually schedule to that other task once a timer interrupt fired and ended our scheduler slice. I know you can't have a different task in the runqueue waiting to be preempted, since that doesn't make sense :-) > Besides, if we aren't alone in the runqueue, this breaks the task isolation > mode. Indeed. We can and will do better catching that at prctl() time. So the question is, if we adopt the "persistent mode", how do we handle this case on some other return from kernel space? >>>> By invoking the scheduler here, we allow any tasks that are ready to run to run >>>> immediately, rather than waiting for an interrupt to wake the scheduler. >>> Well, in this case here we are interested in the current CPU. And if a task >>> got awoken and waits for the current CPU, it will have an opportunity to get >>> schedule on syscall exit. >> >> That's true if TIF_RESCHED was set because a completion occurred that >> the other task was waiting for. But there might not be any such completion >> and the task just got preempted earlier and is still ready to run. > > But if another task waits for the CPU, this break task isolation mode. Now > assuming we want a pending task to resume such that we get the CPU for ourself, > we have no idea if the scheduler is going to schedule that task, it depends on > vruntime and other things. TIF_RESCHED only make entering the scheduler, it doesn't > guarantee any context switch. Yes, true. So we have to decide if we feel spinning into the scheduler is so harmful that we should set up a new completion driven by entering dynticks fullmode, and handle it that way instead. >> My point is that setting TIF_RESCHED is never harmful, and there are >> cases like involuntary preemption where it might help. > > Sure but we don't write code just because it doesn't harm. Strange code hurts > the brain of reviewers. Fair enough, and certainly means at a minimum we need a good comment there! > Now concerning involuntary preemption, it's a matter of a millisecond, userspace > needs to wait a few millisecond before retrying anyway. Sleeping at that point is > what can be useful as we leave the CPU for the resuming task. > > Also if we have any task on the runqueue anyway, whether we hope that it resumes quickly > or not, it's a very bad sign for a task isolation session. Either we did not affine tasks > correctly or there is a kernel thread that might run again at some time ahead. Note that it might also be a one-time kernel task or kworker that is queued by some random syscall in "persistent mode" and we need to let it run until it quiesces again. Then we can context switch back to our task isolation task, and safely return from it to userspace. >> (2) What about times when we are leaving the kernel after already >> doing the prctl()? For example a core doing packet forwarding might >> want to report some error condition up to the kernel, and remove itself >> from the set of cores handling packets, then do some syscall(s) to generate >> logging data, and then go back and continue handling packets. Or, the >> process might have created some large anonymous mapping where >> every now and then it needs to cross a page boundary for some structure >> and touch a new page, and it knows to expect a page fault in that case. >> In those cases we are returning from the kernel, not at prctl() time, and >> we still want to enforce the semantics that no further interrupts will >> occur to disturb the task. These kinds of use cases are why we have >> as general-purpose a mechanism as we do for task isolation. > > If any interrupt or any kind of disturbance happens, we should leave that > task isolation mode and warn the isolated task about that. SIGTERM? That's the goal of STRICT mode. By default it uses SIGTERM. You can also choose a different signal via the prctl() API. Thanks again, Frederic! I'll work to put together a new version of the patch incorporating a selectable one-shot mode, plus the other things mentioned in this patch. I think I will still not add the suggested "dynticks full enabled completion" thing for now, and just add a big comment on the code that makes us call schedule(), unless folks really agree it's a necessary thing to have there. -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com From mboxrd@z Thu Jan 1 00:00:00 1970 From: Chris Metcalf Subject: Re: [PATCH v9 04/13] task_isolation: add initial support Date: Wed, 9 Mar 2016 14:39:28 -0500 Message-ID: <56E07BF0.3060509@mellanox.com> References: <1451936091-29247-1-git-send-email-cmetcalf@ezchip.com> <1451936091-29247-5-git-send-email-cmetcalf@ezchip.com> <20160119154214.GA2722@lerouge> <569EA050.5030106@ezchip.com> <20160128002801.GA14313@lerouge> <56ABACDD.5090500@ezchip.com> <20160130211125.GB7856@lerouge> <56BCDFE9.10200@ezchip.com> <20160304125603.GA23906@lerouge> Mime-Version: 1.0 Content-Type: text/plain; charset="windows-1252"; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20160304125603.GA23906@lerouge> Sender: linux-doc-owner@vger.kernel.org To: Frederic Weisbecker Cc: Gilad Ben Yossef , Steven Rostedt , Ingo Molnar , Peter Zijlstra , Andrew Morton , Rik van Riel , Tejun Heo , Thomas Gleixner , "Paul E. McKenney" , Christoph Lameter , Viresh Kumar , Catalin Marinas , Will Deacon , Andy Lutomirski , linux-doc@vger.kernel.org, linux-api@vger.kernel.org, linux-kernel@vger.kernel.org List-Id: linux-api@vger.kernel.org Frederic, Thanks for the detailed feedback on the task isolation stuff. This reply kind of turned into an essay, so I've added a little "TL;DR" sentence before each section. TL;DR: Let's make an explicit decision about whether task isolation should be "persistent" or "one-shot". Both have some advantages. ===== An important high-level issue is how "sticky" task isolation mode is. We need to choose one of these two options: "Persistent mode": A task switches state to "task isolation" mode (kind of a level-triggered analogy) and stays there indefinitely. It can make a syscall, take a page fault, etc., if it wants to, but the kernel protects it from incurring any further asynchronous interrupts. This is the model I've been advocating for. "One-shot mode": A task requests isolation via prctl(), the kernel ensures it is isolated on return from the prctl(), but then as soon as it enters the kernel again, task isolation is switched off until another prctl is issued. This is what you recommended in your last email. There are a number of pros and cons to the two models. I think on balance I still like the "persistent mode" approach, but here's all the pros/cons I can think of: PRO for persistent mode: A somewhat easier programming model. Users can just imagine "task isolation" as a way for them to still be able to use the kernel exactly as they always have; it's just slower to get back out of the kernel so you use it judiciously. For example, a process is free to call write() on a socket to perform a diagnostic, but when returning from the write() syscall, the kernel will hold the task in kernel mode until any timer ticks (perhaps from networking stuff) are complete, and then let it return to userspace to continue in task isolation mode. This is convenient to the user since they don't have to fret about re-enabling task isolation after that syscall, page fault, or whatever; they can just continue running. With your suggestion, the user pretty much has to leave STRICT mode enabled so he gets notified of any unexpected return to kernel space (in fact we might make it required so you always get a signal when leaving task isolation unless it's via a prctl or exit syscall). PRO for one-shot mode: A somewhat crisper interaction with sched_setaffinity() etc. With a persistent mode approach, a task can start up task isolation, then later another task can be placed on its cpu and break it (it won't return to userspace until killed or the new process affinitizes itself away or stops running). By contrast, in one-shot mode, any return to kernel spaces turns off task isolation anyway, so it's very clear what the interaction looks like. I suspect this is more a theoretical advantage to one-shot mode than a practical one, though. CON for one-shot mode: It's actually hard to catch every kernel entry so we can turn the task-isolation flag off again - and we really do need to have a flag, just so that we can suitably debug any bad actions that bring us into the kernel when we're not expecting it. Right now there are things that bring us into the kernel that we don't bother annotating for task isolation STRICT mode, just because they're visible to the user anyway: e.g., a bus fault or segmentation violation. I think we can actually make both modes available to users with just another flag bit, so maybe we can look at what that looks like in v11: adding a PR_TASK_ISOLATION_ONESHOT flag would turn off task isolation at the next syscall entry, page fault, etc. Then we can think more specifically about whether we want to remove the flag or not, and if we remove it, whether we want to make the code that was controlled by it unconditionally true or unconditionally false (i.e. remove it again). TL;DR: We should be more willing to return -EINVAL from prctl(). ===== One thing you've argued is that we should be more aggressive about failing the prctl() call. I think, in any case, that this is probably reasonable. We already check that the task's affinity is limited to the current core and that that core is a task_isolation cpu; I think we can also require that can_stop_full_tick() return true (or the moral equivalent given your recent patch series). This will mean you can't even try to go into task isolation mode if another task is schedulable, among other things, which seems like a good thing. However, it is important to note that the current task_isolation_ready and task_isolation_enter calls that are in the prepare_exit_to_userspace routine are still required even with your proposed one-shot mode. We have to be sure that no interrupts occur on the way back to userspace that might then in principle lead to timer interrupts being scheduled, and the way to do that is make sure task_isolation_ready returns true with interrupts disabled, and interrupts are not then re-enabled before return to userspace. Anything else is just keeping your fingers crossed and guessing. TL;DR: Returning -EBUSY from prctl() isn't really that helpful. ===== Frederic wonders if we can test for various things not being ready (dynticks not off yet, etc) and just return -EBUSY and let userspace do the spinning. First, note that this is only possible for one-shot mode. For persistent mode, we have the potential to run up against this on return from any syscall, and we obviously can't add new error returns to other syscalls. So it doesn't really make sense to add EBUSY semantics to prctl if nothing else can use it. But even in one-shot mode, I'm not really sure what the advantage is here. We still need to do something like task_isolation_ready() in the prepare_exit_to_usermode() loop, since that's where we have interrupts disabled and can do a final assessment of the state of the kernel for this core. So, while you could imagine having that code just hook in and call syscall_set_return_value() there instead of causing things to loop back, that doesn't really save us much complexity in the kernel, and instead pushes complexity back to userspace, which may well handle it just by busywaiting on the prctl() anyway. You might argue that if we just return to userspace, userspace can sleep briefly and retry, thus avoiding spinning in the scheduler. But it's relatively easy to do that (or better) in the kernel, so I'm not sure that's more than a straw man. See the next point. TL;DR: Should we arrange to actually use a completion in task_isolation_enter when dynticks are ticking, and call complete() in tick-sched.c when we shut down dynticks, or, just spin in schedule() and not worry about burning a little cpu? ===== One question that keeps getting asked is how useful it is to just call schedule() while we're waiting for dynticks to shut off, since it could just be a busy spin going into schedule() over and over. Even if another task is ready to run we might not switch to it right away. So one thing we could think about is arranging so that whenever we turn off dynticks, we also notify any tasks that were waiting for it to be turned off; that way we can just sleep in task_isolation_enter() and wait to be notified, thus guaranteeing any other task that wants to run can run, or even just waiting in cpu idle for a little while. Does this seem like it's worth coding up? My impression has always been that we wait pretty briefly for dynticks to shut down, so it doesn't really matter if we spin - and even if we do spin, in principle we already arranged for this cpu to be dedicated to this task anyway, so it doesn't really do anything bad except maybe burn a little bit of extra cpu power. But I'm willing to be convinced... TL;DR: We should turn off task isolation mode for signals. ===== One thing that occurs to me is that we should arrange so that any signal delivery turns off task isolation mode. This is easily documented semantics even in persistent mode, and it allows the userspace program to run and discover that something bad has happened, rather than potentially hanging in the kernel trying to wait for isolation to be possible before calling the signal handler. I'll make this change for v11 in any case. Also, doing this is something of a requirement for the proposed one-shot mode, since if we have STRICT mode enabled, then any entry into the kernel is either a syscall, or else ends up causing a signal, and by hooking the signal mechanism we have a place to catch all the non-syscall entrypoints, more or less. TL;DR: Maybe we should use seccomp for STRICT mode syscall detection. ===== This is being investigated in a separate email thread with Andy Lutomirski. Whether it gets included in v11 is still TBD. TL;DR: Various minor issues in answer to Frederic's comments :-) ===== On 03/04/2016 07:56 AM, Frederic Weisbecker wrote: > On Thu, Feb 11, 2016 at 02:24:25PM -0500, Chris Metcalf wrote: >> We don't want to wait for preemption points or interrupts, and there are >> no other voluntary reschedules in the prepare_exit_to_usermode() loop. >> >> If the other task had been woken up for some completion, then yes we would >> already have had TIF_RESCHED set, but if the other runnable task was (for >> example) pre-empted on a timer tick, we wouldn't have TIF_RESCHED set at >> this point, and thus we might need to call schedule() explicitly. > > There can't be another task in the runqueue waiting to be preempted since > we (the current task) are running on the CPU. My earlier sentence may not have been clear. By saying "if the other runnable task was pre-empted on a timer tick", I meant that TIF_RESCHED wasn't set on our task, and we'd only eventually schedule to that other task once a timer interrupt fired and ended our scheduler slice. I know you can't have a different task in the runqueue waiting to be preempted, since that doesn't make sense :-) > Besides, if we aren't alone in the runqueue, this breaks the task isolation > mode. Indeed. We can and will do better catching that at prctl() time. So the question is, if we adopt the "persistent mode", how do we handle this case on some other return from kernel space? >>>> By invoking the scheduler here, we allow any tasks that are ready to run to run >>>> immediately, rather than waiting for an interrupt to wake the scheduler. >>> Well, in this case here we are interested in the current CPU. And if a task >>> got awoken and waits for the current CPU, it will have an opportunity to get >>> schedule on syscall exit. >> >> That's true if TIF_RESCHED was set because a completion occurred that >> the other task was waiting for. But there might not be any such completion >> and the task just got preempted earlier and is still ready to run. > > But if another task waits for the CPU, this break task isolation mode. Now > assuming we want a pending task to resume such that we get the CPU for ourself, > we have no idea if the scheduler is going to schedule that task, it depends on > vruntime and other things. TIF_RESCHED only make entering the scheduler, it doesn't > guarantee any context switch. Yes, true. So we have to decide if we feel spinning into the scheduler is so harmful that we should set up a new completion driven by entering dynticks fullmode, and handle it that way instead. >> My point is that setting TIF_RESCHED is never harmful, and there are >> cases like involuntary preemption where it might help. > > Sure but we don't write code just because it doesn't harm. Strange code hurts > the brain of reviewers. Fair enough, and certainly means at a minimum we need a good comment there! > Now concerning involuntary preemption, it's a matter of a millisecond, userspace > needs to wait a few millisecond before retrying anyway. Sleeping at that point is > what can be useful as we leave the CPU for the resuming task. > > Also if we have any task on the runqueue anyway, whether we hope that it resumes quickly > or not, it's a very bad sign for a task isolation session. Either we did not affine tasks > correctly or there is a kernel thread that might run again at some time ahead. Note that it might also be a one-time kernel task or kworker that is queued by some random syscall in "persistent mode" and we need to let it run until it quiesces again. Then we can context switch back to our task isolation task, and safely return from it to userspace. >> (2) What about times when we are leaving the kernel after already >> doing the prctl()? For example a core doing packet forwarding might >> want to report some error condition up to the kernel, and remove itself >> from the set of cores handling packets, then do some syscall(s) to generate >> logging data, and then go back and continue handling packets. Or, the >> process might have created some large anonymous mapping where >> every now and then it needs to cross a page boundary for some structure >> and touch a new page, and it knows to expect a page fault in that case. >> In those cases we are returning from the kernel, not at prctl() time, and >> we still want to enforce the semantics that no further interrupts will >> occur to disturb the task. These kinds of use cases are why we have >> as general-purpose a mechanism as we do for task isolation. > > If any interrupt or any kind of disturbance happens, we should leave that > task isolation mode and warn the isolated task about that. SIGTERM? That's the goal of STRICT mode. By default it uses SIGTERM. You can also choose a different signal via the prctl() API. Thanks again, Frederic! I'll work to put together a new version of the patch incorporating a selectable one-shot mode, plus the other things mentioned in this patch. I think I will still not add the suggested "dynticks full enabled completion" thing for now, and just add a big comment on the code that makes us call schedule(), unless folks really agree it's a necessary thing to have there. -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com