From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754736AbcH3UMT (ORCPT ); Tue, 30 Aug 2016 16:12:19 -0400 Received: from mail-db5eur01on0051.outbound.protection.outlook.com ([104.47.2.51]:10304 "EHLO EUR01-DB5-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751702AbcH3UMP (ORCPT ); Tue, 30 Aug 2016 16:12:15 -0400 Authentication-Results: spf=none (sender IP is ) smtp.mailfrom=cmetcalf@mellanox.com; Subject: Re: [PATCH v15 04/13] task_isolation: add initial support To: Andy Lutomirski References: <1471382376-5443-1-git-send-email-cmetcalf@mellanox.com> <1471382376-5443-5-git-send-email-cmetcalf@mellanox.com> <20160829163352.GV10153@twins.programming.kicks-ass.net> <20160830075854.GZ10153@twins.programming.kicks-ass.net> CC: "linux-doc@vger.kernel.org" , Thomas Gleixner , Christoph Lameter , Michal Hocko , Gilad Ben Yossef , Andrew Morton , Linux API , "Viresh Kumar" , Ingo Molnar , "Steven Rostedt" , Tejun Heo , Will Deacon , Rik van Riel , Frederic Weisbecker , "Paul E. McKenney" , "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" , Catalin Marinas , Peter Zijlstra From: Chris Metcalf Message-ID: Date: Tue, 30 Aug 2016 15:37:02 -0400 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.2.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="utf-8"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [12.216.194.146] X-ClientProxiedBy: SN1PR11CA0011.namprd11.prod.outlook.com (10.164.10.21) To VI1PR0501MB2766.eurprd05.prod.outlook.com (10.172.11.16) X-MS-Office365-Filtering-Correlation-Id: de769759-e5ed-4974-e975-08d3d10d0ee2 X-Microsoft-Exchange-Diagnostics: 1;VI1PR0501MB2766;2:BbJUMWLx/Q3SK3UEeIMJmqTofKJANs82++oljTY/GFIOZ9/EmOFZHd8uhB1JK58WZjEyAaIVkn8nZQF4o+83Zs+6UU2ABbdWjmtLazYSmCgJkZD+ue6txBVVmhP6nYcH99kZmFfcb1Ofxf7zANP2a5Rios2J9z8FvPjZaB5fAdb4CRoXPs4RgTzuH+JZelCM;3:8avtj6EU3HGhMUPjlwBzMHZdT9ahumtD609yt54NW2NCsP3KEA43hrMZ9LlDYcUxQ3hhpgMr/oFgRhDj4TlCLno5yigjfZrjs79bwX+ePPhPGm3h1b+Pif6gFnCCRxYn X-Microsoft-Antispam: UriScan:;BCL:0;PCL:0;RULEID:;SRVR:VI1PR0501MB2766; X-Microsoft-Exchange-Diagnostics: 1;VI1PR0501MB2766;25:UQ8zmR5bot+r/KRvZqAiWTPmtr4jAAsCt42WWqtpevvXXPf0wd8dx4WdmsqWfmuNJr8OCjpJmidZnqNSQSNsfIhf6hyP0f2gvJf+fNCaM/lIZx2pT5CZsJ9eKznaKSppx6/I9OCbYZWrhL+Cvy11dqoh86+z6U3725YDRw9le1JqS55ATSQ58R+mD3UKLzqSR4Y5Rfl09n8ATHqT28gpYxu03psCSUxpxhLoLg1rVkTd4GeQuQQ3b0X+5i9QofRa1uI/E9e9RX00LY0kDy19955i8jevBrSCq3v69Rr2M3NLV5VGQZoM1CCeew2mOe3nXOJLdfEZZagv2x3ZAAy6p6BZyHQxLbPI7PeETo7p/baQi7jbeH0j/nwoir6flcjcfMjU+BT28ca+lnsHLAnMCDivpBRohFHEtEtT2fqVrL3KmOXlh989/YjAtzPRPh3Ec0fYd8Nptz1LnhkzPRQ5hb8PoYSiU3M9K72DtRNtu5AMT9N8L3mhaB6F4TKCDbei5fZXvh4NX5ikKxP50hGguOBmrLCjsZ4IdsmqAvs0Rd3ut6Qw+C4DZW2JCZuk3SYzDfGliLzoEqWVZJbXr0Z9dBPfPpgi7VSVPe1cxnCW4aQaWZ4mhDqjPlYKGp+ISFkI7UgZ2u35f1KwD/r8l1hvIzMOpGLtIDpA9h5RqRpjSsElB83M+WCiqkRq/UXJ1L/ks3ACrN5GaJvOi+7SlkCiF+tZwIladW2e75Binn6yezoUYZt/PUyZlU90f85ihavHREs+skDW7Mj+p+gkOH3MizV9VIS8bLSqLnZMWOXeXtCitB44scfJ1FE4gQXqAN2+eGIYL6+dWoTWGQtS7aVHOQ== X-Microsoft-Exchange-Diagnostics: 1;VI1PR0501MB2766;31:1VIpf0NnSeFo5X+SjvvEQQDOr9I4J8CEFAzDSVRHop2jFqTvbkgRvaM21SU6itkiAIsQQqgHUs9F2dvtF6bcylGjMPdd9b58g02t8l2e1v5QlwVDMxPjxd+6YwLJF+81XNB9sIwkxTC0OPMSVUz5oN/RnAlQLsJy03eMNkjPrzrd/6GKPZU/zJFzcrPIl9+y/cSJIk1yqDtId0tsgVBbcFw9956youGwMtNkPF3FFrg=;20:JqQMEnok/AkZNwIGPdWpVsOgdbpsLRxx/ddvDG+mvvR0kkEKwA7KG80bDvw1eNYgbjvN2Ng8zI57F5XG9FOezi+3z6pBUSDj8nBU73lmVMPkrNiVyX1kn3Bnw625Ylp1YHFR+dhys9q8ZZSJ5ISyNdY2Zhfg5o1xtec+QEq5Wzc16ztYwp9qz76IOiB13L1byIDf10KmoBXW1c+5wHjyLem4i6YEeu1sdzXpeM4NnvrEmavkIBx4nx1q8Xsth565ZrfL8b4g5qGEcR1juJYFQ1UDZa7JPZeRTSRcXg621G6+y8BDJ3kDTAxJ22kvfv3Yt5pNKKZE/zEutj4PZoIvFRFJ2GjWfiDFFkYpW4/5gMnNVPAbgZuIm3t7FHTeXDFK2DV5bOd8k/d1ub1Huqbb8wl4YkmlqV+x/e0ygYig5aEMd2R2wv0FmptJqaH9f50uNcoEgp0f1W4Zkj2K0L2ihY1gAsJywMjJV1y1ar9dz4tyifMVb4WkFrStEtpqsUkZ X-Microsoft-Antispam-PRVS: X-Exchange-Antispam-Report-Test: UriScan:(171992500451332); X-Exchange-Antispam-Report-CFA-Test: BCL:0;PCL:0;RULEID:(6040176)(601004)(2401047)(8121501046)(5005006)(3002001)(10201501046)(6055026);SRVR:VI1PR0501MB2766;BCL:0;PCL:0;RULEID:;SRVR:VI1PR0501MB2766; X-Microsoft-Exchange-Diagnostics: 1;VI1PR0501MB2766;4:ll6GSj+iL2mydu/wDgp65IrSmLFdRF/EK/EPJy9RIaJlsSUYswyzyQYkca6FCKIeWYoOdRVtf90Hr0SpjlKu1RXMqvdDX0pxkTVzUVZeefXBtCU4l6OfSDbBfYWp7bSWEjwFM6xJiT/VdkZPRYcw1MY5AV9zQiZi5Rjy2NKOikjLbWRb2YIBy7fWyDKZ+kuBC/vm+Vu0vcO3L9mYeXGnsIbMsTtGvdtbqbC4HFnQBzAQDfrwsbt/rGWPKL8tT4OLf+r0gJHKMPMVHuo9Bc2rzNFNTlnZd7myAGtlGrpaJmKfqJn4XgGmwMzSBYjN8o9j6/WfueFS471+rCI/qS34ah2pht4/XCUo9nP44d7RUD/kbkXDdVOuTEr73gtLZvyDz6UyCtbZcOfxVCixW9CZYVPFbpNa3eHaQDpPCdKr1Zf3HUL/IBLpp6aZjJuFht/CFwjI2Ea9uuflmBNwrrRjsw== X-Forefront-PRVS: 0050CEFE70 X-Forefront-Antispam-Report: SFV:NSPM;SFS:(10009020)(4630300001)(6009001)(6049001)(7916002)(199003)(377454003)(24454002)(189002)(65806001)(2950100001)(92566002)(50986999)(83506001)(31686004)(106356001)(110136002)(586003)(31696002)(101416001)(8676002)(7416002)(23676002)(230700001)(5660300001)(4001350100001)(54356999)(68736007)(4326007)(6116002)(189998001)(3846002)(76176999)(65956001)(105586002)(97736004)(7846002)(305945005)(2906002)(81156014)(19580405001)(19580395003)(15975445007)(36756003)(50466002)(77096005)(7736002)(86362001)(93886004)(65826007)(47776003)(64126003)(42186005)(81166006)(66066001)(33646002)(18886065003);DIR:OUT;SFP:1101;SCL:1;SRVR:VI1PR0501MB2766;H:[10.15.7.181];FPR:;SPF:None;PTR:InfoNoRecords;A:1;MX:1;LANG:en; X-Microsoft-Exchange-Diagnostics: =?utf-8?B?MTtWSTFQUjA1MDFNQjI3NjY7MjM6V1VrSmdGUDRMRkJNVmhMNWpYaHJnQ2NR?= =?utf-8?B?UFdPT1NaTERYNGhjUWd1eXI2WjMxd1dnWXZraXhXdGdzdG9hd0Q1a21VZEFn?= =?utf-8?B?YTFwUytCeGJMbk9YZ2VTSFR6Ky95ajRpU0RRdG9qRlBmYTdFM25iLzBSNi9j?= =?utf-8?B?YjZ0YnpyRkxwZ0VpNGJUVDdRYVYzamRMTkoweTg0N3J4VC9ycHhwaVB4TVhk?= =?utf-8?B?MS9ocGRvUkRkOUxiK25rZGhEZzIvanorSjR4OW1lYTJ6YVRmY3R1RzRxSFE4?= =?utf-8?B?blpqU3VEc3RxTzhleThsTkxtaUlLdUg4cVllMWYwSnVyTE5ocnVhYjd3UHVF?= =?utf-8?B?ejZkNDFjdUpHQXJHSkVmQVNQNTZRVFRoT2ZnTStmbDlJS3YwSHMzME00ZmQr?= =?utf-8?B?QXZPalFrVkprUXU5SUl5L1BueHRhYUtRMHQxWGNZQjFhcDEvSGNidnYwUWc3?= =?utf-8?B?Tldrc3ZLZ3FWK3NuWS9OdzJINkwxQ203ZXNTaDNLdUNUdDBqR2dvRFdiTFhX?= =?utf-8?B?Z0o2bi9kVU5mbU5RMWh3eUdWZjdSdm9ab0dDbTl5K2lqMFlvZG5rQllQaU1S?= =?utf-8?B?cUtDWXpTaGlDZ1FVVng5T2Z2VEZLemZMaEN1NGUyd29WVlNYcS9TMy9XWXAv?= =?utf-8?B?UE5hNTBPRlZVdXZFcHJkNnhWVTlLQUsrY2p0RThZZFk1eU0wZzc2d0dBWmRH?= =?utf-8?B?ZFREaklnc1pUOFRyMkYxQVg5TTdlcHRlUzhSMjRkOG51Z1dueUJYcTBKbnNR?= =?utf-8?B?SnBNd2ZlUFFkdllxb1psNklRV3haaGNpVk1QQ3FFQXNPd1UrWEF2dHljTEJY?= =?utf-8?B?TVkwRXZjdHloRGI0S01zZHNmemFuQjVpMXN5VTVDOWRxeXd4VHNHYSs5b3I0?= =?utf-8?B?WXZXbzVKQXJZMGJGemtIalNPNjUzNXB1Tk9pdnkycHN0Y1RNU3ZyanBFREdB?= =?utf-8?B?bzhzMjU4dUhZYlNkblBLNnZrZnpkL0xlN3h0MzdRU1lJKzdVMVU5Q2lWeG96?= =?utf-8?B?clQvR2xMVnJyOVlLc3VLSXIvMWZJUFZjaFcrZS9jaEt6Qk5qeU9rdGl1OHJr?= =?utf-8?B?cGlaMXVNRGFWRkt4NEdUTHltRmlZQ1preU5BTVZWQ3pmRndORWNlQlR2ak5R?= =?utf-8?B?NFRWd2dOczJtbVAvVjN0VUlWelF6Ukw0emE3ZXIrRXBuL3R3V3BWMjQvck9T?= =?utf-8?B?d3ZKKzJ4YzdHdXdFdEVQWStYcnQyNlhyUVpTbkxrV0o3S1VhanFjeUZVNU0v?= =?utf-8?B?WkMxd2x3aFF1a0lLbzZpbUNWTHVpRHFHa2YyQUZkRTRKSzlmTWZQM3owcmx2?= =?utf-8?B?SG8zMlk2dEZ0RzBiV2p6VytWL2R4dkt4cGM3WWRJN21tVVdtSjVVQjlQcGtV?= =?utf-8?B?TlJiZTlSVzNvdmZDUGYrSVc2T252M01QRlMwN045SjNZUWF3dlZGaVVub2Js?= =?utf-8?B?anZmRFFRN0xwVW9KQWd0bHFRZ2szNDhqR1o0M0Z5bG55cnR1TlRNK2pCTjBl?= =?utf-8?B?enBQR3hBWDlnRmY2UGZia2lWaUhzNkdieldEV2JNRmRwVWZwWlBjeHhMeDNW?= =?utf-8?B?RStEcUlpWEE5VmRsdU1pSWdxZExJalFMVzI0Y3RXTUJTTitFUVUyNk9HdGtL?= =?utf-8?B?RmNHL2NzUkhHd0ZOUWJrS0hXQkVKMzBsT2ZkYkxaQzVjR2JnR0tnSmlhME1V?= =?utf-8?B?ajl4cWc3TUVOZjVoT2JRbkpNYWQzd0doMjQ0MEk1amUzbC9TUGRxZzhvbUd6?= =?utf-8?B?ZVdGdGRNSVhiYXJaSmsyNG9ldjBDVitNa0JVY2FvYTNjc0kvSjRQZ1kzZ3R4?= =?utf-8?B?YThOWFp2QUZjMmRwakZIK1F4L3JsbE9xNElvSGVzeUMyRThiK3NuZjdDbnFT?= =?utf-8?Q?QNctBGPlBbccE=3D?= X-Microsoft-Exchange-Diagnostics: 1;VI1PR0501MB2766;6:xILQMJKyCgdqukxlun0r7WN/N7EqBqEYNg420D+FmCpvhCXTUJNRzziq+IirQ9sK3/0dmMboPpXst8jPCquq+kGQdFyoghGv/Rkzn8OcU3NyRB1Vmfl4Ko2etj9wwIi4wN36CQ9S0IFZI78s7vxTeHJUieuOwnK9iUf9zA2DeHx1VQmcCZawpGAcLH4L1c+6ebtV0JDJJFwvtukPiNj2zWJYoZgMdGjtni53UBPI90gxQ/QhZWvTnQC1z0+KW+8po/zRONk1VuOwBvqw3fDGsVkDkSvhbrViMCQh5SOesilUiZVwgfqRdlBl7vAyMFqgIFRlrP2KRGjXE1wAazBr+Q==;5:vWBsz7g3pvOunayxNqlsUwXBoFd0WSZ9IQJ5XsPKAlRXGP/Gr17+wKpbsjvjg3wBJX9hhWKZ5+EMsIYxevGMPTnTC33q0x17wBFy5+k/+Hy/aAExjfTwFJII1vaWHRq1i5AOb5Xr6JDASAB9LWokbw==;24:xwj1ix11O8ETLRURW+Y72Z13ZyCpq3Xgglk8Z9iywyI38qgH19XpTqC57aorK3ADPGlqU0QE/xMOPOMs0cMfJZc1gVO8mMyf2dxZU59dOPs=;7:tN8UFrml8GXKgWcmLR1lyDbmpFuR/V7nqx4RsSFhEFPXd7k6zOZ3dvdgkz028ft1C3ipf6MS7MfFlbYSZXIr2/tNyHAd9RvahBU9AEjZMXVf3UBOXHXvWr+XaueWDSOkbiJ8zyCIA/XgqziIIb5AnrE9mN50Y7vEPhQHmqIBzwvkcsd5tvWrRnqTsAzufVTlv6aczzaMCPh1QXhy8vglOQnQlJVKb6hqZzQwSfP3dd8JIY2xLZzerHBbQ++DEWN7 SpamDiagnosticOutput: 1:99 SpamDiagnosticMetadata: NSPM X-OriginatorOrg: Mellanox.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 30 Aug 2016 19:37:15.5363 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-Transport-CrossTenantHeadersStamped: VI1PR0501MB2766 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 8/30/2016 2:43 PM, Andy Lutomirski wrote: > On Aug 30, 2016 10:02 AM, "Chris Metcalf" wrote: >> On 8/30/2016 12:30 PM, Andy Lutomirski wrote: >>> On Tue, Aug 30, 2016 at 8:32 AM, Chris Metcalf wrote: >>>> The basic idea is just that we don't want to be at risk from the >>>> dyntick getting enabled. Similarly, we don't want to be at risk of a >>>> later global IPI due to lru_add_drain stuff, for example. And, we may >>>> want to add additional stuff, like catching kernel TLB flushes and >>>> deferring them when a remote core is in userspace. To do all of this >>>> kind of stuff, we need to run in the return to user path so we are >>>> late enough to guarantee no further kernel things will happen to >>>> perturb our carefully-arranged isolation state that includes dyntick >>>> off, per-cpu lru cache empty, etc etc. >>> None of the above should need to *loop*, though, AFAIK. >> Ordering is a problem, though. >> >> We really want to run task isolation last, so we can guarantee that >> all the isolation prerequisites are met (dynticks stopped, per-cpu lru >> cache empty, etc). But achieving that state can require enabling >> interrupts - most obviously if we have to schedule, e.g. for vmstat >> clearing or whatnot (see the cond_resched in refresh_cpu_vm_stats), or >> just while waiting for that last dyntick interrupt to occur. I'm also >> not sure that even something as simple as draining the per-cpu lru >> cache can be done holding interrupts disabled throughout - certainly >> there's a !SMP code path there that just re-enables interrupts >> unconditionally, which gives me pause. >> >> At any rate at that point you need to retest for signals, resched, >> etc, all as usual, and then you need to recheck the task isolation >> prerequisites once more. >> >> I may be missing something here, but it's really not obvious to me >> that there's a way to do this without having task isolation integrated >> into the usual return-to-userspace loop. >> > What if we did it the other way around: set a percpu flag saying > "going quiescent; disallow new deferred work", then finish all > existing work and return to userspace. Then, on the next entry, clear > that flag. With the flag set, vmstat would just flush anything that > it accumulates immediately, nothing would be added to the LRU list, > etc. This is an interesting idea! However, there are a number of implementation ideas that make me worry that it might be a trickier approach overall. First, "on the next entry" hides a world of hurt in four simple words. Some platforms (arm64 and tile, that I'm familiar with) have a common chunk of code that always runs on every entry to the kernel. It would not be too hard to poke at the assembly and make those platforms always run some task-isolation specific code on entry. But x86 scares me - there seem to be a whole lot of ways to get into the kernel, and I'm not convinced there is a lot of shared macrology or whatever that would make it straightforward to intercept all of them. Then, there are the two actual subsystems in question. It looks like we could intercept LRU reasonably cleanly by hooking pagevec_add() to return zero when we are in this "going quiescent" mode, and that would keep the per-cpu vectors empty. The vmstat stuff is a little trickier since all the existing code is built around updating the per-cpu stuff and then only later copying it off to the global state. I suppose we could add a test-and-flush at the end of every public API and not worry about the implementation cost. But it does seem like we are adding noticeable maintenance cost on the mainline kernel to support task isolation by doing this. My guess is that it is easier to support the kind of "are you clean?" / "get clean" APIs for subsystems, rather than weaving a whole set of "stay clean" mechanism into each subsystem. So to pop up a level, what is your actual concern about the existing "do it in a loop" model? The macrology currently in use means there is zero cost if you don't configure TASK_ISOLATION, and the software maintenance cost seems low since the idioms used for task isolation in the loop are generally familiar to people reading that code. > Also, this cond_resched stuff doesn't worry me too much at a > fundamental level -- if we're really going quiescent, shouldn't we be > able to arrange that there are no other schedulable tasks on the CPU > in question? We aren't currently planning to enforce things in the scheduler, so if the application affinitizes another task on top of an existing task isolation task, by default the task isolation task just dies. (Unless it's using NOSIG mode, in which case it just ends up stuck in the kernel trying to wait out the dyntick until you either kill it, or re-affinitize the offending task.) But I'm reluctant to guarantee every possible way that you might (perhaps briefly) have some schedulable task, and the current approach seems pretty robust if that sort of thing happens. -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com