From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751971AbcKEETI (ORCPT ); Sat, 5 Nov 2016 00:19:08 -0400 Received: from mail-db5eur01on0045.outbound.protection.outlook.com ([104.47.2.45]:54912 "EHLO EUR01-DB5-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1750703AbcKEETG (ORCPT ); Sat, 5 Nov 2016 00:19:06 -0400 Authentication-Results: spf=none (sender IP is ) smtp.mailfrom=cmetcalf@mellanox.com; Subject: task isolation discussion at Linux Plumbers To: Gilad Ben Yossef , Steven Rostedt , Ingo Molnar , Peter Zijlstra , Andrew Morton , "Rik van Riel" , Tejun Heo , Frederic Weisbecker , Thomas Gleixner , "Paul E. McKenney" , Christoph Lameter , Viresh Kumar , Catalin Marinas , Will Deacon , Andy Lutomirski , Daniel Lezcano , "Francis Giraldeau" , Andi Kleen , Arnd Bergmann , References: <1471382376-5443-1-git-send-email-cmetcalf@mellanox.com> From: Chris Metcalf Message-ID: <1605b087-2b3b-77c1-01ac-084e378f5f28@mellanox.com> Date: Sat, 5 Nov 2016 00:04:45 -0400 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 MIME-Version: 1.0 In-Reply-To: <1471382376-5443-1-git-send-email-cmetcalf@mellanox.com> Content-Type: text/plain; charset="windows-1252"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [209.117.102.182] X-ClientProxiedBy: BY2PR06CA057.namprd06.prod.outlook.com (10.141.250.175) To VI1PR0501MB2765.eurprd05.prod.outlook.com (10.172.11.15) X-MS-Office365-Filtering-Correlation-Id: 697acb43-ea44-4e98-064e-08d40530ea18 X-Microsoft-Exchange-Diagnostics: 1;VI1PR0501MB2765;2:zVB0BYVtXkWj50vrPvaruieFpCgBLfH1M40g59oQbwt5O+nsNyikS0Zr0WpPKmRwUdV21IhnY3VxgMKQGFEhtD1DVJRr+tdrr6D5J1fGyJlmKbqpTOt3vSM8b9riuQvR9w51zSYMp1cR5/hvkubsBWqDTydyUhYUWmAWS5xxFgVxzOV2zLEsH6CZtDUAF+xm8DjZhor7lQWyUhHQvQUC6g==;3:geF+JeGYC4SrcjaAHtMEvsh2OWxmkYqujkKia06NeLjMUj7bGA6s/rpw6Zf4feUxG/X7XlHpJEPjk3MZT1T8xURaGIDD7fn+iNlyIFQu6Cl5sSmdAR4o1jXjeO47dkvf41gJq6oxH0g1h/wmQkwnRg==;25:G1lj8MKv3GOKxIoGfRv8ADm8towFug1jDqx7cu/aKX2dlo70H8fUI0jqak40wlLRXAhBc5NED9mSYy6YhYtM20gLs5bnirnSvdWEci4icibk67ARHgzuVwwtd4G1u4hO4a3G9K/dotfHrqjb3yyujIulfAd572L3E0WMMSXUrMKdF0D/0HYRwE1k5Om1dKJlP5MhHE5DkGyErA3EpHwncamFDZfY6sxWR5aIc2hm9zZ/QvYaPUnvufL19fZhLOxPGruW+6fumPm1fiMc0mWOnae6FXuKLji1f4B3syx2uzbw/+a241QZbDpxj+3sELEnAb+wsyTCVCh3qo8E5W7TH6bLOXyB9HKLHdDclxFhlGp+vBNlR2f4D/d2ICzJHioDQJuEV1B2bfFKewzjSa85cY+Gj/dMbfvV9kwT35b5+7g= X-Microsoft-Antispam: UriScan:;BCL:0;PCL:0;RULEID:;SRVR:VI1PR0501MB2765; X-Microsoft-Exchange-Diagnostics: 1;VI1PR0501MB2765;31:+EjGanD2KWgYSQifgGgiPHGglkzVebEch0icj57gTp5ATPDn7vxonSyCXQicff3YX9nvgZnQ84HZQLK2zIvz5O+7j8Pff2oYKRfXhrmNACwKTKFwNdAnybVcbUpkcgT61LEFL4Ctnf0uzweKSBJa0EoCWDPa4uVy6rBWuxXV14E//g+d5sg51PMwElHg+gXKgYx7btmn9y+1i5/tSIm8FLB4gPBEbgYKDUr44YX47XJAhm/r19MVWm3+98O2LOBx;20:oGBY3Um2DhQqmikWEtET8ERYrB15DB8XOcEIaAzar3nottgPM7k9oR2il4QKBgTtVHOJ4EyggauHONdB5PDSzwdyUDRCKEqwthwJ2JEbvxxONpRMWzv68jBsQbD7jcyxn/0vV7mGzIAJpYo1CBVm+5XqLEg4PcyrxR+M7Vy5k3xOTA8etaNsVsDjokaIeSVcvdqEttf7thpPWgH5ApPxClJUKqB6rEGj5E99DgG8vwZJmwtV9y9suxG2UB7GNpiNukZ0gj3N9RSUsB8Jlf5DrzFT7pJnzAgSQBZ4lIvp7t+0hl1P7nDVBWUeXlCBeta0RR96nZXPzTou/Uh/Wp1xQIn31tB1Pg3TlQCohXvj8YGrklhVYI1KtESvC1FPwJBZpBCklITw6Ratt3bzm7TH1g56XgDJLa4MquAdHOQ6XdJ0mvD0MgK6VI0QF7vDCsB4kK5DF6AXoyZMeBbgu5kzobdak50r6cNUXX0gsZBITjVOjGHAghushbd422BCiVQ2 X-Microsoft-Antispam-PRVS: X-Exchange-Antispam-Report-Test: UriScan:(171992500451332)(100405760836317)(155532106045638); X-Exchange-Antispam-Report-CFA-Test: BCL:0;PCL:0;RULEID:(6040176)(601004)(2401047)(8121501046)(5005006)(3002001)(10201501046)(6055026);SRVR:VI1PR0501MB2765;BCL:0;PCL:0;RULEID:;SRVR:VI1PR0501MB2765; X-Microsoft-Exchange-Diagnostics: 1;VI1PR0501MB2765;4:AKPHJGwIbI7/gHfllDMCYMdT6cACUbSU5iIinE/XqbS+AXu7urocMiNFnuAOvVrQkDej+7V1cf8ruMRaoxaj75nothq5EmXgnzUAKAOEgMPWImnpgo7yjFlWW5nmx2VsltuuUajYvwgp0D89M7qNK+80H68LFiAhHDkZ+My+hAdmDG/TLWqmMyCa8/bwXUeQwToGKpC8CoT+dasRXT4MpyMR+I1N+HRB2K2WCh5Di3PMahFXgoC4P013Zqq3hqGQZMGF1xC0S20Pu7Cxg0Gjqk8OW7lGCZWFTO7EkVCdaKsHh9v7VQ00QFNlYCFwS7hzN9zbYQtatuFYyJvVmoLYK2uGpTMvKrAsruFtuMkR/NFtCsz2HpoBKz9wAWsWfe/WjFlYm5RQxbnXax+E+AwImwrGf7i6kg2BBKFhwoJdMxqDBRN5IgapAosl18RACqJqfLbaou4xhCwyld2bgmNoxwdDPZcUsT1OdPYgfO+InM+pOZ5sHZgJe79f7J7McsU+YbzsK9zb+1l24YClY/F9hdCyST6HI09iK4mhUDM5x9Q= X-Forefront-PRVS: 011787B9DD X-Forefront-Antispam-Report: SFV:NSPM;SFS:(10009020)(4630300001)(6009001)(6049001)(7916002)(199003)(189002)(54164003)(66654002)(2950100002)(230700001)(305945005)(2906002)(77096005)(33646002)(31686004)(19580395003)(83506001)(4001350100001)(5001770100001)(97736004)(76176999)(23746002)(7416002)(189998001)(54356999)(107886002)(15975445007)(7846002)(50466002)(50986999)(7736002)(92566002)(65826007)(64126003)(86362001)(68736007)(101416001)(36756003)(65956001)(5660300001)(81156014)(42186005)(31696002)(105586002)(66066001)(47776003)(8676002)(586003)(3846002)(65806001)(81166006)(106356001)(6116002)(6666003)(229853001)(921003)(1121003)(18886065003);DIR:OUT;SFP:1101;SCL:1;SRVR:VI1PR0501MB2765;H:[100.72.189.22];FPR:;SPF:None;PTR:InfoNoRecords;A:1;MX:1;LANG:en; X-Microsoft-Exchange-Diagnostics: =?Windows-1252?Q?1;VI1PR0501MB2765;23:ZpRvXmkTdsJc6X2s9mYvD2ruPTHB7erJ+jl?= =?Windows-1252?Q?4Ivm/+rwmcrhNGI0Sia4emAk10BOZInipWUCqYHv1SN4BxTXjpb+d7Tp?= =?Windows-1252?Q?mj58RCwH+O4nupRaaZSODsirJQE1NGNeMCj+PRmxP6dogywX5PNDLQBf?= =?Windows-1252?Q?o6jMDHShdriWQnDt1JeN99EcC0vn37tjXk14Yu4sm9MlRHavJdol2ueg?= =?Windows-1252?Q?QMX2ks6XhGw/zGD81JWRTjOYBS7GfRzuziE7Um10I0o0obiWYthqwVR5?= =?Windows-1252?Q?zCkPwnRYIQMJUjEHvnca3+JBl1Lpr9BME6dRovVYQbSrpRYs8DaM7RO8?= =?Windows-1252?Q?1HtC2wftJsQ4LBk3sJv7FDqxWlmWqFKgY2syffZmNmSi329rA+ktC0gn?= =?Windows-1252?Q?XFBJpqTPoR2HMKU0h9qgFekryQiIRrWIvkXR8Sk6IYQoEM4IHmQ7u8pM?= =?Windows-1252?Q?Duqspyh3DkpGEylPTxJQlQ3hiJ1rRAL39sCx0KToq1Rkv1L8c4kdQNpN?= =?Windows-1252?Q?bjoz9rKHiCwStOxy8SGbFHtdxROph1bOgANbaU9+apySK/J97vXkCbqF?= =?Windows-1252?Q?R6mvoHqBksi5/gE/zd1f/Q2p4mCHNf+XKS4U37gk9QnMhfrPz2Fmon5f?= =?Windows-1252?Q?JJIKuxP9+2axJ/J3MgXOG2N9ZMONNr12WV64KQZdQNx7XocoPNzG777h?= =?Windows-1252?Q?kpgDqDutkp3ePv0O+AN1TEXt7bHVx7oeAfP2xRpTKttl0jBrRkYzP99k?= =?Windows-1252?Q?878cJknq8rn/JNier/1V+LwJtSQbJToMMMC+YmofbiTleF8VUtG9CK77?= =?Windows-1252?Q?P6NWJ8+PF2tKxrNPxZSQotO4BtwVggAovnlju9fdqDulGK6If4vthRR2?= =?Windows-1252?Q?Tt8vHeLRD3+6rNUJzzFuPolibe9Zll3xFpbP0XGnCkte2wk9SE8I2FCy?= =?Windows-1252?Q?r+FpwP4Uulc00H80qFiIq8BJaLK+YIG8G7M+B0RV+rOwGuCiD7S08nwg?= =?Windows-1252?Q?Abmlkjh0k4fmrDfQIdaQCetMS8cc8C+XOUdSQtoLzE8pbyGw3ZAazU2H?= =?Windows-1252?Q?44O8x/SBLd31TLgl6pJ66QuiDiBmXdF9TP8QxUXfqYREY+3VFZEN3CZe?= =?Windows-1252?Q?9bi83E60etmC465zG3oLdXTA96NrVhp1+nhw+RR8Wqa3VyhnFtCFT79s?= =?Windows-1252?Q?K+GkBif0pm0jQjlYsyky5YnudszqO8MP0Wiari6dm/h7tk/R3usBrGYE?= =?Windows-1252?Q?JJIcRtH4AEvXvPVhbrdUw05ltm1LBq+/RI7BogrYViWnmW+QogyJR2dT?= =?Windows-1252?Q?sFnqhbsBR4HPlkINs35c1wDFS/9vM+DTbHU5QpSOAeYMtm0D7eK590vo?= =?Windows-1252?Q?t5j98MYH5SxJnPYt4h66NuQC0C1MGPnz4PJqF6WP7Q0s39wO+G+x9uDt?= =?Windows-1252?Q?eW41RgiqgJh48/d0ef/85npvEDbfePsbvxp66Xk0i+UcWWbDCaVIxNzF?= =?Windows-1252?Q?woG1fu9oUi8e0ciU/HW+T3M2aVZiXq/roH8m0gzn5uOmANoXkKA=3D?= =?Windows-1252?Q?=3D?= X-Microsoft-Exchange-Diagnostics: 1;VI1PR0501MB2765;6:1tbm9upvjL+e7I+m6Z+nKwJyay+nyf9S5/taK8nzBX7I8buhgIICIUWjJh/NMJ3Vdtd7hefxUX17CaR+8sVvli+I1v1FKOiwPwF0j5NfeXWmX2bp4iOGCSpfxkBy775mwseSJKvnm0RLxon23jKsxutAV8fqDN/qBvi5xQctJUnUuB9FEd+Ed2NkxU8uYY6N5EDROOf1vr7RDq9ql3ni3eolNeWcBmOgNoWzCRGZtSV+lwhg9NkB/J6iYXdS8HVgVtxfTheCzuCJQOcEX7WTv1jN+91PAxELPvqKdstu/iLvjMOTS4xrgvhmVpVhmJwV+sX5Sw2qYkJAvmRUzYpWD2owo9T37qLjDuBPeP2gThE=;5:ZmiXZw3NfCbXYCpDMYObxns70MhWZQQMPDTpFoOpnoyGuUln7r5nHgmpGwtF5K4XOkUic7OUyqf3vzLdezjpgHQn9EKuposWsUVzu2Gnry9MagWy4Bsw9h2KXYUhyFAXzyTY+bbFGM+Hp0GYqRneNMr71p6BFAstx6iCsCeoLH0=;24:PI5Whf5BfHSR4FWhFRhwZDGc16b0GoHe4/5+yqFWSMjJHaUu9IKtnydYmpQMXBezY4nWdNVNwXfkDJT05EArhy25SQwaS+Sy9aEHIi83KAU= SpamDiagnosticOutput: 1:99 SpamDiagnosticMetadata: NSPM X-Microsoft-Exchange-Diagnostics: 1;VI1PR0501MB2765;7:0qIZn1vScojF7uTtwT6D6iw2Lzg5MJLMdgozqaJUlyaSz818OvusxNsQBFoYvswU7H6AWvxX00ovzIslxbpd7ylCkB+ilkBuSxPSfV7/w4rKEefD6l+KJAgrcVp6+6GeZEAUKtozHoV7m0uh+9XPU6Co8ahWCYrPiL4Dd2KK32FeHb0xS8qP2NGV2gbJSsuijjn1j3BEmX33Szc1+Q8de7KdxTR3mZYC1MhrFvqBFR+0P3aLQEnTLrrO++Vm+tuuh7mAj5JlwaU/v6Ht52S5UEOpen4V1TvyiO2T0Fhw6JOkQNBLWglFXes4XnwvmMCncWvsicsrrt1NU/khADVDpSJBwsZ/qcB2lY+V6krg4pQ= X-OriginatorOrg: Mellanox.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 05 Nov 2016 04:04:55.2396 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-Transport-CrossTenantHeadersStamped: VI1PR0501MB2765 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org A bunch of people got together this week at the Linux Plumbers Conference to discuss nohz_full, task isolation, and related stuff. (Thanks to Thomas for getting everyone gathered at one place and time!) Here are the notes I took; I welcome any corrections and follow-up. == rcu_nocbs == We started out by discussing this option. It is automatically enabled by nohz_full, but we spent a little while side-tracking on the implementation of one kthread per rcu flavor per core. The suggestion was made (by Peter or Andy; I forget) that each kthread could handle all flavors per core by using a dedicated worklist. It certainly seems like removing potentially dozens or hundreds of kthreads from larger systems will be a win if this works out. Paul said he would look into this possibility. == Remote statistics == We discussed the possibility of remote statistics gathering, i.e. load average etc. The idea would be that we could have housekeeping core(s) periodically iterate over the nohz cores to load their rq remotely and do update_current etc. Presumably it should be possible for a single housekeeping core to handle doing this for all the nohz_full cores, as we only need to do it quite infrequently. Thomas suggested that this might be the last remaining thing that needed to be done to allow disabling the current behavior of falling back to a 1 Hz clock in nohz_full. I believe Thomas said he had a patch to do this already. == Remote LRU cache drain == One of the issues with task isolation currently is that the LRU cache drain must be done prior to entering userspace, but it requires interrupts enabled and thus can't be done atomically. My previous patch series have handled this by checking with interrupts disabled, but then looping around with interrupts enabled to try to drain the LRU pagevecs. Experimentally this works, but it's not provable that it terminates, which is worrisome. Andy suggested adding a percpu flag to disable creation of deferred work like LRU cache pages. Thomas suggested using an RT "local lock" to guard the LRU cache flush; he is planning on bringing the concept to mainline in any case. However, after some discussion we converged on simply using a spinlock to guard the appropriate resources. As a result, the lru_add_drain_all() code that currently queues work on each remote cpu to drain it, can instead simply acquire the lock and drain it remotely. This means that a task isolation task no longer needs to worry about being interrupted by SMP function call IPIs, so we don't have to deal with this in the task isolation framework any more. I don't recall anyone else volunteering to tackle this, so I will plan to look at it. The patch to do that should be orthogonal to the revised task isolation patch series. == Quiescing vmstat == Another issue that task isolation handles is ensuring that the vmstat worker is quiesced before returning to user space. Currently we cancel the vmstat delayed work, then invoke refresh_cpu_vm_stats(). Currently neither of these things appears safe to do in the interrupts-disabled context just before return to userspace, because they both can call schedule(): refresh_cpu_vm_stats() via a cond_resched() under CONFIG_NUMA, and cancel_delayed_work_sync() via a schedule() in __cancel_work_timer(). Christoph offered to work with me to make sure that we could do the appropriate quiescing with interrupts disabled, and seemed confident it should be doable. == Remote kernel TLB flush == Andy then brought up the issue of remote kernel TLB flush, which I've been trying to sweep under the rug for the initial task isolation series. Remote TLB flush causes an interrupt on many systems (x86 and tile, for example, although not arm64), so to the extent that it occurs frequently, it becomes important to handle for task isolation. With the recent addition of vmap kernel stacks, this becomes suddenly much more important than it used to be, to the point where we now really have to handle it for task isolation. The basic insight here is that you can safely skip interrupting userspace cores when you are sending remote kernel TLB flushes, since by definition they can't touch the kernel pages in question anyway. Then you just need to guarantee to flush the kernel TLB space next time the userspace task re-enters the kernel. The original Tilera dataplane code handled this by tracking task state (kernel, user, or user-flushed) and manipulating the state atomically at TLB flush time and kernel entry time. After some discussion of the overheads of such atomics, Andy pointed out that there is already an atomic increment being done in the RCU code, and we should be able to leverage that word to achieve this effect. The idea is that remote cores would do a compare-exchange of 0 to 1, which if it succeeded would indicate that the remote core was in userspace and thus didn't need to be IPI'd, but that it was now tagged for a kernel flush next time the remote task entered the kernel. Then, when the remote task enters the kernel, it does an atomic update of its own dynticks and discovers the low bit set, it does a kernel TLB flush before continuing. It was agreed that this makes sense to do unconditionally, since it's not just helpful for nohz_full and task isolation, but also for idle, since interrupting an idle core periodically just to do repeated kernel tlb flushes isn't good for power consumption. One open question is whether we discover the low bit set early enough in kernel entry that we can trust that we haven't tried to touch any pages that have been invalidated in the TLB. Paul agreed to take a look at implementing this. == Optimizing vfree via RCU == An orthogonal issue was also brought up, which is whether we could use RCU to handle the kernel TLB flush from freeing vmaps; presumably if we have enough vmap space, we can arrange to return the freed VA space via RCU, and simply defer the TLB flush until the next grace period. I'm not sure if this is practical if we encounter a high volume of vfrees, but I don't think we really reached a definitive agreement on it during the discussion either. == Disabling the dyn tick == One issue that the current task isolation patch series encounters is when we request disabling the dyntick, but it doesn't happen. At the moment we just wait until the the tick is properly disabled, by busy-waiting in the kernel (calling schedule etc as needed). No one is particularly fond of this scheme. The consensus seems to be to try harder to figure out what is going on, fix whatever problems exist, then consider it a regression going forward if something causes the dyntick to become difficult to disable again in the future. I will take a look at this and try to gather more data on if and when this is happening in 4.9. == Missing oneshot_stopped callbacks == I raised the issue that various clock_event_device sources don't always support oneshot_stopped, which can cause an additional final interrupt to occur after the timer infrastructure believes the interrupt has been stopped. I have patches to fix this for tile and arm64 in my patch series; Thomas volunteered to look at adding equivalent support for x86. Many thanks to all those who participated in the discussion. Frederic, we wished you had been there! -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com