From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S964841AbdKPNDW (ORCPT ); Thu, 16 Nov 2017 08:03:22 -0500 Received: from mail-lf0-f47.google.com ([209.85.215.47]:51571 "EHLO mail-lf0-f47.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S964817AbdKPNDQ (ORCPT ); Thu, 16 Nov 2017 08:03:16 -0500 X-Google-Smtp-Source: AGs4zMZwgaRv0MsIjm1gwYmmbUGlgpBwD/Urgb2KviT2Q6iecT+414EAzXwqQLvNJOl+W1YOiti8uscacwbLTi5kDuM= MIME-Version: 1.0 In-Reply-To: <20171116112753.jzsxg4cv6udzqotb@hirez.programming.kicks-ass.net> References: <20171026140750.119265-1-maco@android.com> <20171026140750.119265-2-maco@android.com> <20171115130111.qq5jjgdqc55xpu3i@hirez.programming.kicks-ass.net> <20171116112753.jzsxg4cv6udzqotb@hirez.programming.kicks-ass.net> From: Martijn Coenen Date: Thu, 16 Nov 2017 14:03:13 +0100 Message-ID: Subject: Re: [PATCH v3 1/6] ANDROID: binder: add support for RT prio inheritance. To: Peter Zijlstra Cc: Greg KH , John Stultz , Todd Kjos , "Arve Hj??nnev??g" , Sherry Yang , Thomas Gleixner , Amit Pundir , LKML , devel@driverdev.osuosl.org, Martijn Coenen Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Nov 16, 2017 at 12:27 PM, Peter Zijlstra wrote: >> On Wed, Nov 15, 2017 at 2:01 PM, Peter Zijlstra wrote: >> >> + * 1) binder supports a "minimum node priority", meaning that all transactions >> >> + * into a node must run at this priority at a minimum. This means that the >> >> + * desired priority for handling a transaction is not necessarily equal to >> >> + * the priority of the caller. >> > >> > Isn't that the same as running everything in that node (whatever that >> > is) at that given prio? >> >> Not for synchronous transactions; in that case it's max(min_node_prio, >> caller_prio). > > But that's exactly what you get!? How would it ever get below > min_node_prio? PI only ever (temporarily) raises prio, it never lowers > it. Ah I guess we have different definitions of inheritance then. This behavior is also related to binder's threadpool - all those threads run at a default prio of 120 (nice 0), and call into the kernel to wait for work. If somebody then makes a call with a lower prio (>120), we do change the binder thread that handles that call to use the same low prio as well. Why run at nice 0 if the caller runs at nice 10? > >> One reason min_node_prio exists is that for certain >> critical nodes, we don't want calls into it to run at a very low >> priority, because the implementation of said calls may take locks that >> are contended. I realize one possible solution to that is to use PI >> mutexes in userspace, but Android doesn't support that today. > > Fix that? That's probably easier said than done :) I don't know why we don't support them, I'll ask around. > > I'm confused -- again. Why not run those threads at the min prio to > begin with? This also gets back to binder's threadpool; they are generic threads in a process waiting for work at nice 0. The prio they get depends on the work they get, and in case of synchronous transactions, who calls into them. If we had dedicated threads for each node, we could indeed run them at the min_prio of that node by default. But this is not how binder works today. > Also, this doesn't seem related to PI. Yes, in case of asynchronous transactions it's not PI at all. > I'm not understanding; are you saying that some app does an async > request for sensor data, then later this sensor triggers and we need to > route the data back to the app? The way this typically works is that an interested app creates a binder node to receive sensor data on (basically a callback). The app then passes this node into the Sensor HAL and says "call me back when you have relevant sensor data". When the Sensor HAL has data, it makes calls on this node with the relevant data. Because the receiving app process has a few generic binder threads that may receive any type of data, not just sensor data, we want to raise the priority only when handling sensor data, and restore it to the default when the thread is done. > > But above you said that async requests do not boost the priority; so > then I would expect this sensor's data to land on regular min-prio > thread for dispatch. As you found out later in this e-mail, this doesn't work because of binder's current threadpool design. > *blink* what? Binder has its own threads? Its not just an RPC/ORB? Yeah :-) > That smells like fail; if you need to raise your prio it means you are > low(er) prio and could at that point be subject to inversion. True - it does expose a race condition, but it's still better than not raising the priority at all. I'm looking into selecting a busy thread and boost that. Though in general we don't run into this very often by sizing the threadpool large enough. > > Assuming no idle threads in the pool, and all 'busy' at lower priority. > A medium prio something is running, keeping these pool thingies from > asking for more work. A high prio request comes in.. nobody around to > service. You loose, game over. Agree, we're trying to fix that. > Yeah, that sounds really dodgy. I've been a _long_ time since I read the > RT CORBA stuff, so I'm not sure if there's a sensible solution to the > problem, but it sounds really really dodgy. > > Regular RPCs are synchronous; if A asks B to do something, A blocks. B > can then not turn around and ask A something. That is called a deadlock. In case of binder, this actually works, we call it "recursive calls"; the same thread in A that called into B can receive an incoming call from B and handle that. Of course you still need to be careful when that happens - eg if you hold a lock when making the first outbound call, you better make sure you can't take that same lock on any inbound call that can result from it. "Don't hold locks over synchronous binder calls" is a well-adapted Android design paradigm. > Alternatively, if you consider A and B to be services/nodes and you get > service A to call into B and then have B call back into A, you need to > guarantee there are sufficient tasks/threads to complete the entire > transaction _ahead_of_time_. That is, when task 1 of A does the request > into B and blocks, it needs to have reserved another thread (2) of A to > be idle and waiting for B's request. See above - we're guaranteed to run on the same thread that made the original call. > So I really don't understand how all this works.. how can 'important' > device threads not already be of the right priority. All this fixing up > after the fact just reeks of fail. This gets back to the binder threadpool design. That actually pre-dates my involvement with binder, but there are probably good reasons for it being the way it is. For example, some Android processes host as much as 100 binder nodes, some having different prio requirements; having one or more threads dedicated to each of those individual nodes seems quite wasteful. > >> What parts of the system aren't designed for RT? > > None of it as far I can tell. The shared thread pools, the quick fix > priority, the nested transaction stuff, those all reek of utter fail and > confusion. Ack. FWIW I don't think it's really fail and confusion - I believe these were conscious design choices for binder and at the time I doubt that RT was even considered. Like every system binder has its warts, but it has worked very well for Android. The same holds for this patchset - it has very measurable improvements on Android system performance, and I don't think the code or behavior is incorrect. It's major weakness is that it has to work with the existing threadpool paradigm and therefore depends on some internal scheduler APIs. > Have tasks at appropriate priorities a priory wherever and whenever > possible. PI is a (least bad) fix for when two threads interact on a > shared resource. The best fix is taking out the shared resource. > > You don't have a shared resource in most of the scenarios described > here. > > Use dedicated IPC channels, and avoid mixing priority msgs on a channel. > If you have to, make sure the channels are priority ordered. > > The primary rule for systems is simplicity, RT systems more so than > others. What you're describing is far too complicated and riddled with > holes. > > Also, have a look at RT CORBA, there's a really nice open-source one > here: > > http://www.cs.wustl.edu/~schmidt/TAO.html > > That group also has a whole bunch of research papers on the subject. As > said, its been many years since I used any of that so I'm bound to not > know all the new shiny stuff ;-) > > (As a bonus, its written in a semi sane language, as opposed to most of > Android :-) Thanks for all the feedback. I'll do some more reading and talking internally. As far as this series goes: since all the things we talked about which really only affect Android, is your main concern with merging this the use of some internal scheduler APIs? In general I'd really like to keep the upstream binder driver as close as possible to what we use in Android, and until we have an alternative for PI (if we can agree on something and build it), this is a pretty significant delta.