From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1756438AbdHYSrn (ORCPT <rfc822;w@1wt.eu>);
        Fri, 25 Aug 2017 14:47:43 -0400
Received: from mail-lf0-f53.google.com ([209.85.215.53]:38323 "EHLO
        mail-lf0-f53.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1756301AbdHYSrl (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Fri, 25 Aug 2017 14:47:41 -0400
MIME-Version: 1.0
In-Reply-To: <alpine.DEB.2.20.1708251649260.1883@nanos>
References: <20170825093335.100892-1-maco@android.com> <20170825093335.100892-4-maco@android.com>
 <alpine.DEB.2.20.1708251649260.1883@nanos>
From: Martijn Coenen <maco@android.com>
Date: Fri, 25 Aug 2017 20:47:39 +0200
Message-ID: <CAB0TPYFukfAuwD4P0Q8PnchnVi-t8Vg74FfeSUZqF=uOeHH8HA@mail.gmail.com>
Subject: Re: [PATCH 03/13] ANDROID: binder: add support for RT prio inheritance.
To: Thomas Gleixner <tglx@linutronix.de>
Cc: Greg KH <gregkh@linuxfoundation.org>,
        John Stultz <john.stultz@linaro.org>, Todd Kjos <tkjos@google.com>,
        =?UTF-8?B?QXJ2ZSBIasO4bm5ldsOlZw==?= <arve@android.com>,
        Amit Pundir <amit.pundir@linaro.org>,
        LKML <linux-kernel@vger.kernel.org>, devel@driverdev.osuosl.org,
        Martijn Coenen <maco@google.com>, Iliyan Malchev <malchev@google.com>,
        Colin Cross <ccross@android.com>,
        Peter Zijlstra <peterz@infradead.org>, Christoph Hellwig <hch@lst.de>
Content-Type: text/plain; charset="UTF-8"
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

Hi Thomas,

On Fri, Aug 25, 2017 at 5:08 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> Sorry, but this has not much to do with real priority inheritance.

Can you clarify what "real priority inheritance" is, or are you more
concerned about this particular implementation of it?

>
> It's a poor mans pseudo PI implementation. What I can't see from the sparse
> changelog is how all of this is supposed to work.

Ok, guess the latter :-) I agree with you I should have included a
more verbose description of how this works, and I should also have
mentioned I did look at rtmutex and didn't think it could be easily
used for our purposes (more below).

>
> My interpretation of it is, that you need to make sure that the other
> thread in that IPC mechanism gets boosted enough to not block the thread
> which needs a reply or action.

This is true, but it's not the whole picture - more below.

>
> Therefor you create a unreadable maze of capability checks and other things
> which do not make any sense to me due to lack of comments and something
> which explains the big picture.

I thought the code itself was reasonably self-explanatory, but I'm
happy to comment it more. The binder driver has done "priority
inheritance" with nice values for a long time, and this is an
extension of that that more or less works in the same way.

>
> The whole thing looks wrong and engineered sideways circumventing the
> existing facilities and making weird assumptions about priority settings.

I'd be happy to use existing facilities if they can do what we need.
Can you describe the "weird assumptions" in more detail?


> Current state:
>
>   1) thread queues work to worker via binder
>
>   2) thread waits for the work to complete (I deduced that from a stray
>      comment about sync work)
>
> If the waiting thread is a high priority thread it might wait for a long
> time if the worker thread is low priority or SCHED_OTHER.
>
> Desired state:
>
>   1) thread queues work to worker thread via binder
>
>   1a) thread boosts the worker to its own priority
>
>   2) thread waits for the work to complete
>
>   2a) worker completes and drops priority boost
>
> Is that about right?

It is correct for synchronous transactions. Synchronous transactions
are transactions for which the caller blocks until they are completed
(eg a reply has been received). The receiving process has a threadpool
of one or more threads waiting for transactions. Typically those
receiving threads call into the binder driver with an ioctl and are
then waiting for work on a waitqueue. Before this patch series, all
threads (available for new transactions) would wait on the same
waitqueue, and so it was hard to do any sort of PI, because we didn't
know which thread was going to wake up. The first set in this series
changes this behavior by getting rid of this process-wide wait-queue -
the caller itself picks a thread to wake up. Anyway, after that things
more or less go as you describe: the thread picks up the work, returns
back to userspace to have it do the work, and eventually a reply comes
back, for which we unblock the caller (which is blocked on its own
waitqueue).

One place where using an rtmutex becomes tricky is that there may be
no threads waiting for work: all threads could be busy handling
transactions, so we can't pick one to wake up and do the work. In that
case, we push the work to a process-wide workqueue, and the first
thread to become available will pick up the work and do it. In that
case a "proxy rtmutex" doesn't really seem to work, because the proxy
doesn't know the owner when the work is queued. The current
implementation in this patchset just has the thread boost its own
priority in this scenario.

Another reason rtmutex is not straightforward is that we support
something called "node priority inheritance". A node is binder
terminology for an object that you can make binder transactions to.
For some nodes, we like to set a minimum scheduling priority. An
example of this are all the nodes in our system_server process. The
reason for this is that binder calls into the system server process
often take critical userspace locks. Say a thread calls into
system_server with priority 130; the system_server binder thread
inherits, runs at 130, takes a lock in userspace and gets preempted;
now, when somebody else tries to take the same lock, it can get
blocked for a long time. So, we set a minimum scheduling policy for
system_server at prio 120. This makes using rtmutex APIs hard, because
we don't necessarily want the binder thread to have the priority of
the caller. I guess you could say the proper way to fix this is that
our userspace mutexes should also support priority inheritance; I
don't think they do today, though I don't know the exact reason behind
it.

The final reason using rtmutex is not straightforward is asynchronous
transactions. Those are transactions for which the caller does not
block until completion. We push the work, wake up a thread, and then
the caller returns to userspace immediately. In that case we don't
inherit the priority from the caller, but we still want to run at the
minimum node priority. So there really is no caller that can be
"blocked" on a lock.

I will look into the rtmutex code a bit more - to be honest I hadn't
seen the "proxy" mechanism earlier, which is why I thought rtmutex
wouldn't even work for synchronous transactions. But if you have
suggestions for how to deal with these other scenarios, I'd be happy
to see if I can rework this.

Thanks,
Martijn