Re: [PATCH v3 0/6] Introduce Bandwidth OPPs for interconnect paths

From: Saravana Kannan <saravanak@google.com>
To: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Georgi Djakov <georgi.djakov@linaro.org>,
	Rob Herring <robh+dt@kernel.org>,
	Mark Rutland <mark.rutland@arm.com>,
	Viresh Kumar <vireshk@kernel.org>, Nishanth Menon <nm@ti.com>,
	Stephen Boyd <sboyd@kernel.org>,
	"Rafael J. Wysocki" <rjw@rjwysocki.net>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	"Sweeney, Sean" <seansw@qti.qualcomm.com>,
	David Dai <daidavid1@codeaurora.org>,
	Rajendra Nayak <rnayak@codeaurora.org>,
	Sibi Sankar <sibis@codeaurora.org>,
	Bjorn Andersson <bjorn.andersson@linaro.org>,
	Evan Green <evgreen@chromium.org>,
	Android Kernel Team <kernel-team@android.com>,
	Linux PM <linux-pm@vger.kernel.org>,
	"open list:OPEN FIRMWARE AND FLATTENED DEVICE TREE BINDINGS" 
	<devicetree@vger.kernel.org>, LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v3 0/6] Introduce Bandwidth OPPs for interconnect paths
Date: Sat, 3 Aug 2019 00:36:40 -0700	[thread overview]
Message-ID: <CAGETcx-Hp0CVH8SUyW=HtYPbvfqk-SetHecK8Gg4=n2rFGLOAw@mail.gmail.com> (raw)
In-Reply-To: <20190730030157.aml7z6vfsiqgyief@vireshk-i7>

Resending due to HTML.

On Mon, Jul 29, 2019 at 8:02 PM Viresh Kumar <viresh.kumar@linaro.org> wrote:
>
> On 29-07-19, 13:12, Saravana Kannan wrote:
> > On Mon, Jul 29, 2019 at 2:24 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
> > >
> > > On 18-07-19, 21:12, Saravana Kannan wrote:
> > > > On Wed, Jul 17, 2019 at 10:37 PM Viresh Kumar <viresh.kumar@linaro.org> wrote:
> > > > > I would like
> > > > > to put this data in the GPU OPP table only. What about putting a
> > > > > range in the GPU OPP table for the Bandwidth if it can change so much
> > > > > for the same frequency.
> > > >
> > > > I don't think the range is going to work.
> > >
> > > Any specific reason for that ?
> >
> > The next sentence was literally explaining this :) Fine to debate
> > that, but ignoring that and asking this question is kinda funny.
>
> Okay, but ...
>
> > > > If a GPU is doing purely
> > > > computational work, it's not unreasonable for it to vote for the
> > > > lowest bandwidth for any GPU frequency.
>
> ... it wasn't clear to me even after reading this sentence again now
> :)
>
> I understand that you may have to vote for the lowest bandwidth but
> that doesn't explain why a range can't work (sorry if it was just me
> who doesn't understood it :)).

Well, doesn't work as in, it doesn't give any additional info. I can
just vote for 0 or UINT_MAX if I want to stay at the lowest or high
bandwidth. Having the actual values of the lowest or highest point
doesn't help for cases where you need to skip intermediate bandwidth
levels when going from low to high (as the need increases).

>
> > > I think that is fine, but if the GPU is able to find how much
> > > bandwidth it needs why can't it just pass that value without needing
> > > to have another OPP table for the path ?
> >
> > You were asking this question in the context of "can the GPU OPP just
> > list all the range of bandwidth it might use per GPU frequency". My point
> > is that the range would be useless because it would the entire
> > available bandwidth range (because purely compute work might not need
> > any bandwidth).
>
> If it is useless to have entire range here, then why bother providing
> one ? Why can't the GPU request what it needs in exact terms, based on
> its calculations ? And then based on these requests, let the
> interconnect find what's the best/stable values it really wants to
> program the path for (and for that the interconnect can use its own
> OPP table, which would be fine).

Let's say there actual path can support 1, 2, 3, 4, 5, 6, 7, 8, 9 and 10 GB/s.

Let's say 2, 3, and 4 need the same voltage level as 5 for this path.
So, for GPU's needs using 2, 3 and 4 GB/s might not be good because
the power savings from the frequency difference is not worth the
performance and power (if you run the interconnect slow, the GPU would
run faster to achieve the same performance) impact compared to running
the interconnect at 5 GB/s. Similarly it might skip 6 GB/s. So even if
the GPU can somehow calculate the exact bandwidth required (or say
measure it), it'll need to know to skip 2, 3 and 4 because they aren't
power/perf efficient levels to use.

But all these bandwidth levels might be useable for a smaller HW IP
whose power cost isn't high. So power savings running the interconnect
at 3 GB/s might be worth it -- because even if the small HW IP ran
faster to achieve the performance, the power increase in the HW IP
won't be higher than the power savings from running the interconnect
slower.

> > Whereas, what the GPU's algorithm actually needs might be the list of
> > "useful" bandwidth levels to use.
>
> Hmm, I am not sure GPU's algorithm needs this table AFAIU based on all
> the conversations we had until now. It is very capable of finding how
> much bandwidth it needs,

Not really. If you have a monitor that can actually measure the
bandwidth, yes. Most often that's not the case. If you just have a
monitor that can give you the bus port busy% then it'll have to use
this table to pick the useful ones. As in, in the example above, if
the bus is still too busy at 1 GB/s it would directly ask for 5 GB/s
instead of going through 2, 3 and 4.

> you just want the GPU driver to finally align
> that with a stable bandwidth for the platform later on. And what I am
> asking is that it is not required for the end driver to look for
> stable values, it just requests what it wants and let the interconnect
> core/driver decide the stable values.

I think you've misunderstood my prior statements then.

The interconnect driver would then still have to aggregate the
requests and pick the final frequency for the interconnect. That's
where it comes in -- executing/implementing the requests of all the
clients.

> Very much like the clock framework, most of the user drivers just ask
> for a clk value to be programmed and it is the clock driver which
> keeps a table of the stable values and then aligns the requested value
> to one of those.

This of this similar to the clock API and the OPP tables for CPUs. The
clock API could run the CPU at multiple different frequencies. But the
CPU driver uses the CPU OPP table to pick a certain set of frequencies
that are "useful". If your CPU clock is shared with another block, say
L3 and there's an L3 driver that's requesting a different frequency
range (using clk_set_rate_range(x, UINT_MAX)), that's where the clock
driver aggregates their request and set's the final clock frequency.

Similarly, the GPU driver wants to pick useful interconnect path
bandwidth levels using BW OPP tables. And the interconnect framework
aggregates the requests across multiple drivers requesting different
bandwidths.

Does that make sense?

-Saravana