Re: [PATCH 0/3] drm/panfrost: Expose HW counters to userspace

From: Boris Brezillon <boris.brezillon@collabora.com>
To: Steven Price <steven.price@arm.com>
Cc: Neil Armstrong <narmstrong@baylibre.com>,
	Emil Velikov <emil.l.velikov@gmail.com>,
	dri-devel@lists.freedesktop.org, Rob Herring <robh+dt@kernel.org>,
	Mark Janes <mark.a.janes@intel.com>,
	kernel@collabora.com, Alyssa Rosenzweig <alyssa@rosenzweig.io>
Subject: Re: [PATCH 0/3] drm/panfrost: Expose HW counters to userspace
Date: Tue, 30 Apr 2019 14:42:38 +0200	[thread overview]
Message-ID: <20190430144238.49963521@collabora.com> (raw)
In-Reply-To: <ba54e655-6316-8d36-dfd1-c5df418cee3a@arm.com>

+Rob, Eric, Mark and more

Hi,

On Fri, 5 Apr 2019 16:20:45 +0100
Steven Price <steven.price@arm.com> wrote:

> On 04/04/2019 16:20, Boris Brezillon wrote:
> > Hello,
> > 
> > This patch adds new ioctls to expose GPU counters to userspace.
> > These will be used by the mesa driver (should be posted soon).
> > 
> > A few words about the implementation: I followed the VC4/Etnaviv model
> > where perf counters are retrieved on a per-job basis. This allows one
> > to have get accurate results when there are users using the GPU
> > concurrently.
> > AFAICT, the mali kbase is using a different approach where several
> > users can register a performance monitor but with no way to have fined
> > grained control over what job/GPU-context to track.  
> 
> mali_kbase submits overlapping jobs. The jobs on slot 0 and slot 1 can
> be from different contexts (address spaces), and mali_kbase also fully
> uses the _NEXT registers. So there can be a job from one context
> executing on slot 0 and a job from a different context waiting in the
> _NEXT registers. (And the same for slot 1). This means that there's no
> (visible) gap between the first job finishing and the second job
> starting. Early versions of the driver even had a throttle to avoid
> interrupt storms (see JOB_IRQ_THROTTLE) which would further delay the
> IRQ - but thankfully that's gone.
> 
> The upshot is that it's basically impossible to measure "per-job"
> counters when running at full speed. Because multiple jobs are running
> and the driver doesn't actually know when one ends and the next starts.
> 
> Since one of the primary use cases is to draw pretty graphs of the
> system load [1], this "per-job" information isn't all that relevant (and
> minimal performance overhead is important). And if you want to monitor
> just one application it is usually easiest to ensure that it is the only
> thing running.
> 
> [1]
> https://developer.arm.com/tools-and-software/embedded/arm-development-studio/components/streamline-performance-analyzer
> 
> > This design choice comes at a cost: every time the perfmon context
> > changes (the perfmon context is the list of currently active
> > perfmons), the driver has to add a fence to prevent new jobs from
> > corrupting counters that will be dumped by previous jobs.
> > 
> > Let me know if that's an issue and if you think we should approach
> > things differently.  
> 
> It depends what you expect to do with the counters. Per-job counters are
> certainly useful sometimes. But serialising all jobs can mess up the
> thing you are trying to measure the performance of.

I finally found some time to work on v2 this morning, and it turns out
implementing global perf monitors as done in mali_kbase means rewriting
almost everything (apart from the perfcnt layout stuff). I'm not against
doing that, but I'd like to be sure this is really what we want.

Eric, Rob, any opinion on that? Is it acceptable to expose counters
through the pipe_query/AMD_perfmon interface if we don't have this
job (or at least draw call) granularity? If not, should we keep the
solution I'm proposing here to make sure counters values are accurate,
or should we expose perf counters through a non-standard API?

BTW, I'd like to remind you that serialization (waiting on the perfcnt
fence) only happens if we have a perfmon context change between 2
consecutive jobs, which only happens when
* 2 applications are running in // and at least one of them is
  monitored
* or when userspace decides to stop monitoring things and dump counter
  values

That means that, for the usual case (all perfmons disabled), there's
almost zero overhead (just a few more checks in the submit job code).
That also means that, if we ever decide to support global perfmon (perf
monitors that track things globably) on top of the current approach,
and only global perfmons are enabled, things won't be serialized as
with the per-job approach, because everyone will share the same perfmon
ctx (the same set of perfmons).

I'd appreciate any feedback from people that have used perf counters
(or implemented a way to dump them) on their platform.

Thanks,

Boris
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel