Re: [MAINTAINER SUMMIT] User-space requirements for accelerator drivers

From: Arnd Bergmann <arnd@arndb.de>
To: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Arnd Bergmann <arnd@arndb.de>,
	Linus Walleij <linus.walleij@linaro.org>,
	 Dave Airlie <airlied@gmail.com>, Greg KH <greg@kroah.com>,
	Leon Romanovsky <leon@kernel.org>,
	 Laurent Pinchart <laurent.pinchart@ideasonboard.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	 Josh Triplett <josh@joshtriplett.org>,
	Mauro Carvalho Chehab <mchehab@kernel.org>,
	 Jonathan Corbet <corbet@lwn.net>,
	ksummit@lists.linux.dev, dev@tvm.apache.org
Subject: Re: [MAINTAINER SUMMIT] User-space requirements for accelerator drivers
Date: Tue, 14 Sep 2021 00:04:56 +0200	[thread overview]
Message-ID: <CAK8P3a2pvCwuSic9yevW0xmMy0-F1FEgfQ-_Rc7wWDs7PJEf_w@mail.gmail.com> (raw)
In-Reply-To: <CAKMK7uFrOpH9NG3XB1dT889r4HrUHaotte1D4Nh2=5tjD9VEpg@mail.gmail.com>

>n Mon, Sep 13, 2021 at 3:54 PM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:

> > One straightforward hardware independent low-level API would
> > be the traditional BLAS GEMM call[1] for matrix multiplication
> > and its variants (integer, float, bfloat16, ...).  Most of the frameworks
> > are able to use SGEMM to do the actual calculation since that
> > has optimized versions for most CPUs and GPUs, and most
> > hardware accelerators should be able to provide an
> > implementation of this that doesn't completely suck. This
> > can be used for both inferencing and training.
>
> I think BLAS are too high-level for these. Sure fore perfect speed the
> vendor probably wants to have their own BLAS thing, their own NN
> optmizer and a heap of other things, but for the low-level userspace
> we're talking about here that pretty much doesn't matter.

I suppose high-level vs low-level is not the correct distinction here,
it's more like fixed-function vs programmable.

As a fixed-function interface, something like GEMM is probably as
low-level as you would want to get, as it's big enough to make sense
as a single atomic command, but small enough to be able to build on
top of it.

> I think a really good example of this is the compute stack Intel is building:
> - level0 is the absolute bare-bones low level driver. For this
> discussion here that's enough of a userspace to make at least Dave&me
> happy. In 3d this would be vulkan. In AI/NN space, there's nothing
> here, at least nothing cross-vendor.
> - Then there's the entire OneApi ecosystem on top. Lots of this is
> open, some of it is closed, but from the pov of an accel stack it's
> all looking like applications, not like driver code. BLAS is sitting
> here. For AI/NN this is pytorch, tensorflow and all these higher-level
> frameworks (which often have quite sophisticated optimizers of their
> won)

Looking at OneAPI, I see a BLAS implementation (oneMKL) next to
somewhat higher-level abstraction (oneDNN). Which of the two are
the generic frameworks (pytorch/tensorflow/...) built on top of?

The oneDNN interface looks like it could be implemented not only on
top of level0 but also layered above some BLAS library or as a thin
wrapper above a fixed-function kernel interface that provides similar
high-level abstractions. Is that a correct understanding? It also seems
like this is similar in purpose to Apple's BNNS library.

> Especially BLAS isn't the most impressive, since largely it's fused
> multiple-add benchmark and not much else. Ok, enormous amounts of
> tuning to perfectly exploit the execution bw and interconnect/cache
> hierarchy of your chip, whatever it is. That's often something vendors
> don't like sharing (intel's math kernels are still closed afaik)
> because it leaks a bit much about actual implementation details of the
> chip as opposed to how it's programmed. Also not something I really
> care about with my maintainer hat on.

It's not /just/ benchmarks, it's actually being used directly underneath
the high-level frameworks precisely because it is simple, portable and
well optimized. If there is a higher-level interface like oneDNN that
is usable by the common frameworks, using a subset of that as a
fixed-function interface for the kernel may be a good alternative
(or at least complementary) to a fully programmable interface.

I realize that fixed-function is not fashionable on GPUs, but they
are widely used in other areas (video codecs, crypto, ...) even when
you are running precompiled code on the accelerator hardware.
This would of course replace the question of open source user space
with the question of open-source firmware, as the user side would
become mostly while the accelerator goes from dynamically created
to a firmware blob.

       Arnd