Habanalabs Open-Source TPC LLVM compiler and SynapseAI Core library

All of lore.kernel.org
 help / color / mirror / Atom feed

* Habanalabs Open-Source TPC LLVM compiler and SynapseAI Core library
@ 2021-09-10  7:26 Oded Gabbay
  2021-09-10  7:58 ` Greg Kroah-Hartman
  2021-09-12  7:38 ` Michael Zuckerman
  0 siblings, 2 replies; 15+ messages in thread
From: Oded Gabbay @ 2021-09-10  7:26 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: mzuckerman, dsinger, Linus Torvalds, Dave Airlie, Daniel Vetter,
	Jason Gunthorpe, Linux-Kernel@Vger. Kernel. Org

Hi Greg,

Following our conversations a couple of months ago, I'm happy to tell you that
Habanalabs has open-sourced its TPC (Tensor Processing Core) LLVM compiler,
which is a fork of the LLVM open-source project.

The project can be found on Habanalabs GitHub website at:
https://github.com/HabanaAI/tpc_llvm

There is a companion guide on how to write TPC kernels at:
https://docs.habana.ai/en/latest/TPC_User_Guide/TPC_User_Guide.html

The guide details the TPC compute engine's architecture,
how to write TPC kernels using the TPC-C language, etc.

In addition, we have written a reference implementation of the SynapseAI API,
called SynapseAI Core, and released its code under the MIT license to the
open-source community at:
https://github.com/HabanaAI/SynapseAI_Core

SynapseAI Core contains all the necessary building blocks to run Deep Learning
training on Gaudi, although not as optimized as the closed-source library.

The project repository contains a couple of TPC kernels that implement basic
DL operators. These kernels can serve as an example of how to implement more
complex operators.

To work with the Gaudi device, the library calls the Habanalabs kernel driver
uAPI through the already open-source hl-thunk library at:
https://github.com/HabanaAI/hl-thunk

Moreover, the library contains a few tests (and more will follow soon) that
demonstrate how to use the SynapseAI API to run workloads which utilize the
TPC engines on Gaudi devices. We provided a short readme that explains
how to build and run the included tests.

It is important to note we provided all the necessary APIs to connect this
library to any Deep Learning frameworks by writing appropriate backends in
the frameworks and by writing more TPC kernels to implement the different
operators.

Once the driver(s) for the Gaudi NIC ports will be upstreamed, this library
may be used together with IBverbs to perform training on multiple Gaudi devices.

Thanks,
Oded

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Habanalabs Open-Source TPC LLVM compiler and SynapseAI Core library
  2021-09-10  7:26 Habanalabs Open-Source TPC LLVM compiler and SynapseAI Core library Oded Gabbay
@ 2021-09-10  7:58 ` Greg Kroah-Hartman
  2021-09-10 16:09   ` Daniel Vetter
  2021-10-27  6:53   ` Habanalabs Open-Source TPC LLVM compiler and SynapseAI Core library Oded Gabbay
  2021-09-12  7:38 ` Michael Zuckerman
  1 sibling, 2 replies; 15+ messages in thread
From: Greg Kroah-Hartman @ 2021-09-10  7:58 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: mzuckerman, dsinger, Linus Torvalds, Dave Airlie, Daniel Vetter,
	Jason Gunthorpe, Linux-Kernel@Vger. Kernel. Org

On Fri, Sep 10, 2021 at 10:26:56AM +0300, Oded Gabbay wrote:
> Hi Greg,
> 
> Following our conversations a couple of months ago, I'm happy to tell you that
> Habanalabs has open-sourced its TPC (Tensor Processing Core) LLVM compiler,
> which is a fork of the LLVM open-source project.
> 
> The project can be found on Habanalabs GitHub website at:
> https://github.com/HabanaAI/tpc_llvm
> 
> There is a companion guide on how to write TPC kernels at:
> https://docs.habana.ai/en/latest/TPC_User_Guide/TPC_User_Guide.html

That's great news, thanks for pushing for this and releasing it all!

greg k-h

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Habanalabs Open-Source TPC LLVM compiler and SynapseAI Core library
  2021-09-10  7:58 ` Greg Kroah-Hartman
@ 2021-09-10 16:09   ` Daniel Vetter
  2021-09-10 16:10       ` Daniel Vetter
  2021-10-27  6:53   ` Habanalabs Open-Source TPC LLVM compiler and SynapseAI Core library Oded Gabbay
  1 sibling, 1 reply; 15+ messages in thread
From: Daniel Vetter @ 2021-09-10 16:09 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Oded Gabbay, mzuckerman, dsinger, Linus Torvalds, Dave Airlie,
	Jason Gunthorpe, Linux-Kernel@Vger. Kernel. Org

On Fri, Sep 10, 2021 at 9:58 AM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
> On Fri, Sep 10, 2021 at 10:26:56AM +0300, Oded Gabbay wrote:
> > Hi Greg,
> >
> > Following our conversations a couple of months ago, I'm happy to tell you that
> > Habanalabs has open-sourced its TPC (Tensor Processing Core) LLVM compiler,
> > which is a fork of the LLVM open-source project.
> >
> > The project can be found on Habanalabs GitHub website at:
> > https://github.com/HabanaAI/tpc_llvm
> >
> > There is a companion guide on how to write TPC kernels at:
> > https://docs.habana.ai/en/latest/TPC_User_Guide/TPC_User_Guide.html
>
> That's great news, thanks for pushing for this and releasing it all!

Yeah this is neat.

There's still the problem that we spent the past 2.5 years pissing off
a lot of people for an imo questionable political project, bypassing
all the technical review and expertise. Now that the political
nonsense is resolved I think we need to look at at least the technical
cleanup. The angered people are much harder to fix, so let's maybe
ignore that (or perhaps a ks topic, no idea, I'm honestly not super
motivated to rehash this entire story again). Here's what I think we
should do:

- move drivers/misc/habanalabs under drivers/gpu/habanalabs and
review/discussions on dri-devel
- grandfather the entire current situation in as-is, it's not the only
driver we have with a funny uapi of its own (but the other driver did
manage to get their compiler into upstream llvm even, and not like 2
years late)
- review the dma-buf stuff on dri-devel and then land it through
standard flows, not the gregk-misc bypass
- close drivers/misc backdoor for further accel driver submissions,
I'd like to focus on technical stuff in this area going forward and
not pointless exercises in bypassing due process and all that

I expect we'll have a proper discussion what the stack should look
like with the next submission (from a different vendor maybe), that
ship kinda sailed with habanalabs.

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Habanalabs Open-Source TPC LLVM compiler and SynapseAI Core library
  2021-09-10 16:09   ` Daniel Vetter
@ 2021-09-10 16:10       ` Daniel Vetter
  0 siblings, 0 replies; 15+ messages in thread
From: Daniel Vetter @ 2021-09-10 16:10 UTC (permalink / raw)
  To: Greg Kroah-Hartman, dri-devel
  Cc: Oded Gabbay, mzuckerman, dsinger, Linus Torvalds, Dave Airlie,
	Jason Gunthorpe, Linux-Kernel@Vger. Kernel. Org

Forgot to add dri-devel.

On Fri, Sep 10, 2021 at 6:09 PM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>
> On Fri, Sep 10, 2021 at 9:58 AM Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> > On Fri, Sep 10, 2021 at 10:26:56AM +0300, Oded Gabbay wrote:
> > > Hi Greg,
> > >
> > > Following our conversations a couple of months ago, I'm happy to tell you that
> > > Habanalabs has open-sourced its TPC (Tensor Processing Core) LLVM compiler,
> > > which is a fork of the LLVM open-source project.
> > >
> > > The project can be found on Habanalabs GitHub website at:
> > > https://github.com/HabanaAI/tpc_llvm
> > >
> > > There is a companion guide on how to write TPC kernels at:
> > > https://docs.habana.ai/en/latest/TPC_User_Guide/TPC_User_Guide.html
> >
> > That's great news, thanks for pushing for this and releasing it all!
>
> Yeah this is neat.
>
> There's still the problem that we spent the past 2.5 years pissing off
> a lot of people for an imo questionable political project, bypassing
> all the technical review and expertise. Now that the political
> nonsense is resolved I think we need to look at at least the technical
> cleanup. The angered people are much harder to fix, so let's maybe
> ignore that (or perhaps a ks topic, no idea, I'm honestly not super
> motivated to rehash this entire story again). Here's what I think we
> should do:
>
> - move drivers/misc/habanalabs under drivers/gpu/habanalabs and
> review/discussions on dri-devel
> - grandfather the entire current situation in as-is, it's not the only
> driver we have with a funny uapi of its own (but the other driver did
> manage to get their compiler into upstream llvm even, and not like 2
> years late)
> - review the dma-buf stuff on dri-devel and then land it through
> standard flows, not the gregk-misc bypass
> - close drivers/misc backdoor for further accel driver submissions,
> I'd like to focus on technical stuff in this area going forward and
> not pointless exercises in bypassing due process and all that
>
> I expect we'll have a proper discussion what the stack should look
> like with the next submission (from a different vendor maybe), that
> ship kinda sailed with habanalabs.
>
> Cheers, Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Habanalabs Open-Source TPC LLVM compiler and SynapseAI Core library
@ 2021-09-10 16:10       ` Daniel Vetter
  0 siblings, 0 replies; 15+ messages in thread
From: Daniel Vetter @ 2021-09-10 16:10 UTC (permalink / raw)
  To: Greg Kroah-Hartman, dri-devel
  Cc: Oded Gabbay, mzuckerman, dsinger, Linus Torvalds, Dave Airlie,
	Jason Gunthorpe, Linux-Kernel@Vger. Kernel. Org

Forgot to add dri-devel.

On Fri, Sep 10, 2021 at 6:09 PM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>
> On Fri, Sep 10, 2021 at 9:58 AM Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> > On Fri, Sep 10, 2021 at 10:26:56AM +0300, Oded Gabbay wrote:
> > > Hi Greg,
> > >
> > > Following our conversations a couple of months ago, I'm happy to tell you that
> > > Habanalabs has open-sourced its TPC (Tensor Processing Core) LLVM compiler,
> > > which is a fork of the LLVM open-source project.
> > >
> > > The project can be found on Habanalabs GitHub website at:
> > > https://github.com/HabanaAI/tpc_llvm
> > >
> > > There is a companion guide on how to write TPC kernels at:
> > > https://docs.habana.ai/en/latest/TPC_User_Guide/TPC_User_Guide.html
> >
> > That's great news, thanks for pushing for this and releasing it all!
>
> Yeah this is neat.
>
> There's still the problem that we spent the past 2.5 years pissing off
> a lot of people for an imo questionable political project, bypassing
> all the technical review and expertise. Now that the political
> nonsense is resolved I think we need to look at at least the technical
> cleanup. The angered people are much harder to fix, so let's maybe
> ignore that (or perhaps a ks topic, no idea, I'm honestly not super
> motivated to rehash this entire story again). Here's what I think we
> should do:
>
> - move drivers/misc/habanalabs under drivers/gpu/habanalabs and
> review/discussions on dri-devel
> - grandfather the entire current situation in as-is, it's not the only
> driver we have with a funny uapi of its own (but the other driver did
> manage to get their compiler into upstream llvm even, and not like 2
> years late)
> - review the dma-buf stuff on dri-devel and then land it through
> standard flows, not the gregk-misc bypass
> - close drivers/misc backdoor for further accel driver submissions,
> I'd like to focus on technical stuff in this area going forward and
> not pointless exercises in bypassing due process and all that
>
> I expect we'll have a proper discussion what the stack should look
> like with the next submission (from a different vendor maybe), that
> ship kinda sailed with habanalabs.
>
> Cheers, Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Accelerator drivers going forward (was Re: Habanalabs Open-Source TPC LLVM compiler and SynapseAI Core library)
  2021-09-10 16:10       ` Daniel Vetter
  (?)
@ 2021-09-12 13:55       ` Greg Kroah-Hartman
  2021-09-12 16:37         ` Simon Ser
  2021-09-12 19:32           ` Dave Airlie
  -1 siblings, 2 replies; 15+ messages in thread
From: Greg Kroah-Hartman @ 2021-09-12 13:55 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: dri-devel, Oded Gabbay, mzuckerman, dsinger, Linus Torvalds,
	Dave Airlie, Jason Gunthorpe, Linux-Kernel@Vger. Kernel. Org

On Fri, Sep 10, 2021 at 06:10:27PM +0200, Daniel Vetter wrote:
> Forgot to add dri-devel.
> 
> On Fri, Sep 10, 2021 at 6:09 PM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> >
> > On Fri, Sep 10, 2021 at 9:58 AM Greg Kroah-Hartman
> > <gregkh@linuxfoundation.org> wrote:
> > > On Fri, Sep 10, 2021 at 10:26:56AM +0300, Oded Gabbay wrote:
> > > > Hi Greg,
> > > >
> > > > Following our conversations a couple of months ago, I'm happy to tell you that
> > > > Habanalabs has open-sourced its TPC (Tensor Processing Core) LLVM compiler,
> > > > which is a fork of the LLVM open-source project.
> > > >
> > > > The project can be found on Habanalabs GitHub website at:
> > > > https://github.com/HabanaAI/tpc_llvm
> > > >
> > > > There is a companion guide on how to write TPC kernels at:
> > > > https://docs.habana.ai/en/latest/TPC_User_Guide/TPC_User_Guide.html
> > >
> > > That's great news, thanks for pushing for this and releasing it all!
> >
> > Yeah this is neat.
> >
> > There's still the problem that we spent the past 2.5 years pissing off
> > a lot of people for an imo questionable political project, bypassing
> > all the technical review and expertise. Now that the political
> > nonsense is resolved I think we need to look at at least the technical
> > cleanup. The angered people are much harder to fix, so let's maybe
> > ignore that (or perhaps a ks topic, no idea, I'm honestly not super
> > motivated to rehash this entire story again). Here's what I think we
> > should do:
> >
> > - move drivers/misc/habanalabs under drivers/gpu/habanalabs and
> > review/discussions on dri-devel

Wait, why move into gpu?  Are we going to do that for all hardware
accelerators that we currently have in the kernel tree?

These things are not GPUs in the sense of them being "do some work and
write out to a screen", which is what I would associate with a GPU (G
does stand for "Graphical", right?)

Yes, GPUs can do things that some accelerators can do, but they can do
things that accelerators can not do, and the other way around as well.
I doubt you want all of the existing gpu drivers to be only treated as
an "accelerator driver" now, as where would the logic that has to happen
to get the bits out to a screen live?

And since we have a long history of accepting accelerator drivers (I see
some in our tree since 2018 at the least), and there is no common
userspace collation trying to make a common userspace api, why do they
have to live in the same place?  What makes them common except for the
fact that they use the kernel as a semi-dumb pipe to send work to and
from a different processor?

Look at drivers/misc/cxl/ and drivers/misc/ocxl and drivers/misc/uacce/
and drivers/misc/sgi-gru and drivers/misc/bcm-vk/ even drivers/misc/mei/
as that is an off-load engine we talk to, right?

What about the drivers/fpga/ api we have, it handles accelerators as
well.  I'm sure we have many other examples in the kernel tree as well,
I just did a quick look and found these.

All the above accelerators do things in different ways because their
hardware is different, so they need different user/kernel apis, right?
How are we going to unify them?  Who is going to unify them?

So drivers/accel/ perhaps?  I would be able to get rid of loads of
drivers/misc/ code that way :)

Who is going to be the new maintainer of this subsystem?

So far they have all been going into drivers/misc/ because no one else
stepped up to do the review of them except me.  I would _LOVE_ the help
here as I end up reviewing a new one every kernel release at the least,
but companies do not seem to be willing to fund developers to be
maintainers these days :(

And yes, I have been reviewing the fpga code as well, even though they
do have a good maintainer, as those patches flow through my tree due to
historical reasons.  I know the fpga developers would have loved some
help with review of those patches.

> > - grandfather the entire current situation in as-is, it's not the only
> > driver we have with a funny uapi of its own (but the other driver did
> > manage to get their compiler into upstream llvm even, and not like 2
> > years late)

We have many many accelerator drivers with odd uapis as they all work
differently.  Are we going to have to have any new company that comes
along use one of the existing apis we have (and if so, which one?) or do
we allow them to create their own as everyone does do things
differently, which really is fine as far as a kernel is concerned
(again, semi-dumb pipe.)

> > - review the dma-buf stuff on dri-devel and then land it through
> > standard flows, not the gregk-misc bypass

Are dma-bufs somehow required to be reviewed on dri-devel?  As others
have asked in the past, they are already being used in other subsystems
(like IB) today, did those authors also get review there?

If so, great, if not, that feels odd to me, as I am seeing lots of
out-of-tree drivers start to use these structures, which is why the api
was created (to stop the roll-your-own-implementations.)  Does dri-devel
want me to have those vendors cc: you all when those get submitted?

> > - close drivers/misc backdoor for further accel driver submissions,
> > I'd like to focus on technical stuff in this area going forward and
> > not pointless exercises in bypassing due process and all that

I will be glad to not accept any more, but as I say above, what are the
new requirements going to be so that those companies that do want to
submit their code know what to do?

And what exactly are we using as a definition of an accelerator?  We
have networking cards that are "accelerators" as well as crypto
"accelerators" :)

> > I expect we'll have a proper discussion what the stack should look
> > like with the next submission (from a different vendor maybe), that
> > ship kinda sailed with habanalabs.

Who is going to define this stack?  As there is no industry standard,
why would we define this?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Accelerator drivers going forward (was Re: Habanalabs Open-Source TPC LLVM compiler and SynapseAI Core library)
  2021-09-12 13:55       ` Accelerator drivers going forward (was Re: Habanalabs Open-Source TPC LLVM compiler and SynapseAI Core library) Greg Kroah-Hartman
@ 2021-09-12 16:37         ` Simon Ser
  2021-09-12 19:32           ` Dave Airlie
  1 sibling, 0 replies; 15+ messages in thread
From: Simon Ser @ 2021-09-12 16:37 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Daniel Vetter, dri-devel, Oded Gabbay, mzuckerman, dsinger,
	Linus Torvalds, Dave Airlie, Jason Gunthorpe,
	Linux-Kernel@Vger. Kernel. Org

> > > - move drivers/misc/habanalabs under drivers/gpu/habanalabs and
> > > review/discussions on dri-devel
>
> Wait, why move into gpu?  Are we going to do that for all hardware
> accelerators that we currently have in the kernel tree?
>
> These things are not GPUs in the sense of them being "do some work and
> write out to a screen", which is what I would associate with a GPU (G
> does stand for "Graphical", right?)
>
> Yes, GPUs can do things that some accelerators can do, but they can do
> things that accelerators can not do, and the other way around as well.
> I doubt you want all of the existing gpu drivers to be only treated as
> an "accelerator driver" now, as where would the logic that has to happen
> to get the bits out to a screen live?

This seems like a description of the "display" part of the drivers, driven
by KMS. There are many chips which can't do the "display" part, only the
"render" part. Their drivers are living in drivers/gpu/ as well.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Accelerator drivers going forward (was Re: Habanalabs Open-Source TPC LLVM compiler and SynapseAI Core library)
  2021-09-12 13:55       ` Accelerator drivers going forward (was Re: Habanalabs Open-Source TPC LLVM compiler and SynapseAI Core library) Greg Kroah-Hartman
@ 2021-09-12 19:32           ` Dave Airlie
  2021-09-12 19:32           ` Dave Airlie
  1 sibling, 0 replies; 15+ messages in thread
From: Dave Airlie @ 2021-09-12 19:32 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Daniel Vetter, dri-devel, Oded Gabbay, mzuckerman, dsinger,
	Linus Torvalds, Jason Gunthorpe, Linux-Kernel@Vger. Kernel. Org

On Sun, 12 Sept 2021 at 23:55, Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> On Fri, Sep 10, 2021 at 06:10:27PM +0200, Daniel Vetter wrote:
> > Forgot to add dri-devel.
> >
> > On Fri, Sep 10, 2021 at 6:09 PM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > >
> > > On Fri, Sep 10, 2021 at 9:58 AM Greg Kroah-Hartman
> > > <gregkh@linuxfoundation.org> wrote:
> > > > On Fri, Sep 10, 2021 at 10:26:56AM +0300, Oded Gabbay wrote:
> > > > > Hi Greg,
> > > > >
> > > > > Following our conversations a couple of months ago, I'm happy to tell you that
> > > > > Habanalabs has open-sourced its TPC (Tensor Processing Core) LLVM compiler,
> > > > > which is a fork of the LLVM open-source project.
> > > > >
> > > > > The project can be found on Habanalabs GitHub website at:
> > > > > https://github.com/HabanaAI/tpc_llvm
> > > > >
> > > > > There is a companion guide on how to write TPC kernels at:
> > > > > https://docs.habana.ai/en/latest/TPC_User_Guide/TPC_User_Guide.html
> > > >
> > > > That's great news, thanks for pushing for this and releasing it all!
> > >
> > > Yeah this is neat.
> > >
> > > There's still the problem that we spent the past 2.5 years pissing off
> > > a lot of people for an imo questionable political project, bypassing
> > > all the technical review and expertise. Now that the political
> > > nonsense is resolved I think we need to look at at least the technical
> > > cleanup. The angered people are much harder to fix, so let's maybe
> > > ignore that (or perhaps a ks topic, no idea, I'm honestly not super
> > > motivated to rehash this entire story again). Here's what I think we
> > > should do:
> > >
> > > - move drivers/misc/habanalabs under drivers/gpu/habanalabs and
> > > review/discussions on dri-devel
>
> Wait, why move into gpu?  Are we going to do that for all hardware
> accelerators that we currently have in the kernel tree?
>

We could just mv drivers/gpu drivers/accel if that helps your mental model here.

> These things are not GPUs in the sense of them being "do some work and
> write out to a screen", which is what I would associate with a GPU (G
> does stand for "Graphical", right?)

Neither are a lot of the gpu drivers, it's almost like we evolved the
subsystem in 20 years,
and the name got away from us.

As an example:
etnaviv, panfrost, lima and vgem drivers have no display interfaces at
all. Nada, they do nothing except accelerate and use dma-buf to talk
to other drivers.

> Yes, GPUs can do things that some accelerators can do, but they can do
> things that accelerators can not do, and the other way around as well.
> I doubt you want all of the existing gpu drivers to be only treated as
> an "accelerator driver" now, as where would the logic that has to happen
> to get the bits out to a screen live?

Don't care, totally doesn't matter if a driver is accelerator +
display, you could write in-driver buses if you wanted to abstract
this more, since internally most GPUs are just SoCs, the display and
accelerator pieces talk to power management, irqs and dma-buf like
functionality internally in the driver, the thing is for most GPUs
there is a single PCI device to bind to, so historically nobody has
seen the value in splitting them more or adding an in-driver bus for
one set of devices.

> And since we have a long history of accepting accelerator drivers (I see
> some in our tree since 2018 at the least), and there is no common
> userspace collation trying to make a common userspace api, why do they
> have to live in the same place?  What makes them common except for the
> fact that they use the kernel as a semi-dumb pipe to send work to and
> from a different processor?
>
> Look at drivers/misc/cxl/ and drivers/misc/ocxl and drivers/misc/uacce/
> and drivers/misc/sgi-gru and drivers/misc/bcm-vk/ even drivers/misc/mei/
> as that is an off-load engine we talk to, right?
>
> What about the drivers/fpga/ api we have, it handles accelerators as
> well.  I'm sure we have many other examples in the kernel tree as well,
> I just did a quick look and found these.
>
> All the above accelerators do things in different ways because their
> hardware is different, so they need different user/kernel apis, right?
> How are we going to unify them?  Who is going to unify them?
>
> So drivers/accel/ perhaps?  I would be able to get rid of loads of
> drivers/misc/ code that way :)
>
> Who is going to be the new maintainer of this subsystem?

We already said if we could get agreement on having things follow the
rules, then they can be merged under drm trees or we'd start a new
accel tree.

The problem is the free-for-all merge with no barriers approach that
you and I believe Olof are campaigning for, doesn't seem to create
communities, it may create consulting or training opportunities for
the Linux Foundation, but thus far I don't see any communities.

Graphics accelerator community exists because of and has itself
refined the rules over time. I don't think our rules will necessarily
work for other groups immediately but I think other groups need to
construct acceptable merge criteria beyond the kernel, and kernel
maintainers have to take more responsibility for saying no if they
don't have time for community building.

> So far they have all been going into drivers/misc/ because no one else
> stepped up to do the review of them except me.  I would _LOVE_ the help
> here as I end up reviewing a new one every kernel release at the least,
> but companies do not seem to be willing to fund developers to be
> maintainers these days :(
>
> And yes, I have been reviewing the fpga code as well, even though they
> do have a good maintainer, as those patches flow through my tree due to
> historical reasons.  I know the fpga developers would have loved some
> help with review of those patches.

Lack of reviewing isn't the problem here, lack of responsibility for
creating a long term mess is. You are creating long term dumping
grounds for badly thought out stuff. Saying people keeping adding more
trash to my dump and it's overloading me is just the effect of having
created the dump with no rules to follow in the first place.

>
> > > - review the dma-buf stuff on dri-devel and then land it through
> > > standard flows, not the gregk-misc bypass
>
> Are dma-bufs somehow required to be reviewed on dri-devel?  As others
> have asked in the past, they are already being used in other subsystems
> (like IB) today, did those authors also get review there?

Yes any use of dma-buf has to be cc'ed to dri-devel and linux-media
per MAINTAINERS

>
> If so, great, if not, that feels odd to me, as I am seeing lots of
> out-of-tree drivers start to use these structures, which is why the api
> was created (to stop the roll-your-own-implementations.)  Does dri-devel
> want me to have those vendors cc: you all when those get submitted?

Yes. MAINTAINERS has matching for this, are you not advising people to
use the proper submission techniques and thus bypassing that file?

The reason is dma-buf and later by extension dma-fence can create
really bad problems for the kernel around memory management.

https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html#indefinite-dma-fences

When a driver is self contained and doesn't interact with other kernel
drivers nobody really has to care. However once a driver starts
interacting with other drivers in the kernel, a responsible maintainer
has to check that these new drivers aren't going to crap all over the
existing drivers and destabilise the kernel. Someone has to review the
hardware design to see if page faulting works or if preemption works
or a bunch of other gotchas. Someone has to review the userspace to
make sure it isn't doing knowingly bad things or making assumptions
based on the kernel driver doing bad things.

The thing is we've had code merged into our in-tree i915 driver that
broke a bunch of these assumptions, and have had to spend a year
cleaning it out, now this happened post-merge and diligence had
lessened, having the expertise to spot this in new dma-buf/fence users
is why we insist on having access to way more than just the 1000 line
kernel driver submission.

> I will be glad to not accept any more, but as I say above, what are the
> new requirements going to be so that those companies that do want to
> submit their code know what to do?

I'm proposing a patch for documentation that maintainers can sign up
for (it's mentioned in the ksummit thread).

> And what exactly are we using as a definition of an accelerator?  We
> have networking cards that are "accelerators" as well as crypto
> "accelerators" :)
>
> > > I expect we'll have a proper discussion what the stack should look
> > > like with the next submission (from a different vendor maybe), that
> > > ship kinda sailed with habanalabs.
>
> Who is going to define this stack?  As there is no industry standard,
> why would we define this?

Because someone has to help, saying yes isn't helping, it's enabling
back behaviour. Parenting and maintaining both involve saying No for
the future prosperity of the ecosystem.

Dave.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Accelerator drivers going forward (was Re: Habanalabs Open-Source TPC LLVM compiler and SynapseAI Core library)
@ 2021-09-12 19:32           ` Dave Airlie
  0 siblings, 0 replies; 15+ messages in thread
From: Dave Airlie @ 2021-09-12 19:32 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Daniel Vetter, dri-devel, Oded Gabbay, mzuckerman, dsinger,
	Linus Torvalds, Jason Gunthorpe, Linux-Kernel@Vger. Kernel. Org

On Sun, 12 Sept 2021 at 23:55, Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> On Fri, Sep 10, 2021 at 06:10:27PM +0200, Daniel Vetter wrote:
> > Forgot to add dri-devel.
> >
> > On Fri, Sep 10, 2021 at 6:09 PM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > >
> > > On Fri, Sep 10, 2021 at 9:58 AM Greg Kroah-Hartman
> > > <gregkh@linuxfoundation.org> wrote:
> > > > On Fri, Sep 10, 2021 at 10:26:56AM +0300, Oded Gabbay wrote:
> > > > > Hi Greg,
> > > > >
> > > > > Following our conversations a couple of months ago, I'm happy to tell you that
> > > > > Habanalabs has open-sourced its TPC (Tensor Processing Core) LLVM compiler,
> > > > > which is a fork of the LLVM open-source project.
> > > > >
> > > > > The project can be found on Habanalabs GitHub website at:
> > > > > https://github.com/HabanaAI/tpc_llvm
> > > > >
> > > > > There is a companion guide on how to write TPC kernels at:
> > > > > https://docs.habana.ai/en/latest/TPC_User_Guide/TPC_User_Guide.html
> > > >
> > > > That's great news, thanks for pushing for this and releasing it all!
> > >
> > > Yeah this is neat.
> > >
> > > There's still the problem that we spent the past 2.5 years pissing off
> > > a lot of people for an imo questionable political project, bypassing
> > > all the technical review and expertise. Now that the political
> > > nonsense is resolved I think we need to look at at least the technical
> > > cleanup. The angered people are much harder to fix, so let's maybe
> > > ignore that (or perhaps a ks topic, no idea, I'm honestly not super
> > > motivated to rehash this entire story again). Here's what I think we
> > > should do:
> > >
> > > - move drivers/misc/habanalabs under drivers/gpu/habanalabs and
> > > review/discussions on dri-devel
>
> Wait, why move into gpu?  Are we going to do that for all hardware
> accelerators that we currently have in the kernel tree?
>

We could just mv drivers/gpu drivers/accel if that helps your mental model here.

> These things are not GPUs in the sense of them being "do some work and
> write out to a screen", which is what I would associate with a GPU (G
> does stand for "Graphical", right?)

Neither are a lot of the gpu drivers, it's almost like we evolved the
subsystem in 20 years,
and the name got away from us.

As an example:
etnaviv, panfrost, lima and vgem drivers have no display interfaces at
all. Nada, they do nothing except accelerate and use dma-buf to talk
to other drivers.

> Yes, GPUs can do things that some accelerators can do, but they can do
> things that accelerators can not do, and the other way around as well.
> I doubt you want all of the existing gpu drivers to be only treated as
> an "accelerator driver" now, as where would the logic that has to happen
> to get the bits out to a screen live?

Don't care, totally doesn't matter if a driver is accelerator +
display, you could write in-driver buses if you wanted to abstract
this more, since internally most GPUs are just SoCs, the display and
accelerator pieces talk to power management, irqs and dma-buf like
functionality internally in the driver, the thing is for most GPUs
there is a single PCI device to bind to, so historically nobody has
seen the value in splitting them more or adding an in-driver bus for
one set of devices.

> And since we have a long history of accepting accelerator drivers (I see
> some in our tree since 2018 at the least), and there is no common
> userspace collation trying to make a common userspace api, why do they
> have to live in the same place?  What makes them common except for the
> fact that they use the kernel as a semi-dumb pipe to send work to and
> from a different processor?
>
> Look at drivers/misc/cxl/ and drivers/misc/ocxl and drivers/misc/uacce/
> and drivers/misc/sgi-gru and drivers/misc/bcm-vk/ even drivers/misc/mei/
> as that is an off-load engine we talk to, right?
>
> What about the drivers/fpga/ api we have, it handles accelerators as
> well.  I'm sure we have many other examples in the kernel tree as well,
> I just did a quick look and found these.
>
> All the above accelerators do things in different ways because their
> hardware is different, so they need different user/kernel apis, right?
> How are we going to unify them?  Who is going to unify them?
>
> So drivers/accel/ perhaps?  I would be able to get rid of loads of
> drivers/misc/ code that way :)
>
> Who is going to be the new maintainer of this subsystem?

We already said if we could get agreement on having things follow the
rules, then they can be merged under drm trees or we'd start a new
accel tree.

The problem is the free-for-all merge with no barriers approach that
you and I believe Olof are campaigning for, doesn't seem to create
communities, it may create consulting or training opportunities for
the Linux Foundation, but thus far I don't see any communities.

Graphics accelerator community exists because of and has itself
refined the rules over time. I don't think our rules will necessarily
work for other groups immediately but I think other groups need to
construct acceptable merge criteria beyond the kernel, and kernel
maintainers have to take more responsibility for saying no if they
don't have time for community building.

> So far they have all been going into drivers/misc/ because no one else
> stepped up to do the review of them except me.  I would _LOVE_ the help
> here as I end up reviewing a new one every kernel release at the least,
> but companies do not seem to be willing to fund developers to be
> maintainers these days :(
>
> And yes, I have been reviewing the fpga code as well, even though they
> do have a good maintainer, as those patches flow through my tree due to
> historical reasons.  I know the fpga developers would have loved some
> help with review of those patches.

Lack of reviewing isn't the problem here, lack of responsibility for
creating a long term mess is. You are creating long term dumping
grounds for badly thought out stuff. Saying people keeping adding more
trash to my dump and it's overloading me is just the effect of having
created the dump with no rules to follow in the first place.

>
> > > - review the dma-buf stuff on dri-devel and then land it through
> > > standard flows, not the gregk-misc bypass
>
> Are dma-bufs somehow required to be reviewed on dri-devel?  As others
> have asked in the past, they are already being used in other subsystems
> (like IB) today, did those authors also get review there?

Yes any use of dma-buf has to be cc'ed to dri-devel and linux-media
per MAINTAINERS

>
> If so, great, if not, that feels odd to me, as I am seeing lots of
> out-of-tree drivers start to use these structures, which is why the api
> was created (to stop the roll-your-own-implementations.)  Does dri-devel
> want me to have those vendors cc: you all when those get submitted?

Yes. MAINTAINERS has matching for this, are you not advising people to
use the proper submission techniques and thus bypassing that file?

The reason is dma-buf and later by extension dma-fence can create
really bad problems for the kernel around memory management.

https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html#indefinite-dma-fences

When a driver is self contained and doesn't interact with other kernel
drivers nobody really has to care. However once a driver starts
interacting with other drivers in the kernel, a responsible maintainer
has to check that these new drivers aren't going to crap all over the
existing drivers and destabilise the kernel. Someone has to review the
hardware design to see if page faulting works or if preemption works
or a bunch of other gotchas. Someone has to review the userspace to
make sure it isn't doing knowingly bad things or making assumptions
based on the kernel driver doing bad things.

The thing is we've had code merged into our in-tree i915 driver that
broke a bunch of these assumptions, and have had to spend a year
cleaning it out, now this happened post-merge and diligence had
lessened, having the expertise to spot this in new dma-buf/fence users
is why we insist on having access to way more than just the 1000 line
kernel driver submission.

> I will be glad to not accept any more, but as I say above, what are the
> new requirements going to be so that those companies that do want to
> submit their code know what to do?

I'm proposing a patch for documentation that maintainers can sign up
for (it's mentioned in the ksummit thread).

> And what exactly are we using as a definition of an accelerator?  We
> have networking cards that are "accelerators" as well as crypto
> "accelerators" :)
>
> > > I expect we'll have a proper discussion what the stack should look
> > > like with the next submission (from a different vendor maybe), that
> > > ship kinda sailed with habanalabs.
>
> Who is going to define this stack?  As there is no industry standard,
> why would we define this?

Because someone has to help, saying yes isn't helping, it's enabling
back behaviour. Parenting and maintaining both involve saying No for
the future prosperity of the ecosystem.

Dave.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Accelerator drivers going forward (was Re: Habanalabs Open-Source TPC LLVM compiler and SynapseAI Core library)
  2021-09-12 19:32           ` Dave Airlie
@ 2021-09-14  8:42             ` Oded Gabbay
  -1 siblings, 0 replies; 15+ messages in thread
From: Oded Gabbay @ 2021-09-14  8:42 UTC (permalink / raw)
  To: Dave Airlie
  Cc: Greg Kroah-Hartman, Daniel Vetter, dri-devel, mzuckerman,
	dsinger, Linus Torvalds, Jason Gunthorpe,
	Linux-Kernel@Vger. Kernel. Org

On Sun, Sep 12, 2021 at 10:32 PM Dave Airlie <airlied@gmail.com> wrote:
>
> On Sun, 12 Sept 2021 at 23:55, Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> >
> > On Fri, Sep 10, 2021 at 06:10:27PM +0200, Daniel Vetter wrote:
> > > Forgot to add dri-devel.
> > >
> > > On Fri, Sep 10, 2021 at 6:09 PM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > > >
> > > > On Fri, Sep 10, 2021 at 9:58 AM Greg Kroah-Hartman
> > > > <gregkh@linuxfoundation.org> wrote:
> > > > > On Fri, Sep 10, 2021 at 10:26:56AM +0300, Oded Gabbay wrote:
> > > > > > Hi Greg,
> > > > > >
> > > > > > Following our conversations a couple of months ago, I'm happy to tell you that
> > > > > > Habanalabs has open-sourced its TPC (Tensor Processing Core) LLVM compiler,
> > > > > > which is a fork of the LLVM open-source project.
> > > > > >
> > > > > > The project can be found on Habanalabs GitHub website at:
> > > > > > https://github.com/HabanaAI/tpc_llvm
> > > > > >
> > > > > > There is a companion guide on how to write TPC kernels at:
> > > > > > https://docs.habana.ai/en/latest/TPC_User_Guide/TPC_User_Guide.html
> > > > >
> > > > > That's great news, thanks for pushing for this and releasing it all!
> > > >
> > > > Yeah this is neat.
> > > >
> > > > There's still the problem that we spent the past 2.5 years pissing off
> > > > a lot of people for an imo questionable political project, bypassing
> > > > all the technical review and expertise. Now that the political
> > > > nonsense is resolved I think we need to look at at least the technical
> > > > cleanup. The angered people are much harder to fix, so let's maybe
> > > > ignore that (or perhaps a ks topic, no idea, I'm honestly not super
> > > > motivated to rehash this entire story again). Here's what I think we
> > > > should do:
> > > >
> > > > - move drivers/misc/habanalabs under drivers/gpu/habanalabs and
> > > > review/discussions on dri-devel
> >
> > Wait, why move into gpu?  Are we going to do that for all hardware
> > accelerators that we currently have in the kernel tree?
> >
>
> We could just mv drivers/gpu drivers/accel if that helps your mental model here.
>
> > These things are not GPUs in the sense of them being "do some work and
> > write out to a screen", which is what I would associate with a GPU (G
> > does stand for "Graphical", right?)
>
> Neither are a lot of the gpu drivers, it's almost like we evolved the
> subsystem in 20 years,
> and the name got away from us.
>
> As an example:
> etnaviv, panfrost, lima and vgem drivers have no display interfaces at
> all. Nada, they do nothing except accelerate and use dma-buf to talk
> to other drivers.
>
>
> > Yes, GPUs can do things that some accelerators can do, but they can do
> > things that accelerators can not do, and the other way around as well.
> > I doubt you want all of the existing gpu drivers to be only treated as
> > an "accelerator driver" now, as where would the logic that has to happen
> > to get the bits out to a screen live?
>
> Don't care, totally doesn't matter if a driver is accelerator +
> display, you could write in-driver buses if you wanted to abstract
> this more, since internally most GPUs are just SoCs, the display and
> accelerator pieces talk to power management, irqs and dma-buf like
> functionality internally in the driver, the thing is for most GPUs
> there is a single PCI device to bind to, so historically nobody has
> seen the value in splitting them more or adding an in-driver bus for
> one set of devices.
>
> > And since we have a long history of accepting accelerator drivers (I see
> > some in our tree since 2018 at the least), and there is no common
> > userspace collation trying to make a common userspace api, why do they
> > have to live in the same place?  What makes them common except for the
> > fact that they use the kernel as a semi-dumb pipe to send work to and
> > from a different processor?
> >
> > Look at drivers/misc/cxl/ and drivers/misc/ocxl and drivers/misc/uacce/
> > and drivers/misc/sgi-gru and drivers/misc/bcm-vk/ even drivers/misc/mei/
> > as that is an off-load engine we talk to, right?
> >
> > What about the drivers/fpga/ api we have, it handles accelerators as
> > well.  I'm sure we have many other examples in the kernel tree as well,
> > I just did a quick look and found these.
> >
> > All the above accelerators do things in different ways because their
> > hardware is different, so they need different user/kernel apis, right?
> > How are we going to unify them?  Who is going to unify them?
> >
> > So drivers/accel/ perhaps?  I would be able to get rid of loads of
> > drivers/misc/ code that way :)
> >
> > Who is going to be the new maintainer of this subsystem?
>
> We already said if we could get agreement on having things follow the
> rules, then they can be merged under drm trees or we'd start a new
> accel tree.
>
> The problem is the free-for-all merge with no barriers approach that
> you and I believe Olof are campaigning for, doesn't seem to create
> communities, it may create consulting or training opportunities for
> the Linux Foundation, but thus far I don't see any communities.
>
> Graphics accelerator community exists because of and has itself
> refined the rules over time. I don't think our rules will necessarily
> work for other groups immediately but I think other groups need to
> construct acceptable merge criteria beyond the kernel, and kernel
> maintainers have to take more responsibility for saying no if they
> don't have time for community building.
>
>
> > So far they have all been going into drivers/misc/ because no one else
> > stepped up to do the review of them except me.  I would _LOVE_ the help
> > here as I end up reviewing a new one every kernel release at the least,
> > but companies do not seem to be willing to fund developers to be
> > maintainers these days :(
> >
> > And yes, I have been reviewing the fpga code as well, even though they
> > do have a good maintainer, as those patches flow through my tree due to
> > historical reasons.  I know the fpga developers would have loved some
> > help with review of those patches.
>
> Lack of reviewing isn't the problem here, lack of responsibility for
> creating a long term mess is. You are creating long term dumping
> grounds for badly thought out stuff. Saying people keeping adding more
> trash to my dump and it's overloading me is just the effect of having
> created the dump with no rules to follow in the first place.
>
> >
> > > > - review the dma-buf stuff on dri-devel and then land it through
> > > > standard flows, not the gregk-misc bypass
> >
> > Are dma-bufs somehow required to be reviewed on dri-devel?  As others
> > have asked in the past, they are already being used in other subsystems
> > (like IB) today, did those authors also get review there?
>
> Yes any use of dma-buf has to be cc'ed to dri-devel and linux-media
> per MAINTAINERS
>
Hi Dave/Daniel,
Now that we opened up the user-space compiler and provided a library
with which you can load compiled kernels and run them, I've re-sent
the two dma-buf patches to dri-devel and linux-media (and to specific
people) on Sunday evening.

Can you please help review them ? They already got reviewed by
Christian and Jason on previous iterations and I fixed them according
to their reviews so I believe they are fundamentally correct.

Thanks,
Oded


> >
> > If so, great, if not, that feels odd to me, as I am seeing lots of
> > out-of-tree drivers start to use these structures, which is why the api
> > was created (to stop the roll-your-own-implementations.)  Does dri-devel
> > want me to have those vendors cc: you all when those get submitted?
>
> Yes. MAINTAINERS has matching for this, are you not advising people to
> use the proper submission techniques and thus bypassing that file?
>
> The reason is dma-buf and later by extension dma-fence can create
> really bad problems for the kernel around memory management.
>
> https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html#indefinite-dma-fences
>
> When a driver is self contained and doesn't interact with other kernel
> drivers nobody really has to care. However once a driver starts
> interacting with other drivers in the kernel, a responsible maintainer
> has to check that these new drivers aren't going to crap all over the
> existing drivers and destabilise the kernel. Someone has to review the
> hardware design to see if page faulting works or if preemption works
> or a bunch of other gotchas. Someone has to review the userspace to
> make sure it isn't doing knowingly bad things or making assumptions
> based on the kernel driver doing bad things.
>
> The thing is we've had code merged into our in-tree i915 driver that
> broke a bunch of these assumptions, and have had to spend a year
> cleaning it out, now this happened post-merge and diligence had
> lessened, having the expertise to spot this in new dma-buf/fence users
> is why we insist on having access to way more than just the 1000 line
> kernel driver submission.
>
>
> > I will be glad to not accept any more, but as I say above, what are the
> > new requirements going to be so that those companies that do want to
> > submit their code know what to do?
>
> I'm proposing a patch for documentation that maintainers can sign up
> for (it's mentioned in the ksummit thread).
>
> > And what exactly are we using as a definition of an accelerator?  We
> > have networking cards that are "accelerators" as well as crypto
> > "accelerators" :)
> >
> > > > I expect we'll have a proper discussion what the stack should look
> > > > like with the next submission (from a different vendor maybe), that
> > > > ship kinda sailed with habanalabs.
> >
> > Who is going to define this stack?  As there is no industry standard,
> > why would we define this?
>
> Because someone has to help, saying yes isn't helping, it's enabling
> back behaviour. Parenting and maintaining both involve saying No for
> the future prosperity of the ecosystem.
>
> Dave.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Accelerator drivers going forward (was Re: Habanalabs Open-Source TPC LLVM compiler and SynapseAI Core library)
@ 2021-09-14  8:42             ` Oded Gabbay
  0 siblings, 0 replies; 15+ messages in thread
From: Oded Gabbay @ 2021-09-14  8:42 UTC (permalink / raw)
  To: Dave Airlie
  Cc: Greg Kroah-Hartman, Daniel Vetter, dri-devel, mzuckerman,
	dsinger, Linus Torvalds, Jason Gunthorpe,
	Linux-Kernel@Vger. Kernel. Org

On Sun, Sep 12, 2021 at 10:32 PM Dave Airlie <airlied@gmail.com> wrote:
>
> On Sun, 12 Sept 2021 at 23:55, Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> >
> > On Fri, Sep 10, 2021 at 06:10:27PM +0200, Daniel Vetter wrote:
> > > Forgot to add dri-devel.
> > >
> > > On Fri, Sep 10, 2021 at 6:09 PM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > > >
> > > > On Fri, Sep 10, 2021 at 9:58 AM Greg Kroah-Hartman
> > > > <gregkh@linuxfoundation.org> wrote:
> > > > > On Fri, Sep 10, 2021 at 10:26:56AM +0300, Oded Gabbay wrote:
> > > > > > Hi Greg,
> > > > > >
> > > > > > Following our conversations a couple of months ago, I'm happy to tell you that
> > > > > > Habanalabs has open-sourced its TPC (Tensor Processing Core) LLVM compiler,
> > > > > > which is a fork of the LLVM open-source project.
> > > > > >
> > > > > > The project can be found on Habanalabs GitHub website at:
> > > > > > https://github.com/HabanaAI/tpc_llvm
> > > > > >
> > > > > > There is a companion guide on how to write TPC kernels at:
> > > > > > https://docs.habana.ai/en/latest/TPC_User_Guide/TPC_User_Guide.html
> > > > >
> > > > > That's great news, thanks for pushing for this and releasing it all!
> > > >
> > > > Yeah this is neat.
> > > >
> > > > There's still the problem that we spent the past 2.5 years pissing off
> > > > a lot of people for an imo questionable political project, bypassing
> > > > all the technical review and expertise. Now that the political
> > > > nonsense is resolved I think we need to look at at least the technical
> > > > cleanup. The angered people are much harder to fix, so let's maybe
> > > > ignore that (or perhaps a ks topic, no idea, I'm honestly not super
> > > > motivated to rehash this entire story again). Here's what I think we
> > > > should do:
> > > >
> > > > - move drivers/misc/habanalabs under drivers/gpu/habanalabs and
> > > > review/discussions on dri-devel
> >
> > Wait, why move into gpu?  Are we going to do that for all hardware
> > accelerators that we currently have in the kernel tree?
> >
>
> We could just mv drivers/gpu drivers/accel if that helps your mental model here.
>
> > These things are not GPUs in the sense of them being "do some work and
> > write out to a screen", which is what I would associate with a GPU (G
> > does stand for "Graphical", right?)
>
> Neither are a lot of the gpu drivers, it's almost like we evolved the
> subsystem in 20 years,
> and the name got away from us.
>
> As an example:
> etnaviv, panfrost, lima and vgem drivers have no display interfaces at
> all. Nada, they do nothing except accelerate and use dma-buf to talk
> to other drivers.
>
>
> > Yes, GPUs can do things that some accelerators can do, but they can do
> > things that accelerators can not do, and the other way around as well.
> > I doubt you want all of the existing gpu drivers to be only treated as
> > an "accelerator driver" now, as where would the logic that has to happen
> > to get the bits out to a screen live?
>
> Don't care, totally doesn't matter if a driver is accelerator +
> display, you could write in-driver buses if you wanted to abstract
> this more, since internally most GPUs are just SoCs, the display and
> accelerator pieces talk to power management, irqs and dma-buf like
> functionality internally in the driver, the thing is for most GPUs
> there is a single PCI device to bind to, so historically nobody has
> seen the value in splitting them more or adding an in-driver bus for
> one set of devices.
>
> > And since we have a long history of accepting accelerator drivers (I see
> > some in our tree since 2018 at the least), and there is no common
> > userspace collation trying to make a common userspace api, why do they
> > have to live in the same place?  What makes them common except for the
> > fact that they use the kernel as a semi-dumb pipe to send work to and
> > from a different processor?
> >
> > Look at drivers/misc/cxl/ and drivers/misc/ocxl and drivers/misc/uacce/
> > and drivers/misc/sgi-gru and drivers/misc/bcm-vk/ even drivers/misc/mei/
> > as that is an off-load engine we talk to, right?
> >
> > What about the drivers/fpga/ api we have, it handles accelerators as
> > well.  I'm sure we have many other examples in the kernel tree as well,
> > I just did a quick look and found these.
> >
> > All the above accelerators do things in different ways because their
> > hardware is different, so they need different user/kernel apis, right?
> > How are we going to unify them?  Who is going to unify them?
> >
> > So drivers/accel/ perhaps?  I would be able to get rid of loads of
> > drivers/misc/ code that way :)
> >
> > Who is going to be the new maintainer of this subsystem?
>
> We already said if we could get agreement on having things follow the
> rules, then they can be merged under drm trees or we'd start a new
> accel tree.
>
> The problem is the free-for-all merge with no barriers approach that
> you and I believe Olof are campaigning for, doesn't seem to create
> communities, it may create consulting or training opportunities for
> the Linux Foundation, but thus far I don't see any communities.
>
> Graphics accelerator community exists because of and has itself
> refined the rules over time. I don't think our rules will necessarily
> work for other groups immediately but I think other groups need to
> construct acceptable merge criteria beyond the kernel, and kernel
> maintainers have to take more responsibility for saying no if they
> don't have time for community building.
>
>
> > So far they have all been going into drivers/misc/ because no one else
> > stepped up to do the review of them except me.  I would _LOVE_ the help
> > here as I end up reviewing a new one every kernel release at the least,
> > but companies do not seem to be willing to fund developers to be
> > maintainers these days :(
> >
> > And yes, I have been reviewing the fpga code as well, even though they
> > do have a good maintainer, as those patches flow through my tree due to
> > historical reasons.  I know the fpga developers would have loved some
> > help with review of those patches.
>
> Lack of reviewing isn't the problem here, lack of responsibility for
> creating a long term mess is. You are creating long term dumping
> grounds for badly thought out stuff. Saying people keeping adding more
> trash to my dump and it's overloading me is just the effect of having
> created the dump with no rules to follow in the first place.
>
> >
> > > > - review the dma-buf stuff on dri-devel and then land it through
> > > > standard flows, not the gregk-misc bypass
> >
> > Are dma-bufs somehow required to be reviewed on dri-devel?  As others
> > have asked in the past, they are already being used in other subsystems
> > (like IB) today, did those authors also get review there?
>
> Yes any use of dma-buf has to be cc'ed to dri-devel and linux-media
> per MAINTAINERS
>
Hi Dave/Daniel,
Now that we opened up the user-space compiler and provided a library
with which you can load compiled kernels and run them, I've re-sent
the two dma-buf patches to dri-devel and linux-media (and to specific
people) on Sunday evening.

Can you please help review them ? They already got reviewed by
Christian and Jason on previous iterations and I fixed them according
to their reviews so I believe they are fundamentally correct.

Thanks,
Oded


> >
> > If so, great, if not, that feels odd to me, as I am seeing lots of
> > out-of-tree drivers start to use these structures, which is why the api
> > was created (to stop the roll-your-own-implementations.)  Does dri-devel
> > want me to have those vendors cc: you all when those get submitted?
>
> Yes. MAINTAINERS has matching for this, are you not advising people to
> use the proper submission techniques and thus bypassing that file?
>
> The reason is dma-buf and later by extension dma-fence can create
> really bad problems for the kernel around memory management.
>
> https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html#indefinite-dma-fences
>
> When a driver is self contained and doesn't interact with other kernel
> drivers nobody really has to care. However once a driver starts
> interacting with other drivers in the kernel, a responsible maintainer
> has to check that these new drivers aren't going to crap all over the
> existing drivers and destabilise the kernel. Someone has to review the
> hardware design to see if page faulting works or if preemption works
> or a bunch of other gotchas. Someone has to review the userspace to
> make sure it isn't doing knowingly bad things or making assumptions
> based on the kernel driver doing bad things.
>
> The thing is we've had code merged into our in-tree i915 driver that
> broke a bunch of these assumptions, and have had to spend a year
> cleaning it out, now this happened post-merge and diligence had
> lessened, having the expertise to spot this in new dma-buf/fence users
> is why we insist on having access to way more than just the 1000 line
> kernel driver submission.
>
>
> > I will be glad to not accept any more, but as I say above, what are the
> > new requirements going to be so that those companies that do want to
> > submit their code know what to do?
>
> I'm proposing a patch for documentation that maintainers can sign up
> for (it's mentioned in the ksummit thread).
>
> > And what exactly are we using as a definition of an accelerator?  We
> > have networking cards that are "accelerators" as well as crypto
> > "accelerators" :)
> >
> > > > I expect we'll have a proper discussion what the stack should look
> > > > like with the next submission (from a different vendor maybe), that
> > > > ship kinda sailed with habanalabs.
> >
> > Who is going to define this stack?  As there is no industry standard,
> > why would we define this?
>
> Because someone has to help, saying yes isn't helping, it's enabling
> back behaviour. Parenting and maintaining both involve saying No for
> the future prosperity of the ecosystem.
>
> Dave.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Habanalabs Open-Source TPC LLVM compiler and SynapseAI Core library
  2021-09-10  7:58 ` Greg Kroah-Hartman
  2021-09-10 16:09   ` Daniel Vetter
@ 2021-10-27  6:53   ` Oded Gabbay
  2021-10-28  7:38     ` Daniel Vetter
  1 sibling, 1 reply; 15+ messages in thread
From: Oded Gabbay @ 2021-10-27  6:53 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Linus Torvalds, Dave Airlie, Daniel Vetter, Jason Gunthorpe,
	Linux-Kernel@Vger. Kernel. Org

On Fri, Sep 10, 2021 at 10:58 AM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> On Fri, Sep 10, 2021 at 10:26:56AM +0300, Oded Gabbay wrote:
> > Hi Greg,
> >
> > Following our conversations a couple of months ago, I'm happy to tell you that
> > Habanalabs has open-sourced its TPC (Tensor Processing Core) LLVM compiler,
> > which is a fork of the LLVM open-source project.
> >
> > The project can be found on Habanalabs GitHub website at:
> > https://github.com/HabanaAI/tpc_llvm
> >
> > There is a companion guide on how to write TPC kernels at:
> > https://docs.habana.ai/en/latest/TPC_User_Guide/TPC_User_Guide.html
>
> That's great news, thanks for pushing for this and releasing it all!
>
> greg k-h

Hi Greg,
I would like to update that yesterday AWS launched new EC2 instances
powered by the Gaudi accelerators. It is now in general availability,
and anyone can launch an instance with those devices.
Therefore, one can now take the upstream driver, hl-thunk, tpc llvm
compiler and SynapseAI core and execute compute kernels on the Gaudi
devices. I have verified this to be working with the driver in kernel
5.15-rc6.

We are still missing the networking parts, but I hope to start
upstreaming them in the next coming months.

Thanks,
Oded

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Habanalabs Open-Source TPC LLVM compiler and SynapseAI Core library
  2021-10-27  6:53   ` Habanalabs Open-Source TPC LLVM compiler and SynapseAI Core library Oded Gabbay
@ 2021-10-28  7:38     ` Daniel Vetter
  2021-10-28 12:00       ` Oded Gabbay
  0 siblings, 1 reply; 15+ messages in thread
From: Daniel Vetter @ 2021-10-28  7:38 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: Greg Kroah-Hartman, Linus Torvalds, Dave Airlie, Jason Gunthorpe,
	Linux-Kernel@Vger. Kernel. Org

On Wed, Oct 27, 2021 at 8:53 AM Oded Gabbay <ogabbay@kernel.org> wrote:
>
> On Fri, Sep 10, 2021 at 10:58 AM Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> >
> > On Fri, Sep 10, 2021 at 10:26:56AM +0300, Oded Gabbay wrote:
> > > Hi Greg,
> > >
> > > Following our conversations a couple of months ago, I'm happy to tell you that
> > > Habanalabs has open-sourced its TPC (Tensor Processing Core) LLVM compiler,
> > > which is a fork of the LLVM open-source project.
> > >
> > > The project can be found on Habanalabs GitHub website at:
> > > https://github.com/HabanaAI/tpc_llvm
> > >
> > > There is a companion guide on how to write TPC kernels at:
> > > https://docs.habana.ai/en/latest/TPC_User_Guide/TPC_User_Guide.html
> >
> > That's great news, thanks for pushing for this and releasing it all!
> >
> > greg k-h
>
> Hi Greg,
> I would like to update that yesterday AWS launched new EC2 instances
> powered by the Gaudi accelerators. It is now in general availability,
> and anyone can launch an instance with those devices.
> Therefore, one can now take the upstream driver, hl-thunk, tpc llvm
> compiler and SynapseAI core and execute compute kernels on the Gaudi
> devices. I have verified this to be working with the driver in kernel
> 5.15-rc6.

Nice!

Now that the llvm part is open, any plans to upstream that? Years ago
when amd upstreamed their backend there was the hope that llvm would
grow some competent support for gpu style accelerator isa, but since
for years now amd's the only backend that ever was merged it's stuck
in a chicken-egg situation of upstream llvm complaining why amd
backend has all these special requirements. And other accel backends
(at least the gpu-style simd ones) not having a good path to upstream
llvm since a lot of the infrastructure and understanding isn't there.

Getting a 2nd accel backend into upstream llvm would be a huge step
towards fixing this mess. As far as I know the only other open accel
backend based on llvm is intel's igc (for intel gpus), and that one is
such a massive fork that's been out of upstream llvm for so long that
it's not going to land anytime soon, if ever (in it's current form at
least).

Once we do have an accel backend in upstream llvm we can finally start
building a real stack here I think, so whomever is first will win
quite some advantage I think.

Cheers, Daniel

> We are still missing the networking parts, but I hope to start
> upstreaming them in the next coming months.
>
> Thanks,
> Oded

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Habanalabs Open-Source TPC LLVM compiler and SynapseAI Core library
  2021-10-28  7:38     ` Daniel Vetter
@ 2021-10-28 12:00       ` Oded Gabbay
  0 siblings, 0 replies; 15+ messages in thread
From: Oded Gabbay @ 2021-10-28 12:00 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Greg Kroah-Hartman, Linus Torvalds, Dave Airlie, Jason Gunthorpe,
	Linux-Kernel@Vger. Kernel. Org

On Thu, Oct 28, 2021 at 10:38 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>
> On Wed, Oct 27, 2021 at 8:53 AM Oded Gabbay <ogabbay@kernel.org> wrote:
> >
> > On Fri, Sep 10, 2021 at 10:58 AM Greg Kroah-Hartman
> > <gregkh@linuxfoundation.org> wrote:
> > >
> > > On Fri, Sep 10, 2021 at 10:26:56AM +0300, Oded Gabbay wrote:
> > > > Hi Greg,
> > > >
> > > > Following our conversations a couple of months ago, I'm happy to tell you that
> > > > Habanalabs has open-sourced its TPC (Tensor Processing Core) LLVM compiler,
> > > > which is a fork of the LLVM open-source project.
> > > >
> > > > The project can be found on Habanalabs GitHub website at:
> > > > https://github.com/HabanaAI/tpc_llvm
> > > >
> > > > There is a companion guide on how to write TPC kernels at:
> > > > https://docs.habana.ai/en/latest/TPC_User_Guide/TPC_User_Guide.html
> > >
> > > That's great news, thanks for pushing for this and releasing it all!
> > >
> > > greg k-h
> >
> > Hi Greg,
> > I would like to update that yesterday AWS launched new EC2 instances
> > powered by the Gaudi accelerators. It is now in general availability,
> > and anyone can launch an instance with those devices.
> > Therefore, one can now take the upstream driver, hl-thunk, tpc llvm
> > compiler and SynapseAI core and execute compute kernels on the Gaudi
> > devices. I have verified this to be working with the driver in kernel
> > 5.15-rc6.
>
> Nice!
>
> Now that the llvm part is open, any plans to upstream that? Years ago

AFAIK, there were internal discussions about doing that and the decision was
to pursue that goal somewhere in the future. Not sure how far in the future
they were talking about...

Having said that, I'm not at all involved at the compiler front, so I
might have outdated information.
If you want, I can connect you with the compiler group leader to
discuss that with him.


Oded

> when amd upstreamed their backend there was the hope that llvm would
> grow some competent support for gpu style accelerator isa, but since
> for years now amd's the only backend that ever was merged it's stuck
> in a chicken-egg situation of upstream llvm complaining why amd
> backend has all these special requirements. And other accel backends
> (at least the gpu-style simd ones) not having a good path to upstream
> llvm since a lot of the infrastructure and understanding isn't there.
>
> Getting a 2nd accel backend into upstream llvm would be a huge step
> towards fixing this mess. As far as I know the only other open accel
> backend based on llvm is intel's igc (for intel gpus), and that one is
> such a massive fork that's been out of upstream llvm for so long that
> it's not going to land anytime soon, if ever (in it's current form at
> least).
>
> Once we do have an accel backend in upstream llvm we can finally start
> building a real stack here I think, so whomever is first will win
> quite some advantage I think.
>
> Cheers, Daniel
>
> > We are still missing the networking parts, but I hope to start
> > upstreaming them in the next coming months.
> >
> > Thanks,
> > Oded
>
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: Habanalabs Open-Source TPC LLVM compiler and SynapseAI Core library
  2021-09-10  7:26 Habanalabs Open-Source TPC LLVM compiler and SynapseAI Core library Oded Gabbay
  2021-09-10  7:58 ` Greg Kroah-Hartman
@ 2021-09-12  7:38 ` Michael Zuckerman
  1 sibling, 0 replies; 15+ messages in thread
From: Michael Zuckerman @ 2021-09-12  7:38 UTC (permalink / raw)
  To: Oded Gabbay, Greg Kroah-Hartman, Tzachi Cohen
  Cc: Doron Singer, Linus Torvalds, Dave Airlie, Daniel Vetter,
	Jason Gunthorpe, Linux-Kernel@Vger. Kernel. Org

Add @Tzachi Cohen

-----Original Message-----
From: Oded Gabbay <ogabbay@kernel.org> 
Sent: Friday, 10 September 2021 10:27
To: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Michael Zuckerman <mzuckerman@habana.ai>; Doron Singer <dsinger@habana.ai>; Linus Torvalds <torvalds@linux-foundation.org>; Dave Airlie <airlied@gmail.com>; Daniel Vetter <daniel.vetter@ffwll.ch>; Jason Gunthorpe <jgg@ziepe.ca>; Linux-Kernel@Vger. Kernel. Org <linux-kernel@vger.kernel.org>
Subject: Habanalabs Open-Source TPC LLVM compiler and SynapseAI Core library

[Some people who received this message don't often get email from ogabbay@kernel.org. Learn why this is important at http://aka.ms/LearnAboutSenderIdentification.]

Hi Greg,

Following our conversations a couple of months ago, I'm happy to tell you that Habanalabs has open-sourced its TPC (Tensor Processing Core) LLVM compiler, which is a fork of the LLVM open-source project.

The project can be found on Habanalabs GitHub website at:
https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FHabanaAI%2Ftpc_llvm&amp;data=04%7C01%7Cmzuckerman%40habana.ai%7C2380c349fa96487b7c6b08d9742c6fd6%7C0d4d4539213c4ed8a251dc9766ba127a%7C0%7C0%7C637668556484231215%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=rp%2FIMKAAi%2FAFjjs3GM0cV4ViFn1bkA9nhAq632QB0TQ%3D&amp;reserved=0

There is a companion guide on how to write TPC kernels at:
https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.habana.ai%2Fen%2Flatest%2FTPC_User_Guide%2FTPC_User_Guide.html&amp;data=04%7C01%7Cmzuckerman%40habana.ai%7C2380c349fa96487b7c6b08d9742c6fd6%7C0d4d4539213c4ed8a251dc9766ba127a%7C0%7C0%7C637668556484241172%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=UdmYc%2B71EsEMAtzm6kV0Yu53brrZNdUSIEDQ%2F0vKhA8%3D&amp;reserved=0

The guide details the TPC compute engine's architecture, how to write TPC kernels using the TPC-C language, etc.

In addition, we have written a reference implementation of the SynapseAI API, called SynapseAI Core, and released its code under the MIT license to the open-source community at:
https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FHabanaAI%2FSynapseAI_Core&amp;data=04%7C01%7Cmzuckerman%40habana.ai%7C2380c349fa96487b7c6b08d9742c6fd6%7C0d4d4539213c4ed8a251dc9766ba127a%7C0%7C0%7C637668556484241172%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=sadSWIBvATo2oVvSWgzGujnYYxcfGMEUEDgnfsmy4AU%3D&amp;reserved=0

SynapseAI Core contains all the necessary building blocks to run Deep Learning training on Gaudi, although not as optimized as the closed-source library.

The project repository contains a couple of TPC kernels that implement basic DL operators. These kernels can serve as an example of how to implement more complex operators.

To work with the Gaudi device, the library calls the Habanalabs kernel driver uAPI through the already open-source hl-thunk library at:
https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FHabanaAI%2Fhl-thunk&amp;data=04%7C01%7Cmzuckerman%40habana.ai%7C2380c349fa96487b7c6b08d9742c6fd6%7C0d4d4539213c4ed8a251dc9766ba127a%7C0%7C0%7C637668556484241172%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=VocvcnvoLz36biT58GxWTn%2FxiUD%2BqG2WRD9EEOEDNb8%3D&amp;reserved=0

Moreover, the library contains a few tests (and more will follow soon) that demonstrate how to use the SynapseAI API to run workloads which utilize the TPC engines on Gaudi devices. We provided a short readme that explains how to build and run the included tests.

It is important to note we provided all the necessary APIs to connect this library to any Deep Learning frameworks by writing appropriate backends in the frameworks and by writing more TPC kernels to implement the different operators.

Once the driver(s) for the Gaudi NIC ports will be upstreamed, this library may be used together with IBverbs to perform training on multiple Gaudi devices.

Thanks,
Oded

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-10-28 12:01 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-10  7:26 Habanalabs Open-Source TPC LLVM compiler and SynapseAI Core library Oded Gabbay
2021-09-10  7:58 ` Greg Kroah-Hartman
2021-09-10 16:09   ` Daniel Vetter
2021-09-10 16:10     ` Daniel Vetter
2021-09-10 16:10       ` Daniel Vetter
2021-09-12 13:55       ` Accelerator drivers going forward (was Re: Habanalabs Open-Source TPC LLVM compiler and SynapseAI Core library) Greg Kroah-Hartman
2021-09-12 16:37         ` Simon Ser
2021-09-12 19:32         ` Dave Airlie
2021-09-12 19:32           ` Dave Airlie
2021-09-14  8:42           ` Oded Gabbay
2021-09-14  8:42             ` Oded Gabbay
2021-10-27  6:53   ` Habanalabs Open-Source TPC LLVM compiler and SynapseAI Core library Oded Gabbay
2021-10-28  7:38     ` Daniel Vetter
2021-10-28 12:00       ` Oded Gabbay
2021-09-12  7:38 ` Michael Zuckerman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.