dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* Re: [RFC] Add BPF_PROG_TYPE_CGROUP_IOCTL
       [not found]               ` <20201103232805.6uq4zg3gdvw2iiki@ast-mbp.dhcp.thefacebook.com>
@ 2021-02-01 14:49                 ` Daniel Vetter
  2021-02-01 16:47                   ` Kenny Ho
  2021-02-01 16:51                   ` Kenny Ho
  0 siblings, 2 replies; 22+ messages in thread
From: Daniel Vetter @ 2021-02-01 14:49 UTC (permalink / raw)
  To: Alexei Starovoitov, Dave Airlie
  Cc: Song Liu, DRI Development, Daniel Borkmann, Kenny Ho,
	open list:CONTROL GROUP (CGROUP),
	Brian Welty, John Fastabend, Alexei Starovoitov, amd-gfx list,
	Linux-Fsdevel, Kenny Ho, Alexander Viro, Network Development,
	KP Singh, Yonghong Song, bpf, Andrii Nakryiko, Martin KaFai Lau,
	Alex Deucher

Adding gpu folks.

On Tue, Nov 03, 2020 at 03:28:05PM -0800, Alexei Starovoitov wrote:
> On Tue, Nov 03, 2020 at 05:57:47PM -0500, Kenny Ho wrote:
> > On Tue, Nov 3, 2020 at 4:04 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Tue, Nov 03, 2020 at 02:19:22PM -0500, Kenny Ho wrote:
> > > > On Tue, Nov 3, 2020 at 12:43 AM Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > > On Mon, Nov 2, 2020 at 9:39 PM Kenny Ho <y2kenny@gmail.com> wrote:
> > >
> > > Sounds like either bpf_lsm needs to be made aware of cgv2 (which would
> > > be a great thing to have regardless) or cgroup-bpf needs a drm/gpu specific hook.
> > > I think generic ioctl hook is too broad for this use case.
> > > I suspect drm/gpu internal state would be easier to access inside
> > > bpf program if the hook is next to gpu/drm. At ioctl level there is 'file'.
> > > It's probably too abstract for the things you want to do.
> > > Like how VRAM/shader/etc can be accessed through file?
> > > Probably possible through a bunch of lookups and dereferences, but
> > > if the hook is custom to GPU that info is likely readily available.
> > > Then such cgroup-bpf check would be suitable in execution paths where
> > > ioctl-based hook would be too slow.
> > Just to clarify, when you say drm specific hook, did you mean just a
> > unique attach_type or a unique prog_type+attach_type combination?  (I
> > am still a bit fuzzy on when a new prog type is needed vs a new attach
> > type.  I think prog type is associated with a unique type of context
> > that the bpf prog will get but I could be missing some nuances.)
> > 
> > When I was thinking of doing an ioctl wide hook, the file would be the
> > device file and the thinking was to have a helper function provided by
> > device drivers to further disambiguate.  For our (AMD's) driver, we
> > have a bunch of ioctls for set/get/create/destroy
> > (https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c#L1763)
> > so the bpf prog can make the decision after the disambiguation.  For
> > example, we have an ioctl called "kfd_ioctl_set_cu_mask."  You can
> 
> Thanks for the pointer.
> That's one monster ioctl. So much copy_from_user.
> BPF prog would need to be sleepable to able to examine the args in such depth.
> After quick glance at the code I would put a new hook into
> kfd_ioctl() right before
> retcode = func(filep, process, kdata);
> At this point kdata is already copied from user space 
> and usize, that is cmd specific, is known.
> So bpf prog wouldn't need to copy that data again.
> That will save one copy.
> To drill into details of kfd_ioctl_set_cu_mask() the prog would
> need to be sleepable to do second copy_from_user of cu_mask.
> At least it's not that big.
> Yes, the attachment point will be amd driver specific,
> but the program doesn't need to be.
> It can be generic tracing prog that is agumented to use BTF.
> Something like writeable tracepoint with BTF support would do.
> So on the bpf side there will be minimal amount of changes.
> And in the driver you'll add one or few writeable tracepoints
> and the result of the tracepoint will gate
> retcode = func(filep, process, kdata);
> call in kfd_ioctl().
> The writeable tracepoint would need to be cgroup-bpf based.
> So that's the only tricky part. BPF infra doesn't have
> cgroup+tracepoint scheme. It's probably going to be useful
> in other cases like this. See trace_nbd_send_request.


Yeah I think this proposal doesn't work:

- inspecting ioctl arguments that need copying outside of the
  driver/subsystem doing that copying is fundamentally racy

- there's been a pile of cgroups proposal to manage gpus at the drm
  subsystem level, some by Kenny, and frankly this at least looks a bit
  like a quick hack to sidestep the consensus process for that.

So once we push this into drivers it's not going to be a bpf hook anymore
I think.

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Add BPF_PROG_TYPE_CGROUP_IOCTL
  2021-02-01 14:49                 ` [RFC] Add BPF_PROG_TYPE_CGROUP_IOCTL Daniel Vetter
@ 2021-02-01 16:47                   ` Kenny Ho
  2021-02-01 16:51                   ` Kenny Ho
  1 sibling, 0 replies; 22+ messages in thread
From: Kenny Ho @ 2021-02-01 16:47 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Song Liu, Andrii Nakryiko, DRI Development, Daniel Borkmann,
	Kenny Ho, open list:CONTROL GROUP (CGROUP),
	Brian Welty, John Fastabend, Alexei Starovoitov, amd-gfx list,
	Martin KaFai Lau, Linux-Fsdevel, Alexander Viro,
	Network Development, KP Singh, Yonghong Song, bpf,
	Alexei Starovoitov, Alex Deucher


[-- Attachment #1.1: Type: text/plain, Size: 875 bytes --]

On Mon, Feb 1, 2021 at 9:49 AM Daniel Vetter <daniel@ffwll.ch> wrote:

>
> - there's been a pile of cgroups proposal to manage gpus at the drm
>   subsystem level, some by Kenny, and frankly this at least looks a bit
>   like a quick hack to sidestep the consensus process for that.
>
No Daniel, this is quick *draft* to get a conversation going.  Bpf was
actually a path suggested by Tejun back in 2018 so I think you are
mischaracterizing this quite a bit.

"2018-11-20 Kenny Ho:
To put the questions in more concrete terms, let say a user wants to
 expose certain part of a gpu to a particular cgroup similar to the
 way selective cpu cores are exposed to a cgroup via cpuset, how
 should we go about enabling such functionality?

2018-11-20 Tejun Heo:
Do what the intel driver or bpf is doing?  It's not difficult to hook
into cgroup for identification purposes."

Kenny

[-- Attachment #1.2: Type: text/html, Size: 1392 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Add BPF_PROG_TYPE_CGROUP_IOCTL
  2021-02-01 14:49                 ` [RFC] Add BPF_PROG_TYPE_CGROUP_IOCTL Daniel Vetter
  2021-02-01 16:47                   ` Kenny Ho
@ 2021-02-01 16:51                   ` Kenny Ho
  2021-02-03 11:09                     ` Daniel Vetter
  1 sibling, 1 reply; 22+ messages in thread
From: Kenny Ho @ 2021-02-01 16:51 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Song Liu, Andrii Nakryiko, DRI Development, Daniel Borkmann,
	Kenny Ho, open list:CONTROL GROUP (CGROUP),
	Brian Welty, John Fastabend, Alexei Starovoitov, amd-gfx list,
	Martin KaFai Lau, Linux-Fsdevel, Alexander Viro,
	Network Development, KP Singh, Yonghong Song, bpf,
	Alexei Starovoitov, Alex Deucher

[Resent in plain text.]

On Mon, Feb 1, 2021 at 9:49 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> - there's been a pile of cgroups proposal to manage gpus at the drm
>   subsystem level, some by Kenny, and frankly this at least looks a bit
>   like a quick hack to sidestep the consensus process for that.
No Daniel, this is quick *draft* to get a conversation going.  Bpf was
actually a path suggested by Tejun back in 2018 so I think you are
mischaracterizing this quite a bit.

"2018-11-20 Kenny Ho:
To put the questions in more concrete terms, let say a user wants to
 expose certain part of a gpu to a particular cgroup similar to the
 way selective cpu cores are exposed to a cgroup via cpuset, how
 should we go about enabling such functionality?

2018-11-20 Tejun Heo:
Do what the intel driver or bpf is doing?  It's not difficult to hook
into cgroup for identification purposes."

Kenny
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Add BPF_PROG_TYPE_CGROUP_IOCTL
  2021-02-01 16:51                   ` Kenny Ho
@ 2021-02-03 11:09                     ` Daniel Vetter
  2021-02-03 19:01                       ` Kenny Ho
  0 siblings, 1 reply; 22+ messages in thread
From: Daniel Vetter @ 2021-02-03 11:09 UTC (permalink / raw)
  To: Kenny Ho
  Cc: Song Liu, Andrii Nakryiko, DRI Development, Daniel Borkmann,
	Kenny Ho, open list:CONTROL GROUP (CGROUP),
	Brian Welty, John Fastabend, Alexei Starovoitov, amd-gfx list,
	Martin KaFai Lau, Linux-Fsdevel, Alexander Viro,
	Network Development, KP Singh, Yonghong Song, bpf,
	Alexei Starovoitov, Alex Deucher

On Mon, Feb 01, 2021 at 11:51:07AM -0500, Kenny Ho wrote:
> [Resent in plain text.]
> 
> On Mon, Feb 1, 2021 at 9:49 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > - there's been a pile of cgroups proposal to manage gpus at the drm
> >   subsystem level, some by Kenny, and frankly this at least looks a bit
> >   like a quick hack to sidestep the consensus process for that.
> No Daniel, this is quick *draft* to get a conversation going.  Bpf was
> actually a path suggested by Tejun back in 2018 so I think you are
> mischaracterizing this quite a bit.
> 
> "2018-11-20 Kenny Ho:
> To put the questions in more concrete terms, let say a user wants to
>  expose certain part of a gpu to a particular cgroup similar to the
>  way selective cpu cores are exposed to a cgroup via cpuset, how
>  should we go about enabling such functionality?
> 
> 2018-11-20 Tejun Heo:
> Do what the intel driver or bpf is doing?  It's not difficult to hook
> into cgroup for identification purposes."

Yeah, but if you go full amd specific for this, you might as well have a
specific BPF hook which is called in amdgpu/kfd and returns you the CU
mask for a given cgroups (and figures that out however it pleases).

Not a generic framework which lets you build pretty much any possible
cgroups controller for anything else using BPF. Trying to filter anything
at the generic ioctl just doesn't feel like a great idea that's long term
maintainable. E.g. what happens if there's new uapi for command
submission/context creation and now your bpf filter isn't catching all
access anymore? If it's an explicit hook that explicitly computes the CU
mask, then we can add more checks as needed. With ioctl that's impossible.

Plus I'm also not sure whether that's really a good idea still, since if
cloud companies have to built their own bespoke container stuff for every
gpu vendor, that's quite a bad platform we're building. And "I'd like to
make sure my gpu is used fairly among multiple tenents" really isn't a
use-case that's specific to amd.

If this would be something very hw specific like cache assignment and
quality of service stuff or things like that, then vendor specific imo
makes sense. But for CU masks essentially we're cutting the compute
resources up in some way, and I kinda expect everyone with a gpu who cares
about isolating workloads with cgroups wants to do that.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Add BPF_PROG_TYPE_CGROUP_IOCTL
  2021-02-03 11:09                     ` Daniel Vetter
@ 2021-02-03 19:01                       ` Kenny Ho
  2021-02-05 13:49                         ` Daniel Vetter
  0 siblings, 1 reply; 22+ messages in thread
From: Kenny Ho @ 2021-02-03 19:01 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Song Liu, Andrii Nakryiko, DRI Development, Daniel Borkmann,
	Kenny Ho, open list:CONTROL GROUP (CGROUP),
	Brian Welty, John Fastabend, Alexei Starovoitov, amd-gfx list,
	Martin KaFai Lau, Linux-Fsdevel, Alexander Viro,
	Network Development, KP Singh, Yonghong Song, bpf,
	Alexei Starovoitov, Alex Deucher

Daniel,

I will have to get back to you later on the details of this because my
head is currently context switched to some infrastructure and
Kubernetes/golang work, so I am having a hard time digesting what you
are saying.  I am new to the bpf stuff so this is about my own
learning as well as a conversation starter.  The high level goal here
is to have a path for flexibility via a bpf program.  Not just GPU or
DRM or CU mask, but devices making decisions via an operator-written
bpf-prog attached to a cgroup.  More inline.

On Wed, Feb 3, 2021 at 6:09 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Mon, Feb 01, 2021 at 11:51:07AM -0500, Kenny Ho wrote:
> > On Mon, Feb 1, 2021 at 9:49 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > > - there's been a pile of cgroups proposal to manage gpus at the drm
> > >   subsystem level, some by Kenny, and frankly this at least looks a bit
> > >   like a quick hack to sidestep the consensus process for that.
> > No Daniel, this is quick *draft* to get a conversation going.  Bpf was
> > actually a path suggested by Tejun back in 2018 so I think you are
> > mischaracterizing this quite a bit.
> >
> > "2018-11-20 Kenny Ho:
> > To put the questions in more concrete terms, let say a user wants to
> >  expose certain part of a gpu to a particular cgroup similar to the
> >  way selective cpu cores are exposed to a cgroup via cpuset, how
> >  should we go about enabling such functionality?
> >
> > 2018-11-20 Tejun Heo:
> > Do what the intel driver or bpf is doing?  It's not difficult to hook
> > into cgroup for identification purposes."
>
> Yeah, but if you go full amd specific for this, you might as well have a
> specific BPF hook which is called in amdgpu/kfd and returns you the CU
> mask for a given cgroups (and figures that out however it pleases).
>
> Not a generic framework which lets you build pretty much any possible
> cgroups controller for anything else using BPF. Trying to filter anything
> at the generic ioctl just doesn't feel like a great idea that's long term
> maintainable. E.g. what happens if there's new uapi for command
> submission/context creation and now your bpf filter isn't catching all
> access anymore? If it's an explicit hook that explicitly computes the CU
> mask, then we can add more checks as needed. With ioctl that's impossible.
>
> Plus I'm also not sure whether that's really a good idea still, since if
> cloud companies have to built their own bespoke container stuff for every
> gpu vendor, that's quite a bad platform we're building. And "I'd like to
> make sure my gpu is used fairly among multiple tenents" really isn't a
> use-case that's specific to amd.

I don't understand what you are saying about containers here since
bpf-progs are not the same as container nor are they deployed from
inside a container (as far as I know, I am actually not sure how
bpf-cgroup works with higher level cloud orchestration since folks
like Docker just migrated to cgroup v2 very recently... I don't think
you can specify a bpf-prog to load as part of a k8s pod definition.)
That said, the bit I understand ("not sure whether that's really a
good idea....cloud companies have to built their own bespoke container
stuff for every gpu vendor...") is in fact the current status quo.  If
you look into some of the popular ML/AI-oriented containers/apps, you
will likely see things are mostly hardcoded to CUDA.  Since I work for
AMD, I wouldn't say that's a good thing but this is just the reality.
For Kubernetes at least (where my head is currently), the official
mechanisms are Device Plugins (I am the author for the one for AMD but
there are a few ones from Intel too, you can confirm with your
colleagues)  and Node Feature/Labels.  Kubernetes schedules
pod/container launched by users to the node/servers by the affinity of
the node resources/labels, and the resources/labels in the pod
specification created by the users.

> If this would be something very hw specific like cache assignment and
> quality of service stuff or things like that, then vendor specific imo
> makes sense. But for CU masks essentially we're cutting the compute
> resources up in some way, and I kinda expect everyone with a gpu who cares
> about isolating workloads with cgroups wants to do that.

Right, but isolating workloads is quality of service stuff and *how*
compute resources are cut up are vendor specific.

Anyway, as I said at the beginning of this reply, this is about
flexibility in support of the diversity of devices and architectures.
CU mask is simply a concrete example of hw diversity that a
bpf-program can encapsulate.  I can see this framework (a custom
program making decisions in a specific cgroup and device context) use
for other things as well.  It may even be useful within a vendor to
handle the diversity between SKUs.

Kenny
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Add BPF_PROG_TYPE_CGROUP_IOCTL
  2021-02-03 19:01                       ` Kenny Ho
@ 2021-02-05 13:49                         ` Daniel Vetter
  2021-05-07  2:06                           ` Kenny Ho
  0 siblings, 1 reply; 22+ messages in thread
From: Daniel Vetter @ 2021-02-05 13:49 UTC (permalink / raw)
  To: Kenny Ho
  Cc: Song Liu, Andrii Nakryiko, DRI Development, Daniel Borkmann,
	Kenny Ho, open list:CONTROL GROUP (CGROUP),
	Brian Welty, John Fastabend, Alexei Starovoitov, amd-gfx list,
	Martin KaFai Lau, Linux-Fsdevel, Alexander Viro,
	Network Development, KP Singh, Yonghong Song, bpf,
	Alexei Starovoitov, Alex Deucher

Hi Kenny

On Wed, Feb 3, 2021 at 8:01 PM Kenny Ho <y2kenny@gmail.com> wrote:
>
> Daniel,
>
> I will have to get back to you later on the details of this because my
> head is currently context switched to some infrastructure and
> Kubernetes/golang work, so I am having a hard time digesting what you
> are saying.  I am new to the bpf stuff so this is about my own
> learning as well as a conversation starter.  The high level goal here
> is to have a path for flexibility via a bpf program.  Not just GPU or
> DRM or CU mask, but devices making decisions via an operator-written
> bpf-prog attached to a cgroup.  More inline.

If you have some pointers on this, I'm happy to do some reading and
learning too.

> On Wed, Feb 3, 2021 at 6:09 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> > On Mon, Feb 01, 2021 at 11:51:07AM -0500, Kenny Ho wrote:
> > > On Mon, Feb 1, 2021 at 9:49 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > > > - there's been a pile of cgroups proposal to manage gpus at the drm
> > > >   subsystem level, some by Kenny, and frankly this at least looks a bit
> > > >   like a quick hack to sidestep the consensus process for that.
> > > No Daniel, this is quick *draft* to get a conversation going.  Bpf was
> > > actually a path suggested by Tejun back in 2018 so I think you are
> > > mischaracterizing this quite a bit.
> > >
> > > "2018-11-20 Kenny Ho:
> > > To put the questions in more concrete terms, let say a user wants to
> > >  expose certain part of a gpu to a particular cgroup similar to the
> > >  way selective cpu cores are exposed to a cgroup via cpuset, how
> > >  should we go about enabling such functionality?
> > >
> > > 2018-11-20 Tejun Heo:
> > > Do what the intel driver or bpf is doing?  It's not difficult to hook
> > > into cgroup for identification purposes."
> >
> > Yeah, but if you go full amd specific for this, you might as well have a
> > specific BPF hook which is called in amdgpu/kfd and returns you the CU
> > mask for a given cgroups (and figures that out however it pleases).
> >
> > Not a generic framework which lets you build pretty much any possible
> > cgroups controller for anything else using BPF. Trying to filter anything
> > at the generic ioctl just doesn't feel like a great idea that's long term
> > maintainable. E.g. what happens if there's new uapi for command
> > submission/context creation and now your bpf filter isn't catching all
> > access anymore? If it's an explicit hook that explicitly computes the CU
> > mask, then we can add more checks as needed. With ioctl that's impossible.
> >
> > Plus I'm also not sure whether that's really a good idea still, since if
> > cloud companies have to built their own bespoke container stuff for every
> > gpu vendor, that's quite a bad platform we're building. And "I'd like to
> > make sure my gpu is used fairly among multiple tenents" really isn't a
> > use-case that's specific to amd.
>
> I don't understand what you are saying about containers here since
> bpf-progs are not the same as container nor are they deployed from
> inside a container (as far as I know, I am actually not sure how
> bpf-cgroup works with higher level cloud orchestration since folks
> like Docker just migrated to cgroup v2 very recently... I don't think
> you can specify a bpf-prog to load as part of a k8s pod definition.)
> That said, the bit I understand ("not sure whether that's really a
> good idea....cloud companies have to built their own bespoke container
> stuff for every gpu vendor...") is in fact the current status quo.  If
> you look into some of the popular ML/AI-oriented containers/apps, you
> will likely see things are mostly hardcoded to CUDA.  Since I work for
> AMD, I wouldn't say that's a good thing but this is just the reality.
> For Kubernetes at least (where my head is currently), the official
> mechanisms are Device Plugins (I am the author for the one for AMD but
> there are a few ones from Intel too, you can confirm with your
> colleagues)  and Node Feature/Labels.  Kubernetes schedules
> pod/container launched by users to the node/servers by the affinity of
> the node resources/labels, and the resources/labels in the pod
> specification created by the users.

Sure the current gpu compute ecosystem is pretty badly fragmented,
forcing higher levels (like containers, but also hpc runtimes, or
anything else) to paper over that with more plugins and abstraction
layers.

That's not really a good excuse that when we upstream these features,
that we should continue with the fragmentation.

> > If this would be something very hw specific like cache assignment and
> > quality of service stuff or things like that, then vendor specific imo
> > makes sense. But for CU masks essentially we're cutting the compute
> > resources up in some way, and I kinda expect everyone with a gpu who cares
> > about isolating workloads with cgroups wants to do that.
>
> Right, but isolating workloads is quality of service stuff and *how*
> compute resources are cut up are vendor specific.
>
> Anyway, as I said at the beginning of this reply, this is about
> flexibility in support of the diversity of devices and architectures.
> CU mask is simply a concrete example of hw diversity that a
> bpf-program can encapsulate.  I can see this framework (a custom
> program making decisions in a specific cgroup and device context) use
> for other things as well.  It may even be useful within a vendor to
> handle the diversity between SKUs.

So I agree that on one side CU mask can be used for low-level quality
of service guarantees (like the CLOS cache stuff on intel cpus as an
example), and that's going to be rather hw specific no matter what.

But my understanding of AMD's plans here is that CU mask is the only
thing you'll have to partition gpu usage in a multi-tenant environment
- whether that's cloud or also whether that's containing apps to make
sure the compositor can still draw the desktop (except for fullscreen
ofc) doesn't really matter I think. And since there's clearly a need
for more general (but necessarily less well-defined) gpu usage
controlling and accounting I don't think exposing just the CU mask is
a good idea. That just perpetuates the current fragmented landscape,
and I really don't see why it's not possible to have a generic "I want
50% of my gpu available for these 2 containers each" solution

Of course on top of that having a bfp hook in amd to do the fine
grained QOS assignement for e.g. embedded application which are very
carefully tuned, should still be possible. But that's on top, not as
the exclusive thing available.

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Add BPF_PROG_TYPE_CGROUP_IOCTL
  2021-02-05 13:49                         ` Daniel Vetter
@ 2021-05-07  2:06                           ` Kenny Ho
  2021-05-07  8:59                             ` Daniel Vetter
  0 siblings, 1 reply; 22+ messages in thread
From: Kenny Ho @ 2021-05-07  2:06 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Song Liu, Andrii Nakryiko, DRI Development, Daniel Borkmann,
	Kenny Ho, open list:CONTROL GROUP (CGROUP),
	Brian Welty, John Fastabend, Alexei Starovoitov, amd-gfx list,
	Martin KaFai Lau, Linux-Fsdevel, Alexander Viro,
	Network Development, KP Singh, Yonghong Song, bpf,
	Alexei Starovoitov, Alex Deucher

Sorry for the late reply (I have been working on other stuff.)

On Fri, Feb 5, 2021 at 8:49 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> So I agree that on one side CU mask can be used for low-level quality
> of service guarantees (like the CLOS cache stuff on intel cpus as an
> example), and that's going to be rather hw specific no matter what.
>
> But my understanding of AMD's plans here is that CU mask is the only
> thing you'll have to partition gpu usage in a multi-tenant environment
> - whether that's cloud or also whether that's containing apps to make
> sure the compositor can still draw the desktop (except for fullscreen
> ofc) doesn't really matter I think.
This is not correct.  Even in the original cgroup proposal, it
supports both mask and count as a way to define unit(s) of sub-device.
For AMD, we already have SRIOV that supports GPU partitioning in a
time-sliced-of-a-whole-GPU fashion.

Kenny

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Add BPF_PROG_TYPE_CGROUP_IOCTL
  2021-05-07  2:06                           ` Kenny Ho
@ 2021-05-07  8:59                             ` Daniel Vetter
  2021-05-07 15:33                               ` Kenny Ho
  0 siblings, 1 reply; 22+ messages in thread
From: Daniel Vetter @ 2021-05-07  8:59 UTC (permalink / raw)
  To: Kenny Ho
  Cc: Song Liu, Andrii Nakryiko, DRI Development, Daniel Borkmann,
	Kenny Ho, open list:CONTROL GROUP (CGROUP),
	Brian Welty, John Fastabend, Alexei Starovoitov, amd-gfx list,
	Martin KaFai Lau, Linux-Fsdevel, Alexander Viro,
	Network Development, KP Singh, Yonghong Song, bpf,
	Alexei Starovoitov, Alex Deucher

On Thu, May 06, 2021 at 10:06:32PM -0400, Kenny Ho wrote:
> Sorry for the late reply (I have been working on other stuff.)
> 
> On Fri, Feb 5, 2021 at 8:49 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> > So I agree that on one side CU mask can be used for low-level quality
> > of service guarantees (like the CLOS cache stuff on intel cpus as an
> > example), and that's going to be rather hw specific no matter what.
> >
> > But my understanding of AMD's plans here is that CU mask is the only
> > thing you'll have to partition gpu usage in a multi-tenant environment
> > - whether that's cloud or also whether that's containing apps to make
> > sure the compositor can still draw the desktop (except for fullscreen
> > ofc) doesn't really matter I think.
> This is not correct.  Even in the original cgroup proposal, it
> supports both mask and count as a way to define unit(s) of sub-device.
> For AMD, we already have SRIOV that supports GPU partitioning in a
> time-sliced-of-a-whole-GPU fashion.

Hm I missed that. I feel like time-sliced-of-a-whole gpu is the easier gpu
cgroups controler to get started, since it's much closer to other cgroups
that control bandwidth of some kind. Whether it's i/o bandwidth or compute
bandwidht is kinda a wash.

CU mask feels a lot more like an isolation/guaranteed forward progress
kind of thing, and I suspect that's always going to be a lot more gpu hw
specific than anything we can reasonably put into a general cgroups
controller.

Also for the time slice cgroups thing, can you pls give me pointers to
these old patches that had it, and how it's done? I very obviously missed
that part.

Thanks, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Add BPF_PROG_TYPE_CGROUP_IOCTL
  2021-05-07  8:59                             ` Daniel Vetter
@ 2021-05-07 15:33                               ` Kenny Ho
  2021-05-07 16:13                                 ` Daniel Vetter
  0 siblings, 1 reply; 22+ messages in thread
From: Kenny Ho @ 2021-05-07 15:33 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Song Liu, Andrii Nakryiko, DRI Development, Daniel Borkmann,
	Kenny Ho, open list:CONTROL GROUP (CGROUP),
	Brian Welty, John Fastabend, Alexei Starovoitov, amd-gfx list,
	Martin KaFai Lau, Linux-Fsdevel, Alexander Viro,
	Network Development, KP Singh, Yonghong Song, bpf,
	Alexei Starovoitov, Alex Deucher

On Fri, May 7, 2021 at 4:59 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> Hm I missed that. I feel like time-sliced-of-a-whole gpu is the easier gpu
> cgroups controler to get started, since it's much closer to other cgroups
> that control bandwidth of some kind. Whether it's i/o bandwidth or compute
> bandwidht is kinda a wash.
sriov/time-sliced-of-a-whole gpu does not really need a cgroup
interface since each slice appears as a stand alone device.  This is
already in production (not using cgroup) with users.  The cgroup
proposal has always been parallel to that in many sense: 1) spatial
partitioning as an independent but equally valid use case as time
sharing, 2) sub-device resource control as opposed to full device
control motivated by the workload characterization paper.  It was
never about time vs space in terms of use cases but having new API for
users to be able to do spatial subdevice partitioning.

> CU mask feels a lot more like an isolation/guaranteed forward progress
> kind of thing, and I suspect that's always going to be a lot more gpu hw
> specific than anything we can reasonably put into a general cgroups
> controller.
The first half is correct but I disagree with the conclusion.  The
analogy I would use is multi-core CPU.  The capability of individual
CPU cores, core count and core arrangement may be hw specific but
there are general interfaces to support selection of these cores.  CU
mask may be hw specific but spatial partitioning as an idea is not.
Most gpu vendors have the concept of sub-device compute units (EU, SE,
etc.); OpenCL has the concept of subdevice in the language.  I don't
see any obstacle for vendors to implement spatial partitioning just
like many CPU vendors support the idea of multi-core.

> Also for the time slice cgroups thing, can you pls give me pointers to
> these old patches that had it, and how it's done? I very obviously missed
> that part.
I think you misunderstood what I wrote earlier.  The original proposal
was about spatial partitioning of subdevice resources not time sharing
using cgroup (since time sharing is already supported elsewhere.)

Kenny

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Add BPF_PROG_TYPE_CGROUP_IOCTL
  2021-05-07 15:33                               ` Kenny Ho
@ 2021-05-07 16:13                                 ` Daniel Vetter
  2021-05-07 16:19                                   ` Alex Deucher
  0 siblings, 1 reply; 22+ messages in thread
From: Daniel Vetter @ 2021-05-07 16:13 UTC (permalink / raw)
  To: Kenny Ho
  Cc: Song Liu, Andrii Nakryiko, DRI Development, Daniel Borkmann,
	Kenny Ho, open list:CONTROL GROUP (CGROUP),
	Brian Welty, John Fastabend, Alexei Starovoitov, amd-gfx list,
	Martin KaFai Lau, Linux-Fsdevel, Alexander Viro,
	Network Development, KP Singh, Yonghong Song, bpf,
	Alexei Starovoitov, Alex Deucher

On Fri, May 07, 2021 at 11:33:46AM -0400, Kenny Ho wrote:
> On Fri, May 7, 2021 at 4:59 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> > Hm I missed that. I feel like time-sliced-of-a-whole gpu is the easier gpu
> > cgroups controler to get started, since it's much closer to other cgroups
> > that control bandwidth of some kind. Whether it's i/o bandwidth or compute
> > bandwidht is kinda a wash.
> sriov/time-sliced-of-a-whole gpu does not really need a cgroup
> interface since each slice appears as a stand alone device.  This is
> already in production (not using cgroup) with users.  The cgroup
> proposal has always been parallel to that in many sense: 1) spatial
> partitioning as an independent but equally valid use case as time
> sharing, 2) sub-device resource control as opposed to full device
> control motivated by the workload characterization paper.  It was
> never about time vs space in terms of use cases but having new API for
> users to be able to do spatial subdevice partitioning.
> 
> > CU mask feels a lot more like an isolation/guaranteed forward progress
> > kind of thing, and I suspect that's always going to be a lot more gpu hw
> > specific than anything we can reasonably put into a general cgroups
> > controller.
> The first half is correct but I disagree with the conclusion.  The
> analogy I would use is multi-core CPU.  The capability of individual
> CPU cores, core count and core arrangement may be hw specific but
> there are general interfaces to support selection of these cores.  CU
> mask may be hw specific but spatial partitioning as an idea is not.
> Most gpu vendors have the concept of sub-device compute units (EU, SE,
> etc.); OpenCL has the concept of subdevice in the language.  I don't
> see any obstacle for vendors to implement spatial partitioning just
> like many CPU vendors support the idea of multi-core.
> 
> > Also for the time slice cgroups thing, can you pls give me pointers to
> > these old patches that had it, and how it's done? I very obviously missed
> > that part.
> I think you misunderstood what I wrote earlier.  The original proposal
> was about spatial partitioning of subdevice resources not time sharing
> using cgroup (since time sharing is already supported elsewhere.)

Well SRIOV time-sharing is for virtualization. cgroups is for
containerization, which is just virtualization but with less overhead and
more security bugs.

More or less.

So either I get things still wrong, or we'll get time-sharing for
virtualization, and partitioning of CU for containerization. That doesn't
make that much sense to me.

Since time-sharing is the first thing that's done for virtualization I
think it's probably also the most reasonable to start with for containers.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Add BPF_PROG_TYPE_CGROUP_IOCTL
  2021-05-07 16:13                                 ` Daniel Vetter
@ 2021-05-07 16:19                                   ` Alex Deucher
  2021-05-07 16:26                                     ` Daniel Vetter
  0 siblings, 1 reply; 22+ messages in thread
From: Alex Deucher @ 2021-05-07 16:19 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Song Liu, Daniel Borkmann, Kenny Ho, Brian Welty, John Fastabend,
	Alexei Starovoitov, DRI Development, Alexei Starovoitov,
	Yonghong Song, KP Singh, Kenny Ho, amd-gfx list,
	Network Development, Linux-Fsdevel,
	open list:CONTROL GROUP (CGROUP),
	bpf, Andrii Nakryiko, Martin KaFai Lau, Alex Deucher,
	Alexander Viro

On Fri, May 7, 2021 at 12:13 PM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Fri, May 07, 2021 at 11:33:46AM -0400, Kenny Ho wrote:
> > On Fri, May 7, 2021 at 4:59 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > >
> > > Hm I missed that. I feel like time-sliced-of-a-whole gpu is the easier gpu
> > > cgroups controler to get started, since it's much closer to other cgroups
> > > that control bandwidth of some kind. Whether it's i/o bandwidth or compute
> > > bandwidht is kinda a wash.
> > sriov/time-sliced-of-a-whole gpu does not really need a cgroup
> > interface since each slice appears as a stand alone device.  This is
> > already in production (not using cgroup) with users.  The cgroup
> > proposal has always been parallel to that in many sense: 1) spatial
> > partitioning as an independent but equally valid use case as time
> > sharing, 2) sub-device resource control as opposed to full device
> > control motivated by the workload characterization paper.  It was
> > never about time vs space in terms of use cases but having new API for
> > users to be able to do spatial subdevice partitioning.
> >
> > > CU mask feels a lot more like an isolation/guaranteed forward progress
> > > kind of thing, and I suspect that's always going to be a lot more gpu hw
> > > specific than anything we can reasonably put into a general cgroups
> > > controller.
> > The first half is correct but I disagree with the conclusion.  The
> > analogy I would use is multi-core CPU.  The capability of individual
> > CPU cores, core count and core arrangement may be hw specific but
> > there are general interfaces to support selection of these cores.  CU
> > mask may be hw specific but spatial partitioning as an idea is not.
> > Most gpu vendors have the concept of sub-device compute units (EU, SE,
> > etc.); OpenCL has the concept of subdevice in the language.  I don't
> > see any obstacle for vendors to implement spatial partitioning just
> > like many CPU vendors support the idea of multi-core.
> >
> > > Also for the time slice cgroups thing, can you pls give me pointers to
> > > these old patches that had it, and how it's done? I very obviously missed
> > > that part.
> > I think you misunderstood what I wrote earlier.  The original proposal
> > was about spatial partitioning of subdevice resources not time sharing
> > using cgroup (since time sharing is already supported elsewhere.)
>
> Well SRIOV time-sharing is for virtualization. cgroups is for
> containerization, which is just virtualization but with less overhead and
> more security bugs.
>
> More or less.
>
> So either I get things still wrong, or we'll get time-sharing for
> virtualization, and partitioning of CU for containerization. That doesn't
> make that much sense to me.

You could still potentially do SR-IOV for containerization.  You'd
just pass one of the PCI VFs (virtual functions) to the container and
you'd automatically get the time slice.  I don't see why cgroups would
be a factor there.

Alex

>
> Since time-sharing is the first thing that's done for virtualization I
> think it's probably also the most reasonable to start with for containers.
> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Add BPF_PROG_TYPE_CGROUP_IOCTL
  2021-05-07 16:19                                   ` Alex Deucher
@ 2021-05-07 16:26                                     ` Daniel Vetter
  2021-05-07 16:31                                       ` Alex Deucher
  0 siblings, 1 reply; 22+ messages in thread
From: Daniel Vetter @ 2021-05-07 16:26 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Song Liu, Alexei Starovoitov, DRI Development,
	Alexei Starovoitov, Daniel Borkmann, Brian Welty, John Fastabend,
	amd-gfx list, Yonghong Song, Andrii Nakryiko, Linux-Fsdevel,
	Kenny Ho, Alexander Viro, KP Singh,
	open list:CONTROL GROUP (CGROUP),
	Kenny Ho, Network Development, Alex Deucher, bpf,
	Martin KaFai Lau

On Fri, May 07, 2021 at 12:19:13PM -0400, Alex Deucher wrote:
> On Fri, May 7, 2021 at 12:13 PM Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> > On Fri, May 07, 2021 at 11:33:46AM -0400, Kenny Ho wrote:
> > > On Fri, May 7, 2021 at 4:59 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > > >
> > > > Hm I missed that. I feel like time-sliced-of-a-whole gpu is the easier gpu
> > > > cgroups controler to get started, since it's much closer to other cgroups
> > > > that control bandwidth of some kind. Whether it's i/o bandwidth or compute
> > > > bandwidht is kinda a wash.
> > > sriov/time-sliced-of-a-whole gpu does not really need a cgroup
> > > interface since each slice appears as a stand alone device.  This is
> > > already in production (not using cgroup) with users.  The cgroup
> > > proposal has always been parallel to that in many sense: 1) spatial
> > > partitioning as an independent but equally valid use case as time
> > > sharing, 2) sub-device resource control as opposed to full device
> > > control motivated by the workload characterization paper.  It was
> > > never about time vs space in terms of use cases but having new API for
> > > users to be able to do spatial subdevice partitioning.
> > >
> > > > CU mask feels a lot more like an isolation/guaranteed forward progress
> > > > kind of thing, and I suspect that's always going to be a lot more gpu hw
> > > > specific than anything we can reasonably put into a general cgroups
> > > > controller.
> > > The first half is correct but I disagree with the conclusion.  The
> > > analogy I would use is multi-core CPU.  The capability of individual
> > > CPU cores, core count and core arrangement may be hw specific but
> > > there are general interfaces to support selection of these cores.  CU
> > > mask may be hw specific but spatial partitioning as an idea is not.
> > > Most gpu vendors have the concept of sub-device compute units (EU, SE,
> > > etc.); OpenCL has the concept of subdevice in the language.  I don't
> > > see any obstacle for vendors to implement spatial partitioning just
> > > like many CPU vendors support the idea of multi-core.
> > >
> > > > Also for the time slice cgroups thing, can you pls give me pointers to
> > > > these old patches that had it, and how it's done? I very obviously missed
> > > > that part.
> > > I think you misunderstood what I wrote earlier.  The original proposal
> > > was about spatial partitioning of subdevice resources not time sharing
> > > using cgroup (since time sharing is already supported elsewhere.)
> >
> > Well SRIOV time-sharing is for virtualization. cgroups is for
> > containerization, which is just virtualization but with less overhead and
> > more security bugs.
> >
> > More or less.
> >
> > So either I get things still wrong, or we'll get time-sharing for
> > virtualization, and partitioning of CU for containerization. That doesn't
> > make that much sense to me.
> 
> You could still potentially do SR-IOV for containerization.  You'd
> just pass one of the PCI VFs (virtual functions) to the container and
> you'd automatically get the time slice.  I don't see why cgroups would
> be a factor there.

Standard interface to manage that time-slicing. I guess for SRIOV it's all
vendor sauce (intel as guilty as anyone else from what I can see), but for
cgroups that feels like it's falling a bit short of what we should aim
for.

But dunno, maybe I'm just dreaming too much :-)
-Daniel

> Alex
> 
> >
> > Since time-sharing is the first thing that's done for virtualization I
> > think it's probably also the most reasonable to start with for containers.
> > -Daniel
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch
> > _______________________________________________
> > amd-gfx mailing list
> > amd-gfx@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/amd-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Add BPF_PROG_TYPE_CGROUP_IOCTL
  2021-05-07 16:26                                     ` Daniel Vetter
@ 2021-05-07 16:31                                       ` Alex Deucher
  2021-05-07 16:50                                         ` Alex Deucher
  0 siblings, 1 reply; 22+ messages in thread
From: Alex Deucher @ 2021-05-07 16:31 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Song Liu, Daniel Borkmann, Kenny Ho, Brian Welty, John Fastabend,
	Alexei Starovoitov, DRI Development, Alexei Starovoitov,
	Yonghong Song, KP Singh, Kenny Ho, amd-gfx list,
	Network Development, Linux-Fsdevel,
	open list:CONTROL GROUP (CGROUP),
	bpf, Andrii Nakryiko, Martin KaFai Lau, Alex Deucher,
	Alexander Viro

On Fri, May 7, 2021 at 12:26 PM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Fri, May 07, 2021 at 12:19:13PM -0400, Alex Deucher wrote:
> > On Fri, May 7, 2021 at 12:13 PM Daniel Vetter <daniel@ffwll.ch> wrote:
> > >
> > > On Fri, May 07, 2021 at 11:33:46AM -0400, Kenny Ho wrote:
> > > > On Fri, May 7, 2021 at 4:59 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > > > >
> > > > > Hm I missed that. I feel like time-sliced-of-a-whole gpu is the easier gpu
> > > > > cgroups controler to get started, since it's much closer to other cgroups
> > > > > that control bandwidth of some kind. Whether it's i/o bandwidth or compute
> > > > > bandwidht is kinda a wash.
> > > > sriov/time-sliced-of-a-whole gpu does not really need a cgroup
> > > > interface since each slice appears as a stand alone device.  This is
> > > > already in production (not using cgroup) with users.  The cgroup
> > > > proposal has always been parallel to that in many sense: 1) spatial
> > > > partitioning as an independent but equally valid use case as time
> > > > sharing, 2) sub-device resource control as opposed to full device
> > > > control motivated by the workload characterization paper.  It was
> > > > never about time vs space in terms of use cases but having new API for
> > > > users to be able to do spatial subdevice partitioning.
> > > >
> > > > > CU mask feels a lot more like an isolation/guaranteed forward progress
> > > > > kind of thing, and I suspect that's always going to be a lot more gpu hw
> > > > > specific than anything we can reasonably put into a general cgroups
> > > > > controller.
> > > > The first half is correct but I disagree with the conclusion.  The
> > > > analogy I would use is multi-core CPU.  The capability of individual
> > > > CPU cores, core count and core arrangement may be hw specific but
> > > > there are general interfaces to support selection of these cores.  CU
> > > > mask may be hw specific but spatial partitioning as an idea is not.
> > > > Most gpu vendors have the concept of sub-device compute units (EU, SE,
> > > > etc.); OpenCL has the concept of subdevice in the language.  I don't
> > > > see any obstacle for vendors to implement spatial partitioning just
> > > > like many CPU vendors support the idea of multi-core.
> > > >
> > > > > Also for the time slice cgroups thing, can you pls give me pointers to
> > > > > these old patches that had it, and how it's done? I very obviously missed
> > > > > that part.
> > > > I think you misunderstood what I wrote earlier.  The original proposal
> > > > was about spatial partitioning of subdevice resources not time sharing
> > > > using cgroup (since time sharing is already supported elsewhere.)
> > >
> > > Well SRIOV time-sharing is for virtualization. cgroups is for
> > > containerization, which is just virtualization but with less overhead and
> > > more security bugs.
> > >
> > > More or less.
> > >
> > > So either I get things still wrong, or we'll get time-sharing for
> > > virtualization, and partitioning of CU for containerization. That doesn't
> > > make that much sense to me.
> >
> > You could still potentially do SR-IOV for containerization.  You'd
> > just pass one of the PCI VFs (virtual functions) to the container and
> > you'd automatically get the time slice.  I don't see why cgroups would
> > be a factor there.
>
> Standard interface to manage that time-slicing. I guess for SRIOV it's all
> vendor sauce (intel as guilty as anyone else from what I can see), but for
> cgroups that feels like it's falling a bit short of what we should aim
> for.
>
> But dunno, maybe I'm just dreaming too much :-)

I don't disagree, I'm just not sure how it would apply to SR-IOV.
Once you've created the virtual functions, you've already created the
partitioning (regardless of whether it's spatial or temporal) so where
would cgroups come into play?

Alex

> -Daniel
>
> > Alex
> >
> > >
> > > Since time-sharing is the first thing that's done for virtualization I
> > > think it's probably also the most reasonable to start with for containers.
> > > -Daniel
> > > --
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch
> > > _______________________________________________
> > > amd-gfx mailing list
> > > amd-gfx@lists.freedesktop.org
> > > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Add BPF_PROG_TYPE_CGROUP_IOCTL
  2021-05-07 16:31                                       ` Alex Deucher
@ 2021-05-07 16:50                                         ` Alex Deucher
  2021-05-07 16:54                                           ` Daniel Vetter
  0 siblings, 1 reply; 22+ messages in thread
From: Alex Deucher @ 2021-05-07 16:50 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Song Liu, Daniel Borkmann, Kenny Ho, Brian Welty, John Fastabend,
	Alexei Starovoitov, DRI Development, Alexei Starovoitov,
	Yonghong Song, KP Singh, Kenny Ho, amd-gfx list,
	Network Development, Linux-Fsdevel,
	open list:CONTROL GROUP (CGROUP),
	bpf, Andrii Nakryiko, Martin KaFai Lau, Alex Deucher,
	Alexander Viro

On Fri, May 7, 2021 at 12:31 PM Alex Deucher <alexdeucher@gmail.com> wrote:
>
> On Fri, May 7, 2021 at 12:26 PM Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> > On Fri, May 07, 2021 at 12:19:13PM -0400, Alex Deucher wrote:
> > > On Fri, May 7, 2021 at 12:13 PM Daniel Vetter <daniel@ffwll.ch> wrote:
> > > >
> > > > On Fri, May 07, 2021 at 11:33:46AM -0400, Kenny Ho wrote:
> > > > > On Fri, May 7, 2021 at 4:59 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > > > > >
> > > > > > Hm I missed that. I feel like time-sliced-of-a-whole gpu is the easier gpu
> > > > > > cgroups controler to get started, since it's much closer to other cgroups
> > > > > > that control bandwidth of some kind. Whether it's i/o bandwidth or compute
> > > > > > bandwidht is kinda a wash.
> > > > > sriov/time-sliced-of-a-whole gpu does not really need a cgroup
> > > > > interface since each slice appears as a stand alone device.  This is
> > > > > already in production (not using cgroup) with users.  The cgroup
> > > > > proposal has always been parallel to that in many sense: 1) spatial
> > > > > partitioning as an independent but equally valid use case as time
> > > > > sharing, 2) sub-device resource control as opposed to full device
> > > > > control motivated by the workload characterization paper.  It was
> > > > > never about time vs space in terms of use cases but having new API for
> > > > > users to be able to do spatial subdevice partitioning.
> > > > >
> > > > > > CU mask feels a lot more like an isolation/guaranteed forward progress
> > > > > > kind of thing, and I suspect that's always going to be a lot more gpu hw
> > > > > > specific than anything we can reasonably put into a general cgroups
> > > > > > controller.
> > > > > The first half is correct but I disagree with the conclusion.  The
> > > > > analogy I would use is multi-core CPU.  The capability of individual
> > > > > CPU cores, core count and core arrangement may be hw specific but
> > > > > there are general interfaces to support selection of these cores.  CU
> > > > > mask may be hw specific but spatial partitioning as an idea is not.
> > > > > Most gpu vendors have the concept of sub-device compute units (EU, SE,
> > > > > etc.); OpenCL has the concept of subdevice in the language.  I don't
> > > > > see any obstacle for vendors to implement spatial partitioning just
> > > > > like many CPU vendors support the idea of multi-core.
> > > > >
> > > > > > Also for the time slice cgroups thing, can you pls give me pointers to
> > > > > > these old patches that had it, and how it's done? I very obviously missed
> > > > > > that part.
> > > > > I think you misunderstood what I wrote earlier.  The original proposal
> > > > > was about spatial partitioning of subdevice resources not time sharing
> > > > > using cgroup (since time sharing is already supported elsewhere.)
> > > >
> > > > Well SRIOV time-sharing is for virtualization. cgroups is for
> > > > containerization, which is just virtualization but with less overhead and
> > > > more security bugs.
> > > >
> > > > More or less.
> > > >
> > > > So either I get things still wrong, or we'll get time-sharing for
> > > > virtualization, and partitioning of CU for containerization. That doesn't
> > > > make that much sense to me.
> > >
> > > You could still potentially do SR-IOV for containerization.  You'd
> > > just pass one of the PCI VFs (virtual functions) to the container and
> > > you'd automatically get the time slice.  I don't see why cgroups would
> > > be a factor there.
> >
> > Standard interface to manage that time-slicing. I guess for SRIOV it's all
> > vendor sauce (intel as guilty as anyone else from what I can see), but for
> > cgroups that feels like it's falling a bit short of what we should aim
> > for.
> >
> > But dunno, maybe I'm just dreaming too much :-)
>
> I don't disagree, I'm just not sure how it would apply to SR-IOV.
> Once you've created the virtual functions, you've already created the
> partitioning (regardless of whether it's spatial or temporal) so where
> would cgroups come into play?

For some background, the SR-IOV virtual functions show up like actual
PCI endpoints on the bus, so SR-IOV is sort of like cgroups
implemented in hardware.  When you enable SR-IOV, the endpoints that
are created are the partitions.

Alex

>
> Alex
>
> > -Daniel
> >
> > > Alex
> > >
> > > >
> > > > Since time-sharing is the first thing that's done for virtualization I
> > > > think it's probably also the most reasonable to start with for containers.
> > > > -Daniel
> > > > --
> > > > Daniel Vetter
> > > > Software Engineer, Intel Corporation
> > > > http://blog.ffwll.ch
> > > > _______________________________________________
> > > > amd-gfx mailing list
> > > > amd-gfx@lists.freedesktop.org
> > > > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> >
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Add BPF_PROG_TYPE_CGROUP_IOCTL
  2021-05-07 16:50                                         ` Alex Deucher
@ 2021-05-07 16:54                                           ` Daniel Vetter
  2021-05-07 17:04                                             ` Kenny Ho
  2021-05-07 19:33                                             ` Tejun Heo
  0 siblings, 2 replies; 22+ messages in thread
From: Daniel Vetter @ 2021-05-07 16:54 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Song Liu, Alexei Starovoitov, DRI Development,
	Alexei Starovoitov, Daniel Borkmann, Brian Welty, John Fastabend,
	amd-gfx list, Yonghong Song, Andrii Nakryiko, Linux-Fsdevel,
	Kenny Ho, Alexander Viro, KP Singh,
	open list:CONTROL GROUP (CGROUP),
	Kenny Ho, Network Development, Alex Deucher, bpf,
	Martin KaFai Lau

On Fri, May 07, 2021 at 12:50:07PM -0400, Alex Deucher wrote:
> On Fri, May 7, 2021 at 12:31 PM Alex Deucher <alexdeucher@gmail.com> wrote:
> >
> > On Fri, May 7, 2021 at 12:26 PM Daniel Vetter <daniel@ffwll.ch> wrote:
> > >
> > > On Fri, May 07, 2021 at 12:19:13PM -0400, Alex Deucher wrote:
> > > > On Fri, May 7, 2021 at 12:13 PM Daniel Vetter <daniel@ffwll.ch> wrote:
> > > > >
> > > > > On Fri, May 07, 2021 at 11:33:46AM -0400, Kenny Ho wrote:
> > > > > > On Fri, May 7, 2021 at 4:59 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > > > > > >
> > > > > > > Hm I missed that. I feel like time-sliced-of-a-whole gpu is the easier gpu
> > > > > > > cgroups controler to get started, since it's much closer to other cgroups
> > > > > > > that control bandwidth of some kind. Whether it's i/o bandwidth or compute
> > > > > > > bandwidht is kinda a wash.
> > > > > > sriov/time-sliced-of-a-whole gpu does not really need a cgroup
> > > > > > interface since each slice appears as a stand alone device.  This is
> > > > > > already in production (not using cgroup) with users.  The cgroup
> > > > > > proposal has always been parallel to that in many sense: 1) spatial
> > > > > > partitioning as an independent but equally valid use case as time
> > > > > > sharing, 2) sub-device resource control as opposed to full device
> > > > > > control motivated by the workload characterization paper.  It was
> > > > > > never about time vs space in terms of use cases but having new API for
> > > > > > users to be able to do spatial subdevice partitioning.
> > > > > >
> > > > > > > CU mask feels a lot more like an isolation/guaranteed forward progress
> > > > > > > kind of thing, and I suspect that's always going to be a lot more gpu hw
> > > > > > > specific than anything we can reasonably put into a general cgroups
> > > > > > > controller.
> > > > > > The first half is correct but I disagree with the conclusion.  The
> > > > > > analogy I would use is multi-core CPU.  The capability of individual
> > > > > > CPU cores, core count and core arrangement may be hw specific but
> > > > > > there are general interfaces to support selection of these cores.  CU
> > > > > > mask may be hw specific but spatial partitioning as an idea is not.
> > > > > > Most gpu vendors have the concept of sub-device compute units (EU, SE,
> > > > > > etc.); OpenCL has the concept of subdevice in the language.  I don't
> > > > > > see any obstacle for vendors to implement spatial partitioning just
> > > > > > like many CPU vendors support the idea of multi-core.
> > > > > >
> > > > > > > Also for the time slice cgroups thing, can you pls give me pointers to
> > > > > > > these old patches that had it, and how it's done? I very obviously missed
> > > > > > > that part.
> > > > > > I think you misunderstood what I wrote earlier.  The original proposal
> > > > > > was about spatial partitioning of subdevice resources not time sharing
> > > > > > using cgroup (since time sharing is already supported elsewhere.)
> > > > >
> > > > > Well SRIOV time-sharing is for virtualization. cgroups is for
> > > > > containerization, which is just virtualization but with less overhead and
> > > > > more security bugs.
> > > > >
> > > > > More or less.
> > > > >
> > > > > So either I get things still wrong, or we'll get time-sharing for
> > > > > virtualization, and partitioning of CU for containerization. That doesn't
> > > > > make that much sense to me.
> > > >
> > > > You could still potentially do SR-IOV for containerization.  You'd
> > > > just pass one of the PCI VFs (virtual functions) to the container and
> > > > you'd automatically get the time slice.  I don't see why cgroups would
> > > > be a factor there.
> > >
> > > Standard interface to manage that time-slicing. I guess for SRIOV it's all
> > > vendor sauce (intel as guilty as anyone else from what I can see), but for
> > > cgroups that feels like it's falling a bit short of what we should aim
> > > for.
> > >
> > > But dunno, maybe I'm just dreaming too much :-)
> >
> > I don't disagree, I'm just not sure how it would apply to SR-IOV.
> > Once you've created the virtual functions, you've already created the
> > partitioning (regardless of whether it's spatial or temporal) so where
> > would cgroups come into play?
> 
> For some background, the SR-IOV virtual functions show up like actual
> PCI endpoints on the bus, so SR-IOV is sort of like cgroups
> implemented in hardware.  When you enable SR-IOV, the endpoints that
> are created are the partitions.

Yeah I think we're massively agreeing right now :-)

SRIOV is kinda by design vendor specific. You set up the VF endpoint, it
shows up, it's all hw+fw magic. Nothing for cgroups to manage here at all.

All I meant is that for the container/cgroups world starting out with
time-sharing feels like the best fit, least because your SRIOV designers
also seem to think that's the best first cut for cloud-y computing.
Whether it's virtualized or containerized is a distinction that's getting
ever more blurry, with virtualization become a lot more dynamic and
container runtimes als possibly using hw virtualization underneath.
-Daniel

> 
> Alex
> 
> >
> > Alex
> >
> > > -Daniel
> > >
> > > > Alex
> > > >
> > > > >
> > > > > Since time-sharing is the first thing that's done for virtualization I
> > > > > think it's probably also the most reasonable to start with for containers.
> > > > > -Daniel
> > > > > --
> > > > > Daniel Vetter
> > > > > Software Engineer, Intel Corporation
> > > > > http://blog.ffwll.ch
> > > > > _______________________________________________
> > > > > amd-gfx mailing list
> > > > > amd-gfx@lists.freedesktop.org
> > > > > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> > >
> > > --
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Add BPF_PROG_TYPE_CGROUP_IOCTL
  2021-05-07 16:54                                           ` Daniel Vetter
@ 2021-05-07 17:04                                             ` Kenny Ho
  2021-05-07 19:33                                             ` Tejun Heo
  1 sibling, 0 replies; 22+ messages in thread
From: Kenny Ho @ 2021-05-07 17:04 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Song Liu, KP Singh, Daniel Borkmann, Kenny Ho, Brian Welty,
	John Fastabend, Alexei Starovoitov, DRI Development,
	Alexei Starovoitov, Yonghong Song, Linux-Fsdevel, amd-gfx list,
	Network Development, open list:CONTROL GROUP (CGROUP),
	bpf, Andrii Nakryiko, Martin KaFai Lau, Alex Deucher,
	Alexander Viro

On Fri, May 7, 2021 at 12:54 PM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> SRIOV is kinda by design vendor specific. You set up the VF endpoint, it
> shows up, it's all hw+fw magic. Nothing for cgroups to manage here at all.
Right, so in theory you just use the device cgroup with the VF endpoints.

> All I meant is that for the container/cgroups world starting out with
> time-sharing feels like the best fit, least because your SRIOV designers
> also seem to think that's the best first cut for cloud-y computing.
> Whether it's virtualized or containerized is a distinction that's getting
> ever more blurry, with virtualization become a lot more dynamic and
> container runtimes als possibly using hw virtualization underneath.
I disagree.  By the same logic, the existence of CU mask would imply
it being the preferred way for sub-device control per process.

Kenny

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Add BPF_PROG_TYPE_CGROUP_IOCTL
  2021-05-07 16:54                                           ` Daniel Vetter
  2021-05-07 17:04                                             ` Kenny Ho
@ 2021-05-07 19:33                                             ` Tejun Heo
  2021-05-07 19:55                                               ` Alex Deucher
  1 sibling, 1 reply; 22+ messages in thread
From: Tejun Heo @ 2021-05-07 19:33 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Song Liu, Alexei Starovoitov, DRI Development, Alex Deucher,
	Alexei Starovoitov, Daniel Borkmann, Brian Welty, John Fastabend,
	amd-gfx list, Yonghong Song, Andrii Nakryiko, Kenny Ho,
	Alexander Viro, KP Singh, open list:CONTROL GROUP (CGROUP),
	Kenny Ho, Network Development, Linux-Fsdevel, bpf,
	Martin KaFai Lau

Hello,

On Fri, May 07, 2021 at 06:54:13PM +0200, Daniel Vetter wrote:
> All I meant is that for the container/cgroups world starting out with
> time-sharing feels like the best fit, least because your SRIOV designers
> also seem to think that's the best first cut for cloud-y computing.
> Whether it's virtualized or containerized is a distinction that's getting
> ever more blurry, with virtualization become a lot more dynamic and
> container runtimes als possibly using hw virtualization underneath.

FWIW, I'm completely on the same boat. There are two fundamental issues with
hardware-mask based control - control granularity and work conservation.
Combined, they make it a significantly more difficult interface to use which
requires hardware-specific tuning rather than simply being able to say "I
wanna prioritize this job twice over that one".

My knoweldge of gpus is really limited but my understanding is also that the
gpu cores and threads aren't as homogeneous as the CPU counterparts across
the vendors, product generations and possibly even within a single chip,
which makes the problem even worse.

Given that GPUs are time-shareable to begin with, the most universal
solution seems pretty clear.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Add BPF_PROG_TYPE_CGROUP_IOCTL
  2021-05-07 19:33                                             ` Tejun Heo
@ 2021-05-07 19:55                                               ` Alex Deucher
  2021-05-07 20:59                                                 ` Tejun Heo
  0 siblings, 1 reply; 22+ messages in thread
From: Alex Deucher @ 2021-05-07 19:55 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Song Liu, Alexei Starovoitov, DRI Development,
	Alexei Starovoitov, Daniel Borkmann, Brian Welty, John Fastabend,
	amd-gfx list, Yonghong Song, Andrii Nakryiko, Linux-Fsdevel,
	Kenny Ho, Alexander Viro, KP Singh,
	open list:CONTROL GROUP (CGROUP),
	Kenny Ho, Network Development, Alex Deucher, bpf,
	Martin KaFai Lau

On Fri, May 7, 2021 at 3:33 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Fri, May 07, 2021 at 06:54:13PM +0200, Daniel Vetter wrote:
> > All I meant is that for the container/cgroups world starting out with
> > time-sharing feels like the best fit, least because your SRIOV designers
> > also seem to think that's the best first cut for cloud-y computing.
> > Whether it's virtualized or containerized is a distinction that's getting
> > ever more blurry, with virtualization become a lot more dynamic and
> > container runtimes als possibly using hw virtualization underneath.
>
> FWIW, I'm completely on the same boat. There are two fundamental issues with
> hardware-mask based control - control granularity and work conservation.
> Combined, they make it a significantly more difficult interface to use which
> requires hardware-specific tuning rather than simply being able to say "I
> wanna prioritize this job twice over that one".
>
> My knoweldge of gpus is really limited but my understanding is also that the
> gpu cores and threads aren't as homogeneous as the CPU counterparts across
> the vendors, product generations and possibly even within a single chip,
> which makes the problem even worse.
>
> Given that GPUs are time-shareable to begin with, the most universal
> solution seems pretty clear.

The problem is temporal partitioning on GPUs is much harder to enforce
unless you have a special case like SR-IOV.  Spatial partitioning, on
AMD GPUs at least, is widely available and easily enforced.  What is
the point of implementing temporal style cgroups if no one can enforce
it effectively?

Alex

>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Add BPF_PROG_TYPE_CGROUP_IOCTL
  2021-05-07 19:55                                               ` Alex Deucher
@ 2021-05-07 20:59                                                 ` Tejun Heo
  2021-05-07 22:30                                                   ` Alex Deucher
  0 siblings, 1 reply; 22+ messages in thread
From: Tejun Heo @ 2021-05-07 20:59 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Song Liu, Alexei Starovoitov, DRI Development,
	Alexei Starovoitov, Daniel Borkmann, Brian Welty, John Fastabend,
	amd-gfx list, Yonghong Song, Andrii Nakryiko, Linux-Fsdevel,
	Kenny Ho, Alexander Viro, KP Singh,
	open list:CONTROL GROUP (CGROUP),
	Kenny Ho, Network Development, Alex Deucher, bpf,
	Martin KaFai Lau

Hello,

On Fri, May 07, 2021 at 03:55:39PM -0400, Alex Deucher wrote:
> The problem is temporal partitioning on GPUs is much harder to enforce
> unless you have a special case like SR-IOV.  Spatial partitioning, on
> AMD GPUs at least, is widely available and easily enforced.  What is
> the point of implementing temporal style cgroups if no one can enforce
> it effectively?

So, if generic fine-grained partitioning can't be implemented, the right
thing to do is stopping pushing for full-blown cgroup interface for it. The
hardware simply isn't capable of being managed in a way which allows generic
fine-grained hierarchical scheduling and there's no point in bloating the
interface with half baked hardware dependent features.

This isn't to say that there's no way to support them, but what have been
being proposed is way too generic and ambitious in terms of interface while
being poorly developed on the internal abstraction and mechanism front. If
the hardware can't do generic, either implement the barest minimum interface
(e.g. be a part of misc controller) or go driver-specific - the feature is
hardware specific anyway. I've repeated this multiple times in these
discussions now but it'd be really helpful to try to minimize the interace
while concentrating more on internal abstractions and actual control
mechanisms.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Add BPF_PROG_TYPE_CGROUP_IOCTL
  2021-05-07 20:59                                                 ` Tejun Heo
@ 2021-05-07 22:30                                                   ` Alex Deucher
  2021-05-07 23:45                                                     ` Tejun Heo
  0 siblings, 1 reply; 22+ messages in thread
From: Alex Deucher @ 2021-05-07 22:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Song Liu, Alexei Starovoitov, DRI Development,
	Alexei Starovoitov, Daniel Borkmann, Brian Welty, John Fastabend,
	amd-gfx list, Yonghong Song, Andrii Nakryiko, Linux-Fsdevel,
	Kenny Ho, Alexander Viro, KP Singh,
	open list:CONTROL GROUP (CGROUP),
	Kenny Ho, Network Development, Alex Deucher, bpf,
	Martin KaFai Lau

On Fri, May 7, 2021 at 4:59 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Fri, May 07, 2021 at 03:55:39PM -0400, Alex Deucher wrote:
> > The problem is temporal partitioning on GPUs is much harder to enforce
> > unless you have a special case like SR-IOV.  Spatial partitioning, on
> > AMD GPUs at least, is widely available and easily enforced.  What is
> > the point of implementing temporal style cgroups if no one can enforce
> > it effectively?
>
> So, if generic fine-grained partitioning can't be implemented, the right
> thing to do is stopping pushing for full-blown cgroup interface for it. The
> hardware simply isn't capable of being managed in a way which allows generic
> fine-grained hierarchical scheduling and there's no point in bloating the
> interface with half baked hardware dependent features.
>
> This isn't to say that there's no way to support them, but what have been
> being proposed is way too generic and ambitious in terms of interface while
> being poorly developed on the internal abstraction and mechanism front. If
> the hardware can't do generic, either implement the barest minimum interface
> (e.g. be a part of misc controller) or go driver-specific - the feature is
> hardware specific anyway. I've repeated this multiple times in these
> discussions now but it'd be really helpful to try to minimize the interace
> while concentrating more on internal abstractions and actual control
> mechanisms.

Maybe we are speaking past each other.  I'm not following.  We got
here because a device specific cgroup didn't make sense.  With my
Linux user hat on, that makes sense.  I don't want to write code to a
bunch of device specific interfaces if I can avoid it.  But as for
temporal vs spatial partitioning of the GPU, the argument seems to be
a sort of hand-wavy one that both spatial and temporal partitioning
make sense on CPUs, but only temporal partitioning makes sense on
GPUs.  I'm trying to understand that assertion.  There are some GPUs
that can more easily be temporally partitioned and some that can be
more easily spatially partitioned.  It doesn't seem any different than
CPUs.

Alex

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Add BPF_PROG_TYPE_CGROUP_IOCTL
  2021-05-07 22:30                                                   ` Alex Deucher
@ 2021-05-07 23:45                                                     ` Tejun Heo
  2021-05-11 15:48                                                       ` Alex Deucher
  0 siblings, 1 reply; 22+ messages in thread
From: Tejun Heo @ 2021-05-07 23:45 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Song Liu, Alexei Starovoitov, DRI Development,
	Alexei Starovoitov, Daniel Borkmann, Brian Welty, John Fastabend,
	amd-gfx list, Yonghong Song, Andrii Nakryiko, Linux-Fsdevel,
	Kenny Ho, Alexander Viro, KP Singh,
	open list:CONTROL GROUP (CGROUP),
	Kenny Ho, Network Development, Alex Deucher, bpf,
	Martin KaFai Lau

Hello,

On Fri, May 07, 2021 at 06:30:56PM -0400, Alex Deucher wrote:
> Maybe we are speaking past each other.  I'm not following.  We got
> here because a device specific cgroup didn't make sense.  With my
> Linux user hat on, that makes sense.  I don't want to write code to a
> bunch of device specific interfaces if I can avoid it.  But as for
> temporal vs spatial partitioning of the GPU, the argument seems to be
> a sort of hand-wavy one that both spatial and temporal partitioning
> make sense on CPUs, but only temporal partitioning makes sense on
> GPUs.  I'm trying to understand that assertion.  There are some GPUs

Spatial partitioning as implemented in cpuset isn't a desirable model. It's
there partly because it has historically been there. It doesn't really
require dynamic hierarchical distribution of anything and is more of a way
to batch-update per-task configuration, which is how it's actually
implemented. It's broken too in that it interferes with per-task affinity
settings. So, not exactly a good example to follow. In addition, this sort
of partitioning requires more hardware knowledge and GPUs are worse than
CPUs in that hardwares differ more.

Features like this are trivial to implement from userland side by making
per-process settings inheritable and restricting who can update the
settings.

> that can more easily be temporally partitioned and some that can be
> more easily spatially partitioned.  It doesn't seem any different than
> CPUs.

Right, it doesn't really matter how the resource is distributed. What
matters is how granular and generic the distribution can be. If gpus can
implement work-conserving proportional distribution, that's something which
is widely useful and inherently requires dynamic scheduling from kernel
side. If it's about setting per-vendor affinities, this is way too much
cgroup interface for a feature which can be easily implemented outside
cgroup. Just do per-process (or whatever handles gpus use) and confine their
configurations from cgroup side however way.

While the specific theme changes a bit, we're basically having the same
discussion with the same conclusion over the past however many months.
Hopefully, the point is clear by now.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Add BPF_PROG_TYPE_CGROUP_IOCTL
  2021-05-07 23:45                                                     ` Tejun Heo
@ 2021-05-11 15:48                                                       ` Alex Deucher
  0 siblings, 0 replies; 22+ messages in thread
From: Alex Deucher @ 2021-05-11 15:48 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Song Liu, Alexei Starovoitov, DRI Development,
	Alexei Starovoitov, Daniel Borkmann, Brian Welty, John Fastabend,
	amd-gfx list, Yonghong Song, Andrii Nakryiko, Linux-Fsdevel,
	Kenny Ho, Alexander Viro, KP Singh,
	open list:CONTROL GROUP (CGROUP),
	Kenny Ho, Network Development, Alex Deucher, bpf,
	Martin KaFai Lau

On Fri, May 7, 2021 at 7:45 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Fri, May 07, 2021 at 06:30:56PM -0400, Alex Deucher wrote:
> > Maybe we are speaking past each other.  I'm not following.  We got
> > here because a device specific cgroup didn't make sense.  With my
> > Linux user hat on, that makes sense.  I don't want to write code to a
> > bunch of device specific interfaces if I can avoid it.  But as for
> > temporal vs spatial partitioning of the GPU, the argument seems to be
> > a sort of hand-wavy one that both spatial and temporal partitioning
> > make sense on CPUs, but only temporal partitioning makes sense on
> > GPUs.  I'm trying to understand that assertion.  There are some GPUs
>
> Spatial partitioning as implemented in cpuset isn't a desirable model. It's
> there partly because it has historically been there. It doesn't really
> require dynamic hierarchical distribution of anything and is more of a way
> to batch-update per-task configuration, which is how it's actually
> implemented. It's broken too in that it interferes with per-task affinity
> settings. So, not exactly a good example to follow. In addition, this sort
> of partitioning requires more hardware knowledge and GPUs are worse than
> CPUs in that hardwares differ more.
>
> Features like this are trivial to implement from userland side by making
> per-process settings inheritable and restricting who can update the
> settings.
>
> > that can more easily be temporally partitioned and some that can be
> > more easily spatially partitioned.  It doesn't seem any different than
> > CPUs.
>
> Right, it doesn't really matter how the resource is distributed. What
> matters is how granular and generic the distribution can be. If gpus can
> implement work-conserving proportional distribution, that's something which
> is widely useful and inherently requires dynamic scheduling from kernel
> side. If it's about setting per-vendor affinities, this is way too much
> cgroup interface for a feature which can be easily implemented outside
> cgroup. Just do per-process (or whatever handles gpus use) and confine their
> configurations from cgroup side however way.
>
> While the specific theme changes a bit, we're basically having the same
> discussion with the same conclusion over the past however many months.
> Hopefully, the point is clear by now.

Thanks, that helps a lot.

Alex

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2021-05-11 15:48 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20201007152355.2446741-1-Kenny.Ho@amd.com>
     [not found] ` <CAOWid-d=a1Q3R92s7GrzxWhXx7_dc8NQvQg7i7RYTVv3+jHxkQ@mail.gmail.com>
     [not found]   ` <20201103053244.khibmr66p7lhv7ge@ast-mbp.dhcp.thefacebook.com>
     [not found]     ` <CAOWid-eQSPru0nm8+Xo3r6C0pJGq+5r8mzM8BL2dgNn2c9mt2Q@mail.gmail.com>
     [not found]       ` <CAADnVQKuoZDB-Xga5STHdGSxvSP=B6jQ40kLdpL1u+J98bv65A@mail.gmail.com>
     [not found]         ` <CAOWid-czZphRz6Y-H3OcObKCH=bLLC3=bOZaSB-6YBE56+Qzrg@mail.gmail.com>
     [not found]           ` <20201103210418.q7hddyl7rvdplike@ast-mbp.dhcp.thefacebook.com>
     [not found]             ` <CAOWid-djQ_NRfCbOTnZQ-A8Pr7jMP7KuZEJDSsvzWkdw7qc=yA@mail.gmail.com>
     [not found]               ` <20201103232805.6uq4zg3gdvw2iiki@ast-mbp.dhcp.thefacebook.com>
2021-02-01 14:49                 ` [RFC] Add BPF_PROG_TYPE_CGROUP_IOCTL Daniel Vetter
2021-02-01 16:47                   ` Kenny Ho
2021-02-01 16:51                   ` Kenny Ho
2021-02-03 11:09                     ` Daniel Vetter
2021-02-03 19:01                       ` Kenny Ho
2021-02-05 13:49                         ` Daniel Vetter
2021-05-07  2:06                           ` Kenny Ho
2021-05-07  8:59                             ` Daniel Vetter
2021-05-07 15:33                               ` Kenny Ho
2021-05-07 16:13                                 ` Daniel Vetter
2021-05-07 16:19                                   ` Alex Deucher
2021-05-07 16:26                                     ` Daniel Vetter
2021-05-07 16:31                                       ` Alex Deucher
2021-05-07 16:50                                         ` Alex Deucher
2021-05-07 16:54                                           ` Daniel Vetter
2021-05-07 17:04                                             ` Kenny Ho
2021-05-07 19:33                                             ` Tejun Heo
2021-05-07 19:55                                               ` Alex Deucher
2021-05-07 20:59                                                 ` Tejun Heo
2021-05-07 22:30                                                   ` Alex Deucher
2021-05-07 23:45                                                     ` Tejun Heo
2021-05-11 15:48                                                       ` Alex Deucher

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).