[Ksummit-discuss] [CORE TOPIC] Core Kernel support for Compute-Offload Devices

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Ksummit-discuss] [CORE TOPIC] Core Kernel support for Compute-Offload Devices
@ 2015-07-30 13:00 Joerg Roedel
  2015-07-30 13:31 ` David Woodhouse
                   ` (3 more replies)
  0 siblings, 4 replies; 27+ messages in thread
From: Joerg Roedel @ 2015-07-30 13:00 UTC (permalink / raw)
  To: ksummit-discuss

[
 The topic is highly technical and could be a tech topic. But it also
 touches multiple subsystems, so I decided to submit it as a core
 topic.
]

Across architectures and vendors there are new devices coming up for
offloading tasks from the CPUs. Most of these devices are capable to
operate on user address spaces.

Besides the commonalities there are important differences in the memory
model these devices offer. Some work only on system RAM, others come
with their own memory which may or may not be accessible by the CPU.

I'd like to discuss what support we need in the core kernel for these
devices. A probably incomplete list of open questions:

	(1) Do we need the concept of an off-CPU task in the kernel
	    together with a common interface to create and manage them
	    and probably a (collection of) batch scheduler(s) for these
	    tasks?

	(2) Changes in memory management for devices accessing user
	    address spaces:

	    (2.1) How can we best support the different memory models
	          these devices support?

	    (2.2) How do we handle the off-CPU users of an mm_struct?

	    (2.3) How can we attach common state for off-CPU tasks to
	          mm_struct (and what needs to be in there)?

	(3) Does it make sense to implement automatic migration of
	    system memory to device memory (when available) and vice
	    versa? How do we decide what and when to migrate?

	(4) What features do we require in the hardware to support it
	    with a common interface?

I think it would be great if the kernel would have a common interface
for these kind of devices. Currently every vendor develops its own
interface with various hacks to work around core code behavior.

I am particularily interested in this topic because on PCIe newer IOMMUs
are often an integral part in supporting these devices (ARM-SMMUv3,
Intel VT-d with SVM, AMD IOMMUv2). so that core work here will also
touch the IOMMU code.

Probably (uncomplete list of) interested people:

	David Woodhouse
	Jesse Barnes
	Will Deacon
	Paul E. McKenney
	Rik van Riel
	Mel Gorman
	Andrea Arcangeli
	Christoph Lameter
	Jérôme Glisse

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Core Kernel support for Compute-Offload Devices
  2015-07-30 13:00 [Ksummit-discuss] [CORE TOPIC] Core Kernel support for Compute-Offload Devices Joerg Roedel
@ 2015-07-30 13:31 ` David Woodhouse
  2015-07-30 13:54   ` Joerg Roedel
  2015-07-30 22:32 ` Benjamin Herrenschmidt
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 27+ messages in thread
From: David Woodhouse @ 2015-07-30 13:31 UTC (permalink / raw)
  To: Joerg Roedel, ksummit-discuss

[-- Attachment #1: Type: text/plain, Size: 3248 bytes --]

On Thu, 2015-07-30 at 15:00 +0200, Joerg Roedel wrote:
> [
>  The topic is highly technical and could be a tech topic. But it also
>  touches multiple subsystems, so I decided to submit it as a core
>  topic.
> ]
> 
> Across architectures and vendors there are new devices coming up for
> offloading tasks from the CPUs. Most of these devices are capable to
> operate on user address spaces.
> 
> Besides the commonalities there are important differences in the 
> memory
> model these devices offer. Some work only on system RAM, others come
> with their own memory which may or may not be accessible by the CPU.
> 
> I'd like to discuss what support we need in the core kernel for these
> devices. A probably incomplete list of open questions:
> 
> 	(1) Do we need the concept of an off-CPU task in the kernel
> 	    together with a common interface to create and manage them
> 	    and probably a (collection of) batch scheduler(s) for these
> 	    tasks?
> 
> 	(2) Changes in memory management for devices accessing user
> 	    address spaces:
> 	    
> 	    (2.1) How can we best support the different memory models
> 	          these devices support?
> 	    
> 	    (2.2) How do we handle the off-CPU users of an mm_struct?
> 	    
> 	    (2.3) How can we attach common state for off-CPU tasks to
> 	          mm_struct (and what needs to be in there)?

And how do we handle the assignment of Address Space IDs? The AMD
implementation currently allows the PASID space to be managed per
-device, but I understand ARM systems handle the TLB shootdown
broadcasts in hardware and need the PASID that the device sees to be
identical to the ASID on the CPU's MMU? And there are reasons why we
might actually want that model on Intel systems too. I'm working on the
Intel SVM right now, and looking at a single-PASID-space model (partly
because the PASID tables have to be physically contiguous, and they can
be huge!).

> 	(3) Does it make sense to implement automatic migration of
> 	    system memory to device memory (when available) and vice
> 	    versa? How do we decide what and when to migrate?

This is quite a horrid one, but perhaps ties into generic NUMA
considerations — if a memory page is being frequently accessed by
something that it's far away from, can we move it to closer memory? 

The question is how we handle that. We do have Extended Accessed bits
in the Intel implementation of SVM that let us know that a given PTE
was used from a device. Although not *which* device, in cases where
there might be more than one.

> 	(4) What features do we require in the hardware to support it
> 	    with a common interface?
> 
> I think it would be great if the kernel would have a common interface
> for these kind of devices. Currently every vendor develops its own
> interface with various hacks to work around core code behavior.

Right. For now it's almost all internal on-chip stuff so it's kind of
tolerable to have vendor-specific implenentations. But we are starting
to see PCIe root ports which support the necessary TLP prefixes to
support SVM on discrete devices. And then it'll be really important to
have this working cross-platform.

-- 
dwmw2

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5691 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Core Kernel support for Compute-Offload Devices
  2015-07-30 13:31 ` David Woodhouse
@ 2015-07-30 13:54   ` Joerg Roedel
  2015-07-31 16:34     ` Jerome Glisse
  0 siblings, 1 reply; 27+ messages in thread
From: Joerg Roedel @ 2015-07-30 13:54 UTC (permalink / raw)
  To: David Woodhouse; +Cc: ksummit-discuss

On Thu, Jul 30, 2015 at 02:31:38PM +0100, David Woodhouse wrote:
> On Thu, 2015-07-30 at 15:00 +0200, Joerg Roedel wrote:
> > 	    (2.3) How can we attach common state for off-CPU tasks to
> > 	          mm_struct (and what needs to be in there)?
> 
> And how do we handle the assignment of Address Space IDs? The AMD
> implementation currently allows the PASID space to be managed per
> -device, but I understand ARM systems handle the TLB shootdown
> broadcasts in hardware and need the PASID that the device sees to be
> identical to the ASID on the CPU's MMU? And there are reasons why we
> might actually want that model on Intel systems too. I'm working on the
> Intel SVM right now, and looking at a single-PASID-space model (partly
> because the PASID tables have to be physically contiguous, and they can
> be huge!).

True, ASIDs would be one thing that needs to be attached to a mm_struct,
but I am also interested what other platforms might need here. For
example, is there a better way to track these off-cpu users than using
mmu-notifiers.

> > 	(3) Does it make sense to implement automatic migration of
> > 	    system memory to device memory (when available) and vice
> > 	    versa? How do we decide what and when to migrate?
> 
> This is quite a horrid one, but perhaps ties into generic NUMA
> considerations — if a memory page is being frequently accessed by
> something that it's far away from, can we move it to closer memory?

Yeah, conceptually it is NUMA, so it might fit there. But the difference
to the current NUMA handling is that the device memory is not always
completly visible to the CPU, so I think quite some significant changes
are necessary to make this work.

Another idea is to handle migration like swapping. The difference to
real swapping is that it is not relying on the LRU lists but the device
access patterns we measure.

> The question is how we handle that. We do have Extended Accessed bits
> in the Intel implementation of SVM that let us know that a given PTE
> was used from a device. Although not *which* device, in cases where
> there might be more than one.

One way would be to use seperate page-tables for the devices (which, on
the other side, somehow contradicts the design of the hardware, because
its designed to reuse cpu page-tables).

And I don't know which features other devices have (like the CAPI
devices on Power that Paul wrote about) to help in this decission.

	Joerg

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Core Kernel support for Compute-Offload Devices
  2015-07-30 13:00 [Ksummit-discuss] [CORE TOPIC] Core Kernel support for Compute-Offload Devices Joerg Roedel
  2015-07-30 13:31 ` David Woodhouse
@ 2015-07-30 22:32 ` Benjamin Herrenschmidt
  2015-08-01 16:10   ` Joerg Roedel
  2015-07-31 14:52 ` Rik van Riel
  2015-08-01 20:46 ` Arnd Bergmann
  3 siblings, 1 reply; 27+ messages in thread
From: Benjamin Herrenschmidt @ 2015-07-30 22:32 UTC (permalink / raw)
  To: Joerg Roedel; +Cc: ksummit-discuss

On Thu, 2015-07-30 at 15:00 +0200, Joerg Roedel wrote:
> [
>  The topic is highly technical and could be a tech topic. But it also
>  touches multiple subsystems, so I decided to submit it as a core
>  topic.
> ]
> 
> Across architectures and vendors there are new devices coming up for
> offloading tasks from the CPUs. Most of these devices are capable to
> operate on user address spaces.

There is cross-overs with the proposed FPGA topic as well, for example
CAPI is typically FPGA's that can operate on user address spaces ;-)

> Besides the commonalities there are important differences in the memory
> model these devices offer. Some work only on system RAM, others come
> with their own memory which may or may not be accessible by the CPU.
> 
> I'd like to discuss what support we need in the core kernel for these
> devices. A probably incomplete list of open questions:

I would definitely like to attend this.

> 	(1) Do we need the concept of an off-CPU task in the kernel
> 	    together with a common interface to create and manage them
> 	    and probably a (collection of) batch scheduler(s) for these
> 	    tasks?

It might be interesting at least to cleanup how we handle & account page
faults for these things. Scheduling is a different matter, for CAPI for
example the scheduling is entirely done in HW. For things like GPU, it's
a mixture of HW and generally some kind of on-GPU kernel isn't it ?
Quite proprietary in any case. Back in the Cell days, the kernel did
schedule the SPUs so this would have been a use case of what you
propose.

So I'd think that such an off-core scheduler, while a useful thing for
some of these devices, should be an optional component, ie, the other
functionalities shouldn't necessarily depend on it.

> 	(2) Changes in memory management for devices accessing user
> 	    address spaces:
> 	    
> 	    (2.1) How can we best support the different memory models
> 	          these devices support?
> 	    
> 	    (2.2) How do we handle the off-CPU users of an mm_struct?
> 	    
> 	    (2.3) How can we attach common state for off-CPU tasks to
> 	          mm_struct (and what needs to be in there)?

Right. Some of this (GPUs, MLX) use the proposed HMM infrastructure that
Jerome Glisse have been developing, so he would be an interested party
here, which hooks into the existing MM. Some of these like CAPI (or more
stuff I can't quite talk about just yet) will just share the MMU data
structures (direct access to the host page tables).

The refcounting of mm_struct comes to mind, but also, dealing with the
tracking of which CPU accessed a given context (for example, on POWER,
with CAPI, we need to "upgrade" to global tlb invalidations even for
single threaded apps if the context was used by such an accelerator).

> 	(3) Does it make sense to implement automatic migration of
> 	    system memory to device memory (when available) and vice
> 	    versa? How do we decide what and when to migrate?

Definitely a hot subject. I don't now if you have seen the "proposal"
that Paul McKenney posted a while back. This is in part what HMM does
for non-cache-coherent devices. There are lots of open questions for
cache-coherent ones, such as should we provide struct page for them, how
do we keep normal kernel allocs off the device, etc... ideas like
memory-only NUMA nodes with large distance did crop up.

> 	(4) What features do we require in the hardware to support it
> 	    with a common interface?
>
> I think it would be great if the kernel would have a common interface
> for these kind of devices. Currently every vendor develops its own
> interface with various hacks to work around core code behavior.
> 
> I am particularily interested in this topic because on PCIe newer IOMMUs
> are often an integral part in supporting these devices (ARM-SMMUv3,
> Intel VT-d with SVM, AMD IOMMUv2). so that core work here will also
> touch the IOMMU code.
> 
> Probably (uncomplete list of) interested people:
> 
> 	David Woodhouse
> 	Jesse Barnes
> 	Will Deacon
> 	Paul E. McKenney
> 	Rik van Riel
> 	Mel Gorman
> 	Andrea Arcangeli
> 	Christoph Lameter
> 	Jérôme Glisse

Add me :)

Cheers,
Ben.

> _______________________________________________
> Ksummit-discuss mailing list
> Ksummit-discuss@lists.linuxfoundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Core Kernel support for Compute-Offload Devices
  2015-07-30 13:00 [Ksummit-discuss] [CORE TOPIC] Core Kernel support for Compute-Offload Devices Joerg Roedel
  2015-07-30 13:31 ` David Woodhouse
  2015-07-30 22:32 ` Benjamin Herrenschmidt
@ 2015-07-31 14:52 ` Rik van Riel
  2015-07-31 16:13   ` Jerome Glisse
  2015-08-01 20:46 ` Arnd Bergmann
  3 siblings, 1 reply; 27+ messages in thread
From: Rik van Riel @ 2015-07-31 14:52 UTC (permalink / raw)
  To: ksummit-discuss, Jerome Glisse

On 07/30/2015 09:00 AM, Joerg Roedel wrote:

> 	(1) Do we need the concept of an off-CPU task in the kernel
> 	    together with a common interface to create and manage them
> 	    and probably a (collection of) batch scheduler(s) for these
> 	    tasks?

Given that some of these compute offload devices share the
same address space (mm_struct) as the threads running on
CPUs, it would be easiest if there was a reference on the
mm_struct for the threads that are running off-CPU.

I do not know if a generic scheduler would work, since
it is common to have N threads on compute devices all bound
to the same address space, etc.

Different devices might even require different schedulers,
but having a common data structure that pins mm_struct,
provides for a place to have state (like register content)
stored, and has pointers to scheduler, driver, and cleanup
functions could be really useful.

> 	(2) Changes in memory management for devices accessing user
> 	    address spaces:
> 	    
> 	    (2.1) How can we best support the different memory models
> 	          these devices support?
> 	    
> 	    (2.2) How do we handle the off-CPU users of an mm_struct?
> 	    
> 	    (2.3) How can we attach common state for off-CPU tasks to
> 	          mm_struct (and what needs to be in there)?

Jerome has a bunch of code for this already.

> 	(3) Does it make sense to implement automatic migration of
> 	    system memory to device memory (when available) and vice
> 	    versa? How do we decide what and when to migrate?

I believe he has looked at migration too, but not implemented
it yet.

If compute-offload devices are a kernel summit topic this year,
it would be useful to invite Jerome Glisse.

> 	(4) What features do we require in the hardware to support it
> 	    with a common interface?
> 
> I think it would be great if the kernel would have a common interface
> for these kind of devices. Currently every vendor develops its own
> interface with various hacks to work around core code behavior.
> 
> I am particularily interested in this topic because on PCIe newer IOMMUs
> are often an integral part in supporting these devices (ARM-SMMUv3,
> Intel VT-d with SVM, AMD IOMMUv2). so that core work here will also
> touch the IOMMU code.
> 
> Probably (uncomplete list of) interested people:
> 
> 	David Woodhouse
> 	Jesse Barnes
> 	Will Deacon
> 	Paul E. McKenney
> 	Rik van Riel
> 	Mel Gorman
> 	Andrea Arcangeli
> 	Christoph Lameter
> 	Jérôme Glisse

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Core Kernel support for Compute-Offload Devices
  2015-07-31 14:52 ` Rik van Riel
@ 2015-07-31 16:13   ` Jerome Glisse
  2015-08-01 15:57     ` Joerg Roedel
  0 siblings, 1 reply; 27+ messages in thread
From: Jerome Glisse @ 2015-07-31 16:13 UTC (permalink / raw)
  To: Rik van Riel; +Cc: ksummit-discuss

On Fri, Jul 31, 2015 at 10:52:34AM -0400, Rik van Riel wrote:
> On 07/30/2015 09:00 AM, Joerg Roedel wrote:
> 
> > 	(1) Do we need the concept of an off-CPU task in the kernel
> > 	    together with a common interface to create and manage them
> > 	    and probably a (collection of) batch scheduler(s) for these
> > 	    tasks?
> 
> Given that some of these compute offload devices share the
> same address space (mm_struct) as the threads running on
> CPUs, it would be easiest if there was a reference on the
> mm_struct for the threads that are running off-CPU.
> 
> I do not know if a generic scheduler would work, since
> it is common to have N threads on compute devices all bound
> to the same address space, etc.
> 
> Different devices might even require different schedulers,
> but having a common data structure that pins mm_struct,
> provides for a place to have state (like register content)
> stored, and has pointers to scheduler, driver, and cleanup
> functions could be really useful.

Kernel scheduling does not match what hw (today and tomorrow)
can do. You have to think 10 000 or 100 000 threads when it
comes to GPU (and i would not be surprise that couple years
down the road we reach the 1Millions threads).

With so many threads, you do not want to stop them midway,
what you really want is rush to completion so you never have
to store/save their information.

Hence scheduling here is different, on GPU it is more about
a queue of several thousand thread and you just move things
up and down on what need to be executed first. Then GPU have
hw scheduling that constantly switch btw active thread this
why memory latency is so well hidden on GPU.

That being said like Rik said, some common framework would
probably make sense, especialy to keep some kind of fairness.
But it is definitly not the preempt taks, schedule another
one model.

It is the wait current active thread of process A to finish
and schedule bunch of thread of process B.

> 
> > 	(2) Changes in memory management for devices accessing user
> > 	    address spaces:
> > 	    
> > 	    (2.1) How can we best support the different memory models
> > 	          these devices support?
> > 	    
> > 	    (2.2) How do we handle the off-CPU users of an mm_struct?
> > 	    
> > 	    (2.3) How can we attach common state for off-CPU tasks to
> > 	          mm_struct (and what needs to be in there)?
> 
> Jerome has a bunch of code for this already.

Yes HMM is all about that. It is the first step to provide common
framework inside the kernel (not only for GPU but for any device
that wish to transparently access process address space.

> 
> > 	(3) Does it make sense to implement automatic migration of
> > 	    system memory to device memory (when available) and vice
> > 	    versa? How do we decide what and when to migrate?
> 
> I believe he has looked at migration too, but not implemented
> it yet.

I already implemented several version of it and posted for review
couple of them. You do not want automatic migration because kernel
as not enough informations here.

HMM design is to let the device driver decide and then device driver
can take clue from userspace and use any kind of heuristic to decide
what we want to migrate.

> 
> If compute-offload devices are a kernel summit topic this year,
> it would be useful to invite Jerome Glisse.

I would happy to discuss this topic i have work on GPU open source
driver for a long time and last couple year i spend them working on
compute and how to integrate this inside the kernel.

> > 	(4) What features do we require in the hardware to support it
> > 	    with a common interface?
> > 
> > I think it would be great if the kernel would have a common interface
> > for these kind of devices. Currently every vendor develops its own
> > interface with various hacks to work around core code behavior.
> > 
> > I am particularily interested in this topic because on PCIe newer IOMMUs
> > are often an integral part in supporting these devices (ARM-SMMUv3,
> > Intel VT-d with SVM, AMD IOMMUv2). so that core work here will also
> > touch the IOMMU code.
> > 
> > Probably (uncomplete list of) interested people:
> > 
> > 	David Woodhouse
> > 	Jesse Barnes
> > 	Will Deacon
> > 	Paul E. McKenney
> > 	Rik van Riel
> > 	Mel Gorman
> > 	Andrea Arcangeli
> > 	Christoph Lameter
> > 	Jérôme Glisse

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Core Kernel support for Compute-Offload Devices
  2015-07-30 13:54   ` Joerg Roedel
@ 2015-07-31 16:34     ` Jerome Glisse
  2015-08-03 18:51       ` David Woodhouse
  0 siblings, 1 reply; 27+ messages in thread
From: Jerome Glisse @ 2015-07-31 16:34 UTC (permalink / raw)
  To: Joerg Roedel; +Cc: ksummit-discuss

On Thu, Jul 30, 2015 at 13:54:40 UTC 2015, Joerg Roedel wrote:
> On Thu, Jul 30, 2015 at 02:31:38PM +0100, David Woodhouse wrote:
> > On Thu, 2015-07-30 at 15:00 +0200, Joerg Roedel wrote:
> > > 	    (2.3) How can we attach common state for off-CPU tasks to
> > > 	          mm_struct (and what needs to be in there)?
> > 
> > And how do we handle the assignment of Address Space IDs? The AMD
> > implementation currently allows the PASID space to be managed per
> > -device, but I understand ARM systems handle the TLB shootdown
> > broadcasts in hardware and need the PASID that the device sees to be
> > identical to the ASID on the CPU's MMU? And there are reasons why we
> > might actually want that model on Intel systems too. I'm working on the
> > Intel SVM right now, and looking at a single-PASID-space model (partly
> > because the PASID tables have to be physically contiguous, and they can
> > be huge!).
> 
> True, ASIDs would be one thing that needs to be attached to a mm_struct,
> but I am also interested what other platforms might need here. For
> example, is there a better way to track these off-cpu users than using
> mmu-notifiers.

No the ASID should not be associated with mm_struct. There is to
few ASID to have enough of them. I think currently there is only
8bits worth of ASID. So what happen is that the GPU device driver
schedule process and recycle ASID as it does.

Which means that ASID really need to be on device driver control
as i explained in another mail only device driver knows how to
schedule thing for a given device and it is too much hw specific
to be move to common code.

> > > 	(3) Does it make sense to implement automatic migration of
> > > 	    system memory to device memory (when available) and vice
> > > 	    versa? How do we decide what and when to migrate?
> > 
> > This is quite a horrid one, but perhaps ties into generic NUMA
> > considerations — if a memory page is being frequently accessed by
> > something that it's far away from, can we move it to closer memory?
>
> Yeah, conceptually it is NUMA, so it might fit there. But the difference
> to the current NUMA handling is that the device memory is not always
> completly visible to the CPU, so I think quite some significant changes
> are necessary to make this work.

My HMM patchset already handle all this for anonymous memory, i showed
a proof of concept for file back one but i am exploring other method
for that.

> > Another idea is to handle migration like swapping. The difference to
> > real swapping is that it is not relying on the LRU lists but the device
> > access patterns we measure.
> 
> > The question is how we handle that. We do have Extended Accessed bits
> > in the Intel implementation of SVM that let us know that a given PTE
> > was used from a device. Although not *which* device, in cases where
> > there might be more than one.
> 
> One way would be to use seperate page-tables for the devices (which, on
> the other side, somehow contradicts the design of the hardware, because
> its designed to reuse cpu page-tables).

So HMM use a seperate page table for storing information relating to
migrated memory. Note that not all hw reuse the CPU page table, some
hardware do not and it is very much a platform thing.

> And I don't know which features other devices have (like the CAPI
> devices on Power that Paul wrote about) to help in this decission.

CAPI would not need special PTE, as on CAPI device memory is accessible
by the CPU as regular memory. Only platform that can not offer this
need some special handling. AFAICT x86 and ARM have nothing plan to
offer such level of integration (thought lately i have not paid close
attention to what new features the PCIe consortium is discussing).

Joerg i think you really want to take a look at my patchset to see
how i implemented this. I have been discussing this with AMD, Mellanox,
NVidia and couple other smaller specialize hw manufacturer.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Core Kernel support for Compute-Offload Devices
  2015-07-31 16:13   ` Jerome Glisse
@ 2015-08-01 15:57     ` Joerg Roedel
  2015-08-01 19:08       ` Jerome Glisse
  0 siblings, 1 reply; 27+ messages in thread
From: Joerg Roedel @ 2015-08-01 15:57 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: ksummit-discuss

On Fri, Jul 31, 2015 at 12:13:04PM -0400, Jerome Glisse wrote:
> Hence scheduling here is different, on GPU it is more about
> a queue of several thousand thread and you just move things
> up and down on what need to be executed first. Then GPU have
> hw scheduling that constantly switch btw active thread this
> why memory latency is so well hidden on GPU.

Thats why I wrote "batch"-scheduler in the proposal. Its right that it
does not make sense to schedule out a GPU process, and some devices do
scheduling in hardware anyway.

But the Linux kernel still needs to decide which jobs are sent to the
offload device in which order, more like an io-scheduler.

There might be a compute job that only utilizes 60% of the device
resources, to the in-kernel scheduler could start another job there to
utilize the other 40%.

I think its worth a discussion if some common schedulers (like for
blk-io) make sense here too.

> I already implemented several version of it and posted for review
> couple of them. You do not want automatic migration because kernel
> as not enough informations here.

Some devices might provide that information, see the extended-access bit
of Intel VT-d.

	Joerg

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Core Kernel support for Compute-Offload Devices
  2015-07-30 22:32 ` Benjamin Herrenschmidt
@ 2015-08-01 16:10   ` Joerg Roedel
  0 siblings, 0 replies; 27+ messages in thread
From: Joerg Roedel @ 2015-08-01 16:10 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: ksummit-discuss

Hi Ben,

thanks for your thoughts.

On Fri, Jul 31, 2015 at 08:32:21AM +1000, Benjamin Herrenschmidt wrote:
> > Across architectures and vendors there are new devices coming up for
> > offloading tasks from the CPUs. Most of these devices are capable to
> > operate on user address spaces.
> 
> There is cross-overs with the proposed FPGA topic as well, for example
> CAPI is typically FPGA's that can operate on user address spaces ;-)

True, I was not sure how to put this into the proposal, as FPGAs are a
bit different from other compute-offload devices. GPUs take a kernel to
execute that is basically a piece of software while FPGAs take a
hardware description which in the end might be able to execute its own
software. But there is overlap between the topics, thats right.

> So I'd think that such an off-core scheduler, while a useful thing for
> some of these devices, should be an optional component, ie, the other
> functionalities shouldn't necessarily depend on it.

Yes, of course. The scheduler(s) could be implemented as a library and
optionally be used by the device drivers.

> Right. Some of this (GPUs, MLX) use the proposed HMM infrastructure that
> Jerome Glisse have been developing, so he would be an interested party
> here, which hooks into the existing MM. Some of these like CAPI (or more
> stuff I can't quite talk about just yet) will just share the MMU data
> structures (direct access to the host page tables).

Everything (what I am aware of), besides of the hardware HMM targets,
reuses the CPU MMU structures :) For example all three hardware
implementations of ATS/PRI/PASID I am aware of can share them, and as
you said, CAPI on Power too.

But they also need to attach some state to mm_struct. As David already
said, there will be a need to a global PASID allocation, for example.

	Joerg

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Core Kernel support for Compute-Offload Devices
  2015-08-01 15:57     ` Joerg Roedel
@ 2015-08-01 19:08       ` Jerome Glisse
  2015-08-03 16:02         ` Joerg Roedel
  0 siblings, 1 reply; 27+ messages in thread
From: Jerome Glisse @ 2015-08-01 19:08 UTC (permalink / raw)
  To: Joerg Roedel; +Cc: ksummit-discuss

On Sat, Aug 01, 2015 at 05:57:29PM +0200, Joerg Roedel wrote:
> On Fri, Jul 31, 2015 at 12:13:04PM -0400, Jerome Glisse wrote:
> > Hence scheduling here is different, on GPU it is more about
> > a queue of several thousand thread and you just move things
> > up and down on what need to be executed first. Then GPU have
> > hw scheduling that constantly switch btw active thread this
> > why memory latency is so well hidden on GPU.
> 
> Thats why I wrote "batch"-scheduler in the proposal. Its right that it
> does not make sense to schedule out a GPU process, and some devices do
> scheduling in hardware anyway.
> 
> But the Linux kernel still needs to decide which jobs are sent to the
> offload device in which order, more like an io-scheduler.
> 
> There might be a compute job that only utilizes 60% of the device
> resources, to the in-kernel scheduler could start another job there to
> utilize the other 40%.
> 
> I think its worth a discussion if some common schedulers (like for
> blk-io) make sense here too.

It is definitly worth a discussion but i fear right now there is little
room for anything in the kernel. Hardware scheduling is done is almost
100% hardware. The idea of GPU is that you have 1000 compute unit but
the hardware keep track of 10000 threads and at any point in time there
is huge probability that 1000 of those 10000 threads are ready to compute
something. So if a job is only using 60% of the GPU then the remaining
40% would automaticly be use by the next batch of threads. This is a
simplification as the number of thread the hw can keep track of depend
of several factor and vary from one model to the other even inside same
family of the same manufacturer.

Where kernel have control is which command queue (today GPU have several
command queue than run concurently) can spawn threads inside the GPU.
Also thing like which queue got priority over another one. You even have
mecanism where you can "divide" the GPU among queue (you assign fraction
of the GPU compute unit to a particular queue). Thought i expect this
last one is vanishing.

Also note that many GPU manufacturer are pushing for userspace queue
(i think it is some microsoft requirement) in which case the kernel
have even less control.

I agree that blk-io design is probably closest thing that might fit.

> > I already implemented several version of it and posted for review
> > couple of them. You do not want automatic migration because kernel
> > as not enough informations here.
> 
> Some devices might provide that information, see the extended-access bit
> of Intel VT-d.

This would be limited to integrated GPU and so far only on one platform.
My point was more that userspace have way more informations to make good
decision here. The userspace program is more likely to know what part of
the dataset gonna be repeatedly access by the GPU threads.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Core Kernel support for Compute-Offload Devices
  2015-07-30 13:00 [Ksummit-discuss] [CORE TOPIC] Core Kernel support for Compute-Offload Devices Joerg Roedel
                   ` (2 preceding siblings ...)
  2015-07-31 14:52 ` Rik van Riel
@ 2015-08-01 20:46 ` Arnd Bergmann
  2015-08-03 16:10   ` Joerg Roedel
  2015-08-04 15:40   ` Christoph Lameter
  3 siblings, 2 replies; 27+ messages in thread
From: Arnd Bergmann @ 2015-08-01 20:46 UTC (permalink / raw)
  To: ksummit-discuss

On Thursday 30 July 2015 15:00:27 Joerg Roedel wrote:
> [
>  The topic is highly technical and could be a tech topic. But it also
>  touches multiple subsystems, so I decided to submit it as a core
>  topic.
> ]
> 
> Across architectures and vendors there are new devices coming up for
> offloading tasks from the CPUs. Most of these devices are capable to
> operate on user address spaces.
> 
> Besides the commonalities there are important differences in the memory
> model these devices offer. Some work only on system RAM, others come
> with their own memory which may or may not be accessible by the CPU.
> 
> I'd like to discuss what support we need in the core kernel for these
> devices. A probably incomplete list of open questions:
> 
> 	(1) Do we need the concept of an off-CPU task in the kernel
> 	    together with a common interface to create and manage them
> 	    and probably a (collection of) batch scheduler(s) for these
> 	    tasks?

I think we did this part right with the Cell SPUs 10 years ago: A
task is a task, and you just switch between running in user mode and
running on the offload engine through some syscall or ioctl.

The part that got us into endless trouble though was trying to
satisfy two opposite requirements: 

a) having the kernel schedule tasks automatically onto the offload
   engines and take care of context switches and placement, so you
   can do multi-user and multi-tasking processing on them.

b) getting most performance out of the of offload engines, by giving
   a single user total control over the placement and no do any
   scheduling in the kernel at all.

I would strongly recommend now that any new interface tries to do
only one of the two models, but does it right.

> I am particularily interested in this topic because on PCIe newer IOMMUs
> are often an integral part in supporting these devices (ARM-SMMUv3,
> Intel VT-d with SVM, AMD IOMMUv2). so that core work here will also
> touch the IOMMU code.
> 
> Probably (uncomplete list of) interested people:
> 
> 	David Woodhouse
> 	Jesse Barnes
> 	Will Deacon
> 	Paul E. McKenney
> 	Rik van Riel
> 	Mel Gorman
> 	Andrea Arcangeli
> 	Christoph Lameter
> 	Jérôme Glisse

Add me in as well,

	Arnd

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Core Kernel support for Compute-Offload Devices
  2015-08-01 19:08       ` Jerome Glisse
@ 2015-08-03 16:02         ` Joerg Roedel
  2015-08-03 18:28           ` Jerome Glisse
  0 siblings, 1 reply; 27+ messages in thread
From: Joerg Roedel @ 2015-08-03 16:02 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: ksummit-discuss

Hi Jerome,

On Sat, Aug 01, 2015 at 03:08:48PM -0400, Jerome Glisse wrote:
> It is definitly worth a discussion but i fear right now there is little
> room for anything in the kernel. Hardware scheduling is done is almost
> 100% hardware. The idea of GPU is that you have 1000 compute unit but
> the hardware keep track of 10000 threads and at any point in time there
> is huge probability that 1000 of those 10000 threads are ready to compute
> something. So if a job is only using 60% of the GPU then the remaining
> 40% would automaticly be use by the next batch of threads. This is a
> simplification as the number of thread the hw can keep track of depend
> of several factor and vary from one model to the other even inside same
> family of the same manufacturer.

So the hardware scheduled individual threads, that is right. But still,
as you say, there are limits of how many threads the hardware can handle
which the device driver needs to take care of, and decide which job will
be sent to the offload device next. Same with the priorities for the
queues.

> > Some devices might provide that information, see the extended-access bit
> > of Intel VT-d.
> 
> This would be limited to integrated GPU and so far only on one platform.
> My point was more that userspace have way more informations to make good
> decision here. The userspace program is more likely to know what part of
> the dataset gonna be repeatedly access by the GPU threads.

Hmm, so what is the point of HMM then? If userspace is going to decide
which part of the address space the device needs it could just copy the
data over (keeping the address space layout and thus the pointers
stable) and you would basically achieve the same without adding a lot of
code to memory manangement, no?


	Joerg

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Core Kernel support for Compute-Offload Devices
  2015-08-01 20:46 ` Arnd Bergmann
@ 2015-08-03 16:10   ` Joerg Roedel
  2015-08-03 19:23     ` Arnd Bergmann
  2015-08-04 15:40   ` Christoph Lameter
  1 sibling, 1 reply; 27+ messages in thread
From: Joerg Roedel @ 2015-08-03 16:10 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: ksummit-discuss

Hi Arnd,

On Sat, Aug 01, 2015 at 10:46:49PM +0200, Arnd Bergmann wrote:
> I think we did this part right with the Cell SPUs 10 years ago: A
> task is a task, and you just switch between running in user mode and
> running on the offload engine through some syscall or ioctl.

Do you mean that on Cell the offload is synchronous, so that a task that
schedules something to the SPU sleeps until the job there is done?

> The part that got us into endless trouble though was trying to
> satisfy two opposite requirements: 
> 
> a) having the kernel schedule tasks automatically onto the offload
>    engines and take care of context switches and placement, so you
>    can do multi-user and multi-tasking processing on them.
> 
> b) getting most performance out of the of offload engines, by giving
>    a single user total control over the placement and no do any
>    scheduling in the kernel at all.
> 
> I would strongly recommend now that any new interface tries to do
> only one of the two models, but does it right.

I think it mostly depends on the use-cases which approach makes more
sense. An HPC environment would certainly want to have full control over
the placement and scheduling on the offload devices. A desktop
environment with typical applications that offload stuff to one or
multiple GPUs through optimized libraries would benefit more from
automatic scheduling and placement.

It is probably good to hear about the use-cases of the different offload
devices to make a good decission here.

	Joerg

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Core Kernel support for Compute-Offload Devices
  2015-08-03 16:02         ` Joerg Roedel
@ 2015-08-03 18:28           ` Jerome Glisse
  0 siblings, 0 replies; 27+ messages in thread
From: Jerome Glisse @ 2015-08-03 18:28 UTC (permalink / raw)
  To: Joerg Roedel; +Cc: ksummit-discuss

On Mon, Aug 03, 2015 at 06:02:03PM +0200, Joerg Roedel wrote:
> Hi Jerome,
> 
> On Sat, Aug 01, 2015 at 03:08:48PM -0400, Jerome Glisse wrote:
> > It is definitly worth a discussion but i fear right now there is little
> > room for anything in the kernel. Hardware scheduling is done is almost
> > 100% hardware. The idea of GPU is that you have 1000 compute unit but
> > the hardware keep track of 10000 threads and at any point in time there
> > is huge probability that 1000 of those 10000 threads are ready to compute
> > something. So if a job is only using 60% of the GPU then the remaining
> > 40% would automaticly be use by the next batch of threads. This is a
> > simplification as the number of thread the hw can keep track of depend
> > of several factor and vary from one model to the other even inside same
> > family of the same manufacturer.
> 
> So the hardware scheduled individual threads, that is right. But still,
> as you say, there are limits of how many threads the hardware can handle
> which the device driver needs to take care of, and decide which job will
> be sent to the offload device next. Same with the priorities for the
> queues.

What i was pointing to is that right now you do not have such granularity
of choice from device driver point of view. Right now it is either let
command queue spawn thread or not. So it is either stop a command queue
or let it run. Thought how and when you can stop a queue vary. In some
hw you can only stop it at execution boundary ie you have a packet in
a command queue that request 500k thread to be launch, you can only stop
that queue once the 500k thread are launch and you can not stop in the
middle.

Given that some of those queue a programmed directly from userspace, you
can not even force the queue to only schedule small batches of thread (ie
something like 1000 thread no more per command packet in the queue).

But newer hw is becoming more capable on that front.

> 
> > > Some devices might provide that information, see the extended-access bit
> > > of Intel VT-d.
> > 
> > This would be limited to integrated GPU and so far only on one platform.
> > My point was more that userspace have way more informations to make good
> > decision here. The userspace program is more likely to know what part of
> > the dataset gonna be repeatedly access by the GPU threads.
> 
> Hmm, so what is the point of HMM then? If userspace is going to decide
> which part of the address space the device needs it could just copy the
> data over (keeping the address space layout and thus the pointers
> stable) and you would basically achieve the same without adding a lot of
> code to memory manangement, no?

Well no, you can not be "transparent" if you do it in userspace. Let
say userspace decide to migrate, then it means CPU will not be able
to access that memory, so you have to either PROT_NONE the range or
unmap it. This is not what we want. If we get CPU access to memory
that is migrated to device memory then we want to migrate it back (at
very least one page of it) so CPU can access it. We want this migration
back to be transparent from process point of view, like if the memory
was swapped on a disk.

Even on hw where the CPU can access device memory properly (maintaining
CPU atomic operation for instance which is not the case on PCIe) like
with CAPI on powerpc. You have to either have struct page for the device
memory of the kernel must know how to handle those special range of
memory.

So here HMM never makes any decission, it just leave that with the device
driver that can gather more informations from hw and user space to make
the best decision. But it might get things wrong or user space program
might do stupid thing like trying to access data set with the CPU while
the GPU is churning on it. Still we do not want CPU access to be handle
as fault or forbidden, when this happen HMM will force migration back
to service CPU page fault.

HMM also intends to provide more features that are not doable from user
space. Like exclusive write access on a range for the device so that
device can perform atomic operation. Again PCIe offer limited capabilities
on atomic so only way to provide more advance atomic operations is to map
thing read only for CPU and other devices while atomic operation on a
device progress.

Another feature is sharing device memory btw different devices. Some
devices (not necessarily from the same manufacturer) can communicate
to one another and access one another device memory. When a range is
migrated on one of such device pair, there must be a way for the other
device to find about it. Having userspace device driver try to exchange
that kind of information is racy in many way. So easier and better to
have it in kernel.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Core Kernel support for Compute-Offload Devices
  2015-07-31 16:34     ` Jerome Glisse
@ 2015-08-03 18:51       ` David Woodhouse
  2015-08-03 19:01         ` Jerome Glisse
  2015-08-03 22:10         ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 27+ messages in thread
From: David Woodhouse @ 2015-08-03 18:51 UTC (permalink / raw)
  To: Jerome Glisse, Joerg Roedel; +Cc: ksummit-discuss

[-- Attachment #1: Type: text/plain, Size: 719 bytes --]

On Fri, 2015-07-31 at 12:34 -0400, Jerome Glisse wrote:
> No the ASID should not be associated with mm_struct. There is to
> few ASID to have enough of them. I think currently there is only
> 8bits worth of ASID. So what happen is that the GPU device driver
> schedule process and recycle ASID as it does.

In PCIe we have 20 bits of PASID. And we are going to expect hardware
to implement them all, even if it can only do caching for fewer PASIDs
than that.

There is also an expectation that a given MM will have the *same* PASID
across all devices.

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5691 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Core Kernel support for Compute-Offload Devices
  2015-08-03 18:51       ` David Woodhouse
@ 2015-08-03 19:01         ` Jerome Glisse
  2015-08-03 19:07           ` Andy Lutomirski
  2015-08-03 21:10           ` Joerg Roedel
  2015-08-03 22:10         ` Benjamin Herrenschmidt
  1 sibling, 2 replies; 27+ messages in thread
From: Jerome Glisse @ 2015-08-03 19:01 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Jerome Glisse, ksummit-discuss

On Mon, Aug 03, 2015 at 07:51:02PM +0100, David Woodhouse wrote:
> On Fri, 2015-07-31 at 12:34 -0400, Jerome Glisse wrote:
> > No the ASID should not be associated with mm_struct. There is to
> > few ASID to have enough of them. I think currently there is only
> > 8bits worth of ASID. So what happen is that the GPU device driver
> > schedule process and recycle ASID as it does.
> 
> In PCIe we have 20 bits of PASID. And we are going to expect hardware
> to implement them all, even if it can only do caching for fewer PASIDs
> than that.
> 

This is not the case with current AMD hw which IIRC only support 8bits or
9bits for PASID. Dunno if there next hardware will have more bits or not.
So i need to check PCIE spec but i do not think the 20bits is a mandatory
limit.

> There is also an expectation that a given MM will have the *same* PASID
> across all devices.

I understand that this would be prefered. But in case of hw that have only
limited number of bit for PASID you surely do not want to starve it ie it
would be better to have the device recycle PASID to maximize its usage.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Core Kernel support for Compute-Offload Devices
  2015-08-03 19:01         ` Jerome Glisse
@ 2015-08-03 19:07           ` Andy Lutomirski
  2015-08-03 19:56             ` Jerome Glisse
  2015-08-03 21:10           ` Joerg Roedel
  1 sibling, 1 reply; 27+ messages in thread
From: Andy Lutomirski @ 2015-08-03 19:07 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: Jerome Glisse, ksummit-discuss

On Mon, Aug 3, 2015 at 12:01 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
> On Mon, Aug 03, 2015 at 07:51:02PM +0100, David Woodhouse wrote:
>> On Fri, 2015-07-31 at 12:34 -0400, Jerome Glisse wrote:
>> > No the ASID should not be associated with mm_struct. There is to
>> > few ASID to have enough of them. I think currently there is only
>> > 8bits worth of ASID. So what happen is that the GPU device driver
>> > schedule process and recycle ASID as it does.
>>
>> In PCIe we have 20 bits of PASID. And we are going to expect hardware
>> to implement them all, even if it can only do caching for fewer PASIDs
>> than that.
>>
>
> This is not the case with current AMD hw which IIRC only support 8bits or
> 9bits for PASID. Dunno if there next hardware will have more bits or not.
> So i need to check PCIE spec but i do not think the 20bits is a mandatory
> limit.
>
>> There is also an expectation that a given MM will have the *same* PASID
>> across all devices.
>
> I understand that this would be prefered. But in case of hw that have only
> limited number of bit for PASID you surely do not want to starve it ie it
> would be better to have the device recycle PASID to maximize its usage.
>

FWIW, x86 PCID has 12 bits, and, if I ever try to implement support
for it, my thought would be to only use 3 or 4 of those bits and
aggressively recycle PCIDs.

I have no idea whether ASIC and PCID are supposed to be related at
all, given that I don't know anything about how to program these
unified memory contraptions.

--Andy

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Core Kernel support for Compute-Offload Devices
  2015-08-03 16:10   ` Joerg Roedel
@ 2015-08-03 19:23     ` Arnd Bergmann
  0 siblings, 0 replies; 27+ messages in thread
From: Arnd Bergmann @ 2015-08-03 19:23 UTC (permalink / raw)
  To: Joerg Roedel; +Cc: ksummit-discuss

On Monday 03 August 2015 18:10:27 Joerg Roedel wrote:
> Hi Arnd,
> 
> On Sat, Aug 01, 2015 at 10:46:49PM +0200, Arnd Bergmann wrote:
> > I think we did this part right with the Cell SPUs 10 years ago: A
> > task is a task, and you just switch between running in user mode and
> > running on the offload engine through some syscall or ioctl.
> 
> Do you mean that on Cell the offload is synchronous, so that a task that
> schedules something to the SPU sleeps until the job there is done?

Correct. The tradeoff here is that accounting in the kernel is
greatly simplified, but you need to create a separate thread for
each instance of the offload engine that you want to use.

> > The part that got us into endless trouble though was trying to
> > satisfy two opposite requirements: 
> > 
> > a) having the kernel schedule tasks automatically onto the offload
> >    engines and take care of context switches and placement, so you
> >    can do multi-user and multi-tasking processing on them.
> > 
> > b) getting most performance out of the of offload engines, by giving
> >    a single user total control over the placement and no do any
> >    scheduling in the kernel at all.
> > 
> > I would strongly recommend now that any new interface tries to do
> > only one of the two models, but does it right.
> 
> I think it mostly depends on the use-cases which approach makes more
> sense. An HPC environment would certainly want to have full control over
> the placement and scheduling on the offload devices. A desktop
> environment with typical applications that offload stuff to one or
> multiple GPUs through optimized libraries would benefit more from
> automatic scheduling and placement.
> 
> It is probably good to hear about the use-cases of the different offload
> devices to make a good decission here.

It's also quite likely that not all offload engines fit in one model.
E.g. the realtime units on some SoCs should never be scheduled dynamically
because that clearly breaks the realtime behavior, while for a lot of
others, a scheduler might be the preferred interface.

It may be helpful to have one interface centered around the needs of
OpenCL for all those engines that fit into that model, and then do something
else for the ones that can't run OpenCL anyway and have some other requirements.

	Arnd

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Core Kernel support for Compute-Offload Devices
  2015-08-03 19:07           ` Andy Lutomirski
@ 2015-08-03 19:56             ` Jerome Glisse
  0 siblings, 0 replies; 27+ messages in thread
From: Jerome Glisse @ 2015-08-03 19:56 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Jerome Glisse, ksummit-discuss

On Mon, Aug 03, 2015 at 12:07:49PM -0700, Andy Lutomirski wrote:
> On Mon, Aug 3, 2015 at 12:01 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
> > On Mon, Aug 03, 2015 at 07:51:02PM +0100, David Woodhouse wrote:
> >> On Fri, 2015-07-31 at 12:34 -0400, Jerome Glisse wrote:
> >> > No the ASID should not be associated with mm_struct. There is to
> >> > few ASID to have enough of them. I think currently there is only
> >> > 8bits worth of ASID. So what happen is that the GPU device driver
> >> > schedule process and recycle ASID as it does.
> >>
> >> In PCIe we have 20 bits of PASID. And we are going to expect hardware
> >> to implement them all, even if it can only do caching for fewer PASIDs
> >> than that.
> >>
> >
> > This is not the case with current AMD hw which IIRC only support 8bits or
> > 9bits for PASID. Dunno if there next hardware will have more bits or not.
> > So i need to check PCIE spec but i do not think the 20bits is a mandatory
> > limit.
> >
> >> There is also an expectation that a given MM will have the *same* PASID
> >> across all devices.
> >
> > I understand that this would be prefered. But in case of hw that have only
> > limited number of bit for PASID you surely do not want to starve it ie it
> > would be better to have the device recycle PASID to maximize its usage.
> >
> 
> FWIW, x86 PCID has 12 bits, and, if I ever try to implement support
> for it, my thought would be to only use 3 or 4 of those bits and
> aggressively recycle PCIDs.
> 
> I have no idea whether ASIC and PCID are supposed to be related at
> all, given that I don't know anything about how to program these
> unified memory contraptions.

So it is even worse than i tought, PASID spec says that the PASID capability
register have a field indicating the number of bit for PASID the device
supports. Valid value include 1bit which is kind of low, and it does not
seems there is any mandatory lower limit.

So if we want one page table (mm) associated with one PASID this might be
tedious if not impossible. Let say one device support 12bits and process
start using that device, we set PASID to ((1 << 12) - 1), then process
want to use another device that only support 8bits, so we would need to
switch PASID to something that fit. The first device might not allow to
switch PASID without big hammer like fully stoping current job or waiting
for current job to complete with no garanty on how long it could takes,
minutes, hours, days, ...

I agree and wish we could have one mm one PASID but given hardware, i think
we need to do a reality check here :(

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Core Kernel support for Compute-Offload Devices
  2015-08-03 19:01         ` Jerome Glisse
  2015-08-03 19:07           ` Andy Lutomirski
@ 2015-08-03 21:10           ` Joerg Roedel
  2015-08-03 21:12             ` David Woodhouse
  1 sibling, 1 reply; 27+ messages in thread
From: Joerg Roedel @ 2015-08-03 21:10 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: Jerome Glisse, ksummit-discuss

On Mon, Aug 03, 2015 at 03:01:45PM -0400, Jerome Glisse wrote:
> This is not the case with current AMD hw which IIRC only support 8bits or
> 9bits for PASID. Dunno if there next hardware will have more bits or not.
> So i need to check PCIE spec but i do not think the 20bits is a mandatory
> limit.

AMD hardware currently implements PASIDs with 16 bits. Given that only
mm_structs which are used by offload devices get one, this should be
enough to put them into a global pool and have one PASID per mm_struct.


	Joerg

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Core Kernel support for Compute-Offload Devices
  2015-08-03 21:10           ` Joerg Roedel
@ 2015-08-03 21:12             ` David Woodhouse
  2015-08-03 21:31               ` Joerg Roedel
                                 ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: David Woodhouse @ 2015-08-03 21:12 UTC (permalink / raw)
  To: Joerg Roedel, Jerome Glisse; +Cc: Jerome Glisse, ksummit-discuss

[-- Attachment #1: Type: text/plain, Size: 1022 bytes --]

On Mon, 2015-08-03 at 23:10 +0200, Joerg Roedel wrote:
> On Mon, Aug 03, 2015 at 03:01:45PM -0400, Jerome Glisse wrote:
> > This is not the case with current AMD hw which IIRC only support 8bits or
> > 9bits for PASID. Dunno if there next hardware will have more bits or not.
> > So i need to check PCIE spec but i do not think the 20bits is a mandatory
> > limit.
> 
> AMD hardware currently implements PASIDs with 16 bits. Given that only
> mm_structs which are used by offload devices get one, this should be
> enough to put them into a global pool and have one PASID per mm_struct.

I think there are many ARM systems which need this model because of the
way TLB shootdowns are handled in hardware, and shared with the IOMMU?
So we have to use the same ASID for both MMU and IOMMU there, AIUI.

Not that I claim to be an expert on the ARM IOMMUs.

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5691 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Core Kernel support for Compute-Offload Devices
  2015-08-03 21:12             ` David Woodhouse
@ 2015-08-03 21:31               ` Joerg Roedel
  2015-08-03 21:34               ` Jerome Glisse
  2015-08-04 18:11               ` Catalin Marinas
  2 siblings, 0 replies; 27+ messages in thread
From: Joerg Roedel @ 2015-08-03 21:31 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Jerome Glisse, ksummit-discuss

On Mon, Aug 03, 2015 at 10:12:54PM +0100, David Woodhouse wrote:
> On Mon, 2015-08-03 at 23:10 +0200, Joerg Roedel wrote:
> > AMD hardware currently implements PASIDs with 16 bits. Given that only
> > mm_structs which are used by offload devices get one, this should be
> > enough to put them into a global pool and have one PASID per mm_struct.
> 
> I think there are many ARM systems which need this model because of the
> way TLB shootdowns are handled in hardware, and shared with the IOMMU?
> So we have to use the same ASID for both MMU and IOMMU there, AIUI.

Yes, I heard the same, this hardware clearly forces the one-PASID per
mm_struct model.

On x86 we probably have to look what offload devices will appear, the
ones currently available support 16 bits, which this allows the same model.



	Joerg

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Core Kernel support for Compute-Offload Devices
  2015-08-03 21:12             ` David Woodhouse
  2015-08-03 21:31               ` Joerg Roedel
@ 2015-08-03 21:34               ` Jerome Glisse
  2015-08-03 21:51                 ` David Woodhouse
  2015-08-04 18:11               ` Catalin Marinas
  2 siblings, 1 reply; 27+ messages in thread
From: Jerome Glisse @ 2015-08-03 21:34 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Jerome Glisse, ksummit-discuss

On Mon, Aug 03, 2015 at 10:12:54PM +0100, David Woodhouse wrote:
> On Mon, 2015-08-03 at 23:10 +0200, Joerg Roedel wrote:
> > On Mon, Aug 03, 2015 at 03:01:45PM -0400, Jerome Glisse wrote:
> > > This is not the case with current AMD hw which IIRC only support 8bits or
> > > 9bits for PASID. Dunno if there next hardware will have more bits or not.
> > > So i need to check PCIE spec but i do not think the 20bits is a mandatory
> > > limit.
> > 
> > AMD hardware currently implements PASIDs with 16 bits. Given that only
> > mm_structs which are used by offload devices get one, this should be
> > enough to put them into a global pool and have one PASID per mm_struct.
> 
> I think there are many ARM systems which need this model because of the
> way TLB shootdowns are handled in hardware, and shared with the IOMMU?
> So we have to use the same ASID for both MMU and IOMMU there, AIUI.
> 
> Not that I claim to be an expert on the ARM IOMMUs.
> 

I see that on some platform the ASID <-> page table must be a 1 to 1
relationship. My experience so far on AMD is that it does not and that
while the IOMMU have 16bits their GPU have 9 or 8 bits. Also given that
PASID spec says that device can support different number of bits, it
seems that this gonna end up being a mess with all of specific arch/device
quirks.

Note that i really would like the ASID <-> mm struct 1 to 1 match but
i am just fearing this is not something that can be common to all
platform.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Core Kernel support for Compute-Offload Devices
  2015-08-03 21:34               ` Jerome Glisse
@ 2015-08-03 21:51                 ` David Woodhouse
  0 siblings, 0 replies; 27+ messages in thread
From: David Woodhouse @ 2015-08-03 21:51 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: Jerome Glisse, ksummit-discuss

[-- Attachment #1: Type: text/plain, Size: 1103 bytes --]

On Mon, 2015-08-03 at 17:34 -0400, Jerome Glisse wrote:
> Note that i really would like the ASID <-> mm struct 1 to 1 match but
> i am just fearing this is not something that can be common to all
> platform.

You are quite possibly right. And we don't *have* to force it.

We probably do need to design the core IOMMU interfaces to tolerate
either.

The actual PASID allocation wants to be outside the individual IOMMU
driver anyway. The main thing we need to do is let the IOMMU know if it
can share PASID tables or not.

I'm now pondering a 'pasid_space' object which will contain an IDR or
something for allocating PASIDs, and when one of those object is
created we can also call into an IOMMU driver function for allocating a
set of PASID tables (if it needs to). You then attach a given device to
a pasid space.

I suppose it ends up looking at lot like the existing IOMMU domains
that you can map multiple devices into.

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5691 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Core Kernel support for Compute-Offload Devices
  2015-08-03 18:51       ` David Woodhouse
  2015-08-03 19:01         ` Jerome Glisse
@ 2015-08-03 22:10         ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 27+ messages in thread
From: Benjamin Herrenschmidt @ 2015-08-03 22:10 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Jerome Glisse, ksummit-discuss

On Mon, 2015-08-03 at 19:51 +0100, David Woodhouse wrote:
> In PCIe we have 20 bits of PASID. And we are going to expect hardware
> to implement them all, even if it can only do caching for fewer PASIDs
> than that.
> 
> There is also an expectation that a given MM will have the *same*
> PASID across all devices.

Unless you already have a TLB-coherent fabric keyed on a LPID/PID whose
capacity overall is larger than your 20-bit PASID. In that case your
IOMMU have remapping facilities between PASIDs and LPID/PID which could
in theory provide a different mapping for devices...

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Core Kernel support for Compute-Offload Devices
  2015-08-01 20:46 ` Arnd Bergmann
  2015-08-03 16:10   ` Joerg Roedel
@ 2015-08-04 15:40   ` Christoph Lameter
  1 sibling, 0 replies; 27+ messages in thread
From: Christoph Lameter @ 2015-08-04 15:40 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: ksummit-discuss

[-- Attachment #1: Type: TEXT/PLAIN, Size: 288 bytes --]

On Sat, 1 Aug 2015, Arnd Bergmann wrote:

> > 	David Woodhouse
> > 	Jesse Barnes
> > 	Will Deacon
> > 	Paul E. McKenney
> > 	Rik van Riel
> > 	Mel Gorman
> > 	Andrea Arcangeli
> > 	Christoph Lameter
> > 	Jérôme Glisse
>
> Add me in as well,

Confirming my interest in the subject matter.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Core Kernel support for Compute-Offload Devices
  2015-08-03 21:12             ` David Woodhouse
  2015-08-03 21:31               ` Joerg Roedel
  2015-08-03 21:34               ` Jerome Glisse
@ 2015-08-04 18:11               ` Catalin Marinas
  2 siblings, 0 replies; 27+ messages in thread
From: Catalin Marinas @ 2015-08-04 18:11 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Jerome Glisse, ksummit-discuss

On Mon, Aug 03, 2015 at 10:12:54PM +0100, David Woodhouse wrote:
> On Mon, 2015-08-03 at 23:10 +0200, Joerg Roedel wrote:
> > On Mon, Aug 03, 2015 at 03:01:45PM -0400, Jerome Glisse wrote:
> > > This is not the case with current AMD hw which IIRC only support 8bits or
> > > 9bits for PASID. Dunno if there next hardware will have more bits or not.
> > > So i need to check PCIE spec but i do not think the 20bits is a mandatory
> > > limit.
> > 
> > AMD hardware currently implements PASIDs with 16 bits. Given that only
> > mm_structs which are used by offload devices get one, this should be
> > enough to put them into a global pool and have one PASID per mm_struct.
> 
> I think there are many ARM systems which need this model because of the
> way TLB shootdowns are handled in hardware, and shared with the IOMMU?
> So we have to use the same ASID for both MMU and IOMMU there, AIUI.
> 
> Not that I claim to be an expert on the ARM IOMMUs.

Neither am I, cc'ing Will.

As it's the case with the ARM architecture (whether CPU or SMMU/IOMMU),
many features are optional. So an implementation may or may not support
handling of TLB invalidation broadcasting from the CPU. Even when it
does, this is configurable, so it is not forced to share the same ASID
space as the CPU.

AFAICT, the ARM SMMU uses StreamID/SubstreamID to map (1:1?) PCIe
Request-ID and PASID (when stage 1 translation is supported). The
StreamID/SubstreamID can then be mapped onto an ASID via stream tables.
Currently the arm-smmu drivers use their own ASID space but, if we are
going to support compute-offload devices, they could be made to share
the same ASID as the corresponding user processes (with an additional
API).

For a more generic compute-offload API, I guess we would need callbacks
into the (IOMMU) drivers for page table management, including TLB
invalidation and CPU ASID management events (like renewing the CPU ASID
for an existing task). An implementation supporting sharing of IOMMU/CPU
page tables would just have a minimal implementation of such API. We
probably need a (void *)iommu_context pointer in mm_struct that drivers
can point to their own context information.

-- 
Catalin

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2015-08-04 18:11 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-30 13:00 [Ksummit-discuss] [CORE TOPIC] Core Kernel support for Compute-Offload Devices Joerg Roedel
2015-07-30 13:31 ` David Woodhouse
2015-07-30 13:54   ` Joerg Roedel
2015-07-31 16:34     ` Jerome Glisse
2015-08-03 18:51       ` David Woodhouse
2015-08-03 19:01         ` Jerome Glisse
2015-08-03 19:07           ` Andy Lutomirski
2015-08-03 19:56             ` Jerome Glisse
2015-08-03 21:10           ` Joerg Roedel
2015-08-03 21:12             ` David Woodhouse
2015-08-03 21:31               ` Joerg Roedel
2015-08-03 21:34               ` Jerome Glisse
2015-08-03 21:51                 ` David Woodhouse
2015-08-04 18:11               ` Catalin Marinas
2015-08-03 22:10         ` Benjamin Herrenschmidt
2015-07-30 22:32 ` Benjamin Herrenschmidt
2015-08-01 16:10   ` Joerg Roedel
2015-07-31 14:52 ` Rik van Riel
2015-07-31 16:13   ` Jerome Glisse
2015-08-01 15:57     ` Joerg Roedel
2015-08-01 19:08       ` Jerome Glisse
2015-08-03 16:02         ` Joerg Roedel
2015-08-03 18:28           ` Jerome Glisse
2015-08-01 20:46 ` Arnd Bergmann
2015-08-03 16:10   ` Joerg Roedel
2015-08-03 19:23     ` Arnd Bergmann
2015-08-04 15:40   ` Christoph Lameter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.