All of lore.kernel.org
 help / color / mirror / Atom feed
* Generic IOMMU pooled allocator
@ 2015-03-19  2:25 ` David Miller
  0 siblings, 0 replies; 71+ messages in thread
From: David Miller @ 2015-03-19  2:25 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: sparclinux, paulus, sowmini.varadhan


PowerPC folks, we're trying to kill the locking contention in our
IOMMU allocators and noticed that you guys have a nice solution to
this in your IOMMU code.

Sowmini put together a patch series that tries to extract out the
generic parts of your code and place it in lib/iommu-common.c so
that both Sparc and PowerPC can make use of it.

The real test is if powerpc can be converted to use it.

So if you guys could please take a look at the current version of
this series at:

http://patchwork.ozlabs.org/patch/449712/
http://patchwork.ozlabs.org/patch/449713/
http://patchwork.ozlabs.org/patch/449715/

and give us some feedback that would be great.

After we sort out making this work properly for both architectures we
can figure out who merges which parts via what tree(s).

Thanks!

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Generic IOMMU pooled allocator
@ 2015-03-19  2:25 ` David Miller
  0 siblings, 0 replies; 71+ messages in thread
From: David Miller @ 2015-03-19  2:25 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: sparclinux, paulus, sowmini.varadhan


PowerPC folks, we're trying to kill the locking contention in our
IOMMU allocators and noticed that you guys have a nice solution to
this in your IOMMU code.

Sowmini put together a patch series that tries to extract out the
generic parts of your code and place it in lib/iommu-common.c so
that both Sparc and PowerPC can make use of it.

The real test is if powerpc can be converted to use it.

So if you guys could please take a look at the current version of
this series at:

http://patchwork.ozlabs.org/patch/449712/
http://patchwork.ozlabs.org/patch/449713/
http://patchwork.ozlabs.org/patch/449715/

and give us some feedback that would be great.

After we sort out making this work properly for both architectures we
can figure out who merges which parts via what tree(s).

Thanks!

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
  2015-03-19  2:25 ` David Miller
@ 2015-03-19  2:46   ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 71+ messages in thread
From: Benjamin Herrenschmidt @ 2015-03-19  2:46 UTC (permalink / raw)
  To: David Miller
  Cc: sowmini.varadhan, Anton Blanchard, paulus, sparclinux, linuxppc-dev

On Wed, 2015-03-18 at 22:25 -0400, David Miller wrote:
> PowerPC folks, we're trying to kill the locking contention in our
> IOMMU allocators and noticed that you guys have a nice solution to
> this in your IOMMU code.
> 
> Sowmini put together a patch series that tries to extract out the
> generic parts of your code and place it in lib/iommu-common.c so
> that both Sparc and PowerPC can make use of it.
> 
> The real test is if powerpc can be converted to use it.
> 
> So if you guys could please take a look at the current version of
> this series at:
> 
> http://patchwork.ozlabs.org/patch/449712/
> http://patchwork.ozlabs.org/patch/449713/
> http://patchwork.ozlabs.org/patch/449715/
> 
> and give us some feedback that would be great.
> 
> After we sort out making this work properly for both architectures we
> can figure out who merges which parts via what tree(s).

Sounds like a good idea ! CC'ing Anton who wrote the pool stuff. I'll
try to find somebody to work on that here & will let you know asap.

Cheers,
Ben.




^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
@ 2015-03-19  2:46   ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 71+ messages in thread
From: Benjamin Herrenschmidt @ 2015-03-19  2:46 UTC (permalink / raw)
  To: David Miller
  Cc: sowmini.varadhan, Anton Blanchard, paulus, sparclinux, linuxppc-dev

On Wed, 2015-03-18 at 22:25 -0400, David Miller wrote:
> PowerPC folks, we're trying to kill the locking contention in our
> IOMMU allocators and noticed that you guys have a nice solution to
> this in your IOMMU code.
> 
> Sowmini put together a patch series that tries to extract out the
> generic parts of your code and place it in lib/iommu-common.c so
> that both Sparc and PowerPC can make use of it.
> 
> The real test is if powerpc can be converted to use it.
> 
> So if you guys could please take a look at the current version of
> this series at:
> 
> http://patchwork.ozlabs.org/patch/449712/
> http://patchwork.ozlabs.org/patch/449713/
> http://patchwork.ozlabs.org/patch/449715/
> 
> and give us some feedback that would be great.
> 
> After we sort out making this work properly for both architectures we
> can figure out who merges which parts via what tree(s).

Sounds like a good idea ! CC'ing Anton who wrote the pool stuff. I'll
try to find somebody to work on that here & will let you know asap.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
  2015-03-19  2:46   ` Benjamin Herrenschmidt
@ 2015-03-19  2:50     ` David Miller
  -1 siblings, 0 replies; 71+ messages in thread
From: David Miller @ 2015-03-19  2:50 UTC (permalink / raw)
  To: benh; +Cc: sowmini.varadhan, anton, paulus, sparclinux, linuxppc-dev

From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Date: Thu, 19 Mar 2015 13:46:15 +1100

> Sounds like a good idea ! CC'ing Anton who wrote the pool stuff. I'll
> try to find somebody to work on that here & will let you know asap.

Thanks a lot Ben.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
@ 2015-03-19  2:50     ` David Miller
  0 siblings, 0 replies; 71+ messages in thread
From: David Miller @ 2015-03-19  2:50 UTC (permalink / raw)
  To: benh; +Cc: sowmini.varadhan, anton, paulus, sparclinux, linuxppc-dev

From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Date: Thu, 19 Mar 2015 13:46:15 +1100

> Sounds like a good idea ! CC'ing Anton who wrote the pool stuff. I'll
> try to find somebody to work on that here & will let you know asap.

Thanks a lot Ben.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
  2015-03-19  2:25 ` David Miller
@ 2015-03-19  3:01   ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 71+ messages in thread
From: Benjamin Herrenschmidt @ 2015-03-19  3:01 UTC (permalink / raw)
  To: David Miller
  Cc: Alexey Kardashevskiy, sowmini.varadhan, Anton Blanchard, paulus,
	sparclinux, linuxppc-dev

On Wed, 2015-03-18 at 22:25 -0400, David Miller wrote:
> PowerPC folks, we're trying to kill the locking contention in our
> IOMMU allocators and noticed that you guys have a nice solution to
> this in your IOMMU code.

 .../...

Adding Alexei too who is currently doing some changes to our iommu
code to deal with KVM.

One thing I noticed is the asymetry in your code between the alloc
and the free path. The alloc path is similar to us in that the lock
covers the allocation and that's about it, there's no actual mapping to
the HW done, it's done by the caller level right ?

The free path however, in your case, takes the lock and calls back into
"demap" (which I assume is what removes the translation from the HW)
with the lock held. There's also some mapping between cookies
and index which here too isn't exposed to the alloc side but is
exposed to the free side.

I'm sure there's a rationale for that but it would be good if you could
explain it and document the semantics of those 3 callbacks in the iommu
ops.

One thing that Alexey is doing on our side is to move some of the
hooks to manipulate the underlying TCEs (ie. iommu PTEs) from our
global ppc_md. data structure to a new iommu_table_ops, so your patches
will definitely collide with our current work so we'll have to figure
out how things can made to match. We might be able to move more than
just the allocator to the generic code, and the whole implementation of
map_sg/unmap_sg if we have the right set of "ops", unless you see a
reason why that wouldn't work for you ?

We also need to add some additional platform specific fields to certain
iommu table instances to deal with some KVM related tracking of pinned
"DMAble" memory, here too we might have to be creative and possibly
enclose the generic iommu_table in a platform specific variant.

Alexey, any other comment ?

Cheers,
Ben.




^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
@ 2015-03-19  3:01   ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 71+ messages in thread
From: Benjamin Herrenschmidt @ 2015-03-19  3:01 UTC (permalink / raw)
  To: David Miller
  Cc: Alexey Kardashevskiy, sowmini.varadhan, Anton Blanchard, paulus,
	sparclinux, linuxppc-dev

On Wed, 2015-03-18 at 22:25 -0400, David Miller wrote:
> PowerPC folks, we're trying to kill the locking contention in our
> IOMMU allocators and noticed that you guys have a nice solution to
> this in your IOMMU code.

 .../...

Adding Alexei too who is currently doing some changes to our iommu
code to deal with KVM.

One thing I noticed is the asymetry in your code between the alloc
and the free path. The alloc path is similar to us in that the lock
covers the allocation and that's about it, there's no actual mapping to
the HW done, it's done by the caller level right ?

The free path however, in your case, takes the lock and calls back into
"demap" (which I assume is what removes the translation from the HW)
with the lock held. There's also some mapping between cookies
and index which here too isn't exposed to the alloc side but is
exposed to the free side.

I'm sure there's a rationale for that but it would be good if you could
explain it and document the semantics of those 3 callbacks in the iommu
ops.

One thing that Alexey is doing on our side is to move some of the
hooks to manipulate the underlying TCEs (ie. iommu PTEs) from our
global ppc_md. data structure to a new iommu_table_ops, so your patches
will definitely collide with our current work so we'll have to figure
out how things can made to match. We might be able to move more than
just the allocator to the generic code, and the whole implementation of
map_sg/unmap_sg if we have the right set of "ops", unless you see a
reason why that wouldn't work for you ?

We also need to add some additional platform specific fields to certain
iommu table instances to deal with some KVM related tracking of pinned
"DMAble" memory, here too we might have to be creative and possibly
enclose the generic iommu_table in a platform specific variant.

Alexey, any other comment ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
  2015-03-19  3:01   ` Benjamin Herrenschmidt
@ 2015-03-19  5:27     ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 71+ messages in thread
From: Alexey Kardashevskiy @ 2015-03-19  5:27 UTC (permalink / raw)
  To: David Miller
  Cc: Alexey Kardashevskiy, Anton Blanchard, paulus, sowmini.varadhan,
	sparclinux, linuxppc-dev

On 03/19/2015 02:01 PM, Benjamin Herrenschmidt wrote:
> On Wed, 2015-03-18 at 22:25 -0400, David Miller wrote:
>> PowerPC folks, we're trying to kill the locking contention in our
>> IOMMU allocators and noticed that you guys have a nice solution to
>> this in your IOMMU code.
>
>   .../...
>
> Adding Alexei too who is currently doing some changes to our iommu
> code to deal with KVM.
>
> One thing I noticed is the asymetry in your code between the alloc
> and the free path. The alloc path is similar to us in that the lock
> covers the allocation and that's about it, there's no actual mapping to
> the HW done, it's done by the caller level right ?
>
> The free path however, in your case, takes the lock and calls back into
> "demap" (which I assume is what removes the translation from the HW)
> with the lock held. There's also some mapping between cookies
> and index which here too isn't exposed to the alloc side but is
> exposed to the free side.
>
> I'm sure there's a rationale for that but it would be good if you could
> explain it and document the semantics of those 3 callbacks in the iommu
> ops.
>
> One thing that Alexey is doing on our side is to move some of the
> hooks to manipulate the underlying TCEs (ie. iommu PTEs) from our
> global ppc_md. data structure to a new iommu_table_ops, so your patches
> will definitely collide with our current work so we'll have to figure
> out how things can made to match. We might be able to move more than
> just the allocator to the generic code, and the whole implementation of
> map_sg/unmap_sg if we have the right set of "ops", unless you see a
> reason why that wouldn't work for you ?
>
> We also need to add some additional platform specific fields to certain
> iommu table instances to deal with some KVM related tracking of pinned
> "DMAble" memory, here too we might have to be creative and possibly
> enclose the generic iommu_table in a platform specific variant.
>
> Alexey, any other comment ?


Agree about missing symmetry. In general, I would call it "zoned 
pool-locked memory allocator"-ish rather than iommu_table and have no 
callbacks there.

The iommu_tbl_range_free() caller could call cookie_to_index() and what the 
reset() callback does here - I do not understand, some documentation would 
help here, and demap() does not have to be executed under the lock (as 
map() is not executed under the lock).

btw why demap, not unmap? :)


-- 
Alexey

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
@ 2015-03-19  5:27     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 71+ messages in thread
From: Alexey Kardashevskiy @ 2015-03-19  5:27 UTC (permalink / raw)
  To: David Miller
  Cc: Alexey Kardashevskiy, Anton Blanchard, paulus, sowmini.varadhan,
	sparclinux, linuxppc-dev

On 03/19/2015 02:01 PM, Benjamin Herrenschmidt wrote:
> On Wed, 2015-03-18 at 22:25 -0400, David Miller wrote:
>> PowerPC folks, we're trying to kill the locking contention in our
>> IOMMU allocators and noticed that you guys have a nice solution to
>> this in your IOMMU code.
>
>   .../...
>
> Adding Alexei too who is currently doing some changes to our iommu
> code to deal with KVM.
>
> One thing I noticed is the asymetry in your code between the alloc
> and the free path. The alloc path is similar to us in that the lock
> covers the allocation and that's about it, there's no actual mapping to
> the HW done, it's done by the caller level right ?
>
> The free path however, in your case, takes the lock and calls back into
> "demap" (which I assume is what removes the translation from the HW)
> with the lock held. There's also some mapping between cookies
> and index which here too isn't exposed to the alloc side but is
> exposed to the free side.
>
> I'm sure there's a rationale for that but it would be good if you could
> explain it and document the semantics of those 3 callbacks in the iommu
> ops.
>
> One thing that Alexey is doing on our side is to move some of the
> hooks to manipulate the underlying TCEs (ie. iommu PTEs) from our
> global ppc_md. data structure to a new iommu_table_ops, so your patches
> will definitely collide with our current work so we'll have to figure
> out how things can made to match. We might be able to move more than
> just the allocator to the generic code, and the whole implementation of
> map_sg/unmap_sg if we have the right set of "ops", unless you see a
> reason why that wouldn't work for you ?
>
> We also need to add some additional platform specific fields to certain
> iommu table instances to deal with some KVM related tracking of pinned
> "DMAble" memory, here too we might have to be creative and possibly
> enclose the generic iommu_table in a platform specific variant.
>
> Alexey, any other comment ?


Agree about missing symmetry. In general, I would call it "zoned 
pool-locked memory allocator"-ish rather than iommu_table and have no 
callbacks there.

The iommu_tbl_range_free() caller could call cookie_to_index() and what the 
reset() callback does here - I do not understand, some documentation would 
help here, and demap() does not have to be executed under the lock (as 
map() is not executed under the lock).

btw why demap, not unmap? :)


-- 
Alexey

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
  2015-03-19  5:27     ` Alexey Kardashevskiy
@ 2015-03-19 13:34       ` Sowmini Varadhan
  -1 siblings, 0 replies; 71+ messages in thread
From: Sowmini Varadhan @ 2015-03-19 13:34 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alexey Kardashevskiy, Anton Blanchard, paulus, sowmini.varadhan,
	sparclinux, linuxppc-dev, David Miller


On 03/19/2015 02:01 PM, Benjamin Herrenschmidt wrote:

Ben> One thing I noticed is the asymetry in your code between the alloc
Ben> and the free path. The alloc path is similar to us in that the lock
Ben> covers the allocation and that's about it, there's no actual mapping to
Ben> the HW done, it's done by the caller level right ?

yes, the only constraint  is that the h/w alloc transaction should be
done after the arena-alloc, whereas for the unmap, the h/w  transaction
should happen first, and arena-unmap should happen after.

Ben> The free path however, in your case, takes the lock and calls back into
Ben> "demap" (which I assume is what removes the translation from the HW)
Ben> with the lock held. There's also some mapping between cookies
Ben> and index which here too isn't exposed to the alloc side but is
Ben> exposed to the free side.

Regarding the ->demap indirection- somewhere between V1 and V2, I 
realized that, at least for sun4v, it's not necessary to hold
the pool lock when doing the unmap, (V1 had originally passed this
a the ->demap). Revisiting the LDC change, I think that even that
has no pool-specific info that needs to be passed in, so possibly the
->demap is not required at all?

I can remove that, and re-verify the LDC code (though I might
not be able to get to this till early next week, as I'm travelling
at the moment).

About the cookie_to_index, that came out of observation of the
LDC code (ldc_cookie_to_index in patchset 3). In the LDC case, 
the workflow is approximately
   base = alloc_npages(..); /* calls iommu_tbl_range_alloc *.
   /* set up cookie_state using base */
   /* populate cookies calling fill_cookies() -> make_cookie() */
The make_cookie() is the inverse operation of cookie_to_index()
(afaict, the code is not very well commented, I'm afraid), but 
I need that indirection to figure out which bitmap to clear.

I dont know if there's a better way to do this, or if
the ->cookie_to_index can get more complex for other IOMMU users

Ben> One thing that Alexey is doing on our side is to move some of the
Ben> hooks to manipulate the underlying TCEs (ie. iommu PTEs) from our
Ben> global ppc_md. data structure to a new iommu_table_ops, so your patches
Ben> will definitely collide with our current work so we'll have to figure
Ben> out how things can made to match. We might be able to move more than
Ben> just the allocator to the generic code, and the whole implementation of
Ben> map_sg/unmap_sg if we have the right set of "ops", unless you see a
Ben> reason why that wouldn't work for you ?

I cant think of why that wont work, though it would help to see
the patch itself..


Ben> We also need to add some additional platform specific fields to certain
Ben> iommu table instances to deal with some KVM related tracking of pinned
Ben> "DMAble" memory, here too we might have to be creative and possibly
Ben> enclose the generic iommu_table in a platform specific variant.
Ben> 
Ben> Alexey, any other comment ?
Ben> 
Ben> Cheers,
Ben> Ben.
Ben> 
Ben> 
Ben> 
Ben> --
Ben> To unsubscribe from this list: send the line "unsubscribe sparclinux" in
Ben> the body of a message to majordomo@vger.kernel.org
Ben> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben> 
Ben>Alexey, any other comment ?

On (03/19/15 16:27), Alexey Kardashevskiy wrote:
Alexey> 
Alexey> Agree about missing symmetry. In general, I would call it "zoned
Alexey> pool-locked memory allocator"-ish rather than iommu_table and have
Alexey> no callbacks there.
Alexey> 
Alexey> The iommu_tbl_range_free() caller could call cookie_to_index() and

Problem is that tbl_range_free itself needs the `entry' from
-Alexey>cookie_to_index.. dont know if there's a way to move the code
to avoid that..

Alexey> what the reset() callback does here - I do not understand, some

The -Alexey>reset callback came out of the sun4u use-case. Davem might
have more history here than I do, but my understanding is that the
iommu_flushall() was needed on the older sun4u architectures, where
there was on intermediating HV?

Alexey> documentation would help here, and demap() does not have to be
Alexey> executed under the lock (as map() is not executed under the lock).
Alexey> 
Alexey> btw why demap, not unmap? :)

Maybe neither is needed, as you folks made me realize above.

--Sowmini


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
@ 2015-03-19 13:34       ` Sowmini Varadhan
  0 siblings, 0 replies; 71+ messages in thread
From: Sowmini Varadhan @ 2015-03-19 13:34 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alexey Kardashevskiy, Anton Blanchard, paulus, sowmini.varadhan,
	sparclinux, linuxppc-dev, David Miller


On 03/19/2015 02:01 PM, Benjamin Herrenschmidt wrote:

Ben> One thing I noticed is the asymetry in your code between the alloc
Ben> and the free path. The alloc path is similar to us in that the lock
Ben> covers the allocation and that's about it, there's no actual mapping to
Ben> the HW done, it's done by the caller level right ?

yes, the only constraint  is that the h/w alloc transaction should be
done after the arena-alloc, whereas for the unmap, the h/w  transaction
should happen first, and arena-unmap should happen after.

Ben> The free path however, in your case, takes the lock and calls back into
Ben> "demap" (which I assume is what removes the translation from the HW)
Ben> with the lock held. There's also some mapping between cookies
Ben> and index which here too isn't exposed to the alloc side but is
Ben> exposed to the free side.

Regarding the ->demap indirection- somewhere between V1 and V2, I 
realized that, at least for sun4v, it's not necessary to hold
the pool lock when doing the unmap, (V1 had originally passed this
a the ->demap). Revisiting the LDC change, I think that even that
has no pool-specific info that needs to be passed in, so possibly the
->demap is not required at all?

I can remove that, and re-verify the LDC code (though I might
not be able to get to this till early next week, as I'm travelling
at the moment).

About the cookie_to_index, that came out of observation of the
LDC code (ldc_cookie_to_index in patchset 3). In the LDC case, 
the workflow is approximately
   base = alloc_npages(..); /* calls iommu_tbl_range_alloc *.
   /* set up cookie_state using base */
   /* populate cookies calling fill_cookies() -> make_cookie() */
The make_cookie() is the inverse operation of cookie_to_index()
(afaict, the code is not very well commented, I'm afraid), but 
I need that indirection to figure out which bitmap to clear.

I dont know if there's a better way to do this, or if
the ->cookie_to_index can get more complex for other IOMMU users

Ben> One thing that Alexey is doing on our side is to move some of the
Ben> hooks to manipulate the underlying TCEs (ie. iommu PTEs) from our
Ben> global ppc_md. data structure to a new iommu_table_ops, so your patches
Ben> will definitely collide with our current work so we'll have to figure
Ben> out how things can made to match. We might be able to move more than
Ben> just the allocator to the generic code, and the whole implementation of
Ben> map_sg/unmap_sg if we have the right set of "ops", unless you see a
Ben> reason why that wouldn't work for you ?

I cant think of why that wont work, though it would help to see
the patch itself..


Ben> We also need to add some additional platform specific fields to certain
Ben> iommu table instances to deal with some KVM related tracking of pinned
Ben> "DMAble" memory, here too we might have to be creative and possibly
Ben> enclose the generic iommu_table in a platform specific variant.
Ben> 
Ben> Alexey, any other comment ?
Ben> 
Ben> Cheers,
Ben> Ben.
Ben> 
Ben> 
Ben> 
Ben> --
Ben> To unsubscribe from this list: send the line "unsubscribe sparclinux" in
Ben> the body of a message to majordomo@vger.kernel.org
Ben> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben> 
Ben>Alexey, any other comment ?

On (03/19/15 16:27), Alexey Kardashevskiy wrote:
Alexey> 
Alexey> Agree about missing symmetry. In general, I would call it "zoned
Alexey> pool-locked memory allocator"-ish rather than iommu_table and have
Alexey> no callbacks there.
Alexey> 
Alexey> The iommu_tbl_range_free() caller could call cookie_to_index() and

Problem is that tbl_range_free itself needs the `entry' from
-Alexey>cookie_to_index.. dont know if there's a way to move the code
to avoid that..

Alexey> what the reset() callback does here - I do not understand, some

The -Alexey>reset callback came out of the sun4u use-case. Davem might
have more history here than I do, but my understanding is that the
iommu_flushall() was needed on the older sun4u architectures, where
there was on intermediating HV?

Alexey> documentation would help here, and demap() does not have to be
Alexey> executed under the lock (as map() is not executed under the lock).
Alexey> 
Alexey> btw why demap, not unmap? :)

Maybe neither is needed, as you folks made me realize above.

--Sowmini

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
  2015-03-19  5:27     ` Alexey Kardashevskiy
@ 2015-03-22 19:27       ` Sowmini Varadhan
  -1 siblings, 0 replies; 71+ messages in thread
From: Sowmini Varadhan @ 2015-03-22 19:27 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alexey Kardashevskiy, Anton Blanchard, paulus, sparclinux,
	linuxppc-dev, David Miller

Turned out that I was able to iterate over it, and remove
both the ->cookie_to_index and the ->demap indirection from
iommu_tbl_ops.

That leaves only the odd iommu_flushall() hook, I'm trying
to find the history behind that (needed for sun4u platforms,
afaik, and not sure if there are other ways to achieve this).

Just sent out a new patch-set, which should reach all the 
people/lists on this thread.

If it's possible, it would also help to share any preliminary
code-samples with the iommu_table_ops you folks have in mind.

Please take a look at patchv5, and share your thoughts.

--Sowmini

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
@ 2015-03-22 19:27       ` Sowmini Varadhan
  0 siblings, 0 replies; 71+ messages in thread
From: Sowmini Varadhan @ 2015-03-22 19:27 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alexey Kardashevskiy, Anton Blanchard, paulus, sparclinux,
	linuxppc-dev, David Miller

Turned out that I was able to iterate over it, and remove
both the ->cookie_to_index and the ->demap indirection from
iommu_tbl_ops.

That leaves only the odd iommu_flushall() hook, I'm trying
to find the history behind that (needed for sun4u platforms,
afaik, and not sure if there are other ways to achieve this).

Just sent out a new patch-set, which should reach all the 
people/lists on this thread.

If it's possible, it would also help to share any preliminary
code-samples with the iommu_table_ops you folks have in mind.

Please take a look at patchv5, and share your thoughts.

--Sowmini

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
  2015-03-19  2:25 ` David Miller
@ 2015-03-22 19:36   ` Arnd Bergmann
  -1 siblings, 0 replies; 71+ messages in thread
From: Arnd Bergmann @ 2015-03-22 19:36 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: sparclinux, sowmini.varadhan, paulus, David Miller

On Thursday 19 March 2015, David Miller wrote:
> PowerPC folks, we're trying to kill the locking contention in our
> IOMMU allocators and noticed that you guys have a nice solution to
> this in your IOMMU code.
> 
> Sowmini put together a patch series that tries to extract out the
> generic parts of your code and place it in lib/iommu-common.c so
> that both Sparc and PowerPC can make use of it.
> 
> The real test is if powerpc can be converted to use it.
> 
> So if you guys could please take a look at the current version of
> this series at:
> 
> http://patchwork.ozlabs.org/patch/449712/
> http://patchwork.ozlabs.org/patch/449713/
> http://patchwork.ozlabs.org/patch/449715/
> 
> and give us some feedback that would be great.
> 
> After we sort out making this work properly for both architectures we
> can figure out who merges which parts via what tree(s)

How does this relate to the ARM implementation? There is currently
an effort going on to make that one shared with ARM64 and possibly
x86. Has anyone looked at both the PowerPC and ARM ways of doing the
allocation to see if we could pick one of the two to work on
all architectures?

	Arnd

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
@ 2015-03-22 19:36   ` Arnd Bergmann
  0 siblings, 0 replies; 71+ messages in thread
From: Arnd Bergmann @ 2015-03-22 19:36 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: sparclinux, sowmini.varadhan, paulus, David Miller

On Thursday 19 March 2015, David Miller wrote:
> PowerPC folks, we're trying to kill the locking contention in our
> IOMMU allocators and noticed that you guys have a nice solution to
> this in your IOMMU code.
> 
> Sowmini put together a patch series that tries to extract out the
> generic parts of your code and place it in lib/iommu-common.c so
> that both Sparc and PowerPC can make use of it.
> 
> The real test is if powerpc can be converted to use it.
> 
> So if you guys could please take a look at the current version of
> this series at:
> 
> http://patchwork.ozlabs.org/patch/449712/
> http://patchwork.ozlabs.org/patch/449713/
> http://patchwork.ozlabs.org/patch/449715/
> 
> and give us some feedback that would be great.
> 
> After we sort out making this work properly for both architectures we
> can figure out who merges which parts via what tree(s)

How does this relate to the ARM implementation? There is currently
an effort going on to make that one shared with ARM64 and possibly
x86. Has anyone looked at both the PowerPC and ARM ways of doing the
allocation to see if we could pick one of the two to work on
all architectures?

	Arnd

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
  2015-03-22 19:36   ` Arnd Bergmann
@ 2015-03-22 22:02     ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 71+ messages in thread
From: Benjamin Herrenschmidt @ 2015-03-22 22:02 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: sparclinux, sowmini.varadhan, linuxppc-dev, David Miller, paulus

On Sun, 2015-03-22 at 20:36 +0100, Arnd Bergmann wrote:

> How does this relate to the ARM implementation? There is currently
> an effort going on to make that one shared with ARM64 and possibly
> x86. Has anyone looked at both the PowerPC and ARM ways of doing the
> allocation to see if we could pick one of the two to work on
> all architectures?

What I see in ARM is horribly complex, I can't quite make sense of it
in a couple of minutes of looking at it, and doesn't seem to address the
basic issue we are addressing here which is the splitting of the iommu
table lock.

I think we are looking at two very different approaches here. One is
to deal with horrendously broken HW which is an ARM SoC vendors
signature :) The other one is about performances and scalability on sane
HW.

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
@ 2015-03-22 22:02     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 71+ messages in thread
From: Benjamin Herrenschmidt @ 2015-03-22 22:02 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: sparclinux, sowmini.varadhan, linuxppc-dev, David Miller, paulus

On Sun, 2015-03-22 at 20:36 +0100, Arnd Bergmann wrote:

> How does this relate to the ARM implementation? There is currently
> an effort going on to make that one shared with ARM64 and possibly
> x86. Has anyone looked at both the PowerPC and ARM ways of doing the
> allocation to see if we could pick one of the two to work on
> all architectures?

What I see in ARM is horribly complex, I can't quite make sense of it
in a couple of minutes of looking at it, and doesn't seem to address the
basic issue we are addressing here which is the splitting of the iommu
table lock.

I think we are looking at two very different approaches here. One is
to deal with horrendously broken HW which is an ARM SoC vendors
signature :) The other one is about performances and scalability on sane
HW.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
  2015-03-22 22:02     ` Benjamin Herrenschmidt
@ 2015-03-22 22:07       ` Sowmini Varadhan
  -1 siblings, 0 replies; 71+ messages in thread
From: Sowmini Varadhan @ 2015-03-22 22:07 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: sparclinux, paulus, linuxppc-dev, David Miller, Arnd Bergmann

On (03/23/15 09:02), Benjamin Herrenschmidt wrote:
> > How does this relate to the ARM implementation? There is currently
> > an effort going on to make that one shared with ARM64 and possibly
> > x86. Has anyone looked at both the PowerPC and ARM ways of doing the
> > allocation to see if we could pick one of the two to work on
> > all architectures?
> 
> What I see in ARM is horribly complex, I can't quite make sense of it
> in a couple of minutes of looking at it, and doesn't seem to address the
> basic issue we are addressing here which is the splitting of the iommu
> table lock.

Amen to that.. I thought it was just me :-)

I plan to go through the code to see if/where the armd iommu code
does its locking and achieves its parallelism, but the mapping
between the sparc/powerpc approach and armd is not immediately obvious
to me.

--Sowmini


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
@ 2015-03-22 22:07       ` Sowmini Varadhan
  0 siblings, 0 replies; 71+ messages in thread
From: Sowmini Varadhan @ 2015-03-22 22:07 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: sparclinux, paulus, linuxppc-dev, David Miller, Arnd Bergmann

On (03/23/15 09:02), Benjamin Herrenschmidt wrote:
> > How does this relate to the ARM implementation? There is currently
> > an effort going on to make that one shared with ARM64 and possibly
> > x86. Has anyone looked at both the PowerPC and ARM ways of doing the
> > allocation to see if we could pick one of the two to work on
> > all architectures?
> 
> What I see in ARM is horribly complex, I can't quite make sense of it
> in a couple of minutes of looking at it, and doesn't seem to address the
> basic issue we are addressing here which is the splitting of the iommu
> table lock.

Amen to that.. I thought it was just me :-)

I plan to go through the code to see if/where the armd iommu code
does its locking and achieves its parallelism, but the mapping
between the sparc/powerpc approach and armd is not immediately obvious
to me.

--Sowmini

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
  2015-03-22 22:07       ` Sowmini Varadhan
@ 2015-03-22 22:22         ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 71+ messages in thread
From: Benjamin Herrenschmidt @ 2015-03-22 22:22 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: sparclinux, paulus, linuxppc-dev, David Miller, Arnd Bergmann

On Sun, 2015-03-22 at 18:07 -0400, Sowmini Varadhan wrote:
> On (03/23/15 09:02), Benjamin Herrenschmidt wrote:
> > > How does this relate to the ARM implementation? There is currently
> > > an effort going on to make that one shared with ARM64 and possibly
> > > x86. Has anyone looked at both the PowerPC and ARM ways of doing the
> > > allocation to see if we could pick one of the two to work on
> > > all architectures?
> > 
> > What I see in ARM is horribly complex, I can't quite make sense of it
> > in a couple of minutes of looking at it, and doesn't seem to address the
> > basic issue we are addressing here which is the splitting of the iommu
> > table lock.
> 
> Amen to that.. I thought it was just me :-)
> 
> I plan to go through the code to see if/where the armd iommu code
> does its locking and achieves its parallelism, but the mapping
> between the sparc/powerpc approach and armd is not immediately obvious
> to me.

We might have more chance with x86_64 though... They would surely have
similar scalability issues.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
@ 2015-03-22 22:22         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 71+ messages in thread
From: Benjamin Herrenschmidt @ 2015-03-22 22:22 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: sparclinux, paulus, linuxppc-dev, David Miller, Arnd Bergmann

On Sun, 2015-03-22 at 18:07 -0400, Sowmini Varadhan wrote:
> On (03/23/15 09:02), Benjamin Herrenschmidt wrote:
> > > How does this relate to the ARM implementation? There is currently
> > > an effort going on to make that one shared with ARM64 and possibly
> > > x86. Has anyone looked at both the PowerPC and ARM ways of doing the
> > > allocation to see if we could pick one of the two to work on
> > > all architectures?
> > 
> > What I see in ARM is horribly complex, I can't quite make sense of it
> > in a couple of minutes of looking at it, and doesn't seem to address the
> > basic issue we are addressing here which is the splitting of the iommu
> > table lock.
> 
> Amen to that.. I thought it was just me :-)
> 
> I plan to go through the code to see if/where the armd iommu code
> does its locking and achieves its parallelism, but the mapping
> between the sparc/powerpc approach and armd is not immediately obvious
> to me.

We might have more chance with x86_64 though... They would surely have
similar scalability issues.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
  2015-03-22 22:22         ` Benjamin Herrenschmidt
@ 2015-03-23  6:04           ` Arnd Bergmann
  -1 siblings, 0 replies; 71+ messages in thread
From: Arnd Bergmann @ 2015-03-23  6:04 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: sparclinux, Sowmini Varadhan, linuxppc-dev, David Miller, paulus

On Sunday 22 March 2015, Benjamin Herrenschmidt wrote:
> On Sun, 2015-03-22 at 18:07 -0400, Sowmini Varadhan wrote:
> > On (03/23/15 09:02), Benjamin Herrenschmidt wrote:
> > > > How does this relate to the ARM implementation? There is currently
> > > > an effort going on to make that one shared with ARM64 and possibly
> > > > x86. Has anyone looked at both the PowerPC and ARM ways of doing the
> > > > allocation to see if we could pick one of the two to work on
> > > > all architectures?
> > > 
> > > What I see in ARM is horribly complex, I can't quite make sense of it
> > > in a couple of minutes of looking at it, and doesn't seem to address the
> > > basic issue we are addressing here which is the splitting of the iommu
> > > table lock.
> > 
> > Amen to that.. I thought it was just me :-)

Two problems on ARM IOMMU hardware are:

- a very wide range of incompatible approaches, from a simple sw-loaded
  iotlb with just a few hugepage entries to some rather complex
  ones
- a rather overdesigned "SMMU" that will be used by half the SoCs, but
  with some of them implementing slightly incompatible variants.

> > I plan to go through the code to see if/where the armd iommu code
> > does its locking and achieves its parallelism, but the mapping
> > between the sparc/powerpc approach and armd is not immediately obvious
> > to me.

My guess is that the ARM code so far has been concerned mainly with
getting things to work in the first place, but scalability problems
will only be seen when there are faster CPU cores become available.

> We might have more chance with x86_64 though... They would surely have
> similar scalability issues.
> 

I believe the complexity of the modern Intel and AMD IOMMUs is similar
to the ARM SMMU, which appears to have been modeled after the Intel
one to some degree.

	Arnd

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
@ 2015-03-23  6:04           ` Arnd Bergmann
  0 siblings, 0 replies; 71+ messages in thread
From: Arnd Bergmann @ 2015-03-23  6:04 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: sparclinux, Sowmini Varadhan, linuxppc-dev, David Miller, paulus

On Sunday 22 March 2015, Benjamin Herrenschmidt wrote:
> On Sun, 2015-03-22 at 18:07 -0400, Sowmini Varadhan wrote:
> > On (03/23/15 09:02), Benjamin Herrenschmidt wrote:
> > > > How does this relate to the ARM implementation? There is currently
> > > > an effort going on to make that one shared with ARM64 and possibly
> > > > x86. Has anyone looked at both the PowerPC and ARM ways of doing the
> > > > allocation to see if we could pick one of the two to work on
> > > > all architectures?
> > > 
> > > What I see in ARM is horribly complex, I can't quite make sense of it
> > > in a couple of minutes of looking at it, and doesn't seem to address the
> > > basic issue we are addressing here which is the splitting of the iommu
> > > table lock.
> > 
> > Amen to that.. I thought it was just me :-)

Two problems on ARM IOMMU hardware are:

- a very wide range of incompatible approaches, from a simple sw-loaded
  iotlb with just a few hugepage entries to some rather complex
  ones
- a rather overdesigned "SMMU" that will be used by half the SoCs, but
  with some of them implementing slightly incompatible variants.

> > I plan to go through the code to see if/where the armd iommu code
> > does its locking and achieves its parallelism, but the mapping
> > between the sparc/powerpc approach and armd is not immediately obvious
> > to me.

My guess is that the ARM code so far has been concerned mainly with
getting things to work in the first place, but scalability problems
will only be seen when there are faster CPU cores become available.

> We might have more chance with x86_64 though... They would surely have
> similar scalability issues.
> 

I believe the complexity of the modern Intel and AMD IOMMUs is similar
to the ARM SMMU, which appears to have been modeled after the Intel
one to some degree.

	Arnd

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
  2015-03-23  6:04           ` Arnd Bergmann
@ 2015-03-23 11:04             ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 71+ messages in thread
From: Benjamin Herrenschmidt @ 2015-03-23 11:04 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: sparclinux, Sowmini Varadhan, linuxppc-dev, David Miller, paulus

On Mon, 2015-03-23 at 07:04 +0100, Arnd Bergmann wrote:
> 
> My guess is that the ARM code so far has been concerned mainly with
> getting things to work in the first place, but scalability problems
> will only be seen when there are faster CPU cores become available.

In any case, I think this is mostly a non-issue. The complexity of the
ARM code is in various areas related to making shit work (handling
coherent allocations mostly) but only remotely related to the actual
iommu DMA space allocator (iova in ARM as far as I understand the code)
which is pretty standard.

The work Sowmini is doing is about specifically the allocator. Making
our (powerpc) allocator generic since it has some nice scalability
features.

In fact, what Aik and I have been pushing and Sowmini is close to
achieving is to mostly disconnect that allocator from the rest of the
iommu management (the caller).

So in the end, the allocator itself should be splitable into something
separate that resides in lib/ or similar, which ARM can chose to use as
well.

Sowmini: it might be worthwhile creating a separate data structure
iommu_virt_zone or something like that which is what the allocator works
on exclusively, which we would just embed into sparc and powerpc. The
only remaining problem I see is your "reset" op, but I think we should
be able to sort it out once we understand what it's there for :-)

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
@ 2015-03-23 11:04             ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 71+ messages in thread
From: Benjamin Herrenschmidt @ 2015-03-23 11:04 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: sparclinux, Sowmini Varadhan, linuxppc-dev, David Miller, paulus

On Mon, 2015-03-23 at 07:04 +0100, Arnd Bergmann wrote:
> 
> My guess is that the ARM code so far has been concerned mainly with
> getting things to work in the first place, but scalability problems
> will only be seen when there are faster CPU cores become available.

In any case, I think this is mostly a non-issue. The complexity of the
ARM code is in various areas related to making shit work (handling
coherent allocations mostly) but only remotely related to the actual
iommu DMA space allocator (iova in ARM as far as I understand the code)
which is pretty standard.

The work Sowmini is doing is about specifically the allocator. Making
our (powerpc) allocator generic since it has some nice scalability
features.

In fact, what Aik and I have been pushing and Sowmini is close to
achieving is to mostly disconnect that allocator from the rest of the
iommu management (the caller).

So in the end, the allocator itself should be splitable into something
separate that resides in lib/ or similar, which ARM can chose to use as
well.

Sowmini: it might be worthwhile creating a separate data structure
iommu_virt_zone or something like that which is what the allocator works
on exclusively, which we would just embed into sparc and powerpc. The
only remaining problem I see is your "reset" op, but I think we should
be able to sort it out once we understand what it's there for :-)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
  2015-03-22 19:27       ` Sowmini Varadhan
@ 2015-03-23 16:29         ` David Miller
  -1 siblings, 0 replies; 71+ messages in thread
From: David Miller @ 2015-03-23 16:29 UTC (permalink / raw)
  To: sowmini.varadhan; +Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev

From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Date: Sun, 22 Mar 2015 15:27:26 -0400

> That leaves only the odd iommu_flushall() hook, I'm trying
> to find the history behind that (needed for sun4u platforms,
> afaik, and not sure if there are other ways to achieve this).

In order to elide the IOMMU flush as much as possible, I implemnented
a scheme for sun4u wherein we always allocated from low IOMMU
addresses to high IOMMU addresses.

In this regime, we only need to flush the IOMMU when we rolled over
back to low IOMMU addresses during an allocation.

It made a noticable difference in performance.

Unfortunately, with sun4v and the hypervisor, I'm not allowed to
control when the IOMMU flush happens, it has to occur on every
single IOMMU mapping change.  So this optimization was no longer
possible there.

Anyways, that's the history behind it.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
@ 2015-03-23 16:29         ` David Miller
  0 siblings, 0 replies; 71+ messages in thread
From: David Miller @ 2015-03-23 16:29 UTC (permalink / raw)
  To: sowmini.varadhan; +Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev

From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Date: Sun, 22 Mar 2015 15:27:26 -0400

> That leaves only the odd iommu_flushall() hook, I'm trying
> to find the history behind that (needed for sun4u platforms,
> afaik, and not sure if there are other ways to achieve this).

In order to elide the IOMMU flush as much as possible, I implemnented
a scheme for sun4u wherein we always allocated from low IOMMU
addresses to high IOMMU addresses.

In this regime, we only need to flush the IOMMU when we rolled over
back to low IOMMU addresses during an allocation.

It made a noticable difference in performance.

Unfortunately, with sun4v and the hypervisor, I'm not allowed to
control when the IOMMU flush happens, it has to occur on every
single IOMMU mapping change.  So this optimization was no longer
possible there.

Anyways, that's the history behind it.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
  2015-03-23 16:29         ` David Miller
@ 2015-03-23 16:54           ` Sowmini Varadhan
  -1 siblings, 0 replies; 71+ messages in thread
From: Sowmini Varadhan @ 2015-03-23 16:54 UTC (permalink / raw)
  To: David Miller; +Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev

On (03/23/15 12:29), David Miller wrote:
> 
> In order to elide the IOMMU flush as much as possible, I implemnented
> a scheme for sun4u wherein we always allocated from low IOMMU
> addresses to high IOMMU addresses.
> 
> In this regime, we only need to flush the IOMMU when we rolled over
> back to low IOMMU addresses during an allocation.
> 
> It made a noticable difference in performance.
> 
> Unfortunately, with sun4v and the hypervisor, I'm not allowed to
> control when the IOMMU flush happens, it has to occur on every
> single IOMMU mapping change.  So this optimization was no longer
> possible there.
> 
> Anyways, that's the history behind it.
> --

I see.

If it was only an optimization (i.e., removing it would not break
any functionality), and if this was done for older hardware,
and *if* we believe that the direction of most architectures is to 
follow the sun4v/HV model, then, given that the sun4u code only uses 1 
arena pool anyway, one thought that I have for refactoring this
is the following:

- Caller of iommu_tbl_range_alloc() can do the flush_all if they 
  see start <= end for the one single pool 
- lose the other ->flush_all invocation (i.e., the one that is
  done when iommu_area_alloc() fails for pass = 0, and we reset
  start to 0 to roll-back)

that would avoid the need for any iommu_tbl_ops in my patch-set.

But it would imply that you would still take the perf hit for the roll-back
if we failed the pass = 0 iteration through iommu_area_alloc().
Perhaps this is an acceptable compromise in favor of cleaner code
(again, assuming that current/future archs will all follow the HV
based design).

--Sowmini
 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
@ 2015-03-23 16:54           ` Sowmini Varadhan
  0 siblings, 0 replies; 71+ messages in thread
From: Sowmini Varadhan @ 2015-03-23 16:54 UTC (permalink / raw)
  To: David Miller; +Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev

On (03/23/15 12:29), David Miller wrote:
> 
> In order to elide the IOMMU flush as much as possible, I implemnented
> a scheme for sun4u wherein we always allocated from low IOMMU
> addresses to high IOMMU addresses.
> 
> In this regime, we only need to flush the IOMMU when we rolled over
> back to low IOMMU addresses during an allocation.
> 
> It made a noticable difference in performance.
> 
> Unfortunately, with sun4v and the hypervisor, I'm not allowed to
> control when the IOMMU flush happens, it has to occur on every
> single IOMMU mapping change.  So this optimization was no longer
> possible there.
> 
> Anyways, that's the history behind it.
> --

I see.

If it was only an optimization (i.e., removing it would not break
any functionality), and if this was done for older hardware,
and *if* we believe that the direction of most architectures is to 
follow the sun4v/HV model, then, given that the sun4u code only uses 1 
arena pool anyway, one thought that I have for refactoring this
is the following:

- Caller of iommu_tbl_range_alloc() can do the flush_all if they 
  see start <= end for the one single pool 
- lose the other ->flush_all invocation (i.e., the one that is
  done when iommu_area_alloc() fails for pass == 0, and we reset
  start to 0 to roll-back)

that would avoid the need for any iommu_tbl_ops in my patch-set.

But it would imply that you would still take the perf hit for the roll-back
if we failed the pass == 0 iteration through iommu_area_alloc().
Perhaps this is an acceptable compromise in favor of cleaner code
(again, assuming that current/future archs will all follow the HV
based design).

--Sowmini
 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
  2015-03-23 11:04             ` Benjamin Herrenschmidt
@ 2015-03-23 18:45               ` Arnd Bergmann
  -1 siblings, 0 replies; 71+ messages in thread
From: Arnd Bergmann @ 2015-03-23 18:45 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: sparclinux, paulus, David Miller, Sowmini Varadhan

On Monday 23 March 2015, Benjamin Herrenschmidt wrote:
> On Mon, 2015-03-23 at 07:04 +0100, Arnd Bergmann wrote:
> > 
> > My guess is that the ARM code so far has been concerned mainly with
> > getting things to work in the first place, but scalability problems
> > will only be seen when there are faster CPU cores become available.
> 
> In any case, I think this is mostly a non-issue. The complexity of the
> ARM code is in various areas related to making shit work (handling
> coherent allocations mostly) but only remotely related to the actual
> iommu DMA space allocator (iova in ARM as far as I understand the code)
> which is pretty standard.

Ok, got it. Thanks for explaining tht part.

> The work Sowmini is doing is about specifically the allocator. Making
> our (powerpc) allocator generic since it has some nice scalability
> features.
> 
> In fact, what Aik and I have been pushing and Sowmini is close to
> achieving is to mostly disconnect that allocator from the rest of the
> iommu management (the caller).
> 
> So in the end, the allocator itself should be splitable into something
> separate that resides in lib/ or similar, which ARM can chose to use as
> well.

Yes, this sounds like a good idea. I'm currently at ELC and will bring this
up with the people working on ARM IOMMU support.

	Arnd

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
@ 2015-03-23 18:45               ` Arnd Bergmann
  0 siblings, 0 replies; 71+ messages in thread
From: Arnd Bergmann @ 2015-03-23 18:45 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: sparclinux, paulus, David Miller, Sowmini Varadhan

On Monday 23 March 2015, Benjamin Herrenschmidt wrote:
> On Mon, 2015-03-23 at 07:04 +0100, Arnd Bergmann wrote:
> > 
> > My guess is that the ARM code so far has been concerned mainly with
> > getting things to work in the first place, but scalability problems
> > will only be seen when there are faster CPU cores become available.
> 
> In any case, I think this is mostly a non-issue. The complexity of the
> ARM code is in various areas related to making shit work (handling
> coherent allocations mostly) but only remotely related to the actual
> iommu DMA space allocator (iova in ARM as far as I understand the code)
> which is pretty standard.

Ok, got it. Thanks for explaining tht part.

> The work Sowmini is doing is about specifically the allocator. Making
> our (powerpc) allocator generic since it has some nice scalability
> features.
> 
> In fact, what Aik and I have been pushing and Sowmini is close to
> achieving is to mostly disconnect that allocator from the rest of the
> iommu management (the caller).
> 
> So in the end, the allocator itself should be splitable into something
> separate that resides in lib/ or similar, which ARM can chose to use as
> well.

Yes, this sounds like a good idea. I'm currently at ELC and will bring this
up with the people working on ARM IOMMU support.

	Arnd

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
  2015-03-23 16:54           ` Sowmini Varadhan
@ 2015-03-23 19:05             ` David Miller
  -1 siblings, 0 replies; 71+ messages in thread
From: David Miller @ 2015-03-23 19:05 UTC (permalink / raw)
  To: sowmini.varadhan; +Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev

From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Date: Mon, 23 Mar 2015 12:54:06 -0400

> If it was only an optimization (i.e., removing it would not break
> any functionality), and if this was done for older hardware,
> and *if* we believe that the direction of most architectures is to 
> follow the sun4v/HV model, then, given that the sun4u code only uses 1 
> arena pool anyway, one thought that I have for refactoring this
> is the following:

Why add performance regressions to old machines who already are
suffering too much from all the bloat we are constantly adding to the
kernel?

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
@ 2015-03-23 19:05             ` David Miller
  0 siblings, 0 replies; 71+ messages in thread
From: David Miller @ 2015-03-23 19:05 UTC (permalink / raw)
  To: sowmini.varadhan; +Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev

From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Date: Mon, 23 Mar 2015 12:54:06 -0400

> If it was only an optimization (i.e., removing it would not break
> any functionality), and if this was done for older hardware,
> and *if* we believe that the direction of most architectures is to 
> follow the sun4v/HV model, then, given that the sun4u code only uses 1 
> arena pool anyway, one thought that I have for refactoring this
> is the following:

Why add performance regressions to old machines who already are
suffering too much from all the bloat we are constantly adding to the
kernel?

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
  2015-03-23 19:05             ` David Miller
@ 2015-03-23 19:09               ` Sowmini Varadhan
  -1 siblings, 0 replies; 71+ messages in thread
From: Sowmini Varadhan @ 2015-03-23 19:09 UTC (permalink / raw)
  To: David Miller; +Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev

On (03/23/15 15:05), David Miller wrote:
> 
> Why add performance regressions to old machines who already are
> suffering too much from all the bloat we are constantly adding to the
> kernel?

I have no personal opinion on this- it's a matter of  choosing
whether we want to have some extra baggage in the library to support
old hardware, or keep the code lean-and-mean for the future,
where the model has changed.

I'll let others on this list make the call.

--Sowmini


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
@ 2015-03-23 19:09               ` Sowmini Varadhan
  0 siblings, 0 replies; 71+ messages in thread
From: Sowmini Varadhan @ 2015-03-23 19:09 UTC (permalink / raw)
  To: David Miller; +Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev

On (03/23/15 15:05), David Miller wrote:
> 
> Why add performance regressions to old machines who already are
> suffering too much from all the bloat we are constantly adding to the
> kernel?

I have no personal opinion on this- it's a matter of  choosing
whether we want to have some extra baggage in the library to support
old hardware, or keep the code lean-and-mean for the future,
where the model has changed.

I'll let others on this list make the call.

--Sowmini

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
  2015-03-23 19:05             ` David Miller
@ 2015-03-23 22:21               ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 71+ messages in thread
From: Benjamin Herrenschmidt @ 2015-03-23 22:21 UTC (permalink / raw)
  To: David Miller
  Cc: aik, aik, sowmini.varadhan, anton, paulus, sparclinux, linuxppc-dev

On Mon, 2015-03-23 at 15:05 -0400, David Miller wrote:
> From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> Date: Mon, 23 Mar 2015 12:54:06 -0400
> 
> > If it was only an optimization (i.e., removing it would not break
> > any functionality), and if this was done for older hardware,
> > and *if* we believe that the direction of most architectures is to 
> > follow the sun4v/HV model, then, given that the sun4u code only uses 1 
> > arena pool anyway, one thought that I have for refactoring this
> > is the following:
> 
> Why add performance regressions to old machines who already are
> suffering too much from all the bloat we are constantly adding to the
> kernel?

So we have two choices here that I can see:

 - Keep that old platform use the old/simpler allocator

 - Try to regain the bulk of that benefit with the new one

Sowmini, I see various options for the second choice. We could stick to
1 pool, and basically do as before, ie, if we fail on the first pass of
alloc, it means we wrap around and do a flush, I don't think that will
cause a significant degradation from today, do you ? We might have an
occasional additional flush but I would expect it to be in the noise.

Another option would be trickier, is to keep an additional bitmap
of "stale" entries. When an entry is freed, instead of freeing it
in the main bitmap, set a bit in the "stale" bit map. If we fail to
allocate, then flush, xor off the main bitmap bits using the stale
bitmap, and try again.

However, the second approach means that as the main bitmap gets full, we
will start allocating from remote pools, so it partially defeats the
pool system, unless we do everything locally per-pool (ie flush when the
pool is full before we fallback to another pool), in which case we go
back to flushing more often than we used to. But here too, the
difference might end up in the noise, we still flush order of magnitude
less than once per translation update.

Dave, what's your feeling there ? Does anybody around still have some
HW that we can test with ?

Sowmini, I think we can still kill the ops and have a separate data
structure exclusively concerned by allocations by having the alloc
functions take the lazy flush function as an argument (which can be
NULL), I don't think we should bother with ops.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
@ 2015-03-23 22:21               ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 71+ messages in thread
From: Benjamin Herrenschmidt @ 2015-03-23 22:21 UTC (permalink / raw)
  To: David Miller
  Cc: aik, aik, sowmini.varadhan, anton, paulus, sparclinux, linuxppc-dev

On Mon, 2015-03-23 at 15:05 -0400, David Miller wrote:
> From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> Date: Mon, 23 Mar 2015 12:54:06 -0400
> 
> > If it was only an optimization (i.e., removing it would not break
> > any functionality), and if this was done for older hardware,
> > and *if* we believe that the direction of most architectures is to 
> > follow the sun4v/HV model, then, given that the sun4u code only uses 1 
> > arena pool anyway, one thought that I have for refactoring this
> > is the following:
> 
> Why add performance regressions to old machines who already are
> suffering too much from all the bloat we are constantly adding to the
> kernel?

So we have two choices here that I can see:

 - Keep that old platform use the old/simpler allocator

 - Try to regain the bulk of that benefit with the new one

Sowmini, I see various options for the second choice. We could stick to
1 pool, and basically do as before, ie, if we fail on the first pass of
alloc, it means we wrap around and do a flush, I don't think that will
cause a significant degradation from today, do you ? We might have an
occasional additional flush but I would expect it to be in the noise.

Another option would be trickier, is to keep an additional bitmap
of "stale" entries. When an entry is freed, instead of freeing it
in the main bitmap, set a bit in the "stale" bit map. If we fail to
allocate, then flush, xor off the main bitmap bits using the stale
bitmap, and try again.

However, the second approach means that as the main bitmap gets full, we
will start allocating from remote pools, so it partially defeats the
pool system, unless we do everything locally per-pool (ie flush when the
pool is full before we fallback to another pool), in which case we go
back to flushing more often than we used to. But here too, the
difference might end up in the noise, we still flush order of magnitude
less than once per translation update.

Dave, what's your feeling there ? Does anybody around still have some
HW that we can test with ?

Sowmini, I think we can still kill the ops and have a separate data
structure exclusively concerned by allocations by having the alloc
functions take the lazy flush function as an argument (which can be
NULL), I don't think we should bother with ops.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
  2015-03-23 16:54           ` Sowmini Varadhan
@ 2015-03-23 22:25             ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 71+ messages in thread
From: Benjamin Herrenschmidt @ 2015-03-23 22:25 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev, David Miller

On Mon, 2015-03-23 at 12:54 -0400, Sowmini Varadhan wrote:

> If it was only an optimization (i.e., removing it would not break
> any functionality), and if this was done for older hardware,
> and *if* we believe that the direction of most architectures is to 
> follow the sun4v/HV model, then, given that the sun4u code only uses 1 
> arena pool anyway, one thought that I have for refactoring this
> is the following:
> 
> - Caller of iommu_tbl_range_alloc() can do the flush_all if they 
>   see start <= end for the one single pool 
> - lose the other ->flush_all invocation (i.e., the one that is
>   done when iommu_area_alloc() fails for pass = 0, and we reset
>   start to 0 to roll-back)
> 
> that would avoid the need for any iommu_tbl_ops in my patch-set.

You must hold the lock until you do the flush, otherwise somebody
else might allocate the not-yet-flushed areas and try to use them...
kaboom. However if that's the only callback left, pass it as an
argument.

> But it would imply that you would still take the perf hit for the roll-back
> if we failed the pass = 0 iteration through iommu_area_alloc().
> Perhaps this is an acceptable compromise in favor of cleaner code
> (again, assuming that current/future archs will all follow the HV
> based design).
> 
> --Sowmini
>  



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
@ 2015-03-23 22:25             ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 71+ messages in thread
From: Benjamin Herrenschmidt @ 2015-03-23 22:25 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev, David Miller

On Mon, 2015-03-23 at 12:54 -0400, Sowmini Varadhan wrote:

> If it was only an optimization (i.e., removing it would not break
> any functionality), and if this was done for older hardware,
> and *if* we believe that the direction of most architectures is to 
> follow the sun4v/HV model, then, given that the sun4u code only uses 1 
> arena pool anyway, one thought that I have for refactoring this
> is the following:
> 
> - Caller of iommu_tbl_range_alloc() can do the flush_all if they 
>   see start <= end for the one single pool 
> - lose the other ->flush_all invocation (i.e., the one that is
>   done when iommu_area_alloc() fails for pass == 0, and we reset
>   start to 0 to roll-back)
> 
> that would avoid the need for any iommu_tbl_ops in my patch-set.

You must hold the lock until you do the flush, otherwise somebody
else might allocate the not-yet-flushed areas and try to use them...
kaboom. However if that's the only callback left, pass it as an
argument.

> But it would imply that you would still take the perf hit for the roll-back
> if we failed the pass == 0 iteration through iommu_area_alloc().
> Perhaps this is an acceptable compromise in favor of cleaner code
> (again, assuming that current/future archs will all follow the HV
> based design).
> 
> --Sowmini
>  

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
  2015-03-23 19:05             ` David Miller
@ 2015-03-23 22:36               ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 71+ messages in thread
From: Benjamin Herrenschmidt @ 2015-03-23 22:36 UTC (permalink / raw)
  To: David Miller
  Cc: aik, aik, sowmini.varadhan, anton, paulus, sparclinux, linuxppc-dev

On Mon, 2015-03-23 at 15:05 -0400, David Miller wrote:
> From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> Date: Mon, 23 Mar 2015 12:54:06 -0400
> 
> > If it was only an optimization (i.e., removing it would not break
> > any functionality), and if this was done for older hardware,
> > and *if* we believe that the direction of most architectures is to 
> > follow the sun4v/HV model, then, given that the sun4u code only uses 1 
> > arena pool anyway, one thought that I have for refactoring this
> > is the following:
> 
> Why add performance regressions to old machines who already are
> suffering too much from all the bloat we are constantly adding to the
> kernel?

Our allocator more/less allocates bottom-up from the last successful
allocation and tries again if it fails, so in essence, it is not so
different from what you have done. What we need to do is:

 - One pool only

 - Whenever the allocation is before the previous hint, do a flush, that
should only happen if a wrap around occurred or in some cases if the
device DMA mask forced it. I think we always update the hint whenever we
successfully allocate from the small pools.

 - Deal with the largealloc case. That's the contentious issue, see
below.

So if we ignore the largealloc split we have, we can pretty much recover
David's behaviour using our allocator. There might be the odd flush here
or there that we do and he didn't due to differences in implementation
details, but I doubt those are statistically relevant, as long as we
don't flush on every network packet, we should be good.

The largealloc issue is a different can of worms. We can try adding an
option to disable the largealloc business completely (instead of hard
wiring the "15", make that a variable and define that 0 means no
largealloc pool).

Or we can decide that large allocs are rare (typically
pci_alloc_consistent, ie, driver init time), and thus always flush on
them (or rather on free of a large chunk). David, what's your take
there ? I have a feeling that should work fine without a noticeable
performance issue...

I would also keep a "dirty" flag set on any free and cleared on any
flush to avoid more spurrious flushes, but here too the benefit might be
in the noise.

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
@ 2015-03-23 22:36               ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 71+ messages in thread
From: Benjamin Herrenschmidt @ 2015-03-23 22:36 UTC (permalink / raw)
  To: David Miller
  Cc: aik, aik, sowmini.varadhan, anton, paulus, sparclinux, linuxppc-dev

On Mon, 2015-03-23 at 15:05 -0400, David Miller wrote:
> From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> Date: Mon, 23 Mar 2015 12:54:06 -0400
> 
> > If it was only an optimization (i.e., removing it would not break
> > any functionality), and if this was done for older hardware,
> > and *if* we believe that the direction of most architectures is to 
> > follow the sun4v/HV model, then, given that the sun4u code only uses 1 
> > arena pool anyway, one thought that I have for refactoring this
> > is the following:
> 
> Why add performance regressions to old machines who already are
> suffering too much from all the bloat we are constantly adding to the
> kernel?

Our allocator more/less allocates bottom-up from the last successful
allocation and tries again if it fails, so in essence, it is not so
different from what you have done. What we need to do is:

 - One pool only

 - Whenever the allocation is before the previous hint, do a flush, that
should only happen if a wrap around occurred or in some cases if the
device DMA mask forced it. I think we always update the hint whenever we
successfully allocate from the small pools.

 - Deal with the largealloc case. That's the contentious issue, see
below.

So if we ignore the largealloc split we have, we can pretty much recover
David's behaviour using our allocator. There might be the odd flush here
or there that we do and he didn't due to differences in implementation
details, but I doubt those are statistically relevant, as long as we
don't flush on every network packet, we should be good.

The largealloc issue is a different can of worms. We can try adding an
option to disable the largealloc business completely (instead of hard
wiring the "15", make that a variable and define that 0 means no
largealloc pool).

Or we can decide that large allocs are rare (typically
pci_alloc_consistent, ie, driver init time), and thus always flush on
them (or rather on free of a large chunk). David, what's your take
there ? I have a feeling that should work fine without a noticeable
performance issue...

I would also keep a "dirty" flag set on any free and cleared on any
flush to avoid more spurrious flushes, but here too the benefit might be
in the noise.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
  2015-03-23 22:21               ` Benjamin Herrenschmidt
@ 2015-03-23 23:08                 ` Sowmini Varadhan
  -1 siblings, 0 replies; 71+ messages in thread
From: Sowmini Varadhan @ 2015-03-23 23:08 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev, David Miller

On (03/24/15 09:21), Benjamin Herrenschmidt wrote:
> 
> So we have two choices here that I can see:
> 
>  - Keep that old platform use the old/simpler allocator

Problem with that approach is that the base "struct iommu" structure
for sparc gets a split personality: the older one is used with
the older allocator, and other ugly things ensue.  (alternatively,
you end up duplicating a version of the code with the flush_all 
inlined). 

>  - Try to regain the bulk of that benefit with the new one
> 
> Sowmini, I see various options for the second choice. We could stick to
> 1 pool, and basically do as before, ie, if we fail on the first pass of
> alloc, it means we wrap around and do a flush, I don't think that will
> cause a significant degradation from today, do you ? We might have an
> occasional additional flush but I would expect it to be in the noise.

Isn't this essentially what I have in patch v5 here:
http://www.spinics.net/lists/sparclinux/msg13534.html

(the ops->reset is the flushall indirection, can be renamed if the
latter is preferred)

> Dave, what's your feeling there ? Does anybody around still have some
> HW that we can test with ?

I actually tested this on a V440 and a ultra45 (had a heck of a
time finding these, since the owners keep them turned off because
they are too noisy and consume too much power :-). Thus while I
have no opinion, I would not shed any tears if we lost this extra
perf-tweak in the interest of being earth-friendly  :-))

so testing it is not a problem, though I dont have any perf
benchmarks for them either.

> Sowmini, I think we can still kill the ops and have a separate data
> structure exclusively concerned by allocations by having the alloc
> functions take the lazy flush function as an argument (which can be
> NULL), I don't think we should bother with ops.

I dont quite follow what you have in mind? The caller would somehow
have to specify a flush_all indirection for the legacy platforms 

Also, you mention

> You must hold the lock until you do the flush, otherwise somebody
> else might allocate the not-yet-flushed areas and try to use them...
> kaboom. However if that's the only callback left, pass it as an
> argument.

Passing  in a function pointer to the flushall to iommu_tbl_range_alloc
would work, or we could pass it in as an arg to iommu_tbl_init,
and stash it in the struct iommu_table.

--Sowmini


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
@ 2015-03-23 23:08                 ` Sowmini Varadhan
  0 siblings, 0 replies; 71+ messages in thread
From: Sowmini Varadhan @ 2015-03-23 23:08 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev, David Miller

On (03/24/15 09:21), Benjamin Herrenschmidt wrote:
> 
> So we have two choices here that I can see:
> 
>  - Keep that old platform use the old/simpler allocator

Problem with that approach is that the base "struct iommu" structure
for sparc gets a split personality: the older one is used with
the older allocator, and other ugly things ensue.  (alternatively,
you end up duplicating a version of the code with the flush_all 
inlined). 

>  - Try to regain the bulk of that benefit with the new one
> 
> Sowmini, I see various options for the second choice. We could stick to
> 1 pool, and basically do as before, ie, if we fail on the first pass of
> alloc, it means we wrap around and do a flush, I don't think that will
> cause a significant degradation from today, do you ? We might have an
> occasional additional flush but I would expect it to be in the noise.

Isn't this essentially what I have in patch v5 here:
http://www.spinics.net/lists/sparclinux/msg13534.html

(the ops->reset is the flushall indirection, can be renamed if the
latter is preferred)

> Dave, what's your feeling there ? Does anybody around still have some
> HW that we can test with ?

I actually tested this on a V440 and a ultra45 (had a heck of a
time finding these, since the owners keep them turned off because
they are too noisy and consume too much power :-). Thus while I
have no opinion, I would not shed any tears if we lost this extra
perf-tweak in the interest of being earth-friendly  :-))

so testing it is not a problem, though I dont have any perf
benchmarks for them either.

> Sowmini, I think we can still kill the ops and have a separate data
> structure exclusively concerned by allocations by having the alloc
> functions take the lazy flush function as an argument (which can be
> NULL), I don't think we should bother with ops.

I dont quite follow what you have in mind? The caller would somehow
have to specify a flush_all indirection for the legacy platforms 

Also, you mention

> You must hold the lock until you do the flush, otherwise somebody
> else might allocate the not-yet-flushed areas and try to use them...
> kaboom. However if that's the only callback left, pass it as an
> argument.

Passing  in a function pointer to the flushall to iommu_tbl_range_alloc
would work, or we could pass it in as an arg to iommu_tbl_init,
and stash it in the struct iommu_table.

--Sowmini

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
  2015-03-23 22:36               ` Benjamin Herrenschmidt
@ 2015-03-23 23:19                 ` Sowmini Varadhan
  -1 siblings, 0 replies; 71+ messages in thread
From: Sowmini Varadhan @ 2015-03-23 23:19 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev, David Miller

On (03/24/15 09:36), Benjamin Herrenschmidt wrote:
> 
>  - One pool only
> 
>  - Whenever the allocation is before the previous hint, do a flush, that
> should only happen if a wrap around occurred or in some cases if the
> device DMA mask forced it. I think we always update the hint whenever we
> successfully allocate from the small pools.

right, I believe this is close to what I discussed in the previous
thread, and approx what I have in patchv5, except that the flush 
indirection can be passed as a function pointer, or via the table_ops.

> 
>  - Deal with the largealloc case. That's the contentious issue, see
> below.
  :
> The largealloc issue is a different can of worms. We can try adding an
> option to disable the largealloc business completely (instead of hard
> wiring the "15", make that a variable and define that 0 means no
> largealloc pool).

What I've tried to do is to have a bool large_pool arg passed
to iommu_tbl_pool_init. In my observation (instrumented for scsi, ixgbe), 
we never allocate more than 4 pages at a time, so I pass in 
large_pool = false for all the sparc platforms. 

> Or we can decide that large allocs are rare (typically
> pci_alloc_consistent, ie, driver init time), and thus always flush on
> them (or rather on free of a large chunk). David, what's your take
> there ? I have a feeling that should work fine without a noticeable
> performance issue...
> 
> I would also keep a "dirty" flag set on any free and cleared on any
> flush to avoid more spurrious flushes, but here too the benefit might be
> in the noise.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
@ 2015-03-23 23:19                 ` Sowmini Varadhan
  0 siblings, 0 replies; 71+ messages in thread
From: Sowmini Varadhan @ 2015-03-23 23:19 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev, David Miller

On (03/24/15 09:36), Benjamin Herrenschmidt wrote:
> 
>  - One pool only
> 
>  - Whenever the allocation is before the previous hint, do a flush, that
> should only happen if a wrap around occurred or in some cases if the
> device DMA mask forced it. I think we always update the hint whenever we
> successfully allocate from the small pools.

right, I believe this is close to what I discussed in the previous
thread, and approx what I have in patchv5, except that the flush 
indirection can be passed as a function pointer, or via the table_ops.

> 
>  - Deal with the largealloc case. That's the contentious issue, see
> below.
  :
> The largealloc issue is a different can of worms. We can try adding an
> option to disable the largealloc business completely (instead of hard
> wiring the "15", make that a variable and define that 0 means no
> largealloc pool).

What I've tried to do is to have a bool large_pool arg passed
to iommu_tbl_pool_init. In my observation (instrumented for scsi, ixgbe), 
we never allocate more than 4 pages at a time, so I pass in 
large_pool == false for all the sparc platforms. 

> Or we can decide that large allocs are rare (typically
> pci_alloc_consistent, ie, driver init time), and thus always flush on
> them (or rather on free of a large chunk). David, what's your take
> there ? I have a feeling that should work fine without a noticeable
> performance issue...
> 
> I would also keep a "dirty" flag set on any free and cleared on any
> flush to avoid more spurrious flushes, but here too the benefit might be
> in the noise.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
  2015-03-23 23:08                 ` Sowmini Varadhan
  (?)
@ 2015-03-23 23:29                 ` chase rayfield
  -1 siblings, 0 replies; 71+ messages in thread
From: chase rayfield @ 2015-03-23 23:29 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev, David Miller

[-- Attachment #1: Type: text/plain, Size: 1813 bytes --]

On Mar 23, 2015 7:13 PM, "Sowmini Varadhan" <sowmini.varadhan@oracle.com>
wrote:
>
> On (03/24/15 09:21), Benjamin Herrenschmidt wrote:
> >
> > So we have two choices here that I can see:
> >
> >  - Keep that old platform use the old/simpler allocator
>
> Problem with that approach is that the base "struct iommu" structure
> for sparc gets a split personality: the older one is used with
> the older allocator, and other ugly things ensue.  (alternatively,
> you end up duplicating a version of the code with the flush_all
> inlined).
>
> >  - Try to regain the bulk of that benefit with the new one
> >
> > Sowmini, I see various options for the second choice. We could stick to
> > 1 pool, and basically do as before, ie, if we fail on the first pass of
> > alloc, it means we wrap around and do a flush, I don't think that will
> > cause a significant degradation from today, do you ? We might have an
> > occasional additional flush but I would expect it to be in the noise.
>
> Isn't this essentially what I have in patch v5 here:
> http://www.spinics.net/lists/sparclinux/msg13534.html
>
> (the ops->reset is the flushall indirection, can be renamed if the
> latter is preferred)
>
> > Dave, what's your feeling there ? Does anybody around still have some
> > HW that we can test with ?
>
> I actually tested this on a V440 and a ultra45 (had a heck of a
> time finding these, since the owners keep them turned off because
> they are too noisy and consume too much power :-).

So we need tests then more than hw?.... I have an ultra1, ultra10 and t2000
I can test on if needed. And I'd appreciate my caches not being flushed
excessively on these boxes :-) too. I'll see if I can't get Gentoo on the
u10 tonight.

Also... It would be more "green" to make the code run faster on these boxes
than otherwise!

[-- Attachment #2: Type: text/html, Size: 2358 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
  2015-03-23 23:08                 ` Sowmini Varadhan
@ 2015-03-24  0:47                   ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 71+ messages in thread
From: Benjamin Herrenschmidt @ 2015-03-24  0:47 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev, David Miller

On Mon, 2015-03-23 at 19:08 -0400, Sowmini Varadhan wrote:

> > Sowmini, I see various options for the second choice. We could stick to
> > 1 pool, and basically do as before, ie, if we fail on the first pass of
> > alloc, it means we wrap around and do a flush, I don't think that will
> > cause a significant degradation from today, do you ? We might have an
> > occasional additional flush but I would expect it to be in the noise.
> 
> Isn't this essentially what I have in patch v5 here:
> http://www.spinics.net/lists/sparclinux/msg13534.html

Possibly, I haven't looked at it in details yet.

My point is that while we might flush a bit more often than the original
code, it should remain in the noise performance-wise and thus David
might rescind his objection, but we need to prove this with a
measurement.

> (the ops->reset is the flushall indirection, can be renamed if the
> latter is preferred)
> 
> > Dave, what's your feeling there ? Does anybody around still have some
> > HW that we can test with ?
> 
> I actually tested this on a V440 and a ultra45 (had a heck of a
> time finding these, since the owners keep them turned off because
> they are too noisy and consume too much power :-). Thus while I
> have no opinion, I would not shed any tears if we lost this extra
> perf-tweak in the interest of being earth-friendly  :-))
> 
> so testing it is not a problem, though I dont have any perf
> benchmarks for them either.

That's what we'll need unfortunately to confirm our "gut feeling". It
might be sufficient to add a flush counter and compare it between runs
if actual wall-clock benchmarks are too hard to do (especially if you
don't have things like very fast network cards at hand).

Number of flush / number of packets might be a sufficient metric, it
depends on whether it makes David happy :-)

> > Sowmini, I think we can still kill the ops and have a separate data
> > structure exclusively concerned by allocations by having the alloc
> > functions take the lazy flush function as an argument (which can be
> > NULL), I don't think we should bother with ops.
> 
> I dont quite follow what you have in mind? The caller would somehow
> have to specify a flush_all indirection for the legacy platforms 

Yes, pass a function pointer argument that can be NULL or just make it a
member of the iommu_allocator struct (or whatever you call it) passed to
the init function and that can be NULL. My point is we don't need a
separate "ops" structure.

> Also, you mention
> 
> > You must hold the lock until you do the flush, otherwise somebody
> > else might allocate the not-yet-flushed areas and try to use them...
> > kaboom. However if that's the only callback left, pass it as an
> > argument.
> 
> Passing  in a function pointer to the flushall to iommu_tbl_range_alloc
> would work, or we could pass it in as an arg to iommu_tbl_init,
> and stash it in the struct iommu_table.

Pass it to init and stash it in the table but don't call it
"iommu_table", let's use a name that conveys better the fact that this
is purely a DMA space allocator (to be embedded by the arch specific
iommu table). Something like iommu_alloc_zone or whatever you want to
call it. I keep finding new names whenever I think of it :-)

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
@ 2015-03-24  0:47                   ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 71+ messages in thread
From: Benjamin Herrenschmidt @ 2015-03-24  0:47 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev, David Miller

On Mon, 2015-03-23 at 19:08 -0400, Sowmini Varadhan wrote:

> > Sowmini, I see various options for the second choice. We could stick to
> > 1 pool, and basically do as before, ie, if we fail on the first pass of
> > alloc, it means we wrap around and do a flush, I don't think that will
> > cause a significant degradation from today, do you ? We might have an
> > occasional additional flush but I would expect it to be in the noise.
> 
> Isn't this essentially what I have in patch v5 here:
> http://www.spinics.net/lists/sparclinux/msg13534.html

Possibly, I haven't looked at it in details yet.

My point is that while we might flush a bit more often than the original
code, it should remain in the noise performance-wise and thus David
might rescind his objection, but we need to prove this with a
measurement.

> (the ops->reset is the flushall indirection, can be renamed if the
> latter is preferred)
> 
> > Dave, what's your feeling there ? Does anybody around still have some
> > HW that we can test with ?
> 
> I actually tested this on a V440 and a ultra45 (had a heck of a
> time finding these, since the owners keep them turned off because
> they are too noisy and consume too much power :-). Thus while I
> have no opinion, I would not shed any tears if we lost this extra
> perf-tweak in the interest of being earth-friendly  :-))
> 
> so testing it is not a problem, though I dont have any perf
> benchmarks for them either.

That's what we'll need unfortunately to confirm our "gut feeling". It
might be sufficient to add a flush counter and compare it between runs
if actual wall-clock benchmarks are too hard to do (especially if you
don't have things like very fast network cards at hand).

Number of flush / number of packets might be a sufficient metric, it
depends on whether it makes David happy :-)

> > Sowmini, I think we can still kill the ops and have a separate data
> > structure exclusively concerned by allocations by having the alloc
> > functions take the lazy flush function as an argument (which can be
> > NULL), I don't think we should bother with ops.
> 
> I dont quite follow what you have in mind? The caller would somehow
> have to specify a flush_all indirection for the legacy platforms 

Yes, pass a function pointer argument that can be NULL or just make it a
member of the iommu_allocator struct (or whatever you call it) passed to
the init function and that can be NULL. My point is we don't need a
separate "ops" structure.

> Also, you mention
> 
> > You must hold the lock until you do the flush, otherwise somebody
> > else might allocate the not-yet-flushed areas and try to use them...
> > kaboom. However if that's the only callback left, pass it as an
> > argument.
> 
> Passing  in a function pointer to the flushall to iommu_tbl_range_alloc
> would work, or we could pass it in as an arg to iommu_tbl_init,
> and stash it in the struct iommu_table.

Pass it to init and stash it in the table but don't call it
"iommu_table", let's use a name that conveys better the fact that this
is purely a DMA space allocator (to be embedded by the arch specific
iommu table). Something like iommu_alloc_zone or whatever you want to
call it. I keep finding new names whenever I think of it :-)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
  2015-03-23 23:19                 ` Sowmini Varadhan
@ 2015-03-24  0:48                   ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 71+ messages in thread
From: Benjamin Herrenschmidt @ 2015-03-24  0:48 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev, David Miller

On Mon, 2015-03-23 at 19:19 -0400, Sowmini Varadhan wrote:

> What I've tried to do is to have a bool large_pool arg passed
> to iommu_tbl_pool_init. In my observation (instrumented for scsi, ixgbe), 
> we never allocate more than 4 pages at a time, so I pass in 
> large_pool = false for all the sparc platforms. 

But that might not be necessary. If indeed we very rarely use the large
pool, then just make it flush always. My feeling is that it will only
ever be used at driver init/remove time when allocating things like
descriptor rings, where the flush overhead dont' matter.

> > Or we can decide that large allocs are rare (typically
> > pci_alloc_consistent, ie, driver init time), and thus always flush on
> > them (or rather on free of a large chunk). David, what's your take
> > there ? I have a feeling that should work fine without a noticeable
> > performance issue...
> > 
> > I would also keep a "dirty" flag set on any free and cleared on any
> > flush to avoid more spurrious flushes, but here too the benefit might be
> > in the noise.

Ben.



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
@ 2015-03-24  0:48                   ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 71+ messages in thread
From: Benjamin Herrenschmidt @ 2015-03-24  0:48 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev, David Miller

On Mon, 2015-03-23 at 19:19 -0400, Sowmini Varadhan wrote:

> What I've tried to do is to have a bool large_pool arg passed
> to iommu_tbl_pool_init. In my observation (instrumented for scsi, ixgbe), 
> we never allocate more than 4 pages at a time, so I pass in 
> large_pool == false for all the sparc platforms. 

But that might not be necessary. If indeed we very rarely use the large
pool, then just make it flush always. My feeling is that it will only
ever be used at driver init/remove time when allocating things like
descriptor rings, where the flush overhead dont' matter.

> > Or we can decide that large allocs are rare (typically
> > pci_alloc_consistent, ie, driver init time), and thus always flush on
> > them (or rather on free of a large chunk). David, what's your take
> > there ? I have a feeling that should work fine without a noticeable
> > performance issue...
> > 
> > I would also keep a "dirty" flag set on any free and cleared on any
> > flush to avoid more spurrious flushes, but here too the benefit might be
> > in the noise.

Ben.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
  2015-03-24  0:47                   ` Benjamin Herrenschmidt
@ 2015-03-24  1:11                     ` Sowmini Varadhan
  -1 siblings, 0 replies; 71+ messages in thread
From: Sowmini Varadhan @ 2015-03-24  1:11 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev, David Miller

On (03/24/15 11:47), Benjamin Herrenschmidt wrote:
> 
> Yes, pass a function pointer argument that can be NULL or just make it a
> member of the iommu_allocator struct (or whatever you call it) passed to
> the init function and that can be NULL. My point is we don't need a
> separate "ops" structure.

Ok, I'll make this a function pointer to the iommu_tbl_pool_init()
function (tomorrow)

fwiw, the ops struct came out as a result of DaveM input, 
  http://www.spinics.net/lists/sparclinux/msg13232.html
albeit the original context is now moot.

> Pass it to init and stash it in the table but don't call it
> "iommu_table", let's use a name that conveys better the fact that this
> is purely a DMA space allocator (to be embedded by the arch specific
> iommu table). Something like iommu_alloc_zone or whatever you want to
> call it. I keep finding new names whenever I think of it :-)

please pick a name that you like as this mooshes all the patches,
and I myself really dont care what it gets called (rose? :-)) as long 
as fat-fingering-risk is minimized

Regarding the relationship with largepool,

> But that might not be necessary. If indeed we very rarely use the large
> pool, then just make it flush always. My feeling is that it will only
> ever be used at driver init/remove time when allocating things like
> descriptor rings, where the flush overhead dont' matter.

ok, will factor this in tomorrow (assuming DaveM has no objections
to anything proposed here, esp around the function pointer for flushall).

--Sowmini

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
@ 2015-03-24  1:11                     ` Sowmini Varadhan
  0 siblings, 0 replies; 71+ messages in thread
From: Sowmini Varadhan @ 2015-03-24  1:11 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev, David Miller

On (03/24/15 11:47), Benjamin Herrenschmidt wrote:
> 
> Yes, pass a function pointer argument that can be NULL or just make it a
> member of the iommu_allocator struct (or whatever you call it) passed to
> the init function and that can be NULL. My point is we don't need a
> separate "ops" structure.

Ok, I'll make this a function pointer to the iommu_tbl_pool_init()
function (tomorrow)

fwiw, the ops struct came out as a result of DaveM input, 
  http://www.spinics.net/lists/sparclinux/msg13232.html
albeit the original context is now moot.

> Pass it to init and stash it in the table but don't call it
> "iommu_table", let's use a name that conveys better the fact that this
> is purely a DMA space allocator (to be embedded by the arch specific
> iommu table). Something like iommu_alloc_zone or whatever you want to
> call it. I keep finding new names whenever I think of it :-)

please pick a name that you like as this mooshes all the patches,
and I myself really dont care what it gets called (rose? :-)) as long 
as fat-fingering-risk is minimized

Regarding the relationship with largepool,

> But that might not be necessary. If indeed we very rarely use the large
> pool, then just make it flush always. My feeling is that it will only
> ever be used at driver init/remove time when allocating things like
> descriptor rings, where the flush overhead dont' matter.

ok, will factor this in tomorrow (assuming DaveM has no objections
to anything proposed here, esp around the function pointer for flushall).

--Sowmini

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
  2015-03-23 22:21               ` Benjamin Herrenschmidt
@ 2015-03-24  1:44                 ` David Miller
  -1 siblings, 0 replies; 71+ messages in thread
From: David Miller @ 2015-03-24  1:44 UTC (permalink / raw)
  To: benh; +Cc: aik, aik, sowmini.varadhan, anton, paulus, sparclinux, linuxppc-dev

From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Date: Tue, 24 Mar 2015 09:21:05 +1100

> Dave, what's your feeling there ? Does anybody around still have
> some HW that we can test with ?

I don't see what the actual problem is.

Even if you use multiple pools, which we should for scalability on
sun4u too, just do the flush when allocation in _any_ pool wraps
around.

That's still better than not doing the optimization at all.

That is always going to be correct, and you can use a separate
spinlock to make sure only one thread of control does the full
IOMMU flush at a time.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
@ 2015-03-24  1:44                 ` David Miller
  0 siblings, 0 replies; 71+ messages in thread
From: David Miller @ 2015-03-24  1:44 UTC (permalink / raw)
  To: benh; +Cc: aik, aik, sowmini.varadhan, anton, paulus, sparclinux, linuxppc-dev

From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Date: Tue, 24 Mar 2015 09:21:05 +1100

> Dave, what's your feeling there ? Does anybody around still have
> some HW that we can test with ?

I don't see what the actual problem is.

Even if you use multiple pools, which we should for scalability on
sun4u too, just do the flush when allocation in _any_ pool wraps
around.

That's still better than not doing the optimization at all.

That is always going to be correct, and you can use a separate
spinlock to make sure only one thread of control does the full
IOMMU flush at a time.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
  2015-03-24  1:44                 ` David Miller
@ 2015-03-24  1:57                   ` Sowmini Varadhan
  -1 siblings, 0 replies; 71+ messages in thread
From: Sowmini Varadhan @ 2015-03-24  1:57 UTC (permalink / raw)
  To: David Miller; +Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev

benh> It might be sufficient to add a flush counter and compare it between runs
benh> if actual wall-clock benchmarks are too hard to do (especially if you
benh> don't have things like very fast network cards at hand).
benh>
benh> Number of flush / number of packets might be a sufficient metric, it..

I was just going to say: I can add those counters tomorrow,
and get some stats, but seems like it doesn't really matter what 
the outcome is, because: 

On (03/23/15 21:44), David Miller wrote:
> 
> From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Date: Tue, 24 Mar 2015 09:21:05 +1100
> 
> > Dave, what's your feeling there ? Does anybody around still have
> > some HW that we can test with ?
> 
> I don't see what the actual problem is.
> 
> Even if you use multiple pools, which we should for scalability on
> sun4u too, just do the flush when allocation in _any_ pool wraps
> around.
> 
> That's still better than not doing the optimization at all.
> 
> That is always going to be correct, and you can use a separate
> spinlock to make sure only one thread of control does the full
> IOMMU flush at a time.

--Sowmini

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
@ 2015-03-24  1:57                   ` Sowmini Varadhan
  0 siblings, 0 replies; 71+ messages in thread
From: Sowmini Varadhan @ 2015-03-24  1:57 UTC (permalink / raw)
  To: David Miller; +Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev

benh> It might be sufficient to add a flush counter and compare it between runs
benh> if actual wall-clock benchmarks are too hard to do (especially if you
benh> don't have things like very fast network cards at hand).
benh>
benh> Number of flush / number of packets might be a sufficient metric, it..

I was just going to say: I can add those counters tomorrow,
and get some stats, but seems like it doesn't really matter what 
the outcome is, because: 

On (03/23/15 21:44), David Miller wrote:
> 
> From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Date: Tue, 24 Mar 2015 09:21:05 +1100
> 
> > Dave, what's your feeling there ? Does anybody around still have
> > some HW that we can test with ?
> 
> I don't see what the actual problem is.
> 
> Even if you use multiple pools, which we should for scalability on
> sun4u too, just do the flush when allocation in _any_ pool wraps
> around.
> 
> That's still better than not doing the optimization at all.
> 
> That is always going to be correct, and you can use a separate
> spinlock to make sure only one thread of control does the full
> IOMMU flush at a time.

--Sowmini

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
  2015-03-24  1:44                 ` David Miller
@ 2015-03-24  2:08                   ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 71+ messages in thread
From: Benjamin Herrenschmidt @ 2015-03-24  2:08 UTC (permalink / raw)
  To: David Miller
  Cc: aik, aik, sowmini.varadhan, anton, paulus, sparclinux, linuxppc-dev

On Mon, 2015-03-23 at 21:44 -0400, David Miller wrote:
> From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Date: Tue, 24 Mar 2015 09:21:05 +1100
> 
> > Dave, what's your feeling there ? Does anybody around still have
> > some HW that we can test with ?
> 
> I don't see what the actual problem is.
> 
> Even if you use multiple pools, which we should for scalability on
> sun4u too, just do the flush when allocation in _any_ pool wraps
> around.
> 
> That's still better than not doing the optimization at all.

I agree, I wasn't sure it was good enough for you, so was just putting
other options on the table.

> That is always going to be correct, and you can use a separate
> spinlock to make sure only one thread of control does the full
> IOMMU flush at a time.

That would have to be done inside the of the flush callback but I don't
see a big issue there.

For the large pool, we don't keep a hint so we don't know it's wrapped,
in fact we purposefully don't use a hint to limit fragmentation on it,
but then, it should be used rarely enough that flushing always is, I
suspect, a good option.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
@ 2015-03-24  2:08                   ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 71+ messages in thread
From: Benjamin Herrenschmidt @ 2015-03-24  2:08 UTC (permalink / raw)
  To: David Miller
  Cc: aik, aik, sowmini.varadhan, anton, paulus, sparclinux, linuxppc-dev

On Mon, 2015-03-23 at 21:44 -0400, David Miller wrote:
> From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Date: Tue, 24 Mar 2015 09:21:05 +1100
> 
> > Dave, what's your feeling there ? Does anybody around still have
> > some HW that we can test with ?
> 
> I don't see what the actual problem is.
> 
> Even if you use multiple pools, which we should for scalability on
> sun4u too, just do the flush when allocation in _any_ pool wraps
> around.
> 
> That's still better than not doing the optimization at all.

I agree, I wasn't sure it was good enough for you, so was just putting
other options on the table.

> That is always going to be correct, and you can use a separate
> spinlock to make sure only one thread of control does the full
> IOMMU flush at a time.

That would have to be done inside the of the flush callback but I don't
see a big issue there.

For the large pool, we don't keep a hint so we don't know it's wrapped,
in fact we purposefully don't use a hint to limit fragmentation on it,
but then, it should be used rarely enough that flushing always is, I
suspect, a good option.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
  2015-03-24  2:08                   ` Benjamin Herrenschmidt
@ 2015-03-24  2:15                     ` David Miller
  -1 siblings, 0 replies; 71+ messages in thread
From: David Miller @ 2015-03-24  2:15 UTC (permalink / raw)
  To: benh; +Cc: aik, aik, sowmini.varadhan, anton, paulus, sparclinux, linuxppc-dev

From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Date: Tue, 24 Mar 2015 13:08:10 +1100

> For the large pool, we don't keep a hint so we don't know it's
> wrapped, in fact we purposefully don't use a hint to limit
> fragmentation on it, but then, it should be used rarely enough that
> flushing always is, I suspect, a good option.

I can't think of any use case where the largepool would be hit a lot
at all.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
@ 2015-03-24  2:15                     ` David Miller
  0 siblings, 0 replies; 71+ messages in thread
From: David Miller @ 2015-03-24  2:15 UTC (permalink / raw)
  To: benh; +Cc: aik, aik, sowmini.varadhan, anton, paulus, sparclinux, linuxppc-dev

From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Date: Tue, 24 Mar 2015 13:08:10 +1100

> For the large pool, we don't keep a hint so we don't know it's
> wrapped, in fact we purposefully don't use a hint to limit
> fragmentation on it, but then, it should be used rarely enough that
> flushing always is, I suspect, a good option.

I can't think of any use case where the largepool would be hit a lot
at all.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
  2015-03-24  2:15                     ` David Miller
@ 2015-03-26  0:43                       ` cascardo
  -1 siblings, 0 replies; 71+ messages in thread
From: cascardo @ 2015-03-26  0:43 UTC (permalink / raw)
  To: David Miller
  Cc: aik, aik, anton, paulus, sowmini.varadhan, sparclinux, linuxppc-dev

On Mon, Mar 23, 2015 at 10:15:08PM -0400, David Miller wrote:
> From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Date: Tue, 24 Mar 2015 13:08:10 +1100
> 
> > For the large pool, we don't keep a hint so we don't know it's
> > wrapped, in fact we purposefully don't use a hint to limit
> > fragmentation on it, but then, it should be used rarely enough that
> > flushing always is, I suspect, a good option.
> 
> I can't think of any use case where the largepool would be hit a lot
> at all.

Well, until recently, IOMMU_PAGE_SIZE was 4KiB on Power, so every time a
driver mapped a whole 64KiB page, it would hit the largepool.

I have been suspicious for some time that after Anton's work on the
pools, the large mappings optimization would throw away the benefit of
using the 4 pools, since some drivers would always hit the largepool.

Of course, drivers that map entire pages, when not buggy, are optimized
already to avoid calling dma_map all the time. I worked on that for
mlx4_en, and I would expect that its receive side would always hit the
largepool.

So, I decided to experiment and count the number of times that
largealloc is true versus false.

On the transmit side, or when using ICMP, I didn't notice many large
allocations with qlge or cxgb4.

However, when using large TCP send/recv (I used uperf with 64KB
writes/reads), I noticed that on the transmit side, largealloc is not
used, but on the receive side, cxgb4 almost only uses largealloc, while
qlge seems to have a 1/1 usage or largealloc/non-largealloc mappings.
When turning GRO off, that ratio is closer to 1/10, meaning there is
still some fair use of largealloc in that scenario.

I confess my experiments are not complete. I would like to test a couple
of other drivers as well, including mlx4_en and bnx2x, and test with
small packet sizes. I suspected that MTU size could make a difference,
but in the case of ICMP, with MTU 9000 and payload of 8000 bytes, I
didn't notice any significant hit of largepool with either qlge or
cxgb4.

Also, we need to keep in mind that IOMMU_PAGE_SIZE is now dynamic in the
latest code, with plans on using 64KiB in some situations, Alexey or Ben
should have more details.

But I believe that on the receive side, all drivers should map entire
pages, using some allocation strategy similar to mlx4_en, in order to
avoid DMA mapping all the time. Some believe that is bad for latency,
and prefer to call something like skb_alloc for every package received,
but I haven't seen any hard numbers, and I don't know why we couldn't
make such an allocator as good as using something like the SLAB/SLUB
allocator. Maybe there is a jitter problem, since the allocator has to
go out and get some new pages and map them, once in a while. But I don't
see why this would not be a problem with SLAB/SLUB as well. Calling
dma_map is even worse with the current implementation. It's just that
some architectures do no work at all when dma_map/unmap is called.

Hope that helps consider the best strategy for the DMA space allocation
as of now.

Regards.
Cascardo.


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
@ 2015-03-26  0:43                       ` cascardo
  0 siblings, 0 replies; 71+ messages in thread
From: cascardo @ 2015-03-26  0:43 UTC (permalink / raw)
  To: David Miller
  Cc: aik, aik, anton, paulus, sowmini.varadhan, sparclinux, linuxppc-dev

On Mon, Mar 23, 2015 at 10:15:08PM -0400, David Miller wrote:
> From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Date: Tue, 24 Mar 2015 13:08:10 +1100
> 
> > For the large pool, we don't keep a hint so we don't know it's
> > wrapped, in fact we purposefully don't use a hint to limit
> > fragmentation on it, but then, it should be used rarely enough that
> > flushing always is, I suspect, a good option.
> 
> I can't think of any use case where the largepool would be hit a lot
> at all.

Well, until recently, IOMMU_PAGE_SIZE was 4KiB on Power, so every time a
driver mapped a whole 64KiB page, it would hit the largepool.

I have been suspicious for some time that after Anton's work on the
pools, the large mappings optimization would throw away the benefit of
using the 4 pools, since some drivers would always hit the largepool.

Of course, drivers that map entire pages, when not buggy, are optimized
already to avoid calling dma_map all the time. I worked on that for
mlx4_en, and I would expect that its receive side would always hit the
largepool.

So, I decided to experiment and count the number of times that
largealloc is true versus false.

On the transmit side, or when using ICMP, I didn't notice many large
allocations with qlge or cxgb4.

However, when using large TCP send/recv (I used uperf with 64KB
writes/reads), I noticed that on the transmit side, largealloc is not
used, but on the receive side, cxgb4 almost only uses largealloc, while
qlge seems to have a 1/1 usage or largealloc/non-largealloc mappings.
When turning GRO off, that ratio is closer to 1/10, meaning there is
still some fair use of largealloc in that scenario.

I confess my experiments are not complete. I would like to test a couple
of other drivers as well, including mlx4_en and bnx2x, and test with
small packet sizes. I suspected that MTU size could make a difference,
but in the case of ICMP, with MTU 9000 and payload of 8000 bytes, I
didn't notice any significant hit of largepool with either qlge or
cxgb4.

Also, we need to keep in mind that IOMMU_PAGE_SIZE is now dynamic in the
latest code, with plans on using 64KiB in some situations, Alexey or Ben
should have more details.

But I believe that on the receive side, all drivers should map entire
pages, using some allocation strategy similar to mlx4_en, in order to
avoid DMA mapping all the time. Some believe that is bad for latency,
and prefer to call something like skb_alloc for every package received,
but I haven't seen any hard numbers, and I don't know why we couldn't
make such an allocator as good as using something like the SLAB/SLUB
allocator. Maybe there is a jitter problem, since the allocator has to
go out and get some new pages and map them, once in a while. But I don't
see why this would not be a problem with SLAB/SLUB as well. Calling
dma_map is even worse with the current implementation. It's just that
some architectures do no work at all when dma_map/unmap is called.

Hope that helps consider the best strategy for the DMA space allocation
as of now.

Regards.
Cascardo.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
  2015-03-26  0:43                       ` cascardo
@ 2015-03-26  0:49                         ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 71+ messages in thread
From: Benjamin Herrenschmidt @ 2015-03-26  0:49 UTC (permalink / raw)
  To: cascardo
  Cc: aik, aik, sowmini.varadhan, anton, paulus, sparclinux,
	linuxppc-dev, David Miller

On Wed, 2015-03-25 at 21:43 -0300, cascardo@linux.vnet.ibm.com wrote:
> On Mon, Mar 23, 2015 at 10:15:08PM -0400, David Miller wrote:
> > From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> > Date: Tue, 24 Mar 2015 13:08:10 +1100
> > 
> > > For the large pool, we don't keep a hint so we don't know it's
> > > wrapped, in fact we purposefully don't use a hint to limit
> > > fragmentation on it, but then, it should be used rarely enough that
> > > flushing always is, I suspect, a good option.
> > 
> > I can't think of any use case where the largepool would be hit a lot
> > at all.
> 
> Well, until recently, IOMMU_PAGE_SIZE was 4KiB on Power, so every time a
> driver mapped a whole 64KiB page, it would hit the largepool.

Yes but I was talking of sparc here ...

> I have been suspicious for some time that after Anton's work on the
> pools, the large mappings optimization would throw away the benefit of
> using the 4 pools, since some drivers would always hit the largepool.

Right, I was thinking we should change the test for large pool from > 15
to > (PAGE_SHIFT * n) where n is TBD by experimentation.

> Of course, drivers that map entire pages, when not buggy, are optimized
> already to avoid calling dma_map all the time. I worked on that for
> mlx4_en, and I would expect that its receive side would always hit the
> largepool.
> 
> So, I decided to experiment and count the number of times that
> largealloc is true versus false.
> 
> On the transmit side, or when using ICMP, I didn't notice many large
> allocations with qlge or cxgb4.
> 
> However, when using large TCP send/recv (I used uperf with 64KB
> writes/reads), I noticed that on the transmit side, largealloc is not
> used, but on the receive side, cxgb4 almost only uses largealloc, while
> qlge seems to have a 1/1 usage or largealloc/non-largealloc mappings.
> When turning GRO off, that ratio is closer to 1/10, meaning there is
> still some fair use of largealloc in that scenario.

What are the sizes involved ? Always just 64K ? Or more ? Maybe just
changing 15 to 16 in the test would be sufficient ? We should make the
threshole a parameter set at init time so archs/platforms can adjust it.

> I confess my experiments are not complete. I would like to test a couple
> of other drivers as well, including mlx4_en and bnx2x, and test with
> small packet sizes. I suspected that MTU size could make a difference,
> but in the case of ICMP, with MTU 9000 and payload of 8000 bytes, I
> didn't notice any significant hit of largepool with either qlge or
> cxgb4.
> 
> Also, we need to keep in mind that IOMMU_PAGE_SIZE is now dynamic in the
> latest code, with plans on using 64KiB in some situations, Alexey or Ben
> should have more details.

We still mostly use 4K afaik... We will use 64K in some KVM setups and I
do plan to switch to 64K under some circumstances when we can but we
have some limits imposed by PAPR under hypervisors here.

> But I believe that on the receive side, all drivers should map entire
> pages, using some allocation strategy similar to mlx4_en, in order to
> avoid DMA mapping all the time. Some believe that is bad for latency,
> and prefer to call something like skb_alloc for every package received,
> but I haven't seen any hard numbers, and I don't know why we couldn't
> make such an allocator as good as using something like the SLAB/SLUB
> allocator. Maybe there is a jitter problem, since the allocator has to
> go out and get some new pages and map them, once in a while. But I don't
> see why this would not be a problem with SLAB/SLUB as well. Calling
> dma_map is even worse with the current implementation. It's just that
> some architectures do no work at all when dma_map/unmap is called.
> 
> Hope that helps consider the best strategy for the DMA space allocation
> as of now.

In any case, I don't think Sparc has the same issue. At this point
that's all I care about, once we adapt powerpc to use the new code, we
can revisit that problem on our side.

Cheers,
Ben.

> Regards.
> Cascardo.



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
@ 2015-03-26  0:49                         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 71+ messages in thread
From: Benjamin Herrenschmidt @ 2015-03-26  0:49 UTC (permalink / raw)
  To: cascardo
  Cc: aik, aik, sowmini.varadhan, anton, paulus, sparclinux,
	linuxppc-dev, David Miller

On Wed, 2015-03-25 at 21:43 -0300, cascardo@linux.vnet.ibm.com wrote:
> On Mon, Mar 23, 2015 at 10:15:08PM -0400, David Miller wrote:
> > From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> > Date: Tue, 24 Mar 2015 13:08:10 +1100
> > 
> > > For the large pool, we don't keep a hint so we don't know it's
> > > wrapped, in fact we purposefully don't use a hint to limit
> > > fragmentation on it, but then, it should be used rarely enough that
> > > flushing always is, I suspect, a good option.
> > 
> > I can't think of any use case where the largepool would be hit a lot
> > at all.
> 
> Well, until recently, IOMMU_PAGE_SIZE was 4KiB on Power, so every time a
> driver mapped a whole 64KiB page, it would hit the largepool.

Yes but I was talking of sparc here ...

> I have been suspicious for some time that after Anton's work on the
> pools, the large mappings optimization would throw away the benefit of
> using the 4 pools, since some drivers would always hit the largepool.

Right, I was thinking we should change the test for large pool from > 15
to > (PAGE_SHIFT * n) where n is TBD by experimentation.

> Of course, drivers that map entire pages, when not buggy, are optimized
> already to avoid calling dma_map all the time. I worked on that for
> mlx4_en, and I would expect that its receive side would always hit the
> largepool.
> 
> So, I decided to experiment and count the number of times that
> largealloc is true versus false.
> 
> On the transmit side, or when using ICMP, I didn't notice many large
> allocations with qlge or cxgb4.
> 
> However, when using large TCP send/recv (I used uperf with 64KB
> writes/reads), I noticed that on the transmit side, largealloc is not
> used, but on the receive side, cxgb4 almost only uses largealloc, while
> qlge seems to have a 1/1 usage or largealloc/non-largealloc mappings.
> When turning GRO off, that ratio is closer to 1/10, meaning there is
> still some fair use of largealloc in that scenario.

What are the sizes involved ? Always just 64K ? Or more ? Maybe just
changing 15 to 16 in the test would be sufficient ? We should make the
threshole a parameter set at init time so archs/platforms can adjust it.

> I confess my experiments are not complete. I would like to test a couple
> of other drivers as well, including mlx4_en and bnx2x, and test with
> small packet sizes. I suspected that MTU size could make a difference,
> but in the case of ICMP, with MTU 9000 and payload of 8000 bytes, I
> didn't notice any significant hit of largepool with either qlge or
> cxgb4.
> 
> Also, we need to keep in mind that IOMMU_PAGE_SIZE is now dynamic in the
> latest code, with plans on using 64KiB in some situations, Alexey or Ben
> should have more details.

We still mostly use 4K afaik... We will use 64K in some KVM setups and I
do plan to switch to 64K under some circumstances when we can but we
have some limits imposed by PAPR under hypervisors here.

> But I believe that on the receive side, all drivers should map entire
> pages, using some allocation strategy similar to mlx4_en, in order to
> avoid DMA mapping all the time. Some believe that is bad for latency,
> and prefer to call something like skb_alloc for every package received,
> but I haven't seen any hard numbers, and I don't know why we couldn't
> make such an allocator as good as using something like the SLAB/SLUB
> allocator. Maybe there is a jitter problem, since the allocator has to
> go out and get some new pages and map them, once in a while. But I don't
> see why this would not be a problem with SLAB/SLUB as well. Calling
> dma_map is even worse with the current implementation. It's just that
> some architectures do no work at all when dma_map/unmap is called.
> 
> Hope that helps consider the best strategy for the DMA space allocation
> as of now.

In any case, I don't think Sparc has the same issue. At this point
that's all I care about, once we adapt powerpc to use the new code, we
can revisit that problem on our side.

Cheers,
Ben.

> Regards.
> Cascardo.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
  2015-03-26  0:43                       ` cascardo
@ 2015-03-26 10:56                         ` Sowmini Varadhan
  -1 siblings, 0 replies; 71+ messages in thread
From: Sowmini Varadhan @ 2015-03-26 10:56 UTC (permalink / raw)
  To: cascardo; +Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev, David Miller

On (03/25/15 21:43), cascardo@linux.vnet.ibm.com wrote:
> However, when using large TCP send/recv (I used uperf with 64KB
> writes/reads), I noticed that on the transmit side, largealloc is not
> used, but on the receive side, cxgb4 almost only uses largealloc, while
> qlge seems to have a 1/1 usage or largealloc/non-largealloc mappings.
> When turning GRO off, that ratio is closer to 1/10, meaning there is
> still some fair use of largealloc in that scenario.
> 
> I confess my experiments are not complete. I would like to test a couple
> of other drivers as well, including mlx4_en and bnx2x, and test with
> small packet sizes. I suspected that MTU size could make a difference,
> but in the case of ICMP, with MTU 9000 and payload of 8000 bytes, I
> didn't notice any significant hit of largepool with either qlge or
> cxgb4.

I guess we also need to consider the "average use-case", i.e., 
something that interleaves small packets and interactive data with
jumbo/bulk data.. in those cases, the largepool would not get many
hits, and might actually be undesirable?

> But I believe that on the receive side, all drivers should map entire
> pages, using some allocation strategy similar to mlx4_en, in order to
> avoid DMA mapping all the time. 

good point. I think in the early phase of my perf investigation,
it was brought up that Solaris does pre-mapped DMA buffers (they
have to do this carefully, to avoid resource-starvation vulnerabilities-
see http://www.spinics.net/lists/sparclinux/msg13217.html
and threads leading to it..

This is  not something that the common iommu-arena allocator
can/should get involved in, of course. The scope of the arena-allocator
is much more rigorously defined.

I dont know if there is a way to set up a generalized DMA premapped
buffer infra for linux today.

fwiw, when I instrumented this for solaris (there are hooks to disable
the pre-mapped bufferes) the impact on a T5-2 (8 sockets, 2 numa nodes,
64 cpus) was not very significant for a single 10G ixgbe port- approx
8 Gbps instead of 9.X Gbps. I think the DMA buffer pre-mapping is
only significant when you start trying to scale to multiple ethernet 
ports.

--Sowmini


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
@ 2015-03-26 10:56                         ` Sowmini Varadhan
  0 siblings, 0 replies; 71+ messages in thread
From: Sowmini Varadhan @ 2015-03-26 10:56 UTC (permalink / raw)
  To: cascardo; +Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev, David Miller

On (03/25/15 21:43), cascardo@linux.vnet.ibm.com wrote:
> However, when using large TCP send/recv (I used uperf with 64KB
> writes/reads), I noticed that on the transmit side, largealloc is not
> used, but on the receive side, cxgb4 almost only uses largealloc, while
> qlge seems to have a 1/1 usage or largealloc/non-largealloc mappings.
> When turning GRO off, that ratio is closer to 1/10, meaning there is
> still some fair use of largealloc in that scenario.
> 
> I confess my experiments are not complete. I would like to test a couple
> of other drivers as well, including mlx4_en and bnx2x, and test with
> small packet sizes. I suspected that MTU size could make a difference,
> but in the case of ICMP, with MTU 9000 and payload of 8000 bytes, I
> didn't notice any significant hit of largepool with either qlge or
> cxgb4.

I guess we also need to consider the "average use-case", i.e., 
something that interleaves small packets and interactive data with
jumbo/bulk data.. in those cases, the largepool would not get many
hits, and might actually be undesirable?

> But I believe that on the receive side, all drivers should map entire
> pages, using some allocation strategy similar to mlx4_en, in order to
> avoid DMA mapping all the time. 

good point. I think in the early phase of my perf investigation,
it was brought up that Solaris does pre-mapped DMA buffers (they
have to do this carefully, to avoid resource-starvation vulnerabilities-
see http://www.spinics.net/lists/sparclinux/msg13217.html
and threads leading to it..

This is  not something that the common iommu-arena allocator
can/should get involved in, of course. The scope of the arena-allocator
is much more rigorously defined.

I dont know if there is a way to set up a generalized DMA premapped
buffer infra for linux today.

fwiw, when I instrumented this for solaris (there are hooks to disable
the pre-mapped bufferes) the impact on a T5-2 (8 sockets, 2 numa nodes,
64 cpus) was not very significant for a single 10G ixgbe port- approx
8 Gbps instead of 9.X Gbps. I think the DMA buffer pre-mapping is
only significant when you start trying to scale to multiple ethernet 
ports.

--Sowmini

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
  2015-03-26  0:43                       ` cascardo
@ 2015-03-26 23:00                         ` David Miller
  -1 siblings, 0 replies; 71+ messages in thread
From: David Miller @ 2015-03-26 22:51 UTC (permalink / raw)
  Cc: aik, aik, anton, paulus, sowmini.varadhan, sparclinux, linuxppc-dev

From: cascardo@linux.vnet.ibm.com
Date: Wed, 25 Mar 2015 21:43:42 -0300

> On Mon, Mar 23, 2015 at 10:15:08PM -0400, David Miller wrote:
>> From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
>> Date: Tue, 24 Mar 2015 13:08:10 +1100
>> 
>> > For the large pool, we don't keep a hint so we don't know it's
>> > wrapped, in fact we purposefully don't use a hint to limit
>> > fragmentation on it, but then, it should be used rarely enough that
>> > flushing always is, I suspect, a good option.
>> 
>> I can't think of any use case where the largepool would be hit a lot
>> at all.
> 
> Well, until recently, IOMMU_PAGE_SIZE was 4KiB on Power, so every time a
> driver mapped a whole 64KiB page, it would hit the largepool.

We don't plan to ever use 64KB page size on sparc64, so I think we're
safe there.

There are many reasons using 64KB pages is really stupid, what you
see here with the IOMMU stuff is just one of many symptoms, but I bet
your powerpc guys are kind of sick of hearing it by now... :-)


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
@ 2015-03-26 23:00                         ` David Miller
  0 siblings, 0 replies; 71+ messages in thread
From: David Miller @ 2015-03-26 23:00 UTC (permalink / raw)
  Cc: aik, aik, anton, paulus, sowmini.varadhan, sparclinux, linuxppc-dev

From: cascardo@linux.vnet.ibm.com
Date: Wed, 25 Mar 2015 21:43:42 -0300

> On Mon, Mar 23, 2015 at 10:15:08PM -0400, David Miller wrote:
>> From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
>> Date: Tue, 24 Mar 2015 13:08:10 +1100
>> 
>> > For the large pool, we don't keep a hint so we don't know it's
>> > wrapped, in fact we purposefully don't use a hint to limit
>> > fragmentation on it, but then, it should be used rarely enough that
>> > flushing always is, I suspect, a good option.
>> 
>> I can't think of any use case where the largepool would be hit a lot
>> at all.
> 
> Well, until recently, IOMMU_PAGE_SIZE was 4KiB on Power, so every time a
> driver mapped a whole 64KiB page, it would hit the largepool.

We don't plan to ever use 64KB page size on sparc64, so I think we're
safe there.

There are many reasons using 64KB pages is really stupid, what you
see here with the IOMMU stuff is just one of many symptoms, but I bet
your powerpc guys are kind of sick of hearing it by now... :-)

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
  2015-03-26 23:00                         ` David Miller
@ 2015-03-26 23:51                           ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 71+ messages in thread
From: Benjamin Herrenschmidt @ 2015-03-26 23:51 UTC (permalink / raw)
  To: David Miller
  Cc: aik, aik, sowmini.varadhan, anton, paulus, sparclinux, linuxppc-dev

On Thu, 2015-03-26 at 16:00 -0700, David Miller wrote:
> From: cascardo@linux.vnet.ibm.com
> Date: Wed, 25 Mar 2015 21:43:42 -0300
> 
> > On Mon, Mar 23, 2015 at 10:15:08PM -0400, David Miller wrote:
> >> From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> >> Date: Tue, 24 Mar 2015 13:08:10 +1100
> >> 
> >> > For the large pool, we don't keep a hint so we don't know it's
> >> > wrapped, in fact we purposefully don't use a hint to limit
> >> > fragmentation on it, but then, it should be used rarely enough that
> >> > flushing always is, I suspect, a good option.
> >> 
> >> I can't think of any use case where the largepool would be hit a lot
> >> at all.
> > 
> > Well, until recently, IOMMU_PAGE_SIZE was 4KiB on Power, so every time a
> > driver mapped a whole 64KiB page, it would hit the largepool.
> 
> We don't plan to ever use 64KB page size on sparc64, so I think we're
> safe there.
> 
> There are many reasons using 64KB pages is really stupid, what you
> see here with the IOMMU stuff is just one of many symptoms, but I bet
> your powerpc guys are kind of sick of hearing it by now... :-)

Lalalala :-)

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Generic IOMMU pooled allocator
@ 2015-03-26 23:51                           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 71+ messages in thread
From: Benjamin Herrenschmidt @ 2015-03-26 23:51 UTC (permalink / raw)
  To: David Miller
  Cc: aik, aik, sowmini.varadhan, anton, paulus, sparclinux, linuxppc-dev

On Thu, 2015-03-26 at 16:00 -0700, David Miller wrote:
> From: cascardo@linux.vnet.ibm.com
> Date: Wed, 25 Mar 2015 21:43:42 -0300
> 
> > On Mon, Mar 23, 2015 at 10:15:08PM -0400, David Miller wrote:
> >> From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> >> Date: Tue, 24 Mar 2015 13:08:10 +1100
> >> 
> >> > For the large pool, we don't keep a hint so we don't know it's
> >> > wrapped, in fact we purposefully don't use a hint to limit
> >> > fragmentation on it, but then, it should be used rarely enough that
> >> > flushing always is, I suspect, a good option.
> >> 
> >> I can't think of any use case where the largepool would be hit a lot
> >> at all.
> > 
> > Well, until recently, IOMMU_PAGE_SIZE was 4KiB on Power, so every time a
> > driver mapped a whole 64KiB page, it would hit the largepool.
> 
> We don't plan to ever use 64KB page size on sparc64, so I think we're
> safe there.
> 
> There are many reasons using 64KB pages is really stupid, what you
> see here with the IOMMU stuff is just one of many symptoms, but I bet
> your powerpc guys are kind of sick of hearing it by now... :-)

Lalalala :-)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 71+ messages in thread

end of thread, other threads:[~2015-03-26 23:51 UTC | newest]

Thread overview: 71+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-19  2:25 Generic IOMMU pooled allocator David Miller
2015-03-19  2:25 ` David Miller
2015-03-19  2:46 ` Benjamin Herrenschmidt
2015-03-19  2:46   ` Benjamin Herrenschmidt
2015-03-19  2:50   ` David Miller
2015-03-19  2:50     ` David Miller
2015-03-19  3:01 ` Benjamin Herrenschmidt
2015-03-19  3:01   ` Benjamin Herrenschmidt
2015-03-19  5:27   ` Alexey Kardashevskiy
2015-03-19  5:27     ` Alexey Kardashevskiy
2015-03-19 13:34     ` Sowmini Varadhan
2015-03-19 13:34       ` Sowmini Varadhan
2015-03-22 19:27     ` Sowmini Varadhan
2015-03-22 19:27       ` Sowmini Varadhan
2015-03-23 16:29       ` David Miller
2015-03-23 16:29         ` David Miller
2015-03-23 16:54         ` Sowmini Varadhan
2015-03-23 16:54           ` Sowmini Varadhan
2015-03-23 19:05           ` David Miller
2015-03-23 19:05             ` David Miller
2015-03-23 19:09             ` Sowmini Varadhan
2015-03-23 19:09               ` Sowmini Varadhan
2015-03-23 22:21             ` Benjamin Herrenschmidt
2015-03-23 22:21               ` Benjamin Herrenschmidt
2015-03-23 23:08               ` Sowmini Varadhan
2015-03-23 23:08                 ` Sowmini Varadhan
2015-03-23 23:29                 ` chase rayfield
2015-03-24  0:47                 ` Benjamin Herrenschmidt
2015-03-24  0:47                   ` Benjamin Herrenschmidt
2015-03-24  1:11                   ` Sowmini Varadhan
2015-03-24  1:11                     ` Sowmini Varadhan
2015-03-24  1:44               ` David Miller
2015-03-24  1:44                 ` David Miller
2015-03-24  1:57                 ` Sowmini Varadhan
2015-03-24  1:57                   ` Sowmini Varadhan
2015-03-24  2:08                 ` Benjamin Herrenschmidt
2015-03-24  2:08                   ` Benjamin Herrenschmidt
2015-03-24  2:15                   ` David Miller
2015-03-24  2:15                     ` David Miller
2015-03-26  0:43                     ` cascardo
2015-03-26  0:43                       ` cascardo
2015-03-26  0:49                       ` Benjamin Herrenschmidt
2015-03-26  0:49                         ` Benjamin Herrenschmidt
2015-03-26 10:56                       ` Sowmini Varadhan
2015-03-26 10:56                         ` Sowmini Varadhan
2015-03-26 22:51                       ` David Miller
2015-03-26 23:00                         ` David Miller
2015-03-26 23:51                         ` Benjamin Herrenschmidt
2015-03-26 23:51                           ` Benjamin Herrenschmidt
2015-03-23 22:36             ` Benjamin Herrenschmidt
2015-03-23 22:36               ` Benjamin Herrenschmidt
2015-03-23 23:19               ` Sowmini Varadhan
2015-03-23 23:19                 ` Sowmini Varadhan
2015-03-24  0:48                 ` Benjamin Herrenschmidt
2015-03-24  0:48                   ` Benjamin Herrenschmidt
2015-03-23 22:25           ` Benjamin Herrenschmidt
2015-03-23 22:25             ` Benjamin Herrenschmidt
2015-03-22 19:36 ` Arnd Bergmann
2015-03-22 19:36   ` Arnd Bergmann
2015-03-22 22:02   ` Benjamin Herrenschmidt
2015-03-22 22:02     ` Benjamin Herrenschmidt
2015-03-22 22:07     ` Sowmini Varadhan
2015-03-22 22:07       ` Sowmini Varadhan
2015-03-22 22:22       ` Benjamin Herrenschmidt
2015-03-22 22:22         ` Benjamin Herrenschmidt
2015-03-23  6:04         ` Arnd Bergmann
2015-03-23  6:04           ` Arnd Bergmann
2015-03-23 11:04           ` Benjamin Herrenschmidt
2015-03-23 11:04             ` Benjamin Herrenschmidt
2015-03-23 18:45             ` Arnd Bergmann
2015-03-23 18:45               ` Arnd Bergmann

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.