* Generic IOMMU pooled allocator @ 2015-03-19 2:25 ` David Miller 0 siblings, 0 replies; 71+ messages in thread From: David Miller @ 2015-03-19 2:25 UTC (permalink / raw) To: linuxppc-dev; +Cc: sparclinux, paulus, sowmini.varadhan PowerPC folks, we're trying to kill the locking contention in our IOMMU allocators and noticed that you guys have a nice solution to this in your IOMMU code. Sowmini put together a patch series that tries to extract out the generic parts of your code and place it in lib/iommu-common.c so that both Sparc and PowerPC can make use of it. The real test is if powerpc can be converted to use it. So if you guys could please take a look at the current version of this series at: http://patchwork.ozlabs.org/patch/449712/ http://patchwork.ozlabs.org/patch/449713/ http://patchwork.ozlabs.org/patch/449715/ and give us some feedback that would be great. After we sort out making this work properly for both architectures we can figure out who merges which parts via what tree(s). Thanks! ^ permalink raw reply [flat|nested] 71+ messages in thread
* Generic IOMMU pooled allocator @ 2015-03-19 2:25 ` David Miller 0 siblings, 0 replies; 71+ messages in thread From: David Miller @ 2015-03-19 2:25 UTC (permalink / raw) To: linuxppc-dev; +Cc: sparclinux, paulus, sowmini.varadhan PowerPC folks, we're trying to kill the locking contention in our IOMMU allocators and noticed that you guys have a nice solution to this in your IOMMU code. Sowmini put together a patch series that tries to extract out the generic parts of your code and place it in lib/iommu-common.c so that both Sparc and PowerPC can make use of it. The real test is if powerpc can be converted to use it. So if you guys could please take a look at the current version of this series at: http://patchwork.ozlabs.org/patch/449712/ http://patchwork.ozlabs.org/patch/449713/ http://patchwork.ozlabs.org/patch/449715/ and give us some feedback that would be great. After we sort out making this work properly for both architectures we can figure out who merges which parts via what tree(s). Thanks! ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator 2015-03-19 2:25 ` David Miller @ 2015-03-19 2:46 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 71+ messages in thread From: Benjamin Herrenschmidt @ 2015-03-19 2:46 UTC (permalink / raw) To: David Miller Cc: sowmini.varadhan, Anton Blanchard, paulus, sparclinux, linuxppc-dev On Wed, 2015-03-18 at 22:25 -0400, David Miller wrote: > PowerPC folks, we're trying to kill the locking contention in our > IOMMU allocators and noticed that you guys have a nice solution to > this in your IOMMU code. > > Sowmini put together a patch series that tries to extract out the > generic parts of your code and place it in lib/iommu-common.c so > that both Sparc and PowerPC can make use of it. > > The real test is if powerpc can be converted to use it. > > So if you guys could please take a look at the current version of > this series at: > > http://patchwork.ozlabs.org/patch/449712/ > http://patchwork.ozlabs.org/patch/449713/ > http://patchwork.ozlabs.org/patch/449715/ > > and give us some feedback that would be great. > > After we sort out making this work properly for both architectures we > can figure out who merges which parts via what tree(s). Sounds like a good idea ! CC'ing Anton who wrote the pool stuff. I'll try to find somebody to work on that here & will let you know asap. Cheers, Ben. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator @ 2015-03-19 2:46 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 71+ messages in thread From: Benjamin Herrenschmidt @ 2015-03-19 2:46 UTC (permalink / raw) To: David Miller Cc: sowmini.varadhan, Anton Blanchard, paulus, sparclinux, linuxppc-dev On Wed, 2015-03-18 at 22:25 -0400, David Miller wrote: > PowerPC folks, we're trying to kill the locking contention in our > IOMMU allocators and noticed that you guys have a nice solution to > this in your IOMMU code. > > Sowmini put together a patch series that tries to extract out the > generic parts of your code and place it in lib/iommu-common.c so > that both Sparc and PowerPC can make use of it. > > The real test is if powerpc can be converted to use it. > > So if you guys could please take a look at the current version of > this series at: > > http://patchwork.ozlabs.org/patch/449712/ > http://patchwork.ozlabs.org/patch/449713/ > http://patchwork.ozlabs.org/patch/449715/ > > and give us some feedback that would be great. > > After we sort out making this work properly for both architectures we > can figure out who merges which parts via what tree(s). Sounds like a good idea ! CC'ing Anton who wrote the pool stuff. I'll try to find somebody to work on that here & will let you know asap. Cheers, Ben. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator 2015-03-19 2:46 ` Benjamin Herrenschmidt @ 2015-03-19 2:50 ` David Miller -1 siblings, 0 replies; 71+ messages in thread From: David Miller @ 2015-03-19 2:50 UTC (permalink / raw) To: benh; +Cc: sowmini.varadhan, anton, paulus, sparclinux, linuxppc-dev From: Benjamin Herrenschmidt <benh@kernel.crashing.org> Date: Thu, 19 Mar 2015 13:46:15 +1100 > Sounds like a good idea ! CC'ing Anton who wrote the pool stuff. I'll > try to find somebody to work on that here & will let you know asap. Thanks a lot Ben. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator @ 2015-03-19 2:50 ` David Miller 0 siblings, 0 replies; 71+ messages in thread From: David Miller @ 2015-03-19 2:50 UTC (permalink / raw) To: benh; +Cc: sowmini.varadhan, anton, paulus, sparclinux, linuxppc-dev From: Benjamin Herrenschmidt <benh@kernel.crashing.org> Date: Thu, 19 Mar 2015 13:46:15 +1100 > Sounds like a good idea ! CC'ing Anton who wrote the pool stuff. I'll > try to find somebody to work on that here & will let you know asap. Thanks a lot Ben. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator 2015-03-19 2:25 ` David Miller @ 2015-03-19 3:01 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 71+ messages in thread From: Benjamin Herrenschmidt @ 2015-03-19 3:01 UTC (permalink / raw) To: David Miller Cc: Alexey Kardashevskiy, sowmini.varadhan, Anton Blanchard, paulus, sparclinux, linuxppc-dev On Wed, 2015-03-18 at 22:25 -0400, David Miller wrote: > PowerPC folks, we're trying to kill the locking contention in our > IOMMU allocators and noticed that you guys have a nice solution to > this in your IOMMU code. .../... Adding Alexei too who is currently doing some changes to our iommu code to deal with KVM. One thing I noticed is the asymetry in your code between the alloc and the free path. The alloc path is similar to us in that the lock covers the allocation and that's about it, there's no actual mapping to the HW done, it's done by the caller level right ? The free path however, in your case, takes the lock and calls back into "demap" (which I assume is what removes the translation from the HW) with the lock held. There's also some mapping between cookies and index which here too isn't exposed to the alloc side but is exposed to the free side. I'm sure there's a rationale for that but it would be good if you could explain it and document the semantics of those 3 callbacks in the iommu ops. One thing that Alexey is doing on our side is to move some of the hooks to manipulate the underlying TCEs (ie. iommu PTEs) from our global ppc_md. data structure to a new iommu_table_ops, so your patches will definitely collide with our current work so we'll have to figure out how things can made to match. We might be able to move more than just the allocator to the generic code, and the whole implementation of map_sg/unmap_sg if we have the right set of "ops", unless you see a reason why that wouldn't work for you ? We also need to add some additional platform specific fields to certain iommu table instances to deal with some KVM related tracking of pinned "DMAble" memory, here too we might have to be creative and possibly enclose the generic iommu_table in a platform specific variant. Alexey, any other comment ? Cheers, Ben. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator @ 2015-03-19 3:01 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 71+ messages in thread From: Benjamin Herrenschmidt @ 2015-03-19 3:01 UTC (permalink / raw) To: David Miller Cc: Alexey Kardashevskiy, sowmini.varadhan, Anton Blanchard, paulus, sparclinux, linuxppc-dev On Wed, 2015-03-18 at 22:25 -0400, David Miller wrote: > PowerPC folks, we're trying to kill the locking contention in our > IOMMU allocators and noticed that you guys have a nice solution to > this in your IOMMU code. .../... Adding Alexei too who is currently doing some changes to our iommu code to deal with KVM. One thing I noticed is the asymetry in your code between the alloc and the free path. The alloc path is similar to us in that the lock covers the allocation and that's about it, there's no actual mapping to the HW done, it's done by the caller level right ? The free path however, in your case, takes the lock and calls back into "demap" (which I assume is what removes the translation from the HW) with the lock held. There's also some mapping between cookies and index which here too isn't exposed to the alloc side but is exposed to the free side. I'm sure there's a rationale for that but it would be good if you could explain it and document the semantics of those 3 callbacks in the iommu ops. One thing that Alexey is doing on our side is to move some of the hooks to manipulate the underlying TCEs (ie. iommu PTEs) from our global ppc_md. data structure to a new iommu_table_ops, so your patches will definitely collide with our current work so we'll have to figure out how things can made to match. We might be able to move more than just the allocator to the generic code, and the whole implementation of map_sg/unmap_sg if we have the right set of "ops", unless you see a reason why that wouldn't work for you ? We also need to add some additional platform specific fields to certain iommu table instances to deal with some KVM related tracking of pinned "DMAble" memory, here too we might have to be creative and possibly enclose the generic iommu_table in a platform specific variant. Alexey, any other comment ? Cheers, Ben. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator 2015-03-19 3:01 ` Benjamin Herrenschmidt @ 2015-03-19 5:27 ` Alexey Kardashevskiy -1 siblings, 0 replies; 71+ messages in thread From: Alexey Kardashevskiy @ 2015-03-19 5:27 UTC (permalink / raw) To: David Miller Cc: Alexey Kardashevskiy, Anton Blanchard, paulus, sowmini.varadhan, sparclinux, linuxppc-dev On 03/19/2015 02:01 PM, Benjamin Herrenschmidt wrote: > On Wed, 2015-03-18 at 22:25 -0400, David Miller wrote: >> PowerPC folks, we're trying to kill the locking contention in our >> IOMMU allocators and noticed that you guys have a nice solution to >> this in your IOMMU code. > > .../... > > Adding Alexei too who is currently doing some changes to our iommu > code to deal with KVM. > > One thing I noticed is the asymetry in your code between the alloc > and the free path. The alloc path is similar to us in that the lock > covers the allocation and that's about it, there's no actual mapping to > the HW done, it's done by the caller level right ? > > The free path however, in your case, takes the lock and calls back into > "demap" (which I assume is what removes the translation from the HW) > with the lock held. There's also some mapping between cookies > and index which here too isn't exposed to the alloc side but is > exposed to the free side. > > I'm sure there's a rationale for that but it would be good if you could > explain it and document the semantics of those 3 callbacks in the iommu > ops. > > One thing that Alexey is doing on our side is to move some of the > hooks to manipulate the underlying TCEs (ie. iommu PTEs) from our > global ppc_md. data structure to a new iommu_table_ops, so your patches > will definitely collide with our current work so we'll have to figure > out how things can made to match. We might be able to move more than > just the allocator to the generic code, and the whole implementation of > map_sg/unmap_sg if we have the right set of "ops", unless you see a > reason why that wouldn't work for you ? > > We also need to add some additional platform specific fields to certain > iommu table instances to deal with some KVM related tracking of pinned > "DMAble" memory, here too we might have to be creative and possibly > enclose the generic iommu_table in a platform specific variant. > > Alexey, any other comment ? Agree about missing symmetry. In general, I would call it "zoned pool-locked memory allocator"-ish rather than iommu_table and have no callbacks there. The iommu_tbl_range_free() caller could call cookie_to_index() and what the reset() callback does here - I do not understand, some documentation would help here, and demap() does not have to be executed under the lock (as map() is not executed under the lock). btw why demap, not unmap? :) -- Alexey ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator @ 2015-03-19 5:27 ` Alexey Kardashevskiy 0 siblings, 0 replies; 71+ messages in thread From: Alexey Kardashevskiy @ 2015-03-19 5:27 UTC (permalink / raw) To: David Miller Cc: Alexey Kardashevskiy, Anton Blanchard, paulus, sowmini.varadhan, sparclinux, linuxppc-dev On 03/19/2015 02:01 PM, Benjamin Herrenschmidt wrote: > On Wed, 2015-03-18 at 22:25 -0400, David Miller wrote: >> PowerPC folks, we're trying to kill the locking contention in our >> IOMMU allocators and noticed that you guys have a nice solution to >> this in your IOMMU code. > > .../... > > Adding Alexei too who is currently doing some changes to our iommu > code to deal with KVM. > > One thing I noticed is the asymetry in your code between the alloc > and the free path. The alloc path is similar to us in that the lock > covers the allocation and that's about it, there's no actual mapping to > the HW done, it's done by the caller level right ? > > The free path however, in your case, takes the lock and calls back into > "demap" (which I assume is what removes the translation from the HW) > with the lock held. There's also some mapping between cookies > and index which here too isn't exposed to the alloc side but is > exposed to the free side. > > I'm sure there's a rationale for that but it would be good if you could > explain it and document the semantics of those 3 callbacks in the iommu > ops. > > One thing that Alexey is doing on our side is to move some of the > hooks to manipulate the underlying TCEs (ie. iommu PTEs) from our > global ppc_md. data structure to a new iommu_table_ops, so your patches > will definitely collide with our current work so we'll have to figure > out how things can made to match. We might be able to move more than > just the allocator to the generic code, and the whole implementation of > map_sg/unmap_sg if we have the right set of "ops", unless you see a > reason why that wouldn't work for you ? > > We also need to add some additional platform specific fields to certain > iommu table instances to deal with some KVM related tracking of pinned > "DMAble" memory, here too we might have to be creative and possibly > enclose the generic iommu_table in a platform specific variant. > > Alexey, any other comment ? Agree about missing symmetry. In general, I would call it "zoned pool-locked memory allocator"-ish rather than iommu_table and have no callbacks there. The iommu_tbl_range_free() caller could call cookie_to_index() and what the reset() callback does here - I do not understand, some documentation would help here, and demap() does not have to be executed under the lock (as map() is not executed under the lock). btw why demap, not unmap? :) -- Alexey ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator 2015-03-19 5:27 ` Alexey Kardashevskiy @ 2015-03-19 13:34 ` Sowmini Varadhan -1 siblings, 0 replies; 71+ messages in thread From: Sowmini Varadhan @ 2015-03-19 13:34 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: Alexey Kardashevskiy, Anton Blanchard, paulus, sowmini.varadhan, sparclinux, linuxppc-dev, David Miller On 03/19/2015 02:01 PM, Benjamin Herrenschmidt wrote: Ben> One thing I noticed is the asymetry in your code between the alloc Ben> and the free path. The alloc path is similar to us in that the lock Ben> covers the allocation and that's about it, there's no actual mapping to Ben> the HW done, it's done by the caller level right ? yes, the only constraint is that the h/w alloc transaction should be done after the arena-alloc, whereas for the unmap, the h/w transaction should happen first, and arena-unmap should happen after. Ben> The free path however, in your case, takes the lock and calls back into Ben> "demap" (which I assume is what removes the translation from the HW) Ben> with the lock held. There's also some mapping between cookies Ben> and index which here too isn't exposed to the alloc side but is Ben> exposed to the free side. Regarding the ->demap indirection- somewhere between V1 and V2, I realized that, at least for sun4v, it's not necessary to hold the pool lock when doing the unmap, (V1 had originally passed this a the ->demap). Revisiting the LDC change, I think that even that has no pool-specific info that needs to be passed in, so possibly the ->demap is not required at all? I can remove that, and re-verify the LDC code (though I might not be able to get to this till early next week, as I'm travelling at the moment). About the cookie_to_index, that came out of observation of the LDC code (ldc_cookie_to_index in patchset 3). In the LDC case, the workflow is approximately base = alloc_npages(..); /* calls iommu_tbl_range_alloc *. /* set up cookie_state using base */ /* populate cookies calling fill_cookies() -> make_cookie() */ The make_cookie() is the inverse operation of cookie_to_index() (afaict, the code is not very well commented, I'm afraid), but I need that indirection to figure out which bitmap to clear. I dont know if there's a better way to do this, or if the ->cookie_to_index can get more complex for other IOMMU users Ben> One thing that Alexey is doing on our side is to move some of the Ben> hooks to manipulate the underlying TCEs (ie. iommu PTEs) from our Ben> global ppc_md. data structure to a new iommu_table_ops, so your patches Ben> will definitely collide with our current work so we'll have to figure Ben> out how things can made to match. We might be able to move more than Ben> just the allocator to the generic code, and the whole implementation of Ben> map_sg/unmap_sg if we have the right set of "ops", unless you see a Ben> reason why that wouldn't work for you ? I cant think of why that wont work, though it would help to see the patch itself.. Ben> We also need to add some additional platform specific fields to certain Ben> iommu table instances to deal with some KVM related tracking of pinned Ben> "DMAble" memory, here too we might have to be creative and possibly Ben> enclose the generic iommu_table in a platform specific variant. Ben> Ben> Alexey, any other comment ? Ben> Ben> Cheers, Ben> Ben. Ben> Ben> Ben> Ben> -- Ben> To unsubscribe from this list: send the line "unsubscribe sparclinux" in Ben> the body of a message to majordomo@vger.kernel.org Ben> More majordomo info at http://vger.kernel.org/majordomo-info.html Ben> Ben>Alexey, any other comment ? On (03/19/15 16:27), Alexey Kardashevskiy wrote: Alexey> Alexey> Agree about missing symmetry. In general, I would call it "zoned Alexey> pool-locked memory allocator"-ish rather than iommu_table and have Alexey> no callbacks there. Alexey> Alexey> The iommu_tbl_range_free() caller could call cookie_to_index() and Problem is that tbl_range_free itself needs the `entry' from -Alexey>cookie_to_index.. dont know if there's a way to move the code to avoid that.. Alexey> what the reset() callback does here - I do not understand, some The -Alexey>reset callback came out of the sun4u use-case. Davem might have more history here than I do, but my understanding is that the iommu_flushall() was needed on the older sun4u architectures, where there was on intermediating HV? Alexey> documentation would help here, and demap() does not have to be Alexey> executed under the lock (as map() is not executed under the lock). Alexey> Alexey> btw why demap, not unmap? :) Maybe neither is needed, as you folks made me realize above. --Sowmini ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator @ 2015-03-19 13:34 ` Sowmini Varadhan 0 siblings, 0 replies; 71+ messages in thread From: Sowmini Varadhan @ 2015-03-19 13:34 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: Alexey Kardashevskiy, Anton Blanchard, paulus, sowmini.varadhan, sparclinux, linuxppc-dev, David Miller On 03/19/2015 02:01 PM, Benjamin Herrenschmidt wrote: Ben> One thing I noticed is the asymetry in your code between the alloc Ben> and the free path. The alloc path is similar to us in that the lock Ben> covers the allocation and that's about it, there's no actual mapping to Ben> the HW done, it's done by the caller level right ? yes, the only constraint is that the h/w alloc transaction should be done after the arena-alloc, whereas for the unmap, the h/w transaction should happen first, and arena-unmap should happen after. Ben> The free path however, in your case, takes the lock and calls back into Ben> "demap" (which I assume is what removes the translation from the HW) Ben> with the lock held. There's also some mapping between cookies Ben> and index which here too isn't exposed to the alloc side but is Ben> exposed to the free side. Regarding the ->demap indirection- somewhere between V1 and V2, I realized that, at least for sun4v, it's not necessary to hold the pool lock when doing the unmap, (V1 had originally passed this a the ->demap). Revisiting the LDC change, I think that even that has no pool-specific info that needs to be passed in, so possibly the ->demap is not required at all? I can remove that, and re-verify the LDC code (though I might not be able to get to this till early next week, as I'm travelling at the moment). About the cookie_to_index, that came out of observation of the LDC code (ldc_cookie_to_index in patchset 3). In the LDC case, the workflow is approximately base = alloc_npages(..); /* calls iommu_tbl_range_alloc *. /* set up cookie_state using base */ /* populate cookies calling fill_cookies() -> make_cookie() */ The make_cookie() is the inverse operation of cookie_to_index() (afaict, the code is not very well commented, I'm afraid), but I need that indirection to figure out which bitmap to clear. I dont know if there's a better way to do this, or if the ->cookie_to_index can get more complex for other IOMMU users Ben> One thing that Alexey is doing on our side is to move some of the Ben> hooks to manipulate the underlying TCEs (ie. iommu PTEs) from our Ben> global ppc_md. data structure to a new iommu_table_ops, so your patches Ben> will definitely collide with our current work so we'll have to figure Ben> out how things can made to match. We might be able to move more than Ben> just the allocator to the generic code, and the whole implementation of Ben> map_sg/unmap_sg if we have the right set of "ops", unless you see a Ben> reason why that wouldn't work for you ? I cant think of why that wont work, though it would help to see the patch itself.. Ben> We also need to add some additional platform specific fields to certain Ben> iommu table instances to deal with some KVM related tracking of pinned Ben> "DMAble" memory, here too we might have to be creative and possibly Ben> enclose the generic iommu_table in a platform specific variant. Ben> Ben> Alexey, any other comment ? Ben> Ben> Cheers, Ben> Ben. Ben> Ben> Ben> Ben> -- Ben> To unsubscribe from this list: send the line "unsubscribe sparclinux" in Ben> the body of a message to majordomo@vger.kernel.org Ben> More majordomo info at http://vger.kernel.org/majordomo-info.html Ben> Ben>Alexey, any other comment ? On (03/19/15 16:27), Alexey Kardashevskiy wrote: Alexey> Alexey> Agree about missing symmetry. In general, I would call it "zoned Alexey> pool-locked memory allocator"-ish rather than iommu_table and have Alexey> no callbacks there. Alexey> Alexey> The iommu_tbl_range_free() caller could call cookie_to_index() and Problem is that tbl_range_free itself needs the `entry' from -Alexey>cookie_to_index.. dont know if there's a way to move the code to avoid that.. Alexey> what the reset() callback does here - I do not understand, some The -Alexey>reset callback came out of the sun4u use-case. Davem might have more history here than I do, but my understanding is that the iommu_flushall() was needed on the older sun4u architectures, where there was on intermediating HV? Alexey> documentation would help here, and demap() does not have to be Alexey> executed under the lock (as map() is not executed under the lock). Alexey> Alexey> btw why demap, not unmap? :) Maybe neither is needed, as you folks made me realize above. --Sowmini ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator 2015-03-19 5:27 ` Alexey Kardashevskiy @ 2015-03-22 19:27 ` Sowmini Varadhan -1 siblings, 0 replies; 71+ messages in thread From: Sowmini Varadhan @ 2015-03-22 19:27 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: Alexey Kardashevskiy, Anton Blanchard, paulus, sparclinux, linuxppc-dev, David Miller Turned out that I was able to iterate over it, and remove both the ->cookie_to_index and the ->demap indirection from iommu_tbl_ops. That leaves only the odd iommu_flushall() hook, I'm trying to find the history behind that (needed for sun4u platforms, afaik, and not sure if there are other ways to achieve this). Just sent out a new patch-set, which should reach all the people/lists on this thread. If it's possible, it would also help to share any preliminary code-samples with the iommu_table_ops you folks have in mind. Please take a look at patchv5, and share your thoughts. --Sowmini ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator @ 2015-03-22 19:27 ` Sowmini Varadhan 0 siblings, 0 replies; 71+ messages in thread From: Sowmini Varadhan @ 2015-03-22 19:27 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: Alexey Kardashevskiy, Anton Blanchard, paulus, sparclinux, linuxppc-dev, David Miller Turned out that I was able to iterate over it, and remove both the ->cookie_to_index and the ->demap indirection from iommu_tbl_ops. That leaves only the odd iommu_flushall() hook, I'm trying to find the history behind that (needed for sun4u platforms, afaik, and not sure if there are other ways to achieve this). Just sent out a new patch-set, which should reach all the people/lists on this thread. If it's possible, it would also help to share any preliminary code-samples with the iommu_table_ops you folks have in mind. Please take a look at patchv5, and share your thoughts. --Sowmini ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator 2015-03-22 19:27 ` Sowmini Varadhan @ 2015-03-23 16:29 ` David Miller -1 siblings, 0 replies; 71+ messages in thread From: David Miller @ 2015-03-23 16:29 UTC (permalink / raw) To: sowmini.varadhan; +Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev From: Sowmini Varadhan <sowmini.varadhan@oracle.com> Date: Sun, 22 Mar 2015 15:27:26 -0400 > That leaves only the odd iommu_flushall() hook, I'm trying > to find the history behind that (needed for sun4u platforms, > afaik, and not sure if there are other ways to achieve this). In order to elide the IOMMU flush as much as possible, I implemnented a scheme for sun4u wherein we always allocated from low IOMMU addresses to high IOMMU addresses. In this regime, we only need to flush the IOMMU when we rolled over back to low IOMMU addresses during an allocation. It made a noticable difference in performance. Unfortunately, with sun4v and the hypervisor, I'm not allowed to control when the IOMMU flush happens, it has to occur on every single IOMMU mapping change. So this optimization was no longer possible there. Anyways, that's the history behind it. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator @ 2015-03-23 16:29 ` David Miller 0 siblings, 0 replies; 71+ messages in thread From: David Miller @ 2015-03-23 16:29 UTC (permalink / raw) To: sowmini.varadhan; +Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev From: Sowmini Varadhan <sowmini.varadhan@oracle.com> Date: Sun, 22 Mar 2015 15:27:26 -0400 > That leaves only the odd iommu_flushall() hook, I'm trying > to find the history behind that (needed for sun4u platforms, > afaik, and not sure if there are other ways to achieve this). In order to elide the IOMMU flush as much as possible, I implemnented a scheme for sun4u wherein we always allocated from low IOMMU addresses to high IOMMU addresses. In this regime, we only need to flush the IOMMU when we rolled over back to low IOMMU addresses during an allocation. It made a noticable difference in performance. Unfortunately, with sun4v and the hypervisor, I'm not allowed to control when the IOMMU flush happens, it has to occur on every single IOMMU mapping change. So this optimization was no longer possible there. Anyways, that's the history behind it. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator 2015-03-23 16:29 ` David Miller @ 2015-03-23 16:54 ` Sowmini Varadhan -1 siblings, 0 replies; 71+ messages in thread From: Sowmini Varadhan @ 2015-03-23 16:54 UTC (permalink / raw) To: David Miller; +Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev On (03/23/15 12:29), David Miller wrote: > > In order to elide the IOMMU flush as much as possible, I implemnented > a scheme for sun4u wherein we always allocated from low IOMMU > addresses to high IOMMU addresses. > > In this regime, we only need to flush the IOMMU when we rolled over > back to low IOMMU addresses during an allocation. > > It made a noticable difference in performance. > > Unfortunately, with sun4v and the hypervisor, I'm not allowed to > control when the IOMMU flush happens, it has to occur on every > single IOMMU mapping change. So this optimization was no longer > possible there. > > Anyways, that's the history behind it. > -- I see. If it was only an optimization (i.e., removing it would not break any functionality), and if this was done for older hardware, and *if* we believe that the direction of most architectures is to follow the sun4v/HV model, then, given that the sun4u code only uses 1 arena pool anyway, one thought that I have for refactoring this is the following: - Caller of iommu_tbl_range_alloc() can do the flush_all if they see start <= end for the one single pool - lose the other ->flush_all invocation (i.e., the one that is done when iommu_area_alloc() fails for pass = 0, and we reset start to 0 to roll-back) that would avoid the need for any iommu_tbl_ops in my patch-set. But it would imply that you would still take the perf hit for the roll-back if we failed the pass = 0 iteration through iommu_area_alloc(). Perhaps this is an acceptable compromise in favor of cleaner code (again, assuming that current/future archs will all follow the HV based design). --Sowmini ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator @ 2015-03-23 16:54 ` Sowmini Varadhan 0 siblings, 0 replies; 71+ messages in thread From: Sowmini Varadhan @ 2015-03-23 16:54 UTC (permalink / raw) To: David Miller; +Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev On (03/23/15 12:29), David Miller wrote: > > In order to elide the IOMMU flush as much as possible, I implemnented > a scheme for sun4u wherein we always allocated from low IOMMU > addresses to high IOMMU addresses. > > In this regime, we only need to flush the IOMMU when we rolled over > back to low IOMMU addresses during an allocation. > > It made a noticable difference in performance. > > Unfortunately, with sun4v and the hypervisor, I'm not allowed to > control when the IOMMU flush happens, it has to occur on every > single IOMMU mapping change. So this optimization was no longer > possible there. > > Anyways, that's the history behind it. > -- I see. If it was only an optimization (i.e., removing it would not break any functionality), and if this was done for older hardware, and *if* we believe that the direction of most architectures is to follow the sun4v/HV model, then, given that the sun4u code only uses 1 arena pool anyway, one thought that I have for refactoring this is the following: - Caller of iommu_tbl_range_alloc() can do the flush_all if they see start <= end for the one single pool - lose the other ->flush_all invocation (i.e., the one that is done when iommu_area_alloc() fails for pass == 0, and we reset start to 0 to roll-back) that would avoid the need for any iommu_tbl_ops in my patch-set. But it would imply that you would still take the perf hit for the roll-back if we failed the pass == 0 iteration through iommu_area_alloc(). Perhaps this is an acceptable compromise in favor of cleaner code (again, assuming that current/future archs will all follow the HV based design). --Sowmini ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator 2015-03-23 16:54 ` Sowmini Varadhan @ 2015-03-23 19:05 ` David Miller -1 siblings, 0 replies; 71+ messages in thread From: David Miller @ 2015-03-23 19:05 UTC (permalink / raw) To: sowmini.varadhan; +Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev From: Sowmini Varadhan <sowmini.varadhan@oracle.com> Date: Mon, 23 Mar 2015 12:54:06 -0400 > If it was only an optimization (i.e., removing it would not break > any functionality), and if this was done for older hardware, > and *if* we believe that the direction of most architectures is to > follow the sun4v/HV model, then, given that the sun4u code only uses 1 > arena pool anyway, one thought that I have for refactoring this > is the following: Why add performance regressions to old machines who already are suffering too much from all the bloat we are constantly adding to the kernel? ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator @ 2015-03-23 19:05 ` David Miller 0 siblings, 0 replies; 71+ messages in thread From: David Miller @ 2015-03-23 19:05 UTC (permalink / raw) To: sowmini.varadhan; +Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev From: Sowmini Varadhan <sowmini.varadhan@oracle.com> Date: Mon, 23 Mar 2015 12:54:06 -0400 > If it was only an optimization (i.e., removing it would not break > any functionality), and if this was done for older hardware, > and *if* we believe that the direction of most architectures is to > follow the sun4v/HV model, then, given that the sun4u code only uses 1 > arena pool anyway, one thought that I have for refactoring this > is the following: Why add performance regressions to old machines who already are suffering too much from all the bloat we are constantly adding to the kernel? ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator 2015-03-23 19:05 ` David Miller @ 2015-03-23 19:09 ` Sowmini Varadhan -1 siblings, 0 replies; 71+ messages in thread From: Sowmini Varadhan @ 2015-03-23 19:09 UTC (permalink / raw) To: David Miller; +Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev On (03/23/15 15:05), David Miller wrote: > > Why add performance regressions to old machines who already are > suffering too much from all the bloat we are constantly adding to the > kernel? I have no personal opinion on this- it's a matter of choosing whether we want to have some extra baggage in the library to support old hardware, or keep the code lean-and-mean for the future, where the model has changed. I'll let others on this list make the call. --Sowmini ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator @ 2015-03-23 19:09 ` Sowmini Varadhan 0 siblings, 0 replies; 71+ messages in thread From: Sowmini Varadhan @ 2015-03-23 19:09 UTC (permalink / raw) To: David Miller; +Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev On (03/23/15 15:05), David Miller wrote: > > Why add performance regressions to old machines who already are > suffering too much from all the bloat we are constantly adding to the > kernel? I have no personal opinion on this- it's a matter of choosing whether we want to have some extra baggage in the library to support old hardware, or keep the code lean-and-mean for the future, where the model has changed. I'll let others on this list make the call. --Sowmini ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator 2015-03-23 19:05 ` David Miller @ 2015-03-23 22:21 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 71+ messages in thread From: Benjamin Herrenschmidt @ 2015-03-23 22:21 UTC (permalink / raw) To: David Miller Cc: aik, aik, sowmini.varadhan, anton, paulus, sparclinux, linuxppc-dev On Mon, 2015-03-23 at 15:05 -0400, David Miller wrote: > From: Sowmini Varadhan <sowmini.varadhan@oracle.com> > Date: Mon, 23 Mar 2015 12:54:06 -0400 > > > If it was only an optimization (i.e., removing it would not break > > any functionality), and if this was done for older hardware, > > and *if* we believe that the direction of most architectures is to > > follow the sun4v/HV model, then, given that the sun4u code only uses 1 > > arena pool anyway, one thought that I have for refactoring this > > is the following: > > Why add performance regressions to old machines who already are > suffering too much from all the bloat we are constantly adding to the > kernel? So we have two choices here that I can see: - Keep that old platform use the old/simpler allocator - Try to regain the bulk of that benefit with the new one Sowmini, I see various options for the second choice. We could stick to 1 pool, and basically do as before, ie, if we fail on the first pass of alloc, it means we wrap around and do a flush, I don't think that will cause a significant degradation from today, do you ? We might have an occasional additional flush but I would expect it to be in the noise. Another option would be trickier, is to keep an additional bitmap of "stale" entries. When an entry is freed, instead of freeing it in the main bitmap, set a bit in the "stale" bit map. If we fail to allocate, then flush, xor off the main bitmap bits using the stale bitmap, and try again. However, the second approach means that as the main bitmap gets full, we will start allocating from remote pools, so it partially defeats the pool system, unless we do everything locally per-pool (ie flush when the pool is full before we fallback to another pool), in which case we go back to flushing more often than we used to. But here too, the difference might end up in the noise, we still flush order of magnitude less than once per translation update. Dave, what's your feeling there ? Does anybody around still have some HW that we can test with ? Sowmini, I think we can still kill the ops and have a separate data structure exclusively concerned by allocations by having the alloc functions take the lazy flush function as an argument (which can be NULL), I don't think we should bother with ops. Cheers, Ben. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator @ 2015-03-23 22:21 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 71+ messages in thread From: Benjamin Herrenschmidt @ 2015-03-23 22:21 UTC (permalink / raw) To: David Miller Cc: aik, aik, sowmini.varadhan, anton, paulus, sparclinux, linuxppc-dev On Mon, 2015-03-23 at 15:05 -0400, David Miller wrote: > From: Sowmini Varadhan <sowmini.varadhan@oracle.com> > Date: Mon, 23 Mar 2015 12:54:06 -0400 > > > If it was only an optimization (i.e., removing it would not break > > any functionality), and if this was done for older hardware, > > and *if* we believe that the direction of most architectures is to > > follow the sun4v/HV model, then, given that the sun4u code only uses 1 > > arena pool anyway, one thought that I have for refactoring this > > is the following: > > Why add performance regressions to old machines who already are > suffering too much from all the bloat we are constantly adding to the > kernel? So we have two choices here that I can see: - Keep that old platform use the old/simpler allocator - Try to regain the bulk of that benefit with the new one Sowmini, I see various options for the second choice. We could stick to 1 pool, and basically do as before, ie, if we fail on the first pass of alloc, it means we wrap around and do a flush, I don't think that will cause a significant degradation from today, do you ? We might have an occasional additional flush but I would expect it to be in the noise. Another option would be trickier, is to keep an additional bitmap of "stale" entries. When an entry is freed, instead of freeing it in the main bitmap, set a bit in the "stale" bit map. If we fail to allocate, then flush, xor off the main bitmap bits using the stale bitmap, and try again. However, the second approach means that as the main bitmap gets full, we will start allocating from remote pools, so it partially defeats the pool system, unless we do everything locally per-pool (ie flush when the pool is full before we fallback to another pool), in which case we go back to flushing more often than we used to. But here too, the difference might end up in the noise, we still flush order of magnitude less than once per translation update. Dave, what's your feeling there ? Does anybody around still have some HW that we can test with ? Sowmini, I think we can still kill the ops and have a separate data structure exclusively concerned by allocations by having the alloc functions take the lazy flush function as an argument (which can be NULL), I don't think we should bother with ops. Cheers, Ben. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator 2015-03-23 22:21 ` Benjamin Herrenschmidt @ 2015-03-23 23:08 ` Sowmini Varadhan -1 siblings, 0 replies; 71+ messages in thread From: Sowmini Varadhan @ 2015-03-23 23:08 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev, David Miller On (03/24/15 09:21), Benjamin Herrenschmidt wrote: > > So we have two choices here that I can see: > > - Keep that old platform use the old/simpler allocator Problem with that approach is that the base "struct iommu" structure for sparc gets a split personality: the older one is used with the older allocator, and other ugly things ensue. (alternatively, you end up duplicating a version of the code with the flush_all inlined). > - Try to regain the bulk of that benefit with the new one > > Sowmini, I see various options for the second choice. We could stick to > 1 pool, and basically do as before, ie, if we fail on the first pass of > alloc, it means we wrap around and do a flush, I don't think that will > cause a significant degradation from today, do you ? We might have an > occasional additional flush but I would expect it to be in the noise. Isn't this essentially what I have in patch v5 here: http://www.spinics.net/lists/sparclinux/msg13534.html (the ops->reset is the flushall indirection, can be renamed if the latter is preferred) > Dave, what's your feeling there ? Does anybody around still have some > HW that we can test with ? I actually tested this on a V440 and a ultra45 (had a heck of a time finding these, since the owners keep them turned off because they are too noisy and consume too much power :-). Thus while I have no opinion, I would not shed any tears if we lost this extra perf-tweak in the interest of being earth-friendly :-)) so testing it is not a problem, though I dont have any perf benchmarks for them either. > Sowmini, I think we can still kill the ops and have a separate data > structure exclusively concerned by allocations by having the alloc > functions take the lazy flush function as an argument (which can be > NULL), I don't think we should bother with ops. I dont quite follow what you have in mind? The caller would somehow have to specify a flush_all indirection for the legacy platforms Also, you mention > You must hold the lock until you do the flush, otherwise somebody > else might allocate the not-yet-flushed areas and try to use them... > kaboom. However if that's the only callback left, pass it as an > argument. Passing in a function pointer to the flushall to iommu_tbl_range_alloc would work, or we could pass it in as an arg to iommu_tbl_init, and stash it in the struct iommu_table. --Sowmini ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator @ 2015-03-23 23:08 ` Sowmini Varadhan 0 siblings, 0 replies; 71+ messages in thread From: Sowmini Varadhan @ 2015-03-23 23:08 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev, David Miller On (03/24/15 09:21), Benjamin Herrenschmidt wrote: > > So we have two choices here that I can see: > > - Keep that old platform use the old/simpler allocator Problem with that approach is that the base "struct iommu" structure for sparc gets a split personality: the older one is used with the older allocator, and other ugly things ensue. (alternatively, you end up duplicating a version of the code with the flush_all inlined). > - Try to regain the bulk of that benefit with the new one > > Sowmini, I see various options for the second choice. We could stick to > 1 pool, and basically do as before, ie, if we fail on the first pass of > alloc, it means we wrap around and do a flush, I don't think that will > cause a significant degradation from today, do you ? We might have an > occasional additional flush but I would expect it to be in the noise. Isn't this essentially what I have in patch v5 here: http://www.spinics.net/lists/sparclinux/msg13534.html (the ops->reset is the flushall indirection, can be renamed if the latter is preferred) > Dave, what's your feeling there ? Does anybody around still have some > HW that we can test with ? I actually tested this on a V440 and a ultra45 (had a heck of a time finding these, since the owners keep them turned off because they are too noisy and consume too much power :-). Thus while I have no opinion, I would not shed any tears if we lost this extra perf-tweak in the interest of being earth-friendly :-)) so testing it is not a problem, though I dont have any perf benchmarks for them either. > Sowmini, I think we can still kill the ops and have a separate data > structure exclusively concerned by allocations by having the alloc > functions take the lazy flush function as an argument (which can be > NULL), I don't think we should bother with ops. I dont quite follow what you have in mind? The caller would somehow have to specify a flush_all indirection for the legacy platforms Also, you mention > You must hold the lock until you do the flush, otherwise somebody > else might allocate the not-yet-flushed areas and try to use them... > kaboom. However if that's the only callback left, pass it as an > argument. Passing in a function pointer to the flushall to iommu_tbl_range_alloc would work, or we could pass it in as an arg to iommu_tbl_init, and stash it in the struct iommu_table. --Sowmini ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator 2015-03-23 23:08 ` Sowmini Varadhan (?) @ 2015-03-23 23:29 ` chase rayfield -1 siblings, 0 replies; 71+ messages in thread From: chase rayfield @ 2015-03-23 23:29 UTC (permalink / raw) To: Sowmini Varadhan Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev, David Miller [-- Attachment #1: Type: text/plain, Size: 1813 bytes --] On Mar 23, 2015 7:13 PM, "Sowmini Varadhan" <sowmini.varadhan@oracle.com> wrote: > > On (03/24/15 09:21), Benjamin Herrenschmidt wrote: > > > > So we have two choices here that I can see: > > > > - Keep that old platform use the old/simpler allocator > > Problem with that approach is that the base "struct iommu" structure > for sparc gets a split personality: the older one is used with > the older allocator, and other ugly things ensue. (alternatively, > you end up duplicating a version of the code with the flush_all > inlined). > > > - Try to regain the bulk of that benefit with the new one > > > > Sowmini, I see various options for the second choice. We could stick to > > 1 pool, and basically do as before, ie, if we fail on the first pass of > > alloc, it means we wrap around and do a flush, I don't think that will > > cause a significant degradation from today, do you ? We might have an > > occasional additional flush but I would expect it to be in the noise. > > Isn't this essentially what I have in patch v5 here: > http://www.spinics.net/lists/sparclinux/msg13534.html > > (the ops->reset is the flushall indirection, can be renamed if the > latter is preferred) > > > Dave, what's your feeling there ? Does anybody around still have some > > HW that we can test with ? > > I actually tested this on a V440 and a ultra45 (had a heck of a > time finding these, since the owners keep them turned off because > they are too noisy and consume too much power :-). So we need tests then more than hw?.... I have an ultra1, ultra10 and t2000 I can test on if needed. And I'd appreciate my caches not being flushed excessively on these boxes :-) too. I'll see if I can't get Gentoo on the u10 tonight. Also... It would be more "green" to make the code run faster on these boxes than otherwise! [-- Attachment #2: Type: text/html, Size: 2358 bytes --] ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator 2015-03-23 23:08 ` Sowmini Varadhan @ 2015-03-24 0:47 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 71+ messages in thread From: Benjamin Herrenschmidt @ 2015-03-24 0:47 UTC (permalink / raw) To: Sowmini Varadhan Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev, David Miller On Mon, 2015-03-23 at 19:08 -0400, Sowmini Varadhan wrote: > > Sowmini, I see various options for the second choice. We could stick to > > 1 pool, and basically do as before, ie, if we fail on the first pass of > > alloc, it means we wrap around and do a flush, I don't think that will > > cause a significant degradation from today, do you ? We might have an > > occasional additional flush but I would expect it to be in the noise. > > Isn't this essentially what I have in patch v5 here: > http://www.spinics.net/lists/sparclinux/msg13534.html Possibly, I haven't looked at it in details yet. My point is that while we might flush a bit more often than the original code, it should remain in the noise performance-wise and thus David might rescind his objection, but we need to prove this with a measurement. > (the ops->reset is the flushall indirection, can be renamed if the > latter is preferred) > > > Dave, what's your feeling there ? Does anybody around still have some > > HW that we can test with ? > > I actually tested this on a V440 and a ultra45 (had a heck of a > time finding these, since the owners keep them turned off because > they are too noisy and consume too much power :-). Thus while I > have no opinion, I would not shed any tears if we lost this extra > perf-tweak in the interest of being earth-friendly :-)) > > so testing it is not a problem, though I dont have any perf > benchmarks for them either. That's what we'll need unfortunately to confirm our "gut feeling". It might be sufficient to add a flush counter and compare it between runs if actual wall-clock benchmarks are too hard to do (especially if you don't have things like very fast network cards at hand). Number of flush / number of packets might be a sufficient metric, it depends on whether it makes David happy :-) > > Sowmini, I think we can still kill the ops and have a separate data > > structure exclusively concerned by allocations by having the alloc > > functions take the lazy flush function as an argument (which can be > > NULL), I don't think we should bother with ops. > > I dont quite follow what you have in mind? The caller would somehow > have to specify a flush_all indirection for the legacy platforms Yes, pass a function pointer argument that can be NULL or just make it a member of the iommu_allocator struct (or whatever you call it) passed to the init function and that can be NULL. My point is we don't need a separate "ops" structure. > Also, you mention > > > You must hold the lock until you do the flush, otherwise somebody > > else might allocate the not-yet-flushed areas and try to use them... > > kaboom. However if that's the only callback left, pass it as an > > argument. > > Passing in a function pointer to the flushall to iommu_tbl_range_alloc > would work, or we could pass it in as an arg to iommu_tbl_init, > and stash it in the struct iommu_table. Pass it to init and stash it in the table but don't call it "iommu_table", let's use a name that conveys better the fact that this is purely a DMA space allocator (to be embedded by the arch specific iommu table). Something like iommu_alloc_zone or whatever you want to call it. I keep finding new names whenever I think of it :-) Cheers, Ben. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator @ 2015-03-24 0:47 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 71+ messages in thread From: Benjamin Herrenschmidt @ 2015-03-24 0:47 UTC (permalink / raw) To: Sowmini Varadhan Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev, David Miller On Mon, 2015-03-23 at 19:08 -0400, Sowmini Varadhan wrote: > > Sowmini, I see various options for the second choice. We could stick to > > 1 pool, and basically do as before, ie, if we fail on the first pass of > > alloc, it means we wrap around and do a flush, I don't think that will > > cause a significant degradation from today, do you ? We might have an > > occasional additional flush but I would expect it to be in the noise. > > Isn't this essentially what I have in patch v5 here: > http://www.spinics.net/lists/sparclinux/msg13534.html Possibly, I haven't looked at it in details yet. My point is that while we might flush a bit more often than the original code, it should remain in the noise performance-wise and thus David might rescind his objection, but we need to prove this with a measurement. > (the ops->reset is the flushall indirection, can be renamed if the > latter is preferred) > > > Dave, what's your feeling there ? Does anybody around still have some > > HW that we can test with ? > > I actually tested this on a V440 and a ultra45 (had a heck of a > time finding these, since the owners keep them turned off because > they are too noisy and consume too much power :-). Thus while I > have no opinion, I would not shed any tears if we lost this extra > perf-tweak in the interest of being earth-friendly :-)) > > so testing it is not a problem, though I dont have any perf > benchmarks for them either. That's what we'll need unfortunately to confirm our "gut feeling". It might be sufficient to add a flush counter and compare it between runs if actual wall-clock benchmarks are too hard to do (especially if you don't have things like very fast network cards at hand). Number of flush / number of packets might be a sufficient metric, it depends on whether it makes David happy :-) > > Sowmini, I think we can still kill the ops and have a separate data > > structure exclusively concerned by allocations by having the alloc > > functions take the lazy flush function as an argument (which can be > > NULL), I don't think we should bother with ops. > > I dont quite follow what you have in mind? The caller would somehow > have to specify a flush_all indirection for the legacy platforms Yes, pass a function pointer argument that can be NULL or just make it a member of the iommu_allocator struct (or whatever you call it) passed to the init function and that can be NULL. My point is we don't need a separate "ops" structure. > Also, you mention > > > You must hold the lock until you do the flush, otherwise somebody > > else might allocate the not-yet-flushed areas and try to use them... > > kaboom. However if that's the only callback left, pass it as an > > argument. > > Passing in a function pointer to the flushall to iommu_tbl_range_alloc > would work, or we could pass it in as an arg to iommu_tbl_init, > and stash it in the struct iommu_table. Pass it to init and stash it in the table but don't call it "iommu_table", let's use a name that conveys better the fact that this is purely a DMA space allocator (to be embedded by the arch specific iommu table). Something like iommu_alloc_zone or whatever you want to call it. I keep finding new names whenever I think of it :-) Cheers, Ben. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator 2015-03-24 0:47 ` Benjamin Herrenschmidt @ 2015-03-24 1:11 ` Sowmini Varadhan -1 siblings, 0 replies; 71+ messages in thread From: Sowmini Varadhan @ 2015-03-24 1:11 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev, David Miller On (03/24/15 11:47), Benjamin Herrenschmidt wrote: > > Yes, pass a function pointer argument that can be NULL or just make it a > member of the iommu_allocator struct (or whatever you call it) passed to > the init function and that can be NULL. My point is we don't need a > separate "ops" structure. Ok, I'll make this a function pointer to the iommu_tbl_pool_init() function (tomorrow) fwiw, the ops struct came out as a result of DaveM input, http://www.spinics.net/lists/sparclinux/msg13232.html albeit the original context is now moot. > Pass it to init and stash it in the table but don't call it > "iommu_table", let's use a name that conveys better the fact that this > is purely a DMA space allocator (to be embedded by the arch specific > iommu table). Something like iommu_alloc_zone or whatever you want to > call it. I keep finding new names whenever I think of it :-) please pick a name that you like as this mooshes all the patches, and I myself really dont care what it gets called (rose? :-)) as long as fat-fingering-risk is minimized Regarding the relationship with largepool, > But that might not be necessary. If indeed we very rarely use the large > pool, then just make it flush always. My feeling is that it will only > ever be used at driver init/remove time when allocating things like > descriptor rings, where the flush overhead dont' matter. ok, will factor this in tomorrow (assuming DaveM has no objections to anything proposed here, esp around the function pointer for flushall). --Sowmini ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator @ 2015-03-24 1:11 ` Sowmini Varadhan 0 siblings, 0 replies; 71+ messages in thread From: Sowmini Varadhan @ 2015-03-24 1:11 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev, David Miller On (03/24/15 11:47), Benjamin Herrenschmidt wrote: > > Yes, pass a function pointer argument that can be NULL or just make it a > member of the iommu_allocator struct (or whatever you call it) passed to > the init function and that can be NULL. My point is we don't need a > separate "ops" structure. Ok, I'll make this a function pointer to the iommu_tbl_pool_init() function (tomorrow) fwiw, the ops struct came out as a result of DaveM input, http://www.spinics.net/lists/sparclinux/msg13232.html albeit the original context is now moot. > Pass it to init and stash it in the table but don't call it > "iommu_table", let's use a name that conveys better the fact that this > is purely a DMA space allocator (to be embedded by the arch specific > iommu table). Something like iommu_alloc_zone or whatever you want to > call it. I keep finding new names whenever I think of it :-) please pick a name that you like as this mooshes all the patches, and I myself really dont care what it gets called (rose? :-)) as long as fat-fingering-risk is minimized Regarding the relationship with largepool, > But that might not be necessary. If indeed we very rarely use the large > pool, then just make it flush always. My feeling is that it will only > ever be used at driver init/remove time when allocating things like > descriptor rings, where the flush overhead dont' matter. ok, will factor this in tomorrow (assuming DaveM has no objections to anything proposed here, esp around the function pointer for flushall). --Sowmini ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator 2015-03-23 22:21 ` Benjamin Herrenschmidt @ 2015-03-24 1:44 ` David Miller -1 siblings, 0 replies; 71+ messages in thread From: David Miller @ 2015-03-24 1:44 UTC (permalink / raw) To: benh; +Cc: aik, aik, sowmini.varadhan, anton, paulus, sparclinux, linuxppc-dev From: Benjamin Herrenschmidt <benh@kernel.crashing.org> Date: Tue, 24 Mar 2015 09:21:05 +1100 > Dave, what's your feeling there ? Does anybody around still have > some HW that we can test with ? I don't see what the actual problem is. Even if you use multiple pools, which we should for scalability on sun4u too, just do the flush when allocation in _any_ pool wraps around. That's still better than not doing the optimization at all. That is always going to be correct, and you can use a separate spinlock to make sure only one thread of control does the full IOMMU flush at a time. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator @ 2015-03-24 1:44 ` David Miller 0 siblings, 0 replies; 71+ messages in thread From: David Miller @ 2015-03-24 1:44 UTC (permalink / raw) To: benh; +Cc: aik, aik, sowmini.varadhan, anton, paulus, sparclinux, linuxppc-dev From: Benjamin Herrenschmidt <benh@kernel.crashing.org> Date: Tue, 24 Mar 2015 09:21:05 +1100 > Dave, what's your feeling there ? Does anybody around still have > some HW that we can test with ? I don't see what the actual problem is. Even if you use multiple pools, which we should for scalability on sun4u too, just do the flush when allocation in _any_ pool wraps around. That's still better than not doing the optimization at all. That is always going to be correct, and you can use a separate spinlock to make sure only one thread of control does the full IOMMU flush at a time. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator 2015-03-24 1:44 ` David Miller @ 2015-03-24 1:57 ` Sowmini Varadhan -1 siblings, 0 replies; 71+ messages in thread From: Sowmini Varadhan @ 2015-03-24 1:57 UTC (permalink / raw) To: David Miller; +Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev benh> It might be sufficient to add a flush counter and compare it between runs benh> if actual wall-clock benchmarks are too hard to do (especially if you benh> don't have things like very fast network cards at hand). benh> benh> Number of flush / number of packets might be a sufficient metric, it.. I was just going to say: I can add those counters tomorrow, and get some stats, but seems like it doesn't really matter what the outcome is, because: On (03/23/15 21:44), David Miller wrote: > > From: Benjamin Herrenschmidt <benh@kernel.crashing.org> > Date: Tue, 24 Mar 2015 09:21:05 +1100 > > > Dave, what's your feeling there ? Does anybody around still have > > some HW that we can test with ? > > I don't see what the actual problem is. > > Even if you use multiple pools, which we should for scalability on > sun4u too, just do the flush when allocation in _any_ pool wraps > around. > > That's still better than not doing the optimization at all. > > That is always going to be correct, and you can use a separate > spinlock to make sure only one thread of control does the full > IOMMU flush at a time. --Sowmini ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator @ 2015-03-24 1:57 ` Sowmini Varadhan 0 siblings, 0 replies; 71+ messages in thread From: Sowmini Varadhan @ 2015-03-24 1:57 UTC (permalink / raw) To: David Miller; +Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev benh> It might be sufficient to add a flush counter and compare it between runs benh> if actual wall-clock benchmarks are too hard to do (especially if you benh> don't have things like very fast network cards at hand). benh> benh> Number of flush / number of packets might be a sufficient metric, it.. I was just going to say: I can add those counters tomorrow, and get some stats, but seems like it doesn't really matter what the outcome is, because: On (03/23/15 21:44), David Miller wrote: > > From: Benjamin Herrenschmidt <benh@kernel.crashing.org> > Date: Tue, 24 Mar 2015 09:21:05 +1100 > > > Dave, what's your feeling there ? Does anybody around still have > > some HW that we can test with ? > > I don't see what the actual problem is. > > Even if you use multiple pools, which we should for scalability on > sun4u too, just do the flush when allocation in _any_ pool wraps > around. > > That's still better than not doing the optimization at all. > > That is always going to be correct, and you can use a separate > spinlock to make sure only one thread of control does the full > IOMMU flush at a time. --Sowmini ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator 2015-03-24 1:44 ` David Miller @ 2015-03-24 2:08 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 71+ messages in thread From: Benjamin Herrenschmidt @ 2015-03-24 2:08 UTC (permalink / raw) To: David Miller Cc: aik, aik, sowmini.varadhan, anton, paulus, sparclinux, linuxppc-dev On Mon, 2015-03-23 at 21:44 -0400, David Miller wrote: > From: Benjamin Herrenschmidt <benh@kernel.crashing.org> > Date: Tue, 24 Mar 2015 09:21:05 +1100 > > > Dave, what's your feeling there ? Does anybody around still have > > some HW that we can test with ? > > I don't see what the actual problem is. > > Even if you use multiple pools, which we should for scalability on > sun4u too, just do the flush when allocation in _any_ pool wraps > around. > > That's still better than not doing the optimization at all. I agree, I wasn't sure it was good enough for you, so was just putting other options on the table. > That is always going to be correct, and you can use a separate > spinlock to make sure only one thread of control does the full > IOMMU flush at a time. That would have to be done inside the of the flush callback but I don't see a big issue there. For the large pool, we don't keep a hint so we don't know it's wrapped, in fact we purposefully don't use a hint to limit fragmentation on it, but then, it should be used rarely enough that flushing always is, I suspect, a good option. Cheers, Ben. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator @ 2015-03-24 2:08 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 71+ messages in thread From: Benjamin Herrenschmidt @ 2015-03-24 2:08 UTC (permalink / raw) To: David Miller Cc: aik, aik, sowmini.varadhan, anton, paulus, sparclinux, linuxppc-dev On Mon, 2015-03-23 at 21:44 -0400, David Miller wrote: > From: Benjamin Herrenschmidt <benh@kernel.crashing.org> > Date: Tue, 24 Mar 2015 09:21:05 +1100 > > > Dave, what's your feeling there ? Does anybody around still have > > some HW that we can test with ? > > I don't see what the actual problem is. > > Even if you use multiple pools, which we should for scalability on > sun4u too, just do the flush when allocation in _any_ pool wraps > around. > > That's still better than not doing the optimization at all. I agree, I wasn't sure it was good enough for you, so was just putting other options on the table. > That is always going to be correct, and you can use a separate > spinlock to make sure only one thread of control does the full > IOMMU flush at a time. That would have to be done inside the of the flush callback but I don't see a big issue there. For the large pool, we don't keep a hint so we don't know it's wrapped, in fact we purposefully don't use a hint to limit fragmentation on it, but then, it should be used rarely enough that flushing always is, I suspect, a good option. Cheers, Ben. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator 2015-03-24 2:08 ` Benjamin Herrenschmidt @ 2015-03-24 2:15 ` David Miller -1 siblings, 0 replies; 71+ messages in thread From: David Miller @ 2015-03-24 2:15 UTC (permalink / raw) To: benh; +Cc: aik, aik, sowmini.varadhan, anton, paulus, sparclinux, linuxppc-dev From: Benjamin Herrenschmidt <benh@kernel.crashing.org> Date: Tue, 24 Mar 2015 13:08:10 +1100 > For the large pool, we don't keep a hint so we don't know it's > wrapped, in fact we purposefully don't use a hint to limit > fragmentation on it, but then, it should be used rarely enough that > flushing always is, I suspect, a good option. I can't think of any use case where the largepool would be hit a lot at all. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator @ 2015-03-24 2:15 ` David Miller 0 siblings, 0 replies; 71+ messages in thread From: David Miller @ 2015-03-24 2:15 UTC (permalink / raw) To: benh; +Cc: aik, aik, sowmini.varadhan, anton, paulus, sparclinux, linuxppc-dev From: Benjamin Herrenschmidt <benh@kernel.crashing.org> Date: Tue, 24 Mar 2015 13:08:10 +1100 > For the large pool, we don't keep a hint so we don't know it's > wrapped, in fact we purposefully don't use a hint to limit > fragmentation on it, but then, it should be used rarely enough that > flushing always is, I suspect, a good option. I can't think of any use case where the largepool would be hit a lot at all. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator 2015-03-24 2:15 ` David Miller @ 2015-03-26 0:43 ` cascardo -1 siblings, 0 replies; 71+ messages in thread From: cascardo @ 2015-03-26 0:43 UTC (permalink / raw) To: David Miller Cc: aik, aik, anton, paulus, sowmini.varadhan, sparclinux, linuxppc-dev On Mon, Mar 23, 2015 at 10:15:08PM -0400, David Miller wrote: > From: Benjamin Herrenschmidt <benh@kernel.crashing.org> > Date: Tue, 24 Mar 2015 13:08:10 +1100 > > > For the large pool, we don't keep a hint so we don't know it's > > wrapped, in fact we purposefully don't use a hint to limit > > fragmentation on it, but then, it should be used rarely enough that > > flushing always is, I suspect, a good option. > > I can't think of any use case where the largepool would be hit a lot > at all. Well, until recently, IOMMU_PAGE_SIZE was 4KiB on Power, so every time a driver mapped a whole 64KiB page, it would hit the largepool. I have been suspicious for some time that after Anton's work on the pools, the large mappings optimization would throw away the benefit of using the 4 pools, since some drivers would always hit the largepool. Of course, drivers that map entire pages, when not buggy, are optimized already to avoid calling dma_map all the time. I worked on that for mlx4_en, and I would expect that its receive side would always hit the largepool. So, I decided to experiment and count the number of times that largealloc is true versus false. On the transmit side, or when using ICMP, I didn't notice many large allocations with qlge or cxgb4. However, when using large TCP send/recv (I used uperf with 64KB writes/reads), I noticed that on the transmit side, largealloc is not used, but on the receive side, cxgb4 almost only uses largealloc, while qlge seems to have a 1/1 usage or largealloc/non-largealloc mappings. When turning GRO off, that ratio is closer to 1/10, meaning there is still some fair use of largealloc in that scenario. I confess my experiments are not complete. I would like to test a couple of other drivers as well, including mlx4_en and bnx2x, and test with small packet sizes. I suspected that MTU size could make a difference, but in the case of ICMP, with MTU 9000 and payload of 8000 bytes, I didn't notice any significant hit of largepool with either qlge or cxgb4. Also, we need to keep in mind that IOMMU_PAGE_SIZE is now dynamic in the latest code, with plans on using 64KiB in some situations, Alexey or Ben should have more details. But I believe that on the receive side, all drivers should map entire pages, using some allocation strategy similar to mlx4_en, in order to avoid DMA mapping all the time. Some believe that is bad for latency, and prefer to call something like skb_alloc for every package received, but I haven't seen any hard numbers, and I don't know why we couldn't make such an allocator as good as using something like the SLAB/SLUB allocator. Maybe there is a jitter problem, since the allocator has to go out and get some new pages and map them, once in a while. But I don't see why this would not be a problem with SLAB/SLUB as well. Calling dma_map is even worse with the current implementation. It's just that some architectures do no work at all when dma_map/unmap is called. Hope that helps consider the best strategy for the DMA space allocation as of now. Regards. Cascardo. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator @ 2015-03-26 0:43 ` cascardo 0 siblings, 0 replies; 71+ messages in thread From: cascardo @ 2015-03-26 0:43 UTC (permalink / raw) To: David Miller Cc: aik, aik, anton, paulus, sowmini.varadhan, sparclinux, linuxppc-dev On Mon, Mar 23, 2015 at 10:15:08PM -0400, David Miller wrote: > From: Benjamin Herrenschmidt <benh@kernel.crashing.org> > Date: Tue, 24 Mar 2015 13:08:10 +1100 > > > For the large pool, we don't keep a hint so we don't know it's > > wrapped, in fact we purposefully don't use a hint to limit > > fragmentation on it, but then, it should be used rarely enough that > > flushing always is, I suspect, a good option. > > I can't think of any use case where the largepool would be hit a lot > at all. Well, until recently, IOMMU_PAGE_SIZE was 4KiB on Power, so every time a driver mapped a whole 64KiB page, it would hit the largepool. I have been suspicious for some time that after Anton's work on the pools, the large mappings optimization would throw away the benefit of using the 4 pools, since some drivers would always hit the largepool. Of course, drivers that map entire pages, when not buggy, are optimized already to avoid calling dma_map all the time. I worked on that for mlx4_en, and I would expect that its receive side would always hit the largepool. So, I decided to experiment and count the number of times that largealloc is true versus false. On the transmit side, or when using ICMP, I didn't notice many large allocations with qlge or cxgb4. However, when using large TCP send/recv (I used uperf with 64KB writes/reads), I noticed that on the transmit side, largealloc is not used, but on the receive side, cxgb4 almost only uses largealloc, while qlge seems to have a 1/1 usage or largealloc/non-largealloc mappings. When turning GRO off, that ratio is closer to 1/10, meaning there is still some fair use of largealloc in that scenario. I confess my experiments are not complete. I would like to test a couple of other drivers as well, including mlx4_en and bnx2x, and test with small packet sizes. I suspected that MTU size could make a difference, but in the case of ICMP, with MTU 9000 and payload of 8000 bytes, I didn't notice any significant hit of largepool with either qlge or cxgb4. Also, we need to keep in mind that IOMMU_PAGE_SIZE is now dynamic in the latest code, with plans on using 64KiB in some situations, Alexey or Ben should have more details. But I believe that on the receive side, all drivers should map entire pages, using some allocation strategy similar to mlx4_en, in order to avoid DMA mapping all the time. Some believe that is bad for latency, and prefer to call something like skb_alloc for every package received, but I haven't seen any hard numbers, and I don't know why we couldn't make such an allocator as good as using something like the SLAB/SLUB allocator. Maybe there is a jitter problem, since the allocator has to go out and get some new pages and map them, once in a while. But I don't see why this would not be a problem with SLAB/SLUB as well. Calling dma_map is even worse with the current implementation. It's just that some architectures do no work at all when dma_map/unmap is called. Hope that helps consider the best strategy for the DMA space allocation as of now. Regards. Cascardo. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator 2015-03-26 0:43 ` cascardo @ 2015-03-26 0:49 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 71+ messages in thread From: Benjamin Herrenschmidt @ 2015-03-26 0:49 UTC (permalink / raw) To: cascardo Cc: aik, aik, sowmini.varadhan, anton, paulus, sparclinux, linuxppc-dev, David Miller On Wed, 2015-03-25 at 21:43 -0300, cascardo@linux.vnet.ibm.com wrote: > On Mon, Mar 23, 2015 at 10:15:08PM -0400, David Miller wrote: > > From: Benjamin Herrenschmidt <benh@kernel.crashing.org> > > Date: Tue, 24 Mar 2015 13:08:10 +1100 > > > > > For the large pool, we don't keep a hint so we don't know it's > > > wrapped, in fact we purposefully don't use a hint to limit > > > fragmentation on it, but then, it should be used rarely enough that > > > flushing always is, I suspect, a good option. > > > > I can't think of any use case where the largepool would be hit a lot > > at all. > > Well, until recently, IOMMU_PAGE_SIZE was 4KiB on Power, so every time a > driver mapped a whole 64KiB page, it would hit the largepool. Yes but I was talking of sparc here ... > I have been suspicious for some time that after Anton's work on the > pools, the large mappings optimization would throw away the benefit of > using the 4 pools, since some drivers would always hit the largepool. Right, I was thinking we should change the test for large pool from > 15 to > (PAGE_SHIFT * n) where n is TBD by experimentation. > Of course, drivers that map entire pages, when not buggy, are optimized > already to avoid calling dma_map all the time. I worked on that for > mlx4_en, and I would expect that its receive side would always hit the > largepool. > > So, I decided to experiment and count the number of times that > largealloc is true versus false. > > On the transmit side, or when using ICMP, I didn't notice many large > allocations with qlge or cxgb4. > > However, when using large TCP send/recv (I used uperf with 64KB > writes/reads), I noticed that on the transmit side, largealloc is not > used, but on the receive side, cxgb4 almost only uses largealloc, while > qlge seems to have a 1/1 usage or largealloc/non-largealloc mappings. > When turning GRO off, that ratio is closer to 1/10, meaning there is > still some fair use of largealloc in that scenario. What are the sizes involved ? Always just 64K ? Or more ? Maybe just changing 15 to 16 in the test would be sufficient ? We should make the threshole a parameter set at init time so archs/platforms can adjust it. > I confess my experiments are not complete. I would like to test a couple > of other drivers as well, including mlx4_en and bnx2x, and test with > small packet sizes. I suspected that MTU size could make a difference, > but in the case of ICMP, with MTU 9000 and payload of 8000 bytes, I > didn't notice any significant hit of largepool with either qlge or > cxgb4. > > Also, we need to keep in mind that IOMMU_PAGE_SIZE is now dynamic in the > latest code, with plans on using 64KiB in some situations, Alexey or Ben > should have more details. We still mostly use 4K afaik... We will use 64K in some KVM setups and I do plan to switch to 64K under some circumstances when we can but we have some limits imposed by PAPR under hypervisors here. > But I believe that on the receive side, all drivers should map entire > pages, using some allocation strategy similar to mlx4_en, in order to > avoid DMA mapping all the time. Some believe that is bad for latency, > and prefer to call something like skb_alloc for every package received, > but I haven't seen any hard numbers, and I don't know why we couldn't > make such an allocator as good as using something like the SLAB/SLUB > allocator. Maybe there is a jitter problem, since the allocator has to > go out and get some new pages and map them, once in a while. But I don't > see why this would not be a problem with SLAB/SLUB as well. Calling > dma_map is even worse with the current implementation. It's just that > some architectures do no work at all when dma_map/unmap is called. > > Hope that helps consider the best strategy for the DMA space allocation > as of now. In any case, I don't think Sparc has the same issue. At this point that's all I care about, once we adapt powerpc to use the new code, we can revisit that problem on our side. Cheers, Ben. > Regards. > Cascardo. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator @ 2015-03-26 0:49 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 71+ messages in thread From: Benjamin Herrenschmidt @ 2015-03-26 0:49 UTC (permalink / raw) To: cascardo Cc: aik, aik, sowmini.varadhan, anton, paulus, sparclinux, linuxppc-dev, David Miller On Wed, 2015-03-25 at 21:43 -0300, cascardo@linux.vnet.ibm.com wrote: > On Mon, Mar 23, 2015 at 10:15:08PM -0400, David Miller wrote: > > From: Benjamin Herrenschmidt <benh@kernel.crashing.org> > > Date: Tue, 24 Mar 2015 13:08:10 +1100 > > > > > For the large pool, we don't keep a hint so we don't know it's > > > wrapped, in fact we purposefully don't use a hint to limit > > > fragmentation on it, but then, it should be used rarely enough that > > > flushing always is, I suspect, a good option. > > > > I can't think of any use case where the largepool would be hit a lot > > at all. > > Well, until recently, IOMMU_PAGE_SIZE was 4KiB on Power, so every time a > driver mapped a whole 64KiB page, it would hit the largepool. Yes but I was talking of sparc here ... > I have been suspicious for some time that after Anton's work on the > pools, the large mappings optimization would throw away the benefit of > using the 4 pools, since some drivers would always hit the largepool. Right, I was thinking we should change the test for large pool from > 15 to > (PAGE_SHIFT * n) where n is TBD by experimentation. > Of course, drivers that map entire pages, when not buggy, are optimized > already to avoid calling dma_map all the time. I worked on that for > mlx4_en, and I would expect that its receive side would always hit the > largepool. > > So, I decided to experiment and count the number of times that > largealloc is true versus false. > > On the transmit side, or when using ICMP, I didn't notice many large > allocations with qlge or cxgb4. > > However, when using large TCP send/recv (I used uperf with 64KB > writes/reads), I noticed that on the transmit side, largealloc is not > used, but on the receive side, cxgb4 almost only uses largealloc, while > qlge seems to have a 1/1 usage or largealloc/non-largealloc mappings. > When turning GRO off, that ratio is closer to 1/10, meaning there is > still some fair use of largealloc in that scenario. What are the sizes involved ? Always just 64K ? Or more ? Maybe just changing 15 to 16 in the test would be sufficient ? We should make the threshole a parameter set at init time so archs/platforms can adjust it. > I confess my experiments are not complete. I would like to test a couple > of other drivers as well, including mlx4_en and bnx2x, and test with > small packet sizes. I suspected that MTU size could make a difference, > but in the case of ICMP, with MTU 9000 and payload of 8000 bytes, I > didn't notice any significant hit of largepool with either qlge or > cxgb4. > > Also, we need to keep in mind that IOMMU_PAGE_SIZE is now dynamic in the > latest code, with plans on using 64KiB in some situations, Alexey or Ben > should have more details. We still mostly use 4K afaik... We will use 64K in some KVM setups and I do plan to switch to 64K under some circumstances when we can but we have some limits imposed by PAPR under hypervisors here. > But I believe that on the receive side, all drivers should map entire > pages, using some allocation strategy similar to mlx4_en, in order to > avoid DMA mapping all the time. Some believe that is bad for latency, > and prefer to call something like skb_alloc for every package received, > but I haven't seen any hard numbers, and I don't know why we couldn't > make such an allocator as good as using something like the SLAB/SLUB > allocator. Maybe there is a jitter problem, since the allocator has to > go out and get some new pages and map them, once in a while. But I don't > see why this would not be a problem with SLAB/SLUB as well. Calling > dma_map is even worse with the current implementation. It's just that > some architectures do no work at all when dma_map/unmap is called. > > Hope that helps consider the best strategy for the DMA space allocation > as of now. In any case, I don't think Sparc has the same issue. At this point that's all I care about, once we adapt powerpc to use the new code, we can revisit that problem on our side. Cheers, Ben. > Regards. > Cascardo. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator 2015-03-26 0:43 ` cascardo @ 2015-03-26 10:56 ` Sowmini Varadhan -1 siblings, 0 replies; 71+ messages in thread From: Sowmini Varadhan @ 2015-03-26 10:56 UTC (permalink / raw) To: cascardo; +Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev, David Miller On (03/25/15 21:43), cascardo@linux.vnet.ibm.com wrote: > However, when using large TCP send/recv (I used uperf with 64KB > writes/reads), I noticed that on the transmit side, largealloc is not > used, but on the receive side, cxgb4 almost only uses largealloc, while > qlge seems to have a 1/1 usage or largealloc/non-largealloc mappings. > When turning GRO off, that ratio is closer to 1/10, meaning there is > still some fair use of largealloc in that scenario. > > I confess my experiments are not complete. I would like to test a couple > of other drivers as well, including mlx4_en and bnx2x, and test with > small packet sizes. I suspected that MTU size could make a difference, > but in the case of ICMP, with MTU 9000 and payload of 8000 bytes, I > didn't notice any significant hit of largepool with either qlge or > cxgb4. I guess we also need to consider the "average use-case", i.e., something that interleaves small packets and interactive data with jumbo/bulk data.. in those cases, the largepool would not get many hits, and might actually be undesirable? > But I believe that on the receive side, all drivers should map entire > pages, using some allocation strategy similar to mlx4_en, in order to > avoid DMA mapping all the time. good point. I think in the early phase of my perf investigation, it was brought up that Solaris does pre-mapped DMA buffers (they have to do this carefully, to avoid resource-starvation vulnerabilities- see http://www.spinics.net/lists/sparclinux/msg13217.html and threads leading to it.. This is not something that the common iommu-arena allocator can/should get involved in, of course. The scope of the arena-allocator is much more rigorously defined. I dont know if there is a way to set up a generalized DMA premapped buffer infra for linux today. fwiw, when I instrumented this for solaris (there are hooks to disable the pre-mapped bufferes) the impact on a T5-2 (8 sockets, 2 numa nodes, 64 cpus) was not very significant for a single 10G ixgbe port- approx 8 Gbps instead of 9.X Gbps. I think the DMA buffer pre-mapping is only significant when you start trying to scale to multiple ethernet ports. --Sowmini ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator @ 2015-03-26 10:56 ` Sowmini Varadhan 0 siblings, 0 replies; 71+ messages in thread From: Sowmini Varadhan @ 2015-03-26 10:56 UTC (permalink / raw) To: cascardo; +Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev, David Miller On (03/25/15 21:43), cascardo@linux.vnet.ibm.com wrote: > However, when using large TCP send/recv (I used uperf with 64KB > writes/reads), I noticed that on the transmit side, largealloc is not > used, but on the receive side, cxgb4 almost only uses largealloc, while > qlge seems to have a 1/1 usage or largealloc/non-largealloc mappings. > When turning GRO off, that ratio is closer to 1/10, meaning there is > still some fair use of largealloc in that scenario. > > I confess my experiments are not complete. I would like to test a couple > of other drivers as well, including mlx4_en and bnx2x, and test with > small packet sizes. I suspected that MTU size could make a difference, > but in the case of ICMP, with MTU 9000 and payload of 8000 bytes, I > didn't notice any significant hit of largepool with either qlge or > cxgb4. I guess we also need to consider the "average use-case", i.e., something that interleaves small packets and interactive data with jumbo/bulk data.. in those cases, the largepool would not get many hits, and might actually be undesirable? > But I believe that on the receive side, all drivers should map entire > pages, using some allocation strategy similar to mlx4_en, in order to > avoid DMA mapping all the time. good point. I think in the early phase of my perf investigation, it was brought up that Solaris does pre-mapped DMA buffers (they have to do this carefully, to avoid resource-starvation vulnerabilities- see http://www.spinics.net/lists/sparclinux/msg13217.html and threads leading to it.. This is not something that the common iommu-arena allocator can/should get involved in, of course. The scope of the arena-allocator is much more rigorously defined. I dont know if there is a way to set up a generalized DMA premapped buffer infra for linux today. fwiw, when I instrumented this for solaris (there are hooks to disable the pre-mapped bufferes) the impact on a T5-2 (8 sockets, 2 numa nodes, 64 cpus) was not very significant for a single 10G ixgbe port- approx 8 Gbps instead of 9.X Gbps. I think the DMA buffer pre-mapping is only significant when you start trying to scale to multiple ethernet ports. --Sowmini ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator 2015-03-26 0:43 ` cascardo @ 2015-03-26 23:00 ` David Miller -1 siblings, 0 replies; 71+ messages in thread From: David Miller @ 2015-03-26 22:51 UTC (permalink / raw) Cc: aik, aik, anton, paulus, sowmini.varadhan, sparclinux, linuxppc-dev From: cascardo@linux.vnet.ibm.com Date: Wed, 25 Mar 2015 21:43:42 -0300 > On Mon, Mar 23, 2015 at 10:15:08PM -0400, David Miller wrote: >> From: Benjamin Herrenschmidt <benh@kernel.crashing.org> >> Date: Tue, 24 Mar 2015 13:08:10 +1100 >> >> > For the large pool, we don't keep a hint so we don't know it's >> > wrapped, in fact we purposefully don't use a hint to limit >> > fragmentation on it, but then, it should be used rarely enough that >> > flushing always is, I suspect, a good option. >> >> I can't think of any use case where the largepool would be hit a lot >> at all. > > Well, until recently, IOMMU_PAGE_SIZE was 4KiB on Power, so every time a > driver mapped a whole 64KiB page, it would hit the largepool. We don't plan to ever use 64KB page size on sparc64, so I think we're safe there. There are many reasons using 64KB pages is really stupid, what you see here with the IOMMU stuff is just one of many symptoms, but I bet your powerpc guys are kind of sick of hearing it by now... :-) ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator @ 2015-03-26 23:00 ` David Miller 0 siblings, 0 replies; 71+ messages in thread From: David Miller @ 2015-03-26 23:00 UTC (permalink / raw) Cc: aik, aik, anton, paulus, sowmini.varadhan, sparclinux, linuxppc-dev From: cascardo@linux.vnet.ibm.com Date: Wed, 25 Mar 2015 21:43:42 -0300 > On Mon, Mar 23, 2015 at 10:15:08PM -0400, David Miller wrote: >> From: Benjamin Herrenschmidt <benh@kernel.crashing.org> >> Date: Tue, 24 Mar 2015 13:08:10 +1100 >> >> > For the large pool, we don't keep a hint so we don't know it's >> > wrapped, in fact we purposefully don't use a hint to limit >> > fragmentation on it, but then, it should be used rarely enough that >> > flushing always is, I suspect, a good option. >> >> I can't think of any use case where the largepool would be hit a lot >> at all. > > Well, until recently, IOMMU_PAGE_SIZE was 4KiB on Power, so every time a > driver mapped a whole 64KiB page, it would hit the largepool. We don't plan to ever use 64KB page size on sparc64, so I think we're safe there. There are many reasons using 64KB pages is really stupid, what you see here with the IOMMU stuff is just one of many symptoms, but I bet your powerpc guys are kind of sick of hearing it by now... :-) ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator 2015-03-26 23:00 ` David Miller @ 2015-03-26 23:51 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 71+ messages in thread From: Benjamin Herrenschmidt @ 2015-03-26 23:51 UTC (permalink / raw) To: David Miller Cc: aik, aik, sowmini.varadhan, anton, paulus, sparclinux, linuxppc-dev On Thu, 2015-03-26 at 16:00 -0700, David Miller wrote: > From: cascardo@linux.vnet.ibm.com > Date: Wed, 25 Mar 2015 21:43:42 -0300 > > > On Mon, Mar 23, 2015 at 10:15:08PM -0400, David Miller wrote: > >> From: Benjamin Herrenschmidt <benh@kernel.crashing.org> > >> Date: Tue, 24 Mar 2015 13:08:10 +1100 > >> > >> > For the large pool, we don't keep a hint so we don't know it's > >> > wrapped, in fact we purposefully don't use a hint to limit > >> > fragmentation on it, but then, it should be used rarely enough that > >> > flushing always is, I suspect, a good option. > >> > >> I can't think of any use case where the largepool would be hit a lot > >> at all. > > > > Well, until recently, IOMMU_PAGE_SIZE was 4KiB on Power, so every time a > > driver mapped a whole 64KiB page, it would hit the largepool. > > We don't plan to ever use 64KB page size on sparc64, so I think we're > safe there. > > There are many reasons using 64KB pages is really stupid, what you > see here with the IOMMU stuff is just one of many symptoms, but I bet > your powerpc guys are kind of sick of hearing it by now... :-) Lalalala :-) Cheers, Ben. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator @ 2015-03-26 23:51 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 71+ messages in thread From: Benjamin Herrenschmidt @ 2015-03-26 23:51 UTC (permalink / raw) To: David Miller Cc: aik, aik, sowmini.varadhan, anton, paulus, sparclinux, linuxppc-dev On Thu, 2015-03-26 at 16:00 -0700, David Miller wrote: > From: cascardo@linux.vnet.ibm.com > Date: Wed, 25 Mar 2015 21:43:42 -0300 > > > On Mon, Mar 23, 2015 at 10:15:08PM -0400, David Miller wrote: > >> From: Benjamin Herrenschmidt <benh@kernel.crashing.org> > >> Date: Tue, 24 Mar 2015 13:08:10 +1100 > >> > >> > For the large pool, we don't keep a hint so we don't know it's > >> > wrapped, in fact we purposefully don't use a hint to limit > >> > fragmentation on it, but then, it should be used rarely enough that > >> > flushing always is, I suspect, a good option. > >> > >> I can't think of any use case where the largepool would be hit a lot > >> at all. > > > > Well, until recently, IOMMU_PAGE_SIZE was 4KiB on Power, so every time a > > driver mapped a whole 64KiB page, it would hit the largepool. > > We don't plan to ever use 64KB page size on sparc64, so I think we're > safe there. > > There are many reasons using 64KB pages is really stupid, what you > see here with the IOMMU stuff is just one of many symptoms, but I bet > your powerpc guys are kind of sick of hearing it by now... :-) Lalalala :-) Cheers, Ben. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator 2015-03-23 19:05 ` David Miller @ 2015-03-23 22:36 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 71+ messages in thread From: Benjamin Herrenschmidt @ 2015-03-23 22:36 UTC (permalink / raw) To: David Miller Cc: aik, aik, sowmini.varadhan, anton, paulus, sparclinux, linuxppc-dev On Mon, 2015-03-23 at 15:05 -0400, David Miller wrote: > From: Sowmini Varadhan <sowmini.varadhan@oracle.com> > Date: Mon, 23 Mar 2015 12:54:06 -0400 > > > If it was only an optimization (i.e., removing it would not break > > any functionality), and if this was done for older hardware, > > and *if* we believe that the direction of most architectures is to > > follow the sun4v/HV model, then, given that the sun4u code only uses 1 > > arena pool anyway, one thought that I have for refactoring this > > is the following: > > Why add performance regressions to old machines who already are > suffering too much from all the bloat we are constantly adding to the > kernel? Our allocator more/less allocates bottom-up from the last successful allocation and tries again if it fails, so in essence, it is not so different from what you have done. What we need to do is: - One pool only - Whenever the allocation is before the previous hint, do a flush, that should only happen if a wrap around occurred or in some cases if the device DMA mask forced it. I think we always update the hint whenever we successfully allocate from the small pools. - Deal with the largealloc case. That's the contentious issue, see below. So if we ignore the largealloc split we have, we can pretty much recover David's behaviour using our allocator. There might be the odd flush here or there that we do and he didn't due to differences in implementation details, but I doubt those are statistically relevant, as long as we don't flush on every network packet, we should be good. The largealloc issue is a different can of worms. We can try adding an option to disable the largealloc business completely (instead of hard wiring the "15", make that a variable and define that 0 means no largealloc pool). Or we can decide that large allocs are rare (typically pci_alloc_consistent, ie, driver init time), and thus always flush on them (or rather on free of a large chunk). David, what's your take there ? I have a feeling that should work fine without a noticeable performance issue... I would also keep a "dirty" flag set on any free and cleared on any flush to avoid more spurrious flushes, but here too the benefit might be in the noise. Cheers, Ben. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator @ 2015-03-23 22:36 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 71+ messages in thread From: Benjamin Herrenschmidt @ 2015-03-23 22:36 UTC (permalink / raw) To: David Miller Cc: aik, aik, sowmini.varadhan, anton, paulus, sparclinux, linuxppc-dev On Mon, 2015-03-23 at 15:05 -0400, David Miller wrote: > From: Sowmini Varadhan <sowmini.varadhan@oracle.com> > Date: Mon, 23 Mar 2015 12:54:06 -0400 > > > If it was only an optimization (i.e., removing it would not break > > any functionality), and if this was done for older hardware, > > and *if* we believe that the direction of most architectures is to > > follow the sun4v/HV model, then, given that the sun4u code only uses 1 > > arena pool anyway, one thought that I have for refactoring this > > is the following: > > Why add performance regressions to old machines who already are > suffering too much from all the bloat we are constantly adding to the > kernel? Our allocator more/less allocates bottom-up from the last successful allocation and tries again if it fails, so in essence, it is not so different from what you have done. What we need to do is: - One pool only - Whenever the allocation is before the previous hint, do a flush, that should only happen if a wrap around occurred or in some cases if the device DMA mask forced it. I think we always update the hint whenever we successfully allocate from the small pools. - Deal with the largealloc case. That's the contentious issue, see below. So if we ignore the largealloc split we have, we can pretty much recover David's behaviour using our allocator. There might be the odd flush here or there that we do and he didn't due to differences in implementation details, but I doubt those are statistically relevant, as long as we don't flush on every network packet, we should be good. The largealloc issue is a different can of worms. We can try adding an option to disable the largealloc business completely (instead of hard wiring the "15", make that a variable and define that 0 means no largealloc pool). Or we can decide that large allocs are rare (typically pci_alloc_consistent, ie, driver init time), and thus always flush on them (or rather on free of a large chunk). David, what's your take there ? I have a feeling that should work fine without a noticeable performance issue... I would also keep a "dirty" flag set on any free and cleared on any flush to avoid more spurrious flushes, but here too the benefit might be in the noise. Cheers, Ben. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator 2015-03-23 22:36 ` Benjamin Herrenschmidt @ 2015-03-23 23:19 ` Sowmini Varadhan -1 siblings, 0 replies; 71+ messages in thread From: Sowmini Varadhan @ 2015-03-23 23:19 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev, David Miller On (03/24/15 09:36), Benjamin Herrenschmidt wrote: > > - One pool only > > - Whenever the allocation is before the previous hint, do a flush, that > should only happen if a wrap around occurred or in some cases if the > device DMA mask forced it. I think we always update the hint whenever we > successfully allocate from the small pools. right, I believe this is close to what I discussed in the previous thread, and approx what I have in patchv5, except that the flush indirection can be passed as a function pointer, or via the table_ops. > > - Deal with the largealloc case. That's the contentious issue, see > below. : > The largealloc issue is a different can of worms. We can try adding an > option to disable the largealloc business completely (instead of hard > wiring the "15", make that a variable and define that 0 means no > largealloc pool). What I've tried to do is to have a bool large_pool arg passed to iommu_tbl_pool_init. In my observation (instrumented for scsi, ixgbe), we never allocate more than 4 pages at a time, so I pass in large_pool = false for all the sparc platforms. > Or we can decide that large allocs are rare (typically > pci_alloc_consistent, ie, driver init time), and thus always flush on > them (or rather on free of a large chunk). David, what's your take > there ? I have a feeling that should work fine without a noticeable > performance issue... > > I would also keep a "dirty" flag set on any free and cleared on any > flush to avoid more spurrious flushes, but here too the benefit might be > in the noise. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator @ 2015-03-23 23:19 ` Sowmini Varadhan 0 siblings, 0 replies; 71+ messages in thread From: Sowmini Varadhan @ 2015-03-23 23:19 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev, David Miller On (03/24/15 09:36), Benjamin Herrenschmidt wrote: > > - One pool only > > - Whenever the allocation is before the previous hint, do a flush, that > should only happen if a wrap around occurred or in some cases if the > device DMA mask forced it. I think we always update the hint whenever we > successfully allocate from the small pools. right, I believe this is close to what I discussed in the previous thread, and approx what I have in patchv5, except that the flush indirection can be passed as a function pointer, or via the table_ops. > > - Deal with the largealloc case. That's the contentious issue, see > below. : > The largealloc issue is a different can of worms. We can try adding an > option to disable the largealloc business completely (instead of hard > wiring the "15", make that a variable and define that 0 means no > largealloc pool). What I've tried to do is to have a bool large_pool arg passed to iommu_tbl_pool_init. In my observation (instrumented for scsi, ixgbe), we never allocate more than 4 pages at a time, so I pass in large_pool == false for all the sparc platforms. > Or we can decide that large allocs are rare (typically > pci_alloc_consistent, ie, driver init time), and thus always flush on > them (or rather on free of a large chunk). David, what's your take > there ? I have a feeling that should work fine without a noticeable > performance issue... > > I would also keep a "dirty" flag set on any free and cleared on any > flush to avoid more spurrious flushes, but here too the benefit might be > in the noise. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator 2015-03-23 23:19 ` Sowmini Varadhan @ 2015-03-24 0:48 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 71+ messages in thread From: Benjamin Herrenschmidt @ 2015-03-24 0:48 UTC (permalink / raw) To: Sowmini Varadhan Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev, David Miller On Mon, 2015-03-23 at 19:19 -0400, Sowmini Varadhan wrote: > What I've tried to do is to have a bool large_pool arg passed > to iommu_tbl_pool_init. In my observation (instrumented for scsi, ixgbe), > we never allocate more than 4 pages at a time, so I pass in > large_pool = false for all the sparc platforms. But that might not be necessary. If indeed we very rarely use the large pool, then just make it flush always. My feeling is that it will only ever be used at driver init/remove time when allocating things like descriptor rings, where the flush overhead dont' matter. > > Or we can decide that large allocs are rare (typically > > pci_alloc_consistent, ie, driver init time), and thus always flush on > > them (or rather on free of a large chunk). David, what's your take > > there ? I have a feeling that should work fine without a noticeable > > performance issue... > > > > I would also keep a "dirty" flag set on any free and cleared on any > > flush to avoid more spurrious flushes, but here too the benefit might be > > in the noise. Ben. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator @ 2015-03-24 0:48 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 71+ messages in thread From: Benjamin Herrenschmidt @ 2015-03-24 0:48 UTC (permalink / raw) To: Sowmini Varadhan Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev, David Miller On Mon, 2015-03-23 at 19:19 -0400, Sowmini Varadhan wrote: > What I've tried to do is to have a bool large_pool arg passed > to iommu_tbl_pool_init. In my observation (instrumented for scsi, ixgbe), > we never allocate more than 4 pages at a time, so I pass in > large_pool == false for all the sparc platforms. But that might not be necessary. If indeed we very rarely use the large pool, then just make it flush always. My feeling is that it will only ever be used at driver init/remove time when allocating things like descriptor rings, where the flush overhead dont' matter. > > Or we can decide that large allocs are rare (typically > > pci_alloc_consistent, ie, driver init time), and thus always flush on > > them (or rather on free of a large chunk). David, what's your take > > there ? I have a feeling that should work fine without a noticeable > > performance issue... > > > > I would also keep a "dirty" flag set on any free and cleared on any > > flush to avoid more spurrious flushes, but here too the benefit might be > > in the noise. Ben. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator 2015-03-23 16:54 ` Sowmini Varadhan @ 2015-03-23 22:25 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 71+ messages in thread From: Benjamin Herrenschmidt @ 2015-03-23 22:25 UTC (permalink / raw) To: Sowmini Varadhan Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev, David Miller On Mon, 2015-03-23 at 12:54 -0400, Sowmini Varadhan wrote: > If it was only an optimization (i.e., removing it would not break > any functionality), and if this was done for older hardware, > and *if* we believe that the direction of most architectures is to > follow the sun4v/HV model, then, given that the sun4u code only uses 1 > arena pool anyway, one thought that I have for refactoring this > is the following: > > - Caller of iommu_tbl_range_alloc() can do the flush_all if they > see start <= end for the one single pool > - lose the other ->flush_all invocation (i.e., the one that is > done when iommu_area_alloc() fails for pass = 0, and we reset > start to 0 to roll-back) > > that would avoid the need for any iommu_tbl_ops in my patch-set. You must hold the lock until you do the flush, otherwise somebody else might allocate the not-yet-flushed areas and try to use them... kaboom. However if that's the only callback left, pass it as an argument. > But it would imply that you would still take the perf hit for the roll-back > if we failed the pass = 0 iteration through iommu_area_alloc(). > Perhaps this is an acceptable compromise in favor of cleaner code > (again, assuming that current/future archs will all follow the HV > based design). > > --Sowmini > ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator @ 2015-03-23 22:25 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 71+ messages in thread From: Benjamin Herrenschmidt @ 2015-03-23 22:25 UTC (permalink / raw) To: Sowmini Varadhan Cc: aik, aik, anton, paulus, sparclinux, linuxppc-dev, David Miller On Mon, 2015-03-23 at 12:54 -0400, Sowmini Varadhan wrote: > If it was only an optimization (i.e., removing it would not break > any functionality), and if this was done for older hardware, > and *if* we believe that the direction of most architectures is to > follow the sun4v/HV model, then, given that the sun4u code only uses 1 > arena pool anyway, one thought that I have for refactoring this > is the following: > > - Caller of iommu_tbl_range_alloc() can do the flush_all if they > see start <= end for the one single pool > - lose the other ->flush_all invocation (i.e., the one that is > done when iommu_area_alloc() fails for pass == 0, and we reset > start to 0 to roll-back) > > that would avoid the need for any iommu_tbl_ops in my patch-set. You must hold the lock until you do the flush, otherwise somebody else might allocate the not-yet-flushed areas and try to use them... kaboom. However if that's the only callback left, pass it as an argument. > But it would imply that you would still take the perf hit for the roll-back > if we failed the pass == 0 iteration through iommu_area_alloc(). > Perhaps this is an acceptable compromise in favor of cleaner code > (again, assuming that current/future archs will all follow the HV > based design). > > --Sowmini > ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator 2015-03-19 2:25 ` David Miller @ 2015-03-22 19:36 ` Arnd Bergmann -1 siblings, 0 replies; 71+ messages in thread From: Arnd Bergmann @ 2015-03-22 19:36 UTC (permalink / raw) To: linuxppc-dev; +Cc: sparclinux, sowmini.varadhan, paulus, David Miller On Thursday 19 March 2015, David Miller wrote: > PowerPC folks, we're trying to kill the locking contention in our > IOMMU allocators and noticed that you guys have a nice solution to > this in your IOMMU code. > > Sowmini put together a patch series that tries to extract out the > generic parts of your code and place it in lib/iommu-common.c so > that both Sparc and PowerPC can make use of it. > > The real test is if powerpc can be converted to use it. > > So if you guys could please take a look at the current version of > this series at: > > http://patchwork.ozlabs.org/patch/449712/ > http://patchwork.ozlabs.org/patch/449713/ > http://patchwork.ozlabs.org/patch/449715/ > > and give us some feedback that would be great. > > After we sort out making this work properly for both architectures we > can figure out who merges which parts via what tree(s) How does this relate to the ARM implementation? There is currently an effort going on to make that one shared with ARM64 and possibly x86. Has anyone looked at both the PowerPC and ARM ways of doing the allocation to see if we could pick one of the two to work on all architectures? Arnd ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator @ 2015-03-22 19:36 ` Arnd Bergmann 0 siblings, 0 replies; 71+ messages in thread From: Arnd Bergmann @ 2015-03-22 19:36 UTC (permalink / raw) To: linuxppc-dev; +Cc: sparclinux, sowmini.varadhan, paulus, David Miller On Thursday 19 March 2015, David Miller wrote: > PowerPC folks, we're trying to kill the locking contention in our > IOMMU allocators and noticed that you guys have a nice solution to > this in your IOMMU code. > > Sowmini put together a patch series that tries to extract out the > generic parts of your code and place it in lib/iommu-common.c so > that both Sparc and PowerPC can make use of it. > > The real test is if powerpc can be converted to use it. > > So if you guys could please take a look at the current version of > this series at: > > http://patchwork.ozlabs.org/patch/449712/ > http://patchwork.ozlabs.org/patch/449713/ > http://patchwork.ozlabs.org/patch/449715/ > > and give us some feedback that would be great. > > After we sort out making this work properly for both architectures we > can figure out who merges which parts via what tree(s) How does this relate to the ARM implementation? There is currently an effort going on to make that one shared with ARM64 and possibly x86. Has anyone looked at both the PowerPC and ARM ways of doing the allocation to see if we could pick one of the two to work on all architectures? Arnd ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator 2015-03-22 19:36 ` Arnd Bergmann @ 2015-03-22 22:02 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 71+ messages in thread From: Benjamin Herrenschmidt @ 2015-03-22 22:02 UTC (permalink / raw) To: Arnd Bergmann Cc: sparclinux, sowmini.varadhan, linuxppc-dev, David Miller, paulus On Sun, 2015-03-22 at 20:36 +0100, Arnd Bergmann wrote: > How does this relate to the ARM implementation? There is currently > an effort going on to make that one shared with ARM64 and possibly > x86. Has anyone looked at both the PowerPC and ARM ways of doing the > allocation to see if we could pick one of the two to work on > all architectures? What I see in ARM is horribly complex, I can't quite make sense of it in a couple of minutes of looking at it, and doesn't seem to address the basic issue we are addressing here which is the splitting of the iommu table lock. I think we are looking at two very different approaches here. One is to deal with horrendously broken HW which is an ARM SoC vendors signature :) The other one is about performances and scalability on sane HW. Cheers, Ben. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator @ 2015-03-22 22:02 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 71+ messages in thread From: Benjamin Herrenschmidt @ 2015-03-22 22:02 UTC (permalink / raw) To: Arnd Bergmann Cc: sparclinux, sowmini.varadhan, linuxppc-dev, David Miller, paulus On Sun, 2015-03-22 at 20:36 +0100, Arnd Bergmann wrote: > How does this relate to the ARM implementation? There is currently > an effort going on to make that one shared with ARM64 and possibly > x86. Has anyone looked at both the PowerPC and ARM ways of doing the > allocation to see if we could pick one of the two to work on > all architectures? What I see in ARM is horribly complex, I can't quite make sense of it in a couple of minutes of looking at it, and doesn't seem to address the basic issue we are addressing here which is the splitting of the iommu table lock. I think we are looking at two very different approaches here. One is to deal with horrendously broken HW which is an ARM SoC vendors signature :) The other one is about performances and scalability on sane HW. Cheers, Ben. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator 2015-03-22 22:02 ` Benjamin Herrenschmidt @ 2015-03-22 22:07 ` Sowmini Varadhan -1 siblings, 0 replies; 71+ messages in thread From: Sowmini Varadhan @ 2015-03-22 22:07 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: sparclinux, paulus, linuxppc-dev, David Miller, Arnd Bergmann On (03/23/15 09:02), Benjamin Herrenschmidt wrote: > > How does this relate to the ARM implementation? There is currently > > an effort going on to make that one shared with ARM64 and possibly > > x86. Has anyone looked at both the PowerPC and ARM ways of doing the > > allocation to see if we could pick one of the two to work on > > all architectures? > > What I see in ARM is horribly complex, I can't quite make sense of it > in a couple of minutes of looking at it, and doesn't seem to address the > basic issue we are addressing here which is the splitting of the iommu > table lock. Amen to that.. I thought it was just me :-) I plan to go through the code to see if/where the armd iommu code does its locking and achieves its parallelism, but the mapping between the sparc/powerpc approach and armd is not immediately obvious to me. --Sowmini ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator @ 2015-03-22 22:07 ` Sowmini Varadhan 0 siblings, 0 replies; 71+ messages in thread From: Sowmini Varadhan @ 2015-03-22 22:07 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: sparclinux, paulus, linuxppc-dev, David Miller, Arnd Bergmann On (03/23/15 09:02), Benjamin Herrenschmidt wrote: > > How does this relate to the ARM implementation? There is currently > > an effort going on to make that one shared with ARM64 and possibly > > x86. Has anyone looked at both the PowerPC and ARM ways of doing the > > allocation to see if we could pick one of the two to work on > > all architectures? > > What I see in ARM is horribly complex, I can't quite make sense of it > in a couple of minutes of looking at it, and doesn't seem to address the > basic issue we are addressing here which is the splitting of the iommu > table lock. Amen to that.. I thought it was just me :-) I plan to go through the code to see if/where the armd iommu code does its locking and achieves its parallelism, but the mapping between the sparc/powerpc approach and armd is not immediately obvious to me. --Sowmini ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator 2015-03-22 22:07 ` Sowmini Varadhan @ 2015-03-22 22:22 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 71+ messages in thread From: Benjamin Herrenschmidt @ 2015-03-22 22:22 UTC (permalink / raw) To: Sowmini Varadhan Cc: sparclinux, paulus, linuxppc-dev, David Miller, Arnd Bergmann On Sun, 2015-03-22 at 18:07 -0400, Sowmini Varadhan wrote: > On (03/23/15 09:02), Benjamin Herrenschmidt wrote: > > > How does this relate to the ARM implementation? There is currently > > > an effort going on to make that one shared with ARM64 and possibly > > > x86. Has anyone looked at both the PowerPC and ARM ways of doing the > > > allocation to see if we could pick one of the two to work on > > > all architectures? > > > > What I see in ARM is horribly complex, I can't quite make sense of it > > in a couple of minutes of looking at it, and doesn't seem to address the > > basic issue we are addressing here which is the splitting of the iommu > > table lock. > > Amen to that.. I thought it was just me :-) > > I plan to go through the code to see if/where the armd iommu code > does its locking and achieves its parallelism, but the mapping > between the sparc/powerpc approach and armd is not immediately obvious > to me. We might have more chance with x86_64 though... They would surely have similar scalability issues. Cheers, Ben. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator @ 2015-03-22 22:22 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 71+ messages in thread From: Benjamin Herrenschmidt @ 2015-03-22 22:22 UTC (permalink / raw) To: Sowmini Varadhan Cc: sparclinux, paulus, linuxppc-dev, David Miller, Arnd Bergmann On Sun, 2015-03-22 at 18:07 -0400, Sowmini Varadhan wrote: > On (03/23/15 09:02), Benjamin Herrenschmidt wrote: > > > How does this relate to the ARM implementation? There is currently > > > an effort going on to make that one shared with ARM64 and possibly > > > x86. Has anyone looked at both the PowerPC and ARM ways of doing the > > > allocation to see if we could pick one of the two to work on > > > all architectures? > > > > What I see in ARM is horribly complex, I can't quite make sense of it > > in a couple of minutes of looking at it, and doesn't seem to address the > > basic issue we are addressing here which is the splitting of the iommu > > table lock. > > Amen to that.. I thought it was just me :-) > > I plan to go through the code to see if/where the armd iommu code > does its locking and achieves its parallelism, but the mapping > between the sparc/powerpc approach and armd is not immediately obvious > to me. We might have more chance with x86_64 though... They would surely have similar scalability issues. Cheers, Ben. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator 2015-03-22 22:22 ` Benjamin Herrenschmidt @ 2015-03-23 6:04 ` Arnd Bergmann -1 siblings, 0 replies; 71+ messages in thread From: Arnd Bergmann @ 2015-03-23 6:04 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: sparclinux, Sowmini Varadhan, linuxppc-dev, David Miller, paulus On Sunday 22 March 2015, Benjamin Herrenschmidt wrote: > On Sun, 2015-03-22 at 18:07 -0400, Sowmini Varadhan wrote: > > On (03/23/15 09:02), Benjamin Herrenschmidt wrote: > > > > How does this relate to the ARM implementation? There is currently > > > > an effort going on to make that one shared with ARM64 and possibly > > > > x86. Has anyone looked at both the PowerPC and ARM ways of doing the > > > > allocation to see if we could pick one of the two to work on > > > > all architectures? > > > > > > What I see in ARM is horribly complex, I can't quite make sense of it > > > in a couple of minutes of looking at it, and doesn't seem to address the > > > basic issue we are addressing here which is the splitting of the iommu > > > table lock. > > > > Amen to that.. I thought it was just me :-) Two problems on ARM IOMMU hardware are: - a very wide range of incompatible approaches, from a simple sw-loaded iotlb with just a few hugepage entries to some rather complex ones - a rather overdesigned "SMMU" that will be used by half the SoCs, but with some of them implementing slightly incompatible variants. > > I plan to go through the code to see if/where the armd iommu code > > does its locking and achieves its parallelism, but the mapping > > between the sparc/powerpc approach and armd is not immediately obvious > > to me. My guess is that the ARM code so far has been concerned mainly with getting things to work in the first place, but scalability problems will only be seen when there are faster CPU cores become available. > We might have more chance with x86_64 though... They would surely have > similar scalability issues. > I believe the complexity of the modern Intel and AMD IOMMUs is similar to the ARM SMMU, which appears to have been modeled after the Intel one to some degree. Arnd ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator @ 2015-03-23 6:04 ` Arnd Bergmann 0 siblings, 0 replies; 71+ messages in thread From: Arnd Bergmann @ 2015-03-23 6:04 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: sparclinux, Sowmini Varadhan, linuxppc-dev, David Miller, paulus On Sunday 22 March 2015, Benjamin Herrenschmidt wrote: > On Sun, 2015-03-22 at 18:07 -0400, Sowmini Varadhan wrote: > > On (03/23/15 09:02), Benjamin Herrenschmidt wrote: > > > > How does this relate to the ARM implementation? There is currently > > > > an effort going on to make that one shared with ARM64 and possibly > > > > x86. Has anyone looked at both the PowerPC and ARM ways of doing the > > > > allocation to see if we could pick one of the two to work on > > > > all architectures? > > > > > > What I see in ARM is horribly complex, I can't quite make sense of it > > > in a couple of minutes of looking at it, and doesn't seem to address the > > > basic issue we are addressing here which is the splitting of the iommu > > > table lock. > > > > Amen to that.. I thought it was just me :-) Two problems on ARM IOMMU hardware are: - a very wide range of incompatible approaches, from a simple sw-loaded iotlb with just a few hugepage entries to some rather complex ones - a rather overdesigned "SMMU" that will be used by half the SoCs, but with some of them implementing slightly incompatible variants. > > I plan to go through the code to see if/where the armd iommu code > > does its locking and achieves its parallelism, but the mapping > > between the sparc/powerpc approach and armd is not immediately obvious > > to me. My guess is that the ARM code so far has been concerned mainly with getting things to work in the first place, but scalability problems will only be seen when there are faster CPU cores become available. > We might have more chance with x86_64 though... They would surely have > similar scalability issues. > I believe the complexity of the modern Intel and AMD IOMMUs is similar to the ARM SMMU, which appears to have been modeled after the Intel one to some degree. Arnd ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator 2015-03-23 6:04 ` Arnd Bergmann @ 2015-03-23 11:04 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 71+ messages in thread From: Benjamin Herrenschmidt @ 2015-03-23 11:04 UTC (permalink / raw) To: Arnd Bergmann Cc: sparclinux, Sowmini Varadhan, linuxppc-dev, David Miller, paulus On Mon, 2015-03-23 at 07:04 +0100, Arnd Bergmann wrote: > > My guess is that the ARM code so far has been concerned mainly with > getting things to work in the first place, but scalability problems > will only be seen when there are faster CPU cores become available. In any case, I think this is mostly a non-issue. The complexity of the ARM code is in various areas related to making shit work (handling coherent allocations mostly) but only remotely related to the actual iommu DMA space allocator (iova in ARM as far as I understand the code) which is pretty standard. The work Sowmini is doing is about specifically the allocator. Making our (powerpc) allocator generic since it has some nice scalability features. In fact, what Aik and I have been pushing and Sowmini is close to achieving is to mostly disconnect that allocator from the rest of the iommu management (the caller). So in the end, the allocator itself should be splitable into something separate that resides in lib/ or similar, which ARM can chose to use as well. Sowmini: it might be worthwhile creating a separate data structure iommu_virt_zone or something like that which is what the allocator works on exclusively, which we would just embed into sparc and powerpc. The only remaining problem I see is your "reset" op, but I think we should be able to sort it out once we understand what it's there for :-) Cheers, Ben. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator @ 2015-03-23 11:04 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 71+ messages in thread From: Benjamin Herrenschmidt @ 2015-03-23 11:04 UTC (permalink / raw) To: Arnd Bergmann Cc: sparclinux, Sowmini Varadhan, linuxppc-dev, David Miller, paulus On Mon, 2015-03-23 at 07:04 +0100, Arnd Bergmann wrote: > > My guess is that the ARM code so far has been concerned mainly with > getting things to work in the first place, but scalability problems > will only be seen when there are faster CPU cores become available. In any case, I think this is mostly a non-issue. The complexity of the ARM code is in various areas related to making shit work (handling coherent allocations mostly) but only remotely related to the actual iommu DMA space allocator (iova in ARM as far as I understand the code) which is pretty standard. The work Sowmini is doing is about specifically the allocator. Making our (powerpc) allocator generic since it has some nice scalability features. In fact, what Aik and I have been pushing and Sowmini is close to achieving is to mostly disconnect that allocator from the rest of the iommu management (the caller). So in the end, the allocator itself should be splitable into something separate that resides in lib/ or similar, which ARM can chose to use as well. Sowmini: it might be worthwhile creating a separate data structure iommu_virt_zone or something like that which is what the allocator works on exclusively, which we would just embed into sparc and powerpc. The only remaining problem I see is your "reset" op, but I think we should be able to sort it out once we understand what it's there for :-) Cheers, Ben. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator 2015-03-23 11:04 ` Benjamin Herrenschmidt @ 2015-03-23 18:45 ` Arnd Bergmann -1 siblings, 0 replies; 71+ messages in thread From: Arnd Bergmann @ 2015-03-23 18:45 UTC (permalink / raw) To: linuxppc-dev; +Cc: sparclinux, paulus, David Miller, Sowmini Varadhan On Monday 23 March 2015, Benjamin Herrenschmidt wrote: > On Mon, 2015-03-23 at 07:04 +0100, Arnd Bergmann wrote: > > > > My guess is that the ARM code so far has been concerned mainly with > > getting things to work in the first place, but scalability problems > > will only be seen when there are faster CPU cores become available. > > In any case, I think this is mostly a non-issue. The complexity of the > ARM code is in various areas related to making shit work (handling > coherent allocations mostly) but only remotely related to the actual > iommu DMA space allocator (iova in ARM as far as I understand the code) > which is pretty standard. Ok, got it. Thanks for explaining tht part. > The work Sowmini is doing is about specifically the allocator. Making > our (powerpc) allocator generic since it has some nice scalability > features. > > In fact, what Aik and I have been pushing and Sowmini is close to > achieving is to mostly disconnect that allocator from the rest of the > iommu management (the caller). > > So in the end, the allocator itself should be splitable into something > separate that resides in lib/ or similar, which ARM can chose to use as > well. Yes, this sounds like a good idea. I'm currently at ELC and will bring this up with the people working on ARM IOMMU support. Arnd ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Generic IOMMU pooled allocator @ 2015-03-23 18:45 ` Arnd Bergmann 0 siblings, 0 replies; 71+ messages in thread From: Arnd Bergmann @ 2015-03-23 18:45 UTC (permalink / raw) To: linuxppc-dev; +Cc: sparclinux, paulus, David Miller, Sowmini Varadhan On Monday 23 March 2015, Benjamin Herrenschmidt wrote: > On Mon, 2015-03-23 at 07:04 +0100, Arnd Bergmann wrote: > > > > My guess is that the ARM code so far has been concerned mainly with > > getting things to work in the first place, but scalability problems > > will only be seen when there are faster CPU cores become available. > > In any case, I think this is mostly a non-issue. The complexity of the > ARM code is in various areas related to making shit work (handling > coherent allocations mostly) but only remotely related to the actual > iommu DMA space allocator (iova in ARM as far as I understand the code) > which is pretty standard. Ok, got it. Thanks for explaining tht part. > The work Sowmini is doing is about specifically the allocator. Making > our (powerpc) allocator generic since it has some nice scalability > features. > > In fact, what Aik and I have been pushing and Sowmini is close to > achieving is to mostly disconnect that allocator from the rest of the > iommu management (the caller). > > So in the end, the allocator itself should be splitable into something > separate that resides in lib/ or similar, which ARM can chose to use as > well. Yes, this sounds like a good idea. I'm currently at ELC and will bring this up with the people working on ARM IOMMU support. Arnd ^ permalink raw reply [flat|nested] 71+ messages in thread
end of thread, other threads:[~2015-03-26 23:51 UTC | newest] Thread overview: 71+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-03-19 2:25 Generic IOMMU pooled allocator David Miller 2015-03-19 2:25 ` David Miller 2015-03-19 2:46 ` Benjamin Herrenschmidt 2015-03-19 2:46 ` Benjamin Herrenschmidt 2015-03-19 2:50 ` David Miller 2015-03-19 2:50 ` David Miller 2015-03-19 3:01 ` Benjamin Herrenschmidt 2015-03-19 3:01 ` Benjamin Herrenschmidt 2015-03-19 5:27 ` Alexey Kardashevskiy 2015-03-19 5:27 ` Alexey Kardashevskiy 2015-03-19 13:34 ` Sowmini Varadhan 2015-03-19 13:34 ` Sowmini Varadhan 2015-03-22 19:27 ` Sowmini Varadhan 2015-03-22 19:27 ` Sowmini Varadhan 2015-03-23 16:29 ` David Miller 2015-03-23 16:29 ` David Miller 2015-03-23 16:54 ` Sowmini Varadhan 2015-03-23 16:54 ` Sowmini Varadhan 2015-03-23 19:05 ` David Miller 2015-03-23 19:05 ` David Miller 2015-03-23 19:09 ` Sowmini Varadhan 2015-03-23 19:09 ` Sowmini Varadhan 2015-03-23 22:21 ` Benjamin Herrenschmidt 2015-03-23 22:21 ` Benjamin Herrenschmidt 2015-03-23 23:08 ` Sowmini Varadhan 2015-03-23 23:08 ` Sowmini Varadhan 2015-03-23 23:29 ` chase rayfield 2015-03-24 0:47 ` Benjamin Herrenschmidt 2015-03-24 0:47 ` Benjamin Herrenschmidt 2015-03-24 1:11 ` Sowmini Varadhan 2015-03-24 1:11 ` Sowmini Varadhan 2015-03-24 1:44 ` David Miller 2015-03-24 1:44 ` David Miller 2015-03-24 1:57 ` Sowmini Varadhan 2015-03-24 1:57 ` Sowmini Varadhan 2015-03-24 2:08 ` Benjamin Herrenschmidt 2015-03-24 2:08 ` Benjamin Herrenschmidt 2015-03-24 2:15 ` David Miller 2015-03-24 2:15 ` David Miller 2015-03-26 0:43 ` cascardo 2015-03-26 0:43 ` cascardo 2015-03-26 0:49 ` Benjamin Herrenschmidt 2015-03-26 0:49 ` Benjamin Herrenschmidt 2015-03-26 10:56 ` Sowmini Varadhan 2015-03-26 10:56 ` Sowmini Varadhan 2015-03-26 22:51 ` David Miller 2015-03-26 23:00 ` David Miller 2015-03-26 23:51 ` Benjamin Herrenschmidt 2015-03-26 23:51 ` Benjamin Herrenschmidt 2015-03-23 22:36 ` Benjamin Herrenschmidt 2015-03-23 22:36 ` Benjamin Herrenschmidt 2015-03-23 23:19 ` Sowmini Varadhan 2015-03-23 23:19 ` Sowmini Varadhan 2015-03-24 0:48 ` Benjamin Herrenschmidt 2015-03-24 0:48 ` Benjamin Herrenschmidt 2015-03-23 22:25 ` Benjamin Herrenschmidt 2015-03-23 22:25 ` Benjamin Herrenschmidt 2015-03-22 19:36 ` Arnd Bergmann 2015-03-22 19:36 ` Arnd Bergmann 2015-03-22 22:02 ` Benjamin Herrenschmidt 2015-03-22 22:02 ` Benjamin Herrenschmidt 2015-03-22 22:07 ` Sowmini Varadhan 2015-03-22 22:07 ` Sowmini Varadhan 2015-03-22 22:22 ` Benjamin Herrenschmidt 2015-03-22 22:22 ` Benjamin Herrenschmidt 2015-03-23 6:04 ` Arnd Bergmann 2015-03-23 6:04 ` Arnd Bergmann 2015-03-23 11:04 ` Benjamin Herrenschmidt 2015-03-23 11:04 ` Benjamin Herrenschmidt 2015-03-23 18:45 ` Arnd Bergmann 2015-03-23 18:45 ` Arnd Bergmann
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.