* [patch 1/2]percpu_ida: fix a live lock @ 2013-12-31 3:38 Shaohua Li 2014-01-04 21:08 ` Kent Overstreet 0 siblings, 1 reply; 18+ messages in thread From: Shaohua Li @ 2013-12-31 3:38 UTC (permalink / raw) To: linux-kernel; +Cc: axboe, hch, kmo steal_tags only happens when free tags is more than half of the total tags. This is too restrict and can cause live lock. I found one cpu has free tags, but other cpu can't steal (thread is bound to specific cpus), threads which wants to allocate tags are always sleeping. I found this when I run next patch, but this could happen without it I think. I did performance test too with null_blk. Two cases (each cpu has enough percpu tags, or total tags are limited) haven't performance changes. Signed-off-by: Shaohua Li <shli@fusionio.com> --- lib/percpu_ida.c | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) Index: linux/lib/percpu_ida.c =================================================================== --- linux.orig/lib/percpu_ida.c 2013-11-25 09:36:52.196626486 +0800 +++ linux/lib/percpu_ida.c 2013-12-31 11:20:03.205400612 +0800 @@ -54,9 +54,7 @@ static inline void move_tags(unsigned *d /* * Try to steal tags from a remote cpu's percpu freelist. * - * We first check how many percpu freelists have tags - we don't steal tags - * unless enough percpu freelists have tags on them that it's possible more than - * half the total tags could be stuck on remote percpu freelists. + * We first check how many percpu freelists have tags * * Then we iterate through the cpus until we find some tags - we don't attempt * to find the "best" cpu to steal from, to keep cacheline bouncing to a @@ -69,8 +67,7 @@ static inline void steal_tags(struct per struct percpu_ida_cpu *remote; for (cpus_have_tags = cpumask_weight(&pool->cpus_have_tags); - cpus_have_tags * pool->percpu_max_size > pool->nr_tags / 2; - cpus_have_tags--) { + cpus_have_tags; cpus_have_tags--) { cpu = cpumask_next(cpu, &pool->cpus_have_tags); if (cpu >= nr_cpu_ids) { ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch 1/2]percpu_ida: fix a live lock 2013-12-31 3:38 [patch 1/2]percpu_ida: fix a live lock Shaohua Li @ 2014-01-04 21:08 ` Kent Overstreet 2014-01-05 13:13 ` Shaohua Li 0 siblings, 1 reply; 18+ messages in thread From: Kent Overstreet @ 2014-01-04 21:08 UTC (permalink / raw) To: Shaohua Li; +Cc: linux-kernel, axboe, hch On Tue, Dec 31, 2013 at 11:38:27AM +0800, Shaohua Li wrote: > > steal_tags only happens when free tags is more than half of the total tags. > This is too restrict and can cause live lock. I found one cpu has free tags, > but other cpu can't steal (thread is bound to specific cpus), threads which > wants to allocate tags are always sleeping. I found this when I run next patch, > but this could happen without it I think. > > I did performance test too with null_blk. Two cases (each cpu has enough percpu > tags, or total tags are limited) haven't performance changes. This doesn't appear to me to fix anything wrong with the current code (and it'll hurt performance) - we explicitly don't guarantee that all the tags will be available for allocation at any given time, only half of them. Can you explain more how this is being used where you're seeing the issue? And I don't see the other patch in your patch series. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch 1/2]percpu_ida: fix a live lock 2014-01-04 21:08 ` Kent Overstreet @ 2014-01-05 13:13 ` Shaohua Li 2014-01-06 20:46 ` Kent Overstreet 0 siblings, 1 reply; 18+ messages in thread From: Shaohua Li @ 2014-01-05 13:13 UTC (permalink / raw) To: Kent Overstreet; +Cc: linux-kernel, axboe, hch On Sat, Jan 04, 2014 at 01:08:04PM -0800, Kent Overstreet wrote: > On Tue, Dec 31, 2013 at 11:38:27AM +0800, Shaohua Li wrote: > > > > steal_tags only happens when free tags is more than half of the total tags. > > This is too restrict and can cause live lock. I found one cpu has free tags, > > but other cpu can't steal (thread is bound to specific cpus), threads which > > wants to allocate tags are always sleeping. I found this when I run next patch, > > but this could happen without it I think. > > > > I did performance test too with null_blk. Two cases (each cpu has enough percpu > > tags, or total tags are limited) haven't performance changes. > > This doesn't appear to me to fix anything wrong with the current code > (and it'll hurt performance) I suspects this hurts performance too, but it doesn't actually in my test. > - we explicitly don't guarantee that all > the tags will be available for allocation at any given time, only half > of them. only half of the tags can be used? this is scaring. Of course we hope all tags are available. > Can you explain more how this is being used where you're seeing > the issue? And I don't see the other patch in your patch series. tasks are bound to specific cpus. And I saw one cpu has a lot of free (but less than half of total tags), but other cpus can't allocate tags, so I saw a live lock. Thanks, Shaohua ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch 1/2]percpu_ida: fix a live lock 2014-01-05 13:13 ` Shaohua Li @ 2014-01-06 20:46 ` Kent Overstreet 2014-01-06 20:52 ` Jens Axboe 0 siblings, 1 reply; 18+ messages in thread From: Kent Overstreet @ 2014-01-06 20:46 UTC (permalink / raw) To: Shaohua Li; +Cc: linux-kernel, axboe, hch On Sun, Jan 05, 2014 at 09:13:00PM +0800, Shaohua Li wrote: > On Sat, Jan 04, 2014 at 01:08:04PM -0800, Kent Overstreet wrote: > > On Tue, Dec 31, 2013 at 11:38:27AM +0800, Shaohua Li wrote: > > > > > > steal_tags only happens when free tags is more than half of the total tags. > > > This is too restrict and can cause live lock. I found one cpu has free tags, > > > but other cpu can't steal (thread is bound to specific cpus), threads which > > > wants to allocate tags are always sleeping. I found this when I run next patch, > > > but this could happen without it I think. > > > > > > I did performance test too with null_blk. Two cases (each cpu has enough percpu > > > tags, or total tags are limited) haven't performance changes. > > > > This doesn't appear to me to fix anything wrong with the current code > > (and it'll hurt performance) > > I suspects this hurts performance too, but it doesn't actually in my test. > > > - we explicitly don't guarantee that all > > the tags will be available for allocation at any given time, only half > > of them. > > only half of the tags can be used? this is scaring. Of course we hope all tags > are available. No: that behaviour is explicitly documented and it's the behaviour we want. There is an inherent performance tradeoff here between internal fragmentation and performance: your patch tries to drive internal fragmentation to 0, but under workloads that hit this, running on enough cores, this will have a VERY high performance cost. > > Can you explain more how this is being used where you're seeing > > the issue? And I don't see the other patch in your patch series. > > tasks are bound to specific cpus. And I saw one cpu has a lot of free (but less > than half of total tags), but other cpus can't allocate tags, so I saw a live > lock. You're using it wrong - don't assume all nr_tags can be allocated. If you need to guarantee that you'll always be able to allocate a certain number of tags, you pass twice that to percpu_ida_init() instead. Nacking this patch (and if you want me to look at it more you still to send me the other patch in the series and show me your code that's using it). ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch 1/2]percpu_ida: fix a live lock 2014-01-06 20:46 ` Kent Overstreet @ 2014-01-06 20:52 ` Jens Axboe 2014-01-06 21:47 ` Kent Overstreet 0 siblings, 1 reply; 18+ messages in thread From: Jens Axboe @ 2014-01-06 20:52 UTC (permalink / raw) To: Kent Overstreet, Shaohua Li; +Cc: linux-kernel, hch On 01/06/2014 01:46 PM, Kent Overstreet wrote: > On Sun, Jan 05, 2014 at 09:13:00PM +0800, Shaohua Li wrote: >>> - we explicitly don't guarantee that all >>> the tags will be available for allocation at any given time, only half >>> of them. >> >> only half of the tags can be used? this is scaring. Of course we hope all tags >> are available. > > No: that behaviour is explicitly documented and it's the behaviour we want. That is going to end horribly, for the cases (ie most cases) where we really want to be able to use all tags. If we can't support that reliably or in a fast manner with percpu ida, then that's pretty close to a show stopper. And I suspect that will be the case for most use cases. We can't just feasibly throw away half of the tag space, that's crazy. > There is an inherent performance tradeoff here between internal fragmentation > and performance: your patch tries to drive internal fragmentation to 0, but > under workloads that hit this, running on enough cores, this will have a VERY > high performance cost. For most workloads that are somewhat localized, we should be able to support this without melting. >>> Can you explain more how this is being used where you're seeing >>> the issue? And I don't see the other patch in your patch series. >> >> tasks are bound to specific cpus. And I saw one cpu has a lot of free (but less >> than half of total tags), but other cpus can't allocate tags, so I saw a live >> lock. > > You're using it wrong - don't assume all nr_tags can be allocated. If you need > to guarantee that you'll always be able to allocate a certain number of tags, > you pass twice that to percpu_ida_init() instead. But that wont work at all. To take a concrete example, lets assume the device is some sort of NCQ. We have 0..31 tags. If we throw away half of those tags, we're at 16. You can't double the tag space in software. So what's the proposal? -- Jens Axboe ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch 1/2]percpu_ida: fix a live lock 2014-01-06 20:52 ` Jens Axboe @ 2014-01-06 21:47 ` Kent Overstreet 2014-02-09 15:50 ` Alexander Gordeev 0 siblings, 1 reply; 18+ messages in thread From: Kent Overstreet @ 2014-01-06 21:47 UTC (permalink / raw) To: Jens Axboe; +Cc: Shaohua Li, linux-kernel, hch On Mon, Jan 06, 2014 at 01:52:19PM -0700, Jens Axboe wrote: > On 01/06/2014 01:46 PM, Kent Overstreet wrote: > > On Sun, Jan 05, 2014 at 09:13:00PM +0800, Shaohua Li wrote: > > >>> - we explicitly don't guarantee that all > >>> the tags will be available for allocation at any given time, only half > >>> of them. > >> > >> only half of the tags can be used? this is scaring. Of course we hope all tags > >> are available. > > > > No: that behaviour is explicitly documented and it's the behaviour we want. > > That is going to end horribly, for the cases (ie most cases) where we > really want to be able to use all tags. If we can't support that > reliably or in a fast manner with percpu ida, then that's pretty close > to a show stopper. And I suspect that will be the case for most use > cases. We can't just feasibly throw away half of the tag space, that's > crazy. Sounds like we're coming at this from different use cases, maybe we're going to need to add some flexibility for different use cases. For background, I wrote this code for SSDs where the you get 16 bits for the tag ID; we really don't want the queue to be anywhere near that deep so potentially wasting half of however many tags you have is no big deal. If you have a device where the max tag id is small enough that you really do need to use the entire tag space... yeah, that's going to be an issue. > But that wont work at all. To take a concrete example, lets assume the > device is some sort of NCQ. We have 0..31 tags. If we throw away half of > those tags, we're at 16. You can't double the tag space in software. Ok, so I hadn't really given any thought to that kind of use case; insofar as I had I would've been skeptical percpu tag allocation made sense for 32 different tags at all. We really don't want to screw over the users that aren't so constrained by the size of their tag space; there really is a huge performance tradeoff here (otherwise you're stealing tags and bouncing cachelines for _every_ tag allocation when the queue is full, and your percpu tag allocation is no longer very percpu). I'm not sure what the best strategy is for NCQ-type max nr_tags, though - thoughts? Easy thing to do for now is just to add another parameter to percpu_ida_init() for the number of tags that are allowed to sit unused on other cpu's freelists - users that have large relatively unbounded nr_tags can set that to nr_tags / 2, for NCQ you'd set it to 0. I suspect we can do better for NCQ though, w.r.t. worst case performance. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch 1/2]percpu_ida: fix a live lock 2014-01-06 21:47 ` Kent Overstreet @ 2014-02-09 15:50 ` Alexander Gordeev 2014-02-10 10:32 ` Christoph Hellwig 0 siblings, 1 reply; 18+ messages in thread From: Alexander Gordeev @ 2014-02-09 15:50 UTC (permalink / raw) To: Kent Overstreet; +Cc: Jens Axboe, Shaohua Li, linux-kernel, hch On Mon, Jan 06, 2014 at 01:47:26PM -0800, Kent Overstreet wrote: > Ok, so I hadn't really given any thought to that kind of use case; insofar as I > had I would've been skeptical percpu tag allocation made sense for 32 different > tags at all. > > We really don't want to screw over the users that aren't so constrained by the > size of their tag space; there really is a huge performance tradeoff here > (otherwise you're stealing tags and bouncing cachelines for _every_ tag > allocation when the queue is full, and your percpu tag allocation is no longer > very percpu). > > I'm not sure what the best strategy is for NCQ-type max nr_tags, though - > thoughts? > > Easy thing to do for now is just to add another parameter to percpu_ida_init() > for the number of tags that are allowed to sit unused on other cpu's freelists - > users that have large relatively unbounded nr_tags can set that to nr_tags / 2, > for NCQ you'd set it to 0. > > I suspect we can do better for NCQ though, w.r.t. worst case performance. Yeah, that was my first thought when I posted "percpu_ida: Allow variable maximum number of cached tags" patch some few months ago. But I am back- pedalling as it does not appear solves the fundamental problem - what is the best threshold? May be we can walk off with a per-cpu timeout that flushes batch nr of tags from local caches to the pool? Each local allocation would restart the timer, but once allocation requests stopped coming on a CPU the tags would not gather dust in local caches. -- Regards, Alexander Gordeev agordeev@redhat.com ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch 1/2]percpu_ida: fix a live lock 2014-02-09 15:50 ` Alexander Gordeev @ 2014-02-10 10:32 ` Christoph Hellwig 2014-02-10 12:29 ` Alexander Gordeev 2014-02-10 16:26 ` Jens Axboe 0 siblings, 2 replies; 18+ messages in thread From: Christoph Hellwig @ 2014-02-10 10:32 UTC (permalink / raw) To: Alexander Gordeev Cc: Kent Overstreet, Jens Axboe, Shaohua Li, linux-kernel, hch On Sun, Feb 09, 2014 at 04:50:07PM +0100, Alexander Gordeev wrote: > Yeah, that was my first thought when I posted "percpu_ida: Allow variable > maximum number of cached tags" patch some few months ago. But I am back- > pedalling as it does not appear solves the fundamental problem - what is the > best threshold? > > May be we can walk off with a per-cpu timeout that flushes batch nr of tags > from local caches to the pool? Each local allocation would restart the timer, > but once allocation requests stopped coming on a CPU the tags would not gather > dust in local caches. We'll defintively need a fix to be able to allow the whole tag space. For large numbers of tags per device the flush might work, but for devices with low number of tags we need something more efficient. The case of less tags than CPUs isn't that unusual either and we probably want to switch to an allocator without per cpu allocations for them to avoid all this. E.g. for many ATA devices we just have a single tag, and many scsi drivers also only want single digit outstanding commands per LUN. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch 1/2]percpu_ida: fix a live lock 2014-02-10 10:32 ` Christoph Hellwig @ 2014-02-10 12:29 ` Alexander Gordeev 2014-02-10 15:49 ` Alexander Gordeev 2014-02-10 16:26 ` Jens Axboe 1 sibling, 1 reply; 18+ messages in thread From: Alexander Gordeev @ 2014-02-10 12:29 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Kent Overstreet, Jens Axboe, Shaohua Li, linux-kernel On Mon, Feb 10, 2014 at 02:32:11AM -0800, Christoph Hellwig wrote: > > May be we can walk off with a per-cpu timeout that flushes batch nr of tags > > from local caches to the pool? Each local allocation would restart the timer, > > but once allocation requests stopped coming on a CPU the tags would not gather > > dust in local caches. > > We'll defintively need a fix to be able to allow the whole tag space. > For large numbers of tags per device the flush might work, but for > devices with low number of tags we need something more efficient. The > case of less tags than CPUs isn't that unusual either and we probably > want to switch to an allocator without per cpu allocations for them to > avoid all this. E.g. for many ATA devices we just have a single tag, > and many scsi drivers also only want single digit outstanding commands > per LUN. Do we really always need the pool for these classes of devices? Pulling tags from local caches to the pool just to (near to) dry it at the very next iteration does not seem beneficial. Not to mention caches vs pool locking complexities. -- Regards, Alexander Gordeev agordeev@redhat.com ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch 1/2]percpu_ida: fix a live lock 2014-02-10 12:29 ` Alexander Gordeev @ 2014-02-10 15:49 ` Alexander Gordeev 2014-02-10 16:16 ` Christoph Hellwig 0 siblings, 1 reply; 18+ messages in thread From: Alexander Gordeev @ 2014-02-10 15:49 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Kent Overstreet, Jens Axboe, Shaohua Li, linux-kernel On Mon, Feb 10, 2014 at 01:29:42PM +0100, Alexander Gordeev wrote: > > We'll defintively need a fix to be able to allow the whole tag space. > > For large numbers of tags per device the flush might work, but for > > devices with low number of tags we need something more efficient. The > > case of less tags than CPUs isn't that unusual either and we probably > > want to switch to an allocator without per cpu allocations for them to > > avoid all this. E.g. for many ATA devices we just have a single tag, > > and many scsi drivers also only want single digit outstanding commands > > per LUN. > > Do we really always need the pool for these classes of devices? > > Pulling tags from local caches to the pool just to (near to) dry it at > the very next iteration does not seem beneficial. Not to mention caches > vs pool locking complexities. And I meant here we do not scrap per cpu allocations. -- Regards, Alexander Gordeev agordeev@redhat.com ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch 1/2]percpu_ida: fix a live lock 2014-02-10 15:49 ` Alexander Gordeev @ 2014-02-10 16:16 ` Christoph Hellwig 0 siblings, 0 replies; 18+ messages in thread From: Christoph Hellwig @ 2014-02-10 16:16 UTC (permalink / raw) To: Alexander Gordeev Cc: Christoph Hellwig, Kent Overstreet, Jens Axboe, Shaohua Li, linux-kernel On Mon, Feb 10, 2014 at 04:49:17PM +0100, Alexander Gordeev wrote: > > Do we really always need the pool for these classes of devices? > > > > Pulling tags from local caches to the pool just to (near to) dry it at > > the very next iteration does not seem beneficial. Not to mention caches > > vs pool locking complexities. > > And I meant here we do not scrap per cpu allocations. I'm still a bit confused at what you're proposing. Part of the problem might be that I'm not exactly familar with the allocator and am just looking it from a consumer perspective. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch 1/2]percpu_ida: fix a live lock 2014-02-10 10:32 ` Christoph Hellwig 2014-02-10 12:29 ` Alexander Gordeev @ 2014-02-10 16:26 ` Jens Axboe 2014-02-10 22:41 ` Kent Overstreet 1 sibling, 1 reply; 18+ messages in thread From: Jens Axboe @ 2014-02-10 16:26 UTC (permalink / raw) To: Christoph Hellwig, Alexander Gordeev Cc: Kent Overstreet, Shaohua Li, linux-kernel On 02/10/2014 03:32 AM, Christoph Hellwig wrote: > On Sun, Feb 09, 2014 at 04:50:07PM +0100, Alexander Gordeev wrote: >> Yeah, that was my first thought when I posted "percpu_ida: Allow variable >> maximum number of cached tags" patch some few months ago. But I am back- >> pedalling as it does not appear solves the fundamental problem - what is the >> best threshold? >> >> May be we can walk off with a per-cpu timeout that flushes batch nr of tags >> from local caches to the pool? Each local allocation would restart the timer, >> but once allocation requests stopped coming on a CPU the tags would not gather >> dust in local caches. > > We'll defintively need a fix to be able to allow the whole tag space. Certainly. The current situation of effectively only allowing half the tags (if spread) is pretty crappy with (by far) most hardware. > For large numbers of tags per device the flush might work, but for > devices with low number of tags we need something more efficient. The > case of less tags than CPUs isn't that unusual either and we probably > want to switch to an allocator without per cpu allocations for them to > avoid all this. E.g. for many ATA devices we just have a single tag, > and many scsi drivers also only want single digit outstanding commands > per LUN. Even for cases where you have as many (or more) CPUs than tags, per-cpu allocation is not necessarily a bad idea. It's a rare case where you have all the CPUs touching the device at the same time, after all. -- Jens Axboe ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch 1/2]percpu_ida: fix a live lock 2014-02-10 16:26 ` Jens Axboe @ 2014-02-10 22:41 ` Kent Overstreet 2014-02-10 23:06 ` Jens Axboe 0 siblings, 1 reply; 18+ messages in thread From: Kent Overstreet @ 2014-02-10 22:41 UTC (permalink / raw) To: Jens Axboe; +Cc: Christoph Hellwig, Alexander Gordeev, Shaohua Li, linux-kernel On Mon, Feb 10, 2014 at 09:26:15AM -0700, Jens Axboe wrote: > > > On 02/10/2014 03:32 AM, Christoph Hellwig wrote: > >On Sun, Feb 09, 2014 at 04:50:07PM +0100, Alexander Gordeev wrote: > >>Yeah, that was my first thought when I posted "percpu_ida: Allow variable > >>maximum number of cached tags" patch some few months ago. But I am back- > >>pedalling as it does not appear solves the fundamental problem - what is the > >>best threshold? > >> > >>May be we can walk off with a per-cpu timeout that flushes batch nr of tags > >>from local caches to the pool? Each local allocation would restart the timer, > >>but once allocation requests stopped coming on a CPU the tags would not gather > >>dust in local caches. > > > >We'll defintively need a fix to be able to allow the whole tag space. > > Certainly. The current situation of effectively only allowing half > the tags (if spread) is pretty crappy with (by far) most hardware. > > >For large numbers of tags per device the flush might work, but for > >devices with low number of tags we need something more efficient. The > >case of less tags than CPUs isn't that unusual either and we probably > >want to switch to an allocator without per cpu allocations for them to > >avoid all this. E.g. for many ATA devices we just have a single tag, > >and many scsi drivers also only want single digit outstanding commands > >per LUN. > > Even for cases where you have as many (or more) CPUs than tags, > per-cpu allocation is not necessarily a bad idea. It's a rare case > where you have all the CPUs touching the device at the same time, > after all. <just back from Switzerland, probably forgetting some of where I left off> You do still need to have enough tags to shard across the number of cpus _currently_ touching the device. I think I'm with Christoph here, I'm not sure how percpu tag allocation would be helpful when we have single digits/low double digits of tags available. I would expect that in that case we're better off with just a well implemented atomic bit vector and waitlist. However, I don't know where the crossover point is and I think Jens has done by far the most and most relevant benchmarking here. How about we just make the number of tags that are allowed to be stranded an explicit parameter (somehow) - then it can be up to device drivers to do something sensible with it. Half is probably an ideal default for devices where that works, but this way more constrained devices will be able to futz with it however they want. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch 1/2]percpu_ida: fix a live lock 2014-02-10 22:41 ` Kent Overstreet @ 2014-02-10 23:06 ` Jens Axboe 2014-02-11 9:12 ` Christoph Hellwig 2014-02-14 10:36 ` Alexander Gordeev 0 siblings, 2 replies; 18+ messages in thread From: Jens Axboe @ 2014-02-10 23:06 UTC (permalink / raw) To: Kent Overstreet Cc: Christoph Hellwig, Alexander Gordeev, Shaohua Li, linux-kernel On 02/10/2014 03:41 PM, Kent Overstreet wrote: > On Mon, Feb 10, 2014 at 09:26:15AM -0700, Jens Axboe wrote: >> >> >> On 02/10/2014 03:32 AM, Christoph Hellwig wrote: >>> On Sun, Feb 09, 2014 at 04:50:07PM +0100, Alexander Gordeev wrote: >>>> Yeah, that was my first thought when I posted "percpu_ida: Allow variable >>>> maximum number of cached tags" patch some few months ago. But I am back- >>>> pedalling as it does not appear solves the fundamental problem - what is the >>>> best threshold? >>>> >>>> May be we can walk off with a per-cpu timeout that flushes batch nr of tags >>> >from local caches to the pool? Each local allocation would restart the timer, >>>> but once allocation requests stopped coming on a CPU the tags would not gather >>>> dust in local caches. >>> >>> We'll defintively need a fix to be able to allow the whole tag space. >> >> Certainly. The current situation of effectively only allowing half >> the tags (if spread) is pretty crappy with (by far) most hardware. >> >>> For large numbers of tags per device the flush might work, but for >>> devices with low number of tags we need something more efficient. The >>> case of less tags than CPUs isn't that unusual either and we probably >>> want to switch to an allocator without per cpu allocations for them to >>> avoid all this. E.g. for many ATA devices we just have a single tag, >>> and many scsi drivers also only want single digit outstanding commands >>> per LUN. >> >> Even for cases where you have as many (or more) CPUs than tags, >> per-cpu allocation is not necessarily a bad idea. It's a rare case >> where you have all the CPUs touching the device at the same time, >> after all. > > <just back from Switzerland, probably forgetting some of where I left off> > > You do still need to have enough tags to shard across the number of cpus > _currently_ touching the device. I think I'm with Christoph here, I'm not sure > how percpu tag allocation would be helpful when we have single digits/low double > digits of tags available. For the common case, I'd assume that anywhere between 31..256 tags is "normal". That's where the majority of devices will end up being, largely. So single digits would be an anomaly. And even for the case of 31 tags and, eg, a 64 cpu system, over windows of access I don't think it's unreasonable to expect that you are not going to have 64 threads banging on the same device. It obviously all depends on the access pattern. X threads for X tags would work perfectly well with per-cpu tagging, if they are doing sync IO. And similarly, 8 threads each having low queue depth would be fine. However, it all falls apart pretty quickly if threads*qd > tag space. > I would expect that in that case we're better off with just a well implemented > atomic bit vector and waitlist. However, I don't know where the crossover point > is and I think Jens has done by far the most and most relevant benchmarking > here. The problem with that is when you have some of those threads on different nodes, it ends up collapsing pretty quickly again. Maybe the solution is to have a hierarchy of caching instead - per-node, per-cpu. At least that has the potential to make the common case still perform better. > How about we just make the number of tags that are allowed to be stranded an > explicit parameter (somehow) - then it can be up to device drivers to do > something sensible with it. Half is probably an ideal default for devices where > that works, but this way more constrained devices will be able to futz with it > however they want. I don't think we should involve device drivers in this, that's punting a complicated issue to someone who likely has little idea what to do about it. This needs to be handled sensibly in the core, not in a device driver. If we can't come up with a sensible algorithm to handle this, how can we expect someone writing a device driver to do so? -- Jens Axboe ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch 1/2]percpu_ida: fix a live lock 2014-02-10 23:06 ` Jens Axboe @ 2014-02-11 9:12 ` Christoph Hellwig 2014-02-11 14:42 ` James Bottomley 2014-02-14 10:36 ` Alexander Gordeev 1 sibling, 1 reply; 18+ messages in thread From: Christoph Hellwig @ 2014-02-11 9:12 UTC (permalink / raw) To: Jens Axboe Cc: Kent Overstreet, Alexander Gordeev, Shaohua Li, linux-kernel, linux-scsi On Mon, Feb 10, 2014 at 04:06:27PM -0700, Jens Axboe wrote: > For the common case, I'd assume that anywhere between 31..256 tags > is "normal". That's where the majority of devices will end up being, > largely. So single digits would be an anomaly. Unfortunately that's not true in SCSI land, where most driver do per-lun tagging, and the the cmd_per_lun values are very low and very often single digits, as a simple grep for cmd_per_lun will tell. Now it might be that the tag space actually is much bigger in the hardware and the driver authors for some reason want to limit the number of outstanding commands, but the interface to the drivers doesn't allow them to express such a difference at the moment. > >How about we just make the number of tags that are allowed to be stranded an > >explicit parameter (somehow) - then it can be up to device drivers to do > >something sensible with it. Half is probably an ideal default for devices where > >that works, but this way more constrained devices will be able to futz with it > >however they want. > > I don't think we should involve device drivers in this, that's > punting a complicated issue to someone who likely has little idea > what to do about it. This needs to be handled sensibly in the core, > not in a device driver. If we can't come up with a sensible > algorithm to handle this, how can we expect someone writing a device > driver to do so? Agreed, punting this to the drivers is a bad idea. But at least exposing variable for the allowed tag space and allowed outstanding commands to be able to make a smarter decision might be a good idea. On the other hand this will require us to count the outstanding commands again, introducing more cachelines touched than nessecary. To make things worse for complex topologies like SCSI we might have to limit the outstanding commands at up to three levels in the hierarchy. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch 1/2]percpu_ida: fix a live lock 2014-02-11 9:12 ` Christoph Hellwig @ 2014-02-11 14:42 ` James Bottomley 2014-02-11 14:53 ` Christoph Hellwig 0 siblings, 1 reply; 18+ messages in thread From: James Bottomley @ 2014-02-11 14:42 UTC (permalink / raw) To: Christoph Hellwig Cc: Jens Axboe, Kent Overstreet, Alexander Gordeev, Shaohua Li, linux-kernel, linux-scsi On Tue, 2014-02-11 at 01:12 -0800, Christoph Hellwig wrote: > On Mon, Feb 10, 2014 at 04:06:27PM -0700, Jens Axboe wrote: > > For the common case, I'd assume that anywhere between 31..256 tags > > is "normal". That's where the majority of devices will end up being, > > largely. So single digits would be an anomaly. > > Unfortunately that's not true in SCSI land, where most driver do per-lun > tagging, and the the cmd_per_lun values are very low and very often > single digits, as a simple grep for cmd_per_lun will tell. Remember we do shared (all queue) tags on qla, aic and a few other drivers (it's actually the mailbox slot tag for the HBA). > Now it might > be that the tag space actually is much bigger in the hardware and the > driver authors for some reason want to limit the number of outstanding > commands, but the interface to the drivers doesn't allow them to express > such a difference at the moment. Tag space is dependent on SCSI protocol. It's 256 for SPI, 65536 for SAS and I'm not sure for FCP. > > >How about we just make the number of tags that are allowed to be stranded an > > >explicit parameter (somehow) - then it can be up to device drivers to do > > >something sensible with it. Half is probably an ideal default for devices where > > >that works, but this way more constrained devices will be able to futz with it > > >however they want. > > > > I don't think we should involve device drivers in this, that's > > punting a complicated issue to someone who likely has little idea > > what to do about it. This needs to be handled sensibly in the core, > > not in a device driver. If we can't come up with a sensible > > algorithm to handle this, how can we expect someone writing a device > > driver to do so? > > Agreed, punting this to the drivers is a bad idea. But at least > exposing variable for the allowed tag space and allowed outstanding > commands to be able to make a smarter decision might be a good idea. On > the other hand this will require us to count the outstanding commands > again, introducing more cachelines touched than nessecary. To make > things worse for complex topologies like SCSI we might have to limit the > outstanding commands at up to three levels in the hierarchy. The list seems to be missing prior context but a few SPI drivers use the clock algorithm for tag starvation in the driver. The NCR ones are the ones I know about: tag allocation is the hands of a clock sweeping around (one for last tag and one for last outstanding tag). The hands are never allowed to cross, so if a tag gets starved the hands try to cross and the driver stops issuing until the missing tag returns. Tag starvation used to be a known problem for Parallel devices; I haven't seen much in the way of tag starvation algorithms for other types of devices, so I assume the problem went away. James ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch 1/2]percpu_ida: fix a live lock 2014-02-11 14:42 ` James Bottomley @ 2014-02-11 14:53 ` Christoph Hellwig 0 siblings, 0 replies; 18+ messages in thread From: Christoph Hellwig @ 2014-02-11 14:53 UTC (permalink / raw) To: James Bottomley Cc: Christoph Hellwig, Jens Axboe, Kent Overstreet, Alexander Gordeev, Shaohua Li, linux-kernel, linux-scsi On Tue, Feb 11, 2014 at 06:42:40AM -0800, James Bottomley wrote: > > Unfortunately that's not true in SCSI land, where most driver do per-lun > > tagging, and the the cmd_per_lun values are very low and very often > > single digits, as a simple grep for cmd_per_lun will tell. > > Remember we do shared (all queue) tags on qla, aic and a few other > drivers (it's actually the mailbox slot tag for the HBA). That's why I said most and not all. We have a fair amount of drivers using host-wide tags. Having to support both models actually is a bit of an issue with the blk-mq support in scsi. At this point I suspect we'll have to detangle the concept of tag space and the queueing limits that blk-mq merged into a single problem space again to support SCSI properly. > > Now it might > > be that the tag space actually is much bigger in the hardware and the > > driver authors for some reason want to limit the number of outstanding > > commands, but the interface to the drivers doesn't allow them to express > > such a difference at the moment. > > Tag space is dependent on SCSI protocol. It's 256 for SPI, 65536 for > SAS and I'm not sure for FCP. But that doesn't mean a HBA can support the full space. Unless we know each piece of hardware detailed enough we can't simplify increase the tag space as we can be sure something will break. > The list seems to be missing prior context but a few SPI drivers use the > clock algorithm for tag starvation in the driver. The NCR ones are the > ones I know about: tag allocation is the hands of a clock sweeping > around (one for last tag and one for last outstanding tag). The hands > are never allowed to cross, so if a tag gets starved the hands try to > cross and the driver stops issuing until the missing tag returns. Tag > starvation used to be a known problem for Parallel devices; I haven't > seen much in the way of tag starvation algorithms for other types of > devices, so I assume the problem went away. As long as the drivers don't rely on the block layer tag handling they can keep whatever they want to do. Given that it's mostly SPI drivers I doubt the tag allocation is a performance bottle neck of any sort for them anyway. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch 1/2]percpu_ida: fix a live lock 2014-02-10 23:06 ` Jens Axboe 2014-02-11 9:12 ` Christoph Hellwig @ 2014-02-14 10:36 ` Alexander Gordeev 1 sibling, 0 replies; 18+ messages in thread From: Alexander Gordeev @ 2014-02-14 10:36 UTC (permalink / raw) To: Jens Axboe; +Cc: Kent Overstreet, Christoph Hellwig, Shaohua Li, linux-kernel On Mon, Feb 10, 2014 at 04:06:27PM -0700, Jens Axboe wrote: > It obviously all depends on the access pattern. X threads for X tags > would work perfectly well with per-cpu tagging, if they are doing > sync IO. And similarly, 8 threads each having low queue depth would > be fine. However, it all falls apart pretty quickly if threads*qd > > tag space. What about inroducing two modes: gentle and agressive? Gentle would be something implemented now with occasional switches into agressive mode and back (more about this below). In agressive mode percpu_ida_alloc() would ignore nr_cpus/2 threshold and rob remote caches of tags whenever available. Further, percpu_ida_free() would wake waiters always, even under percpu_max_size limit. In essence, the agressive mode would be used to overcome the tag space fragmentation (garbage collector, if you want), but not only. Some classes of devices would initialize it constantly - that would be the case for i.e. libata, paired with TASK_RUNNING condition. Most challenging question here is when to switch from gentle mode to agressive and back. I am thinking about a timeout that indicates a reasonable average time for an IO to complete: * Once CPU failed to obtain a tag and cpus_have_tags is not empty and the timeout expired, the agressive mode is enforced, meaning no IO activity on other CPUs and tags are out there; * Once no tags were requested and the timeout expired the agressive mode is switched back to gentle; * Once the second tag returned to a local cache the agressive mode is switched back to gentle, meaning no threads were/are in the waitqueue, hungry for tags; The value of timeout is calculated based on the recent IO activity, probably with some hints from the device driver. The good thing here that occasional unnecessary switches to the agressive mode would not hit the overall performance. Quite complicated, huh? :) -- Regards, Alexander Gordeev agordeev@redhat.com ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2014-02-14 10:34 UTC | newest] Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2013-12-31 3:38 [patch 1/2]percpu_ida: fix a live lock Shaohua Li 2014-01-04 21:08 ` Kent Overstreet 2014-01-05 13:13 ` Shaohua Li 2014-01-06 20:46 ` Kent Overstreet 2014-01-06 20:52 ` Jens Axboe 2014-01-06 21:47 ` Kent Overstreet 2014-02-09 15:50 ` Alexander Gordeev 2014-02-10 10:32 ` Christoph Hellwig 2014-02-10 12:29 ` Alexander Gordeev 2014-02-10 15:49 ` Alexander Gordeev 2014-02-10 16:16 ` Christoph Hellwig 2014-02-10 16:26 ` Jens Axboe 2014-02-10 22:41 ` Kent Overstreet 2014-02-10 23:06 ` Jens Axboe 2014-02-11 9:12 ` Christoph Hellwig 2014-02-11 14:42 ` James Bottomley 2014-02-11 14:53 ` Christoph Hellwig 2014-02-14 10:36 ` Alexander Gordeev
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.