* Virtio-scsi multiqueue irq affinity @ 2019-03-18 6:21 Peter Xu 2019-03-23 17:15 ` Thomas Gleixner 0 siblings, 1 reply; 16+ messages in thread From: Peter Xu @ 2019-03-18 6:21 UTC (permalink / raw) To: Christoph Hellwig Cc: Thomas Gleixner, Jason Wang, Luiz Capitulino, Linux Kernel Mailing List, Michael S. Tsirkin Hi, Christoph & all, I noticed that starting from commit 0d9f0a52c8b9 ("virtio_scsi: use virtio IRQ affinity", 2017-02-27) the virtio scsi driver is using a new way (via irq_create_affinity_masks()) to automatically initialize IRQ affinities for the multi-queues, which is different comparing to all the other virtio devices (like virtio-net, which still uses virtqueue_set_affinity(), which is actually, irq_set_affinity_hint()). Firstly, it will definitely broke some of the userspace programs with that when the scripts wanted to do the bindings explicitly like before and they could simply fail with -EIO now every time when echoing to /proc/irq/N/smp_affinity of any of the multi-queues (see write_irq_affinity()). Is there any specific reason to do it with the new way? Since AFAIU we should still allow the system admins to decide what to do for such configurations, .e.g., what if we only want to provision half of the CPU resources to handle IRQs for a specific virtio-scsi controller? We won't be able to achieve that with current policy. Or, could this be a question for the IRQ system (irq_create_affinity_masks()) in general? Any special considerations behind the big picture? I believe I must have missed some contexts here and there... but I'd like to raise the question up. Say, if the new way is preferred and attempted, maybe it would worth it to spread the idea to the rest of the virtio drivers who support multi-queues as well. Thanks, -- Peter Xu ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Virtio-scsi multiqueue irq affinity 2019-03-18 6:21 Virtio-scsi multiqueue irq affinity Peter Xu @ 2019-03-23 17:15 ` Thomas Gleixner 2019-03-25 5:02 ` Peter Xu 0 siblings, 1 reply; 16+ messages in thread From: Thomas Gleixner @ 2019-03-23 17:15 UTC (permalink / raw) To: Peter Xu Cc: Christoph Hellwig, Jason Wang, Luiz Capitulino, Linux Kernel Mailing List, Michael S. Tsirkin Peter, On Mon, 18 Mar 2019, Peter Xu wrote: > I noticed that starting from commit 0d9f0a52c8b9 ("virtio_scsi: use > virtio IRQ affinity", 2017-02-27) the virtio scsi driver is using a > new way (via irq_create_affinity_masks()) to automatically initialize > IRQ affinities for the multi-queues, which is different comparing to > all the other virtio devices (like virtio-net, which still uses > virtqueue_set_affinity(), which is actually, irq_set_affinity_hint()). > > Firstly, it will definitely broke some of the userspace programs with > that when the scripts wanted to do the bindings explicitly like before > and they could simply fail with -EIO now every time when echoing to > /proc/irq/N/smp_affinity of any of the multi-queues (see > write_irq_affinity()). Did it break anything? I did not see a report so far. Assumptions about potential breakage are not really useful. > Is there any specific reason to do it with the new way? Since AFAIU > we should still allow the system admins to decide what to do for such > configurations, .e.g., what if we only want to provision half of the > CPU resources to handle IRQs for a specific virtio-scsi controller? > We won't be able to achieve that with current policy. Or, could this > be a question for the IRQ system (irq_create_affinity_masks()) in > general? Any special considerations behind the big picture? That has nothing to do with the irq subsystem. That merily provides the mechanisms. The reason behind this is that multi-queue devices set up queues per cpu or if not enough queues are available queues per cpu groups. So it does not make sense to move the interrupt away from the CPU or the CPU group. Aside of that in the CPU hotunplug case, interrupts used to be moved to the online CPUs which resulted in problems for e.g. hibernation because on large systems moving all interrupts to the boot CPU does not work due to vector space exhaustion. Also CPU hotunplug is used for power management purposes and there it does not make sense either to have the per cpu queues of the offlined CPUs moved to the still online CPUs which then end up with several queues. The new way to deal with this is to strictly bind per CPU (per CPU group) queues. If the CPU or the last CPU in the group goes offline the following happens: 1) The queue is disabled, i.e. no new requests can be queued 2) Wait for the outstanding requests to complete 3) Shut down the interrupt This avoids having multiple queues moved to the still online CPUs and also prevents vector space exhaustion because the shut down interrupt does not have to be migrated. When the CPU (or the first in the group) comes online again: 1) Reenable the interrupt 2) Reenable the queue Hope that helps. Thanks, tglx ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Virtio-scsi multiqueue irq affinity 2019-03-23 17:15 ` Thomas Gleixner @ 2019-03-25 5:02 ` Peter Xu 2019-03-25 7:06 ` Ming Lei 0 siblings, 1 reply; 16+ messages in thread From: Peter Xu @ 2019-03-25 5:02 UTC (permalink / raw) To: Thomas Gleixner Cc: Christoph Hellwig, Jason Wang, Luiz Capitulino, Linux Kernel Mailing List, Michael S. Tsirkin, minlei On Sat, Mar 23, 2019 at 06:15:59PM +0100, Thomas Gleixner wrote: > Peter, Hi, Thomas, > > On Mon, 18 Mar 2019, Peter Xu wrote: > > I noticed that starting from commit 0d9f0a52c8b9 ("virtio_scsi: use > > virtio IRQ affinity", 2017-02-27) the virtio scsi driver is using a > > new way (via irq_create_affinity_masks()) to automatically initialize > > IRQ affinities for the multi-queues, which is different comparing to > > all the other virtio devices (like virtio-net, which still uses > > virtqueue_set_affinity(), which is actually, irq_set_affinity_hint()). > > > > Firstly, it will definitely broke some of the userspace programs with > > that when the scripts wanted to do the bindings explicitly like before > > and they could simply fail with -EIO now every time when echoing to > > /proc/irq/N/smp_affinity of any of the multi-queues (see > > write_irq_affinity()). > > Did it break anything? I did not see a report so far. Assumptions about > potential breakage are not really useful. It broke some automation scripts e.g. where they tried to bind CPUs to IRQs before staring IO but these scripts failed early during setup when trying to echo into the affinity procfs file. Actually I started to look into this because of such script breakage reported by QEs. Iinitially it was thought as a kernel bug but later we noticed that it's a change in policy. > > > Is there any specific reason to do it with the new way? Since AFAIU > > we should still allow the system admins to decide what to do for such > > configurations, .e.g., what if we only want to provision half of the > > CPU resources to handle IRQs for a specific virtio-scsi controller? > > We won't be able to achieve that with current policy. Or, could this > > be a question for the IRQ system (irq_create_affinity_masks()) in > > general? Any special considerations behind the big picture? > > That has nothing to do with the irq subsystem. That merily provides the > mechanisms. > > The reason behind this is that multi-queue devices set up queues per cpu or > if not enough queues are available queues per cpu groups. So it does not > make sense to move the interrupt away from the CPU or the CPU group. > > Aside of that in the CPU hotunplug case, interrupts used to be moved to the > online CPUs which resulted in problems for e.g. hibernation because on > large systems moving all interrupts to the boot CPU does not work due to > vector space exhaustion. Also CPU hotunplug is used for power management > purposes and there it does not make sense either to have the per cpu queues > of the offlined CPUs moved to the still online CPUs which then end up with > several queues. > > The new way to deal with this is to strictly bind per CPU (per CPU group) > queues. If the CPU or the last CPU in the group goes offline the following > happens: > > 1) The queue is disabled, i.e. no new requests can be queued > > 2) Wait for the outstanding requests to complete > > 3) Shut down the interrupt > > This avoids having multiple queues moved to the still online CPUs and also > prevents vector space exhaustion because the shut down interrupt does not > have to be migrated. > > When the CPU (or the first in the group) comes online again: > > 1) Reenable the interrupt > > 2) Reenable the queue > > Hope that helps. Thanks for explaining everything! It helps a lot, and yes it makes perfect sense to me. If no one reported any issue I think either the scripts are not checking the return code so they might fail silently but it might not matter much (e.g., if the only thing that a script wants to do is to spread the CPUs upon the IRQs then the script can simply cancel the setup procedure of this, and even failing of those echos won't affect much too), or they're just simpled fixed up later on. Now the only thing I am unsure about is whether there could be scenarios that we may not want to use the default policy to spread the cores. One thing I can think of is the real-time scenario where "isolcpus=" is provided, then logically we should not allow any isolated CPUs to be bound to any of the multi-queue IRQs. Though Ming Lei and I had a discussion offlist before and Ming explained to me that as long as the isolated CPUs do not generate any IO then there will be no IRQ on those isolated (real-time) CPUs at all. Can we guarantee that? Now I'm thinking whether the ideal way should be that, when multi-queue is used with "isolcpus=" then we only spread the queues upon housekeeping CPUs somehow? Because AFAIU general real-time applications should not use block IOs at all (and if not those hardware multi-queues running upon isolated CPUs would probably be a pure waste too because they could be always idle on the isolated cores where the real-time application runs). CCing Ming too. Thanks, -- Peter Xu ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Virtio-scsi multiqueue irq affinity 2019-03-25 5:02 ` Peter Xu @ 2019-03-25 7:06 ` Ming Lei 2019-03-25 8:53 ` Thomas Gleixner 0 siblings, 1 reply; 16+ messages in thread From: Ming Lei @ 2019-03-25 7:06 UTC (permalink / raw) To: Peter Xu Cc: Thomas Gleixner, Christoph Hellwig, Jason Wang, Luiz Capitulino, Linux Kernel Mailing List, Michael S. Tsirkin, minlei On Mon, Mar 25, 2019 at 01:02:13PM +0800, Peter Xu wrote: > On Sat, Mar 23, 2019 at 06:15:59PM +0100, Thomas Gleixner wrote: > > Peter, > > Hi, Thomas, > > > > > On Mon, 18 Mar 2019, Peter Xu wrote: > > > I noticed that starting from commit 0d9f0a52c8b9 ("virtio_scsi: use > > > virtio IRQ affinity", 2017-02-27) the virtio scsi driver is using a > > > new way (via irq_create_affinity_masks()) to automatically initialize > > > IRQ affinities for the multi-queues, which is different comparing to > > > all the other virtio devices (like virtio-net, which still uses > > > virtqueue_set_affinity(), which is actually, irq_set_affinity_hint()). > > > > > > Firstly, it will definitely broke some of the userspace programs with > > > that when the scripts wanted to do the bindings explicitly like before > > > and they could simply fail with -EIO now every time when echoing to > > > /proc/irq/N/smp_affinity of any of the multi-queues (see > > > write_irq_affinity()). > > > > Did it break anything? I did not see a report so far. Assumptions about > > potential breakage are not really useful. > > It broke some automation scripts e.g. where they tried to bind CPUs to > IRQs before staring IO but these scripts failed early during setup > when trying to echo into the affinity procfs file. Actually I started > to look into this because of such script breakage reported by QEs. > Iinitially it was thought as a kernel bug but later we noticed that > it's a change in policy. > > > > > > Is there any specific reason to do it with the new way? Since AFAIU > > > we should still allow the system admins to decide what to do for such > > > configurations, .e.g., what if we only want to provision half of the > > > CPU resources to handle IRQs for a specific virtio-scsi controller? > > > We won't be able to achieve that with current policy. Or, could this > > > be a question for the IRQ system (irq_create_affinity_masks()) in > > > general? Any special considerations behind the big picture? > > > > That has nothing to do with the irq subsystem. That merily provides the > > mechanisms. > > > > The reason behind this is that multi-queue devices set up queues per cpu or > > if not enough queues are available queues per cpu groups. So it does not > > make sense to move the interrupt away from the CPU or the CPU group. > > > > Aside of that in the CPU hotunplug case, interrupts used to be moved to the > > online CPUs which resulted in problems for e.g. hibernation because on > > large systems moving all interrupts to the boot CPU does not work due to > > vector space exhaustion. Also CPU hotunplug is used for power management > > purposes and there it does not make sense either to have the per cpu queues > > of the offlined CPUs moved to the still online CPUs which then end up with > > several queues. > > > > The new way to deal with this is to strictly bind per CPU (per CPU group) > > queues. If the CPU or the last CPU in the group goes offline the following > > happens: > > > > 1) The queue is disabled, i.e. no new requests can be queued > > > > 2) Wait for the outstanding requests to complete > > > > 3) Shut down the interrupt > > > > This avoids having multiple queues moved to the still online CPUs and also > > prevents vector space exhaustion because the shut down interrupt does not > > have to be migrated. > > > > When the CPU (or the first in the group) comes online again: > > > > 1) Reenable the interrupt > > > > 2) Reenable the queue > > > > Hope that helps. > > Thanks for explaining everything! It helps a lot, and yes it makes > perfect sense to me. > > If no one reported any issue I think either the scripts are not > checking the return code so they might fail silently but it might not > matter much (e.g., if the only thing that a script wants to do is to > spread the CPUs upon the IRQs then the script can simply cancel the > setup procedure of this, and even failing of those echos won't affect > much too), or they're just simpled fixed up later on. Now the only > thing I am unsure about is whether there could be scenarios that we > may not want to use the default policy to spread the cores. > > One thing I can think of is the real-time scenario where "isolcpus=" > is provided, then logically we should not allow any isolated CPUs to > be bound to any of the multi-queue IRQs. Though Ming Lei and I had a So far, this behaviour is made by user-space. From my understanding, IRQ subsystem doesn't handle "isolcpus=", even though the Kconfig help doesn't mention irq affinity affect: Make sure that CPUs running critical tasks are not disturbed by any source of "noise" such as unbound workqueues, timers, kthreads... Unbound jobs get offloaded to housekeeping CPUs. This is driven by the "isolcpus=" boot parameter. Yeah, some RT application may exclude 'isolcpus=' from some IRQ's affinity via /proc/irq interface, and now it becomes not possible any more to do that for managed IRQ. > discussion offlist before and Ming explained to me that as long as the > isolated CPUs do not generate any IO then there will be no IRQ on > those isolated (real-time) CPUs at all. Can we guarantee that? Now It is only guaranteed for 1:1 mapping. blk-mq uses managed IRQ's affinity to setup queue mapping, for example: 1) single hardware queue - this queue's IRQ affinity includes all CPUs, then the hardware queue's IRQ is only fired on one specific CPU for IO submitted from any CPU 2) multi hardware queue - there are N hardware queues - for each hardware queue i(i < N), its IRQ's affinity may include N(i) CPUs, then IRQ for this hardware queue i is fired on one specific CPU among N(i). Thanks, Ming ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Virtio-scsi multiqueue irq affinity 2019-03-25 7:06 ` Ming Lei @ 2019-03-25 8:53 ` Thomas Gleixner 2019-03-25 9:43 ` Peter Xu 2019-03-25 9:50 ` Ming Lei 0 siblings, 2 replies; 16+ messages in thread From: Thomas Gleixner @ 2019-03-25 8:53 UTC (permalink / raw) To: Ming Lei Cc: Peter Xu, Christoph Hellwig, Jason Wang, Luiz Capitulino, Linux Kernel Mailing List, Michael S. Tsirkin, minlei Ming, On Mon, 25 Mar 2019, Ming Lei wrote: > On Mon, Mar 25, 2019 at 01:02:13PM +0800, Peter Xu wrote: > > One thing I can think of is the real-time scenario where "isolcpus=" > > is provided, then logically we should not allow any isolated CPUs to > > be bound to any of the multi-queue IRQs. Though Ming Lei and I had a > > So far, this behaviour is made by user-space. > > >From my understanding, IRQ subsystem doesn't handle "isolcpus=", even > though the Kconfig help doesn't mention irq affinity affect: > > Make sure that CPUs running critical tasks are not disturbed by > any source of "noise" such as unbound workqueues, timers, kthreads... > Unbound jobs get offloaded to housekeeping CPUs. This is driven by > the "isolcpus=" boot parameter. isolcpus has no effect on the interupts. That's what 'irqaffinity=' is for. > Yeah, some RT application may exclude 'isolcpus=' from some IRQ's > affinity via /proc/irq interface, and now it becomes not possible any > more to do that for managed IRQ. > > > discussion offlist before and Ming explained to me that as long as the > > isolated CPUs do not generate any IO then there will be no IRQ on > > those isolated (real-time) CPUs at all. Can we guarantee that? Now > > It is only guaranteed for 1:1 mapping. > > blk-mq uses managed IRQ's affinity to setup queue mapping, for example: > > 1) single hardware queue > - this queue's IRQ affinity includes all CPUs, then the hardware queue's > IRQ is only fired on one specific CPU for IO submitted from any CPU Right. We can special case that for single HW queue to honor the default affinity setting. That's not hard to achieve. > 2) multi hardware queue > - there are N hardware queues > - for each hardware queue i(i < N), its IRQ's affinity may include N(i) CPUs, > then IRQ for this hardware queue i is fired on one specific CPU among N(i). Correct and that's the sane case where it does not matter much, because if your task on an isolated CPU does I/O then redirecting it through some other CPU does not make sense. If it doesn't do I/O it wont be affected by the dormant queue. Thanks, tglx ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Virtio-scsi multiqueue irq affinity 2019-03-25 8:53 ` Thomas Gleixner @ 2019-03-25 9:43 ` Peter Xu 2019-03-25 13:27 ` Thomas Gleixner 2019-03-25 9:50 ` Ming Lei 1 sibling, 1 reply; 16+ messages in thread From: Peter Xu @ 2019-03-25 9:43 UTC (permalink / raw) To: Thomas Gleixner Cc: Ming Lei, Christoph Hellwig, Jason Wang, Luiz Capitulino, Linux Kernel Mailing List, Michael S. Tsirkin, minlei On Mon, Mar 25, 2019 at 09:53:28AM +0100, Thomas Gleixner wrote: > Ming, > > On Mon, 25 Mar 2019, Ming Lei wrote: > > On Mon, Mar 25, 2019 at 01:02:13PM +0800, Peter Xu wrote: > > > One thing I can think of is the real-time scenario where "isolcpus=" > > > is provided, then logically we should not allow any isolated CPUs to > > > be bound to any of the multi-queue IRQs. Though Ming Lei and I had a > > > > So far, this behaviour is made by user-space. > > > > >From my understanding, IRQ subsystem doesn't handle "isolcpus=", even > > though the Kconfig help doesn't mention irq affinity affect: > > > > Make sure that CPUs running critical tasks are not disturbed by > > any source of "noise" such as unbound workqueues, timers, kthreads... > > Unbound jobs get offloaded to housekeeping CPUs. This is driven by > > the "isolcpus=" boot parameter. > > isolcpus has no effect on the interupts. That's what 'irqaffinity=' is for. > > > Yeah, some RT application may exclude 'isolcpus=' from some IRQ's > > affinity via /proc/irq interface, and now it becomes not possible any > > more to do that for managed IRQ. > > > > > discussion offlist before and Ming explained to me that as long as the > > > isolated CPUs do not generate any IO then there will be no IRQ on > > > those isolated (real-time) CPUs at all. Can we guarantee that? Now > > > > It is only guaranteed for 1:1 mapping. > > > > blk-mq uses managed IRQ's affinity to setup queue mapping, for example: > > > > 1) single hardware queue > > - this queue's IRQ affinity includes all CPUs, then the hardware queue's > > IRQ is only fired on one specific CPU for IO submitted from any CPU > > Right. We can special case that for single HW queue to honor the default > affinity setting. That's not hard to achieve. > > > 2) multi hardware queue > > - there are N hardware queues > > - for each hardware queue i(i < N), its IRQ's affinity may include N(i) CPUs, > > then IRQ for this hardware queue i is fired on one specific CPU among N(i). > > Correct and that's the sane case where it does not matter much, because if > your task on an isolated CPU does I/O then redirecting it through some > other CPU does not make sense. If it doesn't do I/O it wont be affected by > the dormant queue. (My thanks to both.) Now I understand it can be guaranteed so it should not break determinism of the real-time applications. But again, I'm curious whether we can specify how to spread the hardware queues of a block controller (as I asked in my previous post) instead of the default one (which is to spread the queues upon all the cores)? I'll try to give a detailed example on this one this time: Let's assume we've had a host with 2 nodes and 8 cores (Node 0 with CPUs 0-3, Node 1 with CPUs 4-7), and a SCSI controller with 4 queues. We want to take the 2nd node to run the real-time applications so we do isolcpus=4-7. By default, IIUC the hardware queues will be allocated like this: - queue 1: CPU 0,1 - queue 2: CPU 2,3 - queue 3: CPU 4,5 - queue 4: CPU 6,7 And the IRQs of the queues will be bound to the same cpuset that the queue is bound to. So my previous question is: since we know that CPU 4-7 won't generate any IO after all (and they shouldn't), could it be possible that we configure the system somehow to reflect a mapping like below: - queue 1: CPU 0 - qeueu 2: CPU 1 - queue 3: CPU 2 - queue 4: CPU 3 Then we disallow the CPUs 4-7 to generate IO and return failure if they tries to. Again, I'm pretty uncertain on whether this case can be anything close to useful... It just came out of my pure curiosity. I think it at least has some benefits like: we will guarantee that the realtime CPUs won't send block IO requests (which could be good because it could simply break real-time determinism), and we'll save two queues from being totally idle (so if we run non-real-time block applications on cores 0-3 we still gain 4 hardware queues's throughput rather than 2). Thanks, -- Peter Xu ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Virtio-scsi multiqueue irq affinity 2019-03-25 9:43 ` Peter Xu @ 2019-03-25 13:27 ` Thomas Gleixner 0 siblings, 0 replies; 16+ messages in thread From: Thomas Gleixner @ 2019-03-25 13:27 UTC (permalink / raw) To: Peter Xu Cc: Ming Lei, Christoph Hellwig, Jason Wang, Luiz Capitulino, Linux Kernel Mailing List, Michael S. Tsirkin, minlei Peter, On Mon, 25 Mar 2019, Peter Xu wrote: > Now I understand it can be guaranteed so it should not break > determinism of the real-time applications. But again, I'm curious > whether we can specify how to spread the hardware queues of a block > controller (as I asked in my previous post) instead of the default one > (which is to spread the queues upon all the cores)? I'll try to give > a detailed example on this one this time: Let's assume we've had a > host with 2 nodes and 8 cores (Node 0 with CPUs 0-3, Node 1 with CPUs > 4-7), and a SCSI controller with 4 queues. We want to take the 2nd > node to run the real-time applications so we do isolcpus=4-7. By > default, IIUC the hardware queues will be allocated like this: > > - queue 1: CPU 0,1 > - queue 2: CPU 2,3 > - queue 3: CPU 4,5 > - queue 4: CPU 6,7 > > And the IRQs of the queues will be bound to the same cpuset that the > queue is bound to. > > So my previous question is: since we know that CPU 4-7 won't generate > any IO after all (and they shouldn't), could it be possible that we > configure the system somehow to reflect a mapping like below: > > - queue 1: CPU 0 > - qeueu 2: CPU 1 > - queue 3: CPU 2 > - queue 4: CPU 3 > > Then we disallow the CPUs 4-7 to generate IO and return failure if > they tries to. > > Again, I'm pretty uncertain on whether this case can be anything close > to useful... It just came out of my pure curiosity. I think it at > least has some benefits like: we will guarantee that the realtime CPUs > won't send block IO requests (which could be good because it could > simply break real-time determinism), and we'll save two queues from > being totally idle (so if we run non-real-time block applications on > cores 0-3 we still gain 4 hardware queues's throughput rather than 2). If that _IS_ useful, then the affinity spreading logic can be changed to accomodate that. It's not really hard to do so, but we'd need a proper usecase for justification. Thanks, tglx ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Virtio-scsi multiqueue irq affinity 2019-03-25 8:53 ` Thomas Gleixner 2019-03-25 9:43 ` Peter Xu @ 2019-03-25 9:50 ` Ming Lei 2021-05-08 7:52 ` xuyihang 1 sibling, 1 reply; 16+ messages in thread From: Ming Lei @ 2019-03-25 9:50 UTC (permalink / raw) To: Thomas Gleixner Cc: Peter Xu, Christoph Hellwig, Jason Wang, Luiz Capitulino, Linux Kernel Mailing List, Michael S. Tsirkin, minlei On Mon, Mar 25, 2019 at 09:53:28AM +0100, Thomas Gleixner wrote: > Ming, > > On Mon, 25 Mar 2019, Ming Lei wrote: > > On Mon, Mar 25, 2019 at 01:02:13PM +0800, Peter Xu wrote: > > > One thing I can think of is the real-time scenario where "isolcpus=" > > > is provided, then logically we should not allow any isolated CPUs to > > > be bound to any of the multi-queue IRQs. Though Ming Lei and I had a > > > > So far, this behaviour is made by user-space. > > > > >From my understanding, IRQ subsystem doesn't handle "isolcpus=", even > > though the Kconfig help doesn't mention irq affinity affect: > > > > Make sure that CPUs running critical tasks are not disturbed by > > any source of "noise" such as unbound workqueues, timers, kthreads... > > Unbound jobs get offloaded to housekeeping CPUs. This is driven by > > the "isolcpus=" boot parameter. > > isolcpus has no effect on the interupts. That's what 'irqaffinity=' is for. Indeed. irq_default_affinity is built from 'irqaffinity=', however, we don't consider irq_default_affinity for managed IRQ affinity. Looks Peter wants to exclude some CPUs from the spread on managed IRQ. Thanks, Ming ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Virtio-scsi multiqueue irq affinity 2019-03-25 9:50 ` Ming Lei @ 2021-05-08 7:52 ` xuyihang 2021-05-08 12:26 ` Thomas Gleixner 0 siblings, 1 reply; 16+ messages in thread From: xuyihang @ 2021-05-08 7:52 UTC (permalink / raw) To: Ming Lei, Thomas Gleixner Cc: Peter Xu, Christoph Hellwig, Jason Wang, Luiz Capitulino, Linux Kernel Mailing List, Michael S. Tsirkin, minlei, liaochang1 在 2019/3/25 17:50, Ming Lei 写道: > On Mon, Mar 25, 2019 at 09:53:28AM +0100, Thomas Gleixner wrote: >> Ming, >> >> On Mon, 25 Mar 2019, Ming Lei wrote: >>> On Mon, Mar 25, 2019 at 01:02:13PM +0800, Peter Xu wrote: >>>> One thing I can think of is the real-time scenario where "isolcpus=" >>>> is provided, then logically we should not allow any isolated CPUs to >>>> be bound to any of the multi-queue IRQs. Though Ming Lei and I had a >>> So far, this behaviour is made by user-space. >>> >>> >From my understanding, IRQ subsystem doesn't handle "isolcpus=", even >>> though the Kconfig help doesn't mention irq affinity affect: >>> >>> Make sure that CPUs running critical tasks are not disturbed by >>> any source of "noise" such as unbound workqueues, timers, kthreads... >>> Unbound jobs get offloaded to housekeeping CPUs. This is driven by >>> the "isolcpus=" boot parameter. >> isolcpus has no effect on the interupts. That's what 'irqaffinity=' is for. > Indeed. > > irq_default_affinity is built from 'irqaffinity=', however, we don't > consider irq_default_affinity for managed IRQ affinity. > > Looks Peter wants to exclude some CPUs from the spread on managed IRQ. Hi Ming and Thomas, We are dealing with a scenario which may need to assign a default irqaffinity for managed IRQ. Assume we have a full CPU usage RT thread running binded to a specific CPU. In the mean while, interrupt handler registered by a device which is ksoftirqd may never have a chance to run. (And we don't want to use isolate CPU) There could be a couple way to deal with this problem: 1. Adjust priority of ksoftirqd or RT thread, so the interrupt handler could preempt RT thread. However, I am not sure whether it could have some side effects or not. 2. Adjust interrupt CPU affinity or RT thread affinity. But managed IRQ seems design to forbid user from manipulating interrupt affinity. It seems managed IRQ is coupled with user side application to me. Would you share your thoughts about this issue please? Thanks, Yihang ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Virtio-scsi multiqueue irq affinity 2021-05-08 7:52 ` xuyihang @ 2021-05-08 12:26 ` Thomas Gleixner 2021-05-10 3:19 ` liaochang (A) 2021-05-10 8:48 ` xuyihang 0 siblings, 2 replies; 16+ messages in thread From: Thomas Gleixner @ 2021-05-08 12:26 UTC (permalink / raw) To: xuyihang, Ming Lei Cc: Peter Xu, Christoph Hellwig, Jason Wang, Luiz Capitulino, Linux Kernel Mailing List, Michael S. Tsirkin, minlei, liaochang1 Yihang, On Sat, May 08 2021 at 15:52, xuyihang wrote: > > We are dealing with a scenario which may need to assign a default > irqaffinity for managed IRQ. > > Assume we have a full CPU usage RT thread running binded to a specific > CPU. > > In the mean while, interrupt handler registered by a device which is > ksoftirqd may never have a chance to run. (And we don't want to use > isolate CPU) A device cannot register and interrupt handler in ksoftirqd. > There could be a couple way to deal with this problem: > > 1. Adjust priority of ksoftirqd or RT thread, so the interrupt handler > could preempt > > RT thread. However, I am not sure whether it could have some side > effects or not. > > 2. Adjust interrupt CPU affinity or RT thread affinity. But managed IRQ > seems design to forbid user from manipulating interrupt affinity. > > It seems managed IRQ is coupled with user side application to me. > > Would you share your thoughts about this issue please? Can you please provide a more detailed description of your system? - Number of CPUs - Kernel version - Is NOHZ full enabled? - Any isolation mechanisms enabled, and if so how are they configured (e.g. on the kernel command line)? - Number of queues in the multiqueue device - Is the RT thread issuing I/O to the multiqueue device? Thanks, tglx ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Virtio-scsi multiqueue irq affinity 2021-05-08 12:26 ` Thomas Gleixner @ 2021-05-10 3:19 ` liaochang (A) 2021-05-10 7:54 ` Thomas Gleixner 2021-05-10 8:48 ` xuyihang 1 sibling, 1 reply; 16+ messages in thread From: liaochang (A) @ 2021-05-10 3:19 UTC (permalink / raw) To: Thomas Gleixner, xuyihang, Ming Lei Cc: Peter Xu, Christoph Hellwig, Jason Wang, Luiz Capitulino, Linux Kernel Mailing List, Michael S. Tsirkin, minlei Hi Thomas, 在 2021/5/8 20:26, Thomas Gleixner 写道: > Yihang, > > On Sat, May 08 2021 at 15:52, xuyihang wrote: >> >> We are dealing with a scenario which may need to assign a default >> irqaffinity for managed IRQ. >> >> Assume we have a full CPU usage RT thread running binded to a specific >> CPU. >> >> In the mean while, interrupt handler registered by a device which is >> ksoftirqd may never have a chance to run. (And we don't want to use >> isolate CPU) > > A device cannot register and interrupt handler in ksoftirqd. I learn the scenario further after communicate with Yihang offline: 1.We have a machine with 36 CPUs,and assign several RT threads to last two CPUs(CPU-34, CPU-35). 2.I/O device driver create single managed irq, the affinity of which includes CPU-34 and CPU-35. 3.Another regular application launch I/O operation at different CPUs with the ones RT threads use, then CPU-34/35 will receive hardware interrupt and wakeup ksoftirqd to deal with real I/O stuff. 4.Cause the priority and schedule policy of RT thread overwhlem per-cpu ksoftirqd, it looks like ksoftirqd has no chance to run at CPU-34/35,which leads to I/O processing can't finish at time, and application get stuck. > >> There could be a couple way to deal with this problem: >> >> 1. Adjust priority of ksoftirqd or RT thread, so the interrupt handler >> could preempt >> >> RT thread. However, I am not sure whether it could have some side >> effects or not. >> >> 2. Adjust interrupt CPU affinity or RT thread affinity. But managed IRQ >> seems design to forbid user from manipulating interrupt affinity. >> >> It seems managed IRQ is coupled with user side application to me. >> >> Would you share your thoughts about this issue please? > > Can you please provide a more detailed description of your system? > > - Number of CPUs > > - Kernel version > - Is NOHZ full enabled? > - Any isolation mechanisms enabled, and if so how are they > configured (e.g. on the kernel command line)? > > - Number of queues in the multiqueue device > > - Is the RT thread issuing I/O to the multiqueue device? > > Thanks, > > tglx > . > BR, Liao Chang ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Virtio-scsi multiqueue irq affinity 2021-05-10 3:19 ` liaochang (A) @ 2021-05-10 7:54 ` Thomas Gleixner 2021-05-18 1:37 ` liaochang (A) 0 siblings, 1 reply; 16+ messages in thread From: Thomas Gleixner @ 2021-05-10 7:54 UTC (permalink / raw) To: liaochang (A), xuyihang, Ming Lei Cc: Peter Xu, Christoph Hellwig, Jason Wang, Luiz Capitulino, Linux Kernel Mailing List, Michael S. Tsirkin, minlei Liao, On Mon, May 10 2021 at 11:19, liaochang wrote: > 1.We have a machine with 36 CPUs,and assign several RT threads to last > two CPUs(CPU-34, CPU-35). Which kind of machine? x86? > 2.I/O device driver create single managed irq, the affinity of which > includes CPU-34 and CPU-35. If that driver creates only a single managed interrupt, then the possible affinity of that interrupt spawns CPUs 0 - 35. That's expected, but what is the effective affinity of that interrupt? # cat /proc/irq/$N/effective_affinity Also please provide the full output of # cat /proc/interrupts and point out which device we are talking about. Thanks, tglx ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Virtio-scsi multiqueue irq affinity 2021-05-10 7:54 ` Thomas Gleixner @ 2021-05-18 1:37 ` liaochang (A) 0 siblings, 0 replies; 16+ messages in thread From: liaochang (A) @ 2021-05-18 1:37 UTC (permalink / raw) To: Thomas Gleixner, xuyihang, Ming Lei Cc: Peter Xu, Christoph Hellwig, Jason Wang, Luiz Capitulino, Linux Kernel Mailing List, Michael S. Tsirkin, minlei Thomas, 在 2021/5/10 15:54, Thomas Gleixner 写道: > Liao, > > On Mon, May 10 2021 at 11:19, liaochang wrote: >> 1.We have a machine with 36 CPUs,and assign several RT threads to last >> two CPUs(CPU-34, CPU-35). > > Which kind of machine? x86? > >> 2.I/O device driver create single managed irq, the affinity of which >> includes CPU-34 and CPU-35. > > If that driver creates only a single managed interrupt, then the > possible affinity of that interrupt spawns CPUs 0 - 35. > > That's expected, but what is the effective affinity of that interrupt? > > # cat /proc/irq/$N/effective_affinity > > Also please provide the full output of > > # cat /proc/interrupts > > and point out which device we are talking about. the mentioned managed irq is registered by virtio-scsi driver over PCI (on X86 platform, VM with 4 vCPU), as shown below. #lspci -vvv ... 00:04.0 SCSI storage controller: Virtio: Virtio SCSI Subsystem: Virtio: Device 0008 Physical Slot: 4 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0 Interrupt: pin A routed to IRQ 11 Region 0: I/O ports at c140 [size=64] Region 1: Memory at febd2000 (32-bit, non-prefetchable) [size=4K] Region 4: Memory at fe004000 (64-bit, prefetchable) [size=16K] Capabilities: [98] MSI-X: Enable+ Count=4 Masked- Vector table: BAR=1 offset=00000000 PBA: BAR=1 offset=00000800 #ls /sys/bus/pci/devices/0000:00:04.0/msi_irqs 33 34 35 36 #cat /proc/interrupts ... 33: 0 0 0 0 PCI-MSI 65536-edge virtio1-config 34: 0 0 0 0 PCI-MSI 65537-edge virtio1-control 35: 0 0 0 0 PCI-MSI 65538-edge virtio1-event 36: 10637 0 0 0 PCI-MSI 65539-edge virtio1-request As you see, virtio-scsi allocates four MSI-X interrupts,from 33 to 36, and the last one supposes to be triggered when the data of virtqueue is ready to receive, then its interrupt handler will raise ksoftirqd to process I/O.If I assign FIFO RT thread to CPU0, a simple I/O operation issued by command "dd if=/dev/zero of=/test.img bs=1K cout=1 oflag=direct,sync" will never finish. Although that's expected, do you think it is sort of risky for Linux availability? Given in cloud based environment,services from different teams may have serious influence to each other because of lack of enough communication or good understanding about infrastructure, Thanks. This problem arises when RT thread and ksoftirqd scheduled on the same CPU, beside placing RT thread carefully, I also tried to set "rq_affinity" as 2, but the cost is a performance degradation of some I/O benchmark by 10%~30%. So I wonder if the affinity of managed irq supports configuration from user space or via kernel bootargs? Thanks. > > Thanks, > > tglx > . > BR, Liao, Chang ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Virtio-scsi multiqueue irq affinity 2021-05-08 12:26 ` Thomas Gleixner 2021-05-10 3:19 ` liaochang (A) @ 2021-05-10 8:48 ` xuyihang 2021-05-10 19:56 ` Thomas Gleixner 1 sibling, 1 reply; 16+ messages in thread From: xuyihang @ 2021-05-10 8:48 UTC (permalink / raw) To: Thomas Gleixner, Ming Lei Cc: Peter Xu, Christoph Hellwig, Jason Wang, Luiz Capitulino, Linux Kernel Mailing List, Michael S. Tsirkin, minlei, liaochang1 Thomas, 在 2021/5/8 20:26, Thomas Gleixner 写道: > Yihang, > > On Sat, May 08 2021 at 15:52, xuyihang wrote: >> We are dealing with a scenario which may need to assign a default >> irqaffinity for managed IRQ. >> >> Assume we have a full CPU usage RT thread running binded to a specific >> CPU. >> >> In the mean while, interrupt handler registered by a device which is >> ksoftirqd may never have a chance to run. (And we don't want to use >> isolate CPU) > A device cannot register and interrupt handler in ksoftirqd. > >> There could be a couple way to deal with this problem: >> >> 1. Adjust priority of ksoftirqd or RT thread, so the interrupt handler >> could preempt >> >> RT thread. However, I am not sure whether it could have some side >> effects or not. >> >> 2. Adjust interrupt CPU affinity or RT thread affinity. But managed IRQ >> seems design to forbid user from manipulating interrupt affinity. >> >> It seems managed IRQ is coupled with user side application to me. >> >> Would you share your thoughts about this issue please? > Can you please provide a more detailed description of your system? > > - Number of CPUs It's a 4 CPU x86 VM. > - Kernel version This experiment run on linux-4.19 > - Is NOHZ full enabled? nohz=off > - Any isolation mechanisms enabled, and if so how are they > configured (e.g. on the kernel command line)? Some core is isolated by command line (such as : isolcpus=3), and bind with RT thread, and no other isolation configure. > - Number of queues in the multiqueue device Only one queue. [root@localhost ~]# cat /proc/interrupts | grep request 27: 5499 0 0 0 PCI-MSI 65539-edge virtio1-request This environment is a virtual machine and it's a virtio device, I guess it should not make any difference in this case. > - Is the RT thread issuing I/O to the multiqueue device? The RT thread doesn't issue IO. We simplified the reproduce procedure: 1. Start a busy loopping program that have near 100% cpu usage, named print ./print 1 1 & 2. Make the program become realtime application chrt -f -p 1 11514 3. Bind the RT process to the **managed irq** core taskset -cpa 0 11514 4. Use dd to write to hard drive, and dd could not finish and return. dd if=/dev/zero of=/test.img bs=1K count=1 oflag=direct,sync & Since CPU is fully utilized by RT application, and hard drive driver choose CPU0 to handle it's softirq, there is no chance for dd to run. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 11514 root -2 0 2228 740 676 R 100.0 0.0 3:26.70 print If we make some change on this experiment: 1. Make this RT application use less CPU time instead of 100%, the problem disappear. 2, If we change rq_affinity to 2, in order to avoid handle softirq on the same core of RT thread, the problem also disappear. However, this approach result in about 10%-30% random write proformance deduction comparing to rq_affinity = 1, since it may has better cache utilization. echo 2 > /sys/block/sda/queue/rq_affinity Therefore, I want to exclude some CPU from managed irq on boot parameter, which has simliar approach to 11ea68f553e2 ("genirq, sched/isolation: Isolate from handling managed interrupts"). Thanks, Yihang ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Virtio-scsi multiqueue irq affinity 2021-05-10 8:48 ` xuyihang @ 2021-05-10 19:56 ` Thomas Gleixner 2021-05-11 12:38 ` xuyihang 0 siblings, 1 reply; 16+ messages in thread From: Thomas Gleixner @ 2021-05-10 19:56 UTC (permalink / raw) To: xuyihang, Ming Lei Cc: Peter Xu, Christoph Hellwig, Jason Wang, Luiz Capitulino, Linux Kernel Mailing List, Michael S. Tsirkin, minlei, liaochang1 Yihang, On Mon, May 10 2021 at 16:48, xuyihang wrote: > 在 2021/5/8 20:26, Thomas Gleixner 写道: >> Can you please provide a more detailed description of your system? >> - Kernel version > This experiment run on linux-4.19 Again. Please provide reports against the most recent mainline version and not against some randomly picked kernel variant. > If we make some change on this experiment: > > 1. Make this RT application use less CPU time instead of 100%, the problem > disappear. > > 2, If we change rq_affinity to 2, in order to avoid handle softirq on > the same core of RT thread, the problem also disappear. However, this approach > result in about 10%-30% random write proformance deduction comparing > to rq_affinity = 1, since it may has better cache utilization. > echo 2 > /sys/block/sda/queue/rq_affinity > > Therefore, I want to exclude some CPU from managed irq on boot > parameter, Why has this realtime thread to run on CPU0 and cannot move to some other CPU? > which has simliar approach to 11ea68f553e2 ("genirq, sched/isolation: > Isolate from handling managed interrupts"). Why can't you use the existing isolation mechanisms? Thanks, tglx ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Virtio-scsi multiqueue irq affinity 2021-05-10 19:56 ` Thomas Gleixner @ 2021-05-11 12:38 ` xuyihang 0 siblings, 0 replies; 16+ messages in thread From: xuyihang @ 2021-05-11 12:38 UTC (permalink / raw) To: Thomas Gleixner, Ming Lei Cc: Peter Xu, Christoph Hellwig, Jason Wang, Luiz Capitulino, Linux Kernel Mailing List, Michael S. Tsirkin, minlei, liaochang1 Hi Thomas, The previous experiment require a device driver to enable managed irq, which I could not easily install on a most recent branch of OS. Actually what I was asking is whether we could change the managed irq behaviour a little bit, rather than reporting a bug. So, to better illustrate this problem I do another test to simulate this scenario. This time I wrote a kernel module, and in the module_init function, I use request_irq to register a irq. In the irq_handler it put a work in the workqueue. And the work_handler would print "work handler called". 1. Register a irq for a fake new deivce and queue a work_handler when irq arrives. / # insmod request_irq.ko 2. Bind the irq to CPU3 / # echo 8 > /proc/irq/7/smp_affinity 3. Start a full CPU usage RT process and bind to CPU3 ./test.sh & / # taskset -p 8 100 pid 100's current affinity mask: f pid 100's new affinity mask: 8 / # chrt -f -p 1 100 pid 100's current scheduling policy: SCHED_OTHER pid 100's current scheduling priority: 0 pid 100's new scheduling policy: SCHED_FIFO pid 100's new scheduling priority: 1 / # echo -1 >/proc/sys/kernel/sched_rt_runtime_us / # echo -1 >/proc/sys/kernel/sched_rt_period_us / # top Mem: 27376K used, 73224K free, 0K shrd, 0K buff, 8368K cached CPU0: 0.0% usr 0.0% sys 0.0% nic 100% idle 0.0% io 0.0% irq 0.0% sirq CPU1: 0.0% usr 0.0% sys 0.0% nic 100% idle 0.0% io 0.0% irq 0.0% sirq CPU2: 0.0% usr 0.0% sys 0.0% nic 100% idle 0.0% io 0.0% irq 0.0% sirq CPU3: 100% usr 0.0% sys 0.0% nic 0.0% idle 0.0% io 0.0% irq 0.0% sirq Load average: 4.00 4.00 4.00 5/62 126 PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND 100 1 0 R 3252 3.2 3 26.3 {exe} ash ./test.sh 126 1 0 R 3252 3.2 1 0.8 top ... / # echo -n trigger > /sys/kernel/debug/irq/irqs/7 From the demsg we can tell the queued work_handler is not called. I could understand the behaviour is as expected, but in pratice, let's say people work on the RT team could be a totally different team for device driver. It feels like it is nice to have a feature to exclude some CPU from managed irq driver. 在 2021/5/11 3:56, Thomas Gleixner 写道: > > Again. Please provide reports against the most recent mainline version > and not against some randomly picked kernel variant. This time I try it on current master branch. Linux (none) 5.12.0-next-20210506+ #3 SMP Tue May 11 14:53:58 HKT 2021 x86_64 GNU/Linux >> If we make some change on this experiment: >> >> 1. Make this RT application use less CPU time instead of 100%, the problem >> disappear. >> >> 2, If we change rq_affinity to 2, in order to avoid handle softirq on >> the same core of RT thread, the problem also disappear. However, this approach >> result in about 10%-30% random write proformance deduction comparing >> to rq_affinity = 1, since it may has better cache utilization. >> echo 2 > /sys/block/sda/queue/rq_affinity >> >> Therefore, I want to exclude some CPU from managed irq on boot >> parameter, > Why has this realtime thread to run on CPU0 and cannot move to some > other CPU? Yes, this realtime thread could move to other CPU, but I think maybe it's not so good to dodge the managed irq CPU. It also seems OS does not give so much hint to indicate RT thread should not run on this CPU. I think the kernel should be able to schedule the irq workqueue handler a little bit, since RT thread is more like a user application and driver works within kernel space. >> which has simliar approach to 11ea68f553e2 ("genirq, sched/isolation: >> Isolate from handling managed interrupts"). > Why can't you use the existing isolation mechanisms? Isolation of CPU forbids other process from utilizing this CPU. Sometimes the RT thread may not use up all CPU time, so other process could schedule to this CPU and run for a little while. Thanks for your time, Yihang ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2021-05-18 1:37 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-03-18 6:21 Virtio-scsi multiqueue irq affinity Peter Xu 2019-03-23 17:15 ` Thomas Gleixner 2019-03-25 5:02 ` Peter Xu 2019-03-25 7:06 ` Ming Lei 2019-03-25 8:53 ` Thomas Gleixner 2019-03-25 9:43 ` Peter Xu 2019-03-25 13:27 ` Thomas Gleixner 2019-03-25 9:50 ` Ming Lei 2021-05-08 7:52 ` xuyihang 2021-05-08 12:26 ` Thomas Gleixner 2021-05-10 3:19 ` liaochang (A) 2021-05-10 7:54 ` Thomas Gleixner 2021-05-18 1:37 ` liaochang (A) 2021-05-10 8:48 ` xuyihang 2021-05-10 19:56 ` Thomas Gleixner 2021-05-11 12:38 ` xuyihang
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.